Let's train a basic model on this dataset first. We'll use this model as a starting point for all sorts of tricks later on.
import pandas as pd from sklearn.pipeline import make_pipeline from sklearn.linear_model import LogisticRegression from sklearn.feature_extraction.text import CountVectorizer df = pd.read_csv("goemotions_1.csv")
If you're curious to see some examples of the dataset, don't forget to increase the column width in pandas. That way, it's easy to see the full text.
pd.set_option('display.max_colwidth', None) df[['text', 'excitement']].loc[lambda d: d['excitement'] == 0].sample(2)
When you make a model for this dataset, it helps to remember that the dataset is imbalanced.
df['excitement'].value_counts() # 0 68100 # 1 1900
That's why you shouldn't forget to set the
class_weight parameter in the logistic regression.
X, y = df['text'], df['excitement'] pipe = make_pipeline( CountVectorizer(), LogisticRegression(class_weight='balanced', max_iter=1000) ) pipe.fit(X, y)
Given that we now have a trained pipeline, let's explore some tricks we can do with it!