Let's train a basic model on this dataset first. We'll use this model as a starting point for all sorts of tricks later on.
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_csv("goemotions_1.csv")
If you're curious to see some examples of the dataset, don't forget to increase the column width in pandas. That way, it's easy to see the full text.
pd.set_option('display.max_colwidth', None)
df[['text', 'excitement']].loc[lambda d: d['excitement'] == 0].sample(2)
Class Imbalance
When you make a model for this dataset, it helps to remember that the dataset is imbalanced.
df['excitement'].value_counts()
# 0 68100
# 1 1900
That's why you shouldn't forget to set the class_weight
parameter in the logistic regression.
X, y = df['text'], df['excitement']
pipe = make_pipeline(
CountVectorizer(),
LogisticRegression(class_weight='balanced', max_iter=1000)
)
pipe.fit(X, y)
Given that we now have a trained pipeline, let's explore some tricks we can do with it!