Calmcode - bad labels: basic model

How to find Bad Labels with Machine Learning models

1 2 3 4 5 6 7 8

Let's train a basic model on this dataset first. We'll use this model as a starting point for all sorts of tricks later on.

import pandas as pd

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

df = pd.read_csv("goemotions_1.csv")

If you're curious to see some examples of the dataset, don't forget to increase the column width in pandas. That way, it's easy to see the full text.

pd.set_option('display.max_colwidth', None)

df[['text', 'excitement']].loc[lambda d: d['excitement'] == 0].sample(2)

Class Imbalance

When you make a model for this dataset, it helps to remember that the dataset is imbalanced.

df['excitement'].value_counts()
# 0    68100
# 1     1900

That's why you shouldn't forget to set the class_weight parameter in the logistic regression.

X, y = df['text'], df['excitement']

pipe = make_pipeline(
    CountVectorizer(),
    LogisticRegression(class_weight='balanced', max_iter=1000)
)

pipe.fit(X, y)

Given that we now have a trained pipeline, let's explore some tricks we can do with it!