Calmcode - bad labels: noisy label learning

Learning with Noisy Labels via Cleanlab

1 2 3 4 5 6 7 8

Note this tutorial uses cleanlab v1. The code examples run, but you may need to install the older version explicitly.

python -m pip install cleanlab==1.0

Noisy Labels

Another interesting feature from cleanlab is the ability to train a classifier that is more robust against bad labels. There's a meta model called LearningWithNoisyLabels that is able to take any scikit-learn compatible pipeline and adapt it in order to become more robust.

from cleanlab.classification import LearningWithNoisyLabels
from sklearn.linear_model import LogisticRegression

# Wrap around any classifier that has `sample_weights`.
fresh_pipe = make_pipeline(
    CountVectorizer(),
    LogisticRegression(class_weight='balanced', max_iter=1000)
)
lnl = LearningWithNoisyLabels(clf=fresh_pipe)

# Pay attention! It's s=, not y=!
lnl.fit(X=X, s=y.values)

The LearningWithNoisyLabels model from cleanlab is meant as a general method to allow machine learning to work on uncertain labels. As long as the underlying pipeline implements a .predict_proba() method you're good to use this trick. The approach is sometimes referred to as "confident learning" and the idea is getting traction.

LearningWithNoisyLabels vs. Utility

Although it's useful to have a more robust model via this trick, it would be best not to fully rely on it. It's better to have higher quality labels, even when you're using robustness methods. There's another use-case here though. We might be able to re-use the lnl model for another purpose; finding bad labels!

A simple way to use it, is to compare it's output with the output of another machine learning model. When these two models disagree, we might again have a good proxy to fix a label.

Let's first make a new pipeline (one that isn't used in the LearningWithNoisyLabels model).

new_pipe = make_pipeline(
    CountVectorizer(),
    LogisticRegression(class_weight='balanced', max_iter=1000)
)

new_pipe.fit(X=X, y=y)

With these two models trained, you can now start comparing.

df.loc[lnl.predict(X) != new_pipe.predict(X)][['text', 'excitement']].sample(5)