Note this tutorial uses cleanlab v1. The code examples run, but you may need to install the older version explicitly.
python -m pip install cleanlab==1.0
Noisy Labels
Another interesting feature from cleanlab is the ability to train a classifier
that is more robust against bad labels. There's a meta model called LearningWithNoisyLabels
that is able to take any scikit-learn compatible pipeline and adapt it in order
to become more robust.
from cleanlab.classification import LearningWithNoisyLabels
from sklearn.linear_model import LogisticRegression
# Wrap around any classifier that has `sample_weights`.
fresh_pipe = make_pipeline(
CountVectorizer(),
LogisticRegression(class_weight='balanced', max_iter=1000)
)
lnl = LearningWithNoisyLabels(clf=fresh_pipe)
# Pay attention! It's s=, not y=!
lnl.fit(X=X, s=y.values)
The LearningWithNoisyLabels
model from cleanlab is meant as a
general method to allow machine learning to work on uncertain labels.
As long as the underlying pipeline implements a .predict_proba()
method you're good to use this trick. The approach is sometimes referred
to as "confident learning" and the idea is getting traction.
LearningWithNoisyLabels
vs. Utility
Although it's useful to have a more robust model via this trick, it would be best
not to fully rely on it. It's better to have higher quality labels, even when you're
using robustness methods. There's another use-case here though. We might be able to
re-use the lnl
model for another purpose; finding bad labels!
A simple way to use it, is to compare it's output with the output of another machine learning model. When these two models disagree, we might again have a good proxy to fix a label.
Let's first make a new pipeline (one that isn't used in the LearningWithNoisyLabels
model).
new_pipe = make_pipeline(
CountVectorizer(),
LogisticRegression(class_weight='balanced', max_iter=1000)
)
new_pipe.fit(X=X, y=y)
With these two models trained, you can now start comparing.
df.loc[lnl.predict(X) != new_pipe.predict(X)][['text', 'excitement']].sample(5)