We can also use cleanlab to help us find bad labels. Cleanlab offers an interesting suite of tools surrounding the concept of "confident learning". The goal is to be able to learn with noisy labels and it also offers features that help with estimating uncertainty in dataset labels.
Note this tutorial uses cleanlab v1. The code examples run, but you may need to install the older version explicitly.
python -m pip install cleanlab==1.0
Pruning with Cleanlab
Let's first explore the pruning
submodule.
from cleanlab.pruning import get_noise_indices
ordered_label_errors = get_noise_indices(
s=y,
psx=pipe.predict_proba(X),
sorted_index_method='prob_given_label',
)
The get_noise_indices
function will retreive rows for us that are worth
double-checking. The labels that should be checked first will be on top
of the list which means that we can use the .iloc
method in pandas to
find the examples that we're looking for.
df.iloc[ordered_label_errors][['text', 'excitement']].head(20)
In general, this method seems to work very well, but it helps to always use more than one method to find bad labels.