Calmcode - bad labels: prune

Using Cleanlab to find Bad Labels

1 2 3 4 5 6 7 8

We can also use cleanlab to help us find bad labels. Cleanlab offers an interesting suite of tools surrounding the concept of "confident learning". The goal is to be able to learn with noisy labels and it also offers features that help with estimating uncertainty in dataset labels.

Note this tutorial uses cleanlab v1. The code examples run, but you may need to install the older version explicitly.

python -m pip install cleanlab==1.0

Pruning with Cleanlab

Let's first explore the pruning submodule.

from cleanlab.pruning import get_noise_indices

ordered_label_errors = get_noise_indices(
    s=y,
    psx=pipe.predict_proba(X),
    sorted_index_method='prob_given_label',
)

The get_noise_indices function will retreive rows for us that are worth double-checking. The labels that should be checked first will be on top of the list which means that we can use the .iloc method in pandas to find the examples that we're looking for.

df.iloc[ordered_label_errors][['text', 'excitement']].head(20)

In general, this method seems to work very well, but it helps to always use more than one method to find bad labels.