bad labels logo bad labels: prune

1 2 3 4 5 6 7 8
Notes

We can also use cleanlab to help us find bad labels. Cleanlab offers an interesting suite of tools surrounding the concept of "confident learning". The goal is to be able to learn with noisy labels and it also offers features that help with estimating uncertainty in dataset labels.

Pruning with Cleanlab

Let's first explore the pruning submodule.

from cleanlab.pruning import get_noise_indices

ordered_label_errors = get_noise_indices(
    s=y,
    psx=pipe.predict_proba(X),
    sorted_index_method='prob_given_label',
)

The get_noise_indices function will retreive rows for us that are worth double-checking. The labels that should be checked first will be on top of the list which means that we can use the .iloc method in pandas to find the examples that we're looking for.

df.iloc[ordered_label_errors][['text', 'excitement']].head(20)

In general, this method seems to work very well, but it helps to always use more than one method to find bad labels.