We can also use cleanlab to help us find bad labels. Cleanlab offers an interesting suite of tools surrounding the concept of "confident learning". The goal is to be able to learn with noisy labels and it also offers features that help with estimating uncertainty in dataset labels.
Pruning with Cleanlab
Let's first explore the
from cleanlab.pruning import get_noise_indices ordered_label_errors = get_noise_indices( s=y, psx=pipe.predict_proba(X), sorted_index_method='prob_given_label', )
get_noise_indices function will retreive rows for us that are worth
double-checking. The labels that should be checked first will be on top
of the list which means that we can use the
.iloc method in pandas to
find the examples that we're looking for.
In general, this method seems to work very well, but it helps to always use more than one method to find bad labels.