logo

... scikit dummy.


You need to properly benchmark your models and it can be easy to forget to do this step. That is why we wanted to demonstrate the dummy module of scikit-learn. It is an underappreciated part of the library that is very useful to start with.


Notes

To download the dataset that we'll use you'll need to use this block of code;

from sklearn.datasets import fetch_openml 
X, y = fetch_openml(
    data_id=1597,
    return_X_y=True,
)

Next we'll build and run an algorithm on this dataset.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, accuracy_score

pipe = Pipeline([
    ('scale', StandardScaler()),
    ('model', LogisticRegression())
])

grid = GridSearchCV(estimator=pipe, 
                    param_grid={}, 
                    cv=2, 
                    scoring={'acc': make_scorer(accuracy_score)}, 
                    refit='acc', 
                    return_train_score=True)

grid.fit(X, y);

We can look at the results from the gridsearch.

import pandas as pd 
pd.DataFrame(grid.cv_results_)

It will look like everything has gone well, but in fact we should be critical. We can count the labels to understand what is happening.

from collections import Counter
Counter(y)

Feedback? See an issue? Something unclear? Feel free to mention it here.

If you want to be kept up to date, consider getting the newsletter.