logo


scikit dummy


<p>You need to properly benchmark your models and it can be easy to forget to do this step. That is why we wanted to demonstrate the dummy module of scikit-learn. It is an underappreciated part of the library that is very useful to start with.</p>


1 - Intro
2 - Dummy Classifier
3 - Comparison
4 - Dummy Regression

To download the dataset that we'll use you'll need to use this block of code;

from sklearn.datasets import fetch_openml 
X, y = fetch_openml(
    data_id=1597,
    return_X_y=True,
)

Next we'll build and run an algorithm on this dataset.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, accuracy_score

pipe = Pipeline([
    ('scale', StandardScaler()),
    ('model', LogisticRegression())
])

grid = GridSearchCV(estimator=pipe, 
                    param_grid={}, 
                    cv=2, 
                    scoring={'acc': make_scorer(accuracy_score)}, 
                    refit='acc', 
                    return_train_score=True)

grid.fit(X, y);

We can look at the results from the gridsearch.

import pandas as pd 
pd.DataFrame(grid.cv_results_)

It will look like everything has gone well, but in fact we should be critical. We can count the labels to understand what is happening.

from collections import Counter
Counter(y)