You need to properly benchmark your models and it can be easy to forget to do this step. That is why we wanted to demonstrate the dummy module of scikit-learn. It is an underappreciated part of the library that is very useful to start with.
Download Data
To download the dataset that we'll use you'll need to use this block of code;
from sklearn.datasets import fetch_openml
X, y = fetch_openml(
data_id=1597,
return_X_y=True,
)
Build a Pipeline
Next we'll build and run an algorithm on this dataset.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, accuracy_score
pipe = Pipeline([
('scale', StandardScaler()),
('model', LogisticRegression())
])
grid = GridSearchCV(estimator=pipe,
param_grid={},
cv=2,
scoring={'acc': make_scorer(accuracy_score)},
refit='acc',
return_train_score=True)
grid.fit(X, y);
GridSearch results
We can look at the results from the gridsearch.
import pandas as pd
pd.DataFrame(grid.cv_results_)
It will look like everything has gone well, but in fact we should be critical. We can count the labels to understand what is happening.
from collections import Counter
Counter(y)
Maybe we should guard ourselves against optimism by using a dummy model as a benchmark.