Calmcode - scikit meta: inflated regression

Use ZeroInflatedRegressor to deal with zeros in the regression label.

1 2 3 4 5 6 7

Simulate a zero-inflated dataset.

We'll first need to generate a dataset.

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.model_selection import cross_val_score
from sklego.meta import ZeroInflatedRegressor

# Note the final line of code in this block. We're setting y=0 for all weekend dates
# while we simulate standard regression data for all the other dates.
df = (pd.DataFrame({'dt': pd.date_range("2018-01-01", "2021-01-01")})
      .assign(x=lambda d: np.random.normal(0, 1, d.shape[0]))
      .assign(weekend = lambda d: (d['dt'].dt.weekday >= 5).astype(np.int16))
      .assign(y=lambda d: np.where(d['weekend'], 0, 1.5 + 0.87 * d['x'] + np.random.normal(0, 0.2, d.shape[0]))))

Next we convert this dataframe to a X and y array.

X, y = df[['x', 'weekend']].values, df['y'].values

Benchmarking the ZeroInflatedRegressor

Finally, we run a small benchmark.

zir = ZeroInflatedRegressor(
    classifier=LogisticRegression(),
    regressor=Ridge()
)

lr = Ridge(random_state=0)

print('ZIR r²:', cross_val_score(zir, X, y).mean()) # ZIR r²: 0.9715677148308327
print(' LR r²:', cross_val_score(lr, X, y).mean())  #  LR r²: 0.8154520977784985

You can read more about the possible settings for this tool on the getting started docs and the api docs. A shoutout goes out to Robert Kübler for implementing this feature.