Simulate a zero-inflated dataset.
We'll first need to generate a dataset.
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.model_selection import cross_val_score
from sklego.meta import ZeroInflatedRegressor
# Note the final line of code in this block. We're setting y=0 for all weekend dates
# while we simulate standard regression data for all the other dates.
df = (pd.DataFrame({'dt': pd.date_range("2018-01-01", "2021-01-01")})
.assign(x=lambda d: np.random.normal(0, 1, d.shape[0]))
.assign(weekend = lambda d: (d['dt'].dt.weekday >= 5).astype(np.int16))
.assign(y=lambda d: np.where(d['weekend'], 0, 1.5 + 0.87 * d['x'] + np.random.normal(0, 0.2, d.shape[0]))))
Next we convert this dataframe to a X
and y
array.
X, y = df[['x', 'weekend']].values, df['y'].values
Benchmarking the ZeroInflatedRegressor
Finally, we run a small benchmark.
zir = ZeroInflatedRegressor(
classifier=LogisticRegression(),
regressor=Ridge()
)
lr = Ridge(random_state=0)
print('ZIR r²:', cross_val_score(zir, X, y).mean()) # ZIR r²: 0.9715677148308327
print(' LR r²:', cross_val_score(lr, X, y).mean()) # LR r²: 0.8154520977784985
You can read more about the possible settings for this tool on the getting started docs and the api docs. A shoutout goes out to Robert Kübler for implementing this feature.