... patsy.

There are many ways to get data from pandas to scikit-learn but when you're hacking in a notebook you may prefer to have something that is expressive. Like a domain specific grammar. The tool patsy offers exactly this by mocking features from the R language.


Here's the function that creates the plot.

def plot_model(mod, df, X, y):
    plt.figure(figsize=(12, 4))
    df_ml = df.copy().assign(pred = mod.predict(X))
    for day in 'Mon,Tues,Wed,Thurs,Fri,Sat,Sun'.split(','):
        df_subset = df_ml.loc[lambda d: d['wday'] == day]
        plt.scatter(df_subset['date'], df_subset['n_born'], color='steelblue')
        plt.plot(df_subset['date'], df_subset['pred'], label=day, linewidth=2)
    plt.title(f"mean absolute error = {np.mean(np.abs(mod.predict(X) - y))}");

The final model that we end up with is created with the code below;

from sklearn.linear_model import LinearRegression
import matplotlib.pylab as plt

df_ml = df_clean.head(2000).loc[lambda d: d['n_born'] > 2000]
y, X = ps.dmatrices("n_born ~ (cc(yday, df=12) + wday + date_to_num(date))**2", df_ml)
mod = LinearRegression().fit(X, y)
plot_model(mod, df_ml, X, y)

Feedback? See an issue? Something unclear? Feel free to mention it here.

If you want to be kept up to date, consider getting the newsletter.