There are many ways to get data from pandas to scikit-learn but when you're hacking in a notebook you may prefer to have something that is expressive. Like a domain specific grammar. The tool patsy offers exactly this by mocking features from the R language.
For this video you'll need to install the following dependencies;
python -m pip install jupyterlab pandas scikit-learn patsy matplotlib
You'll also need the dataset, it can be downloaded directly or downloaded via;
The python code in the beginning of this notebook is;
import patsy as ps import numpy as np import pandas as pd import matplotlib.pylab as plt from sklearn.linear_model import LinearRegression df = pd.read_csv("birthdays.csv") def clean_data(dataf): return (dataf .drop(columns=['Unnamed: 0']) .assign(date = lambda d: pd.to_datetime(d['date'])) .groupby(['date', 'wday', 'month']) .agg(n_born=('births', 'sum')) .reset_index() .assign(yday = lambda d: d['date'].dt.dayofyear)) df_clean = df.pipe(clean_data)