There are many ways to get data from pandas to scikit-learn but when you're hacking in a notebook you may prefer to have something that is expressive. Like a domain specific grammar. The tool patsy offers exactly this by mocking features from the R language.
For this video you'll need to install the following dependencies;
python -m pip install jupyterlab pandas scikit-learn patsy matplotlib
You'll also need the dataset, it can be downloaded directly or downloaded via;
wget https://calmcode.io/datasets/birthdays.csv
The python code in the beginning of this notebook is;
import patsy as ps
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from sklearn.linear_model import LinearRegression
df = pd.read_csv("birthdays.csv")
def clean_data(dataf):
return (dataf
.drop(columns=['Unnamed: 0'])
.assign(date = lambda d: pd.to_datetime(d['date']))
.groupby(['date', 'wday', 'month'])
.agg(n_born=('births', 'sum'))
.reset_index()
.assign(yday = lambda d: d['date'].dt.dayofyear))
df_clean = df.pipe(clean_data)