Sometimes you want to calculate an average per group. Other times you'd want the average over a window of time. An average over 30 days is much less noisy than an a daily figure.
Make a Simple Chart
To keep the chart simple, we'll only take a subset of our data first:
import pandas as pd df = pd.read_csv("https://calmcode.io/datasets/birthdays.csv") subset_df = (df[['state', 'date', 'births']] .assign(date=lambda d: pd.to_datetime(d['date'], format="%Y-%m-%d")) .loc[lambda d: d['state'] == 'CA'] .tail(365 * 2))
To visualise this timeseries with zoom functionality we can run:
import altair as alt (alt.Chart(subset_df) .mark_line() .encode(x='date', y='births') .properties(width=600, height=250) .interactive())
To calculate a rolling mean, you can call
.rolling() on the dataframe. This
returns an object that represents rolling subsets of the entire dataframe. When
.mean() on this object we can calculate the rolling mean.
It's incredibly important that your data is sorted before you run the
method. Otherwise you may be calculating the wrong thing.
Creating new columns
Typically, you'd like to add a new column when you create such a rolling function.
.assign() for that.
subset_df.assign(rolling_births=lambda d: d.rolling(10).mean())
The first few rows will have
NaN values when you do this though.
Preventing Empty Values
When you run a rolling average with window size
n, you typically get a lot of
because the first
(n-1) rows cannot be used. To prevent this, you may like using the
subset_df.assign(rolling_births=lambda d: d.rolling(10, min_periods=1).mean())
Now, the results looks something like: