Sometimes you want to calculate an average per group. Other times you'd want the average over a window of time. An average over 30 days is much less noisy than an a daily figure.
Make a Simple Chart
Let's start by making a chart of our data. We'll use the altair library
for this. You can install it via pip install altair
. If you'd like to learn more about the tool you may enjoy our course!
To keep the chart simple, we'll only take a subset of our data first:
import pandas as pd
df = pd.read_csv("https://calmcode.io/datasets/birthdays.csv")
subset_df = (df[['state', 'date', 'births']]
.assign(date=lambda d: pd.to_datetime(d['date'], format="%Y-%m-%d"))
.loc[lambda d: d['state'] == 'CA']
.tail(365 * 2))
To visualise this timeseries with zoom functionality we can run:
import altair as alt
(alt.Chart(subset_df)
.mark_line()
.encode(x='date', y='births')
.properties(width=600, height=250)
.interactive())
Rolling Average
To calculate a rolling mean, you can call .rolling()
on the dataframe. This
returns an object that represents rolling subsets of the entire dataframe. When
we call .mean()
on this object we can calculate the rolling mean.
subset_df.rolling(10).mean()
It's incredibly important that your data is sorted before you run the .rolling()
method. Otherwise you may be calculating the wrong thing.
Creating new columns
Typically, you'd like to add a new column when you create such a rolling function.
Let's use .assign()
for that.
subset_df.assign(rolling_births=lambda d: d.rolling(10).mean())
The first few rows will have NaN
values when you do this though.
state | date | births | rolling_births |
---|---|---|---|
CA | 1969-01-01 00:00:00 | 824 | nan |
CA | 1969-01-02 00:00:00 | 816 | nan |
CA | 1969-01-03 00:00:00 | 940 | nan |
CA | 1969-01-04 00:00:00 | 906 | nan |
CA | 1969-01-05 00:00:00 | 804 | nan |
CA | 1969-01-06 00:00:00 | 922 | nan |
CA | 1969-01-07 00:00:00 | 866 | nan |
Preventing Empty Values
When you run a rolling average with window size n
, you typically get a lot of NaN
values
because the first (n-1)
rows cannot be used. To prevent this, you may like using the
min_periods
argument.
subset_df.assign(rolling_births=lambda d: d.rolling(10, min_periods=1).mean())
Now, the results looks something like:
state | date | births | rolling_births |
---|---|---|---|
CA | 1969-01-01 00:00:00 | 824 | 824 |
CA | 1969-01-02 00:00:00 | 816 | 820 |
CA | 1969-01-03 00:00:00 | 940 | 860 |
CA | 1969-01-04 00:00:00 | 906 | 871.5 |
CA | 1969-01-05 00:00:00 | 804 | 858 |
CA | 1969-01-06 00:00:00 | 922 | 868.667 |
CA | 1969-01-07 00:00:00 | 866 | 868.286 |