Calmcode - pandas datetime: rolling stats

Rolling average in pandas.

1 2 3 4 5 6

Sometimes you want to calculate an average per group. Other times you'd want the average over a window of time. An average over 30 days is much less noisy than an a daily figure.

Make a Simple Chart

Let's start by making a chart of our data. We'll use the altair library for this. You can install it via pip install altair. If you'd like to learn more about the tool you may enjoy our course!

To keep the chart simple, we'll only take a subset of our data first:

import pandas as pd

df = pd.read_csv("https://calmcode.io/datasets/birthdays.csv")

subset_df = (df[['state', 'date', 'births']]
.assign(date=lambda d: pd.to_datetime(d['date'], format="%Y-%m-%d"))
.loc[lambda d: d['state'] == 'CA']
.tail(365 * 2))

To visualise this timeseries with zoom functionality we can run:

import altair as alt

(alt.Chart(subset_df)
  .mark_line()
  .encode(x='date', y='births')
  .properties(width=600, height=250)
  .interactive())

Rolling Average

To calculate a rolling mean, you can call .rolling() on the dataframe. This returns an object that represents rolling subsets of the entire dataframe. When we call .mean() on this object we can calculate the rolling mean.

subset_df.rolling(10).mean()

It's incredibly important that your data is sorted before you run the .rolling() method. Otherwise you may be calculating the wrong thing.

Creating new columns

Typically, you'd like to add a new column when you create such a rolling function. Let's use .assign() for that.

subset_df.assign(rolling_births=lambda d: d.rolling(10).mean())

The first few rows will have NaN values when you do this though.

state	date	births	rolling_births
CA	1969-01-01 00:00:00	824	nan
CA	1969-01-02 00:00:00	816	nan
CA	1969-01-03 00:00:00	940	nan
CA	1969-01-04 00:00:00	906	nan
CA	1969-01-05 00:00:00	804	nan
CA	1969-01-06 00:00:00	922	nan
CA	1969-01-07 00:00:00	866	nan

Preventing Empty Values

When you run a rolling average with window size n, you typically get a lot of NaN values because the first (n-1) rows cannot be used. To prevent this, you may like using the min_periods argument.

subset_df.assign(rolling_births=lambda d: d.rolling(10, min_periods=1).mean())

Now, the results looks something like:

state	date	births	rolling_births
CA	1969-01-01 00:00:00	824	824
CA	1969-01-02 00:00:00	816	820
CA	1969-01-03 00:00:00	940	860
CA	1969-01-04 00:00:00	906	871.5
CA	1969-01-05 00:00:00	804	858
CA	1969-01-06 00:00:00	922	868.667
CA	1969-01-07 00:00:00	866	868.286