Calmcode - pandas datetime: rolling stats

Rolling average in pandas.

1 2 3 4 5 6

Sometimes you want to calculate an average per group. Other times you'd want the average over a window of time. An average over 30 days is much less noisy than an a daily figure.

Make a Simple Chart

Let's start by making a chart of our data. We'll use the altair library for this. You can install it via pip install altair. If you'd like to learn more about the tool you may enjoy our course!

To keep the chart simple, we'll only take a subset of our data first:

import pandas as pd

df = pd.read_csv("https://calmcode.io/datasets/birthdays.csv")

subset_df = (df[['state', 'date', 'births']]
.assign(date=lambda d: pd.to_datetime(d['date'], format="%Y-%m-%d"))
.loc[lambda d: d['state'] == 'CA']
.tail(365 * 2))

To visualise this timeseries with zoom functionality we can run:

import altair as alt

(alt.Chart(subset_df)
  .mark_line()
  .encode(x='date', y='births')
  .properties(width=600, height=250)
  .interactive())

Rolling Average

To calculate a rolling mean, you can call .rolling() on the dataframe. This returns an object that represents rolling subsets of the entire dataframe. When we call .mean() on this object we can calculate the rolling mean.

subset_df.rolling(10).mean()

It's incredibly important that your data is sorted before you run the .rolling() method. Otherwise you may be calculating the wrong thing.

Creating new columns

Typically, you'd like to add a new column when you create such a rolling function. Let's use .assign() for that.

subset_df.assign(rolling_births=lambda d: d.rolling(10).mean())

The first few rows will have NaN values when you do this though.

state date births rolling_births
CA 1969-01-01 00:00:00 824 nan
CA 1969-01-02 00:00:00 816 nan
CA 1969-01-03 00:00:00 940 nan
CA 1969-01-04 00:00:00 906 nan
CA 1969-01-05 00:00:00 804 nan
CA 1969-01-06 00:00:00 922 nan
CA 1969-01-07 00:00:00 866 nan

Preventing Empty Values

When you run a rolling average with window size n, you typically get a lot of NaN values because the first (n-1) rows cannot be used. To prevent this, you may like using the min_periods argument.

subset_df.assign(rolling_births=lambda d: d.rolling(10, min_periods=1).mean())

Now, the results looks something like:

state date births rolling_births
CA 1969-01-01 00:00:00 824 824
CA 1969-01-02 00:00:00 816 820
CA 1969-01-03 00:00:00 940 860
CA 1969-01-04 00:00:00 906 871.5
CA 1969-01-05 00:00:00 804 858
CA 1969-01-06 00:00:00 922 868.667
CA 1969-01-07 00:00:00 866 868.286