Calmcode - pandas datetime: resample

Grouping date ranges in pandas time series using resample.

1 2 3 4 5 6

Pandas can convert datetime frequencies by resampling. This allows you to turn per-second data into 5-minutely data. This is a very common task when you're dealing with timeseries. So let's explore how this works.

Timeseries

Let's start by making a subset of our data.

subset_df = (df
      .assign(date=lambda d: pd.to_datetime(d['date'], format="%Y-%m-%d"))
      [['state', 'date', 'births']]
      .loc[lambda d: d['state'] == 'CA'])

This dataframe represents the births in the state of California. Here's the top 5 rows.

state date births
CA 1969-01-01 00:00:00 824
CA 1969-01-02 00:00:00 816
CA 1969-01-03 00:00:00 940
CA 1969-01-04 00:00:00 906
CA 1969-01-05 00:00:00 804

If you're interested in calculating aggregates here you could could generate a grouping-feature, like year, pass it in a group-by and aggregate. There is a more convenient method though, which involves using the .resample method.

Using resample

To use .resample() you'll need to make sure that the dataframe has an index that's a datetime column first. Then you'll be able to call resample, which acts kind of like a group-by but has a convenient string-syntax to declare time windows. After that you'll be able to call an aggregation method to summarise the data.

Calculate the sum per year.

This example calculates the total birhts per year in California.

subset_df.set_index('date').resample('Y').sum()
date births
1969-12-31 00:00:00 352858
1970-12-31 00:00:00 362682
1971-12-31 00:00:00 329816
1972-12-31 00:00:00 306538
1973-12-31 00:00:00 298062

Calculate the average per two weeks.

This example calculates the average births every two weeks in California. Note that we're able to use a number here to indicate we're dealing with two weeks instead of one!

subset_df.set_index('date').resample('2W').sum()
date births
1969-01-05 00:00:00 858
1969-01-19 00:00:00 924.429
1969-02-02 00:00:00 913.143
1969-02-16 00:00:00 923.714
1969-03-02 00:00:00 879.867

Together with Groupby

In the previous example we calculated statistics but we only calculated statistics for the state of California. We could combine the use of .resample with .groupby to calculate these numbers for all states. If you do this, remember that you can only use .resample() after .groupby(), not the other way around.

(df[['state', 'date', 'births']]
      .assign(date=lambda d: pd.to_datetime(d['date'], format='%Y-%m-%d'))
      .set_index('date')
      .groupby('state')
      .resample('2W').sum())

The top 5 rows of this query look like:

state date births
AK 1969-01-05 00:00:00 88
AK 1969-01-19 00:00:00 246
AK 1969-02-02 00:00:00 246
AK 1969-02-16 00:00:00 244
AK 1969-03-02 00:00:00 260

The bottom 5 of this table look like:

state date births
WY 1988-11-13 00:00:00 243
WY 1988-11-27 00:00:00 223
WY 1988-12-11 00:00:00 271
WY 1988-12-25 00:00:00 238
WY 1989-01-08 00:00:00 114

Offsets in Resample

There are many ways to group dates together using the string-like syntax in resample. The table below shows a subset of all possible offsets.

Alias Description
B business day frequency
D calendar day frequency
W weekly frequency
M month end frequency
Q quarter end frequency
H hourly frequency
T, min minutely frequency
S secondly frequency
L, ms milliseconds
U, us microseconds
N nanoseconds

This table represents a subset of all offsets found on the pandas documentation

Summary Methods in Resample

Typically, you'll be interested in calculating the .mean() or the .sum(). It's good to know though that many other methods are also supported.

Method Description
.count() Compute count of group, excluding missing values.
.nunique() Return number of unique elements in the group.
.first() Compute first of group values.
.last() Compute last of group values.
.max() Compute max of group values.
.mean() Compute mean of groups, excluding missing values.
.median() Compute median of groups, excluding missing values.
.min() Compute min of group values.
.size() Compute group sizes.
.sum() Compute sum of group values.
.quantile(q) Return value at the given quantile.

A full list of descriptive statistics can be found on the pandas documentation.

Caveat

There is also a more general .agg() method which is similar to the .agg() method you may know from .groupby-objects but there are subtle differences to be aware of. One of the main difference is that named aggregations are not impemented yet at the time of writing this document.