Pandas can convert datetime frequencies by resampling. This allows you to turn per-second data into 5-minutely data. This is a very common task when you're dealing with timeseries. So let's explore how this works.
Timeseries
Let's start by making a subset of our data.
subset_df = (df
.assign(date=lambda d: pd.to_datetime(d['date'], format="%Y-%m-%d"))
[['state', 'date', 'births']]
.loc[lambda d: d['state'] == 'CA'])
This dataframe represents the births in the state of California. Here's the top 5 rows.
state | date | births |
---|---|---|
CA | 1969-01-01 00:00:00 | 824 |
CA | 1969-01-02 00:00:00 | 816 |
CA | 1969-01-03 00:00:00 | 940 |
CA | 1969-01-04 00:00:00 | 906 |
CA | 1969-01-05 00:00:00 | 804 |
If you're interested in calculating aggregates here you could could generate
a grouping-feature, like year, pass it in a group-by and aggregate. There is
a more convenient method though, which involves using the .resample
method.
Using resample
To use .resample()
you'll need to make sure that the dataframe has an index
that's a datetime column first. Then you'll be able to call resample, which
acts kind of like a group-by but has a convenient string-syntax to declare
time windows. After that you'll be able to call an aggregation method to summarise
the data.
Calculate the sum per year.
This example calculates the total birhts per year in California.
subset_df.set_index('date').resample('Y').sum()
date | births |
---|---|
1969-12-31 00:00:00 | 352858 |
1970-12-31 00:00:00 | 362682 |
1971-12-31 00:00:00 | 329816 |
1972-12-31 00:00:00 | 306538 |
1973-12-31 00:00:00 | 298062 |
Calculate the average per two weeks.
This example calculates the average births every two weeks in California. Note that we're able to use a number here to indicate we're dealing with two weeks instead of one!
subset_df.set_index('date').resample('2W').sum()
date | births |
---|---|
1969-01-05 00:00:00 | 858 |
1969-01-19 00:00:00 | 924.429 |
1969-02-02 00:00:00 | 913.143 |
1969-02-16 00:00:00 | 923.714 |
1969-03-02 00:00:00 | 879.867 |
Together with Groupby
In the previous example we calculated statistics but we only calculated statistics
for the state of California. We could combine the use of .resample
with .groupby
to calculate these numbers for all states. If you do this, remember that you can
only use .resample()
after .groupby()
, not the other way around.
(df[['state', 'date', 'births']]
.assign(date=lambda d: pd.to_datetime(d['date'], format='%Y-%m-%d'))
.set_index('date')
.groupby('state')
.resample('2W').sum())
The top 5 rows of this query look like:
state | date | births |
---|---|---|
AK | 1969-01-05 00:00:00 | 88 |
AK | 1969-01-19 00:00:00 | 246 |
AK | 1969-02-02 00:00:00 | 246 |
AK | 1969-02-16 00:00:00 | 244 |
AK | 1969-03-02 00:00:00 | 260 |
The bottom 5 of this table look like:
state | date | births |
---|---|---|
WY | 1988-11-13 00:00:00 | 243 |
WY | 1988-11-27 00:00:00 | 223 |
WY | 1988-12-11 00:00:00 | 271 |
WY | 1988-12-25 00:00:00 | 238 |
WY | 1989-01-08 00:00:00 | 114 |
Offsets in Resample
There are many ways to group dates together using the string-like syntax in resample. The table below shows a subset of all possible offsets.
Alias | Description |
---|---|
B | business day frequency |
D | calendar day frequency |
W | weekly frequency |
M | month end frequency |
Q | quarter end frequency |
H | hourly frequency |
T, min | minutely frequency |
S | secondly frequency |
L, ms | milliseconds |
U, us | microseconds |
N | nanoseconds |
This table represents a subset of all offsets found on the pandas documentation
Summary Methods in Resample
Typically, you'll be interested in calculating the .mean()
or the .sum()
. It's
good to know though that many other methods are also supported.
Method | Description |
---|---|
.count() |
Compute count of group, excluding missing values. |
.nunique() |
Return number of unique elements in the group. |
.first() |
Compute first of group values. |
.last() |
Compute last of group values. |
.max() |
Compute max of group values. |
.mean() |
Compute mean of groups, excluding missing values. |
.median() |
Compute median of groups, excluding missing values. |
.min() |
Compute min of group values. |
.size() |
Compute group sizes. |
.sum() |
Compute sum of group values. |
.quantile(q) |
Return value at the given quantile. |
A full list of descriptive statistics can be found on the pandas documentation.
Caveat
There is also a more general .agg() method which is similar to the .agg()
method
you may know from .groupby
-objects but there are subtle differences to be aware of.
One of the main difference is that named aggregations are not impemented yet
at the time of writing this document.