Calmcode - pandas pipe: logs

Adding logging to Pandas Pipelines with decorators

1 2 3 4 5 6 7 8 9

The code that follows is advanced (especially if you're new to decorators) but in python you can write a function that can decorate another function. If you appreciate a refresher on how decorators work, feel free to check out our course on decorators.

Here's an example meant for pandas pipelines.

from functools import wraps
import datetime as dt

def log_step(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        tic = dt.datetime.now()
        result = func(*args, **kwargs)
        time_taken = str(dt.datetime.now() - tic)
        print(f"just ran step {func.__name__} shape={result.shape} took {time_taken}s")
        return result
    return wrapper

You can use this code to decorate your pipeline steps.

import pandas as pd

df = pd.read_csv('https://calmcode.io/datasets/bigmac.csv')

@log_step
def start_pipeline(dataf):
    return dataf.copy()

@log_step
def set_dtypes(dataf):
    return (dataf
            .assign(date=lambda d: pd.to_datetime(d['date']))
            .sort_values(['currency_code', 'date']))

@log_step
def remove_outliers(dataf, min_row_country=32):
    countries = (dataf
                .groupby('currency_code')
                .agg(n=('name', 'count'))
                .loc[lambda d: d['n'] >= min_row_country]
                .index)
    return (dataf
            .loc[lambda d: d['currency_code'].isin(countries)])

When you now run this code, you'll see output printed as a side-effect.

(df
  .pipe(start_pipeline)
  .pipe(set_dtypes)
  .pipe(remove_outliers, min_row_country=20))

Note

We're only showing the example here with a print statement. A better next-step would be to use python's logging framework. That's something that's out of scope for this series of videos, but if you're intersted you can learn more in our logging course. Alternatively you might also enjoy the implementation in scikit-lego.