Calmcode - polars: over expresssions

You can get very expessive in polars with the `.over()` expresson.

1 2 3 4 5 6 7 8 9

We still need to remove bots from our dataset. Before doing that though, we've taken the libery to turn our sessionization code into a function so we may use it in a pipeline.

import polars as pl

# First, "read" the data.
df = pl.read_csv("wowah_data.csv", parse_dates=False)
df.columns = [c.replace(" ", "") for c in df.columns]
df = df.lazy()

def set_types(dataf):
    return (dataf.with_columns([
                pl.col("guild") != -1,
                pl.col("timestamp").str.strptime(pl.Datetime, fmt="%m/%d/%y %H:%M:%S"),
            ]))

def sessionize(dataf, threshold=1_000_000):
    return (dataf
             .sort(["char", "timestamp"])
             .with_columns([
                 (pl.col("timestamp").diff().cast(pl.Int64) > threshold).fill_null(True).alias("ts_diff"),
                 (pl.col("char").diff() != 0).fill_null(True).alias("char_diff"),
             ])
             .with_columns([
                 (pl.col("ts_diff") | pl.col("char_diff")).alias("new_session_mark")
             ])
             .with_columns([
                 pl.col("new_session_mark").cumsum().alias("session")
             ])
             .drop(['char_diff', 'ts_diff', 'new_session_mark']))

(df
 .pipe(set_types)
 .pipe(sessionize, threshold=20 * 60 * 1000))

Filtering Bots

There are a number of ways to filter bots from a dataset. A simple method for this dataset is to remove all users that have a session length that is too long. No human would play for 24 hours straight.

That means we need to calculate two new columns;

  1. We need to have a column that represents the session length.
  2. We need to have a column that represents the longest session length per character.

Partitions with .over()

You might be temped to think that you'll need to run a query with a group_by and join statement at this join. Although you certainly could do this, there's a nicer way in polars. You can simply keep using the expressions that we've used before. You merely need to indicate that you want to calculate statistics .over() a column. This way, you only need a one-liner!

(df
 .pipe(set_types)
 .pipe(sessionize, threshold=20 * 60 * 1000)
 .with_columns([
    pl.col("char").count().over("session").alias("session_length"),
    pl.col("session").n_unique().over("char").alias("n_sessions_per_char")
 ]))

With these two columns set, it's very easy to filter away the bots.

Pay Attention

When you're building expressions you should check that the order makes sense. This would work;

pl.col("char").count().over("session").alias("session_length"),

This wouldn't;

pl.col("char").over("session").count().alias("session_length"),

The .over() command needs to follow an aggregating expression. Not the other way around. If you're used to pandas then you may hit your head a few times on this.