We still need to remove bots from our dataset. Before doing that though, we've taken the libery to turn our sessionization code into a function so we may use it in a pipeline.
import polars as pl
# First, "read" the data.
df = pl.read_csv("wowah_data.csv", parse_dates=False)
df.columns = [c.replace(" ", "") for c in df.columns]
df = df.lazy()
def set_types(dataf):
return (dataf.with_columns([
pl.col("guild") != -1,
pl.col("timestamp").str.strptime(pl.Datetime, fmt="%m/%d/%y %H:%M:%S"),
]))
def sessionize(dataf, threshold=1_000_000):
return (dataf
.sort(["char", "timestamp"])
.with_columns([
(pl.col("timestamp").diff().cast(pl.Int64) > threshold).fill_null(True).alias("ts_diff"),
(pl.col("char").diff() != 0).fill_null(True).alias("char_diff"),
])
.with_columns([
(pl.col("ts_diff") | pl.col("char_diff")).alias("new_session_mark")
])
.with_columns([
pl.col("new_session_mark").cumsum().alias("session")
])
.drop(['char_diff', 'ts_diff', 'new_session_mark']))
(df
.pipe(set_types)
.pipe(sessionize, threshold=20 * 60 * 1000))
Filtering Bots
There are a number of ways to filter bots from a dataset. A simple method for this dataset is to remove all users that have a session length that is too long. No human would play for 24 hours straight.
That means we need to calculate two new columns;
- We need to have a column that represents the session length.
- We need to have a column that represents the longest session length per character.
Partitions with .over()
You might be temped to think that you'll need to run a query with a group_by
and join
statement at this join. Although you certainly could do this, there's a nicer way in polars. You can simply keep using the expressions that we've used before. You merely need to indicate that you want to calculate statistics .over()
a column. This way, you only need a one-liner!
(df
.pipe(set_types)
.pipe(sessionize, threshold=20 * 60 * 1000)
.with_columns([
pl.col("char").count().over("session").alias("session_length"),
pl.col("session").n_unique().over("char").alias("n_sessions_per_char")
]))
With these two columns set, it's very easy to filter away the bots.
Pay Attention
When you're building expressions you should check that the order makes sense. This would work;
pl.col("char").count().over("session").alias("session_length"),
This wouldn't;
pl.col("char").over("session").count().alias("session_length"),
The .over()
command needs to follow an aggregating expression. Not the other way around. If you're used to pandas then you may hit your head a few times on this.