Calmcode - polars: with_columns

How to add columns (or mutate) in polars.

1 2 3 4 5 6 7 8 9

Adding Columns

In pandas you may be used to calling .assign() when you want to add a new column. In polars you'd use the with_columns method instead. The example below demonstrates how you might use it.

import polars as pl

df = pl.read_csv("wowah_data.csv", parse_dates=False)
df.columns = [c.replace(" ", "") for c in df.columns]
df = df.lazy()

# Note that you need to call `.collect()` if you want to see results.
df.with_columns([
    pl.col("guild") != -1,
    pl.col("timestamp").str.strptime(pl.Datetime, fmt="%m/%d/%y %H:%M:%S"),
]).collect()

The idea here is that you can pass a list of "expressions". You'd use the pl.col() function to refer to a pre-existing column.

Expressions

Polars expressions can be used in various contexts and produce new columns. Just to compare, where in pandas you'd write;

df.assign(guild=lambda d: d['guild'] != -1)

You'd write this in polars;

df.with_columns([
    pl.col("guild") != -1
])

There's a few differences here worth pointing out.

  1. Pandas needs a (lambda) function to be called by .assign(). This function describes how a new column is created. You could use numpy but also slow list-comprehensions if you'd like. It also immediately run when .assign() is called. That means there's no opportunity for pandas to inspect what's inside of the function, which removes any possible optimisation.
  2. Polars uses a reference to a column via pl.col("guild"). It describes what* needs to happen, but doesn't describe "how". That means that polars can leave it up to the internal rust code to figure out what the most performant computational method is. It also means that there's an option for parallelisation as well.

Syntax

Expressions can be constructed by chaining together many commands. Like below;

df.with_columns([
    pl.col("timestamp").str.strptime(pl.Datetime, fmt="%m/%d/%y %H:%M:%S"),
])

Just like pandas, polars uses .str and .dt submodules to expose methods for string and datetime columns. Althought it should feel familiar to a pandas user, there are minor differences in the API. You can check the docs for expressions that work on columns.