Adding Columns
In pandas you may be used to calling .assign()
when you want to add a new
column. In polars you'd use the with_columns
method instead. The example
below demonstrates how you might use it.
import polars as pl
df = pl.read_csv("wowah_data.csv", parse_dates=False)
df.columns = [c.replace(" ", "") for c in df.columns]
df = df.lazy()
# Note that you need to call `.collect()` if you want to see results.
df.with_columns([
pl.col("guild") != -1,
pl.col("timestamp").str.strptime(pl.Datetime, fmt="%m/%d/%y %H:%M:%S"),
]).collect()
The idea here is that you can pass a list of "expressions". You'd use the
pl.col()
function to refer to a pre-existing column.
Expressions
Polars expressions can be used in various contexts and produce new columns. Just to compare, where in pandas you'd write;
df.assign(guild=lambda d: d['guild'] != -1)
You'd write this in polars;
df.with_columns([
pl.col("guild") != -1
])
There's a few differences here worth pointing out.
- Pandas needs a (lambda) function to be called by
.assign()
. This function describes how a new column is created. You could use numpy but also slow list-comprehensions if you'd like. It also immediately run when.assign()
is called. That means there's no opportunity for pandas to inspect what's inside of the function, which removes any possible optimisation. - Polars uses a reference to a column via
pl.col("guild")
. It describes what* needs to happen, but doesn't describe "how". That means that polars can leave it up to the internal rust code to figure out what the most performant computational method is. It also means that there's an option for parallelisation as well.
Syntax
Expressions can be constructed by chaining together many commands. Like below;
df.with_columns([
pl.col("timestamp").str.strptime(pl.Datetime, fmt="%m/%d/%y %H:%M:%S"),
])
Just like pandas, polars uses .str
and .dt
submodules to expose methods for
string and datetime columns. Althought it should feel familiar to a pandas user,
there are minor differences in the API. You can check the docs for expressions
that work on columns.