Calmcode - polars: read csv

Reading CSV files in Polars.

1 2 3 4 5 6 7 8 9

The Dataset

If you want to tag along, you can find download the "World of Warcraft Avatar History"-dataset on kaggle. It's about 644 MB.

Reading Data

Reading with Pandas

To check the first five rows in pandas you can run:

import pandas as pd

df = pd.read_csv("wowah_data.csv")
df.columns = [c.replace(" ", "") for c in df.columns]
df.head(5)

On our machine this takes about 6.14s.

Reading with Polars

You could do the same in polars via:

import polars as pl

df = pl.read_csv("wowah_data.csv", parse_dates=False)

df.head(5)

This only takes 604ms. That's much faster, but right now the data is still being loaded into memory. There's another speedup that we might do.

Reading, but with Lazyness!

If you really want to supercharge your data reading you can also run;

df = pl.scan_csv("wowah_data.csv")
df.fetch(5)

The pl.scan_csv method won't actually read in all the data into memory. If you then call .fetch() then you'll retreive 5 (potentially random) rows of data. Because there's no concern about ordering we don't need to load in all the data which is why it only takes 896 µs!

Content

To make the queries in this course just a bit faster we'll read in the data into memory first, but from there on we will keep it lazy. You can turn an eagerly loaded dataframe into a lazy one by calling;

df = pl.read_csv("wowah_data.csv", parse_dates = False)
df = df.lazy()