The Dataset
If you want to tag along, you can find download the "World of Warcraft Avatar History"-dataset on kaggle. It's about 644 MB.
Reading Data
Reading with Pandas
To check the first five rows in pandas you can run:
import pandas as pd
df = pd.read_csv("wowah_data.csv")
df.columns = [c.replace(" ", "") for c in df.columns]
df.head(5)
On our machine this takes about 6.14s.
Reading with Polars
You could do the same in polars via:
import polars as pl
df = pl.read_csv("wowah_data.csv", parse_dates=False)
df.head(5)
This only takes 604ms. That's much faster, but right now the data is still being loaded into memory. There's another speedup that we might do.
Reading, but with Lazyness!
If you really want to supercharge your data reading you can also run;
df = pl.scan_csv("wowah_data.csv")
df.fetch(5)
The pl.scan_csv
method won't actually read in all the data into memory. If you then call .fetch()
then you'll retreive 5 (potentially random) rows of data. Because there's no concern about ordering we don't need to load in all the data which is why it only takes 896 µs!
Content
To make the queries in this course just a bit faster we'll read in the data into memory first, but from there on we will keep it lazy. You can turn an eagerly loaded dataframe into a lazy one by calling;
df = pl.read_csv("wowah_data.csv", parse_dates = False)
df = df.lazy()