This series is about the polars project. It's an alternative to the pandas project that has been implemented from the ground up in rust. It's one of the fastest dataframe implementations out there and in this series of videos we'll benchmark the library while we also explain how to use it.
Prerequisites
This series will assume that you're already familiar with pandas. It may also help to have a gut-feeling of what a compilar might do. If you haven't done so before, we recommend checking:
- The pandas pipe course.
- The numba course.
Install
We're going to use polars and pandas in this series, you can install both via;
python -m pip install polars pandas
Pandas vs. Polars
Pandas is a great tool, but there's a limit to what we might expect from it's performance. It can be explained by considering a query like below.
(pd.read_csv("path/to/file.csv")
.groupby("city")
.agg(n_adresses=("address", "nunique"),
n_people=("name", "size"))
.loc[lambda d: d["city"] == "Amsterdam"])
If we consider the end result of this query then you may realize that a lot of unnecessary compute is happening. The query would be a lot faster if we would rewrite it into:
(pd.read_csv("path/to/file.csv")
.loc[lambda d: d["city"] == "Amsterdam"]
.groupby("city")
.agg(n_adresses=("address", "nunique"),
n_people=("name", "size")))
But there's two issues here.
- The
pd.read_csv
-call may still be reading a lot of data that's not relevant for us. You could argue that's wasted compute power. - While this particular query is simple to fix we should be mindful that larger queries won't be as easy to adress.
In general it'd be better if we could just describe what we're interested in and that we could leave it up to the library to figure out the most performant way to get the right data.
This is one of the goals of the polars project. It tries to be the lightweight pandas alternative that's able to handle much larger files. It's implemented in rust, and as we'll see, can cause impressive speedups.
Let's dive in!