You can run the benchmark shown in this video yourself by downloading this notebook. As always with benchmarks; please take it with a grain of salt. There's certainly things we could do to make the pandas code much faster. The goal here was to see what speedups might be possible if we write code in a similar style in both libraries.
That said, it seems like a 80x speed possible. Which is very impressive.
You can also have a look at the h2o benchmarks to confirm that polars also performs very well here. Also here, the benchmarks deserve to be taken with a grain of salt because they may not resemble a task that you're dealing with. But the point remains; polars is pretty fast.
Again, one needs to keep the grain of salt in mind. Pandas is still a great tool and it's certainly not "slow" for many tasks. It even comes with lots of features, some of which polars may never have. Polars doesn't come with an index and it currently does not integrate directly with many plotting libraries. Polars is also a relatively new project and there may be many lessons the project will learn as it's getting used more and more by the community while Pandas has been battle-hardened for years.
Polars is cool beans.
There's a lot of like about polars. It's fast and it comes with a novel expressions API that makes it very easy to handle tasks like sessionization, which are very common when you're dealing with weblogs.
On a personal note from Vincent, the coolest feature of the library is the feeling that it gives you. Maybe we can keep using laptops when we're dealing with files that are 10s of GBs large. Maybe we can build data processing pipelines that only need small docker containers in order to run. Maybe we don't need to spin up clusters of compute (with overhead) to handle these tasks. Maybe we don't have to learn a new operating system worth of cloud tools when we're dealing with bigger datasets.
It's that feeling of a brighter future that really introduces a lot of calm, which is very welcome.