Calmcode - ibis: differences

Backend Differences

1 2 3 4 5 6 7

Lets consider the following code snippet.

def set_types(dataf):
    return dataf.mutate(dataf.date.to_date("%Y-%m-%d").name('date'))

def counter(dataf, *args):
    return (
        dataf
         .group_by(args)
         .agg(
             dataf.births.sum().name('sum'), 
             dataf.births.mean().name('mean')
         ).order_by(args)
    )

tbl_polars.pipe(set_types).pipe(counter, 'date')

This code works and will not just aggregate the dataframe, but it will also cast the types before doing that. The interesting thing about this example is that the same pipeline wil fail for the DuckDB backend.

For example this code fails:

tbl_duckdb.pipe(set_types).pipe(counter, 'date')

While this code totally runs:

tbl_duckdb.pipe(counter, 'date')

The reason is that the set_types function assumes that the 'date' column is a string. The polars backend does not automatically cast the string to a date, but the DuckDB backend does. This means that the set_types function is not only not needed for the DuckDB backend; it's breaking it!

Ibis might not be to blame here though. A future version of Ibis might address this, but in general there are lots and lots of differences between the backends. Some backend don't support types the way others do, so that means that you may need to pay attention with the first function in a pipeline. It could make sense to have functions per backend to set the types the right way.