Calmcode - partial_fit: batches

Configuring scikit-learn to learn from micro-batches.

1 2 3 4 5 6 7 8

We've been updating our models one datapoint at a time sofar. Although this way of training allows us to explain very clearly what is happening it's a better idea to learn on batches of data instead.

So let's demonstrate a practical way to go about that. Let's first create a .csv file on disk.

import pandas as pd
from sklearn.datasets import make_regression

X, y, w = make_regression(n_features=2, n_samples=4000,
                    random_state=42, coef=True, noise=1.0)
y = y + 1.5

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.5,
                                                    random_state=42)

df_save = pd.DataFrame(X).assign(y=y)
df_save.columns = ["x1", "x2", "y"]
df_save.to_csv("batch_example.csv", index=False)

Let's now read this file in chunks. Pandas has a like-able functionality for this!

chunked = pd.read_csv("batch_example.csv", chunksize=1000)
for chunk in chunked:
    print(chunk)

The chunked object is like a python generator and each chunk in our loop represents a pandas dataframe. Each chunk can be used as a batch of data for our regressor.

mod = SGDRegressor()
chunked = pd.read_csv("batch_example.csv", chunksize=1000)

for chunk in chunked:
    x_to_train = chunk[['x1', 'x2']].values
    y_to_train = chunk['y'].values
    mod.partial_fit(x_to_train, y_to_train)

This will train our model.

Protip

Try to vary the chunksize when you're trying to use this on your own dataset. It can really impact the speed at which your system will train.