We've been updating our models one datapoint at a time sofar. Although this way of training allows us to explain very clearly what is happening it's a better idea to learn on batches of data instead.
So let's demonstrate a practical way to go about that. Let's first create a
file on disk.
import pandas as pd from sklearn.datasets import make_regression X, y, w = make_regression(n_features=2, n_samples=4000, random_state=42, coef=True, noise=1.0) y = y + 1.5 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42) df_save = pd.DataFrame(X).assign(y=y) df_save.columns = ["x1", "x2", "y"] df_save.to_csv("batch_example.csv", index=False) `
Let's now read this file in chunks. Pandas has a like-able functionality for this!
chunked = pd.read_csv("batch_example.csv", chunksize=1000) for chunk in chunked: print(chunk)
chunked object is like a python generator and each
in our loop represents a pandas dataframe. Each chunk can be used
as a batch of data for our regressor.
mod = SGDRegressor() chunked = pd.read_csv("batch_example.csv", chunksize=1000) for chunk in chunked: x_to_train = chunk[['x1', 'x2']].values y_to_train = chunk['y'].values mod.partial_fit(x_to_train, y_to_train)
This will train our model.
Try to vary the chunksize when you're trying to use this on your own dataset. It can really impact the speed at which your system will train.