We've been updating our models one datapoint at a time sofar. Although this way of training allows us to explain very clearly what is happening it's a better idea to learn on batches of data instead.
So let's demonstrate a practical way to go about that. Let's first create a .csv
file on disk.
import pandas as pd
from sklearn.datasets import make_regression
X, y, w = make_regression(n_features=2, n_samples=4000,
random_state=42, coef=True, noise=1.0)
y = y + 1.5
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.5,
random_state=42)
df_save = pd.DataFrame(X).assign(y=y)
df_save.columns = ["x1", "x2", "y"]
df_save.to_csv("batch_example.csv", index=False)
Let's now read this file in chunks. Pandas has a like-able functionality for this!
chunked = pd.read_csv("batch_example.csv", chunksize=1000)
for chunk in chunked:
print(chunk)
The chunked
object is like a python generator and each chunk
in our loop represents a pandas dataframe. Each chunk can be used
as a batch of data for our regressor.
mod = SGDRegressor()
chunked = pd.read_csv("batch_example.csv", chunksize=1000)
for chunk in chunked:
x_to_train = chunk[['x1', 'x2']].values
y_to_train = chunk['y'].values
mod.partial_fit(x_to_train, y_to_train)
This will train our model.
Protip
Try to vary the chunksize when you're trying to use this on your own dataset. It can really impact the speed at which your system will train.