When you're dealing with a dataset that does not fit in memory, you
can still train a machine learning model on it. You'd merely need to
cut the dataset into pieces and train your model on batches of data.
Not all machine learning pipelines can train using this method, but
many scikit-learn components do via the .partial_fit
API. To learn
more about this, check out our course.
The main Pipeline
in scikit-learn, however, does not support this
.partial_fit
API. Which is why we made a variant that does in scikit-partial.
Scikit Partial
To get started with this new pipeline you'll first need to install it:
python -m pip install scikit-partial
Once installed you can use it to train models in multiple batches. The code below gives an example of this.
import pandas as pd
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import HashingVectorizer
from skpartial.pipeline import make_partial_pipeline
# First, load some data.
url = "https://raw.githubusercontent.com/koaning/icepickle/main/datasets/imdb_subset.csv"
df = pd.read_csv(url)
X, y = list(df['text']), df['label']
# Construct a pipeline with components that are `.partial_fit()` compatible
# Note that the `HashingVectorizer` is a stateless transformer and that
# the `SGDClassifier` implements `partial_fit`!
pipe = make_partial_pipeline(HashingVectorizer(), SGDClassifier(loss="log"))
# Run the learning algorithm on batches of data
for i in range(10):
# We could also do a whole bunch of data augmentation here!
pipe.partial_fit(X, y, classes=[0, 1])
Features
The library implements the following pipeline components that you can use to build partial pipelines.
from skpartial.pipeline import (
PartialPipeline,
PartialFeatureUnion,
make_partial_pipeline,
make_partial_union,
)
If you're familiar with scikit-learn, you should appreciate the naming conventions used here.
When to use
If your dataset fits in memory you should stick to the normal .fit()
API because that's numerically more stable for the optimiser. But if you're
interested in training models on large datasets that don't fit in
memory this library should help you construct proper pipelines.
You could also use this library to augment the dataset while you're training. In the case of text classification, you could consider simulating some spelling errors in each batch of text to make it more robust. You could also do something similar with images that are rotated in each batch.
Back to labs main.