Back to labs main.

Calmcode Labs Presents

scikit-partial logoscikit-partial.

Pipelines for partial_fit.

When you're dealing with a dataset that does not fit in memory, you can still train a machine learning model on it. You'd merely need to cut the dataset into pieces and train your model on batches of data. Not all machine learning pipelines can train using this method, but many scikit-learn components do via the .partial_fit API. To learn more about this, check out our course.

The main Pipeline in scikit-learn, however, does not support this .partial_fit API. Which is why we made a variant that does in scikit-partial.

Scikit Partial

To get started with this new pipeline you'll first need to install it:

python -m pip install scikit-partial

Once installed you can use it to train models in multiple batches. The code below gives an example of this.

import pandas as pd
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import HashingVectorizer

from skpartial.pipeline import make_partial_pipeline

# First, load some data.
url = "https://raw.githubusercontent.com/koaning/icepickle/main/datasets/imdb_subset.csv"
df = pd.read_csv(url)
X, y = list(df['text']), df['label']

# Construct a pipeline with components that are `.partial_fit()` compatible
# Note that the `HashingVectorizer` is a stateless transformer and that
# the `SGDClassifier` implements `partial_fit`!
pipe = make_partial_pipeline(HashingVectorizer(), SGDClassifier(loss="log"))

# Run the learning algorithm on batches of data
for i in range(10):
    # We could also do a whole bunch of data augmentation here!
    pipe.partial_fit(X, y, classes=[0, 1])

Features

The library implements the following pipeline components that you can use to build partial pipelines.

from skpartial.pipeline import (
    PartialPipeline,
    PartialFeatureUnion,
    make_partial_pipeline,
    make_partial_union,
)

If you're familiar with scikit-learn, you should appreciate the naming conventions used here.

When to use

If your dataset fits in memory you should stick to the normal .fit() API because that's numerically more stable for the optimiser. But if you're interested in training models on large datasets that don't fit in memory this library should help you construct proper pipelines.

You could also use this library to augment the dataset while you're training. In the case of text classification, you could consider simulating some spelling errors in each batch of text to make it more robust. You could also do something similar with images that are rotated in each batch.


Back to labs main.