When you're dealing with a dataset that does not fit in memory, you
can still train a machine learning model on it. You'd merely need to
cut the dataset into pieces and train your model on batches of data.
Not all machine learning pipelines can train using this method, but
many scikit-learn components do via the
.partial_fit API. To learn
more about this, check out our course.
Pipeline in scikit-learn, however, does not support this
.partial_fit API. Which is why we made a variant that does in scikit-partial.
To get started with this new pipeline you'll first need to install it:
python -m pip install scikit-partial
Once installed you can use it to train models in multiple batches. The code below gives an example of this.
import pandas as pd from sklearn.linear_model import SGDClassifier from sklearn.feature_extraction.text import HashingVectorizer from skpartial.pipeline import make_partial_pipeline # First, load some data. url = "https://raw.githubusercontent.com/koaning/icepickle/main/datasets/imdb_subset.csv" df = pd.read_csv(url) X, y = list(df['text']), df['label'] # Construct a pipeline with components that are `.partial_fit()` compatible # Note that the `HashingVectorizer` is a stateless transformer and that # the `SGDClassifier` implements `partial_fit`! pipe = make_partial_pipeline(HashingVectorizer(), SGDClassifier(loss="log")) # Run the learning algorithm on batches of data for i in range(10): # We could also do a whole bunch of data augmentation here! pipe.partial_fit(X, y, classes=[0, 1])
The library implements the following pipeline components that you can use to build partial pipelines.
from skpartial.pipeline import ( PartialPipeline, PartialFeatureUnion, make_partial_pipeline, make_partial_union, )
If you're familiar with scikit-learn, you should appreciate the naming conventions used here.
When to use
If your dataset fits in memory you should stick to the normal
API because that's numerically more stable for the optimiser. But if you're
interested in training models on large datasets that don't fit in
memory this library should help you construct proper pipelines.
You could also use this library to augment the dataset while you're training. In the case of text classification, you could consider simulating some spelling errors in each batch of text to make it more robust. You could also do something similar with images that are rotated in each batch. It also pairs well with our icepickle experiment if you're interested in pre-training a model that you'd like other people to fine-tune later.
Back to labs main.