Scikit-learn models might take a while to train. That's why it's preferable to store them on disk such that you may be able to re-use them elsewhere.
The standard method of doing this in scikit-learn is to use
store a pickle file. The snippet of code below, which can be found in-full
on this Github repository demonstrates
how you might do that.
import pandas as pd from joblib import dump from rich.console import Console from sklearn.pipeline import make_pipeline, make_union from sklearn.linear_model import LogisticRegression from sklearn.feature_extraction.text import CountVectorizer console = Console() # Load the training data df = pd.read_csv("clinc_oos-plus.csv").loc[lambda d: d['split'] == 'train'] console.log("Training data loaded.") X = df['text'].to_list() y = df['label'] # Make a very basic machine learning pipeline pipe = make_pipeline( CountVectorizer(), LogisticRegression() ) pipe.fit(X, y) console.log("ML Pipeline fitted.") # Save the pickled object to disk. dump(pipe, 'pipe.joblib') console.log("Joblib pickle saved.")
pipe.joblib file on disk represents the Python object that
contains the trained pipeline. This file ... comes with a few downsides though.
It may not be compatible with other Python versions, different operating systems or different versions of packages. It also comes with a big security concern. The scikit-learn docs even warn about this.
In this series of videos we will dive deeper into this security concern and we will also explore some alternatives in this space.