Scikit-learn models might take a while to train. That's why it's preferable to store them on disk such that you may be able to re-use them elsewhere.
The standard method of doing this in scikit-learn is to use joblib
to
store a pickle file. The snippet of code below, which can be found in-full
on this Github repository demonstrates
how you might do that.
import pandas as pd
from joblib import dump
from rich.console import Console
from sklearn.pipeline import make_pipeline, make_union
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
console = Console()
# Load the training data
df = pd.read_csv("clinc_oos-plus.csv").loc[lambda d: d['split'] == 'train']
console.log("Training data loaded.")
X = df['text'].to_list()
y = df['label']
# Make a very basic machine learning pipeline
pipe = make_pipeline(
CountVectorizer(),
LogisticRegression()
)
pipe.fit(X, y)
console.log("ML Pipeline fitted.")
# Save the pickled object to disk.
dump(pipe, 'pipe.joblib')
console.log("Joblib pickle saved.")
The pipe.joblib
file on disk represents the Python object that
contains the trained pipeline. This file ... comes with a few downsides though.
It may not be compatible with other Python versions, different operating systems or different versions of packages. It also comes with a big security concern. The scikit-learn docs even warn about this.
In this series of videos we will dive deeper into this security concern and we will also explore some alternatives in this space.