Calmcode - scikit save: introduction

Saving Scikit-Learn Models

1 2 3 4 5 6

Scikit-learn models might take a while to train. That's why it's preferable to store them on disk such that you may be able to re-use them elsewhere.

The standard method of doing this in scikit-learn is to use joblib to store a pickle file. The snippet of code below, which can be found in-full on this Github repository demonstrates how you might do that.

import pandas as pd
from joblib import dump
from rich.console import Console

from sklearn.pipeline import make_pipeline, make_union
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

console = Console()

# Load the training data
df = pd.read_csv("clinc_oos-plus.csv").loc[lambda d: d['split'] == 'train']
console.log("Training data loaded.")

X = df['text'].to_list()
y = df['label']

# Make a very basic machine learning pipeline
pipe = make_pipeline(
    CountVectorizer(),
    LogisticRegression()
)

pipe.fit(X, y)
console.log("ML Pipeline fitted.")

# Save the pickled object to disk.
dump(pipe, 'pipe.joblib')
console.log("Joblib pickle saved.")

The pipe.joblib file on disk represents the Python object that contains the trained pipeline. This file ... comes with a few downsides though.

It may not be compatible with other Python versions, different operating systems or different versions of packages. It also comes with a big security concern. The scikit-learn docs even warn about this.

In this series of videos we will dive deeper into this security concern and we will also explore some alternatives in this space.