Calmcode - model-mining: introduction

Introduction

1 2 3 4

In this series we will be discussing a data mining technique that can be re-used as a data modelling technique. To implement the technique we will make heavy use of human-learn. If that package is new, it might be good to explore our course on human-learn first. You can install human-learn via;

pip install human-learn

Dataset

For the first demo we will use the titanic dataset.

from hulearn.datasets import load_titanic

df = load_titanic(as_frame=True)
df.head()

Let's first turn the dataset into an X, y-pair that's scikit-learn ready.

X, y = df.assign(sex=lambda d: d['sex']=='male').drop(columns=['survived', 'name']), df['survived']

Once that's done we can create a rule-based model based on the fare that folks paid for the titanic trip. Maybe a higher fare implies that you're more likely to have access to a lifeboat.

import numpy as np
from hulearn.classification import FunctionClassifier

def fare_based(dataf, threshold=10):
    """
    The assumption is that folks who paid more are wealthier and are more
    likely to have recieved access to lifeboats.
    """
    return np.array(dataf['fare'] > threshold).astype(int)

mod = FunctionClassifier(fare_based)
preds = mod.fit(X, y).predict(X)

We can use this model in a grid-search if we were interested in finding the best threshold too.

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import precision_score, recall_score, accuracy_score, make_scorer

# Note the threshold keyword argument in this function.
def fare_based(dataf, threshold=10):
    return np.array(dataf['fare'] > threshold).astype(int)

# Pay attention here, we set the threshold argument in here.
mod = FunctionClassifier(fare_based, threshold=10)

# The GridSearch object can now "grid-search" over this argument.
# We also add a bunch of metrics to our approach so we can measure.
grid = GridSearchCV(mod,
                    cv=2,
                    param_grid={'threshold': np.linspace(0, 100, 30)},
                    scoring={'accuracy': make_scorer(accuracy_score),
                            'precision': make_scorer(precision_score),
                            'recall': make_scorer(recall_score)},
                    refit='accuracy')
grid.fit(X, y);

From here we can see a pretty chart that tells us about precision/recall.

import pandas as pd

(pd.DataFrame(grid.cv_results_)
  [['param_threshold', 'mean_test_recall', 'mean_test_accuracy', 'mean_test_precision']]
  .set_index('param_threshold')
  .plot(figsize=(12, 4)));

This example is nice, but it does give us a moment to think. Is this the best way to come up with domain models? It might be better if we can get inspiration from the data instead of trying to come up with hypotheses.