Calmcode - model-mining: benchmark

Benchmark

1 2 3 4

Let's now repeat the technique on a harder dataset. We'll also use the hiplot parallel coordinates implementation this time around. If you're unfamiliar with hiplot, check out our hiplot course first.

You can retreive the creditcard data via the fetch_openml API.

from sklearn.datasets import fetch_openml
df_credit = fetch_openml(
    data_id=1597,
    as_frame=True
)

df_credit = df_credit['frame'].rename(columns={"Class": "group"})
df_credit['group'] = df_credit['group'] == '1'

df_credit.head()

You can confirm that this dataset suffers from a class imbalance.

df_credit.group.value_counts()

So let's now try building a model with a train/test split.

from sklearn.model_selection import train_test_split

credit_train, credit_test = train_test_split(df_credit, test_size=0.5, shuffle=True)

Next, we pass the train data to hiplot. We take a sample of the non-fraud examples to ensure that hiplot remains responsive.

import json
import hiplot as hip

samples = [credit_train.loc[lambda d: d['group'] == True], credit_train.sample(5000)]
json_data = pd.concat(samples).to_json(orient='records')

hip.Experiment.from_iterable(json.loads(json_data)).display()

Again, we can play around to find some rules. Here's the translated model that we found in the video.

from hulearn.experimental import CaseWhenRuler

def make_prediction(dataf):
    ruler = CaseWhenRuler(default=0)

    (ruler
    .add_rule(lambda d: (d['V11'] > 4), 1)
    .add_rule(lambda d: (d['V17'] < -3), 1)
    .add_rule(lambda d: (d['V14'] < -8), 1))

    return ruler.predict(dataf)

clf = FunctionClassifier(make_prediction)
from sklearn.metrics import classification_report

print(classification_report(credit_test['group'], clf.fit(credit_test, credit_test['group']).predict(credit_test)))

This is the report that we got in the end.

              precision    recall  f1-score   support

       False       1.00      1.00      1.00    142164
        True       0.69      0.72      0.70       240

    accuracy                           1.00    142404
   macro avg       0.85      0.86      0.85    142404
weighted avg       1.00      1.00      1.00    142404

If you'd like, you can compare these results with the one from the keras blog.

Notes on Benchmarking

The main point that we hope to demonstrate is that this technique has merit but we should point out to take this result with a grain of salt. The author of the Keras blog likely wasn't trying to make a state of the art model and was likely more worried about clearly explaining a technique (which, the blogpost does quite well). Our technique also has an element of "luck" since this dataset lends itself particularily well to the visualisation technique.

It's also been correctly pointed out that this course calculates a slighly different number since the Keras blogpost uses a slightly different validation set than we do. We shuffle and take 50% of the data for validation while the Keras blog doesn't shuffle and takes 20% for validation. A community member took the effort to explore this and noticed that the numbers are slightly different when you account for this.

              precision    recall  f1-score   support

       False       1.00      1.00      1.00     56886
        True       0.75      0.63      0.68        75

    accuracy                           1.00     56961
   macro avg       0.87      0.81      0.84     56961
weighted avg       1.00      1.00      1.00     56961

The numbers still suggest there's plenty of merit to mining a model here, but it's a more fair statistic. For a discussion on the matter, see this github issue.