Let's use a parallel coordinates chart to inspire us for a better model.
from hulearn.experimental import parallel_coordinates
parallel_coordinates(df, label="survived")
The color of the line indicates the class that we'd like to predict. That makes it easy to eye-ball when we sub-selections might become a rule for a model.
Here's the model that we found while playing around.
def make_prediction(dataf, age=15):
women_rule = (dataf['pclass'] < 3.0) & (dataf['sex'] == "female")
children_rule = (dataf['pclass'] < 3.0) & (dataf['age'] <= age)
return women_rule | children_rule
mod = FunctionClassifier(make_prediction)
We've kept "age" as a hyperparameter, which could be optimised.
grid_rule = GridSearchCV(mod,
cv=10,
param_grid={'age': range(5, 50)},
scoring={'accuracy': make_scorer(accuracy_score),
'precision': make_scorer(precision_score),
'recall': make_scorer(recall_score)},
refit='accuracy')
df = load_titanic(as_frame=True)
X, y = df.drop(columns=['survived']), df['survived']
grid_rule.fit(X, y);
When you then look at the results from the grid-search you can confirm a 80% accuracy and a 95% precision. That's better than before!
score_df = (pd.DataFrame(grid_rule.cv_results_)
.set_index('param_age')
[['mean_test_accuracy', 'mean_test_precision', 'mean_test_recall']])
score_df.head(15)