... dirty cat: introduction


For this video you'll need to install the following dependencies;

python -m pip install jupyterlab scikit-learn scikit-lego dirty-cat

To get the same dataset as we use here, simply run;

import numpy as np
import pandas as pd
from dirty_cat import datasets

employee_salaries = datasets.fetch_employee_salaries()

You end up with the same dataframe when you run this;

target_column = 'Current Annual Salary'
ml_df = data[[target_column, 'year_first_hired', 'assignment_category', 'employee_position_title']].dropna()
y = ml_df[target_column].values.ravel()
X = ml_df[['employee_position_title', 'year_first_hired', 'assignment_category']]

Note, again, that we explicitly drop the gender column here. Dropping gender is not enough to suggest fairness. We need to keep this in mind if we ever think of doing this in practice. In this series of video's we won't explore this further and instead we will explore how to encode the features.

Feedback? See an issue? Something unclear? Feel free to mention it here.

If you want to be kept up to date, consider signing up for the newsletter.