When you're working with scikit-learn you'll often need to deal with categorical data. The way you deal with this type of data really matters. In this series of videos we'll explore a the dirty-cat while we try to deal with categorical data.
For this video you'll need to install the following dependencies;
python -m pip install jupyterlab scikit-learn scikit-lego dirty-cat
To get the same dataset as we use here, simply run;
import numpy as np import pandas as pd from dirty_cat import datasets employee_salaries = datasets.fetch_employee_salaries() data = employee_salaries['data']
You end up with the same dataframe when you run this;
target_column = 'Current Annual Salary' ml_df = data[[target_column, 'year_first_hired', 'assignment_category', 'employee_position_title']].dropna() y = ml_df[target_column].values.ravel() X = ml_df[['employee_position_title', 'year_first_hired', 'assignment_category']]
Note, again, that we explicitly drop the
gender column here. Dropping gender is not enough to suggest fairness.
We need to keep this in mind if we ever think of doing this in practice. In this series of
video's we won't explore this further and instead we will explore how to encode the features.