Calmcode - dirty cat: introduction

Introduction

1 2 3 4 5

When you're working with scikit-learn you'll often need to deal with categorical data. The way you deal with this type of data really matters. In this series of videos we'll explore a the dirty-cat while we try to deal with categorical data.

For this video you'll need to install the following dependencies;

python -m pip install jupyterlab scikit-learn scikit-lego dirty-cat

To get the same dataset as we use here, simply run;

import numpy as np
import pandas as pd
from dirty_cat import datasets

employee_salaries = datasets.fetch_employee_salaries()
data = employee_salaries['data']

You end up with the same dataframe when you run this;

target_column = 'Current Annual Salary'
ml_df = data[[target_column, 'year_first_hired', 'assignment_category', 'employee_position_title']].dropna()
y = ml_df[target_column].values.ravel()
X = ml_df[['employee_position_title', 'year_first_hired', 'assignment_category']]

Note, again, that we explicitly drop the gender column here. Dropping gender is not enough to suggest fairness. We need to keep this in mind if we ever think of doing this in practice. In this series of video's we won't explore this further and instead we will explore how to encode the features.