... smoking.

It is easier that you might think to fool yourself with data. It is quantified so there is less bias right? This series of videos shows you an analysis using pandas that demonstrates why this might not be true.


It's always a good idea to clean the dataset before analysing it. In this case we're cleaning it merely to make our analysis easier.

import numpy as np
import pandas as pd
import matplotlib.pylab as plt

df = pd.read_csv("~/Downloads/smoking.csv")

def clean_dataframe(dataf):
    return (dataf
            .assign(alive=lambda d: (d['outcome'] == 'Alive').astype(np.int))
            .assign(smokes=lambda d: (d['smoker'] == 'Yes').astype(np.int)))

clean_df = df.pipe(clean_dataframe)

Now that the dataset is clean we can start with the analysis.

Feedback? See an issue? Something unclear? Feel free to mention it here.

If you want to be kept up to date, consider getting the newsletter.