Does smoking increase life expectancy?
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
#
The table below shows the data of 1,314 women from a survey conducted in the UK during 1972-74 1 . The women were classified either as current smokers ('Y') or as never having smoked ('N'). A follow-up study was conducted twenty years later and it was recorded whether or not an individual in the original survey was alive at the time of the follow-up. From this data, it seems that smoking is beneficial as only 24% of smokers died when compared to 31% of nonsmokers.
df = pd.DataFrame(data={'age_group' : ["18-24","18-24","25-34","25-34",
"35-44","35-44","45-54","45-54",
"55-64","55-64","65-74","65-74",
"75+","75+"],
'smoker' : ['Y', 'N'] * 7,
'dead' : [2,1,3,5,14,7,27,12,51,40,29,101,13,64],
'alive' : [53,61,121,152,95,114,103,66,64,81,7,28,0,0]
})
df['pct_alive'] = (df.alive/(df.alive + df.dead)).round(decimals=4) * 100
sdf = df.groupby('smoker').agg({'dead':'sum','alive':'sum'})
sdf['pct_alive'] = (sdf.alive/(sdf.alive + sdf.dead)).round(decimals=2) * 100
sdf.reset_index(inplace=True)
sdf.style.set_properties(**{'background-color': 'yellow','precision':0}, subset=['pct_alive'])
#
Before we jump to the conclusion that smoking is beneficial, let's compare the percentage alive within subpopulations of same age. In the bar chart below on the left, we see that a higher percentage of smokers are alive when compared to nonsmokers. However, in the chart on the right, in every subpopulation of the same age group, we see that a higher percentage of nonsmokers are alive!
So should we infer the following?
- If we don't know age, then smoking increases life expectancy
- If we know age, then smoking decreases life expectancy
Obviously, that is ridiculous! This is an example of Simpson's Paradox 2 where the association between two variables: smoker, pct_alive reverses sign when we condition on a third variable: age_group , irrespective of the value of age_group . In this example, we need to include the age_group variable in our analysis and conclude that smoking decreases life expectancy.
with plt.xkcd():
fig, (ax0,ax1) = plt.subplots(ncols=2, figsize=(16,4),gridspec_kw={'width_ratios': [1, 3]})
# Marginalized over age group
sns.barplot(x='smoker',y='pct_alive', data=sdf, order=['Y','N'], ax=ax0)
ax0.set_xlabel('Smoker')
ax0.set_ylabel('% Alive')
# With age groups
sns.barplot(x='age_group',y='pct_alive',hue='smoker',hue_order=['Y','N'],data=df, ax=ax1)
ax1.set_xlabel('Age Group')
ax1.set_ylabel('% Alive')
#
Is it always the case that if we add more variables, for e.g. age as above, then we can make better inferences? That's not necessarily the case. In order to make better inferences regarding causal questions as above, we need to understand the story behind the data 3 .
References ¶
- Appleton et. al, Ignoring a Covariate: An Example of Simpson's Paradox. The American Statistician, Vol. 50, No. 4, (Nov., 1996), pp. 340-341
- Judea Pearl, Understanding Simpson's Paradox
- Pearl et. al, Causal Inference in Statistics: A Primer. Wiley