Does exposure to welding fumes cause DNA damage?
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
from scipy.spatial import distance
from scipy import optimize
%config InlineBackend.figure_format = 'retina'
#
The following data is from a study conducted in 1993 1 on 21 welders and 26 controls not exposed to welding fumes. All subjects were men. The variables in the dataset are:
- age : in years
- race : African American or Caucasian
- smoker : Y or N
- welder : Y or N
- dpc : DNA protein crosslinks percentage
There are 3 covariates : age, race and current smoking behavior, i.e. variables measured at the beginning of the study. The treatment variable is whether the subject is a welder. The response variable is DNA-protein crosslinks percentage (dpc), an indicator of DNA damage 2 .The hypothesis is that exposure to welding fumes causes DNA damage. This was an observational study as subjects were not assigned randomly to the treatment.
Matching is a technique used in observational studies to set up a valid, apple to apple comparison between treatment and control groups by balancing observed covariates. Even though the groups may be matched on observed data, they may not be comparable in terms of unmeasured covariates, for e.g. whether a welder's father was also a welder. If such an unmeasured variable is a confounder , then matching only on observed covariates will not balance on unobserved variables. Randomization is the gold standard for causal inference as it balances observed and unobserved confounders. In observational studies, we'll balance the observed covariates and hope that the unobserved confounders are balanced.
Notation ¶
- $\mathbf{x}$: Observed covariates, in this case, a three dimensional vector (age, race, smoker) for each subject
- Let $t$ denote a treated subject and $c$ denote a control subject. Then $\mathbf{x}_t$ denotes the covariates of $t$ and $\mathbf{x}_c$ denotes the covariates of control $c$.
- Let $e(\mathbf{x})$ denote the true propensity score, the conditional probability of treatment given $\mathbf{x}$, i.e. $e(\mathbf{x}) = Pr(Z = 1 \vert \mathbf{x})$. In this example, the treatment assignment $Z$ is whether a subject is a welder.
- Let $\hat{e}(\mathbf{x})$ denote an estimate of the propensity score, based on a model that relates the treatment assignment $Z$ and observed covariates $x$.
Why use Propensity Score? ¶
Matching on $\mathbf{x}$ is almost impossible as the dimensionality of $\mathbf{x}$ increases. Instead, we match on propensity score which tends to balance all the covariates.
What's coming down? ¶
- Exploratory analysis: Compare means of covariates between treatment and control groups
- Estimate propensity score using a logistic regression model
- Pairwise matching (one treated subject matched to a unique control) using propensity score
- Improve balance using propensity score calipers and Mahalanobis distance
- Optimal pair matching to minimize the sum of distances over all pairs
- Compare means of covariates with what we saw at the beginning
This case study is based on the chapter Basic Tools of Multivariate Matching 3 .
df = pd.read_csv("data/costa.csv", index_col=0)
df.head()
#
Let's first compare the covariate distributions in the two groups before matching.
fig = plt.figure(figsize=(16,10), constrained_layout=True)
with sns.plotting_context("notebook", font_scale=1.3):
gs = fig.add_gridspec(3, 3)
ax0 = fig.add_subplot(gs[0,:])
sns.scatterplot(x=df.age, y=df.dpc,
hue=df.welder, hue_order=['Y','N'],
size=df.smoker, size_order=['Y','N'], sizes={'Y':120,'N':60},
style=df.race, style_order=['C','A'],
ax=ax0)
ax0.set_xlabel('Age')
ax0.set_ylabel('% DPC')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
ax1 = fig.add_subplot(gs[1,:])
sns.kdeplot(data=df, x='age', hue='welder', ax=ax1, hue_order=['Y','N'])
ax1.set_xlabel('Age')
ax2 = fig.add_subplot(gs[2,0])
sns.boxplot(x='smoker', y='dpc',data=df, ax=ax2, order=['Y','N'])
ax2.set_xlabel('Smoker')
ax2.set_ylabel('% DPC')
ax3 = fig.add_subplot(gs[2,1])
sns.boxplot(x='race', y='dpc',data=df, ax=ax3, order=['C','A'])
ax3.set_xlabel('Race')
ax3.set_ylabel('% DPC')
ax4 = fig.add_subplot(gs[2,2])
sns.boxplot(x='welder', y='dpc',data=df, ax=ax4, order=['Y','N'])
ax4.set_xlabel('Welder')
ax4.set_ylabel('% DPC')
plt.tight_layout();
#
Notice that the welder group has more younger smokers. Also, notice that smokers and welders have elevated dpc.
Encode categorical variables ¶
- welder = 'Y' $\rightarrow$ treatment = 1 else treatment = 0
- smoker = 'Y' $\rightarrow$ smoker_y = 1 else smoker_y = 0
- race = 'A' $\rightarrow$ race_a = 1 else race_a = 0
df['treatment'] = 0
df.loc[df.welder=="Y",'treatment'] = 1
covariates = ['age','race_a','smoker_y']
df['race_a'] = (1 - pd.get_dummies(df.race,drop_first=True))
df['smoker_y'] = pd.get_dummies(df.smoker, drop_first=True)
df.sample(5)
#
The two groups appear imbalanced
- Control is 4 years older than welders
- 19% African Americans in control vs. 10% in welders
- 35% smokers in control vs. 52% in welders
df.groupby('welder').agg(dict(zip(covariates,['mean']*3))).round(2)
Estimate Propensity Score using Logistic Regression ¶
propensity_formula = (
'treatment ~ age + race_a + smoker_y'
)
model = sm.Logit.from_formula(propensity_formula, data=df)
res = model.fit();
df['pscore'] = res.predict(df)
#
Compute L1 distance based on propensity score ¶
l1dist = $\vert \hat{e}(\mathbf{x}_t) - \hat{e}(\mathbf{x}_c) \vert$
pscore0 = df[df.welder=='N'].pscore.values
pscore1 = df[df.welder=='Y'].pscore.values
l1dist = np.abs(pscore1.reshape(-1,1) - pscore0.reshape(1,-1))
#
We can visualize the distances of the welders (21 subjects) to the controls (26 subjects) as shown below. As an example, let's consider welder 3 and the nearest four controls by L1 distance (highlighted in red).
fig, ax = plt.subplots(figsize=(14,10))
with sns.plotting_context("notebook", font_scale=0.9):
hm = sns.heatmap(l1dist, cmap="Blues_r", robust=True, annot=True, fmt=".2f",
cbar=False, ax=ax)
ax.set_xlabel('Control')
ax.set_ylabel('Welder')
from matplotlib.patches import Rectangle
ax.add_patch(Rectangle((19, 3), 1, 1, fill=False, edgecolor='red', lw=3))
ax.add_patch(Rectangle((9, 3), 1, 1, fill=False, edgecolor='red', lw=3))
ax.add_patch(Rectangle((2, 3), 1, 1, fill=False, edgecolor='red', lw=3))
ax.add_patch(Rectangle((14, 3), 1, 1, fill=False, edgecolor='red', lw=3));
#
Let's compare the covariates (age, race, smoker) of welder 3 with controls 14 and 19. Even though the distance to control 14 is higher, it seems a better match as all covariates match except for 2 years difference in age. The propensity score for control 14 is higher because younger people are more likely to be in the welder group.
welder3_df = pd.concat((df[df.welder=='Y'].loc[[3]], df[df.welder=='N'].loc[[2,9,14,19]]))
welder3_df = welder3_df[['age','race','smoker','welder','dpc','pscore']]
def highlight_row(row):
if row.name == 3:
return ['background : lightblue'] * row.shape[0]
elif row.name in [14,19]:
return ['background : lightgreen'] * row.shape[0]
else:
return [''] * row.shape[0]
welder3_df.style.apply(highlight_row, axis=1).format({"pscore" : "{:20,.2f}"})
#
Can we first select nearest controls based on propensity score and then find the best match based on similarity of covariates? We can use Mahalanobis Distance as a distance metric that takes into account similarity of covariates to find better matches.
Matching using Mahalanobis Distance within propensity score calipers ¶
We'll first use a caliper with its width $w$ set as a multiple of the standard deviation of the propensity score. In this example, we set it to one half of the standard deviation. In practice, a caliper of 10-20% of the standard deviation of the propensity score is more common.
caliper_width = np.std(df.pscore, ddof=1)/2
print(f"Caliper width: {np.round(caliper_width,3)}")
#
The matrix below shows the Mahalanobis distance between welders and controls after masking out cells where the propensity score distance exceeds the caliper width. Note that, as we wanted, the best match for welder 3 now is control 14.
np.set_printoptions(precision=3, suppress=True)
X_t = df[df.welder=='Y'][covariates].values
X_c = df[df.welder=='N'][covariates].values
X = np.vstack((X_t, X_c))
V = np.cov(X, rowvar=False)
VI = np.linalg.pinv(V)
maha_dist = distance.cdist(X_t, X_c, metric='mahalanobis', VI=VI)
mask = l1dist > caliper_width
# Visualize
fig, ax = plt.subplots(figsize=(14,10))
with sns.plotting_context("notebook", font_scale=0.9):
hm = sns.heatmap(maha_dist, cmap="Blues_r", robust=True, annot=True, fmt=".2f",
cbar=False, ax=ax, mask=mask)
ax.set_xlabel('Control')
ax.set_ylabel('Welder');
ax.add_patch(Rectangle((14, 3), 1, 1, fill=False, edgecolor='red', lw=3));
#
Optimal Pair Matching ¶
We'll now pair each welder with a different control such that the total distance within matched pairs is minimized. This means that in the matrix above, if we fill a selected welder-control pair $(i,j)$ with a one and the rest of the cells with zeros, then every row must have exactly one 1 and every column must have at most one 1. In addition, the sum of the distances of the selected cells should be the least possible compared to any other such pair-wise matching scheme. This is the same as minimum weight matching in bipartite graphs where the first partite is the set of welders and the second partite is the set of controls. We'll use scipy linear_sum_assignment to solve this.
Due to competition among welders for the same controls, the caliper width may not be respected for all pairs. Therefore, we'll add a penalty to the Mahalanobis distance. The penalty is $$1000 \times \max(0, \vert \hat{e}(\mathbf{x}_t) - \hat{e}(\mathbf{x}_c) \vert - w)$$
The penalty is 0 if the L1 distance is within the caliper width. Otherwise, a positive penalty is added to the distance in the masked white cells above.
Add penalties ¶
maha_dist[mask] += 1000 * (l1dist - caliper_width)[mask]
Compute matches ¶
The optimal pair matches are shown below. Notice that even though control 14 was the best match for welder 3 when considered in isolation, the best match now is control 13 as our objective is to minimize the total distance between matched pairs.
xidx, yidx = optimize.linear_sum_assignment(maha_dist)
# Visualize
fig, ax = plt.subplots(figsize=(14,10))
with sns.plotting_context("notebook", font_scale=0.9):
hm = sns.heatmap(maha_dist, cmap="Blues_r", robust=True, annot=True, fmt=".2f",
cbar=False, ax=ax, mask=mask)
ax.set_xlabel('Control')
ax.set_ylabel('Welder')
for i,j in zip(xidx, yidx):
ax.add_patch(Rectangle((j, i), 1, 1, fill=False, edgecolor='red', lw=2));
#
We can now compare the covariates of the matched pairs side by side.
matched_treatment_df = df[df.welder=="Y"].iloc[xidx][covariates + ['pscore','dpc']]
matched_treatment_df.columns = [f"t_{col}" for col in matched_treatment_df.columns]
matched_treatment_df['c_idx'] = yidx
matched_control_df = df[df.welder=="N"].iloc[yidx][covariates + ['pscore','dpc']]
matched_control_df.columns = [f"c_{col}" for col in matched_control_df.columns]
matched_treatment_df.reset_index(drop=True, inplace=True)
matched_control_df.reset_index(drop=True, inplace=True)
pair_match_df = pd.concat((matched_treatment_df, matched_control_df),axis=1)
def highlight_columns(col):
if col.name.startswith("t_"):
return ['background-color: lightblue'] * len(col)
elif col.name.startswith("c_") and col.name != "c_idx":
return ['background-color: lightgreen'] * len(col)
def highlight_control(col):
return 'background-color: lightgreen'
pair_match_df.style.apply(highlight_columns).format({"t_pscore" : "{:20,.2f}",
"t_dpc" : "{:20,.2f}",
"c_pscore" : "{:20,.2f}",
"c_dpc" : "{:20,.2f}",
})
#
Compare groups pre and post matching ¶
- Age: Control and treatment groups are one year apart vs. 4 years before
- Race: 4% difference vs. 9% difference before
- Smoker: 14% difference vs. 17% difference before
cols = covariates + ["pscore"]
all_cols = []
for col in cols:
all_cols += [f"t_{col}",f"c_{col}"]
pair_match_df[all_cols].mean().round(2).to_frame('mean')
#
Pre-match means ¶
df.groupby('welder').agg(dict(zip(covariates,['mean']*3))).round(2)
The mean difference in dpc between matched welders and controls is 0.64 . ¶
np.round((pair_match_df.t_dpc - pair_match_df.c_dpc).mean(),2)