In [1]:

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
from scipy.spatial import distance 
from scipy import optimize
%config InlineBackend.figure_format = 'retina'
#

The following data is from a study conducted in 1993 ¹ on 21 welders and 26 controls not exposed to welding fumes. All subjects were men. The variables in the dataset are:

age : in years
race : African American or Caucasian
smoker : Y or N
welder : Y or N
dpc : DNA protein crosslinks percentage

There are 3 covariates : age, race and current smoking behavior, i.e. variables measured at the beginning of the study. The treatment variable is whether the subject is a welder. The response variable is DNA-protein crosslinks percentage (dpc), an indicator of DNA damage ² .The hypothesis is that exposure to welding fumes causes DNA damage. This was an observational study as subjects were not assigned randomly to the treatment.

Matching is a technique used in observational studies to set up a valid, apple to apple comparison between treatment and control groups by balancing observed covariates. Even though the groups may be matched on observed data, they may not be comparable in terms of unmeasured covariates, for e.g. whether a welder's father was also a welder. If such an unmeasured variable is a confounder , then matching only on observed covariates will not balance on unobserved variables. Randomization is the gold standard for causal inference as it balances observed and unobserved confounders. In observational studies, we'll balance the observed covariates and hope that the unobserved confounders are balanced.

Notation ¶

$\mathbf{x}$: Observed covariates, in this case, a three dimensional vector (age, race, smoker) for each subject
Let $t$ denote a treated subject and $c$ denote a control subject. Then $\mathbf{x}_t$ denotes the covariates of $t$ and $\mathbf{x}_c$ denotes the covariates of control $c$.
Let $e(\mathbf{x})$ denote the true propensity score, the conditional probability of treatment given $\mathbf{x}$, i.e. $e(\mathbf{x}) = Pr(Z = 1 \vert \mathbf{x})$. In this example, the treatment assignment $Z$ is whether a subject is a welder.
Let $\hat{e}(\mathbf{x})$ denote an estimate of the propensity score, based on a model that relates the treatment assignment $Z$ and observed covariates $x$.

Why use Propensity Score? ¶

Matching on $\mathbf{x}$ is almost impossible as the dimensionality of $\mathbf{x}$ increases. Instead, we match on propensity score which tends to balance all the covariates.

What's coming down? ¶

Exploratory analysis: Compare means of covariates between treatment and control groups
Estimate propensity score using a logistic regression model
Pairwise matching (one treated subject matched to a unique control) using propensity score
Improve balance using propensity score calipers and Mahalanobis distance
Optimal pair matching to minimize the sum of distances over all pairs
Compare means of covariates with what we saw at the beginning

This case study is based on the chapter Basic Tools of Multivariate Matching ³ .

In [2]:

df = pd.read_csv("data/costa.csv", index_col=0)
df.head()
#

Out[2]:

	age	race	smoker	welder	dpc
subject
0	48	A	N	N	1.08
1	63	C	N	N	1.09
2	44	C	Y	N	1.10
3	40	C	N	N	1.10
4	50	C	N	N	0.93

Let's first compare the covariate distributions in the two groups before matching.

In [3]:

fig = plt.figure(figsize=(16,10), constrained_layout=True)
with sns.plotting_context("notebook", font_scale=1.3):
    gs = fig.add_gridspec(3, 3)
    ax0 = fig.add_subplot(gs[0,:])
    sns.scatterplot(x=df.age, y=df.dpc, 
                hue=df.welder, hue_order=['Y','N'],
                size=df.smoker,  size_order=['Y','N'], sizes={'Y':120,'N':60}, 
                style=df.race, style_order=['C','A'],
                ax=ax0)
    ax0.set_xlabel('Age')
    ax0.set_ylabel('% DPC')
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
    
    ax1 = fig.add_subplot(gs[1,:])
    sns.kdeplot(data=df, x='age', hue='welder', ax=ax1, hue_order=['Y','N'])
    ax1.set_xlabel('Age')

    ax2 = fig.add_subplot(gs[2,0])
    sns.boxplot(x='smoker', y='dpc',data=df, ax=ax2, order=['Y','N'])
    ax2.set_xlabel('Smoker')
    ax2.set_ylabel('% DPC')

    ax3 = fig.add_subplot(gs[2,1])
    sns.boxplot(x='race', y='dpc',data=df, ax=ax3, order=['C','A'])
    ax3.set_xlabel('Race')
    ax3.set_ylabel('% DPC')

    ax4 = fig.add_subplot(gs[2,2])
    sns.boxplot(x='welder', y='dpc',data=df, ax=ax4, order=['Y','N'])
    ax4.set_xlabel('Welder')
    ax4.set_ylabel('% DPC')
    
    plt.tight_layout();
    
#

Notice that the welder group has more younger smokers. Also, notice that smokers and welders have elevated dpc.

Encode categorical variables ¶

welder = 'Y' $\rightarrow$ treatment = 1 else treatment = 0
smoker = 'Y' $\rightarrow$ smoker_y = 1 else smoker_y = 0
race = 'A' $\rightarrow$ race_a = 1 else race_a = 0

In [4]:

df['treatment'] = 0
df.loc[df.welder=="Y",'treatment'] = 1

covariates = ['age','race_a','smoker_y']

df['race_a'] = (1 - pd.get_dummies(df.race,drop_first=True))
df['smoker_y'] = pd.get_dummies(df.smoker, drop_first=True)
df.sample(5)
#

Out[4]:

	age	race	smoker	welder	dpc	treatment	race_a	smoker_y
subject
25	35	C	Y	N	1.59	0	0	1
18	38	C	Y	Y	1.15	1	0	1
22	45	C	N	N	1.02	0	0	0
10	42	C	N	N	0.55	0	0	0
12	41	C	N	N	1.66	0	0	0

The two groups appear imbalanced

Control is 4 years older than welders
19% African Americans in control vs. 10% in welders
35% smokers in control vs. 52% in welders

In [5]:

df.groupby('welder').agg(dict(zip(covariates,['mean']*3))).round(2)

Out[5]:

	age	race_a	smoker_y
welder
N	42.69	0.19	0.35
Y	38.24	0.10	0.52

Estimate Propensity Score using Logistic Regression ¶

In [6]:

propensity_formula = (
    'treatment ~ age + race_a + smoker_y'
)
model = sm.Logit.from_formula(propensity_formula, data=df) 
res = model.fit();
df['pscore'] = res.predict(df)
#

Optimization terminated successfully.
         Current function value: 0.624527
         Iterations 6

Compute L1 distance based on propensity score ¶

l1dist = $\vert \hat{e}(\mathbf{x}_t) - \hat{e}(\mathbf{x}_c) \vert$

In [7]:

pscore0 = df[df.welder=='N'].pscore.values
pscore1 = df[df.welder=='Y'].pscore.values
l1dist = np.abs(pscore1.reshape(-1,1) - pscore0.reshape(1,-1))
#

We can visualize the distances of the welders (21 subjects) to the controls (26 subjects) as shown below. As an example, let's consider welder 3 and the nearest four controls by L1 distance (highlighted in red).

In [8]:

fig, ax = plt.subplots(figsize=(14,10))
with sns.plotting_context("notebook", font_scale=0.9):
    hm = sns.heatmap(l1dist, cmap="Blues_r", robust=True, annot=True, fmt=".2f", 
                     cbar=False, ax=ax)
    ax.set_xlabel('Control')
    ax.set_ylabel('Welder')

    from matplotlib.patches import Rectangle
    ax.add_patch(Rectangle((19, 3), 1, 1, fill=False, edgecolor='red', lw=3))
    ax.add_patch(Rectangle((9, 3), 1, 1, fill=False, edgecolor='red', lw=3))
    ax.add_patch(Rectangle((2, 3), 1, 1, fill=False, edgecolor='red', lw=3))
    ax.add_patch(Rectangle((14, 3), 1, 1, fill=False, edgecolor='red', lw=3));
#

Let's compare the covariates (age, race, smoker) of welder 3 with controls 14 and 19. Even though the distance to control 14 is higher, it seems a better match as all covariates match except for 2 years difference in age. The propensity score for control 14 is higher because younger people are more likely to be in the welder group.

In [9]:

welder3_df = pd.concat((df[df.welder=='Y'].loc[[3]], df[df.welder=='N'].loc[[2,9,14,19]]))
welder3_df = welder3_df[['age','race','smoker','welder','dpc','pscore']]

def highlight_row(row):
    if row.name == 3:
        return ['background : lightblue'] * row.shape[0]
    elif row.name in [14,19]:
        return ['background : lightgreen'] * row.shape[0]
    else:
        return [''] * row.shape[0]
    
welder3_df.style.apply(highlight_row, axis=1).format({"pscore" : "{:20,.2f}"})
#

Out[9]:

	age	race	smoker	welder	dpc	pscore
subject
3	33	A	Y	Y	0.650000	0.51
2	44	C	Y	N	1.100000	0.47
9	34	C	N	N	1.550000	0.54
14	31	A	Y	N	1.360000	0.55
19	35	C	N	N	2.330000	0.52

Can we first select nearest controls based on propensity score and then find the best match based on similarity of covariates? We can use Mahalanobis Distance as a distance metric that takes into account similarity of covariates to find better matches.

Matching using Mahalanobis Distance within propensity score calipers ¶

We'll first use a caliper with its width $w$ set as a multiple of the standard deviation of the propensity score. In this example, we set it to one half of the standard deviation. In practice, a caliper of 10-20% of the standard deviation of the propensity score is more common.

In [10]:

caliper_width = np.std(df.pscore, ddof=1)/2
print(f"Caliper width: {np.round(caliper_width,3)}")
#

Caliper width: 0.086

The matrix below shows the Mahalanobis distance between welders and controls after masking out cells where the propensity score distance exceeds the caliper width. Note that, as we wanted, the best match for welder 3 now is control 14.

In [11]:

np.set_printoptions(precision=3, suppress=True)
X_t = df[df.welder=='Y'][covariates].values
X_c = df[df.welder=='N'][covariates].values
X = np.vstack((X_t, X_c))
V = np.cov(X, rowvar=False)
VI = np.linalg.pinv(V)
maha_dist = distance.cdist(X_t, X_c, metric='mahalanobis', VI=VI)

mask = l1dist > caliper_width

# Visualize
fig, ax = plt.subplots(figsize=(14,10))
with sns.plotting_context("notebook", font_scale=0.9):
    hm = sns.heatmap(maha_dist, cmap="Blues_r", robust=True, annot=True, fmt=".2f", 
                     cbar=False, ax=ax, mask=mask)
    ax.set_xlabel('Control')
    ax.set_ylabel('Welder');
ax.add_patch(Rectangle((14, 3), 1, 1, fill=False, edgecolor='red', lw=3));
#

Optimal Pair Matching ¶

We'll now pair each welder with a different control such that the total distance within matched pairs is minimized. This means that in the matrix above, if we fill a selected welder-control pair $(i,j)$ with a one and the rest of the cells with zeros, then every row must have exactly one 1 and every column must have at most one 1. In addition, the sum of the distances of the selected cells should be the least possible compared to any other such pair-wise matching scheme. This is the same as minimum weight matching in bipartite graphs where the first partite is the set of welders and the second partite is the set of controls. We'll use scipy linear_sum_assignment to solve this.

Due to competition among welders for the same controls, the caliper width may not be respected for all pairs. Therefore, we'll add a penalty to the Mahalanobis distance. The penalty is $$1000 \times \max(0, \vert \hat{e}(\mathbf{x}_t) - \hat{e}(\mathbf{x}_c) \vert - w)$$

The penalty is 0 if the L1 distance is within the caliper width. Otherwise, a positive penalty is added to the distance in the masked white cells above.

Add penalties ¶

In [12]:

maha_dist[mask] += 1000 * (l1dist - caliper_width)[mask]

Compute matches ¶

The optimal pair matches are shown below. Notice that even though control 14 was the best match for welder 3 when considered in isolation, the best match now is control 13 as our objective is to minimize the total distance between matched pairs.

In [13]:

xidx, yidx = optimize.linear_sum_assignment(maha_dist)

# Visualize
fig, ax = plt.subplots(figsize=(14,10))
with sns.plotting_context("notebook", font_scale=0.9):
    hm = sns.heatmap(maha_dist, cmap="Blues_r", robust=True, annot=True, fmt=".2f", 
                     cbar=False, ax=ax, mask=mask)
    ax.set_xlabel('Control')
    ax.set_ylabel('Welder')

for i,j in zip(xidx, yidx):
    ax.add_patch(Rectangle((j, i), 1, 1, fill=False, edgecolor='red', lw=2));
#

We can now compare the covariates of the matched pairs side by side.

In [14]:

matched_treatment_df = df[df.welder=="Y"].iloc[xidx][covariates + ['pscore','dpc']]
matched_treatment_df.columns = [f"t_{col}" for col in matched_treatment_df.columns]
matched_treatment_df['c_idx'] = yidx

matched_control_df = df[df.welder=="N"].iloc[yidx][covariates + ['pscore','dpc']]
matched_control_df.columns = [f"c_{col}" for col in matched_control_df.columns]

matched_treatment_df.reset_index(drop=True, inplace=True)
matched_control_df.reset_index(drop=True, inplace=True)

pair_match_df = pd.concat((matched_treatment_df, matched_control_df),axis=1)

def highlight_columns(col):
    if col.name.startswith("t_"):
        return ['background-color: lightblue'] * len(col)
    elif col.name.startswith("c_") and col.name != "c_idx":
        return ['background-color: lightgreen'] * len(col)

def highlight_control(col):
    return 'background-color: lightgreen'

pair_match_df.style.apply(highlight_columns).format({"t_pscore" : "{:20,.2f}",
                                                     "t_dpc" : "{:20,.2f}",
                                                     "c_pscore" : "{:20,.2f}",
                                                     "c_dpc" : "{:20,.2f}",
                                                    })


#

Out[14]:

	t_age	t_race_a	t_smoker_y	t_pscore	t_dpc	c_idx	c_age	c_race_a	c_smoker_y	c_pscore	c_dpc
0	38	0	0	0.46	1.77	18	44	0	0	0.34	0.42
1	44	0	0	0.34	1.02	7	47	0	0	0.28	2.20
2	39	0	1	0.57	1.44	21	39	0	1	0.57	0.62
3	33	1	1	0.51	0.65	13	41	1	1	0.35	1.49
4	35	0	1	0.65	2.08	20	34	0	1	0.67	0.97
5	39	0	1	0.57	0.61	9	34	0	0	0.54	1.55
6	27	0	0	0.68	2.86	24	30	0	0	0.63	0.95
7	43	0	1	0.49	4.19	22	45	0	0	0.32	1.02
8	39	0	1	0.57	4.88	19	35	0	0	0.52	2.33
9	43	1	0	0.20	1.08	0	48	1	0	0.14	1.08
10	41	0	1	0.53	2.03	2	44	0	1	0.47	1.10
11	36	0	0	0.50	2.81	12	41	0	0	0.40	1.66
12	35	0	0	0.52	0.94	3	40	0	0	0.42	1.10
13	37	0	0	0.48	1.43	23	42	0	0	0.38	1.78
14	39	0	1	0.57	1.25	14	31	1	1	0.55	1.36
15	34	0	0	0.54	2.97	8	38	0	0	0.46	0.88
16	35	0	1	0.65	1.01	25	35	0	1	0.65	1.59
17	53	0	0	0.19	2.07	5	52	0	0	0.20	1.11
18	38	0	1	0.60	1.15	11	36	0	1	0.64	1.04
19	37	0	0	0.48	1.07	10	42	0	0	0.38	0.55
20	38	0	1	0.60	1.63	17	36	0	1	0.64	0.65

Compare groups pre and post matching ¶

Age: Control and treatment groups are one year apart vs. 4 years before
Race: 4% difference vs. 9% difference before
Smoker: 14% difference vs. 17% difference before

In [15]:

cols = covariates + ["pscore"]
all_cols = []
for col in cols:
    all_cols += [f"t_{col}",f"c_{col}"]

pair_match_df[all_cols].mean().round(2).to_frame('mean')
#

Out[15]:

	mean
t_age	38.24
c_age	39.71
t_race_a	0.10
c_race_a	0.14
t_smoker_y	0.52
c_smoker_y	0.38
t_pscore	0.51
c_pscore	0.45

Pre-match means ¶

In [16]:

df.groupby('welder').agg(dict(zip(covariates,['mean']*3))).round(2)

Out[16]:

	age	race_a	smoker_y
welder
N	42.69	0.19	0.35
Y	38.24	0.10	0.52

The mean difference in dpc between matched welders and controls is 0.64 . ¶

In [17]:

np.round((pair_match_df.t_dpc - pair_match_df.c_dpc).mean(),2)

Out[17]:

0.64

Does exposure to welding fumes cause DNA damage?