Does exposure to welding fumes cause DNA damage?

In [1]:

The following data is from a study conducted in 1993 1 on 21 welders and 26 controls not exposed to welding fumes. All subjects were men. The variables in the dataset are:

  • age : in years
  • race : African American or Caucasian
  • smoker : Y or N
  • welder : Y or N
  • dpc : DNA protein crosslinks percentage

There are 3 covariates : age, race and current smoking behavior, i.e. variables measured at the beginning of the study. The treatment variable is whether the subject is a welder. The response variable is DNA-protein crosslinks percentage (dpc), an indicator of DNA damage 2 .The hypothesis is that exposure to welding fumes causes DNA damage. This was an observational study as subjects were not assigned randomly to the treatment.

Matching is a technique used in observational studies to set up a valid, apple to apple comparison between treatment and control groups by balancing observed covariates. Even though the groups may be matched on observed data, they may not be comparable in terms of unmeasured covariates, for e.g. whether a welder's father was also a welder. If such an unmeasured variable is a confounder , then matching only on observed covariates will not balance on unobserved variables. Randomization is the gold standard for causal inference as it balances observed and unobserved confounders. In observational studies, we'll balance the observed covariates and hope that the unobserved confounders are balanced.

Notation

  • $\mathbf{x}$: Observed covariates, in this case, a three dimensional vector (age, race, smoker) for each subject
  • Let $t$ denote a treated subject and $c$ denote a control subject. Then $\mathbf{x}_t$ denotes the covariates of $t$ and $\mathbf{x}_c$ denotes the covariates of control $c$.
  • Let $e(\mathbf{x})$ denote the true propensity score, the conditional probability of treatment given $\mathbf{x}$, i.e. $e(\mathbf{x}) = Pr(Z = 1 \vert \mathbf{x})$. In this example, the treatment assignment $Z$ is whether a subject is a welder.
  • Let $\hat{e}(\mathbf{x})$ denote an estimate of the propensity score, based on a model that relates the treatment assignment $Z$ and observed covariates $x$.

Why use Propensity Score?

Matching on $\mathbf{x}$ is almost impossible as the dimensionality of $\mathbf{x}$ increases. Instead, we match on propensity score which tends to balance all the covariates.

What's coming down?

  1. Exploratory analysis: Compare means of covariates between treatment and control groups
  2. Estimate propensity score using a logistic regression model
  3. Pairwise matching (one treated subject matched to a unique control) using propensity score
  4. Improve balance using propensity score calipers and Mahalanobis distance
  5. Optimal pair matching to minimize the sum of distances over all pairs
  6. Compare means of covariates with what we saw at the beginning

This case study is based on the chapter Basic Tools of Multivariate Matching 3 .

In [2]:
Out[2]:
age race smoker welder dpc
subject
0 48 A N N 1.08
1 63 C N N 1.09
2 44 C Y N 1.10
3 40 C N N 1.10
4 50 C N N 0.93

Let's first compare the covariate distributions in the two groups before matching.

In [3]:

Notice that the welder group has more younger smokers. Also, notice that smokers and welders have elevated dpc.

Encode categorical variables

  1. welder = 'Y' $\rightarrow$ treatment = 1 else treatment = 0
  2. smoker = 'Y' $\rightarrow$ smoker_y = 1 else smoker_y = 0
  3. race = 'A' $\rightarrow$ race_a = 1 else race_a = 0
In [4]:
Out[4]:
age race smoker welder dpc treatment race_a smoker_y
subject
25 35 C Y N 1.59 0 0 1
18 38 C Y Y 1.15 1 0 1
22 45 C N N 1.02 0 0 0
10 42 C N N 0.55 0 0 0
12 41 C N N 1.66 0 0 0

The two groups appear imbalanced

  1. Control is 4 years older than welders
  2. 19% African Americans in control vs. 10% in welders
  3. 35% smokers in control vs. 52% in welders
In [5]:
df.groupby('welder').agg(dict(zip(covariates,['mean']*3))).round(2)
Out[5]:
age race_a smoker_y
welder
N 42.69 0.19 0.35
Y 38.24 0.10 0.52

Estimate Propensity Score using Logistic Regression

In [6]:
Optimization terminated successfully.
         Current function value: 0.624527
         Iterations 6

Compute L1 distance based on propensity score

l1dist = $\vert \hat{e}(\mathbf{x}_t) - \hat{e}(\mathbf{x}_c) \vert$

In [7]:

We can visualize the distances of the welders (21 subjects) to the controls (26 subjects) as shown below. As an example, let's consider welder 3 and the nearest four controls by L1 distance (highlighted in red).

In [8]:

Let's compare the covariates (age, race, smoker) of welder 3 with controls 14 and 19. Even though the distance to control 14 is higher, it seems a better match as all covariates match except for 2 years difference in age. The propensity score for control 14 is higher because younger people are more likely to be in the welder group.

In [9]:
Out[9]:
age race smoker welder dpc pscore
subject
3 33 A Y Y 0.650000 0.51
2 44 C Y N 1.100000 0.47
9 34 C N N 1.550000 0.54
14 31 A Y N 1.360000 0.55
19 35 C N N 2.330000 0.52

Can we first select nearest controls based on propensity score and then find the best match based on similarity of covariates? We can use Mahalanobis Distance as a distance metric that takes into account similarity of covariates to find better matches.

Matching using Mahalanobis Distance within propensity score calipers

We'll first use a caliper with its width $w$ set as a multiple of the standard deviation of the propensity score. In this example, we set it to one half of the standard deviation. In practice, a caliper of 10-20% of the standard deviation of the propensity score is more common.

In [10]:
Caliper width: 0.086

The matrix below shows the Mahalanobis distance between welders and controls after masking out cells where the propensity score distance exceeds the caliper width. Note that, as we wanted, the best match for welder 3 now is control 14.

In [11]:

Optimal Pair Matching

We'll now pair each welder with a different control such that the total distance within matched pairs is minimized. This means that in the matrix above, if we fill a selected welder-control pair $(i,j)$ with a one and the rest of the cells with zeros, then every row must have exactly one 1 and every column must have at most one 1. In addition, the sum of the distances of the selected cells should be the least possible compared to any other such pair-wise matching scheme. This is the same as minimum weight matching in bipartite graphs where the first partite is the set of welders and the second partite is the set of controls. We'll use scipy linear_sum_assignment to solve this.

Due to competition among welders for the same controls, the caliper width may not be respected for all pairs. Therefore, we'll add a penalty to the Mahalanobis distance. The penalty is $$1000 \times \max(0, \vert \hat{e}(\mathbf{x}_t) - \hat{e}(\mathbf{x}_c) \vert - w)$$

The penalty is 0 if the L1 distance is within the caliper width. Otherwise, a positive penalty is added to the distance in the masked white cells above.

Add penalties

In [12]:
maha_dist[mask] += 1000 * (l1dist - caliper_width)[mask]

Compute matches

The optimal pair matches are shown below. Notice that even though control 14 was the best match for welder 3 when considered in isolation, the best match now is control 13 as our objective is to minimize the total distance between matched pairs.

In [13]:

We can now compare the covariates of the matched pairs side by side.

In [14]:
Out[14]:
t_age t_race_a t_smoker_y t_pscore t_dpc c_idx c_age c_race_a c_smoker_y c_pscore c_dpc
0 38 0 0 0.46 1.77 18 44 0 0 0.34 0.42
1 44 0 0 0.34 1.02 7 47 0 0 0.28 2.20
2 39 0 1 0.57 1.44 21 39 0 1 0.57 0.62
3 33 1 1 0.51 0.65 13 41 1 1 0.35 1.49
4 35 0 1 0.65 2.08 20 34 0 1 0.67 0.97
5 39 0 1 0.57 0.61 9 34 0 0 0.54 1.55
6 27 0 0 0.68 2.86 24 30 0 0 0.63 0.95
7 43 0 1 0.49 4.19 22 45 0 0 0.32 1.02
8 39 0 1 0.57 4.88 19 35 0 0 0.52 2.33
9 43 1 0 0.20 1.08 0 48 1 0 0.14 1.08
10 41 0 1 0.53 2.03 2 44 0 1 0.47 1.10
11 36 0 0 0.50 2.81 12 41 0 0 0.40 1.66
12 35 0 0 0.52 0.94 3 40 0 0 0.42 1.10
13 37 0 0 0.48 1.43 23 42 0 0 0.38 1.78
14 39 0 1 0.57 1.25 14 31 1 1 0.55 1.36
15 34 0 0 0.54 2.97 8 38 0 0 0.46 0.88
16 35 0 1 0.65 1.01 25 35 0 1 0.65 1.59
17 53 0 0 0.19 2.07 5 52 0 0 0.20 1.11
18 38 0 1 0.60 1.15 11 36 0 1 0.64 1.04
19 37 0 0 0.48 1.07 10 42 0 0 0.38 0.55
20 38 0 1 0.60 1.63 17 36 0 1 0.64 0.65

Compare groups pre and post matching

  1. Age: Control and treatment groups are one year apart vs. 4 years before
  2. Race: 4% difference vs. 9% difference before
  3. Smoker: 14% difference vs. 17% difference before
In [15]:
Out[15]:
mean
t_age 38.24
c_age 39.71
t_race_a 0.10
c_race_a 0.14
t_smoker_y 0.52
c_smoker_y 0.38
t_pscore 0.51
c_pscore 0.45

Pre-match means

In [16]:
df.groupby('welder').agg(dict(zip(covariates,['mean']*3))).round(2)
Out[16]:
age race_a smoker_y
welder
N 42.69 0.19 0.35
Y 38.24 0.10 0.52

The mean difference in dpc between matched welders and controls is 0.64 .

In [17]:
np.round((pair_match_df.t_dpc - pair_match_df.c_dpc).mean(),2)
Out[17]:
0.64