14 Overview of doubly robust estimators: Example with the ATE

Recall our motivation for doing mediation analysis — that is, we would like to decompose the total effect of a treatment $A$ on an outcome $Y$ into (two) distinct components (direct/indirect effects) that operate through mediator(s) $M$ (indirect effect) v. those operating independently of $M$ (direct effect).

Recall that we define the average treatment effect (ATE) as $E (Y_{1} - Y_{0})$ , and decompose it as follows

$E [Y_{1, M_{1}} - Y_{0, M_{0}}] = \underset{natural indirect effect}{\underset{⏟}{E [Y_{1, M_{1}} - Y_{1, M_{0}}]}} + \underset{natural direct effect}{\underset{⏟}{E [Y_{1, M_{0}} - Y_{0, M_{0}}]}}$

To introduce some of the ideas that we will use for estimation of the NDE, let us first briefly discuss estimation of $E (Y_{1})$ (estimation of $E (Y_{0})$ can be performed analogously).

First, notice that under the assumption of no unmeasured confounders ( $Y_{1} ⊥ ⊥ A ∣ W$ ), we have

$\E(Y_1) &= \E[ \E(Y_1 \mid W) ] \\ &= \E[ \E(Y_1 \mid A=1, W) ] \\ &= \E[ \E(Y \mid A=1, W) ] \ \ ,$

where the first step adds an expectation over $W$ (that is marginalized over), the second step uses no unmeasured confounding or exchangeability $(A ⊥ ⊥ Y_{a} ∣ L)$ , and the last step applies consistency ( $Y_{a} = Y$ for $A = a$ ).

14.1 Plug-in (G-computation) estimator

The first estimator of $E [E (Y ∣ A = 1, W)]$ can be obtained by a three-step procedure:

Fit a regression for $Y$ on $A$ and $W$ , then
use the above regression to predict the outcome mean if everyone’s $A$ is set to $A = 1$ , and then
average these predictions.

The resultant estimator can be expressed as

$\frac{1}{n} \sum_{i = 1}^{n} \hat{E} (Y ∣ A_{i} = 1, W_{i}) .$

Note that this plug-in estimator directly uses the above identification formula (called a g-formula, arrived at via g-computation): $E [E (Y ∣ A = 1, W)]$
This estimator requires that the (outcome) regression model for $\hat{E} (Y ∣ A_{i} = 1, W_{i})$ is correctly specified.
Downside: If we use arbitrary machine learning for this model, general theory for computing standard errors and confidence intervals (i.e., statistical inference) is not available.

14.2 Inverse probability weighted (IPW) estimator

An alternative method of estimation can be constructed after noticing the following equivalence:

$E [E (Y ∣ A = 1, W)] = E [\frac{A}{P (A = 1 ∣ W)} Y],$

which may be carried out by way of the following procedure:

Fit a regression for $A$ and $W$ , then
use the above regression to predict the probability of treatment $A = 1$ , then
compute the inverse probability weights $A_{i} / \hat{P} (A_{i} = 1 ∣ W_{i})$ . This weight will be zero for untreated units, and the inverse of the probability of treatment for treated units.
Finally, compute the weighted average of the outcome:

$\frac{1}{n} \sum_{i = 1}^{n} \frac{A_{i}}{\hat{P} (A_{i} = 1 ∣ W_{i})} Y_{i} .$

This estimator requires that the regression model for $\hat{P} (A = 1 ∣ W_{i})$ is correctly specified.
Downside: If we use arbitrary machine learning for this model, general theory for computing standard errors and confidence intervals (i.e., statistical inference) is not available.

14.3 Augmented inverse probability weighted (AIPW) estimator

Fortunately, we can combine these two estimators to get an estimator with enhanced properties.

The improved estimator can be seen both as a corrected (or augmented) IPW estimator:

$\underset{IPW estimator}{\underset{⏟}{\frac{1}{n} \sum_{i = 1}^{n} \frac{A_{i}}{\hat{P} (A_{i} = 1 ∣ W_{i})} Y_{i}}} - \underset{Correction term}{\underset{⏟}{\frac{1}{n} \sum_{i = 1}^{n} \frac{\hat{E} (Y ∣ A_{i} = 1, W_{i})}{\hat{P} (A_{i} = 1 ∣ W_{i})} [A_{i} - \hat{P} (A_{i} = 1 ∣ W_{i})]}},$

$\underset{G-comp estimator}{\underset{⏟}{\frac{1}{n} \sum_{i = 1}^{n} \hat{E} (Y ∣ A_{i} = 1, W_{i})}} + \underset{Correction term}{\underset{⏟}{\frac{1}{n} \sum_{i = 1}^{n} \frac{A_{i}}{\hat{P} (A_{i} = 1 ∣ W_{i})} [Y_{i} - \hat{E} (Y ∣ A_{i} = 1, W_{i})]}} .$

This estimator has some desirable properties:

It is robust to misspecification of at most one of the two models (outcome or treatment) (Can you see why?)
It is distributed as a normal random variable (RV) as sample size grows. This allows us to easily compute confidence intervals and perform hypothesis tests.
It allows us to use machine learning to estimate the treatment and outcome regressions to alleviate model misspecification bias.

Next, we will work towards constructing estimators with these same properties for the mediation parameters that we have introduced.