6  Estimation of effects using causal mediation analysis

#| include: false
library(here)
library(DiagrammeR)
library(tidyverse)
Learning outcomes
  • Explain the theoretical concepts and statistical methods regarding causal mediation analysis
  • Perform causal mediation analysis using R software
  • Interpret results from causal mediation analysis

6.1 Recap from yesterday (part 1)

  • causal inference in general
  • causal mediation analysis
  • (many) different estimands

DAG under no mediator-outcome relation affected by treatment:

Reflection on yesterday

Take 2 min to reflect on what your main 2 learning points from yesterday were

Then, we take 3 min where you talk to your neighbor about these.

Lastly, we will take 5 mins to share in plenum.

6.2 Regression-based approach

General problems with the traditional approach is when we have exposure-mediator interaction and non-linear relationships.

Counterfactual-based direct and indirect effects can be estimated, provided that the no confounding assumption hold. To be able to do so, we need a model for the mediator and a model for the outcome.

Model for the mediator:

\(E(M|A = a, W = w) = \beta_0 + \beta_1a + \beta'_2w\)

Model for the outcome:

\(E(Y|A = a, M = m, W = w) = \sigma_0 + \sigma_1a + \sigma_2m + \sigma_3am + \sigma'_4w\)

From these two regression models we can estimate the CDE, NDE and NIE.

To try this out in practice, we load the NHANES dataset, only keep the variables we need for the example, remove missing data and scale the intake of red meat to be per 65 g/day (i.e., 1-unit higher = 65 g/day).

Note, that for your own data, you may need to take missing data into account in another way.

The example here include a continuous treatment, mediator and outcome. We use these equations on an additive scale because it is easier to explain the intuition behind the estimands. Similar equations exist for other types and combinations of variables.

nhanes <- read_csv(here::here("data/nhanes_dataset.csv"))
# only keeping the variables we need for the example and removing missing variables
nhanes <- nhanes %>%
  select(
    id = seqn, 
    w1 = age, 
    w2 = gender, 
    w3 = education_clean, 
    w4 = smoke,
    a = total_redmeat, # this is the exposure
    m = magic_biomarker, # this is the mediator
    y = blood_glucose # this is the outcome
  ) %>% 
  na.omit()

We can now run a model for the mediator only adjusting for a single variable. In practice you would adjust for many more variables to satisfy the confounding assumptions.

lm_m <- lm(m ~ a + w1, data = nhanes)
summary(lm_m)
#> 
#> Call:
#> lm(formula = m ~ a + w1, data = nhanes)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.2706 -0.6733  0.0109  0.6650  4.0061 
#> 
#> Coefficients:
#>               Estimate Std. Error  t value Pr(>|t|)    
#> (Intercept) -1.946e+01  1.309e-01 -148.697   <2e-16 ***
#> a            2.591e+01  1.429e-01  181.280   <2e-16 ***
#> w1          -4.214e-04  5.687e-04   -0.741    0.459    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.002 on 8566 degrees of freedom
#> Multiple R-squared:  0.7995, Adjusted R-squared:  0.7995 
#> F-statistic: 1.708e+04 on 2 and 8566 DF,  p-value: < 2.2e-16

Here, we see that higher red meat intake (a) is associated with higher inflammatory biomarker.

Now, we run the model for the outcome, including an interaction.

lm_y <- lm(y ~ a + m + a:m + w1, data = nhanes)

summary(lm_y)
#> 
#> Call:
#> lm(formula = y ~ a + m + a:m + w1, data = nhanes)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -6.9317 -1.3422  0.0016  1.3923  8.1954 
#> 
#> Coefficients:
#>              Estimate Std. Error  t value Pr(>|t|)    
#> (Intercept) 17.765686   0.603136   29.456   <2e-16 ***
#> a           52.286147   0.748403   69.864   <2e-16 ***
#> m            0.642066   0.051865   12.379   <2e-16 ***
#> w1           1.499853   0.001142 1313.423   <2e-16 ***
#> a:m          0.055135   0.054827    1.006    0.315    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.007 on 8564 degrees of freedom
#> Multiple R-squared:  0.9951, Adjusted R-squared:  0.9951 
#> F-statistic: 4.335e+05 on 4 and 8564 DF,  p-value: < 2.2e-16

In this model we can see:

  • higher red meat intake (a) increase blood glucose levels, independent of the magical biomarker
  • there is no statistically significant interaction between red meat (a) and the magical mediator (m). We still keep this in the model because even though the interaction is not statistically significant, it can still contribute to the outcome. For an interaction to be statistically significant we often need a lot of power
  • The magical mediator (m) is associated with higher blood glucose levels, independent of red meat intake

Based on the coefficients of these two models, we can now estimate the different effects.

The equations are from (1) and (2).

6.2.1 Controlled direct effect

The controlled direct effect can be obtained using this formula based on the regression coefficients:

\(CDE(m) = (\sigma_1 + \sigma_3*m)(a - a^*)\)

\(a^*\) is for a change from level \(a^*\) (=control) to level \(a\) (=intervention).

For the controlled direct effect we set m to a specific value.

Let’s look at the distribution of m.

nhanes %>%
  select(m) %>%
  summary()
#>        m         
#>  Min.   :-3.624  
#>  1st Qu.: 1.336  
#>  Median : 2.469  
#>  Mean   : 2.813  
#>  3rd Qu.: 3.837  
#>  Max.   :21.958
median_m <- round(median(nhanes$m), digits = 1)

In this example we will set it to the median which is inflammatory biomarker at 2.5 which is a low level.

CDE_m <- (lm_y$coefficients[2] + lm_y$coefficients[5] * median_m) * (1 - 0)

round(as.numeric(CDE_m), digits = 2)
#> [1] 52.42

The CDE is how much the outcome, here blood glucose, would change on average if the mediator, here inflammation measured by CRP levels, were fixed at level (m = 2.5) uniformly in the population but the treatment, red meat intake, was changed from 0 to 65 g/day. As we operate with a linear model for simplicity, it does not matter here if the change is from 0 to 65 or from 50 to 115 g/day.

The CDE answer the question, what would be the effect of A on Y, when fixing M at a specific value for everyone in the population (3)?

6.2.2 (Pure) natural direct effect

PNDE can be obtained using this formula:

\(PNDE = (\sigma_1 + \sigma_3 * (\beta_0 + \beta_1*a^* + \beta'_2*w))(a - a^*)\)

We set the value for the confounder w1 for the interaction to be the mean value of w1.

mean_w1 <- mean(nhanes$w1)

Then we calculate the PNDE.

PNDE <- (lm_y$coefficients[2] + (lm_y$coefficients[5] * (lm_m$coefficients[1] + lm_m$coefficients[2] * 0 + lm_m$coefficients[3] * mean_w1))) * (1 - 0)

round(as.numeric(PNDE), digits = 2)
#> [1] 51.21

The PNDE is how much the outcome, blood glucose, would change if the treatment a, red meat, was set at 65 versus 0 but for each individual the mediator was kept at the level it would have taken, for that individual, in the absence of the exposure.

In other words, to what extent does red meat cause higher blood glucose via pathways other than through inflammation (3).

6.2.3 (Total) natural direct effect

TNDE can be obtained by this formula:

\(TNDE = (\sigma_1 + \sigma_3 * (\beta_0 + beta_1*a + \beta'_2*w))(a - a^*)\)

TNDE <- (lm_y$coefficients[2] + (lm_y$coefficients[5] * (lm_m$coefficients[1] + lm_m$coefficients[2] * 1 + lm_m$coefficients[3] * mean_w1))) * (1 - 0)

round(as.numeric(TNDE), digits = 2)
#> [1] 52.64

Here, the value of M is enabled to act (as opposed to the PNDE). Here we see a little higher blood glucose, and that is the extra contribution from inflammation.

The TNDE asks the question: “to what extent does red meat cause higher blood glucose via pathways other than through inflammation, allowing inflammation to boost up or tune down such effect at the same time?” (3)

6.2.4 Total natural indirect effect

The TNIE can be obtained by this formula:

\(TNIE = (\sigma_2 * \beta_1 + \sigma_3 * \beta_1 * a)(a - a^*)\)

TNIE <- ((lm_y$coefficients[3] * lm_m$coefficients[2]) + (lm_y$coefficients[5] * lm_m$coefficients[2] * 1)) * (1 - 0)

round(as.numeric(TNIE), digits = 2)
#> [1] 18.06

The TNIE is how much the outcome would change on average if the treatment was fixed at level a = 65 g/day but the mediator was changed from the level it would take if a* = 0 to the level it would take if a = 65 g/day.

Note that exposure has to have an effect on M otherwise this will be zero.

The TNIE asks the question: to what extent does red meat cause higher blood glucose via inflammation (due to red meat affecting inflammation and subsequently, inflammation affecting blood glucose) and the possible interaction between red meat and inflammation in affecting blood glucose levels? In other words, the effect of exposure that ‘would be prevented if the exposure did not cause the mediator’ (i.e., the portion of the effect for which mediation is ‘necessary’) (3).

This is often the effect we are interested in in biomedical research for questions regarding mediation.

6.2.5 Pure natural indirect effect

The PNIE can be obtained by this formula:

\(PNIE = (\sigma_2 * \beta_1 + \sigma_3 * \beta_1 * a^*)(a - a^*)\)

PNIE <- ((lm_y$coefficients[3] * lm_m$coefficients[2]) + (lm_y$coefficients[5] * lm_m$coefficients[2] * 0)) * (1 - 0)

round(as.numeric(PNIE), digits = 2)
#> [1] 16.64

The PNIE is different from the TNIE because it does not include the interaction effect. We estimate the effect of red meat on inflammation and then subsequent effect of inflammation on blood glucose.

The PNIE answer the question: “to what extent does red meat cause higher blood glucose via inflammation only (i.e., due to red meat affecting inflammation and subsequently, inflammation affecting blood glucose), not accounting for the possible interaction between red meat and inflammation? In other words, the effect that the exposure would have had if ‘its only action were to cause the mediator’ (i.e., the portion of the effect for which mediation is ‘’sufficient’’)” (3).

6.2.6 Total effect

The total effect can be decomposed as:

\(TE = PNDE + TNIE\)

TE <- PNDE + TNIE

round(as.numeric(TE), digits = 2)
#> [1] 69.28

This is the overall effect of red meat on blood glucose levels. Higher red meat is associated with higher blood glucose.

6.2.7 Proportion mediation

From this, we can calculate the proportion mediated.

\(PM = \frac{TNIE}{TE}\)

PM <- TNIE / TE

as.numeric(PM) * 100
#> [1] 26.07476

26% of the association between red meat and blood glucose is mediated by M, inflammation.

6.2.8 R package example for causal mediation analysis - CMAverse-package

All of this can be done using an R package. Here we use the CMAverse R package (4). With this package we can use several different approach for the estimation and get confidence intervals.

First, we load the package and then we setup the model object. We have to specify:

  • model: this is the type of model, or the approach for the causal mediation. We used the regression-based approach, so we will also do that here. However, one can also use weighting-based approach marginal structural models or the gformula.
  • outcome: your outcome variable
  • exposure: your exposure/intervention/treatment
  • mediator: your mediator
  • mreg: the type of model for the mediator. It is a list because you can have multiple mediators
  • yreg: type of regression for the outcome. E.g., linear, logistic, cox.
  • astar: Control/comparison intervention (value from)
  • a: Intervention (value to)
  • mval: the value for the mediator when estimating the CDE
  • EMin0t: indicator of whether there should be exposure and mediator interaction in the models.
  • estimation: method for estimating causal effects. Can use parametric functions or imputation.
  • inference: how to calculate confidence intervals. For regression-based we use the delta method. But for the other methods we use a bootstrap.

In short, all the things we did above can be estimated like this:

library(CMAverse)

res_rb <- cmest(
  data = nhanes, model = "rb", outcome = "y", exposure = "a",
  mediator = "m", basec = c("w1"), EMint = TRUE,
  mreg = list("linear"), yreg = "linear",
  astar = 0, a = 1, mval = list(2.5),
  estimation = "paramfunc", inference = "delta"
)

summary(res_rb)
#> Causal Mediation Analysis
#> 
#> # Outcome regression:
#> 
#> Call:
#> glm(formula = y ~ a + m + a * m + w1, family = gaussian(), data = getCall(x$reg.output$yreg)$data, 
#>     weights = getCall(x$reg.output$yreg)$weights)
#> 
#> Coefficients:
#>              Estimate Std. Error  t value Pr(>|t|)    
#> (Intercept) 17.765686   0.603136   29.456   <2e-16 ***
#> a           52.286147   0.748403   69.864   <2e-16 ***
#> m            0.642066   0.051865   12.379   <2e-16 ***
#> w1           1.499853   0.001142 1313.423   <2e-16 ***
#> a:m          0.055135   0.054827    1.006    0.315    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 4.027311)
#> 
#>     Null deviance: 7017320  on 8568  degrees of freedom
#> Residual deviance:   34490  on 8564  degrees of freedom
#> AIC: 36262
#> 
#> Number of Fisher Scoring iterations: 2
#> 
#> 
#> # Mediator regressions: 
#> 
#> Call:
#> glm(formula = m ~ a + w1, family = gaussian(), data = getCall(x$reg.output$mreg[[1L]])$data, 
#>     weights = getCall(x$reg.output$mreg[[1L]])$weights)
#> 
#> Coefficients:
#>               Estimate Std. Error  t value Pr(>|t|)    
#> (Intercept) -1.946e+01  1.309e-01 -148.697   <2e-16 ***
#> a            2.591e+01  1.429e-01  181.280   <2e-16 ***
#> w1          -4.214e-04  5.687e-04   -0.741    0.459    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 1.003485)
#> 
#>     Null deviance: 42877.0  on 8568  degrees of freedom
#> Residual deviance:  8595.9  on 8566  degrees of freedom
#> AIC: 24353
#> 
#> Number of Fisher Scoring iterations: 2
#> 
#> 
#> # Effect decomposition on the mean difference scale via the regression-based approach
#>  
#> Closed-form parameter function estimation with 
#>  delta method standard errors, confidence intervals and p-values 
#>  
#>               Estimate Std.error   95% CIL 95% CIU  P.val    
#> cde          52.423985  0.684107 51.083161  53.765 <2e-16 ***
#> pnde         51.212266  1.601394 48.073591  54.351 <2e-16 ***
#> tnde         52.640738  0.631797 51.402438  53.879 <2e-16 ***
#> pnie         16.635018  1.346886 13.995170  19.275 <2e-16 ***
#> tnie         18.063490  0.603457 16.880736  19.246 <2e-16 ***
#> te           69.275756  1.308153 66.711823  71.840 <2e-16 ***
#> intref       -1.211719  1.204962 -3.573401   1.150  0.315    
#> intmed        1.428472  1.420505 -1.355666   4.213  0.315    
#> cde(prop)     0.756744  0.012980  0.731303   0.782 <2e-16 ***
#> intref(prop) -0.017491  0.017715 -0.052212   0.017  0.323    
#> intmed(prop)  0.020620  0.020884 -0.020312   0.062  0.323    
#> pnie(prop)    0.240128  0.015561  0.209628   0.271 <2e-16 ***
#> pm            0.260748  0.011256  0.238687   0.283 <2e-16 ***
#> int           0.003129  0.003169 -0.003082   0.009  0.323    
#> pe            0.243256  0.012980  0.217816   0.269 <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (cde: controlled direct effect; pnde: pure natural direct effect; tnde: total natural direct effect; pnie: pure natural indirect effect; tnie: total natural indirect effect; te: total effect; intref: reference interaction; intmed: mediated interaction; cde(prop): proportion cde; intref(prop): proportion intref; intmed(prop): proportion intmed; pnie(prop): proportion pnie; pm: overall proportion mediated; int: overall proportion attributable to interaction; pe: overall proportion eliminated)
#> 
#> Relevant variable values: 
#> $a
#> [1] 1
#> 
#> $astar
#> [1] 0
#> 
#> $mval
#> $mval[[1]]
#> [1] 2.5
#> 
#> 
#> $basecval
#> $basecval[[1]]
#> [1] 45.5381

We can check that the models were the ones we wanted to specify. And we can see that the results are the same.

We can now try to adjust for multiple other confounders.

res_rb_confounders <- cmest(
  data = nhanes, model = "rb", outcome = "y", exposure = "a",
  mediator = "m", basec = c("w1", "w2", "w3"), EMint = TRUE,
  mreg = list("linear"), yreg = "linear",
  astar = 0, a = 1, mval = list(2.5),
  estimation = "paramfunc", inference = "delta"
)

summary(res_rb_confounders)
#> Causal Mediation Analysis
#> 
#> # Outcome regression:
#> 
#> Call:
#> glm(formula = y ~ a + m + a * m + w1 + w2 + w3, family = gaussian(), 
#>     data = getCall(x$reg.output$yreg)$data, weights = getCall(x$reg.output$yreg)$weights)
#> 
#> Coefficients:
#>                      Estimate Std. Error  t value Pr(>|t|)    
#> (Intercept)         17.793668   0.609882   29.176   <2e-16 ***
#> a                   52.238325   0.758400   68.880   <2e-16 ***
#> m                    0.639863   0.052164   12.266   <2e-16 ***
#> w1                   1.499841   0.001149 1305.718   <2e-16 ***
#> w2Male               0.019112   0.045521    0.420    0.675    
#> w3College and above  0.015804   0.048412    0.326    0.744    
#> a:m                  0.057466   0.055168    1.042    0.298    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 4.028118)
#> 
#>     Null deviance: 7017320  on 8568  degrees of freedom
#> Residual deviance:   34489  on 8562  degrees of freedom
#> AIC: 36266
#> 
#> Number of Fisher Scoring iterations: 2
#> 
#> 
#> # Mediator regressions: 
#> 
#> Call:
#> glm(formula = m ~ a + w1 + w2 + w3, family = gaussian(), data = getCall(x$reg.output$mreg[[1L]])$data, 
#>     weights = getCall(x$reg.output$mreg[[1L]])$weights)
#> 
#> Coefficients:
#>                       Estimate Std. Error  t value Pr(>|t|)    
#> (Intercept)         -1.943e+01  1.339e-01 -145.052   <2e-16 ***
#> a                    2.585e+01  1.491e-01  173.390   <2e-16 ***
#> w1                  -4.449e-04  5.716e-04   -0.778    0.436    
#> w2Male               3.090e-02  2.258e-02    1.369    0.171    
#> w3College and above  2.017e-02  2.416e-02    0.835    0.404    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 1.003419)
#> 
#>     Null deviance: 42877.0  on 8568  degrees of freedom
#> Residual deviance:  8593.3  on 8564  degrees of freedom
#> AIC: 24354
#> 
#> Number of Fisher Scoring iterations: 2
#> 
#> 
#> # Effect decomposition on the mean difference scale via the regression-based approach
#>  
#> Closed-form parameter function estimation with 
#>  delta method standard errors, confidence intervals and p-values 
#>  
#>               Estimate Std.error   95% CIL 95% CIU  P.val    
#> cde          52.381990  0.692505 51.024705  53.739 <2e-16 ***
#> pnde         51.121905  1.617002 47.952640  54.291 <2e-16 ***
#> tnde         52.607442  0.637358 51.358243  53.857 <2e-16 ***
#> pnie         16.540891  1.351851 13.891312  19.190 <2e-16 ***
#> tnie         18.026428  0.603132 16.844311  19.209 <2e-16 ***
#> te           69.148334  1.325762 66.549888  71.747 <2e-16 ***
#> intref       -1.260085  1.209714 -3.631081   1.111  0.298    
#> intmed        1.485537  1.426153 -1.309671   4.281  0.298    
#> cde(prop)     0.757531  0.013025  0.732002   0.783 <2e-16 ***
#> intref(prop) -0.018223  0.017834 -0.053177   0.017  0.307    
#> intmed(prop)  0.021483  0.021025 -0.019724   0.063  0.307    
#> pnie(prop)    0.239209  0.015627  0.208581   0.270 <2e-16 ***
#> pm            0.260692  0.011312  0.238522   0.283 <2e-16 ***
#> int           0.003260  0.003191 -0.002993   0.010  0.307    
#> pe            0.242469  0.013025  0.216941   0.268 <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (cde: controlled direct effect; pnde: pure natural direct effect; tnde: total natural direct effect; pnie: pure natural indirect effect; tnie: total natural indirect effect; te: total effect; intref: reference interaction; intmed: mediated interaction; cde(prop): proportion cde; intref(prop): proportion intref; intmed(prop): proportion intmed; pnie(prop): proportion pnie; pm: overall proportion mediated; int: overall proportion attributable to interaction; pe: overall proportion eliminated)
#> 
#> Relevant variable values: 
#> $a
#> [1] 1
#> 
#> $astar
#> [1] 0
#> 
#> $mval
#> $mval[[1]]
#> [1] 2.5
#> 
#> 
#> $basecval
#> $basecval[[1]]
#> [1] 45.5381
#> 
#> $basecval[[2]]
#> [1] 0.4969075
#> 
#> $basecval[[3]]
#> [1] 0.2800794

After adjustment for age, sex and education, there is statistically significant associations.

6.3 G-computation-based approach

Recall no-confounding assumption of causal mediation
  • no exposure-mediator confounding
  • no mediator-outcome confounding
  • no exposure-outcome confounding
  • no exposure-induced mediator-outcome confounding (or intermediate confounding)

Note: Assumption 4 is also called post-treatment confounding, or treatment induced mediator-outcome confounding in randomization trials.

Assumptions 1 and 3 could be satisfied in randomization trials. The assumption 2 and 4 might be problematic, particularly in observational studies.

6.3.1 Intermediate confounding

Like the figure below, L is influenced by the exposure, also severs as a confounder of M and Y. When considering the effect on Y of changes to X via a specific mediator M, the presence of L is challenging and regression-based methods could no longer handle this situation.

Conditional on L, we successfully block the backdoor path between X-M-L-Y; but at the same time we also block the backdoor path X-L-Y, which represents part of the direct effect of X on Y (all the paths not going through M). Furthermore, if there is unmeasured confounder between L and Y, adjusting on L will induce a noncausal association among A, U and Y.

6.3.2 Time-varying confounding

6.3.3 G-computation

Because of these limitations, we need to resort to a more general framework for mediation analysis utilizing causal diagrams and potential outcome languages.

The G-computation algorithm was first introduced by Robins 1986 (5) for estimating time-varying exposure causal in the presence of time-varying confounders of exposure effects. When estimating total effect, g-computation is generally equivalent to inverse-probability-of-treatment weighting (IPTW). But in high-dimensional settings, g-computation is more powerful. G-computation (using g-formula) could also provide an intuitive method for decomposing the total effect.

The basic idea of g-computation is to estimate the probability of the outcome under a hypothetical action (i.e., intervention).

Note

Ingredients: A, Y, and controls W

  • Step 1: Model the outcome as a function of A and W.

  • Step 2: Duplicate the initial dataset in two counterfactual datasets. In one world, set A=1; in the other, set A=0. All other variables keep the original values.

  • Step 3: Apply the function in step 1 to predict each individual’s outcome in the two counterfactual datasets, then get potential outcomes Y1 and Y0.

  • Step 4: Aggregate these potential outcomes (e.g., average across all individuals) and contrast them (e.g, by taking the difference) to arrive at an estimate.

We will now demonstrate how to conduct g-computation in a simulated dataset.

datasim <- function(n) { 
  # This small function simulates a dataset with n rows
  # containing covariats, and action, an outcome and
  # the underlying potential outcomes
  x1 <- rbinom(n, size = 1, prob = 0.5) 
  x2 <- rbinom(n, size = 1, prob = 0.65)
  x3 <- rnorm(n, 0, 1)
  x4 <- rnorm(n, 0, 1)
  # The action (independent variable, treatment, exposure...)
  # is a function of x2, x3, x4 and 
  # the product of (ie, an interaction of) x2 and x4
  A <- rbinom(n, size = 1, prob = plogis(-1.6 + 2.5*x2 + 2.2*x3 + 0.6*x4 + 0.4*x2*x4)) 
  # Simulate the two potential outcomes
  # as functions of x1, X2, X4 and the product of x2 and x4
  
  # Potential outcome if the action is 1
  # note that here, we add 1
  Y.1 <- rbinom(n, size = 1, prob = plogis(-0.7 + 1 - 0.15*x1 + 0.45*x2 + 0.20*x4 + 0.4*x2*x4)) 
  # Potential outcome if the action is 0
  # note that here, we do not add 1 so that there
  # is a different the two potential outcomes (ie, an effect of A)
  Y.0 <- rbinom(n, size = 1, prob = plogis(-0.7 + 0 - 0.15*x1 + 0.45*x2 + 0.20*x4 + 0.4*x2*x4)) 
  
  # Observed outcome 
  # is the potential outcomes (Y.1 or Y.0)
  # corresponding to action the individual experienced (A)
  Y <- Y.1*A + Y.0*(1 - A) 
  
  # Return a data.frame as the output of this function
  data.frame(x1, x2, x3, x4, A, Y, Y.1, Y.0) 
} 



set.seed(120110) # for reproducibility
ObsData <- datasim(n = 500000) # really large sample
TRUE_EY.1 <- mean(ObsData$Y.1); TRUE_EY.1  # mean outcome under A = 1
#> [1] 0.619346
# Fit a logistic regression model predicting Y from the relevant confounders
Q <- glm(Y ~ A + x2 + x4 + x2*x4, data = ObsData, family=binomial)


# Copy the "actual" (simulated) data twice
A1Data <- A0Data <- ObsData
# In one world A equals 1 for everyone, in the other one it equals 0 for everyone
# The rest of the data stays as is (for now)
A1Data$A <- 1; A0Data$A <- 0
head(ObsData); head(A1Data); head(A0Data)
#> # A tibble: 6 × 8
#>      x1    x2     x3     x4     A     Y   Y.1   Y.0
#>   <int> <int>  <dbl>  <dbl> <int> <dbl> <int> <int>
#> 1     0     1 -0.291  0.837     1     1     1     0
#> 2     1     1 -0.160 -1.70      0     0     1     0
#> 3     1     0 -0.545  1.60      0     0     1     0
#> 4     0     1 -0.473 -0.829     0     0     1     0
#> 5     1     1 -0.848  1.17      1     1     1     1
#> 6     1     0 -1.29   1.53      0     0     1     0
#> # A tibble: 6 × 8
#>      x1    x2     x3     x4     A     Y   Y.1   Y.0
#>   <int> <int>  <dbl>  <dbl> <dbl> <dbl> <int> <int>
#> 1     0     1 -0.291  0.837     1     1     1     0
#> 2     1     1 -0.160 -1.70      1     0     1     0
#> 3     1     0 -0.545  1.60      1     0     1     0
#> 4     0     1 -0.473 -0.829     1     0     1     0
#> 5     1     1 -0.848  1.17      1     1     1     1
#> 6     1     0 -1.29   1.53      1     0     1     0
#> # A tibble: 6 × 8
#>      x1    x2     x3     x4     A     Y   Y.1   Y.0
#>   <int> <int>  <dbl>  <dbl> <dbl> <dbl> <int> <int>
#> 1     0     1 -0.291  0.837     0     1     1     0
#> 2     1     1 -0.160 -1.70      0     0     1     0
#> 3     1     0 -0.545  1.60      0     0     1     0
#> 4     0     1 -0.473 -0.829     0     0     1     0
#> 5     1     1 -0.848  1.17      0     1     1     1
#> 6     1     0 -1.29   1.53      0     0     1     0
# Predict Y if everybody attends
Y_A1 <- predict(Q, A1Data, type="response") 
# Predict Y if nobody attends
Y_A0 <- predict(Q, A0Data, type="response") 
# Taking a look at the predictions
data.frame(Y_A1=head(Y_A1), Y_A0=head(Y_A0), TRUE_Y=head(ObsData$Y)) |> round(digits = 2)
#> # A tibble: 6 × 3
#>    Y_A1  Y_A0 TRUE_Y
#>   <dbl> <dbl>  <dbl>
#> 1  0.77  0.55      1
#> 2  0.42  0.21      0
#> 3  0.63  0.39      0
#> 4  0.55  0.31      0
#> 5  0.8   0.6       1
#> 6  0.63  0.38      0
# Mean outcomes in the two worlds
pred_A1 <- mean(Y_A1); pred_A0 <- mean(Y_A0)

# Marginal odds ratio
MOR_gcomp <- (pred_A1*(1 - pred_A0))/((1 - pred_A1)*pred_A0)
# ATE (risk difference)
RD_gcomp <- pred_A1 - pred_A0
c(MOR_gcomp, RD_gcomp) |> round(digits=3)
#> [1] 2.568 0.231

We recommend take the advantage of R packages to do the work (decomposition).

set.seed(1)
expit <- function(x) exp(x) / (1 + exp(x))
n <- 10000
w <- rnorm(n, mean = 1, sd = 0.2)
a <- rbinom(n, 1, expit(0.3 + 0.5 * w))
l <- rnorm(n, mean = 1 + a + 0.5 * w, sd = 0.5)
m <- rbinom(n, 1, expit(1 + 2 * a - l + 2 * w))
y <- rbinom(n, 1, expit(0.3 * a + 0.8 * m + 0.5 * a * m - 0.3 * l + 0.4 * w))
data <- data.frame(a, m, y, w, l)

library(CMAverse)

res_gformula <- cmest(data = data, model = "gformula", outcome = "y", exposure = "a",
                 mediator = "m", basec = c("w"), postc = "l", EMint = TRUE,
                 mreg = list("logistic"), yreg = "logistic",
                 postcreg = list("linear"),
                 astar = 0, a = 1, mval = list(1), 
                 estimation = "imputation", inference = "bootstrap", nboot = 2)
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
summary(res_gformula)
#> Causal Mediation Analysis
#> 
#> # Outcome regression:
#> 
#> Call:
#> glm(formula = y ~ a + m + a * m + w + l, family = binomial(), 
#>     data = getCall(x$reg.output$yreg)$data, weights = getCall(x$reg.output$yreg)$weights)
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -0.004126   0.143422  -0.029 0.977052    
#> a            0.425094   0.122396   3.473 0.000514 ***
#> m            0.880876   0.092010   9.574  < 2e-16 ***
#> w            0.293455   0.115138   2.549 0.010812 *  
#> l           -0.312479   0.046346  -6.742 1.56e-11 ***
#> a:m          0.418939   0.125377   3.341 0.000833 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 12068  on 9999  degrees of freedom
#> Residual deviance: 11504  on 9994  degrees of freedom
#> AIC: 11516
#> 
#> Number of Fisher Scoring iterations: 4
#> 
#> 
#> # Mediator regressions: 
#> 
#> Call:
#> glm(formula = m ~ a + w + l, family = binomial(), data = getCall(x$reg.output$mreg[[1L]])$data, 
#>     weights = getCall(x$reg.output$mreg[[1L]])$weights)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  0.98466    0.16862   5.839 5.24e-09 ***
#> a            2.00490    0.09257  21.658  < 2e-16 ***
#> w            1.92755    0.15859  12.155  < 2e-16 ***
#> l           -0.98356    0.06417 -15.328  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 7578.1  on 9999  degrees of freedom
#> Residual deviance: 6981.3  on 9996  degrees of freedom
#> AIC: 6989.3
#> 
#> Number of Fisher Scoring iterations: 5
#> 
#> 
#> # Regressions for mediator-outcome confounders affected by the exposure: 
#> 
#> Call:
#> glm(formula = l ~ a + w, family = gaussian(), data = getCall(x$reg.output$postcreg[[1L]])$data, 
#>     weights = getCall(x$reg.output$postcreg[[1L]])$weights)
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  1.01522    0.02593   39.15   <2e-16 ***
#> a            1.00188    0.01087   92.17   <2e-16 ***
#> w            0.49342    0.02466   20.01   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 0.2488027)
#> 
#>     Null deviance: 4744.8  on 9999  degrees of freedom
#> Residual deviance: 2487.3  on 9997  degrees of freedom
#> AIC: 14473
#> 
#> Number of Fisher Scoring iterations: 2
#> 
#> 
#> # Effect decomposition on the odds ratio scale via the g-formula approach
#>  
#> Direct counterfactual imputation estimation with 
#>  bootstrap standard errors, percentile confidence intervals and p-values 
#>  
#>                  Estimate Std.error   95% CIL 95% CIU  P.val    
#> Rcde            1.696e+00 4.949e-03 1.721e+00   1.728 <2e-16 ***
#> rRpnde          1.505e+00 1.253e-05 1.512e+00   1.512 <2e-16 ***
#> rRtnde          1.603e+00 8.186e-03 1.624e+00   1.635 <2e-16 ***
#> rRpnie          1.109e+00 8.029e-03 1.125e+00   1.136 <2e-16 ***
#> rRtnie          1.182e+00 1.479e-02 1.209e+00   1.229 <2e-16 ***
#> Rte             1.778e+00 2.234e-02 1.828e+00   1.858 <2e-16 ***
#> ERcde           1.689e-01 2.354e-03 1.743e-01   0.177 <2e-16 ***
#> rERintref       3.359e-01 2.366e-03 3.343e-01   0.337 <2e-16 ***
#> rERintmed       1.641e-01 1.432e-02 1.908e-01   0.210 <2e-16 ***
#> rERpnie         1.093e-01 8.029e-03 1.252e-01   0.136 <2e-16 ***
#> ERcde(prop)     2.170e-01 2.739e-03 2.068e-01   0.211 <2e-16 ***
#> rERintref(prop) 4.316e-01 1.338e-02 3.897e-01   0.408 <2e-16 ***
#> rERintmed(prop) 2.108e-01 1.070e-02 2.305e-01   0.245 <2e-16 ***
#> rERpnie(prop)   1.405e-01 5.421e-03 1.512e-01   0.159 <2e-16 ***
#> rpm             3.513e-01 1.612e-02 3.818e-01   0.403 <2e-16 ***
#> rint            6.425e-01 2.682e-03 6.346e-01   0.638 <2e-16 ***
#> rpe             7.830e-01 2.739e-03 7.895e-01   0.793 <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Rcde: controlled direct effect odds ratio; rRpnde: randomized analogue of pure natural direct effect odds ratio; rRtnde: randomized analogue of total natural direct effect odds ratio; rRpnie: randomized analogue of pure natural indirect effect odds ratio; rRtnie: randomized analogue of total natural indirect effect odds ratio; Rte: total effect odds ratio; ERcde: excess relative risk due to controlled direct effect; rERintref: randomized analogue of excess relative risk due to reference interaction; rERintmed: randomized analogue of excess relative risk due to mediated interaction; rERpnie: randomized analogue of excess relative risk due to pure natural indirect effect; ERcde(prop): proportion ERcde; rERintref(prop): proportion rERintref; rERintmed(prop): proportion rERintmed; rERpnie(prop): proportion rERpnie; rpm: randomized analogue of overall proportion mediated; rint: randomized analogue of overall proportion attributable to interaction; rpe: randomized analogue of overall proportion eliminated)
#> 
#> Relevant variable values: 
#> $a
#> [1] 1
#> 
#> $astar
#> [1] 0
#> 
#> $yval
#> [1] "1"
#> 
#> $mval
#> $mval[[1]]
#> [1] 1