Some fun data simulations I use to think about confounders and colliders

I love conducting very simple simulations and getting terrified by the implications of their results.

Let’s say you want to measure the association between the outcome that you care about, y, and your key predictor, x. You begin with a simple regression model, predicting y with x.

One of these estimates is correct (1), one is 35% smaller than it should be (2), one is about 80% larger than it should be (3). Let’s start with the smaller estimate (2). Below is the Stata code to set it up:

clear all
set obs 10000
* create a matrix to establish correlation of x and z
matrix C = (1,-.35\-.35,1) 
* simulate x and z, set correlation equal to -.3 (via matrix C)
corr2data x z , means(0 0) sd(1 1) corr(C)  n(10000)
twoway scatter x z
graph export s8811-02-simuation_lecture-02.emf, replace
gen y = 1 + x*1 + z*1 + rnormal(0,1)
reg y x
eststo m1
reg y z
eststo m2
reg y x z
eststo m3

esttab m1 m2 m3

Variables x and z are negatively correlated (-0.35, higher x, lower z). Both are positively correlated with y. So you just include x, or z, in your regression equation, you won’t get their actual association with y.

In (1) above, x is too small. In (2), z is too small. Both included in model, coefficient magnitudes increase to correct size. Let’s look at (3) now, which is a weird one.

cls
clear all
set obs 10000
* create a matrix to establish correlation of x and z's
matrix C = (1, .3 , .5 \ .3 ,1 , 0 \ .5 , 0 , 1)
* simulate x and z, set correlation equal to -.3 (via matrix C)
corr2data x z1 z2 , means(0 0 0) sd(1 1 1) corr(C)  n(10000)
corr x z1 z2
gen y = 1 + x*1 + z1*1 + z2*1 + rnormal(0,1)
reg y x
eststo m1
reg y x z1
eststo m2 
reg y x z2 
eststo m3 
reg y x z1 z2
eststo m4
esttab m1 m2 m3 m4

This setup is fun and weird. Now there isn’t just one confounding z variable. There are two! x, z1, and z2 are all positively correlated with one another. All have the same association with y, 1.

Without z1 or z2 in the regression model, x’s association with y is much too large! But including only z1 or z2 results in coefficients for x that are still much too large. It’s only when z1 and z2 are both included in the model that x’s coefficient is correctly reduced to 1. Even if your model is mostly correct (z1 or z2 in there), you’ll still get the wrong answer! This setup is fun, as you can let z2 affect all other variable, just pairs (x and y, x and z1, z1 and y), or just single variables (x only). You can also let the magnitude of associations and correlations vary to see how the magnitude and direction of bias vary.

Let’s look at example (1), where the coefficient for x=1. Below is the code to conduct this simulation.

cls 
clear all
set obs 10000 // set sample to 10,000 observations
gen x = rnormal(0,1)
gen y = x*1 + rnormal(0,1)
gen z = x*1 + y*1 + rnormal(0,1)
corr x z
reg y x
eststo m1
reg y z
eststo m2
reg y x z
eststo m3
esttab m1 m2 m3

Here, the correctly specified model just includes x (y is made only with x included and random noise, rnormal(0,1)). However, maybe you haven’t thought correctly about which variables to control for, and you notice that z has a positive association with y, and is also positively associated with x.

(Here we see a large positive correlation between z and x. Higher z, higher x).

In model (1), we see that x has a positive and significant association with y, with a coefficient of 1. In model (2), we see that z has a positive and significant association with y, with a coefficient of 0.5. When both z and x are included in the regression model, we see that z remains significant and positive with the coefficient magnitude unchanging. x, in contrast, loses significance and the coefficient goes to zero.

You may think to yourself, aha! Results in model (1) are not true, because z is a confounder. And when z is included in the regression model, x’s effect goes away. A concrete example: SAT scores predict college success. But both are caused by child socioeconomic origins. Control for that, and the association between SAT and college success declines.

BUT. I made the data. And I made model (1) the true effect of x on y. I ALSO made z to be a collider in this setup. x and y cause z. But z doesn’t cause y. Just as with the examples above, including z changes the association between x and y. But in this case, that change is inappropriate, whereas in the other cases, the change WAS appropriate. How do I know this? Because in this case I made the data. In cases where one doesn’t make their data for their research, the variation across these simulations should hopefully cause some sleepless nights as you fret over whether you correctly specified the theoretical model that guides your model specification.

Anyways, I thought this was neat. Hope it helps you!