When you conduct a research project, you must ensure that you don’t select on the dependent variable. Doing so will create a risk that you produce biased results. I’m thinking about this because I’m reading the excellent Thinking Clearly with Data: A Guide to Quantitative Reasoning and Analysis by Ethan Bueno de Mesquita and Anthony Fowler (BF).

Let’s list out three simple examples they use for selecting on the dependent variable.

Malcom Gladwell’s 10,000 hours. Gladwell studied successful people and found they tended to invest 10,000 hours of practice into their craft. So…you do the same to be successful. But BF note that Gladwell doesn’t survey non-successful people. Perhaps you find the same rate of 10k hour investment among non-successful individuals as well. In that case, this practice investment is not predictive of success.
School dropouts. The Gates Foundation studied high school dropouts and found that 70% of them found classes boring. So, make high school classes interesting to increase retention? Nope! A national study of high school aged individuals found roughly the same percentage of individuals who did, and did not, drop out found their classes boring.
Terrorism. A study of suicide bombing found them to frequently occur in countries occupied by the US military. But these selected mostly on countries with terrorist attacks. Many countries have been occupied by the US military but not produced suicide bombers (Germany, South Korea, Japan, etc.).

So far so good. Don’t just study a restricted range of your DV. Got it.

…but I actually don’t got it.

This point also relates to issues of selection and conditioning on a collider. In the linked blog post, a spurious negative relationship between conscientiousness and IQ is found among college graduates because college attendance is positively correlated with both conscientiousness and IQ. So one could argue that by restricting the sample to college graduates, you’re partially selecting on the dependent variable (college graduates have slightly higher mean conscientiousness). I think that’s right.

Let’s go back to the HS dropout question.

We don’t want to select on those who drop out of HS because we won’t be able to see whether boredom correlates with the probability of dropping out. Makes sense. But at what point do we stop selecting on the dependent variable? If I’m willing to not thing about this question too much, the answer’s obvious. But when I think about it a bit more seriously, I get thoroughly confused.

We must include individuals who are not simply HS dropouts in our sample to understand the factors that associate with the probability of dropping out. So we include HS grads, perhaps by conducting a nationally representative study of, say, 17-25 year olds, and asking all folks about how boring they found their HS classes (BF found everyone was about equally bored). Easy peasy, lemon squeezy. But is that the only group we should account for?

Consider migration. Every year some number of children under the age of 18 leave the United States. It is highly unlikely the migratory children are representative of the distribution of all key observable and unobservable characteristics that may predict both dropping out of high school and whether a young person finds school classes boring. So a random sample of 18-year-olds who have not migrated probably has some amount of selection that confounds results, right?

Consider time. Why restrict your population to today? Many of the factors affecting high school attainment play out across multiple years, multiple decades. Are you selecting on the dependent variable if you select from a year with a disproportionate concentration of non-dropouts? Seems likely.

Consider non-respondents. Obviously not evenly distributed across dropouts and the overly bored.

What reason do we have to not get weirder? What about the deceased? I suspect less educated and more risk tolerant (or more easily bored) individuals are more likely to die at young ages. So wouldn’t their exclusion potentially introduce selection-based bias?

Let’s get even weirder. What about the potential humans, those trillions of zygotic potentialities who didn’t successfully make it to be a physical person and are thus excluded from our sample? Are you telling me the actualized people are a random selection from the broader potentialities? I’m not convinced. But then aren’t we again in a situation where selecting on the actual folks, potentially introducing bias in the association we care about, between HS completion and feeling bored in class? We care more about that association than the observed distribution of humans who won out the competition to be born, right?

I realize that some of these examples are silly. But when I think seriously about this topic, the potential problems introduced by the obvious examples and the ones by the sillier examples don’t feel so terribly distinct.

It seems that the typical response is to handwave and say something like, “think seriously about your question, “use your substantive knowledge of the case” or “be guided by theory.” But I’ve noticed that some of these subtler points of selection, identification, and causality don’t really respond well to arguments such as, “my theory suggests doing X.” Don’t the issues of selection, etc. also suggest that previous research is probably fairly shoddy, so previous knowledge of one’s case, and the theories used to understand it, are likely poor bases for one’s decision making? Seems at least plausible.

In my own research, I’m pretty milquetoast, conventional, and practical when it comes to these questions. I feel pretty confident in deploying and addressing issues of selection on the DV in standard ways that will make the typical social scientist satisfied. But I think the deeper and weirder concerns remain lurking in the shadows, and I don’t have a good answer to the concerns they raise.