A closer look at the UMC Call to Action Part 2

This is the second part in a series taking a closer look at the United Methodist Church’s (UMC) Call to Action Steering Team Report.

The first part is here.

Correlation by itself does not imply causation

(Wikipedia’s article on this subject.)

For correlation, I’m going to use a description from John Allen Paulos:

There are various kinds and various measures of statistical correlation, but all of them indicate that two or more quantities are related in some way and in some degree, not necessarily that one causes the other (Beyond Numeracy: John Allen Paulos; “Correlation, Intervals, and Testing”).

Paulos mentions that children with bigger feet spell better.  Based on a correlation between foot size and spelling ability, we can’t seriously conclude that foot size is a driver of spelling ability and start a vigorous foot-stretching campaign. What’s more likely is that older children have larger feet than younger children. Not as certainly, older children tend to spell better than younger children.

For another example regarding correlation and causation, consider the following headline:

Happiness wards off heart disease, study suggests

(example taken from this page)

The above headline suggests (to me, anyway) the following scenario: a gloomy patient is talking to the doctor. The doctor says, “You’re suffering from heart disease. You’d better cheer up from now on!” So the patient cheers up. The patient’s change in mood all by itself causes the heart disease to go away, and the patient lives long and happily ever after.

That absurd scenario is not what the study found.  The study found that there was a lower risk for developing heart disease among people who rated higher on the happiness scale.  There could be many subtleties in trying to interpret this finding.  To oversimplify: could good health cause general happiness, or general happiness cause good health, or maybe some other variable affect both health and happiness?  My point in bringing up this example is that even with a statistically significant correlation, there can still be disputes about how to interpret such a finding.  (The BBC News article linked above quotes three sources regarding what this particular study might mean.)

To summarize: correlation by itself does not imply causation.  It’s easier to see this in the above examples, in part because in the examples we already have a sense of how things work.  But suppose we’re wandering into unfamiliar territory.  We need more to guide us than just correlations uncovered by a single study.

What about data mining?

The section I quoted in the previous post (from page 41 of the Report) suggested that the Report’s “unprecedented data-mining research . . . objectively and systematically uses massive amounts of data to determine cause and effect relationships.”  I suppose establishing cause-and-effect through one data mining study is possible, but I have a difficult time understanding how Towers Watson (TW) has established this.  (TW does not appear to directly make this claim about cause and effect.)

Perhaps an appropriate data mining algorithm could establish cause and effect.  Still, there are known ways of misusing data mining.  Michael Berry in 2004:

In our consulting practice, we have seen how often data mining is misused:

  1. to learn things that aren’t true; or
  2. to learn things that are true, but not useful….

Finding data that is inaccurate is more dangerous than finding factual data that is not useful because important business decisions may be based on incorrect information. Data mining results often seem reliable because they are based on actual data derived in a seemingly scientific manner. This appearance of reliability can be deceiving. The data itself may be incorrect or not relevant to the question at hand. The patterns discovered may reflect past business decisions or nothing at all. Data transformations, within the system, such as summarization, may have destroyed or hidden important information.

With that warning out of the way, what data mining technique did TW use?  From page 111:

The data mining process used regression analysis, a long-established statistical technique used to identify the impact of multiple factors on a specific desired outcome. …

Regression analysis is commonly used in consumer, employee and political research to help identify and prioritize actions that will have the greatest impact on a desired outcome. In the Vital Congregations research project, regression analysis was used to statistically identify the significant factors that impact the desired outcome – indicators of church vitality.

It’s true that regression analysis is a “long-established statistical technique” used to identify how different variables (such as X) correlate with a specific desired outcome (which we can call Y).  It’s also true that, for decades, text books have including warnings such as the following regarding interpreting regression analysis:

An observed correlation between Y and X should not be interpreted to mean a causal relationship between X and Y regardless of the magnitude of the correlation.  The correlation may be due to a causal effect of X on Y, of Y on X, or of other variables on both X and Y.  Causation should be inferred only after a careful examination of a variety of theoretical and experimental evidence, not merely from statistical studies based on correlation (Computer-Aided Multivariate Analysis, Fourth Edition: Abdelmonem A. Afifi, Virginia Clark, Susanne May; page 118).

Unless TW used a secret data mining algorithm that establishes cause-and-effect, TW has not established cause-and-effect here.  For the rest of this discussion, I will assume that a secret data mining algorithm was not used.

Why say “impact” so much?

Regression analysis uncovers correlations.  Regression analysis alone does not allow us to talk about causation.  So why would management consultants repeatedly use the word “impact” to describe the relationship between the “drivers of vitality” and the “indicators of vitality” (pages 49, 52, 70, 71, and 74 are some of many other examples)?

I can think of a couple reasons having to do with “consumer, employee and political research.”

First, since these types of research are often concerned with identifying short-term opportunities (in the current marketplace, in the current technological workplace, until the next election), in these situations finding correlations might be good enough.  To use a mundane example: suppose we find a correlation between “having a runny nose” and “buying Brand-Z over-the-counter decongestant.”  Surprisingly, this finding might not be a trivial result from a marketing perspective: it might tell us, for example, that the people more likely to purchase “Brand-Z” are purchasing it for themselves and generally don’t plan ahead.  Informally, we can talk about “having a runny nose” “impacting” the purchase of Brand-Z.  Strictly speaking, though, we’re better off thinking of this finding as being a “snapshot” of a specific opportunity.  For example, if a competitor releases a cheaper “Brand-ZZ” in six months, in six months this finding might be worthless.  This snapshot could still be a useful short-term opportunity in marketing.

Second, the desired outcome in these research areas generally has a clear definition.  Some possible examples: in consumer research, consumers purchase the product; in employee research, the call time is reduced; in political research, voters vote for the desired outcome.  In each of these examples, we have a clear understanding of what the desired outcome is.  Let’s consider again the example of purchasing Brand-Z.  There might be some question about the details of what a purchase of Brand-Z consists of (for example, paying by cash or credit; web, mail order or in person), but ultimately there will an exchange of money for the product.  This clear understanding of the desired outcome makes it even easier to talk about “having a runny nose” “impacting” the purchase of Brand-Z.

The Report’s analysis differs from these scenarios.  The Report’s desired outcome is determined by the vitality index, an intricate calculation and ranking process instead of an easily understood outcome.  It is completely possible that the regression analysis has merely uncovered correlations between the “drivers of vitality” and the way vitality is defined.  In other words, it’s not clear what these mathematical relationships mean beyond what’s inside the computer.  Furthermore, these could be just short-term correlations.  Without a deeper understanding, we don’t know how well these results might generalize several years into the future.

None of the methods stated in the Report allow us to say cause-and-effect relationships have been established.  (Page 114 mentions analysis of variance, ANOVA.  ANOVA can tell us whether different sample means likely differ from each other more than due to chance. ANOVA does not tell us why they differ.)  To put the situation more bluntly: it’s possible that the UMC is about to start the equivalent of a multi-million dollar “Cheer Up or Else!” campaign to fight heart disease.  Such an initiative cannot be justified by a single study alone.

In part 3, I look at a specific problem possibly related to confusing correlation and causation: bias in the Report’s vitality index.