(This is the third part in a series taking a closer look at the United Methodist Church’s (UMC) Call to Action Steering Team Report.
The first part is here.)
In this post, I examine the bias of the Report’s vitality index. Specifically, I’m looking at how often the vitality index finds congregations vital. (This assignment of vitality takes place prior to the analysis that discovers the “drivers of vitality.”) The Report’s stated goal is to increase the number of vital congregations. If we’re trying to increase the number of vital congregations, it’s only fair that we make sure everyone is starting in the same place. If a clearly identifiable group is in fact starting out behind other groups, fairness requires us to deal with this imbalance before we start any great initiatives.
What does the Report say about how often different kinds of churches are called vital?
From page 24 (emphasis in original):
The reliable statistical findings indicate that high-vitality churches come in all sizes, ethnic representations, church settings, and geographies, but they consistently share common factors that work together to influence congregational vitality.
From page 73 (also on page 37 and 47):
Based on vitality index, Towers Watson found that all kinds of UMC churches are vital – small, large, across different geographies, and church setting (e.g., urban, rural)
From page 115 (emphasis in original):
The analysis identified four areas described in more detail in the Steering Team report: Small Groups and Programs, Worship, Lay Leadership and Pastoral Leadership. These areas were not just related to the vitality of a single church or a handful of churches. These areas were found to have a strong, positive impact on the Indicators of Church Vitality across thousands of churches. There are examples of churches with high vitality that have been successful in each of these areas in every district in North America, in larger churches, in smaller churches, in predominantly minority churches in churches in urban communities and in churches in suburban or rural communities.
One might conclude from the statements above (combined with the discussion on page 102) that, everything else being equal,
- A small congregation has the same chance of being called vital as a large congregation;
- A congregation in the Northeastern jurisdiction has the same chance of being called vital as a congregation in the South Central jurisdiction;
- A predominantly Hispanic congregation has the same chance of being called vital as a predominantly white congregation.
In all three of the above cases, one would be wrong.
Larger congregations are more likely to be vital
We don’t have to look far in the Report regarding congregation size and vitality. From page 69:
While larger churches are more likely to be vital than smaller churches there are many high vital churches across all church sizes
This could be an important finding: larger churches are more likely to be vital. This result could have important operational implications: for example, it could make it easier to justify closing smaller congregations. But after further reflection, I’m not sure what this finding (“larger churches are more likely to be vital than smaller churches”) is actually telling us:
- It’s possible to interpret this finding to mean that an essential thing called“vitality” is more likely to express itself in larger congregations. It’s also possible to interpret this finding to mean that larger congregations have an easier chance at meeting the Report’s definition of “vitality.” There could be other possible interpretations. Without any research beyond one study showing a correlation, how can we choose between these competing interpretations?
- We’re told that one of the “key drivers of vitality” is “Mix of traditional and contemporary worship services” (page 73). I can’t help but read page 84 and think, “This mix of services only helps churches with an average weekly attendance (AWA) of at least 350.”
- Another “key driver of vitality” is the Number of Small Groups & Programs (page 73). Page 75 says, “Regardless of size, more vital churches have more small groups.” On page 75, I don’t see that big a difference in the number of small groups for churches with an AWA below 100. I don’t see much of a larger difference on page 76 and page 77 either. This brings to mind two issues: (a) statistical significance does not necessarily mean clinical importance (“statistical significance” only means that there’s a difference that’s likely greater than chance alone); and (b) I can’t help but wonder if the ecological fallacy is at work here in the Report. (From the Wikipedia link: “An ecology fallacy is a logical fallacy in the interpretation of statistical data . . . whereby inferences about the nature of specific individuals are based solely upon aggregate statistics collected for the group to which those individuals belong.”)
- The previous two points about how smaller churches should implement the “key drivers of vitality” are not trivial since they would affect many churches in the U.S. In the U.S. in 2008, UMC churches with a membership below 100 accounted for 10% of church membership, but 47% of all churches (page 189). Admittedly, church membership below 100 is not the same measure as church AWA below 100; my point here is that there are many smaller churches in the U.S.
How much more likely is a larger congregation to be vital (compared to a smaller one)? Page 69 does not say. But it does display a pie chart showing “Percent of Total Vital Churches by Church Size.” (Incidentally, the total number of vital churches in this pie chart is 4,971. This differs from the total of 4,961 vital churches given on pages 37, 68, and 111.)
Next I want to take a closer look at the analysis on page 102. I’m going to use a hypothetical example to help sort out the issues involved.
(The next section might be difficult for someone who hasn’t worked with some of these concepts before. That’s OK. The most important point of the next section is that despite some vague reassurances, the Report does not analyze the question of vitality index bias.)
Let’s imagine a company with 1000 employees: 500 of ethnic group 1, 500 of ethnic group 2. This company hires independent consultants to select employees as noteworthy employees, employees who will be studied for what they can teach everyone about being employees.
In the first step of their analysis, the independent consultants choose 150 noteworthy employees: 60 happen to be from group 1, and 90 happen to be from group 2. We can summarize this in TABLE 1 below:
The independent consultants are getting ready to move on to the next step in their analysis. Suppose we freeze the action right here and take a closer look at TABLE 1. I can think of two possible reactions to the above table:
- “It looks like employees in each group are being called noteworthy at about the same rate. After all, 12% and 18% are pretty similar numbers. This looks fair.”
- “Why are employees in Group 2 fifty percent more likely to be called noteworthy than employees in Group 1? This isn’t fair!”
We can think of the question here as, “Are employees being selected for noteworthiness at the same rate for each group?” Now I’m not going to claim that statistics alone can resolve this question, but statistics can help clarify it. If we think of this problem in terms of whether ethnicity and noteworthiness are independent of each other – the variable of ethnicity and the variable of noteworthiness – we can analyze this situation using a statistical test any first-year statistics student should have seen: the chi-squared test of independence. In running this test, we’re checking for independence of the two variables. If the two are independent, we can reasonably ignore any difference in noteworthiness as just a random difference. If the two are not independent, we need to consider that the assignment of noteworthiness might not be “color-blind.”
When we run the chi-squared test of independence, we get an associated p-value. We can interpret this value as saying that, assuming the two variables are independent, what is the probability of getting this observed result? Typically a result that is less than 0.05 (5%) is considered statistically significant and we reject the assumption of independence. If the p-value is less than 0.05, we conclude that the two measurements are not independent of each other. If the p-value is at least 0.05, we can assume that the two measurement are independent of each other. (Usually when analysts look at p-values, they want to find statistically significant results and reject the assumption of independence – to oversimplify, this could mean they’ve found something worth studying. In this particular case, however, we don’t want to find a statistical significant result: we want to find a p-value equal to or greater than 0.05. In order to assume that noteworthiness assignments are color-blind, we want to keep the assumption of independence.)
If the independent consultants run the chi-squared test of independence, in this particular case the associated p-value is 0.010. Since this value is less than 0.05, we should conclude that ethnicity and noteworthiness are not independent of each other (the assignment of noteworthiness might not be colorblind). In other words, the difference in noteworthiness between group 1 and group 2 cannot be dismissed as simply due to chance alone. This test by itself does not tell us why there is a difference. It only gives us evidence that a difference exists. We need to investigate further.
Now contrast the above discussion with the following responses:
- The independent consultants could say, truthfully, “Employees in ethnic group 1 and ethnic group 2 have been found to be noteworthy.” It is true that both ethnic groups are represented in the group of noteworthy employees. This statement does not address how often employees in each ethnic group have been found noteworthy, which is the question.
- The independent consultants could run regression analysis and say, truthfully, that (according to the regression analysis) ethnicity does not correlate significantly with noteworthiness. (Page 102 of the Report basically does this.) The regression analysis answers a different question: given that we know an employee’s ethnicity, can we use that alone to predict whether that employee has been labeled noteworthy? And of course we shouldn’t be able to: the noteworthy designation isn’t common! Only 15% of all employees have the noteworthy designation. (Even the group with the most noteworthy employees, group 2, sees only 18% of its employees deemed noteworthy.)
- The independent consultants could produce a chart like the one in Figure 1A and say, “As you can see, the numbers in the chart are very similar.” This is also a true statement: all the numbers in the chart are Hindu-Arabic numerals.
With regards to the Report’s vitality index, in a couple of instances we can run the chi-squared test of independence for ourselves. We can combine the data on page 68 with data from page 102’s Figure 1 and take a closer look at jurisdiction and the vitality index (Appendix A), and predominant ethnicity and the vitality index (Appendix B). We can then go ahead and run the chi-squared test of independence on these two data sets (Appendix C).
Remember: in these cases, we want the associated p-value for each test to be at least 0.05. (This is because we want jurisdiction or ethnicity to be independent of the vitality index.) For each of these tests (according to the software package R) we get a p-value of less than 0.000 000 000 000 000 22. This p-value is less than 0.05.
Using the same format as TABLE 1, let’s take a look at these numbers.
“Vitality” by Jurisdiction and “Vitality” by Predominant Ethnicity
I want to focus on TABLE 8. The Report’s stated goal is “to increase the number of vital congregations.” We can think of TABLE 8 as the “starting line.” Why is there such an obvious discrepancy among congregations with different predominant ethnicities? (Even more important: why doesn’t the Report discuss this discrepancy?) This is not in keeping with the stated Social Principles (PDF) of The United Methodist Church. To quote two representative sentences from ¶162A) Rights of Racial and Ethnic Persons:
We define racial discrimination as the disparate treatment and lack of full access to resources and opportunities in the church and in society based on race or ethnicity.
We support affirmative action as one method of addressing the inequalities and discriminatory practices within our Church and society.
Specifically, what are we to make of the Hispanic demographic being so poorly represented among “vital” congregations?
One might try to diminish the importance of this result by saying, “We’re only discussing 290 congregations here.” That’s true. It’s also true that the “vitality index” only identified about 5 of these congregations as “vital.” If the “vitality index” identified vitality for this demographic at the same rate it did for predominantly white congregations, it should have identified about 46 of these congregations as vital. More importantly, however, we’re dealing with a demographic that is one of the fastest growing in the United States. The Report claims that some of the factors it hopes to reverse are “the four-decade decline in membership; an aging and predominantly Anglo constituency” (page 10). If the Report systematically neglects this important demographic, I’m not sure where the Report’s value lies.
One might complain about this being a “fluke.” The Report repeatedly tells us that its results are reliable (pages 7, 24, 34, 38, 39, 40, 60, 107, 110, 111) and representative (pages 34, 38, 60). In a discussion of results being representative and reliable we are explicitly told that “Churches from ethnic minorities (Asian, Black, Hispanic) are represented” (page 60). If in fact such churches are not properly represented in the Report, page 60 would have been an excellent opportunity to say so.
One might wring one’s hands and protest, “We have to follow the data wherever it leads us.” This response ignores (1) one of the Guiding Principles was that of being inclusive (page 51); (2) the Steering Team chose the definition of vitality themselves (page 63); (3) the Report provides no evidence of any checking for vitality index bias.
One might try to argue, “The vitality index isn’t meant to be the only definition of vitality.” The Report used this definition of vitality to allegedly undercover “crystal clear findings that are actionable” (page 73).
Finally, one might want to start talking about the good intentions and hard work of the authors of the Report. All I can say to this is that the Report relies upon on a measurement that outputs biased results. Being inclusive requires that this output be explained. Page 4 of the Report calls for “a much greater emphasis on outputs as contrasted with intentions and activities…” An investigation of this biased output would be a great place to start.
What else is there?
Towers Watson concludes one section of the Report by saying:
Recognizing that there are many expressions of church vitality, some of which are not readily measurable, the research findings will serve as only one input to decisions taken by the project Steering Team in determining the implications and significance of the drivers of church vitality (page 107).
In the next post, I take a look at this bigger picture.