THRESHOLDS FOR INTERPRETING EFFECT SIZES
Paul D. Ellis, Hong Kong Polytechnic University
 
Interpretation is essential if researchers are to extract meaning from their results. However, the interpretation of effect sizes is a subjective process. What is an important and meaningful effect to you may not be so important to someone else. Many researchers trained in the pseudo-objectivity of statistical significance testing are uncomfortable making these sorts of value judgments. Consequently research results often go uninterpreted. Sometimes researchers try to draw meaningful conclusions from their p values. That is, statistically significant results are passed off as if they were of substantive significance. But a statistically significant result is not necessarily important or meaningful. As Cohen (1994, p.1001) famously observed:
 
  All psychologists know that statistically significant does not mean plain English significant, but if one reads the literature, one often discovers a finding reported in the Results section studded with asterisks implicitly becomes in the Discussion section highly significant, or very highly significant, important, big!
 
The problem with confusing statistical with substantive significance is that p values are confounded indexes that reflect the effect size, the sample size, and the test type (Lang et al. 1988). In many cases a statistically significant result merely tells us that a big sample was used. But what we really want to know is, how big is the effect and what does it mean and for whom?
 
Across the social science disciplines editors and academy presidents are increasingly calling for researchers to interpret the substantive, as opposed to the statistical, significance of their results. These calls are framed in appeals for research which is relevant, useful and matters. To extract meaning from their results researchers need to look beyond p values and effect sizes and make informed judgments about what they see. No one is better placed to do this than the researcher who collected and analyzed the data (Kirk 2001).
 
To assess the substantive significance of a result we need to interpret our estimates of the effect size. The critical question is not how big is it? but is it big enough to mean something? Effects by themselves are meaningless unless they can be contextualized against some frame of reference such as a well-known scale (e.g., IQ) or a previous result (15% more efficient). The interpretation of results becomes problematic when effects are measured indirectly using arbitrary or unfamiliar scales. Imagine your doctor gave you the following information:
 
  Research shows that people with your body-mass index and sedentary lifestyle score on average two points lower on a cardiac risk assessment test in comparison with active people with a healthy body weight.
 
Would this prompt you to make drastic changes to your lifestyle? Probably not. Not because the effect reported in the research is trivial but because you have no way of interpreting its meaning. What does "two points lower" mean? Does it mean you are more or less healthy than a normal person? Is two points a big deal? Should you be worried? Being unfamiliar with the scale, you are unable to draw any conclusion.
   
  Now imagine your doctor said this to you instead:
  Research shows that people with your body-mass index and sedentary lifestyle are four times more likely to suffer a serious heart attack within 10 years in comparison with active people with a normal body weight.
 
Now the doctor has your full attention. This time you're sitting on the edge of your seat gripped with a resolve to lose weight and start exercising again. Hearing the research result in terms which are familiar to you, you are better able to extract their meaning and draw conclusions.
 
Unfortunately the medical field is something of a special case when it comes to reporting results in metrics that are widely understood. Most people have heard of cholesterol, blood pressure, the body-mass index, blood-sugar levels, etc. But in the social sciences many phenomena (e.g., self-esteem, trust, satisfaction, power distance, opportunism, depression) can only be observed indirectly by getting people to circle numbers on an arbitrary scale. The challenge for the researcher is to translate results obtained from such scales into effects that are non-arbitrary. Ordinary people don't care if a treatment leads to a statistically significant difference between "before" and "after" scores. What they want to know is whether the treatment works. In other words, how effective is the treatment in terms of outcomes they value and understand?
 
When these sorts of assessments are difficult to make one can refer to effect size conventions such as those proposed by Jacob Cohen. In his authoritative Statistical Power Analysis for the Behavioral Sciences, Cohen (1988) outlined criteria for gauging small, medium and large effect sizes (see Table 1). According to Cohen's logic, a standardized mean difference of d = .18 would be trivial in size, not big enough to register even as a small effect. Conversely, a correlational effect of r = .18 would qualify as a small effect. It's not particularly big, but it's certainly nontrivial. The thresholds used in the Result Whacker are taken from Cohen and can be summarized as follows:.
These thresholds are simple to grasp and have arguably achieved conventional status. Yet their use in informing judgments about research results is controversial. Noted scholars such as Gene Glass, one of the developers of meta-analysis, have vigorously argued against classifying effects into "t-shirt sizes" of small, medium and large:
 
  There is no wisdom whatsoever in attempting to associate regions of the effect size metric with descriptive adjectives such as "small," "moderate," "large," and the like. Dissociated from a context of decision and comparative value, there is little inherent value to an effect size of 3.5 or .2. Depending on what benefits can be achieved at what cost, an effect size of 2.0 might be "poor" and one of .1 might be "good." (Glass et al. 1981, p.104)
 
The temptation to plug in a result and whack out a ready-made interpretation based on an arbitrary benchmark may hinder the researcher from thinking about what the results really mean. Cohen himself was not unaware of the "many dangers" associated with benchmarking effect sizes, noting that the conventions were devised "with much diffidence, qualifications, and invitations not to employ them if possible" (1988, pp.12, 532). A similar warning is made here: excessive use of the Result Whacker is bad for your health!
   
Ideally scholars will interpret the substantive significance of their research results by grounding them in a meaningful context or by assessing their contribution to knowledge. When this is problematic, Cohen's benchmarks may serve as a last resort. The fact that they are used at all íV given that they have no raison d'etre beyond Cohen's own judgment íV speaks volumes about the inherent difficulties researchers have in drawing conclusions about the real world significance of their results.
 
 
Technical note:
 
Rosenthal and Rosnow (1984, p.361) noted that the thresholds for small, medium and large effects are not consistent for r and d. This can be seen by looking at the table below. Although a small d = .20 is equivalent to a small r = .10, the r equivalent of a medium-sized d = 0.5 would be rated a small effect (.24, not .30) in the correlational metric. Thus Cohen requires larger effects when measuring the strength of association.
 
References
Cohen, J. (1988), Statistical Power Analysis for the Behavioral Sciences, 2nd Edition. Hillsdale: Lawrence Erlbaum.
Cohen, J. (1994), "The earth is round (p<.05)," American Psychologist, 49(12), 997-1003.
Glass, G.V., B. McGaw, and M.L. Smith (1981), Meta-Analysis in Social Research. Sage: Beverly Hills.
Kirk, R.E. (2001), "Promoting good statistical practices: Some suggestions," Educational and Psychological
  Measurement, 61(2): 213-218.
Lang, J.M., K.J. Rothman and C.I. Cann (1998), "That confounded p-value," Epidemiology, 9(1): 7-8.
Rosenthal, J.A. (1996), "Qualitative descriptors of strength of association and effect size," Journal of Social
  Service Research, 21(4): 37-59.
Rosenthal, R. and R.L. Rosnow (1984), Essentials of Behavioral Research: Methods and Data Analysis. New
  York: McGraw-Hill.
 
Links
 
Click here to go to Paul Ellis’s effect size website
The effect size equations
Effect size calculators
The Result Whacker
The Essential Guide to Effect Sizes
To send feedback or corrections regarding this page click here.
 
How to cite this page: Ellis, P.D. (2009), "Thresholds for interpreting effect sizes," website: [insert domain name here] accessed on [insert access date here].
 
Last updated: 7 Sept 2009
Visitors since 7 Sept 2009