THRESHOLDS FOR INTERPRETING EFFECT SIZES 

Paul D. Ellis,
Hong Kong Polytechnic University 

Interpretation is essential if researchers are to extract meaning from their results. However, the interpretation of effect sizes is a subjective process. What is an important and meaningful effect to you may not be so important to someone else. Many researchers trained in the pseudoobjectivity of statistical significance testing are uncomfortable making these sorts of value judgments. Consequently research results often go uninterpreted. Sometimes researchers try to draw meaningful conclusions from their p values. That is, statistically significant results are passed off as if they were of substantive significance. But a statistically significant result is not necessarily important or meaningful. As Cohen (1994, p.1001) famously observed:  


The problem with confusing statistical with substantive
significance is that p values are confounded indexes that reflect
the effect size, the sample size, and the test type (Lang et al. 1988).
In many cases a statistically significant result merely tells us that
a big sample was used. But what we really want to know is, how big is
the effect and what does it mean and for whom? 

Across the social science disciplines editors and academy presidents are increasingly calling for researchers to interpret the substantive, as opposed to the statistical, significance of their results. These calls are framed in appeals for research which is relevant, useful and matters. To extract meaning from their results researchers need to look beyond p values and effect sizes and make informed judgments about what they see. No one is better placed to do this than the researcher who collected and analyzed the data (Kirk 2001).  
To assess the substantive significance of a result we
need to interpret our estimates of the effect size. The critical question
is not how big is it? but is it big enough to mean something?
Effects by themselves are meaningless unless they can be contextualized
against some frame of reference such as a wellknown scale (e.g., IQ)
or a previous result (15% more efficient). The interpretation of results
becomes problematic when effects are measured indirectly using arbitrary
or unfamiliar scales. Imagine your doctor gave you the following information:




Would this prompt you to make drastic changes to your lifestyle? Probably not. Not because the effect reported in the research is trivial but because you have no way of interpreting its meaning. What does "two points lower" mean? Does it mean you are more or less healthy than a normal person? Is two points a big deal? Should you be worried? Being unfamiliar with the scale, you are unable to draw any conclusion.  




Now the doctor has your full attention. This time you're sitting on the edge of your seat gripped with a resolve to lose weight and start exercising again. Hearing the research result in terms which are familiar to you, you are better able to extract their meaning and draw conclusions.  
Unfortunately the medical field is something of a special case when it comes to reporting results in metrics that are widely understood. Most people have heard of cholesterol, blood pressure, the bodymass index, bloodsugar levels, etc. But in the social sciences many phenomena (e.g., selfesteem, trust, satisfaction, power distance, opportunism, depression) can only be observed indirectly by getting people to circle numbers on an arbitrary scale. The challenge for the researcher is to translate results obtained from such scales into effects that are nonarbitrary. Ordinary people don't care if a treatment leads to a statistically significant difference between "before" and "after" scores. What they want to know is whether the treatment works. In other words, how effective is the treatment in terms of outcomes they value and understand?  
When these sorts of assessments are difficult to make one can refer to effect size conventions such as those proposed by Jacob Cohen. In his authoritative Statistical Power Analysis for the Behavioral Sciences, Cohen (1988) outlined criteria for gauging small, medium and large effect sizes (see Table 1). According to Cohen's logic, a standardized mean difference of d = .18 would be trivial in size, not big enough to register even as a small effect. Conversely, a correlational effect of r = .18 would qualify as a small effect. It's not particularly big, but it's certainly nontrivial. The thresholds used in the Result Whacker are taken from Cohen and can be summarized as follows:.  
These thresholds are simple to grasp and have arguably achieved conventional status. Yet their use in informing judgments about research results is controversial. Noted scholars such as Gene Glass, one of the developers of metaanalysis, have vigorously argued against classifying effects into "tshirt sizes" of small, medium and large:  


The temptation to plug in a result and whack out a readymade
interpretation based on an arbitrary benchmark may hinder the researcher
from thinking about what the results really mean. Cohen himself was not
unaware of the "many dangers" associated with benchmarking effect sizes,
noting that the conventions were devised "with much diffidence, qualifications,
and invitations not to employ them if possible" (1988, pp.12, 532). A
similar warning is made here: excessive use of the Result
Whacker is bad for your health! 



Technical note:  
Rosenthal and Rosnow (1984, p.361) noted that the thresholds for small, medium and large effects are not consistent for r and d. This can be seen by looking at the table below. Although a small d = .20 is equivalent to a small r = .10, the r equivalent of a mediumsized d = 0.5 would be rated a small effect (.24, not .30) in the correlational metric. Thus Cohen requires larger effects when measuring the strength of association.  
References 

Cohen, J. (1988), Statistical Power Analysis for the Behavioral Sciences, 2nd Edition. Hillsdale: Lawrence Erlbaum.  
Cohen, J. (1994), "The earth is round (p<.05),"
American Psychologist, 49(12), 9971003. 

Glass, G.V., B. McGaw, and M.L. Smith (1981), MetaAnalysis in Social Research. Sage: Beverly Hills.  
Kirk, R.E. (2001), "Promoting good statistical practices:
Some suggestions," Educational and Psychological 



Lang, J.M., K.J. Rothman and C.I. Cann (1998), "That
confounded pvalue," Epidemiology, 9(1): 78. 

Rosenthal, J.A. (1996), "Qualitative descriptors of strength of association and effect size," Journal of Social  




Links  
Click here to go to Paul Ellis’s effect size website  
The effect size equations  
Effect size calculators  
The Result Whacker  
The Essential Guide to Effect Sizes  
To send feedback or corrections regarding this page click here.  
How to cite this page: Ellis, P.D. (2009), "Thresholds for interpreting effect sizes," website: [insert domain name here] accessed on [insert access date here].  
Last updated: 7 Sept 2009 
