Computing Applications BLOG@CACM

Likert-Type Scales, Statistical Methods, and Effect Sizes

The Communications Web site, http://cacm.acm.org, features more than a dozen bloggers in the BLOG@CACM community. In each issue of Communications, we'll publish selected posts or excerpts.

twitter
Follow us on Twitter at http://twitter.com/blogCACM

http://cacm.acm.org/blogs/blog-cacm
Judy Robertson writes about researchers' use of the wrong statistical techniques to analyze attitude questionnaires.

By Judy Robertson

Posted May 1 2012

Judy Robertson "Stats: We're Doing It Wrong"
Robust Modern Statistical Methods
Effect Sizes
Reader's comment
References
Author

http://cacm.acm.org/blogs/blog-cacm/107125
April 4, 2011

It is quite common for HCI or computer science education researchers to use attitude questionnaires to examine people’s opinions of new software or teaching interventions. These are often on a Likert-type scale of "strongly agree" to "strongly disagree." And the sad truth is that researchers typically use the wrong statistical techniques to analyze them. Kaptein, Nass, and Markopoulos³ published a paper in CHI last year that found that in the previous year’s CHI proceedings, 45% of the papers reported on Likert-type data, but only 8% used nonparametric stats to do the analysis. Ninety-five percent reported on small sample sizes (under 50 people). This is statistically problematic even if it gets past reviewers! Here’s why.

Likert-type scales give ordinal data. That is, the data is ranked "strongly agree" is usually better than "agree." However, it is not interval data. You cannot say the distances between "strongly agree" and "agree" would be the same as "neutral" and "disagree," for example. People tend to think there is a bigger difference between items at the extremes of the scale than in the middle (there is some evidence cited in Kaptein et al.’s paper that this is the case). For ordinal data, one should use nonparametric statistical tests (so 92% of the CHI papers got that wrong!), which do not assume a normal distribution of the data. Furthermore, because of this it makes no sense to report means of Likert-scale data—you should report the mode (entry which occurs most frequently in the dataset).

Which classic nonparametric tests should you use? I strongly recommend the flow chart on p. 274 of How to Design and Report Experiments by Field and Hole. This textbook is also pretty good for explaining how to do the tests in SPSS and how to report the results. It also mentions how to calculate effect sizes (see later).

Why is it so common to use parametric tests such as the T-test or ANOVA instead of nonparametric counterparts? Kaptein, Nass, and Markopoulos³ suggest it is because HCI researchers know that nonparametric tests lack power. This means they are worried the nonparametric tests will fail to find a test where one exists. They also suggest it is because there aren’t handy nonparametric tests that let you do analysis of factorial designs. So what’s a researcher to do?

Robust Modern Statistical Methods

It turns out that statisticians have been busy in the last 40 years inventing improved tests that are not vulnerable to various problems that classic parametric tests stumble across with real-world data and which are also at least as powerful as classic parametric tests (Erceg-Hurn and Mirosevich¹). Why this is not mentioned in psychology textbooks is not clear to me. It must be quite annoying for statisticians to have their research ignored! A catch about modern robust statistical methods is that you cannot use SPSS to do them. You have to start messing around with extra packages in R or SAS, which are slightly more frightening than SPSS, which itself is not a model of usability. Erceg-Hurn and Mirosevich¹ and Kaptein, Nass, and Markopoulos³ both describe the ANOVA-type statistics, which are powerful and usable in factorial designs and works for nonparametric data.

A lot of interval data from behavioral research, such as reaction times, does not have a normal distribution or is heterscedastic (groups have unequal variance), and so should not be analyzed with classic parametric tests either. To make matters worse, the tests that people typically use to check the normality or heterscedaticity of data are not reliable when both are present. So, basically, you should always run modern robust tests in preference to the classic ones. I have come to the sad conclusion that I am going to have to learn R. However, at least it is free and a package called nparLD does ANOVA-type statistics. Kaptein et al.’s paper gives an example of such analysis, which I am currently practicing with.

Effect Sizes

You might think this is the end of the statistical jiggery pokery required to publish some seemingly simple results correctly. Uh-uh, it gets more complicated. The APA style guidelines require authors to publish effect size as well as significance results. What is the difference? Significance testing checks to see if differences in the means could have occurred by chance alone. Effect size tells you how big the difference was between the groups. Randolph, Julnes, Sutinen, and Lehman⁴, in what amounts to a giant complaint about the reporting practices of researchers in computer science education, pointed out that the way stats are reported by computer science education folk does not contain enough information, and missing effect sizes is one problem. Apparently it is not just us: Paul Ellis reports similar results with psychologists in The Essential Guide to Effect Sizes.

Ellis also comments that there is a viewpoint that not reporting effect size is tantamount to withholding evidence. Yikes! Robert Cole has a useful article, "It’s the Effect Size, Stupid," on what effect size is, why it matters, and which measures one can use. Researchers often use Cohen’s d or the correlation coefficient r as a measure of effect size. For Cohen’s d, there is even a handy way of saying whether the effect size is small, medium, or big. Unfortunately, if you have nonparametric data, effect size reporting seems to get more tricky, and Cohen’s way of interpreting the size of effect no longer makes sense (indeed, some people question whether it makes sense at all). Also, it is difficult for nonexperts to understand.

Common language effect sizes or probability of superiority statistics can solve this problem (Grissom²). It is "the probability that a randomly sampled member of a population given one treatment will have a score (y) that is higher on the dependent variable than that of a randomly sampled member of a population given another treatment (y2)" (Grissom²). An example from Robert Cole: Consider a common language effect size of 0.92 in a comparison of heights of males and females. In other words "in 92 out of 100 blind dates among young adults, the male will be taller than the female." If you have Likert-type data with an independent design and you want to report an effect size, it is quite easy. SPSS won’t do it for you, but you can do it with Excel: PS = U/ mn where U is the Mann-Whitney U result, m is the number of people in condition 1, and n is the number of people in condition 2 (Grissom²). If you have a repeated measures design, refer to Grissom and Kim’s Effect Sizes for Research (2006, p.115). PSdep = w/n, where n is the number of participants and w refers to "wins" where the score was higher in the second measure compared to the first. Grissom² has a handy table for converting between probability of superiority and Cohen’s d, as well as a way of interpreting the size of the effect.

Why is it so common to use parametric tests such as the T-test or ANOVA instead of nonparametric counterparts?

I am not a stats expert in any way. This is just my current understanding of the topic from recent reading, although I have one or two remaining questions. If you want to read more, you could consult a forthcoming paper by Maurits Kaptein and myself in this year’s CHI conference (Kaptein and Robertson⁵). I welcome any corrections from stats geniuses! I hope it is useful but I suspect colleagues will hate me for bringing it up. I hate myself for reading any of this in the first place. It is much easier to do things incorrectly.

Reader’s comment

By replacing ANOVA by nonparametric or robust statistics we risk ending up in another local maximum. Robust statistics are just another way to squeeze your data into a shape appropriate for the infamous "minimizing sum of squares" statistics. Those had their rise in the 20th century because they were computable by pure brainpower (or the ridiculously slow computers in those times).

If HCI researchers and psychologists would just learn their tools and acknowledge the progress achieved in econometrics or biostatistics. For example, linear regression and model selection strategies are there to replace the one-by-one null hypothesis testing with subsequent adjustment of alpha-level. With maximum likelihood estimation, a person no longer needs to worry about Gaussian error terms. Just use Poisson regression for counts and logistic regression for binary outcomes. The latter can also model Likert-type scales appropriately with meaningful parameters and in multifactorial designs.

Once you start discovering this world of modern regression techniques, you start seeing more in your data than just a number of means and their differences. You start seeing its shape and begin reasoning about the underlying processes. This can truly be a source of inspiration.

—Martin Schmettow

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Likert-Type Scales, Statistical Methods, and Effect Sizes

View in the ACM Digital Library

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

DOI

10.1145/2160718.2160721

May 2012 Issue

Published: May 1, 2012

Vol. 55 No. 5

Pages: 6-7

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Apr 23 2024

Maximizing Power Grid Security

R. Colin Johnson

Security and Privacy

News Apr 18 2024

Keeping AI Out of Elections

Bennie Mols

Artificial Intelligence and Machine Learning

BLOG@CACM Apr 17 2024

Technical Marvels

Herbert Bruderer

Computer History

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Robust Modern Statistical Methods

Effect Sizes

Reader’s comment

Likert-Type Scales, Statistical Methods, and Effect Sizes

DOI

May 2012 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.