guessing and the reliability of multiple choice tests

Off and on over the past year, I’ve been exploring literature on multiple choice tests. Starting with Brigham Young University’s handbook How to Prepare Better Multiple-Choice Test Items: Guidelines for University Faculty (1991), I found myself deep into  scholarly publications by many, but especially experts Haladyna, Downing, Rodriguez, or some combination of the three (1989a, 1989b, 2002, 2005, 2013).

After putting together several question banks, I think I’ve become fairly adept at internalizing and apply best practices (or noticing errors). At least I’d like to think so–it can still take me hours to put together a quality set of questions. And full disclaimer: They are yet untouched by any sort of quantitative analysis.

As I embark on a couple of educational research projects mention or rely on multiple choice tests as an assessment instrument, an FAQ format seems to be a useful way of parsing the literature and retrieving readings. Writing a paper might be thought of as answering a series of questions. Below are two questions that I had in the beginning.

In making this public, I should also disclaim that some part of me wants to debunk commons myths about this format. There are legitimate reasons it gets a bad rep, especially in discussions of standardized testing, but I think it still has a place in higher education as an instrument–as long as it’s not the sole or even primary assessment instrument.

Are three alternatives per question item really the ideal?

According to a review of multiple-choice writing guidelines and research by Haladyna (2002), there is no consensus on the number of alternatives (otherwise known as answer options) that should be written. Four to five answer options appear to be the standard on major instruments, but some studies support three-option items or show no difference between three-option and four-option assessments (Tarrant & Ware, 2002). Rodriguez (2005) in particular advocated three options as the “optimal” number given a review of the literature that spanned 80 years.

Assuming one follows best writing practices, I tend to lean toward three options given that writing more than two plausible distractors can be a difficult task. However, I also recognize that writing many different questions can also be difficult and might opt for adding answer options if I already have a set of questions, especially if I’m trying to improve the reliability of the test.

Haladyna, Thomas M, Steven M. Downing, and Michael C. Rodriguez. “A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment.” Applied Measurement in Education. 15.3 (2002): 309-34.

Rodriguez, M. C. ( 2005). Three Options Are Optimal for Multiple-Choice Items: A Meta-Analysis of 80 Years of Research. Educational Measurement: Issues and Practice, 24, 2, 3-13.

Tarrant, M, and J Ware. A Framework for Improving the Quality of Multiple-Choice Assessments. Nurse Educator. 37.3 (2012).

But with only three or four alternatives, I have a 33% or 25% chance of getting a question right if I just guess. How reliable are these tests, really?

Reliability refers to how consistently a test produces the same result. Put another way: Would a student be able to repeat his or performance on a test or equivalent test that he or she has already taken? Or is each instance a game of chance once random measurement errors such as guessing, ambiguous questions, and careless scoring is taken into account?

Well, consider the following: According to BYU’s aforementioned handbook (Burton et al., 1991, p. 6), if someone is blindly guessing on a ten-question test because they don’t care or waited until the last minute, there is a 1 in 285 chance that a student would score 70% or higher assuming there are four answer alternatives per question. At 20 questions, this increases to 1 in 33, 885. Clearly, this is not the best strategy for students.

There are actually several measures of reliability (more sophisticated than the above example) and strong evidence that multiple-choice tests are accurate assessment tools, but this is too short of space to fully cover them. However, a common qualitative argument I want to mention is the comparison of the multiple choice test a with an essay test. The latter may have inconsistent graders with subjective biases, whereas a multiple choice test is immune to many of the factors that have been shown to affect essay results and therefore should more accurately capture students’ learning outcomes.

Burton, S. J., Sugweeks, R.R., Merrill, P.F., & Wood, B.  (1991). How to prepare better multiple-choice test items: Guidelines for university faculty. Handbook. Brigham Young University Testing Services & The Department of Instructional Science. Available: http://testing.byu.edu/info/handbooks/betteritems.pdf

Considine, J., Botti, M., & Thomas, S. (2005). Design, format, validity and reliability of multiple choice questions for use in nursing research and education. Collegian, 12, 1, 19-24.

Wells, C.S. & Wollack, J.A. (2003). An Instructor’s Guide to Understanding Test Reliability. Testing & Evaluation Services. University of Wisconsin. Available: http://testing.wisc.edu/Reliability.pdf

Leave a Reply

Your email address will not be published. Required fields are marked *