A Little Test Theory

October 25, 2007

For the greater part of the twentieth century, measurement and assessment specialists employed a simple but surprisingly durable mathematical model to describe and analyze tests used in education, psychology, clinical practice and employment. Briefly, a person’s standing on a test designed to assess a given attribute (call it X) is modeled as a linear, additive function of two more fundamental constructs: a “true” score, T say, and an error component, E:

X = T + E

It is called the Classical Test Theory Model, or the Classical True Score Model, but it is more a model about errors of measurement than true scores. Technically, the true score is defined as the mean score an individual would obtain on either (1) a very large number of “equivalent” or “parallel” tests, or (2) a very large number of administrations of the same test, assuming that each administration is a “new” experience. This definition is purely hypothetical but, as we will see below, it allows us to get a handle on the central concept of an “error” score. The true score is assumed to be stable over some reasonable interval of time. That is, we assume that such human attributes as vocabulary, reading comprehension, practical judgment and introversion are relatively stable traits that do not change dramatically from day to day or week to week. This does not mean that true scores do not change at all. Quite the contrary. A non-French speaking person’s true score on a test of basic French grammar would change dramatically after a year’s study of French.

By contrast, the error component (E) is assumed to be completely random and flip-flops up and down on each measurement occasion. In the hypothetically infinite number of administrations of a test, errors of measurement are assumed to arise from virtually every imaginable source: temporary lapses of attention; lucky guesses on multiple-choice tests; misreading a question; fortuitous (or unfortuitous) sampling of the domain, and so on. The theory assumes that in the long run positive errors and negative errors balance each other out. More precisely, the assumption is that errors of measurement, and therefore the X scores themselves, are normally distributed around individuals’ true scores.

Two fundamental testing concepts are validity and reliability. They are the cornerstone of formal test theory. Validity has traditionally been defined as the extent to which a test measures what it purports to measure. So, for example, a test that claims to measure “quantitative reasoning ability” should measure this ability as “purely” as possible and should not be too contaminated with, say, verbal ability. What this means in practice is that the reading level required by the test should not be so high as to interfere with assessment of the intended construct.

The foregoing definition of validity implies that in a certain sense validity inheres in the test itself. But the modern view is that validity is not strictly a property of the test; a test does not “possess” validity. Rather, validity properly refers to the soundness and defensibility of the interpretations, inferences and uses of test results. It is the interpretations and uses of tests that are either valid or invalid. A test can be valid for one purpose and invalid for another. The use of the SAT-Math test to predict success in college mathematics may constitute a valid use of this test, but using the test to make inferences about the relative quality of high schools would be an invalid use.

Reliability refers to the “repeatability” and stability of the test scores themselves. Note clearly that, unlike validity, the concern here is with the behavior of the numbers themselves, not with their underlying meaning. Specifically, the score a person obtains on an assessment should not change the moment our back is turned. Suppose a group of individuals were administered the same test on two separate occasions. Let us assume that memory per se plays no part in performance on the second administration. (This would be the case, for example, if the test were a measure of proficiency in basic arithmetic operations such as addition and subtraction, manipulation of fractions, long division, and so on. It is unlikely that people would remember each problem and their answers to each problem.) If the test is reliable it should rank order the individuals in essentially the same way on both occasions. If one person obtains a score that places him in the 75th percentile of a given population one week and in the 25th percentile the next week, one would be rightly suspicious of the test’s reliability.

A major factor affecting test reliability is the length of the assessment. An assessment with ten items or exercises will, other things being equal, be less reliable than one with 20 items or exercises. To see why this is so, consider the following thought experiment. Suppose we arranged a golf match between a typical weekend golfer and the phenomenal Tiger Woods. The match (read “test”) will consist of a single, par-3 hole at a suitable golf course. Although unlikely, it is entirely conceivable that the weekend golfer could win this “one item” contest. He or she could get lucky and birdie the hole, or if they are really lucky, get a hole in one. Mr. Woods might well simply par the hole, as he has done countless times in his career. Now suppose that the match consisted not of one hole, but of an entire round of 18 holes. The odds against the weekend golfer winning this longer, more reliable match are enormous. Being lucky once or twice is entirely credible, but being lucky enough to beat Mr. Woods over the entire round taxes credulity. The longer the “test,” the more reliably it reflects the two golfers’ relative ability.

Newcomers to testing theory often confuse validity and reliability; some even use the terms interchangeably. A brief, exaggerated example will illustrate the difference between these two essential testing concepts. We noted above that a reliable test rank orders individuals in essentially the same way on two separate administrations of the test. Now, let us suppose that one were to accept, foolishly, the length of a person’s right index finger as a measure of their vocabulary. To exclude the confounding effects of age, we will restrict our target population to persons 18 years of age and older. This is obviously a hopelessly invalid measure of the construct “vocabulary.” But note that were we to administer our “vocabulary” test on two separate occasions (that is, were we to measure the length of the index fingers of a suitable sample of adults on two separate occasions), the resulting two rank orderings would be virtually identical. We have a highly reliable but utterly invalid test.

The numerical index of reliability is scaled from 0, the total absence of reliability, to 1, perfect reliability. What does zero reliability mean? Consider another thought experiment. Many people believe that there are individuals who are naturally luckier than the rest of us. Suppose we were to test this notion by attempting to rank order people according to their “coin tossing ability.” Our hypothesis is that when instructed to “toss heads,” some individuals can do so consistently more often then others. We instruct 50 randomly chosen individuals to toss a coin 100 times. They are to “attempt to toss as many heads as possible.” We record the number of heads tossed by each individual. The experiment is then repeated and the results are again recorded. It should come as no surprise that the correlation between the first rank order and the second would likely be near zero. The coin-tossing test has essentially zero reliability.

Perfect reliability, on the other hand, implies that both the rank orders and score distributions of a large sample of persons on two administrations of the same test, or on the administrations of equivalent tests, would be identical. The finger test of vocabulary discussed above is an example. In educational and psychological assessment, perfect reliability is hard to come by.