Item Response Theory

Item Response Theory (aka IRT) is also sometimes called latent trait theory. This is a modern test theory (as opposed to classical test theory). It is not the only modern test theory, but it is the most popular one and is currently an area of active research. IRT requires stronger assumptions than classical test theory (we will cover these in a moment). IRT is much intuitive approach to measurement once you get used to it. In IRT, the true score is defined on the latent trait of interest rather than on the test, as is the case in classical test theory. IRT is popular because it provides a theoretical justification for doing lots of things that classical test theory does not. Some applications where IRT is handy include:

• Item bias analysis--IRT provides a test of item equivalence across groups. We can test whether an item is behaving differently for blacks and whites or for males and females, for example. The same logic can be applied to translations of attitude scales into different languages. We can test whether the item means the same thing in English and French, for example.
• Equating--Sometimes we have scores on one test and we would like to know what the equivalent score would be on another test (e.g., versions or forms of the SAT). IRT provides a theoretical justification for equating scores from one test to another.
• Tailored Testing--IRT provides an estimate of the true score that is not based on the number of correct items. This frees us to give different people different test items but still place people on the same scale. One particularly exciting feature of tailored testing is the capability to give people test items that are matched (close) to them. A tailored testing program for the SAT will give more difficult items to brighter test takers. This also has implications for test security -- different people get different tests.

Basics of IRT

Assumptions

1. A single common factor accounts for all item covariances. This common factor is the latent trait of interest. This is stated a couple of different ways in the literature.

a) unidimensionality -- there is a single latent trait

b) local independence--if you partial out the test common factor from any two items, their residual covariance is zero.

This assumption is never met precisely. It is obviously a problem when the test format includes several items that are related by a common problem. For example, several different items may be asked about the same story. Monte Carlo work and experience with IRT programs suggests that minor violations of this assumption don't make much difference. The programs appear to work well so long as there is a clear dominant first factor in the data.

2. Relations between the latent trait and observed response have a specific form. The line relating the trait and response is called an item characteristic curve or ICC for short (this is not the same ICC as the intraclass correlation coefficient). It is theoretically possible to have several different kinds of relations between the trait and observed response, and there is a history of test theories that correspond to different relations. I want to share a few of these with you.

Let's start by thinking about cognitive ability, say a vocabulary test. People vary in their knowledge of the meaning of words in English. A vocabulary test will be composed of a number items. A person taking the test will either get an item right or wrong. The form of the relation between the trait (vocabulary) and an observed score can be represented as a graph: It is customary in IRT to talk about the latent trait (in our example, individual differences in vocabulary) as theta ( ). It also customary to set the theta scale by considering the population mean equal to zero and the population standard deviation to one. Note that in the graph the center of the theta scale is zero (the mean vocabulary in the population) and the numbers go up and down from there. Thus, 1 corresponds to 1 standard deviation above the mean and -1 to one standard deviation below the mean. For a vocabulary item, it stands to reason that the probability of getting the item right should increase as ability increases. The vertical axis is the probability of getting the item right. The item trace or ICC shows the expected in creasing function. In this example the ICC is a linear function. Some early theories assumed linear functions between trait and response. There is a problem with this conception, however. The problem is that if we move far enough up or down the ability ( ) scale, the probability of a correct response must fall below zero and above 1.0. The early response to this problem was to require that "nobody was home" at such places on the scale. Later, the more realistic approach was to change the assumed relation or ICC.

The second historically important approach was to use a step function for the ICC: In this graph, the probability of getting the item right stays at zero until theta is about .75, at which point the probability shoots to 1.0. Suppose we measure height in people with a collection of sticks. If you think of an item as a single stick and we judge whether a person is taller than the stick by standing next to it, we might get an ICC that looks like the one above, where theta is analogous to height and P is the probability of being judged taller than the stick.

Another curve that was examined was the cumulative normal distribution. This also varies between zero and one, but does so more gradually than the step function. The normal is a very messy function. It has been replaced by the logistic function, which is fairly tractable to math-stat types (even if it looks really ugly to the rest of us). The next figure shows an example of a logistic ICC. The most general form of the logistic curve is called the 3 parameter logistic model. It is written as: ,

where P is the probability of a correct response, and theta ( ) is the standing on the underlying trait. The symbol e is the natural exponent (its value is about 2.71). The variables a, b, and c are the parameters of the curve. They vary from item to item, and they define the specific shape of the ICC.

A first question to be addressed is whether the assumed curve is reasonable. For cognitive ability testing, such a relation often appears reasonable (see the figure below). Data from large samples are necessary to make the curve behave properly at the ends of the graph because samples are small at the extremes of the ability distribution.

The function of the parameters. Each of the parameters will be described in turn.

The most central of the parameters is the b parameter. The b parameter is called the difficulty parameter. It sets the location of the inflection point of the curve. The difficulty parameter sets the location of the curve on the horizontal axis; it shifts the curve from left to right as the item becomes more and more difficult. If the c parameter is equal to zero, then the inflection point on the curve will be at p = .50, that is, a person whose ability is b has a .50 chance of getting the item right. The location of b can be found by dropping a vertical line from the inflection point to the horizontal axis. The rightmost curve in the above figure has a b value of 1.0. The classical test theory statistic most closely related to b is P , the proportion correct. One of the really interesting things about IRT is that both people and items are located on the same scale. The scale is that of the underlying trait. The IRT model in which only the b parameter varies is called the Rasch model.

The a parameter is found by taking the slope of the line tangent to the ICC at b. The a parameter is the steepness of the curve at its steepest point. The a parameter is called the discrimination parameter. Its closest relative in classical test theory is the item total correlation. The steeper the curve, the more discriminating the item is, and the greater its item total correlation. As a limit, we could have a step function where below some level, the probability of getting the item right is zero, and just above that, the probability jumps to 1.0. Using a stick to measure height might come close. As the a parameter decreases, the curve gets flatter until there is virtually no change in probability across the ability continuum. Items with very low a values are pretty useless for distinguishing among people, just like items with very low item total correlations. The 2-parameter model allows both a and b parameters to vary to describe the items. This model is used to represent attitude scales and some ability tests where there is no guessing.

The c parameter is known as the guessing parameter. The c parameter is a lower asymptote. It is the low point of the curve as it move to negative infinity on the horizontal axis. You can think of c as the probability that a chicken would get the item right. The c parameter can be used to model guessing in multiple choice items. The 3 parameter model is usually used for representing cognitive tests.

The psychologist's measure of weight.

Suppose we measured weight with tippy iron rods. If we graphed the probability of a marble falling out of the cup against the weight of people, we might see curves like these. The thicker rods would be the curves to the right, that is, the more difficult items, because it takes a heavier person to knock out the marble. The a parameter would be related to the 'tippy-ness' of the rods. Rods particularly vulnerable to angle of attack or simply loosely attached to the base would have lower a values. It is likely that there would be some positive c value for the items because even the lightest person will sometimes knock the marble out of the cup. In the figure below, the flattest curve would be for the tippiest rod, and the rightmost curve would be for the thickest rod. The steep curve in the middle represents and excellent rod for lots of people because it is discriminating and located near the mean of the weight distribution. Concepts from IRT that Move Beyond Classical Test Theory

Invariance of Parameters

According to IRT, the parameters of the ICC (that is, a, b, and c) are invariant across populations. In other words, if you pick different samples and estimate the ICCs, you should get the same values of a, b, and c, that is, you get the same ICC. This should happen because if you have part of the curve, you can recover the rest of it. That is, we should get the same expected values. This is shown in Figure 2.5.1 (Hulin, Drasgow, & Parsons, 1983). Now the precision of estimation is another matter; the more of the curve that is covered by the population, the more accurate the estimates will be. Populations in which there is little variance in the trait will not be useful in estimating item parameters. For example, in Figure 2.5.1, if the hired group were all above theta = 2, there would be very little of the curve exposed to estimate the rest of it.

Lord's Paradox. There is no invariance of parameter estimates in classical test theory. As you can see in the figure 2.5.2 below, for the not hired population, item 2 is easier than item 1. An eyeball estimate is that about 40 percent of the not hired get item 2 right, and about 20 percent get item 1 right. For the hired population, however, item 1 is easier than item 2. About 52 percent of the hired population get item 1 right, whereas about 50 percent get item 2 right. Lord's paradox is that one item can be more and less difficult than another item at the same time. This cannot happen in IRT because of the invariance property. It is desirable to have invariant item properties because then decisions you make about them (what items to keep in a test, for example) should be the same no matter where you got your data, that is, any population of examinees will do.

Information

In statistics, information means reduction in uncertainty. The analogous measure in classical test theory is reliability. Reliability concerns the amount of error in a test. There is one reliability for a test that indicates the amount of true and error variance in an observed scores. In IRT, each item contributes some information by reducing our uncertainty about the examinee's standing on the trait. When we start testing a person, we have no idea what his or her standing will be. If you have been trained as a statistician, you would probably guess the person's standing to be at the mean of the distribution (i.e., Q = 0) and the confidence interval would be a function of the standard deviation of that distribution. You are likely to be right if you guess a person's weight to be within plus or minus two standard deviations of the population weight. Now after we administer each item, we will have a little more information about the person's standing on the trait of interest. We can give a point estimate of the standing (e.g., Q =.78) and place a new confidence interval about that estimate. At the end of the test we will have a final estimate of the person's standing and a final confidence interval around that point. In classical test theory, the final estimate is defined in terms of the total score on the test, and the confidence interval is a function of the standard error of measurement (i.e., ). With IRT, each item contains information. The quality and location of that information depends upon the ICC. From the figure below, we can see that the amount of information depends upon the steepness of the ICC (that is, the a parameter) and the location of the information depends upon the difficulty or b parameter.

The ICCs are for two job satisfaction scale items, "satisfying" and "fascinating." The top figure shows the two ICCs. The bottom figure shows the corresponding information functions. A first point to notice is that both curves indicate that there are certain places on continuum were the items give maximum information, and other places where the items basically provide no information. Notice that the curve for satisfying is much higher, indicating that it gives more information at its maximum point. However, the item fascinating provides more information at the high, or very satisfied, end of the satisfaction continuum. This makes sense because only the most satisfied people will describe their job as fascinating, so the item fascinating will only discriminate among the most satisfied people. Similarly, an item like "rotten" would only discriminate among those who were very dissatisfied with their jobs.  According to IRT, the test information is simply the sum of each item's information. This means that we can create a test to have any kind of information we like by judiciously choosing our items. We can pick items to give maximum information right at our cutoff point, for example.

In classical test theory, there is one reliability for a test. In IRT, there is local reliability, that is, an amount of information at each point of the underlying continuum. The size of the confidence interval around the person's score will depend upon where that score it. Assume that a test is made up of the following items: Most of the items are located (have b parameter values) near the center of the distribution of theta. This means that there will be much greater precision of information near the average theta than toward the extremes. If you consider measuring height with sticks as an analogy, this test would be composed of several sticks near the average height, one short stick and one tall stick. Our test would order or differentiate people relatively well near the center of the distribution, but not at the extremes. For all of the tall people, there is really only one item in this test. They will be taller than the rest of the sticks. For them, this test puts them into two piles: tall and taller. Now if we loaded up the test with other sticks with b values between 2 and three, we could do a decent job measuring tall people. In fact, most conventional tests look a lot like this one, where most of the items are located in the center with most of the people. This is an efficient design, but we don't get good (informative, reliable) measurements of people at the extremes with such tests. IRT really is superior to classical test theory with respect to the idea of local error, local reliability, and local information. The precision of all real tests varies across the value of the thing measured.

Applications

1. Test bias. If the ICCs for two populations are the same, the item is not biased. If the ICCs are different, the item is biased, that is, it is functioning or behaving differently across the groups.

2. Tailored testing. You begin the test by picking an item of average difficulty (b about 0). If the person gets it right, select a more difficult item. Keep making them more difficult until the person gets an item wrong. If the person gets the first item wrong, give them an easier item. Keep making the items easier until they get an item right. As soon as at least one item is right and at least one item is wrong, we can get a maximum likelihood estimate of the person's standing on the trait. As soon as we have a point estimate, we can compute a confidence interval, that is, a local standard error of measurement for the person. Now we will choose that item for the person that is expected to provide the maximum information for that person. After administering each item, we can compute their standing on the trait and their confidence interval. When the confidence interval is small enough, we stop testing. This means that each person is likely to get a different test but that the scores will be on the same scale and measured with approximately equal error.

3. Practical notes. IRT is an extraordinary theoretical advance. It provides theoretical justification for lots of practical activities. On the other hand, it doesn't allow you to do anything we weren't already doing, it just gives us an excuse to go ahead and do it. IRT typically requires large sample sizes and lots of computational power. It is difficult to publish IRT analyses because accepted procedure constantly changes.