**Item Response Theory**

Item Response Theory (aka IRT) is also sometimes called latent trait theory.
This is a modern test theory (as opposed to classical test theory). It is not
the only modern test theory, but it is the most popular one and is currently an
area of active research. IRT requires stronger assumptions than classical test
theory (we will cover these in a moment). IRT is much intuitive approach to
measurement once you get used to it. In IRT, the true score is defined on the
latent trait of interest rather than on the test, as is the case in classical
test theory. IRT is popular because it provides a **theoretical justification**
for doing lots of things that classical test theory does not. Some applications
where IRT is handy include:

- Item bias analysis--IRT provides a test of item equivalence across groups. We can test whether an item is behaving differently for blacks and whites or for males and females, for example. The same logic can be applied to translations of attitude scales into different languages. We can test whether the item means the same thing in English and French, for example.
- Equating--Sometimes we have scores on one test and we would like to know what the equivalent score would be on another test (e.g., versions or forms of the SAT). IRT provides a theoretical justification for equating scores from one test to another.
- Tailored Testing--IRT provides an estimate of the true score that is not based on the number of correct items. This frees us to give different people different test items but still place people on the same scale. One particularly exciting feature of tailored testing is the capability to give people test items that are matched (close) to them. A tailored testing program for the SAT will give more difficult items to brighter test takers. This also has implications for test security -- different people get different tests.

**Basics of IRT**

**Assumptions**

1. A single common factor accounts for all item covariances. This common factor is the latent trait of interest. This is stated a couple of different ways in the literature.

a) unidimensionality -- there is a single latent trait

b) local independence--if you partial out the test common factor from any two items, their residual covariance is zero.

This assumption is never met precisely. It is obviously a problem when the test format includes several items that are related by a common problem. For example, several different items may be asked about the same story. Monte Carlo work and experience with IRT programs suggests that minor violations of this assumption don't make much difference. The programs appear to work well so long as there is a clear dominant first factor in the data.

2. Relations between the latent trait and observed response have a specific form. The line relating the trait and response is called an item characteristic curve or ICC for short (this is not the same ICC as the intraclass correlation coefficient). It is theoretically possible to have several different kinds of relations between the trait and observed response, and there is a history of test theories that correspond to different relations. I want to share a few of these with you.

Let's start by thinking about cognitive ability, say a vocabulary test. People vary in their knowledge of the meaning of words in English. A vocabulary test will be composed of a number items. A person taking the test will either get an item right or wrong. The form of the relation between the trait (vocabulary) and an observed score can be represented as a graph:

It is customary in IRT to talk about the latent trait (in our example,
individual differences in vocabulary) as theta (). It also customary to set the theta scale by considering the
population mean equal to zero and the population standard deviation to one.
Note that in the graph the center of the theta scale is zero (the mean
vocabulary in the population) and the numbers go up and down from there. Thus,
1 corresponds to 1 standard deviation above the mean and -1 to one standard
deviation below the mean. For a vocabulary item, it stands to reason that the
probability of getting the item right should increase as ability increases. The
vertical axis is the probability of getting the item right. The item trace or ICC
shows the expected in creasing function. In this example the ICC is a linear
function. Some early theories assumed linear functions between trait and
response. There is a problem with this conception, however. The problem is that
if we move far enough up or down the ability (_{}_{ ) scale, the probability of a correct response must
fall below zero and above 1.0. The early response to this problem was to
require that "nobody was home" at such places on the scale. Later,
the more realistic approach was to change the assumed relation or ICC.}

_{The second historically important approach was to use a step function
for the ICC:}

_{}

_{In this graph, the probability of getting the item right stays at zero
until theta is about .75, at which point the probability shoots to 1.0. Suppose
we measure height in people with a collection of sticks. If you think of an
item as a single stick and we judge whether a person is taller than the stick
by standing next to it, we might get an ICC that looks like the one above,
where theta is analogous to height and P is the probability of being judged
taller than the stick.}

_{Another curve that was examined was the cumulative normal distribution.
This also varies between zero and one, but does so more gradually than the step
function. The normal is a very messy function. It has been replaced by the
logistic function, which is fairly tractable to math-stat types (even if it
looks really ugly to the rest of us). The next figure shows an example of a
logistic ICC.
}

_{The most general form of the logistic curve is called the 3 parameter
logistic model. It is written as:}

_{, }

_{where P is the probability of a correct response, and theta () is the standing on the underlying trait. The symbol e
is the natural exponent (its value is about 2.71). The variables a, b,
and c are the parameters of the curve. They vary from item to
item, and they define the specific shape of the ICC.}

_{A first question to be addressed is whether the assumed curve is
reasonable. For cognitive ability testing, such a relation often appears
reasonable (see the figure below). Data from large samples are necessary to
make the curve behave properly at the ends of the graph because samples are
small at the extremes of the ability distribution.}

_{The function of the parameters. Each of the parameters will be
described in turn. }

_{The most central of the parameters is the b parameter. The b
parameter is called the difficulty parameter. It sets the location of the
inflection point of the curve.}

_{}

_{The difficulty parameter sets the location of the curve on the
horizontal axis; it shifts the curve from left to right as the item becomes
more and more difficult. If the c parameter is equal to zero, then the
inflection point on the curve will be at p = .50, that is, a person whose
ability is b has a .50 chance of getting the item right. The location of
b can be found by dropping a vertical line from the inflection point to
the horizontal axis. The rightmost curve in the above figure has a b
value of 1.0. The classical test theory statistic most closely related to b
is }_{P , the proportion
correct. One of the really interesting things about IRT is that both people and
items are located on the same scale. The scale is that of the underlying trait.
The IRT model in which only the b parameter varies is called the Rasch
model.}

_{The a parameter is found by taking the slope of the line tangent
to the ICC at b. The a parameter is the steepness of the curve at
its steepest point. The a parameter is called the discrimination
parameter. Its closest relative in classical test theory is the item total
correlation.}

_{}

_{The steeper the curve, the more discriminating the item is, and the
greater its item total correlation. As a limit, we could have a step function
where below some level, the probability of getting the item right is zero, and
just above that, the probability jumps to 1.0. Using a stick to measure height
might come close. As the a parameter decreases, the curve gets flatter until
there is virtually no change in probability across the ability continuum. Items
with very low a values are pretty useless for distinguishing among
people, just like items with very low item total correlations. The 2-parameter
model allows both a and b parameters to vary to describe the
items. This model is used to represent attitude scales and some ability tests
where there is no guessing.}

_{The c parameter is known as the guessing parameter. The c
parameter is a lower asymptote. It is the low point of the curve as it move to
negative infinity on the horizontal axis. You can think of c as the
probability that a chicken would get the item right.}

_{}

_{The c parameter can be used to model guessing in multiple choice items.
The 3 parameter model is usually used for representing cognitive tests.}

_{The psychologist's measure of weight.}

_{Suppose we measured weight with tippy iron rods. If we graphed the
probability of a marble falling out of the cup against the weight of people, we
might see curves like these. The thicker rods would be the curves to the right,
that is, the more difficult items, because it takes a heavier person to knock
out the marble. The a parameter would be related to the 'tippy-ness' of
the rods. Rods particularly vulnerable to angle of attack or simply loosely
attached to the base would have lower a values. It is likely that there
would be some positive c value for the items because even the lightest person
will sometimes knock the marble out of the cup. In the figure below, the
flattest curve would be for the tippiest rod, and the rightmost curve would be
for the thickest rod. The steep curve in the middle represents and excellent
rod for lots of people because it is discriminating and located near the mean
of the weight distribution.}

_{}

_{Concepts from IRT that Move
Beyond Classical Test Theory}

_{Invariance of Parameters}

_{According to IRT, the parameters of the ICC (that is, a, b, and c) are
invariant across populations. In other words, if you pick different samples and
estimate the ICCs, you should get the same values of a, b, and c, that is, you
get the same ICC. This should happen because if you have part of the curve, you
can recover the rest of it. That is, we should get the same expected values.
This is shown in Figure 2.5.1 (Hulin, Drasgow, & Parsons, 1983). Now the
precision of estimation is another matter; the more of the curve that is
covered by the population, the more accurate the estimates will be. Populations
in which there is little variance in the trait will not be useful in estimating
item parameters. For example, in Figure 2.5.1, if the hired group were all
above theta = 2, there would be very little of the curve exposed to estimate
the rest of it.}

_{Lord's Paradox}_{. There is no invariance of parameter
estimates in classical test theory. As you can see in the figure 2.5.2 below,
for the not hired population, item 2 is easier than item 1. An eyeball estimate
is that about 40 percent of the not hired get item 2 right, and about 20 percent
get item 1 right. For the hired population, however, item 1 is easier than item
2. About 52 percent of the hired population get item 1 right, whereas about 50
percent get item 2 right. Lord's paradox is that one item can be more and less
difficult than another item at the same time. This cannot happen in IRT because
of the invariance property. It is desirable to have invariant item properties
because then decisions you make about them (what items to keep in a test, for
example) should be the same no matter where you got your data, that is, any
population of examinees will do.}

_{ }

_{Information}

_{In statistics, information means reduction in uncertainty.
The analogous measure in classical test theory is reliability. Reliability
concerns the amount of error in a test. There is one reliability for a test
that indicates the amount of true and error variance in an observed scores. In
IRT, each item contributes some information by reducing our uncertainty about
the examinee's standing on the trait. When we start testing a person, we have
no idea what his or her standing will be. If you have been trained as a
statistician, you would probably guess the person's standing to be at the mean
of the distribution (i.e., }_{Q
= 0) and the confidence interval would be a function of the standard deviation
of that distribution. You are likely to be right if you guess a person's weight
to be within plus or minus two standard deviations of the population weight.
Now after we administer each item, we will have a little more information about
the person's standing on the trait of interest. We can give a point estimate of
the standing (e.g., }_{Q =.78)
and place a new confidence interval about that estimate. At the end of the test
we will have a final estimate of the person's standing and a final confidence
interval around that point. In classical test theory, the final estimate is
defined in terms of the total score on the test, and the confidence interval is
a function of the standard error of measurement (i.e., ). With IRT, each item contains
information. The quality and location of that information depends upon the ICC.
From the figure below, we can see that the amount of information depends upon
the steepness of the ICC (that is, the a parameter) and the location of
the information depends upon the difficulty or b parameter.}

_{The ICCs are for two job satisfaction scale items,
"satisfying" and "fascinating." The top figure shows the
two ICCs. The bottom figure shows the corresponding information functions. A
first point to notice is that both curves indicate that there are certain
places on continuum were the items give maximum information, and other places
where the items basically provide no information. Notice that the curve for
satisfying is much higher, indicating that it gives more information at its
maximum point. However, the item fascinating provides more information at the
high, or very satisfied, end of the satisfaction continuum. This makes sense
because only the most satisfied people will describe their job as fascinating,
so the item fascinating will only discriminate among the most satisfied people.
Similarly, an item like "rotten" would only discriminate among those
who were very dissatisfied with their jobs.}

_{}

_{}

_{ }

_{According to IRT, the test information is simply the sum of each item's
information. This means that we can create a test to have any kind of
information we like by judiciously choosing our items. We can pick items to
give maximum information right at our cutoff point, for example.}

_{In classical test theory, there is one reliability for a test. In IRT,
there is local reliability, that is, an amount of information at each point of
the underlying continuum. The size of the confidence interval around the
person's score will depend upon where that score it. Assume that a test is made
up of the following items:}

_{}

_{Most of the items are located (have b parameter values) near the
center of the distribution of theta. This means that there will be much greater
precision of information near the average theta than toward the extremes. If
you consider measuring height with sticks as an analogy, this test would be
composed of several sticks near the average height, one short stick and one
tall stick. Our test would order or differentiate people relatively well near
the center of the distribution, but not at the extremes. For all of the tall
people, there is really only one item in this test. They will be taller than
the rest of the sticks. For them, this test puts them into two piles: tall and
taller. Now if we loaded up the test with other sticks with b values
between 2 and three, we could do a decent job measuring tall people. In fact,
most conventional tests look a lot like this one, where most of the items are
located in the center with most of the people. This is an efficient design, but
we don't get good (informative, reliable) measurements of people at the
extremes with such tests. IRT really is superior to classical test theory with
respect to the idea of local error, local reliability, and local information.
The precision of all real tests varies across the value of the thing measured.}

_{Applications}

_{1. Test bias.}_{ If the ICCs for two populations are the
same, the item is not biased. If the ICCs are different, the item is biased,
that is, it is functioning or behaving differently across the groups.}

_{2. Tailored testing.}_{ You begin the test by picking an
item of average difficulty (b about 0). If the person gets it right,
select a more difficult item. Keep making them more difficult until the person
gets an item wrong. If the person gets the first item wrong, give them an
easier item. Keep making the items easier until they get an item right. As soon
as at least one item is right and at least one item is wrong, we can get a
maximum likelihood estimate of the person's standing on the trait. As soon as
we have a point estimate, we can compute a confidence interval, that is, a
local standard error of measurement for the person. Now we will choose that
item for the person that is expected to provide the maximum information for
that person. After administering each item, we can compute their standing on
the trait and their confidence interval. When the confidence interval is small
enough, we stop testing. This means that each person is likely to get a
different test but that the scores will be on the same scale and measured with
approximately equal error.}

_{3. Practical notes}_{. IRT is an extraordinary
theoretical advance. It provides theoretical justification for lots of practical
activities. On the other hand, it doesn't allow you to do anything we weren't
already doing, it just gives us an excuse to go ahead and do it. IRT typically
requires large sample sizes and lots of computational power. It is difficult to
publish IRT analyses because accepted procedure constantly changes. }