Validity Outline

I. Definitions

II. Designs

A. Content

B. Known Groups

C. Criterion

D. Construct

III. Special Cases

A. Construct

1. Factor analysis

2. Multitrait multimethod matrix

B. Criterion

1. Taylor-Russell Tables

Definitions

Lots of texts say that reliability is the degree to which a test measures something (anything) and validity is the degree to which the test measures what it says it measures. And they are right in a way, but they are wrong in a way, too. Validity doesn’t reside in a test. Validity is a property of inferences based on test scores. Validity is really the degree to which test scores support the inferences we want to make based on those scores. The closer test scores come to letting us infer what we want, the more valid they are.

Suppose I dust off my bathroom scale, calibrate it and weigh some people with it. Is my scale valid? Are my resulting numbers valid? Well, if I am trying to find out the weight of people in this class, my numbers will support inferences of individual differences in weight. Does this mean that the scale is a valid measure of weight? Only if we qualify what I am trying to do. If I use the scale to measure the weight of your automobiles, I am not going to get useful measures from the scale. In fact, after the first car, I am probably never going to use the scale again. Would a typical bathroom scale be useful for measuring differences in weight of various boxes of cereal or different facial tissues we have hanging around the house? So in a sense validity is the degree to which the test measures what it says it measures, but you would have to qualify what it say it measures a lot before that statement is accurate. A bathroom scale measures the weight of people within a certain range with some error.

Would a bathroom scale be a valid measure of height? Your first reaction is probably, "Of course not, it measures weight." But if we used the scale to estimate the relative heights of people in this class, it would do a lot better than chance because the heavier people also tend to be taller. Therefore, there is some validity (even if not a lot) to the scale, because it lends some support to the inferences I want to make, in this case, ordering individuals on the basis of height. When I measure people, I want to order them in some way, whether in terms of their height, weight, introversion, skill in welding or mastery of the contents of this course. A test is valid to the degree that is helps me do so, that is, to support the inferences I want to make.

It is almost always the case the a test is never perfectly valid or perfectly invalid, although it turns out to be easier to create invalid tests than valid ones. An in-class multiple choice test of your mastery of the contents of this class, if well constructed, will provide some information about the ordering of students in the class in terms of content mastery. On the other hand, it will also reflect luck in guessing answers, luck in the choice of particular items that you happen to remember (or not), individual differences in hangovers, lovers’ quarrels, and other sources of unwanted influence too numerous to mention. A large part of the literature on validity deals with how we would know how valid some measure is, and we turn to this next.

Validation Designs

A validation design is a way of collecting data to evaluate the validity of a measure. All of the validation designs rest on a very simple principle, which can be stated "If our measure is valid, what do we expect to see in the data?" We will describe four designs: content, known groups, criterion, and construct.

Content

Content validation designs or content validity for short is rather different from the other three. Content validity refers to the degree that the contents of a test reflect the domain of contents of interest. For example, a test of American History would be said to be content valid to the degree that the items in the test matched or represented things that happened in America’s past. If the test were content valid, you would expect to see items about the American Revolution, the Industrial Revolution, the World Wars, and various U.S. Presidents among others, e.g., "Who is buried in Grant’s tomb?" On the other hand, if you found that for the same test most of the items read like this "One train leaves Chicago Westbound at 60 mph. The other leaves Boise Eastbound traveling at 2,301.7 mph. Where do they collide?" we would doubt the content validity of the test. Discussion of the content validity of the test is most appropriate where there is a well defined body of knowledge or skill that can be sampled. We could sample words from a dictionary to test vocabulary, for example. Or we could sample various types of shot production in tennis (e.g., backhand overhead, lob, kick serve, etc.). Content validity is not much use when there isn’t a clear relation between a sample of items and some well defined domain. For example, it is hard to talk about the content validity of a personality test because personality doesn’t lend itself well to a given domain that can be sampled. True, you can ask people to describe themselves given a list of adjectives (brave, cheerful, thrifty, clean and reverent), but how do you know you didn’t miss something important? Maybe it would be better to test for personality by putting two people in a room and watching them talking to one another about politics. The point is that personality doesn’t have a well defined domain like mathematics, American history, or tennis shots.

There is one other difference between content validity and the other kinds. Content validity refers the adequacy of sampling of stimulus situations, that is, test items. All the others refer to responses to the test items, that is, the other kinds of designs all refer to test scores instead of test items. In my opinion, content validity is very good to have, but it is not sufficient because it doesn’t tell you how well your test scores support the inferences you want to make.

Recall that the question asked by validation is:

"If our measure is valid, what do we expect to see in the data?"

Content validity provides this answer:

If the test is valid, then I expect the sample of items in the test to be representative of the domain (population) of possible items.

Known Groups

With the known groups validation design, we collect data on a measure from two or more groups whom everybody agrees are different to see whether our measure shows the expected difference. For example, if we developed a measure of body image anxiety (worry about the way we look), we might take a group of anorexics and a group of normal adults and compare their test scores. We would expect that the anorexic group would score higher on body image anxiety than would the normal group. A finding of such a difference helps support the validity of the measure. A finding of no difference would make us doubt that the scale was valid. Several clinical assessment tools have been developed and validated using a known groups design. The MMPI (Minnesota Multiphasic Personality Inventory) was developed by comparing responses of presumably normal people to groups of patients diagnosed with different disorders, such as depression, schizophrenia, and so forth. You could use groups of men and women to validate a scale of masculine or feminine interests. Any group will do so long as it is naturally occurring and everyone can agree about who belongs in what group.

"If our measure is valid, what do we expect to see in the data?"

Known groups provides this answer:

If the test is valid, then I expect the test scores to differ according to group membership in a predictable way.

Criterion

Criterion related validation strategies compare test scores to some other variable which is called a criterion (criteria is plural). The basic idea here is that if a test is valid, it should correlate with or predict some criterion of interest. For example, the SAT should predict college grade point average. Here the SAT would be the test and GPA would be the criterion. Finding a correlation between the SAT and the GPA would help support inferences about which student applicants to accept into college (based on test scores). Finding no relation between the two would cause us to doubt the validity of the test.

Criterion related validation strategies are frequently used by industrial psychologists, who use tests to predict job performance measures. We might find, for example, that a test of manual dexterity (ability to manipulate objects quickly and accurately with one’s hands) predicts the job performance of sewing machine operators and dentists. Here the test would be one of manual dexterity and the criterion would be some measure of job performance, such as the number of garments sewed per hour or the number of patient complaints of being hurt by the dentist.

"If our measure is valid, what do we expect to see in the data?"

Criterion related validity provides this answer:

If the test is valid, then I expect the test to correlate with one or more criteria of interest in a predictable way.

Construct

A construct is an unobservable variable hypothesized to account for behavior. Extroversion is an example in personality. Some people are outgoing and like to be the center of attention. The strike up conversations with strangers, are at ease with other people, and can be the life of a party. The construct of extroversion is hypothesized to account for a broad array of interpersonal behavior. Construct validation designs are the most comprehensive of the validation designs. They involve collecting multiple measures and examining the pattern of relations to see whether they conform to what is expected or predicted. For example, if I had a measure of workplace hostility (angry and resentful feelings toward other people at work) and I gave it to a large group of workers, I would expect the measure to be correlated with satisfaction with coworkers and supervision. I would also expect more hostile people to have a greater incidence of drug and alcohol problems, and to be less well liked by their coworkers than those with lower scores on the measures. In a construct validation design one collects multiple measures (e.g., hostility, job satisfaction, coping mechanisms, peer ratings) and looks to see an expected pattern of relations among the measures.

"If our measure is valid, what do we expect to see in the data?"

Construct related validity provides this answer:

If the test is valid, then I expect the test to show an expected pattern of relations with other variables.

This expected pattern of relations is really a generalization of known groups and criterion related validity. Another way of saying this is that known groups and criterion related validity are special cases of construct validity.