Validity Outline
I. Definitions
II. Designs
A. Content
B. Known Groups
C. Criterion
D. Construct
III. Special Cases
A. Construct
1. Factor Analysis
2. Multitrait multimethod matrix
B. Criterion -- Taylor-Russell Tables
Definitions
Lots of experts say that reliability is the degree to which a test measures something (anything) and validity is the degree to which the test measures what it says it measures. And they are right in a way, but they are wrong in a way, too. Validity doesn't reside in a test. Validity is a property of inferences based on test scores. Validity is really the degree to which test scores support the inferences we want to make based on those scores. The closer test scores come to letting us infer what we wan, the more valid they are.
Suppose I dust off my bathroom scale, calibrate it and weight some people with it. Is my scale valid? Are my resulting numbers valid? Well, if I am trying to find out the weight of people in this class, my numbers will support inferences of individual differences in weight. Does this mean that the scale is a valid measure of weight? Only if we qualify what I am trying to do. If I use the scale to measure the weight of your automobiles, I am not going to et useful measures from the scale. In fact, after the first car, I am probably never going to use the scale again. Would a typical bathroom scale be useful for measuring differences in the weight of various boxes of cereal or different facial tissues we hanging around the house? So in a sense validity is the degree to which the test measures what it says it measures, but you would have to qualify what it says it measures quite a lot before that statement is accurate. A bathroom scale measures the weight of certain people within a certain range with some error.
Would a bathroom scale be a valid measure of height? You first reaction is probably, "Of course not, it measures weight." But if we used the scale to estimate the relative heights of people in this class, it would do a lot better than chance because the heavier people also tend to be taller. Therefore there is some validity (even if not alot) to the bathroom scale of height because it lends some support to the inference I want to make, in this case, ordering individuals on the basis of height. When I measure people, I want to order them in some way, whether in terms of height, weight, introversion, skill in welding or mastery of the contents of this course. A test is valid to the degree that it helps me to do so, that is, to support the inferences I want to make.
It is almost always the case that the test is not perfectly valid or perfectly invalid, although it urns out be easier to create invalid tests than valid ones. An in-class multiple choice test of your mastery of the contents of this class, if well constructed, will provide some information about he ordering of students in the class in terms of content mastery. On the other hand, it will also reflect luck in guessing answers, luck in the choice of particular items that you happen to remember (or not), individual differences in hangovers, lover's quarrels, and other sources of unwanted influence too numerous to mention. A large part of the literature on validity deals with how we would know how valid some measure is, and we turn to this next.
Validation Designs
A validation design is a way of collecting data to evaluate the validity of a measure. All of the validation designs rest on a very simple principle, which can be stated "If our measure is valid, what do we expect to see in the data?" We will discuss four designs: content, known groups, criterion, and construct.
Content
Content validation designs or content validity for short is rather different from the other three. Content validity refers to the degree that the contents a test reflect the domain of contents of interest. for example, a test of American History would be said to be content valid to the degree that the items in the test matched or represented things that happened in America's past. If the test were content valid, you would expect to see items about the American Revolution, the Industrial Revolution, the World Wars, and various US presidents, among others, e.g., "Who is buried in Grant's tomb?" On the other hand, if you found that for the same test most of the items read like this "One train leaves Chicago Westbound at 60 mph. The other leaves Boise Eastbound traveling 210mph. Where do they collide?" we would doubt the content validity of the test. Discussion of content validity is most appropriate where there is a well defined body of knowledge or skill that can be sampled. We could sample words from a dictionary to test vocabulary, for example. Or we could sample various types of shot production in tennis (e.g., backhand overhead, lob, kick serve, etc.). Content validity is not much use when there is no clear relation between a sample of items and some well defined domain. For example, it is hard to talk about the content validity of a personality test because personality doesn't lend itself well to a given domain that can be sampled. True, you can ask people to describe themselves given a list of adjectives (brave, cheerful, thrifty, clean and reverent), but how do you know you didn't miss something important? Maybe it would be better to test for personality by putting two people in a room and watching them talking to one another about politics. The point is that personality doesn't have a well defined domain like vocabulary or American History, so it is difficult to establish the content validity of a personality test.
There is one other major difference between content validity and the other aspects of validity. Content validity refers to the adequacy of sample of stimulus situations that is, test items. All the others refer to responses to the test items, that is, the other kinds of designs all refer to test scores instead of test items. In my opinion, content validity is very good to have, but it is not sufficient to support validity because it doesn't tell you how well your test scores support the inferences you want to make. In some domains content validity may be necessary; in others it is not.
Recall that validation asks:
"If our measures is valid, what doe we expect to see in the data?"
Content validity provides this answer:
"If the test is valid, then I expect the sample of items in the test to be representative of the domain (population) of possible items."
Known Groups
With the known groups validation design, we collect data on a measure from two or more groups whom everybody agrees are different to see whether our measures shows the expected difference. For example, if we developed a measure of body image anxiety (worry about the way we look), we might take a group of anorexics and a group of normal adults and compare their test scores. We would expect that the anorexic group would score higher on body image anxiety that would the normal group. A finding of such a difference helps support the validity of the measure. A finding of no difference would make us doubt that the scale was valid. Several clinical assessment tools have been developed and validated using a known groups design. The MMPI (Minnesota Multiphasic Personality Inventory) was developed by comparing responses of presumably normal people to groups of patients diagnosed with different disorders, such as depression, schizophrenia, and so forth. You could used groups of men and women to validate a scale of masculine or feminine interests. Any group will do so long as it is naturally occurring and everyone can agree about who belongs in which group.
"If our measures is valid, what doe we expect to see in the data?"
Known groups provides this answer:
"If the test is valid, then I expect the test scores to differ according to group membership in a predictable way."
Criterion
Criterion related validation strategies compare test scores to some other variable that is called a criterion (criteria is plural). The basic idea here is that if a test is valid, it should correlate with or predict some criterion of interest. For example, the SAT should predict college grate point average. Here the SAT would be the test and the GPA would be the criterion. Finding a correlation between the SAT and the GPA would help support inferences about which student applicants to accept into college (based on test scores). Finding no relation between the two would cause us to doubt the validity of the test. Criterion related validation strategies are frequently used by industrial psychologists, who use test to predict job performance measures. We might find, for example, that a test of manual dexterity (ability to manipulate objects quickly and accurately with one's hands) predicts the job performance of sewing machine operators and dentists. Here the test would be a test of manual dexterity and the criterion would be some measure of job performance, such as the number of garments sewn per hour or the number of patient complaints of being hurt by the dentist. ("...now, if this hurts, you just holler. Just keep on hollerin' 'till you feel better.")
"If our measures is valid, what doe we expect to see in the data?"
Criterion related validity provides this answer:
If the test is valid, then I expect the test to correlate with one or more criteria of interest in the predicted direction.
Construct
A construct is an unobservable variable hypothesized to account for behavior. Extroversion is an example in personality. Some people are outgoing and like to be the center of attention. They strike up conversations with strangers, are at ease with other people, and can be the life of a party. The construct of extroversion is hypothesized to account for (explain) a broad array of interpersonal behavior. Construct validation designs are the most comprehensive of the validation designs. They involve collecting multiple measures and examining the pattern of relations to see whether they conform to what is expected or predicted. For example, if I had a measure of workplace hostility (ANGRY AND RESENTFUL FEELINGS TOWARD OTHER PEOPLE QT WORK) AND I GAVE IT TO A LARGE Group of workers, I would expect the measures to be correlated with satisfaction with coworkers and with satisfaction with supervision. I would also expect more hostile people to have a greater incidents of drug and alcohol problems, and to be less well liked by their coworkers than those with lower scores on the measures. In a construct validation design, one collects multiple measures (e.g., hostility, job satisfaction, coping mechanisms, peer ratings) and looks to se and expected pattern of relations among the measures.
There are two specific designs that are most often associated with construct validity: factor analysis and the multitrait multimethod matrix. Both techniques embody the same idea, namely, that a measure should correlate more highly with similar measures than with different measures. For example, a measure of depression should correlate more highly with other measures of depression than with measures of other things like curiosity or anxiety. This is a very specific expectation for a pattern of relations with other variables. On the other hand, the expected pattern of relations can also be a generalization of known groups and criterion related validity. Another way of saying this is that known groups and criterion related validity are special cases of construct validity.
"If our measures is valid, what doe we expect to see in the data?"
Construct related validity provides this answer:
If the test is valid, then I expect the test to show an expected pattern of relations with other variables.