home

Setting Standards or Cutoff Scores

How can we set passing scores or decision score? I asked my advisor about doing this for tests in industry, and he said "It can't be done." But of course it can; we do it all the time. How well we do it is a different matter, and that is what he was getting at.

Two main methods

1) Judgment based methods

e.g., Angoff

2) Psychometric (test data driven) methods

The main idea here is to maximize either

a) the number of correct decisions, or

b) the utility of the decision maker.

In the development of tests is common to pick two groups (e.g., normals and those with eating disorders) in equal numbers, administer the test, and look at the mean difference on the test. (This is the known groups validation method.) Then folks will either

a) pick the test score that maximizes correct decisions in the sample, or

b) use discriminant analysis or something like it to estimate the score that will result in the most correct decisions in the population.

In practice, there are often problems. One problem is base rates.

Example 1.

In an article published in Psychological Bulletin in 1955, Meehl & Rosen reported an army study of using a test to predict adjustment problems. They found 415 soldiers with good adjustment and 89 who were discharged for reasons of disability or AWOL (absent without official leave, hooky).

The test was able to find 55 % valid positive and 19 % false positive. Not bad, right? Well it turns out that of people who are inducted into the Army, 5 % ultimately are discharged for adjustment problems. Let's look at some numbers.

 

Actual Adjustment

Predicted Total

Predicted Adjustment

Poor

Good

 

Poor

275 (55%)

1,805 (19%)

2,080

Good

225 (45%)

7,696 (81%)

7,920

Actual Total

500

9,500

10,000

If the test is used, 2080 will be predicted to have poor adjustment. Of those, 275 will actually have poor adjustment. That is, 13 percent of those predicted to have poor adjustment will in fact have poor adjustment. If we use the test we will make 225 + 1,805 or 2030 errors. Without the test and predicting all to be well, we would make a total of 500 errors. Therefore, using the test increases the number of errors about 400 percent. This is a base rate problem because so few people (5 %) are in the poor adjustment group.

Example 2

Using the Rorschach test marked for schizophrenia to select adult workers. A scoring scheme was developed for the famous ink blot test to differentiate normals from schizophrenics, and the authors suggested using it to screen workers. A cross validation study showed that 8.1 percent of schizophrenics showed the pattern, and 0 percent of the normals showed it. Therefore, we don't have to worry about misclassifying someone who is normal as schizophrenic. The estimated proportion of schizophrenia in the population is .85 percent (it is debatable whether this represents the proportion of schizophrenics in the job applicant pool, but that is another story). Let's look at some numbers:

 

 

Criterion Classification

Predicted Total

Test

Schiz

Normal

 

Schiz

7 (8.1%)

0 (0%)

7

Normal

78 (91.9)

9,915 (100%)

9,993

Actual Total

85

9,915

10,000

Here we would have to test 10,000 people to find 7 schizophrenics. This is probably not worth the cost and effort.

Remember, however, that base rates can change, so that a test may or may not be useful depending on the circumstances. Consider the following differences

  1. Screening for maladjustment for potential inductees versus recent inductees (could the base rates be different and if so, why?).
  2. Use the test after an inductee complains about some other problem (e.g., physical health complaint).
  3. Use the test after an interview produces an equivocal judgment from the interviewer.
  4. Use the test to determine whether someone gets interviewed (that is use the test to decide whether to accept or to examine).

Other psychometric issues in classification

This is the sort of thing we implicitly assume is happening with our groups. We have a nice mean difference and can choose a single number that maximizes correct decisions. The base rate problem bites us when we collect data on even numbers of people and then apply the results to a population with very uneven numbers. The base rate problem can be illustrated in a graph like this:

Note that even though the groups differ in their means, the top group is so much larger that there are always more of them, even at high levels of x.

 

Groups may also differ with respect to variability. This can result in having two cut points rather than one to maximize correct classifications.

It is even possible to have two cutoff point being useful when there is NO mean difference between the groups.

The technique used for placing people into groups, that is, classifying them, based on multivariate data is called discriminant analysis. Discriminant analysis can yield a probability estimate of group membership for any number of groups. PAR (Psych Assessment Resources) has recently developed a computer application based on discriminant analysis. You enter the raw scores for the MMPI(2?) and it shows a profile of probability of membership in 55 different classes.

Utility of Testing

Utility is a name for subjective value, sort of like well being, or sort of like money only the psychological counterpart, not the market value. In lots of decision problems it seems reasonable to maximize utility. This may or may not be the same as maximizing income or minimizing expense. Recall the Taylor-Russell Tables and their conception of the hiring process. Suppose we have a scattergram showing the relations between test scores and job performance. We can dichotomize the job performance and call those above some level successes, while those below are failures. We can set a cutoff point on the test so that we reject all those that fall below, and accept those at or above the point.

Each quadrant represents the outcome of a decision. Two of these outcomes we would like (reject failures and hire sucesses). The other two are not so happy, that is we want to avoid hiring failures and rejecting successes. Note that if we move the cutoff point up we will hire less failures but reject more successes; the opposite will occur if we move the cutoff point down. Now if we attach numbers to each of the four outcomes and note the relative proportions of people in each quadrant, we can assign a value to the total selection system. We can find the utility of the selection procedure by counting the number of people in each quadrant and multiplying by the utility of the outcome for that quadrant. For example suppose our utility structure looked like this:

Succeed

Outcome 1= -10

P1= .15

Outcome 2 = 10

P2= .30

Fail

Outcome 3= 10

P3 = .30

Outcome 4= -10

P4= .25

 

Reject

Hire

Then the utility of the test in this scenario would be (-10)(.15) + (10)(.30) + (10)(.30) +

(-10)(.25) or 2.0. If we move the cut point up or down, the utility of the outcomes in each cell will remain constant, but the proportions(relative frequency) will change in the cells. This will change the overall utility of the test. We can then pick the test with the largest subjective value, that is, the test cutoff point that results in the greatest utility.