World Wide Web
Review Guide for the CRC Examination:

Validity and Reliability

Tests are valid when they measure what they claim to measure, and reliable when they measure consistently. Because we measure qualities that are relatively stable (e.g., intelligence, personality characteristics) a vaild measure of a construct will also be a reliable one. But a reliable measurement is not necessarily valid.

In a situation where a reliable measure is not valid, the test is consistently measuring something other than what it claims to measure. A test claiming to measure spatial perception could actually be measuring more form perception than spatial perception. Client scores on the test would be quite consistent from one administration to the next, but their scores would too strongly reflect their abilities at form perception, and the test would not be a valid measure of what it claims to measure -- spatial perception. If we needed a measure of spatial perception for vocational planning with a client, and we used their score on such a test, it would not be valid.

Another concept to understand in validity and reliability is that we can use a valid and reliable test in totally invalid ways. Here the problem is not the design or psychometric properties of the test, but rather how we ourselves are using the instrument. A valid test of intelligence, for example, could be misused by a counselor if he or she concluded that an I.Q. score of 80 is too low for successful completion of some technical training program. The more important factor may be reading level. If the training text is written at an 8th grade level, a client with an I.Q. of 80 and grade reading level of 10.3 is likely to do better than a client with an I.Q. of 90 and 5.5 reading level.

Are there any situations where a valid test would not be a reliable one? Only if what we are measuring is changing from one administration to the next. A test of mood, for example, might be valid but have low test-retest reliability. The result is valid at the moment obtained, but mood can change again a moment later. From a pragmatic perspective, there is not much point in measuring what is highly unstable or constantly changing. We are interested in measuring more enduring client qualities and characteristics that will help us to predict what is likely to happen in the future.

So far we have discussed validity and reliability in rather general and pragmatic ways. There is more to both concepts, and in preparing for the CRC exam you should definitely know the different types of validity and reliability.

Types of Validity

Face Validity: Many say this is not a true type of validity for it cannot be measured. Face validity simply means that the test looks reasonable to a lay person in relation to what it claims to measure. When you take the CRC Exam you expect to see questions dealing with the practice of rehabilitation counseling. If you see a question such as, "Who was the 30th President of the United States?" you might correctly wonder what that has to do with the practice of your profession. What you are experiencing with the test is a problem with its face validity.

Content Validity: This is where a test adequately samples from the entire domain it intends to measure. If it does so, the test may be said to have good content validity. On the CRC Examination you can expect to see questions dealing with many different aspects of rehabilitation and counseling. If there were no questions on the examination dealing with validity and reliability, or far too many compared with other areas, the test would have a problem with content validity. The test would be failing to assess an area of knowledge important in the practice of the profession, or placing far too much emphasis on it, invariably at the expense of measuring other areas.

The content validity of an instrument can be studied empirically in a number of different ways. Studies of internal consistency, item analysis, and expert ratings are commonly employed methods. But content validity is never an end in itself. When a test has content validity, other and more important types of validity tend to follow.

Criterion or Predictive Validity: This is when a test score is correlated with some event beyond itself. This is the type of validity counselors are most interested in for they need to predict what will happen if certain courses of action are followed. Will the client be able to successfully complete a certain job training program? Scores on tests may help the counselor answer the question. Is the client presently depressed, and if so how severe is the depression? Test scores can provide quantifiable information not available any other way.

When two events are correlated one can be used to predict the other. The client's test score is one event, and the other event is the criterion or what we are trying to predict. We measure validity with correlation coefficients, and by far the most widely used correlation is the Pearson r. Correlations can range from -1.0 to +1.0, and a correlation of 0.0 means there is no relationship between the two events. Correlations from -.30 to +.30 are generally considered to be insignificant.

To be useful in prediction, a test should correlate at .40 or higher with the criterion. The higher the correlation the better one event predicts the other, and .60 is a good correlation for a validity coefficient. What does a correlation of .60 mean? It simply reflects how much variance the two events share with one another. A conservative estimate of the shared variance can be obtained by squaring the correlation coefficient. A .60 correlation means that 36% of what contributes to one observation also contributes to the other. Sometimes the term concurrent validity is used to describe a particular type of criterion or predictive validity. When two tests claim to measure the same thing, they should have a high concurrent validity coefficient. If two tests, for example, both claim to measure intelligence, you would expect relative performance on one to be similar to relative performance on the other. If that situation does not exist -- then the tests are not measuring the same thing.

Construct Validity: This exists when a test has wide acceptance as a means for measuring a construct. The published literature on the construct frequently uses the test to measure it, and after a time our understanding of the construct starts to merge with the instrument. As an example of this, if you ask some psychologists what intelligence is, they will tell you it is what the WAIS-R measures. They say that because all the published research on intelligence uses the WAIS-R or an instrument that has established concurrent validity with it, and our very understanding of the construct has merged, to a significant degree, with what the WAIS-R measures.

Types of Reliability

Reliability refers to the consistency and stability of measurements. It is essentially the test predicting itself from one administration to the next. A test that cannot predict itself will be unable to predict anything else. If, for example, a subject earns an IQ score of 105 ... when teasted again a few days later the second IQ score should be similar. If it is not, the reliability of the test might be questioned.

Like validity, reliability is usually measured by Pearson product-moment correlations. Reliability coefficients tend to be higher than validity coefficients because it is easier for a test to predict itself than a quality or occurrance beyond itself. Reliability coefficients are usually in the .80 range or higher.

Test-Retest Reliability: This is when subjects produce similar scores on a subsequent admisnistration of the test to scores they produced on the first administration of the test.

Split-Half Reliability: One of the ways to study and estimate the reliability of an instrument without going through the work of two administrations (where a test-retest effect may influence second scores) is the split-half procedure. It is usually done by correlating subject scores on the odd items of the test with scores for the even items of the test. The correlation obtained is the split-half reliability. The reliability of the larger instrument can be estimated using a simple Spearman-Brown formula to correct the split-half correlation. This formula raises the reliability estimate somewhat because longer tests are generally more reliable than shorter ones.

Parallel Form Relaiability: In order to avoid test-retest effects it is often useful to have two or more forms or versions of the same test. Form A, for example, can be given at the start of a program and Form B at the end without duplicating the test questions. An example of this would be taking 200 test items designed to measure a domain and building two 100 item parallel test forms from them.


Return to Areas to Review

Copyright, University of South Florida.