home

Generalizability Theory

What is Generalizability Theory?

Classical Test Theory is a special case of Generalizability Theory in which error is unitary. In Generalizability Theory, error is separated into pieces, each of which can be estimated (if we collect the data properly). Recall the schematic for scientific measurement from the philosophy of science:

In classical test theory, we have true scores and (a single) error. In generalizability theory, we have multiple error terms that might correspond to things like fatigue, test form, and attributes of the test giver in our example.

When you take a measurement, you estimate some quantity. The quality of the estimate -- how good it is -- depends upon the amount of error in the measurement. GT lets you separate the error due to differences in measurement conditions.

For example, four counselors rate drug clients over two sessions on coping. We may want to know the error due to counselors and to sessions. We could ask more specific questions, such as how well does a rating by counselor 1 in session 1 generalize to the average of 4 counselors' ratings in session 2?

We can also use this stuff to answer questions such as how many judges we need to reliably judge Olympic Figure Skating Competitions.

Generalizability theory is a statistical theory of measurement that expands classical test theory to include multiple sources of error and explicitly connects measurement operations to the purpose of measurement.

 

Generalizability Jargon

G-study and D-study. A G-study is a generalizability study. It addresses questions of how well measures taken in one context generalize to another. The g-study is essentially as study of the reliability of measurement. More formally, the G-study asks to what extent a sample of measurement generalizes to a universe of measurement. A universe is a super population defined by measurement conditions. Conditions include such aspects as time of day, alternate forms, raters, union activity, psychotherapy, etc. A universe is the thing you want to generalize to. For example, in grading my exams in this class, I would like to think that the grades I assign would be very similar to those assigned by another competent in psychometrics (e.g., Spector. I don't care if I agree with Penner.). I would like to think that the grades I assign would be very similar to those I would have given if the specific test questions changed, but the content covered was the same.

A D-study is a decision study. In a D-study, measures are taken to make a decision. Examples include who to hire, whether one form of psychotherapy works better than another and whether psychological stages of development can be skipped.

The difference between the G-study and the D-study is not necessarily one of data (although they can vary in this). The principal difference between the two is purpose.

The purpose of the G-study is to assess the adequacy (reliability, generalizability) of the measures. The G-study is logically prior to the D-study because the D-study assumes adequate measures.

Facet. A facet is a set of measurement conditions. It is conceptually similar to a variable in ANOVA, that is, a categorical variable. Examples include judges, workload, time of day, essay topic, etc. In GT, we ask whether the facet affects the measure. Is there a main effect of the facet on the target (thing to be measured)? Is there an interaction between facets?

Facets may be considered fixed or random. If fixed, the specified conditions are the only conditions of interest. You generalize only to them. If random, you want to generalize to a population which has been sampled. In that case the levels of the facet included in the G-study must be representative of the population (universe). For applied work, the determination of fixed versus random almost always is determined by whether the targets (things to be measured) are crossed or nested within facets. In general, crossed targets imply fixed factes, while nested targets imply random facets. The reason for this is whether different conditions will or will not contribute error variance in the D-study. If the facets will contribute errror, you need to reflect this in your estimate generalizability.

If the facets will not contribute error, you generalizability coefficient should reflect this as well.