Shrout and Fleiss Expected Mean Squares and Computational Formulas
In general, all k judes rate all n targets, and the targets are random, that is, drawn from some larger population. You can still do reliability and generalizability studies if you have some nesting (that is, not all k judges see all n targets), but the appropriate models for data analysis are tricky. If at all possible, you should cross judges and targets. Otherwise you will probably need to hire a statistician unless you can find an example in the literature where the appropriate estimators are already worked out for you.
Shrout and Fleiss consider two major cases for analysis, Case 2 in which judges are a random facet, and Case 3, in which judges are a fixed facet. Remember that according to the Book of Brannick fixed means that in the D study judges are crossed with targets and random means that in the D study judges are not crossed with targets.
The linear model for the design is
.
If you have had analysis of variance, this should look vaguely familiar to you. It says that an obersvation is composed of the grand mean and some deviations from that mean. The a stands for the judge (leniency and severity; if judge were replaced by test item, it would refer to the difficulty of the item), b stands for the target (differences in the trait of interest; differences in ratee), ab stands for the interaction, and e stands for error.
Olympic divers as an example--
Typically, (ab) and e are not separately estimable because there is only one observation per cell. The following is an ANOVA source table for a the Shrout and Fleiss fully crossed design in which each judge rates each target once. Error and the interaction are separately estimable if judges repeatedly rate targets. However, this produces other problems because (a) there are now repeated measures, and (b) people remember what their first rating was, and the effects of this are not in the model, that is, there is no term for this in the equation..
|
Source |
df |
name |
Case 2 expected mean square |
Case 3 expected mean square |
|
Between targets |
n - 1 |
BMS |
|
|
|
Within targets |
|
|
|
|
|
between judges |
k - 1 |
JMS |
|
|
|
residual |
(k - 1)(n - 1) |
EMS |
|
|
Generalizability coefficients from Shrout and Fleiss
|
Estimate |
Case 2 (random judges) |
Case 3 (fixed judges) |
|
r |
|
|
|
ICC(2,1); ICC(3,1) intraclass correlation case 2 or case 3, 1 judge |
|
|
|
ICC(2,k); ICC(3,k) intraclass correlation case 2 or case 3, all k judges |
|
|
ICC(3,k) is equal to Cronbach's alpha, which provides another interpretation of fixed judges or items that are 'just like these.'
Calculations of Generalizabiltiy Coefficients from the Shrout & Fliess data as analyzed by SAS. (Review the SAS programs ICCIN.SAS and ICCOUT.SAS before reading the following section.)
From the SAS printout:
|
Source |
Shrout & Fleiss Name |
df |
Type III SS |
MS |
|
Judge |
JMS |
3 |
97.4583 |
32.48 |
|
Target |
BMS |
5 |
56.2083 |
11.24 |
|
Judge*Target |
EMS |
15 |
15.2917 |
1.02 |
Note do not round to two decimals until you are finished calculating, that is, do not round in intermediate steps. It sometimes causes huge errors in the final result. I recommend using a spreadsheet or calculator for best accuracy.
Calculations
|
One Random Judge |
|
|
|
|
|
One Fixed Judge |
|
|
|
|
|
All K (four) Random Judges |
|
|
|
|
|
All K (four) Fixed Judges |
|
|
|
|
Note that it matters whethe the judges are fixed or random. The greater the differences in means among the raters, the more it matters whether they are fixed or random. It also matters whether there are 1 or 4 judges. Very reliavble estimates come from either good agreement among judges, larger numbers of judges, or both. You can control the reliability of the final results by increasing the number of judges or increasing the training of the judges or both, whichever is best (or necessary or cheapest) for your situation. Suppose you get a result like .62 and you want to have a reliabiltiy of .90. How many judges would you need?
Prophecy
A variant of the Spearman Browne Prophecy formula can be used to choose the number of judges for any given level of reliability. The same formula can also be used to choose the number of items in a test.
The formula is:
![]()
where m is result to be rounded to the next highest integer, r * is an aspiration level, and r L is a reliability estimate, typically either ICC(2,1) or ICC(3,1).
Suppose we had the Shrout and Fleiss results and an aspiration of a reliability of .80 when judges are random. Then ICC(2,1) (one random judge) = .29 = r L and our aspiraion is .80 = r *. To find the number of judges, we just plug the numbers in:
= 9.793.
Therefore, we would need 10 judges to get a reliability of at least.80. Using 10 judges should give us a reliability just above .80. Using 9 judges would result in reliability of less than the desired .80, and fractional judges are in short supply.
Suppose we had the Shrout and Fliess results but the judges are fixed. Then ICC(3,1) = .71 and our aspiration is still .80. The calculation is:
= 1.62
Therefore we would need two judges.
Note that the same results can be applied to the whole tests, but don't forget the number of items within the test. For example, the reliability of four random judges was .62. If we wanted a reliability of .80 for random judges, we could compute
= 2.45
We would need 2.45 tests like this one to get a reliability of .80. But remembe that the test we have contains 4 people. Therefore the numbe of people we actually need is 2.45*4 or 9.8 people. Rounding up, we would need 9 judges for a reliability of .80. The estimate disagrees with our first estimate (10 judges based on 1 random judge) because of rounding error.