How the stat packages compute a
correlation matrix
Step 1. Start with the raw data. Variables are columns and people are rows, thus:
|
|
Variable |
||
|
Person |
X1 |
X2 |
X3 |
|
1 |
1 |
2 |
3 |
|
2 |
2 |
3 |
4 |
|
3 |
1 |
2 |
3 |
|
4 |
5 |
4 |
3 |
|
5 |
4 |
4 |
4 |
|
|
|
|
|
|
|
X1 |
X2 |
X3 |
|
Means |
2.6 |
3.0 |
3.4 |
Step 2. Subtract the mean for each variable from each observation for that variable, that is, subtract the column mean from each entry in that column. Find the deviates!
Notation: x = X - m for the
population or x = X -
for the sample.
|
|
Variable in Deviation Form |
||
|
Person |
x1 |
x2 |
x3 |
|
1 |
-1.6 |
-1 |
-0.4 |
|
2 |
-0.6 |
0 |
0.6 |
|
3 |
-1.6 |
-1 |
-0.4 |
|
4 |
2.4 |
1 |
-0.4 |
|
5 |
1.4 |
1 |
0.6 |
What are the means of the new variables?
Step 3. Create the SSCP (Sums of Squares and Cross Products) matrix.
This is done through a little matrix multiplication, thus:
SSCP = x' x or ![]()
|
-1.6 |
-0.6 |
-1.6 |
2.4 |
1.4 |
|
-1.6 |
-1 |
-0.4 |
|
-1 |
0 |
-1 |
1 |
1 |
|
-0.6 |
0 |
0.6 |
|
-0.4 |
0.6 |
-0.4 |
-0.4 |
0.6 |
|
-1.6 |
-1 |
-0.4 |
|
|
|
|
|
|
|
2.4 |
1 |
-0.4 |
|
|
|
|
|
|
|
1.4 |
1 |
0.6 |
|
|
SSCP Matrix in Symbols |
SSCP Matrix in Numbers |
||||
|
|
1 |
2 |
3 |
1 |
2 |
3 |
|
1 |
|
|
|
13.2 |
|
|
|
2 |
|
|
|
7 |
4 |
|
|
3 |
|
|
|
0.8 |
1 |
1.2 |
Note that matrix multiplication pairs columns and rows. Paired elements are multiplied and then added. For example, the first row and column of SSCP is
the sum of
|
row1 |
col 1 |
product |
|
-1.6 |
-1.6 |
2.56 |
|
-0.6 |
-0.6 |
.36 |
|
-1.6 |
-1.6 |
2.56 |
|
2.4 |
2.4 |
5.76 |
|
1.4 |
1.4 |
1.96 |
or 13.2 if you add the third column. This result goes into the first column
and first row of SSCP. In symbols, the result is
, the
deviations of the first column squared.
The second row and first column of SSCP come from the second row of x' and the first row of x, thus:
|
row2 |
col 1 |
product |
|
-1 |
-1.6 |
1.6 |
|
0 |
-0.6 |
0 |
|
-1 |
-1.6 |
1.6 |
|
1 |
2.4 |
2.4 |
|
1 |
1.4 |
1.4 |
or 7. In symbols the result is the cross product of the deviations, or
.
Step 4. Find the variance-covariance (VCV) matrix. The SSCP matrix shows the total sums of squares and cross products. The larger the number of observations, the larger the total tends to be. We want to know the average sums of squares and cross products. These turn out to be extremely handy quantities in statistics. So handy, in fact that they have special names. The average sum of squared deviations is called a variance. The average sum of cross products in called a covariance. The resulting matrix is called the variance-covariance matrix, or sometimes just the covariance matrix for short. I usually use the symbol VCV to represent this matrix. To get this matrix, you divide the entries in the SSCP matrix by N-1 if you are working with sample data and want to estimate the population variances and covariances (this is what the stat packages do). If you had the population, you would divide by N rather than N -1. I'm using N rather than N-1 in my development because it reminds you that we are talking about an average here.
|
|
Variance-Covariance Matrix in Symbols |
Variance-Covariance Matrix in Numbers (using N-1 instead of N) |
||||||
|
|
1 |
2 |
3 |
1 |
2 |
3 |
||
|
1 |
|
|
|
3.3 |
|
|
||
|
2 |
|
|
|
1.75 |
1 |
|
||
|
3 |
|
|
|
0.2 |
0.25 |
0.3 |
||
The covariance matrix is so important that special symbols for the population have been used by convention (like m for the mean).
|
|
Symbols for the population variances and covariances |
||
|
|
1 |
2 |
3 |
|
1 |
|
|
|
|
2 |
|
|
|
|
3 |
|
|
|
Note that the Greek symbol in all cells in sigma. For the variances, sigma is squared; for the covariances it is not squared.
In some areas of science, the variance and covariance have direct meaning. For example, we could talk about the variance of a distribution of silverware in squared inches. We can and sometime do talk about pressure in pounds per square inch. In psychology, however, the raw units of measure rarely have intuitive meaning. In-class examinations have variances in points, but the meaning of point varies from class to class. Note that the size of the covariance terms depends upon the constituent variances. Large covariances tend to have one or both variables with large variances. The covariance can never be larger than the larger of two constituent variances. It would be nice to standardize the covariance matrix so that the variances were all equal. Then the relative size of the covariances would be directly interpretable.
Step 5. Standardize the variance covariance . Do this by dividing each entry through by two standard deviations, one for each relevant row and column. So in the first row and column, divide through by the product of the standard deviation of the first variable and the standard deviation of the first variable. For row 2, column 1, divide by the product of the standard deviation of variable 1 and variable 2. The resulting standardized VCV matrix is called the correlation matrix.
|
|
Correlation Matrix in Symbols |
Correlation Matrix in Numbers |
||||
|
|
1 |
2 |
3 |
1 |
2 |
3 |
|
1 |
|
|
|
1 |
|
|
|
2 |
|
|
|
.96 |
1 |
|
|
3 |
|
|
|
.20 |
.46 |
1 |
The correlation matrix also is so important that populations correlations have their own symbols by convention. The matrix is usually denoted by R.
|
|
Symbols for the population correlations |
||
|
|
1 |
2 |
3 |
|
1 |
1 |
|
|
|
2 |
|
1 |
|
|
3 |
|
|
1 |
Note that the ones on the main diagonal are the result of standardizing the
variance because from the variance covariance matrix the variance is
, and we
standardize by
, which of course is equal to
.
You need to remember two handy formulas for the covariance, becuase they are commonly used:
![]()
![]()
The second formula is so because of the formula for the correlation, namely:
![]()