Correlation and Regression
There are two major different types of data analysis. One type is applied to nominal independent variables; the other is applied to continuous independent variables. (Most data analysis applies to continuous dependent variables.) We will be spending a lot of time later talking about analyzing nominal independent variables, including the t-test and analysis of variance (ANOVA). Now we will talk about analyzing continuous variables. The simplest method for doing this is
Linear correlation
Linear correlation deals with techniques that summarize the relations between variables. Linear means straight line. Correlation means co-relation, or the degree that two variables "go together" or more technically, covary. Linear correlation means to go together in a straight line. The correlation coefficient is a number that summarizes the direction and degree (closeness) of linear relations between two variables.
The correlation coefficient is also known as the Pearson Product-Moment Correlation Coefficient. The sample value is called r, and the population value is called r (rho). The correlation coefficient can take values between
-1 through 0 to +1. The sign (+ or -) of the correlation affects its interpretation. When the correlation is positive (r > 0), as the value of one variable increases, so does the other. For example, on average, as height in people increases, so does weight.
|
N |
Ht. In. |
Wt. Lbs. |
|
|
1 |
60 |
102 |
|
|
2 |
62 |
120 |
|
|
3 |
63 |
130 |
|
|
4 |
65 |
150 |
|
|
5 |
65 |
120 |
|
|
6 |
68 |
145 |
|
|
7 |
69 |
175 |
|
|
8 |
70 |
170 |
|
|
9 |
72 |
185 |
Example of a Positive Correlation |
|
10 |
74 |
210 |
If the correlation is positive, when one variable increases, so does the other.
If a correlation is negative, when one variable increases, the other variable descreases. This means there is an inverse or negative relationship between the two variables. For example, as study time increases, the number of errors on an exam decreases.
|
N |
Study Time Minutes |
# Error |
|
|
1 |
90 |
25 |
|
|
2 |
100 |
28 |
|
|
3 |
130 |
20 |
|
|
4 |
150 |
20 |
|
|
5 |
180 |
15 |
|
|
6 |
200 |
12 |
|
|
7 |
220 |
13 |
|
|
8 |
300 |
10 |
|
|
9 |
350 |
8 |
Example of a Negative Correlation |
|
10 |
400 |
6 |
If the correlation is negative, when one variable increases, the other decreases.
If there is no relationship between the two variables, then as one variable increases, the other variable neither increases nor decreases. In this case, the correlation is zero. For example, if we measure the SAT-V scores of college freshmen and also measure the circumference of their right big toes, there will be a zero correlation.
|
N |
SAT-V |
Toe Size |
|
|
1 |
450 |
1.7 |
|
|
2 |
480 |
1.8 |
|
|
3 |
500 |
1.6 |
|
|
4 |
510 |
1.8 |
|
|
5 |
520 |
1.9 |
|
|
6 |
550 |
1.7 |
|
|
7 |
600 |
1.6 |
|
|
8 |
630 |
1.7 |
|
|
9 |
650 |
1.9 |
Example of a Zero Correlation |
|
10 |
700 |
1.7 |
Note that as either toe size or SAT increases, the other variable stays the same on average.
Some other examples of positive, negative, and zero correlations:
|
Variable X |
Variable Y |
Correlation |
|
Salary |
Taxes paid |
Positive |
|
Shyness |
N of people greeted at party |
Negative |
|
Price of car |
Prestige of car |
Positive |
|
Price of tennis shoe |
Foot support |
Zero |
|
Time of use of flashlight |
Battery life |
Negative |
|
Weight in lbs. |
Average daily caloric intake |
Positive |
|
Price of quartz watch |
Accuracy of time kept |
Zero |
|
Salary of sales people |
Number of cars sold |
Positive |
|
Instructor knowledge of subject matter |
Clarity of presentation |
Zero? (I don't really know) |
Thus far, we have shown examples in which the correlation (co-relation) between the two variables is less than perfect. For example in the positive correlations we have seen, when one variable increases, so does the other, on average, but not necessarily for every given score. When correlations are perfect, an increase in one results in a proportionate increase or decrease in the other. For example, if salary is determined entirely by commission, the function might look like this:
|
N |
Cars sold |
$ |
|
|
1 |
10 |
1000 |
|
|
2 |
15 |
1500 |
|
|
3 |
20 |
2000 |
|
|
4 |
25 |
2500 |
|
|
5 |
30 |
3000 |
|
|
6 |
35 |
3500 |
|
|
7 |
40 |
4000 |
|
|
8 |
45 |
4500 |
This example shows a perfect positive correlation. The value of correlation coefficient that corresponds to the example is r = 1.00. For a perfect negative relationship, r = -1.00, and for no relations, r = 0.00.
The conceptual (definitional) formula of the correlation coefficient is:

where zx is X in z-score form,
zy is Y in z-score form,
And S and N have their customary meaning. This says that r is the average cross-product of z-scores.
Example: Height and Weight Revisited
|
N |
Ht |
Wt |
Zht |
Zwt |
Zh*Zw |
|
1 |
60 |
102 |
-1.50 |
-1.43 |
2.15 |
|
2 |
62 |
120 |
-1.06 |
-0.90 |
0.96 |
|
3 |
63 |
130 |
-0.84 |
-0.61 |
0.51 |
|
4 |
65 |
150 |
-0.40 |
-0.02 |
0.01 |
|
5 |
65 |
120 |
-0.40 |
-0.90 |
0.36 |
|
6 |
68 |
145 |
0.26 |
-0.17 |
-0.04 |
|
7 |
69 |
175 |
0.48 |
0.72 |
0.35 |
|
8 |
70 |
170 |
0.70 |
0.57 |
0.40 |
|
9 |
72 |
185 |
1.15 |
1.01 |
1.16 |
|
10 |
74 |
210 |
1.59 |
1.75 |
2.77 |
|
|
66.8 |
150.7 |
0 |
0 |
0.96 |
|
S |
4.54 |
33.95 |
1 |
1 |
|
Points to notice: The mean height is 66.8 inches, the SD is 4.54 inches. The first height is 60 inches, which is 1.50 standard deviations below the mean, or a z-score of -1.50. The first weight is 102 pounds, which is 1.43 standard deviations below the mean weight z = (102-150.7)/33.95 = -1.43. The product of the two z-scores is 2.15 (-1.50*-1.43=2.15). If we average the products, we get .96, which is the correlation coefficient.
Why does the correlation coefficient have a maximum of 1, and a min of -1? Why is the correlation positive when both increase together?
Let's look at graphs of height and weight. First in raw scores:

Now in z-scores

Points to notice:
Regression
Correlation and regression are closely related mathematically and in use. We use the correlation coefficient to summarize the relations between 2 variables. Regression is used to predict values of 1 variable from the values of another variable. We are going to talk about linear regression, which also is based on the idea of a straight line. If we draw a line to predict the values of Y from the values of X, we have a regression line. For example,

Note that the line passes through the means of both height and weight. The line passes close to the points (height-weight pairs), but it does not and cannot touch all of them unless the correlation is perfect. The line is called a regression line. Like any line, it can be described by an equation. You might recall from geometry or algebra the equation
Y = mx + b, which describes a line.
The standard formula used in linear regression is
Y = a +bX, Where
X and Y are the independent and dependent variables (X is height and Y is weight in our example), a is an intercept, and b is a slope (also known as the regression weight).
The b weight or regression weight is a slope. A slope indicates the steepness of a line. You may recall that slope equals rise over run, or in our case:

(slope equals change in y over change in x; rise over run)
The slope tells us how much Y changes when X changes 1 unit. In our example, line has a slope of 7.15. This means that as height increases by 1 inch, weight increases by 7.15 pounds.
The intercept (or Y intercept) is the value where the line crosses the Y-axis (where x is zero). It is found by

The actual result is not very meaningful in our example. It says essentially that if a you were zero inches tall, we expect you to weigh -327 lbs. The intercept has the function of moving the entire line up or down the Y-axis so that the line falls in the middle of the points.
The correlation coefficient and regression coefficient are closely related. The correlation coefficient is in fact the slope of the regression line if both X and Y are measured in z-scores (so that their means are zero and standard deviations are one). The correlation coefficient says how many standard score units Y changes when X changes 1 unit (absolute value from zero to 1). Recall that for our example, the correlation beteen height and weight was .96, so that when height in standard scores increase by 1, we can expect weight to increase .96 in standard scores. Remember that the slope indicates rise over run, and the correlation indicates standardized rise over standardized run. We use the regression coefficient for raw scores, so we need to move from standardized scores to raw scores. We can do this by multiplying the correlation coefficient by the ratio of the standard deviations of the Y and X variables, thus:

in our example, the regression weight can be found by
b = .96*(33.95/4.54) = .96*7.45 = 7.15.
To use regression to make predictions, we just plug numbers into the equation or use the graph. For example, If someone were 68 inches tall, we would predict their weight to be 68*7.15-327 or 159.2 Compare to graph.
Both correlation and regression show the linear relations between 2 variables. Correlation shows this information in standard scores, regression shows it in raw scores. The slope in regression is the correlation times the ratio of the two standard deviations. The intercept in z-scores is zero. The intercept in raw scores is the place where the regression line crosses the Y-axis (where X is zero).