SAS Manual
PSY 6217
Univariate Statistics
by
Michael T. Brannick, Ph.D.
Contents |
Page |
|
SAS Introduction |
4 |
|
DATA and PROC Steps |
6 |
|
DATA Step |
8 |
|
Reading and Writing Data |
8 |
|
Reading |
8 |
|
Writing |
12 |
|
Manipulating Data |
13 |
|
Arithmetic Operations |
13 |
|
Functions |
14 |
|
Internal Counters |
14 |
|
OUTPUT Statement |
15 |
|
DO statement |
15 |
|
ARRAY Statement |
16 |
|
Recoding |
17 |
|
Selecting & Concatenating (Merge & Stack) |
17 |
|
MERGE |
18 |
|
DELETE |
18 |
|
DROP |
18 |
|
PROC Step Utilities |
19 |
|
PROC SORT |
19 |
|
PROC STANDARD |
20 |
|
Exploring Your Data |
20 |
|
PROC PRINT |
20 |
|
PROC MEANS |
22 |
|
PROC UNIVARIATE |
23 |
|
PROC PLOT & GPLOT |
26 |
|
Statistical Analyses |
30 |
|
PROC CORR |
30 |
|
Example 1: Height
& Weight |
31 |
|
Example 2: Pairwise
vs. Listwise Deletion |
33 |
|
Example 3: BY
Statement |
36 |
|
PROC GLM |
39 |
|
Example 1: t-test |
40 |
|
Example 2: 3 X 2
ANOVA |
43 |
|
Example 3: 2 factor
ANOVA with Repeated Measures |
49 |
|
Example 4: ANCOVA for
Pizza Sales |
53 |
|
Example 5: Aptitude
Treatement Interaction with Post Hoc Test |
56 |
|
PROC LOGISTIC |
62 |
|
Example 1: Heart
Attack Relapse |
62 |
|
PROC REG |
69 |
|
Example 1: Simple
Regression |
70 |
|
Example 2: Multiple
Regression with Diagnostics |
74 |
|
Example 3: Output and
Partial Correlation |
78 |
|
Example 4: Vairable
Selection for Model Building |
82 |
|
Example 5: Nonlinear
Relations |
88 |
SAS
Introduction
SAS stands for the
Statistical Analysis System. It was
written by statisticians for statisticians.
SAS is extremely powerful, flexible and opaque. This manual introduces many of the features
of SAS that I have found most useful in my work as a psychologist. There is much more to SAS than what is shown
here. Chances are good that whatever
the analysis you want, SAS has it. It
just may be covered in other documentation. Check the online SAS System help
for an overview of all that SAS does.
Getting in touch with SAS
SAS is networked and can be
accessed from the open use computing labs.
You must have a valid computer account to use the computing labs and
SAS. You can also lease SAS through the
College of Arts and Sciences. Once you
install the system on your PC or find an open computer at a lab, you can invoke
SAS by double clicking its icon.
You will use three primary windows
to deal with SAS for Windows: a program window, a log window, and an output
window. The program window is where you
type in your program, that is, where you issue commands for SAS to
execute. The log window is the place
where SAS tells you what it thinks you said and shows you any errors or
differences of opinion. It also
contains little technical notes to thrill geeks. The output window contains the results of the SAS program; that
is, it contains the output that you requested (in the case of a successful run)
or nothing (in the case of a completely unsuccessful run).
SAS Help
Help is one of the main menu
items at the top of the SAS interface.
Many of the topics in this document are covered in detail in the Help
materials. To find explicit discussion
of many of the topics in this document, click Help on the main menu, then 1)
SAS System, 2) Base SAS Documentation, and 3) Language Reference. At that point, you can choose such topics as
SAS Functions or Informats.
Sample Program
DATA D1;
INPUT A 1 B 3;
CARDS;
1 2
2 2
2 3
3 4
2 3
1 3
2 1
1 2
4 4
2 2
PROC PRINT;
RUN;
SAS Grammar
SAS reads along until it
encounters a semicolon (;). Then it
stops and interprets everything up to the semicolon as either an instruction or
data. Once upon a time various
instructions had to fall within certain columns or else the computer became
confused. The semicolon as a delimiter
ended this necessity. This is nice
because you don't have to worry about columns or even lines. The following would be legal and produce the
same result as
D = A+B+C;
D =
A+
B
+ C;
The one exception to the
semicolon at the end is when data are input.
In the above file, note that data immediately follow the CARDS
statement. SAS goes until the
semicolon, and then backs up one row.
SAS figures therefore that whatever precedes PROC PRINT; must be the end
of the data. If you want to use a
semicolon to end the data, put it one line below the last line of data, like
this:
1 2 2
4 4 5
2 2 3
;
PROC UNIVARIATE PLOT NORMAL;
The following would produce
an error in the SASLOG:
1 2 2
4 4 5
2 2 3
PROC UNIVARIATE
PLOT NORMAL;
because SAS would think that
PROC UNIVARIATE is part of the data.
DATA and
PROC steps
SAS divides the program into
two parts or steps, DATA and PROC. In
the DATA step you describe your data to SAS, as well as create and manipulate
variables. In the PROC step you tell
SAS what procedure to use, that is, what data analysis you want (e.g.,
correlation, regression, ANOVA, etc.).
The DATA step
Begin the data step with the
word DATA followed by a name. In the
above example, you can see DATA D1; SAS
interprets this to mean "create a data set and call it D1."
INPUT STATEMENT
The INPUT statement tells
SAS what variables are read from a data set.
In this case, there are two variables called A and B, and they are read
from columns 1 and 3. Variable A is
read from column 1 and variable B is read from column 3.
Data contained in multiple
columns are handled by linking column numbers with a dash, like this:
INPUT A 1 B 2-3;
SAS would interpret this to
mean "find data for variable A in column 1, and data for B in columns 2
and 3.” Blanks within a number are
illegal and will result in an error.
For example
INPUT A 1-3;
would result in an error in
reading the data in our example because of the blanks.
CARDS;
1 2
2 2
Note that CARDS is followed
by data in columns 1 and 3 but a blank in column 2, therefore the error caused
by the input statement INPUT A 1-3;
Compute statements such as
D = A+B+C;
also belong in the data
step. SAS interprets this statement to
mean "create a new variable called D.
Define D as the sum of A, B, and C. Compute the result for each row
(person) and collect this as part of the data contained in set D1." Note that D will be missing when any of A, B
and C are missing. Compute statements
can be quite complex. You can find the
natural logarithm, do modular arithmetic or operate on DO loops with compute
statements.
The CARDS statement tells
SAS that the data follow immediately, and that they are to be read from the
same file that contains the SAS instructions.
The PROC step
You run the PROC step after
telling SAS all about your data. Once
SAS knows what and where your data are, it can do what you want it to with
them. The PROC step begins with the
word "PROC" followed by the procedure you want, in our sample
program, the first PROC is PRINT. PRINT
will cause SAS to print out all of the data we input. This is useful to check and see that SAS is analyzing the data we
want it to. It is surprisingly common
to find that the data are a row or column off so that SAS isn't actually
analyzing the data you want it to.
PRINT lets us verify that the data are correct. In SAS there are many, many PROCs to choose
from, such as ANOVA.
SAS Output for the Sample Program
Results of PROC PRINT:
OBS A
B
1 1 2
2 2 2
3
2 3
4 3 4
5 2 3
6 1 3
7 2 1
8 1 2
9 4 4
10 2 2
This
shows that SAS thinks there are records on 10 people, each represented by a
row. There are two columns,
corresponding to variables A and B. SAS
thinks that the data for the first person include a "1" for variable
A and a "2" for variable B.
Data Step
DATA statement. You must start the Data step with the DATA statement. Following the DATA statement is the name of
the data set. For example, if you write
'DATA D1;' you create a dataset named D1.
The name of the dataset cannot be more than 8 characters long. Start with a letter. You can also use numbers and the underscore
(_). For example you could create
datasets named D1, D2 or ALL_SET; The
following would create error messages 'DATA 1234;' (begins with a digit)
and 'DATA ALLTOOMANY;' (too many
charters).
Reading and Writing Data
Reading
Records and Observations.
Your input data file consists of one or more lines of data. Each line of data in your input data file is
a record. SAS keeps track of the number of records
that are read from your file. Your data
file is a collection of numbers that describes one or more entities, objects,
or people. The entities or objects or
people of interest are called OBSERVATIONS (OBS) by SAS. There may be more than one record for each
observation. For example, you might
have asked people to respond to 150 questions, and have room for 80 columns on
a record in your data file. In that
case, you would need at least 2 records per observation.
Missing Data. When you
collect data, it is almost inevitable that you do not collect all variables for
all people. In surveys, for example,
some people omit their age or skip one or more items on the survey. SAS represents missing data with a period
(.). You can explicitly tell SAS that a
value is missing by putting a period in the data file. However, you can also leave the missing item
blank and tell SAS exactly where to look for the values. I typically leave missing values as
blanks. So long as you specify the
exact location of your numeric variables, SAS will substitute the missing value
for blanks. (If you are reading alpha
[letters and symbols] instead of numbers, the value for blank will be legal and
read by SAS. You can deal with this
later by telling SAS to substitute missing for blank.)
INPUT Statement. You need to include an INPUT statement for any dataset that you
want SAS to read. You may read data
from either (a) the file that is in your SAS Program Window or (b) an external
file, such as a file on your hard drive.
For example, you might say
INPUT A B C;
SAS interprets this to mean read three
variables, named A, B and C respectively.
Input Formats
SAS allows many different instructions,
called formats, about how to read data.
I will describe three main types and give examples of each.
Free (List) format. The simplest format is free or list format. For example
INPUT X Y Z A1 A2 A3;
This instructs SAS to read six variables (X
through Z and A1 through A3). SAS will
read numbers from the following data and assign them serially to each of the
six variables. The first number SAS
finds will be assigned to X, the second to Y, and so forth, until it gets to A3
with the sixth number. The seventh
number will be assigned to the second observation (person) for X. SAS knows when numbers start and end through
blank spaces. Each time SAS finds a
blank space or the end of a record it considers one number to end and another
to begin.
Free format is attractive because it is
simple. It has a major drawback,
however. Because SAS skips over
blanks, it does not distinguish between intentional and unintentional
blanks. If a value for variable is missing,
SAS will just read the value for the next variable and place it (wrongly) into
the first variable. If you want to use
free format and you have missing data, you must put a single period (.) to stand for
each missing value. SAS considers a
period listed by itself (without adjacent numbers) to be a symbol for missing
data.
Column format. With column format, you name the variable and then tell SAS the
number of the column(s) in which to find the data for that variable. For example
INPUT A 1 B 2-3 C 10-15;
This says to find the data for variable A in
column 1, the data for variable B in columns 2 and 3 and the data for variable
C in columns 10 through 15.
Alpha variables ($). The default in SAS is to assume that variables are numeric, that
is, they are numbers rather than letters or symbols. If you want variables that are alpha rather than numeric, such as
names or social security numbers, you communicate this to SAS with the dollar sign following the
variable name. For example
INPUT NAME $ 1-20 A1 21-22 A2 23-24;
This says to read the alpha variable NAME in
the first 20 columns, and then to read A1 and A2 in the next two sets of two
columns.
Pointing or jumping (@). You can tell SAS to go directly to a column using the at sign
(@). For example
INPUT NAME $ 1-20 @ 23 A2;
This tells SAS to read the alpha variable
NAME in the first 20 columns and then to jump to column 23 to read variable A2.
Record number (#). If you have more than one record ('card') per person, you can
tell SAS which record to read using the number sign. Suppose you have 10 records per person, but the only information you need is on the first card and the
last card. You might have an input
statement something like this:
INPUT NAME $ 1-20 #10 @ 50 SALARY;
This tells SAS to read the alpha variable
NAME on the first record, then to skip to the 10th record to find SALARY data
starting at column 50. Note that if the
salary data were missing on the record, you would need to note this by placing
a period (.) for missing data there.
Otherwise, SAS would read the next available number and place it in the
SALARY variable. To avoid this problem,
assuming that SALARY is left blank when it is missing, you would simply list
the columns in which the salary data is found.
For example,
INPUT NAME $ 1-20 # 10 SALARY 50-55;
When you have multiple records and you only
need information from some of them, you must tell SAS to go to the last record
as part of the INPUT statement.
Otherwise, SAS will think that each observation has as many records as
the last record from which data were read.
Suppose you have 10 records per observation, but you only need data from
the first two. You would write
INPUT NAME $ 1-20 #2 SALARY 50-55 #10;
If you leave off the #10, then SAS will think
that there are 2 records per observation.
It will then read whatever is on the third record for the first person
in columns 1-20 and assign that as the
NAME of the second observation.
Fixed (informat). Instead of specifying the column numbers, you can tell SAS the number
of columns and where to place the decimal.
Suppose we have administered a survey with 10 items in it and each
response takes 1 column (e.g., a 1 to 5 Likert scale where 1=Strongly Disagree
and 5 = Strongly agree). Our data might
look like this:
|
Name |
Item |
|||||||||
|
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
|
Col 1-10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
|
Joe |
2 |
3 |
2 |
4 |
1 |
5 |
5 |
5 |
2 |
3 |
|
Mary |
2 |
4 |
4 |
3 |
2 |
5 |
4 |
5 |
3 |
3 |
For each person, we record their name in the
first ten columns, and then the responses for each of the 10 items in columns
11 through 20. With column format, we
would write
INPUT NAME $ 1-10 I1 11 I2 12 I3 13 I4 14 I5
15 I6 16 I7 17 I8 18 I9 19 I10 20;
With an informat statement, we could write
INPUT NAME $ 1-10 I1 1. I2 1. I3 1. I4 1. I5
1. I6 1. I7 1. I8 1. I9 1. I10 1.;
This says to give each item 1 column. A quicker way to write this is:
INPUT NAME $ 1-10 (I1-I10)(1.);
SAS allows the hyphen (-) to refer to
variables that have the same beginning, but differ only by a final digit. In this case, I1-I10 means I1 through
I10. The hyphen cannot be used for
variables that do not have the same stem.
For example, suppose we had two surveys, the first of which was 10 items
long and the second of which was 20 items long. Further suppose that both scales have single column responses and
that the second scale follows the first immediately on the same record. We would write
INPUT (I1-I10)(1.)(T1-T20)(1.);
The statement
INPUT (I1-T20)(1.) ;
Would cause an error message.
Informats for Decimals. We usually put the decimal point in our data when we write data
to a file. For example, if the number
is 32.8, we would put that into 4 columns, such as:
32.8
With column format, we would tell SAS to read
the appropriate 4 columns, e.g.,
INPUT TEMP 1-4;
With informats, however, we do not need to put
the decimal in the data, thus saving space in our data file. We can tell SAS where to put the decimal
during the input statement. Suppose we
had punched
328
and we wanted SAS to read this as 32.8. Then we would use the statement
INPUT TEMP 3.1;
This tells SAS to read three columns and to
put one digit to the right of the decimal point. If we instead said
INPUT TEMP 3.2;
SAS would interpret the number as 3.28 rather
than 32.8. Therefore the number to the
left of the decimal in the INPUT statement is the number of columns, and the
number to the right of the decimal is the number of digits to the right of the
decimal.
Combining Formats. SAS allows any combination of free, column, and informat
instructions in a single INPUT statement.
This gives you great flexibility in inputting your data. For example, you could write
INPUT #10 (I1-I50)(1.0) #2 @ 50 A1 A2 A3 #3
SSN $ 2-10;
I wouldn't ordinarily tell SAS to read the
10th record first. I like to start with
the 1st record and 1st column and move across columns to the end of each
record. That way I don't get
confused. Nor would I allow free format
on the 2nd record, because of the problems with missing data. However, you CAN do all that if you want to.
CARDS Statement. The CARDS statement tells SAS that your data follow immediately
in the same program where you have your INPUT statement. For example
INPUT A 1 B 2 C 3;
CARDS;
123
234
456
;
shows an input statement that tells SAS to
find 3 variables in column format. The
CARDS statement follows, and is in turn followed by three records, one for each observation. SAS continues to read each line as a record
of data until it finds a semicolon (;).
It then backs up one line (record) and considers this to be the last
line (record) of data.
INFILE Statement. The INFILE Statement tells SAS where to find a file containing
data that is external to the Program Window.
For example, suppose you kept you data in a file on a floppy disk. Let's suppose that the floppy disk drive is
A: and the name of your file is Sumdata.txt.
You would write:
DATA D1;
INFILE 'A:Sumdata.txt';
INPUT A 1 B 2 C3;
Note that you put the INFILE statement before
the INPUT statement. Also note the
single quotes and semicolon. The single
quotes tell SAS where to find the external file. The semicolon must be placed after the single quotes to end the
statement. Suppose your hard drive is
C: and you keep your data in a director inside the SAS directory. If you keep your data in a directory called
MYDATA, then your INFILE statement would look something like this:
INFILE 'C:\SAS\MYDATA\Sumdat.txt';
Writing
You can use SAS to write or output data as
well as to read or input data. The
formats that you use to do this are identical to the input formats, which were
described earlier.
FILE Statement. Use the FILE statement to tell SAS where to write the new
file. For example
DATA D2; SET D1;
FILE 'A:Newdat.sas';
I always create a new dataset and copy the
contents of an earlier dataset when I use the FILE statement. The statements DATA D2; SET D1; tell SAS to create a new dataset called D2 and
to copy the contents of D1 into it.
The reason for this is that I am generally outputting derived data to
the new dataset. I may have added items
together to form scales, combined the judgments of two or more judges into a
composite, or merged datasets from different places to form my new data.
The above FILE statement will create a file
on your A: drive called Newdat.sas. Be
sure to follow the DOS conventions for legal file names and extensions (e.g.,
eight or less characters for the name, three for the extension). The file that SAS writes is an ASCII file
with one line per record.
SAS will over-write files with identical
names. It will not ask you about this
either, so be careful. If you already
have a file named Newdat.sas on your A: drive, you will lose it by running SAS
with the above lines in it.
PUT Statement. Use the put statement to tell SAS the format of the new
file. For example, you might write
DATA D2; SET D1;
FILE 'A:Newdat.sas';
PUT SSN 1-10 @ 15 (A1-A10)(3.2) #2
(I1-I50)(1.0);
You are not limited to 80 columns in your PUT
statements. However, it is often a good
idea to use no more than 80 columns for a record because you may have problems
viewing the new file having more than 80 columns with some software.
Manipulating Data in the Data Step
Arithmetic Operations
You can combine variables, take functions of
them, and generally assign values to variables in the data step. For example, you might want to add items to
form a scale:
Scale = I1+I2+I3+I4+I5;
SAS uses the following symbols to perform
arithmetic operations
|
Operation |
Symbol |
|
Addition |
+ |
|
Subtraction |
- |
|
Multiplication |
* |
|
Division |
/ |
|
Power (exponent) |
** |
SAS will carry out multiplication, division and the exponent operation before it
carries out subtraction and addition.
Therefore you need to use parentheses to make sure that SAS is doing
what you want. For example
A = B + C**2;
SAS interprets this to mean "set
variable A equal to variable B plus C-squared."
That is,
A=B+(C**2);
It does not mean to set A equal to the square
of the sum of B and C. The latter value
would be written:
A = (B+C)**2;
Arithmetic with missing values. SAS will set the result of any arithmetic operation to be missing
if ANY part of the operation contains a missing value. So for example if
Scale = I1+I2+I3+I4+I5;
Then Scale will be missing if any 1 of the 5
items is missing. This is not always
true with functions. Some functions
ignore missing values.
IF Statement. The IF statement causes SAS to check on the truth of a statement
or expression, and then to execute a command if the statement or expression is
true.
For example
IF A LT 3 THEN B=0;
Would be interpreted by SAS to mean that if
the value of variable A is less than 3, set variable B equal to 0.
Suppose we had a distribution of depression
scores in a variable named BECK, and we wanted to dichotomize them so that
values greater than 14 were set to 1 (high) and values of 14 or less were set
to zero (low). We could do this by
BECK2 = 0;
IF BECK GT 14 THEN BECK2=1;
The first statement creates a variable called
BECK2, and sets its value to zero. The
second statement sets the value of BECK2 to 1 if the score on BECK is greater
than 14.
Functions
SAS supports a large number of functions,
including arithmetic (math, trig, Boolean) and statistical functions. For example LOG is the natural logarithm
function. We can write the Fisher r
to z
transformation as:
Z = .5*LOG((1+r)/(1-r));
Note that I needed to include the parentheses
around the addition and subtraction operations. Otherwise, SAS would have done the division before the addition
and subtraction and the result would have been wrong.
To find a complete list of SAS functions,
click on Help at the main SAS menu.
Then 1) SAS System, 2) Base SAS Documentation, 3) Language Reference and
finally 4) SAS Functions.
A few functions ignore missing values. For example, suppose we have five items that
we want to sum to form a scale. We can
compute this by:
SCALE = SUM(OF I1-I5);
But the SUM(OF) function ignores missing
values, so that if one of these items is missing for a person, that person will have as their scale score
the sum of the non-missing items.
To avoid this problem, you should either set
the result to missing like this:
ARRAY A I1-I5;
DO OVER A;
IF A EQ . THEN SCALE =.;
END;
(see the ARRAY statement) or else use
arithmetic to form the scale, that is:
SCALE = I1+I2+I3+I4=i5;
Internal Counters
Internal counters are values that SAS uses
for its own record- or house-keeping purposes.
Sometimes you will want to use them for your own purpose.
The _N_ counter. The symbol _N_ refers to the Observation number. The third observation in a dataset in known
to SAS as _N_ = 3, the fourth observation is _N_=4 and so on. If you
wanted to examine your first 25 people in the dataset you could use the
following statements:
DATA D1;
INPUT A B C;
IF _N_ LE 25;
CARDS;
data go here...
PROC PRINT;
RUN;
The _I_ counter. The symbol _I_ refers to the internal counter on a DO LOOP. A DO LOOP is a set of commands that are
executed repeatedly. The first time a
DO LOOP is executed, _I_ =1, the second time through, _I_=2, and so on.
The OUTPUT Statement. The OUTPUT Statement causes SAS to write an observation to the
current dataset. For example, suppose
we have the following program:
DATA D1;
INPUT A B C;
IF A LT 2 THEN OUTPUT;
CARDS;
1 2 3
4 5 6
PROC PRINT;
RUN;
Then the output will be:
OBS A B C
1 1 2 3
Only the first observation is written to the
dataset because only the first observation has a value of A less than 2.
DO Statement. The DO statement causes SAS to execute a set of operations as one
unit. The operations in the unit are
those encountered prior to the END statement.
Its form is:
DO Name = Start TO Stop <BY Increment>;,
For example,
INPUT A B C;
If A EQ . THEN DO K=1 TO 10;
A = NORMAL(0);
OUTPUT;
END;
The NORMAL function returns a random number
from an approximately Normal (0,1) distribution. This above program reads values for each person for A, B, and
C. If A is missing, then the program
generates 10 values for A, each from a normal distribution, and outputs them to
the current dataset.
If we want to create values of the r
to z
function to place in a table, we could use the following program:
DATA D1;
DO r = -.9 to .9 by .1;
Z = .5*log((1+r)/(1-r));
OUTPUT;
END;
PROC PRINT;
RUN;
SAS would produce the
following output:
OBS R Z
1 -0.90000 -1.47222
2
-0.80000 -1.09861
3 -0.70000 -0.86730
4 -0.60000 -0.69315
5 -0.50000 -0.54931
6 -0.40000 -0.42365
7 -0.30000 -0.30952
8 -0.20000 -0.20273
9 -0.10000 -0.10034
10 -0.00000 -0.00000
11 0.10000 0.10034
12 0.20000 0.20273
13 0.30000 0.30952
14
0.40000 0.42365
15 0.50000 0.54931
16 0.60000 0.69315
17 0.70000 0.86730
18 0.80000 1.09861
19 0.90000 1.47222
ARRAY Statement. An ARRAY statement is a convenient way to refer to a collection of
variables. The form of the statement is
ARRAY Name [Variable List];
For example, suppose we have a 50-item scale
that we want to put into an array. We
could do this with:
ARRAY Scale I1-I50;
SAS interprets this to mean Create an array
called Scale that contains the variables I1 through I50 (our 50 items).
Arrays are often used with DO LOOPS. If we have a set of variables X and want to
create a set of variables Y each of which is equal to X, we could use the
following:
ARRAY A X1-X10;
ARRAY B Y1-Y10;
DO K=1 TO 10;
B = A;
END;
The DO OVER command is also used with
arrays. The example immediately above
could be written as:
ARRAY A X1-X10;
ARRAY B Y1-Y10;
DO OVER A;
B = A;
END;
In either case, the variable Y1 would be set
equal to X1, Y2 would be set equal to X2, and so forth. Note that we refer to each array by its name
in our compute statements. The compute
statements are executed once for each variable in the array (for the DO OVER
command) or using the variables that correpond to an explicit counter in the
case of the DO statement.
Recoding
It is usually the case with questionnaires or
surveys that some of the items are written so that they are opposite in meaning
to others. For example one item might
be "I love my job" and another item might be "I hate my
job." In both cases the response
would be 1=strongly disagree to 5=strongly agree. One set of responses needs to be scored opposite to the other to
provide a consistent scoring key. This
can be done using an ARRAY statement.
Suppose we have a 20 items in a scale,
and each response is coded 1 to 5 as described above. If items 2, 4, 5, 7, 11 and 19 are to be
reverse scored, we would write:
ARRAY A I2 I4 I5 I7 I11 I19;
DO OVER A;
A = 6-A;
END;
This would cause each item to be subtracted
from 6, which will result in the proper recoding. In general with Likert-type items, if there are k response
options, recode by subtracting the input response from k+1 (in our case 5
response options, so we subtract from 6).
Selecting & Concatenating
(Merging & Stacking Selected Data)
SAS allows lots of flexibility in managing
your data. You can read data from
several different sources and then merge them.
For example, you might have one set of data from Human Resources that
contains information about the amount of training that employees have received. Another dataset might come from Accounting,
where they keep information on store profits.
Or you might want to stack some new data that you generated on top of
data that already exist within a file.
SET Statement. The SET statement causes SAS to stack vertically all datasets
named in the SET statement. For example
DATA ALL; SET D1 D2 D3;
Would cause SAS to create a new data set
called ALL and then to stack D2 on top of D3 and D1 on top of D2. The variables in the two or more datasets
to be stacked must have the same names.
For example, after the SET statement,
if D1 and D2 both have variables named A, B, and C, then the
observations for D1 for variable A will be in the dataset above those for A in
D2. If only D1 has a variable named A,
then after the SET statement, there will be values in the variable A for the
observations from D1, but the observations from D2 will have missing values for
A.
MERGE Statement. The MERGE statement causes SAS to stack horizontally all datasets
named in the statement. For example
DATA ALL; MERGE D1 D2;
Would cause SAS to create a new dataset
called ALL and concatenate all the variables in D1 and D2. The MERGE statement is usually used with the
BY statement. Most files have some
identifying variable, such as social security number, that is used to match records
for the merge. So if we has such an
identifier we would use it like this:
PROC SORT DATA = D1; BY SSN;
PROC SORT DATA = D2; BY SSN;
DATA ALL; MERGE D1 D2; BY SSN;
For people where there is a matching SSN,
there will be values assigned for all variables contained in both D1 and D2
after the merge. If the person is
listed in only D1 or D2 but not both, that person will have missing data
corresponding to either D1 or D2 after the merge. Be sure to give your variables different names prior to merging
datasets. Otherwise SAS will simply
replace the values in one dataset with the values in the other.
DELETE Statement. Use the DELETE statement to delete an observation from a dataset.
For example:
IF _N_ EQ 10 THEN DELETE;
IF A EQ 3 THEN DELETE;
IF B EQ . THEN DELETE;
The first statement would delete the 10th
observation. The second statement would
delete observations in which the variable A has the value 3. The third statement would delete any
observation with a missing value for B.
DROP Statement. Use the DROP statement to eliminate variables that you do not
need. For example, if you created some
scales from items, you might want to keep the scale scores and drop the item
scores. To see the correlations among
the scales without the items, we could write:
DATA D1;
INFILE 'A:ITEM.DAT';
INPUT (I1-I150)(1.0);
SCALE1=SUM(OF I1-I50);
SCALE2=SUM(OF I51-I100);
SCALE3=SUM(OF I101-I150);
DATA D2; SET D1;
DROP I1-I150;
PROC CORR;
RUN;
PROC STEP
Most Procedures (PROCs) are
used for statistical analysis. However,
there are some procedures for managing data.
The most frequently used of these is PROC SORT.
Sorting
PROC SORT
Use PROC SORT to arrange the observations by
the values of one or more variables in your dataset. PROC SORT can be used organize your data for you to see or it can
be used as a precursor to another PROC.
You should also sort datasets that you plan to merge by a common
variable. For example, if you want to
merge two datasets by a common social security number, sort each dataset by
social security number before the merge.
Form:
PROC SORT <DATA = dataset name><OUT = dataset name>;
BY <DESCENDING> variable name(s);
DATA Option. By default, SORT uses the most recently created dataset. You can tell SAS to SORT any dataset by naming
that dataset here.
OUT Option. By default, SAS will replace the current dataset with the sorted
data. To create a new dataset for the
sorted data and still keep the original dataset, use this option. If you use a name for a dataset that you
used earlier, SAS will replace that earlier dataset, so be careful with long
programs.
BY statement. The BY statement tells SAS which variable(s) to use for
sorting. For example
PROC SORT DATA=D1 OUT=SD1; BY AGE;
The above command tells SAS to sort the data in
D1 by the values in the variable AGE. If you want a descending sort, you
specify this in the BY statement immediately before the variable to be used for
sorting. For example
PROC SORT; BY DESCENDING AGE DESCENDING
INCOME;
Would sort the data descending by age and
descending by income. If the second
DESCENDING statement were left off, the data would have been sorted descending
by age and ascending by income. When
more than variable is in the BY statement, SAS sorts first by the left-most
variable and proceeds serially to the right.
PROC STANDARD
PROC STANDARD allows you to set the mean and
standard deviation of one or more variables to a value that you choose. Suppose
you want to find z scores for a variable called AGE in your data. You would use the command
PROC STANDARD M=0 S=1; VAR AGE;
Exploring Your Data
PROC PRINT
Form:
PROC PRINT <DATA=Dataset name>;
<VAR name(s);>
PROC PRINT causes SAS to print a list of
observations by variables in a dataset.
DATA= Option. By default, SAS prints the most recently created dataset. You can tell SAS which dataset you want by
placing another dataset name here.
You should always print out your data and
check it to be sure that what is in the computer matches exactly the paper or
original source of the data. To print
all the data, use the command
PROC PRINT;
This will print everything. If you do not want to see all of the data,
you can reduce the amount in several ways.
VAR Statement. The VAR (variable selection) statement tells SAS to use the
chosen procedure (PROC) only on variables that you name. Otherwise, the default for nearly all PROCs
is to use all applicable variables in the dataset. To select only the variables A B and C for printing, you would
write:
PROC PRINT; VAR A B C;
You can also reduce the number of
observations in your dataset by selecting observations with the internal
counter.
DATA D1;
INPUT A B C;
IF _N_ LE 25;
CARDS;
Data lines...
PROC PRINT;
I recommend that you print all of the data
before doing any analyses. If possible,
check every data point against the raw data (e.g., check the numbers in the
printout against the survey responses).
If this is not possible, check the first few observations, the last few
observations, and a sprinkling of data
throughout the file. You need to
discover errors in your data early and to correct them. Other procedures that may assist you with
finding errors are FREQ, UNIVARIATE, MEANS, and PLOT.
Example of PROC PRINT
DATA D1;
INPUT A 1 B 3;
CARDS;
1 2
2 2
2 3
3 4
2 3
1 3
2 1
1 2
4 4
2 2
PROC PRINT;
RUN;
1.
Results of PROC PRINT:
OBS A B
1 1 2
2 2 2
3 2
3
4 3 4
5 2 3
6 1 3
7 2 1
8 1 2
9 4 4
10 2 2
This
shows that SAS thinks there are records on 10 people, each represented by a
row. There are two columns, corresponding
to variables A and B. SAS thinks that
the data for the first person include a "1" for variable A and a
"2" for variable B.
PROC
MEANS
This
procedure produces means for your variables, of course. It also produces other descriptive
information. If you have a large number
of variables to view, MEANS is often a good bet. You can use the maximum and minimum values to see whether any of
your variables have any illegal values.
For example, you might have a variable that takes values from 1 to 5,
but find that it somehow has a value of 50.
This would cause you to go back to your data to spot the error.
Form:
PROC
MEANS <DATA=Dataset name>;
<VAR
Variable name(s);>
DATA=
Option. The default is to use the most recently
created dataset. If you want means for
another dataset, specify the dataset name here.
VAR
Statement. The default for MEANS is to use all numeric
variables. Use the VAR statement to
select a subset of your variables.
Example
Program
DATA D1;
INPUT A 1 B 3;
CARDS;
1 2
2 2
2 3
3 4
2 3
1 3
2 1
1 2
4 4
2 2
PROC MEANS; VAR A B;
RUN;
Output
from PROC MEANS:
Variable N
Mean Std Dev Minimum Maximum
--------------------------------------------------------------------
A
10 2.0000000 0.9428090 1.0000000 4.0000000
B 10 2.6000000
0.9660918 1.0000000 4.0000000
--------------------------------------------------------------------
Note how MEANS produces a compact table that
shows the name of the variable, the number of non-missing observations, the
mean, standard deviation, minimum and maximum.
PROC UNIVARIATE
Univariate provides a great
deal of descriptive information about a variable.
Form:
PROC UNIVARIATE <PLOT>
<NORMAL><DATA=Dataset name>;
VAR Variable name(s);
PLOT Option. The plot statment causes SAS to produce a stem-and-leaf diagram,
a box plot and a normal probability plot for visual representations of your
variable(s).
NORMAL Option. The NORMAL option causes SAS to produce a
test of thehypothesis that the data were drawn from a population in which the
variable is normally distributed.
DATA= Option. The default is to use the
most recently created dataset. You can
specify a different dataset here.
VAR Statement. The default is to produce an analysis of all numeric variables in
the dataset. Use the VAR statement to
choose a subset of the variables.
Example Program
DATA D1;
INPUT A 1 B 3;
CARDS;
1 2
2 2
2 3
3 4
2 3
1 3
2 1
1 2
4 4
2 2
PROC UNIVARIATE PLOT
NORMAL;
RUN;
PROC
UNIVARIATE Results
Univariate
shows lots of information about the distribution of the variable. It shows the mean, median and mode. It shows the variance, standard deviation,
range, quartiles, and various percentiles.
It shows the skew and kurtosis.
Variable=A
Moments Quantiles(Def=5)
N
10 Sum Wgts 10
100% Max 4 99% 4
Mean
2 Sum 20 75% Q3 2
95% 4
Std Dev
0.94 Variance 0.89
50% Med 2 90% 3.5
Skewness 0.99 Kurtosis 1.19 25% Q1 1 10% 1
USS
48 CSS 8 0% Min
1 5% 1
CV
47.14 Std Mean 0.298142 1% 1
T:Mean=0
6.708204 Pr>|T| 0.0001 Range
3
Num ^= 0 10 Num > 0 10 Q3-Q1
1
M(Sign) 5
Pr>=|M| 0.0020 Mode 2
Sgn Rank 27.5 Pr>=|S| 0.0020
W:Normal
0.840083 Pr<W 0.0430
Then
univariate shows the extremes, or highest and lowest numbers in the
distribution and where they came from.
For example, the lowest number in this variable is 1.0, and that number
can be found for persons (rows) 8, 6 and 1.
The Highest number in the distribution is 4, and this can be found for
person 9. This is useful in large
datasets. You can find errors or
inspect people with unusually high or low scores.
Extremes
Lowest Obs
Highest Obs
1( 8) 2( 5)
1( 6)
2( 7)
1( 1) 2( 10)
2( 10) 3( 4)
2( 7) 4( 9)
The
stem-leaf diagram and the boxplot are both graphical methods for representing
the distribution. The stem-leaf diagram
is a histogram. This one shows that 1
person scored a 4.0, one person scored a 3.0, five people scored a 2.0, and 3
people scored a 3.0. The boxplot is a
drawing of the distribution. The middle
of the distribution is shown by the box, and the tails of the distribution are
shown by the whiskers extending from the box.
This plot shows that the distribution is skewed.
Stem Leaf # Boxplot
4 0 1 0
3
3 0 1 |
2 |
2 00000 5 +--+--+
1 | |
1 000 3 +-----+
----+----+----+----+
Stem
Leaf and Boxplot for variable B
Stem Leaf # Boxplot
4 00 2 |
3 |
3 000 3 +-----+
2
*--+--*
2 0000 4 +-----+
1 |
1 0 1 |
----+----+----+----+
This
distribution (B) is closer to being Normal.
The mean of the distribution is shown by the + in the middle of the
box. The median is shown by the
*------* across the box. For variable
B, the mean and median are about equal, as they should be when data are
approximately normally distributed. The
quartiles are shown by the lines
+------+
that form the edges or shoulders of the box.
For variable A, the mean is graphed on the upper quartile and the median
is not shown. The tails or extremes are
the vertical bars, | . Unusual scores
are shown at the ends of the tails as either 0 as in variable A, or for more
extreme scores, the symbol * is used (not shown in either graph). The graph for variable B is what you want to
see for a nice normal distribution. The
mean and median occur together. The
shoulders are equally spaced from the mean and median. The tails are equally long and there are no
circles or asterisks at the ends.
Variables that have distributions that look like those of variable B are
best for statistical analysis. Note
that the stem-leaf and boxplot graphs both show the same information, that is,
the shape of the distribution. They
just do it in slightly different ways.
PROC PLOT & PROC GPLOT
PROC PLOT is a simple
program for getting a look at your data.
PROC GPLOT produces nicer graphs that you can use in publications or
other presentations that require high quality.
With GPLOT, you can copy the entire graph as a window and then paste it
into other Windows software, such as MS Word and CorelDRAW.
Form
PROC [G]PLOT
<DATA=Dataset name>;
PLOT Vertical * Horizontal
</ OVERLAY>;
DATA= Option. The default is to use the most recently created dataset. If you want a different dataset, specify
that dataset's name here.
PLOT Statement. You must have at least one PLOT statement. The first named variable will be the
vertical axis; the variable to the right of the asterisk (*) will be the
horizontal axis. In a regression
problem,we would write
PLOT Y*X;
You can have more than one
plot statement within a PROC PLOT or PROC GPLOT. Each PLOT statement will cause SAS to produce one graph.
OVERLAY Option. The OVERLAY option causes SAS to put graphs corresponding to two
or more PLOT statements onto a single graph.
Example Plot Program
DATA D1;
INPUT A 1 B 3;
CARDS;
1 2
2 2
2 3
3 4
2 3
1 3
2 1
1 2
4 4
2 2
PROC PLOT; PLOT B*A;
RUN;
Results of PROC PLOT
Results will be more
meaningful if (a) you have more observations, and also (b) both variables take
on many values. Each of the variables
plotted here takes only 4 values. In
cases with small numbers of values (like ours) it probably makes more sense to
produce a two-way frequency table.
Plot of B*A. Legend: A = 1 obs,
B = 2 obs, etc.
B ‚
‚
4
ˆ
A A
‚
‚
‚
3
ˆ A B
‚
,
‚
‚
2
ˆ B B
‚
‚
‚
1
ˆ A
----+-------------+-------------+--------------+------
1 2 3 4
A
Example for GPLOT
To use GPLOT, you must have installed SAS
GRAPH. USF has installed this on the
open use facilities. If you leased SAS,
you may or may not have installed SAS GRAPH.
This example illustrates the use of
GPLOT. It also uses several other
elements described in this document, including a DO LOOP, a function (exp, the
base of the natural log function) and the OUTPUT statement. What we do is to create two logistic curves
(see, For example, Pedhazur's chapter on logistic regression). The DO LOOP has a counter called L. L varies from -10 to 10 by 1. At each value, we find two functions of
L. These are called P and P2,
respectively. Each value of both is
output for each iteration of the DO LOOP.
The OVERLAY command causes both plots to
appear in a single graph. Otherwise,
SAS would produce one graph for each plot.
Input Program:
Data d1;
A=1;
A2=0;
B=.50;
Do l = -10 to 10;
P=(exp(a+b*l)/(1+exp(a+b*l)));
P2=(exp(a2+b*l)/(1+exp(a2+b*l)));
Output;
End;
Proc gplot;
Plot p*l p2*l/overalay;
Run;
Output of GPLOT:

This plot was copied from SAS to a word file
and re-sized. That is, you can cut and
paste from SAS to another Windows application. This feature is particularly nice with the graphs that GPLOT
produces.
Example 2
The second example shows a scatterplot with a
regression line plotted on the same graph.
The data are generated using the NORMAL random number generator with the
computer clock NORMAL(0) as a seed, or start, value.
SAS Input:
DATA D1;
DO L=1 TO 100;
X1=NORMAL(0);
X2=NORMAL(0);
Y=X1+X2;
Y2=.0593+.94*X1;
OUTPUT;
END;
PROC PRINT;
PROC GPLOT;
PLOT Y*X1 Y2*X1 /OVERLAY;
RUN;
SAS Output:

Statistical Analyses
PROC CORR
Form
PROC CORR <DATA=dataset name>
<NOMISS> <ALPHA>;
<BY variable name(s);>
<VAR variable name(s);>
Description
PROC CORR computes Pearson product-moment
correlation coefficients. The default
is to find the correlation matrix with all non-missing observations for each
pair of numeric variables, thus maximizing the sample size and power for each
significance test. Such an approach is
called pairwise deletion of missing values.
For other purposes, such as multivariable applications (multiple
regression, etc.) or reliability analyses, you will want to delete any
observation that is missing on any variable in the analysis. This is called listwise deletion of missing
data.
DATA = dataset name option. Use the DATA=dataset name option to
choose a dataset for CORR to analyze.
The default is the most recently created dataset.
NOMISS option. For listwise deletion (see the description paragraph on PROC
CORR), use the NOMISS option.
ALPHA option. This produces an item analysis using Cronbach's alpha, an
estimate of internal consistency reliability.
Use this only if you specify NOMISS.
This is used primarily in scale development or prior to other
statistical analysis to check scale reliability.
VAR statement. Use the VAR statement to
select a subset of variables. The
default is to use all numeric variables in the dataset. The choice of variables in the VAR
statement may affect the sample size when NOMISS is specified. If a variable with a small number of
observations is omitted, the sample size for the other variables may increase.
BY statement. Use the BY statement to compute correlation matrices separately
for each value of the BY variable. The data must be sorted using PROC SORT
before using the BY statement with PROC CORR.
For example, suppose you wanted to compute correlations separately for
males and females, which were coded using a variable named SEX. This would be accomplished by:
PROC SORT;
BY SEX;
PROC CORR;
BY SEX;
Example 1: Height and Weight
Input:
Data d1;
**********************************************
* We input height and weight from the table
*
* using column format. *
**********************************************;
input ht 1-2 wt 4-6;
cards;
60 102
62 120
63 130
65 150
65 120
68 145
69 175
70 170
72 185
74 210
proc print;
* to verify that the data are correct;
proc means;
* to find the means and standard deviations
of the raw scores;
proc standard m=0 std=1 out=d1;
************************************************
* Proc standard allows us to set the
means *
* and standard deviations of our
variables *
* to be whatever values we want. If we *
* choose 0 and 1 as I did, we get z
scores. *
************************************************;
data d2; set d1;
* to find the products of z scores;
zhXzw = ht*wt;
proc print;
* to see the z score sand their products;
proc corr; var ht wt;
*to find the correlation between height and
weight;
run;
Output:
First, the raw data, which are output
from Proc Print.
|
OBS |
HT |
WT |
|
1 |
60 |
102 |
|
2 |
62 |
120 |
|
3 |
63 |
130 |
|
4 |
65 |
150 |
|
5 |
65 |
120 |
|
6 |
68 |
145 |
|
7 |
69 |
175 |
|
8 |
70 |
170 |
|
9 |
72 |
185 |
|
10 |
74 |
210 |
Next, descriptive statistics, which are
output from Proc Means.
|
Variable |
N |
Mean |
Std
Dev |
Minimum |
Maximum |
|
HT |
10 |
66.80000 |
4.54116987 |
60.000 |
74.000 |
|
WT |
10 |
150.7000 |
33.5911086 |
102.000 |
210.000 |
|
|
|
|
|
|
|
Z scores output from the second Proc Print.
|
OBS |
HT |
WT |
ZHXZW |
|
1 |
-1.49741 |
-1.43442 |
2.14791 |
|
2 |
-1.05700 |
-0.90424 |
-0.95578 |
|
3 |
-0.83679 |
0.60970 |
0.51019 |
|
4 |
-0.39637 |
-0.02062 |
0.00817 |
|
5 |
-0.39637 |
-0.90424 |
0.35842 |
|
6 |
0.26425 |
-0.16789 |
-0.04436 |
|
7 |
0.48446 |
0.71574 |
0.34674 |
|
8 |
0.70466 |
0.56846 |
0.40058 |
|
9 |
1.14508 |
1.01028 |
1.15685 |
|
10 |
1.58549 |
1.74663 |
2.76927 |
Results of Proc Corr.
|
Correlation
Analysis |
||||||||
|
2
'VAR' Variables: HT WT |
||||||||
|
Simple
Statistics |
||||||||
|
Variable |
N |
Mean |
Std Dev |
Sum |
Minimum |
Maximum |
||
|
HT |
10 |
0 |
1.000 |
0 |
-1.497412 |
1.585495 |
||
|
WT |
10 |
0 |
1.000 |
0 |
-1.434416 |
1.746629 |
||
|
Pearson
Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 10 |
||||||||
|
|
HT |
WT |
||||||
|
HT |
1.00 |
0.95662 |
||||||
|
|
0.0 |
0.0001 |
||||||
|
WT |
.95662 |
1.00 |
||||||
|
|
0.0001 |
0.0001 |
||||||
The correlation between height and itself is
1.00. The probability value that the correlation was drawn from a population in
which r = 0 with a sample size of
10 is 0. Of course, this isn't very meaningful, because a correlation of any
variable with itself is 1.0. The correlation between height and weight is
.95662. The probability that this correlation was drawn from a population in which
r = 0 with a sample size of
10 is 0.0001. In other words, the correlation is significant (or to be more
precise, significantly different from zero by conventional standards). PROC CORR also produces descriptive statistics,
including the number of non-missing observations, the mean and standard
deviation of the variable, along with the sum, minimum and maximum.
Example 2: Pairwise
vs. Listwise Deletion
Assume that the data in this example are
1.
Race (column 1) – 1 Asian, 2=Black, 3=Hispanic, 4=White, 5=Other
2.
Job Satisfaction (columns 6-7) – This indicates to what extent the person
finds his/her work satisfying and enjoyable.
Self report; larger numbers mean more satisfied.
3.
Life Satisfaction (columns 12-13) – Indicates the extent the person finds
life in general to be enjoyable and satisfying. Self report; larger numbers mean more satisfied.
4.
Trait Optimism (columns 18-19) – Indicates the extent to which the
person takes an optimistic view of challenges and life in general. Self report; larger numbers mean more
optimistic.
INPUT
DATA D1;
INPUT RACE 1 JSAT 6-7 LSAT 12-13 OPTIMIST
18-19;
CARDS;
1
24 32 36
.
24 21 31
1
17 31 28
4
39 39 39
4
21 22 38
4
16 21 52
1
18 27
4 27 26
3
17 36 41
4
29 32 24
3
27 34 42
4
32 47 40
4
12 28 28
1
19 45
2
11 44
2
18 43 23
2
11 37 38
3
22 28 47
4
24 43 41
4
40 28 38
4
17 19 36
4
26 30 52
4
20 23 50
2
27 . 37
4
18 23 48
4
17 42 44
2
23 35 43
4
25 41 15
3
31 27 53
2
26 38 35
4
11 41 39
3
26 24 32
2
16 27 49
4
18 24 25
4
28 30 51
1
12 40 30
1
23 38 33
3
13 27 46
4
16 43 50
2
25 26 45
4
21 51 38
2
24 44 41
4
28 37 41
4
16 31 43
4
15 48 46
2
18 43 31
4
25 35 37
2
19 37 22
4
19 45 20
2
12 24 40
4
18 25 49
1
14 44 50
2
19 30 35
4
. 39 43
1
12 44 38
.
25 26 35
4
12 34 26
2
17 42 31
1
17 47 28
2
23 37 50
4
18 39 49
3
17 25 41
4
23 46 34
3
25 41 47
4
17 40 54
.
16 41 38
4
23 26 47
4
20 45 36
PROC CORR; VAR JSAT LSAT OPTIMIST;
This
statement produces pairwise deletion.
It maximizes the number of observations per correlation. The VAR statement was used to select the
variables for the analysis. RACE is not
suitable for the analysis because it is categorical rather than continuous.
PROC CORR NOMISS;
VAR JSAT LSAT OPTIMIST;
RUN;
Output of PAIRWISE deletion
Correlation Analysis
3 'VAR'
Variables: JSAT LSAT
OPTIMIST
SAS Says that there are
three variables in the analysis and lists them.
Simple Statistics
Variable N Mean Std
Dev Sum Minimum Maximum
JSAT 67
20.582090 6.233218 1379.000000 11.000000 40.000000
LSAT 65
34.446154 8.301037 2239.000000 19.000000 51.000000
OPTIMIST 66 39.060606
8.914774 2578.000000 15.000000 54.000000
SAS produces
descriptive statistics. It then
produces the actual correlations.
Pearson Correlation
Coefficients / Prob > |R| under Ho: Rho=0 / Number of Observations
JSAT LSAT OPTIMIST
JSAT 1.00000 -0.06700 0.02583
0.0 0.5988
0.8381
67 64 65
LSAT -0.06700 1.00000 -0.18162
0.5988 0.0
0.1543
64 65 63
OPTIMIST 0.02583 -0.18162
1.00000
0.8381 0.1543 0.0
65 63 66
The first (top) number in each set of three
is a correlation coefficient. The
correlation between job satisfaction (JSAT) and itself is 1.000. The correlation between job satisfaction and
life satisfaction (LSAT) is -0.06700.
The second (middle) number in each set of three is the probability (p)
value for testing r=0. The p value for the correlation between job
satisfaction and itself is 0.0. The probability
value for the correlation between job satisfaction and life satisfaction is
.5988. The third (bottom) number in
each set of three is the sample size upon which the correlation was computed. The correlation for job satisfaction with
itself was computed on 67 people. The
correlation between job and life satisfaction was computed on 64 people, all of
whom had scores on both variables.
Output of LISTWISE deletion
Correlation Analysis
3 'VAR'
Variables: JSAT LSAT
OPTIMIST
Simple Statistics
Variable N Mean Std
Dev Sum Minimum Maximum
JSAT 62
20.596774 6.247517 1277.000000 11.000000 40.000000
LSAT 62
34.629032 8.357256 2147.000000 19.000000 51.000000
OPTIMIST 62 38.854839
9.129033 2409.000000 15.000000 54.000000
Pearson Correlation
Coefficients / Prob > |R| under Ho: Rho=0 / N = 62
JSAT LSAT OPTIMIST
JSAT 1.00000 -0.05786 0.04696
0.0 0.6551 0.7170
LSAT -0.05786 1.00000 -0.18615
0.6551 0.0 0.1474
OPTIMIST 0.04696 -0.18615
1.00000
0.7170 0.1474 0.0
Note that all the same
information is given with LISTWISE as with PAIRWISE deletion. Note that there are only 62 observations in
the analysis. These 62 had no missing
data on any of the variables listed in the analysis. SAS only prints the sample size at the header (top) of the
correlation matrix rather than under each p value.
Example 3: the BY
Statement
This example uses the same data as example
2. We will sort the data by RACE and
then compute the correlation between job and life satisfaction separately for
each value of race.
Input
PROC SORT;
BY RACE;
PROC CORR;
BY RACE;
VAR JSAT LSAT;
In this example, we first used PROC SORT to
sort the data by RACE. Then we asked for a correlation matrix for each race,
but only for the variables job satisfaction and life satisfaction.
Output
-------------------------------------------- RACE=.
-------------------------------------
SAS first produces a
correlation matrix for those whose value of RACE is missing. You may not want this information, so you
can delete people who are missing or you may disregard this part of the
printout. On the other hand, you may
want to know the correlation for people who are missing - especially if they
had a choice about what information to provide you.
Correlation Analysis
2
'VAR' Variables: JSAT LSAT
Simple Statistics
Variable N Mean Std
Dev Sum Minimum Maximum
JSAT 3
21.666667 4.932883 65.000000 16.000000 25.000000
LSAT 3
29.333333 10.408330 88.000000 21.000000 41.000000
Pearson Correlation
Coefficients / Prob > |R| under Ho: Rho=0 / N = 3
JSAT LSAT
JSAT 1.00000 -0.94138
0.0 0.2191
LSAT -0.94138 1.00000
0.2191 0.0
-------------------------------------------- RACE=1 -------------------------------------
Correlation Analysis
2
'VAR' Variables: JSAT LSAT
Simple Statistics
Variable N Mean Std
Dev Sum Minimum Maximum
JSAT 9
17.333333 4.301163 156.000000 12.000000 24.000000
LSAT 8
37.875000 7.199950 303.000000 27.000000 47.000000
Pearson Correlation
Coefficients / Prob > |R| under Ho: Rho=0 / Number of Observations
JSAT LSAT
JSAT 1.00000 -0.48357
0.0 0.2247
9 8
LSAT -0.48357 1.00000
0.2247 0.0
8 8
The matrices for the
other values of RACE are omitted to conserve space. The information provided by PROC CORR is given separately for
each value of the BY variable.
PROC GLM
Form
PROC GLM <DATA = dataset name>;
<CLASS name(s) of categorical variabe(s);>
MODEL dependent variable name(s) = independent
variable name(s);
<TEST H = term for hypothesis effect
E=term
for error effect</HTYPE=number ETYPE=number>;>
<OUTPUT OUT = dataset name P=name
R=name;>
<MEANS name(s) of
categorical variable(s)/[Post Hoc Test];>
Description
GLM stands for General Linear Model. GLM is the most flexible of the regression
programs in SAS. It is less efficient
than other programs, but this is less of a concern now that even personal
computers can handle very large problems in a very short period of time. I use this program for models that contain
categorical independent variables, that is, for ANOVA. I also use it for models that contain both
continuous and categorical variables, such as ANCOVA.
DATA = Option. You can specify the dataset to
be analyzed with the DATA = option. The default is the most recently created dataset.
The CLASS Statement tells SAS
which variables are categorical in nature.
SAS assumes that all variables are continuous unless you tell it
otherwise with this statement. In other
words, you don't need to tell SAS which variables are continuous, just which
are categorical.
The MODEL
Statement tells SAS what the dependent variable is (or what the
dependent variables are) and the names of the independent variables. The
dependent variable(s) are listed to the left of the equals sign; the
independent variables go on the right.
The TEST
Statement allows you to construct your own tests of effects. To do this, you specify both the term for
the hypothesis and the term for error.
For example, for a mixed design with a repeated measure, you might have
a test for factor A with an error term of subjects within A. You could then write
TEST H=A E=SUB(A);
The hypothesis and
error terms must appear in the previous MODEL statement so that SAS will know
what they are.
HTYPE and ETYPE
Options. You can specify the type of
sums of squares to use in the TEST statement.
The default is the highest TYPE of sums of squares computed by SAS for
the problem (usually TYPE=3). Choosing
the proper type is tricky. I recommend that you either find an example in a
textbook to match with your output to verify your choice or to find a
consulting statistician.
The Output
Statement allows you to output predicted and residual values along
with your original variables. You can
also name the output dataset. If you do
not supply a name, SAS will augment the original dataset with the new
information.
The Means Statement tells SAS to compute means
and standard deviations of the dependent variable at each level of the
categorical variable you name. You can
also specify post hoc tests such as the Tukey HSD (TUKEY) and the Scheffe (SCHEFFE),
among others.
Example 1: The t-test.
This is a simple
example that shows how to use GLM for the t-test or for one-way ANOVA. The difference between the two is simply the
number of levels of the single independent variable. If there are two levels of the IV, then the t-test and ANOVA amount
to the same thing, because F = t2 in such a case.
INPUT:
data d1;
*********************************************************
* Example of a single factor
experiment with *
* two independent groups. Data from Thorne & Slane, *
* 1997, p. 224. Data represent weight in grams of two *
*
groups of rats. All rats have a
tendency to become *
*
obese. Half (experimental group)
were randomly *
*
selected for weight loss drug.
Control group got *
*
saline. Data after 4 weeks: *
*********************************************************;
* Experimental group = 1, Control
group = 2; *
*********************************************************;
input group weight;
cards;
2 437
2 455
2 483
2 392
2 455
2 469
2 513
2 452
2 444
2 410
1 383
1 355
1 313
1 410
1 289
1 344
1 418
1 400
1 344
1 310
proc glm;
class group;
model weight = group;
means group/ hovtest;
run;
Notice the structure of
the data. SAS wants the data stacked by
columns, one for independent variable (group) and one for the dependent
variable (weight). The CLASS statement
tells SAS that our IV is categorical.
The MODEL statement tells SAS the proper DV and IV. The MEANS statements causes SAS to print the
mean and standard deviation for each group; the hovtest option tells SAS to
test for homogeneity of variance.
OUTPUT
The GLM
Procedure
Class
Level Information
Class Levels Values
group 2 1 2
Number of
observations 20
The GLM
Procedure
Dependent
Variable: weight
Sum of
Source DF
Squares Mean Square F Value
Pr > F
Model 1
44556.80000 44556.80000 27.85
<.0001
Error 18
28796.40000 1599.80000
Corrected Total 19 73353.20000
The first part of the
output shows that SAS sees Group as a categorical variable with 2 levels. There are 20 observations total (so far, so
good). Then there is an ANOVA source table
for the entire model. You can see the
sources of variance, the sums of squares, degrees of freedom, mean squares, and
overall F (27.85 in this example). The
significant value of F means that the difference between means is significant
with p < .05 (actually, it shows that p < .0001, but this is not
typically hypothesized).
R-Square Coeff Var Root MSE weight
Mean
0.607428 9.905275 39.99750
403.8000
Next, (just above) there are summary statistics
for the model as a whole (more on this later).
Source DF
Type I SS Mean Square F Value
Pr > F
group 1
44556.80000 44556.80000 27.85
<.0001
Source
DF Type III SS Mean Square F Value Pr > F
group 1
44556.80000 44556.80000 27.85
<.0001
Below that are
different summaries of the test for the Group variable. Notice that the tests all have the same
value of F in this case
(27.85). In more complicated models,
this equality will not hold, and you will have reason to prefer some tables
over others. However, in this example,
life is simple and any way you look at it, the result is significant.
The GLM Procedure
Levene's Test for
Homogeneity of weight Variance
ANOVA of Squared
Deviations from Group Means
Sum
of Mean
Source
DF Squares Square F Value Pr > F
group 1 2836852
2836852 1.21 0.2861
Error 18
42255843 2347547
Levene’s test for the
homogeneity of variance assumption is shown above. A non-significant difference means that homogeneithy of variance
is plausible. Note, however, that this
is not a very powerful test with small numbers of people. On the other hand, the t-test and ANOVA are
supposed to be robust to departures from homogeneity of variance.
The
GLM Procedure
Level of ------------weight-----------
group N Mean
Std Dev
1 10 356.600000
44.9251229
2
10 451.000000 34.370530
The last bit of
printout is the values of the mean and standard deviation for each group. As you can see, the control group of rats is
the heavier on average.
Example 2: Effects of performance feedback and norm
group on level of aspiration for a motor pursuit task.
In a classic social
psychological experimental waste of time, Dr. Penner decided to test males on
their performance using a difficult motor pursuit task. The task yields a numerical score for each
trial that shows how well the participant did on that trial (large numbers
indicate better performance). But as
you have probably already guessed, participants are not told how well they
actually did. Instead, they are fed a
brazen lie that they scored either Above Average, Average , or Below average
compared to a norm group. Additionally
they were told (again, dishonestly) that the norm group was either College Men
or Professional Athletes. The dependent
variable was a quantitative score given by the participant about their
‘aspiration level’ for the next trial.
That is, did the feedback that they got about how well they did and
compared to whom make a difference in how well they thought they could do (and
aspired to do) during the next trial?
[Part B. How did Penner get this
through the IRB? Stay tuned.]
data d1;
input aspire standing norm;
**********************************************************************
*
Data from Hays, 1994, p. 491, Table 12.7.1. *
*
Aspiration experiment. *
*
Design is fully crossed (independent groups). Standing has three *
*
levels and Norm has two levels. *
**********************************************************************;
* Standing - 1=Above, 2=Average,
3=Below.
* Norm - 1=College men,
2=Professional Athletes.
**********************************************************************;cards;
52 1 1
48 1 1
43 1 1
50 1 1
43 1 1
44 1 1
46 1 1
46 1 1
43 1 1
49 1 1
28 2 1
35 2 1
34 2 1
32 2 1
34 2 1
27 2 1
31 2 1
27 2 1
29 2 1
25 2 1
15 3 1
14 3 1
23 3 1
21 3 1
14 3 1
20 3 1
21 3 1
16 3 1
20 3 1
14 3 1
38 1 2
42 1 2
42 1 2
35 1 2
33 1 2
38 1 2
39 1 2
34 1 2
33 1 2
34 1 2
43 2 2
34 2 2
33 2 2
42 2 2
41 2 2
37 2 2
37 2 2
40 2 2
36 2 2