SAS Manual

PSY 6217

Univariate Statistics

 

 

 

 

 

 

 

by

Michael T. Brannick, Ph.D.


 


Contents

Page

SAS Introduction

4

  DATA and PROC Steps

6

DATA Step

8

  Reading and Writing Data

8

    Reading

8

    Writing

12

  Manipulating Data

13

    Arithmetic Operations

13

    Functions

14

    Internal Counters

14

    OUTPUT Statement

15

    DO statement

15

    ARRAY Statement

16

    Recoding

17

  Selecting & Concatenating (Merge & Stack)

17

    MERGE

18

    DELETE

18

    DROP

18

PROC Step Utilities

19

  PROC SORT

19

  PROC STANDARD

20

Exploring Your Data

20

  PROC PRINT

20

  PROC MEANS

22

  PROC UNIVARIATE

23

  PROC PLOT & GPLOT

26

Statistical Analyses

30

  PROC CORR

30

    Example 1:  Height & Weight

31

    Example 2:  Pairwise vs. Listwise Deletion

33

    Example 3:  BY Statement

36

  PROC GLM

39

    Example 1:  t-test

40

    Example 2:  3 X 2 ANOVA

43

    Example 3:  2 factor ANOVA with Repeated Measures

49

    Example 4:  ANCOVA for Pizza Sales

53

    Example 5:  Aptitude Treatement Interaction with Post Hoc Test

56

  PROC LOGISTIC

62

    Example 1:  Heart Attack Relapse

62

  PROC REG

69

    Example 1:  Simple Regression

70

    Example 2:  Multiple Regression with Diagnostics

74

    Example 3:  Output and Partial Correlation

78

    Example 4:  Vairable Selection for Model Building

82

    Example 5:  Nonlinear Relations

88

 


 

 

SAS Introduction

 

SAS stands for the Statistical Analysis System.  It was written by statisticians for statisticians.  SAS is extremely powerful, flexible and opaque.  This manual introduces many of the features of SAS that I have found most useful in my work as a psychologist.  There is much more to SAS than what is shown here.  Chances are good that whatever the analysis you want, SAS has it.  It just may be covered in other documentation. Check the online SAS System help for an overview of all that SAS does.

 

Getting in touch with SAS

 

SAS is networked and can be accessed from the open use computing labs.  You must have a valid computer account to use the computing labs and SAS.  You can also lease SAS through the College of Arts and Sciences.  Once you install the system on your PC or find an open computer at a lab, you can invoke SAS by double clicking its icon.

 

You will use three primary windows to deal with SAS for Windows: a program window, a log window, and an output window.  The program window is where you type in your program, that is, where you issue commands for SAS to execute.  The log window is the place where SAS tells you what it thinks you said and shows you any errors or differences of opinion.  It also contains little technical notes to thrill geeks.  The output window contains the results of the SAS program; that is, it contains the output that you requested (in the case of a successful run) or nothing (in the case of a completely unsuccessful run).

 

SAS Help

 

Help is one of the main menu items at the top of the SAS interface.  Many of the topics in this document are covered in detail in the Help materials.  To find explicit discussion of many of the topics in this document, click Help on the main menu, then 1) SAS System, 2) Base SAS Documentation, and 3) Language Reference.  At that point, you can choose such topics as SAS Functions or Informats.

 

Sample Program

 

DATA D1;

INPUT A 1 B 3;

CARDS;

1 2

2 2

2 3

3 4

2 3

1 3

2 1

1 2

4 4

2 2

PROC PRINT;

RUN;

 

SAS Grammar

 

SAS reads along until it encounters a semicolon (;).  Then it stops and interprets everything up to the semicolon as either an instruction or data.  Once upon a time various instructions had to fall within certain columns or else the computer became confused.  The semicolon as a delimiter ended this necessity.  This is nice because you don't have to worry about columns or even lines.  The following would be legal and produce the same result as

D = A+B+C;

 

D =

 A+

    B

  +   C;

 

The one exception to the semicolon at the end is when data are input.  In the above file, note that data immediately follow the CARDS statement.  SAS goes until the semicolon, and then backs up one row.  SAS figures therefore that whatever precedes PROC PRINT; must be the end of the data.  If you want to use a semicolon to end the data, put it one line below the last line of data, like this:

 

1 2 2

4 4 5

2 2 3

;

PROC UNIVARIATE PLOT NORMAL;

 

 

The following would produce an error in the SASLOG:

 

1 2 2

4 4 5

2 2 3

PROC UNIVARIATE

PLOT NORMAL;

 

because SAS would think that PROC UNIVARIATE is part of the data.

 


DATA and PROC steps

 

SAS divides the program into two parts or steps, DATA and PROC.  In the DATA step you describe your data to SAS, as well as create and manipulate variables.  In the PROC step you tell SAS what procedure to use, that is, what data analysis you want (e.g., correlation, regression, ANOVA, etc.).

 

 

The DATA step

 

Begin the data step with the word DATA followed by a name.  In the above example, you can see DATA D1;  SAS interprets this to mean "create a data set and call it D1."

 

INPUT STATEMENT

 

The INPUT statement tells SAS what variables are read from a data set.   In this case, there are two variables called A and B, and they are read from columns 1 and 3.  Variable A is read from column 1 and variable B is read from column 3.

 

Data contained in multiple columns are handled by linking column numbers with a dash, like this:

 

INPUT A 1 B 2-3;

 

SAS would interpret this to mean "find data for variable A in column 1, and data for B in columns 2 and 3.”  Blanks within a number are illegal and will result in an error.  For example

 

INPUT A 1-3;

 

would result in an error in reading the data in our example because of the blanks.

CARDS;

1 2

2 2

Note that CARDS is followed by data in columns 1 and 3 but a blank in column 2, therefore the error caused by the input statement INPUT A 1-3;

 

Compute statements such as

 

D = A+B+C;

 

also belong in the data step.  SAS interprets this statement to mean "create a new variable called D.  Define D as the sum of A, B, and C. Compute the result for each row (person) and collect this as part of the data contained in set D1."  Note that D will be missing when any of A, B and C are missing.  Compute statements can be quite complex.  You can find the natural logarithm, do modular arithmetic or operate on DO loops with compute statements.

 

The CARDS statement tells SAS that the data follow immediately, and that they are to be read from the same file that contains the SAS instructions.

 

The PROC step

 

You run the PROC step after telling SAS all about your data.  Once SAS knows what and where your data are, it can do what you want it to with them.  The PROC step begins with the word "PROC" followed by the procedure you want, in our sample program, the first PROC is PRINT.  PRINT will cause SAS to print out all of the data we input.  This is useful to check and see that SAS is analyzing the data we want it to.  It is surprisingly common to find that the data are a row or column off so that SAS isn't actually analyzing the data you want it to.  PRINT lets us verify that the data are correct.  In SAS there are many, many PROCs to choose from, such as ANOVA.

 

SAS Output for the Sample Program

 

Results of PROC PRINT:

 

                                      OBS    A    B

 

                                           1     1    2

                                           2     2    2

                                           3     2    3

                                           4     3    4

                                           5     2    3

                                           6     1    3

                                           7     2    1

                                           8     1    2

                                           9     4    4

                                          10    2    2

 

This shows that SAS thinks there are records on 10 people, each represented by a row.  There are two columns, corresponding to variables A and B.  SAS thinks that the data for the first person include a "1" for variable A and a "2" for variable B.


 

 

Data Step

 

DATA statement.  You must start the Data step with the DATA statement.  Following the DATA statement is the name of the data set.  For example, if you write 'DATA D1;' you create a dataset named D1.  The name of the dataset cannot be more than 8 characters long.  Start with a letter.  You can also use numbers and the underscore (_).  For example you could create datasets named D1, D2 or ALL_SET;  The following would create error messages 'DATA 1234;' (begins with a digit) and  'DATA ALLTOOMANY;' (too many charters).

Reading and Writing Data

Reading

 

Records and Observations.  Your input data file consists of one or more lines of data.  Each line of data in your input data file is a record.  SAS keeps track of the number of records that are read from your file.  Your data file is a collection of numbers that describes one or more entities, objects, or people.  The entities or objects or people of interest are called OBSERVATIONS (OBS) by SAS.  There may be more than one record for each observation.  For example, you might have asked people to respond to 150 questions, and have room for 80 columns on a record in your data file.  In that case, you would need at least 2 records per observation.

 

Missing Data.  When you collect data, it is almost inevitable that you do not collect all variables for all people.  In surveys, for example, some people omit their age or skip one or more items on the survey.  SAS represents missing data with a period (.).  You can explicitly tell SAS that a value is missing by putting a period in the data file.  However, you can also leave the missing item blank and tell SAS exactly where to look for the values.  I typically leave missing values as blanks.  So long as you specify the exact location of your numeric variables, SAS will substitute the missing value for blanks.  (If you are reading alpha [letters and symbols] instead of numbers, the value for blank will be legal and read by SAS.  You can deal with this later by telling SAS to substitute missing for blank.)

 

INPUT Statement.  You need to include an INPUT statement for any dataset that you want SAS to read.  You may read data from either (a) the file that is in your SAS Program Window or (b) an external file, such as a file on your hard drive.  For example, you might say

INPUT A B C;

SAS interprets this to mean read three variables, named A, B and C respectively.

Input Formats

SAS allows many different instructions, called formats, about how to read data.  I will describe three main types and give examples of each. 

 

Free (List) format.  The simplest format is free or list format.  For example

INPUT X Y Z A1 A2 A3;

This instructs SAS to read six variables (X through Z and A1 through A3).  SAS will read numbers from the following data and assign them serially to each of the six variables.  The first number SAS finds will be assigned to X, the second to Y, and so forth, until it gets to A3 with the sixth number.  The seventh number will be assigned to the second observation (person) for X.  SAS knows when numbers start and end through blank spaces.  Each time SAS finds a blank space or the end of a record it considers one number to end and another to begin.

 

Free format is attractive because it is simple.  It has a major drawback, however.    Because SAS skips over blanks, it does not distinguish between intentional and unintentional blanks.  If a value for variable is missing, SAS will just read the value for the next variable and place it (wrongly) into the first variable.  If you want to use free format and you have missing data, you must put a single period (.) to stand for each missing value.  SAS considers a period listed by itself (without adjacent numbers) to be a symbol for missing data.  

 

Column format.  With column format, you name the variable and then tell SAS the number of the column(s) in which to find the data for that variable.  For example

INPUT A 1 B 2-3 C 10-15;

This says to find the data for variable A in column 1, the data for variable B in columns 2 and 3 and the data for variable C in columns 10 through 15.

 

Alpha variables ($).  The default in SAS is to assume that variables are numeric, that is, they are numbers rather than letters or symbols.  If you want variables that are alpha rather than numeric, such as names or social security numbers, you communicate this to  SAS with the dollar sign following the variable name.  For example

INPUT NAME $ 1-20 A1 21-22 A2 23-24;

This says to read the alpha variable NAME in the first 20 columns, and then to read A1 and A2 in the next two sets of two columns.

 

Pointing or jumping (@).  You can tell SAS to go directly to a column using the at sign (@).  For example

INPUT NAME $ 1-20 @ 23 A2;

This tells SAS to read the alpha variable NAME in the first 20 columns and then to jump to column 23 to read variable A2.

 

Record number (#).  If you have more than one record ('card') per person, you can tell SAS which record to read using the number sign.  Suppose you have 10 records per person,  but the only information you need is on the first card and the last card.  You might have an input statement something like this:

INPUT NAME $ 1-20 #10 @ 50 SALARY;

This tells SAS to read the alpha variable NAME on the first record, then to skip to the 10th record to find SALARY data starting at column 50.  Note that if the salary data were missing on the record, you would need to note this by placing a period (.) for missing data there.  Otherwise, SAS would read the next available number and place it in the SALARY variable.  To avoid this problem, assuming that SALARY is left blank when it is missing, you would simply list the columns in which the salary data is found.  For example,

INPUT NAME $ 1-20 # 10 SALARY 50-55;

When you have multiple records and you only need information from some of them, you must tell SAS to go to the last record as part of the INPUT statement.  Otherwise, SAS will think that each observation has as many records as the last record from which data were read.  Suppose you have 10 records per observation, but you only need data from the first two.  You would write

INPUT NAME $ 1-20 #2 SALARY 50-55 #10;

If you leave off the #10, then SAS will think that there are 2 records per observation.  It will then read whatever is on the third record for the first person in columns 1-20 and assign  that as the NAME of the second observation.

 

Fixed (informat).  Instead of specifying the column numbers, you can tell SAS the number of columns and where to place the decimal.  Suppose we have administered a survey with 10 items in it and each response takes 1 column (e.g., a 1 to 5 Likert scale where 1=Strongly Disagree and 5 = Strongly agree).  Our data might look like this:

 

 

Name

Item

 

1

2

3

4

5

6

7

8

9

10

Col 1-10

11

12

13

14

15

16

17

18

19

20

Joe

2

3

2

4

1

5

5

5

2

3

Mary

2

4

4

3

2

5

4

5

3

3

 

For each person, we record their name in the first ten columns, and then the responses for each of the 10 items in columns 11 through 20.  With column format, we would write

INPUT NAME $ 1-10 I1 11 I2 12 I3 13 I4 14 I5 15 I6 16 I7 17 I8 18 I9 19 I10 20;

With an informat statement, we could write

INPUT NAME $ 1-10 I1 1. I2 1. I3 1. I4 1. I5 1. I6 1. I7 1. I8 1. I9 1. I10 1.;

This says to give each item 1 column.  A quicker way to write this is:

INPUT NAME $ 1-10 (I1-I10)(1.);

SAS allows the hyphen (-) to refer to variables that have the same beginning, but differ only by a final digit.  In this case, I1-I10 means I1 through I10.  The hyphen cannot be used for variables that do not have the same stem.  For example, suppose we had two surveys, the first of which was 10 items long and the second of which was 20 items long.  Further suppose that both scales have single column responses and that the second scale follows the first immediately on the same record.  We would write

INPUT (I1-I10)(1.)(T1-T20)(1.);

The statement

INPUT (I1-T20)(1.) ;

Would cause an error message.

 

Informats for Decimals.  We usually put the decimal point in our data when we write data to a file.  For example, if the number is 32.8, we would put that into 4 columns, such as:

32.8

With column format, we would tell SAS to read the appropriate 4 columns, e.g.,

INPUT TEMP 1-4;

With informats, however, we do not need to put the decimal in the data, thus saving space in our data file.  We can tell SAS where to put the decimal during the input statement.  Suppose we had punched

328

and we wanted SAS to read this as 32.8.  Then we would use the statement

INPUT TEMP 3.1;

This tells SAS to read three columns and to put one digit to the right of the decimal point.  If we instead said

INPUT TEMP 3.2;

SAS would interpret the number as 3.28 rather than 32.8.  Therefore the number to the left of the decimal in the INPUT statement is the number of columns, and the number to the right of the decimal is the number of digits to the right of the decimal.

 

Combining Formats.  SAS allows any combination of free, column, and informat instructions in a single INPUT statement.  This gives you great flexibility in inputting your data.  For example, you could write

INPUT #10 (I1-I50)(1.0) #2 @ 50 A1 A2 A3 #3 SSN $ 2-10;

I wouldn't ordinarily tell SAS to read the 10th record first.  I like to start with the 1st record and 1st column and move across columns to the end of each record.  That way I don't get confused.  Nor would I allow free format on the 2nd record, because of the problems with missing data.  However, you CAN do all that if you want to.

 

CARDS Statement.  The CARDS statement tells SAS that your data follow immediately in the same program where you have your INPUT statement.  For example

INPUT A 1 B 2 C 3;

CARDS;

123

234

456

;

shows an input statement that tells SAS to find 3 variables in column format.  The CARDS statement follows, and is in turn followed  by three records, one for each observation.  SAS continues to read each line as a record of data until it finds a semicolon (;).  It then backs up one line (record) and considers this to be the last line (record) of data.  

           

INFILE Statement.  The INFILE Statement tells SAS where to find a file containing data that is external to the Program Window.  For example, suppose you kept you data in a file on a floppy disk.  Let's suppose that the floppy disk drive is A: and the name of your file is Sumdata.txt.  You would write:

DATA D1;

INFILE 'A:Sumdata.txt';

INPUT A 1 B 2 C3;

Note that you put the INFILE statement before the INPUT statement.  Also note the single quotes and semicolon.  The single quotes tell SAS where to find the external file.  The semicolon must be placed after the single quotes to end the statement.  Suppose your hard drive is C: and you keep your data in a director inside the SAS directory.  If you keep your data in a directory called MYDATA, then your INFILE statement would look something like this:

INFILE 'C:\SAS\MYDATA\Sumdat.txt';

 

Writing

 

You can use SAS to write or output data as well as to read or input data.  The formats that you use to do this are identical to the input formats, which were described earlier.

 

FILE Statement.  Use the FILE statement to tell SAS where to write the new file.  For example

DATA D2; SET D1;

FILE 'A:Newdat.sas';

 

I always create a new dataset and copy the contents of an earlier dataset when I use the FILE statement.  The statements  DATA D2; SET D1; tell SAS to create a new dataset called D2 and to copy the contents of D1 into it.   The reason for this is that I am generally outputting derived data to the new dataset.  I may have added items together to form scales, combined the judgments of two or more judges into a composite, or merged datasets from different places to form my new data.

 

The above FILE statement will create a file on your A: drive called Newdat.sas.  Be sure to follow the DOS conventions for legal file names and extensions (e.g., eight or less characters for the name, three for the extension).  The file that SAS writes is an ASCII file with one line per record.

 

SAS will over-write files with identical names.  It will not ask you about this either, so be careful.  If you already have a file named Newdat.sas on your A: drive, you will lose it by running SAS with the above lines in it.

 

PUT Statement.  Use the put statement to tell SAS the format of the new file.  For example, you might write

DATA D2; SET D1;

FILE 'A:Newdat.sas';

PUT SSN 1-10 @ 15 (A1-A10)(3.2) #2 (I1-I50)(1.0);

You are not limited to 80 columns in your PUT statements.  However, it is often a good idea to use no more than 80 columns for a record because you may have problems viewing the new file having more than 80 columns with some software.

 

Manipulating Data in the Data Step

Arithmetic Operations

 

You can combine variables, take functions of them, and generally assign values to variables in the data step.  For example, you might want to add items to form a scale:

Scale = I1+I2+I3+I4+I5;

 

SAS uses the following symbols to perform arithmetic operations

 

Operation

Symbol

Addition

+

Subtraction

-

Multiplication

*

Division

/

Power (exponent)

**

 

SAS will carry  out multiplication, division and the exponent operation before it carries out subtraction and addition.  Therefore you need to use parentheses to make sure that SAS is doing what you want.  For example

A = B + C**2;

SAS interprets this to mean "set variable A equal to variable B plus C-squared." 

That is,

A=B+(C**2);

It does not mean to set A equal to the square of the sum of B and C.  The latter value would be written:

A = (B+C)**2;

 

Arithmetic with missing values.  SAS will set the result of any arithmetic operation to be missing if ANY part of the operation contains a missing value.  So for example if

Scale = I1+I2+I3+I4+I5;

Then Scale will be missing if any 1 of the 5 items is missing.  This is not always true with functions.  Some functions ignore missing values.

 

IF Statement.  The IF statement causes SAS to check on the truth of a statement or expression, and then to execute a command if the statement or expression is true.

For example

IF A LT 3 THEN B=0;

Would be interpreted by SAS to mean that if the value of variable A is less than 3, set variable B equal to 0.

Suppose we had a distribution of depression scores in a variable named BECK, and we wanted to dichotomize them so that values greater than 14 were set to 1 (high) and values of 14 or less were set to zero (low).  We could do this by

BECK2 = 0;

IF BECK GT 14 THEN BECK2=1;

The first statement creates a variable called BECK2, and sets its value to zero.  The second statement sets the value of BECK2 to 1 if the score on BECK is greater than 14.

 

Functions

 

SAS supports a large number of functions, including arithmetic (math, trig, Boolean) and statistical functions.  For example LOG is the natural logarithm function.  We can write the Fisher r to z transformation as:

Z = .5*LOG((1+r)/(1-r));

Note that I needed to include the parentheses around the addition and subtraction operations.  Otherwise, SAS would have done the division before the addition and subtraction and the result would have been wrong.

 

To find a complete list of SAS functions, click on Help at the main SAS menu.  Then 1) SAS System, 2) Base SAS Documentation, 3) Language Reference and finally 4) SAS Functions.

 

A few functions ignore missing values.  For example, suppose we have five items that we want to sum to form a scale.  We can compute this by:

SCALE = SUM(OF I1-I5);

But the SUM(OF) function ignores missing values, so that if one of these items is missing for a person,  that person will have as their scale score the sum of the non-missing items.

To avoid this problem, you should either set the result to missing like this:

 

ARRAY A I1-I5;

DO OVER A;

IF A EQ . THEN SCALE =.;

END;

 

(see the ARRAY statement) or else use arithmetic to form the scale, that is:

 

SCALE = I1+I2+I3+I4=i5;

 

Internal Counters

 

Internal counters are values that SAS uses for its own record- or house-keeping purposes.  Sometimes you will want to use them for your own purpose.

 

The _N_ counter.  The symbol _N_ refers to the Observation number.  The third observation in a dataset in known to SAS as _N_ = 3, the fourth observation is _N_=4 and so on.  If  you wanted to examine your first 25 people in the dataset you could use the following statements:

 

DATA D1;

INPUT A B C;

IF _N_ LE 25;

CARDS;

data go here...

PROC PRINT;

RUN;

 

The _I_ counter.  The symbol _I_ refers to the internal counter on a DO LOOP.  A DO LOOP is a set of commands that are executed repeatedly.  The first time a DO LOOP is executed, _I_ =1, the second time through, _I_=2, and so on.

 

The OUTPUT Statement.  The OUTPUT Statement causes SAS to write an observation to the current dataset.  For example, suppose we have the following program:

 

DATA D1;

INPUT A B C;

IF A LT 2 THEN OUTPUT;

CARDS;

1 2 3

4 5 6

PROC PRINT;

RUN;

 

Then the output will be:

OBS  A   B   C

1    1   2   3

 

Only the first observation is written to the dataset because only the first observation has a value of A less than 2.

 

DO Statement.  The DO statement causes SAS to execute a set of operations as one unit.  The operations in the unit are those encountered prior to the END statement.

Its form is:

DO Name = Start TO Stop <BY Increment>;,

 

For example,

 

INPUT A B C;

If A EQ . THEN DO K=1 TO 10;

A = NORMAL(0);

OUTPUT;

END;

 

The NORMAL function returns a random number from an approximately Normal (0,1) distribution.  This above program reads values for each person for A, B, and C.   If A is missing, then the program generates 10 values for A, each from a normal distribution, and outputs them to the current dataset.

 

If we want to create values of the r to z function to place in a table, we could use the following program:

 

DATA D1;

DO r = -.9 to .9 by .1;

Z = .5*log((1+r)/(1-r));

OUTPUT;

END;

PROC PRINT;

RUN;

 

SAS would produce the following output:

 

                                  OBS        R           Z

 

                                    1    -0.90000    -1.47222

                                    2    -0.80000    -1.09861

                                    3    -0.70000    -0.86730

                                    4    -0.60000    -0.69315

                                    5    -0.50000    -0.54931

                                    6    -0.40000    -0.42365

                                    7    -0.30000    -0.30952

                                    8    -0.20000    -0.20273

                                    9    -0.10000    -0.10034

                                   10    -0.00000    -0.00000

                                   11     0.10000     0.10034

                                   12     0.20000     0.20273

                                   13     0.30000     0.30952

                                   14     0.40000     0.42365

                                   15     0.50000     0.54931

                                   16     0.60000     0.69315

                                   17     0.70000     0.86730

                                   18     0.80000     1.09861

                                   19     0.90000     1.47222

 

ARRAY Statement.  An ARRAY statement is a convenient way to refer to a collection of variables.  The form of the statement is

ARRAY Name [Variable List];

For example, suppose we have a 50-item scale that we want to put into an array.  We could do this with:

ARRAY Scale I1-I50;

SAS interprets this to mean Create an array called Scale that contains the variables I1 through I50 (our 50 items).

Arrays are often used with DO LOOPS.  If we have a set of variables X and want to create a set of variables Y each of which is equal to X, we could use the following:

ARRAY A X1-X10;

ARRAY B Y1-Y10;

DO K=1 TO 10;

B = A;

END;

The DO OVER command is also used with arrays.  The example immediately above could be written as:

ARRAY A X1-X10;

ARRAY B Y1-Y10;

DO OVER A;

B = A;

END;

In either case, the variable Y1 would be set equal to X1, Y2 would be set equal to X2, and so forth.  Note that we refer to each array by its name in our compute statements.  The compute statements are executed once for each variable in the array (for the DO OVER command) or using the variables that correpond to an explicit counter in the case of the DO statement.

 

Recoding

 

It is usually the case with questionnaires or surveys that some of the items are written so that they are opposite in meaning to others.  For example one item might be "I love my job" and another item might be "I hate my job."  In both cases the response would be 1=strongly disagree to 5=strongly agree.  One set of responses needs to be scored opposite to the other to provide a consistent scoring key.  This can be done using an ARRAY statement.  Suppose we have a 20 items in a scale,  and each response is coded 1 to 5 as described above.  If items 2, 4, 5, 7, 11 and 19 are to be reverse scored, we would write:

 

ARRAY A I2 I4 I5 I7 I11 I19;

DO OVER A;

A = 6-A;

END;

 

This would cause each item to be subtracted from 6, which will result in the proper recoding.  In general with Likert-type items, if there are k response options, recode by subtracting the input response from k+1 (in our case 5 response options, so we subtract from 6).

 

Selecting & Concatenating (Merging & Stacking Selected Data)

 

SAS allows lots of flexibility in managing your data.  You can read data from several different sources and then merge them.  For example, you might have one set of data from Human Resources that contains information about the amount of training that employees have received.  Another dataset might come from Accounting, where they keep information on store profits.  Or you might want to stack some new data that you generated on top of data that already exist within a file.

 

SET Statement.  The SET statement causes SAS to stack vertically all datasets named in the SET statement.  For example

 

DATA ALL; SET D1 D2 D3;

 

Would cause SAS to create a new data set called ALL and then to stack D2 on top of D3 and D1 on top of D2.   The variables in the two or more datasets to be stacked must have the same names.  For example, after the SET statement,  if D1 and D2 both have variables named A, B, and C, then the observations for D1 for variable A will be in the dataset above those for A in D2.  If only D1 has a variable named A, then after the SET statement, there will be values in the variable A for the observations from D1, but the observations from D2 will have missing values for A.

 

MERGE Statement.  The MERGE statement causes SAS to stack horizontally all datasets named in the statement.  For example

 

DATA ALL; MERGE D1 D2;

 

Would cause SAS to create a new dataset called ALL and concatenate all the variables in D1 and D2.  The MERGE statement is usually used with the BY statement.  Most files have some identifying variable, such as social security number, that is used to match records for the merge.  So if we has such an identifier we would use it like this:

 

PROC SORT DATA = D1; BY SSN;

PROC SORT DATA = D2; BY SSN;

DATA ALL; MERGE D1 D2; BY SSN;

 

For people where there is a matching SSN, there will be values assigned for all variables contained in both D1 and D2 after the merge.  If the person is listed in only D1 or D2 but not both, that person will have missing data corresponding to either D1 or D2 after the merge.  Be sure to give your variables different names prior to merging datasets.  Otherwise SAS will simply replace the values in one dataset with the values in the other. 

 

DELETE Statement.  Use the DELETE statement to delete an observation from  a dataset.  For example:

 

IF _N_ EQ 10 THEN DELETE;

IF A EQ 3 THEN DELETE; 

IF B EQ . THEN DELETE;

 

The first statement would delete the 10th observation.  The second statement would delete observations in which the variable A has the value 3.  The third statement would delete any observation with a missing value for B.

 

DROP Statement.  Use the DROP statement to eliminate variables that you do not need.  For example, if you created some scales from items, you might want to keep the scale scores and drop the item scores.  To see the correlations among the scales without the items, we could write:

 

DATA D1;

INFILE 'A:ITEM.DAT';

INPUT (I1-I150)(1.0);

SCALE1=SUM(OF I1-I50);

SCALE2=SUM(OF I51-I100);

SCALE3=SUM(OF I101-I150);

DATA D2; SET D1;

DROP I1-I150;

PROC CORR;

RUN;

PROC STEP

Most Procedures (PROCs) are used for statistical analysis.  However, there are some procedures for managing data.  The most frequently used of these is PROC SORT.

 

Sorting

PROC SORT

 

Use PROC SORT to arrange the observations by the values of one or more variables in your dataset.  PROC SORT can be used organize your data for you to see or it can be used as a precursor to another PROC.  You should also sort datasets that you plan to merge by a common variable.  For example, if you want to merge two datasets by a common social security number, sort each dataset by social security number before the merge.

 

Form:

PROC SORT <DATA = dataset name><OUT = dataset name>;

BY <DESCENDING> variable name(s);

 

DATA Option.  By default, SORT uses the most recently created dataset.  You can tell SAS to SORT any dataset by naming that dataset here.

 

OUT Option.  By default, SAS will replace the current dataset with the sorted data.  To create a new dataset for the sorted data and still keep the original dataset, use this option.  If you use a name for a dataset that you used earlier, SAS will replace that earlier dataset, so be careful with long programs.

 

BY statement.  The BY statement tells SAS which variable(s) to use for sorting.  For example

 

PROC SORT DATA=D1 OUT=SD1; BY AGE;

 

The above command tells SAS to sort the data in D1 by the values in the variable AGE. If you want a descending sort, you specify this in the BY statement immediately before the variable to be used for sorting.  For example

 

PROC SORT; BY DESCENDING AGE DESCENDING INCOME;

 

Would sort the data descending by age and descending by income.  If the second DESCENDING statement were left off, the data would have been sorted descending by age and ascending by income.  When more than variable is in the BY statement, SAS sorts first by the left-most variable and proceeds serially to the right.

 

PROC STANDARD

 

PROC STANDARD allows you to set the mean and standard deviation of one or more variables to a value that you choose. Suppose you want to find z scores for a variable called AGE in your data.  You would use the command

PROC STANDARD M=0 S=1; VAR AGE;

 

 

Exploring Your Data

PROC PRINT

Form:

PROC PRINT <DATA=Dataset name>;

<VAR name(s);>

PROC PRINT causes SAS to print a list of observations by variables in a dataset.

 

DATA= Option.  By default, SAS prints the most recently created dataset.  You can tell SAS which dataset you want by placing another dataset name here.

 

You should always print out your data and check it to be sure that what is in the computer matches exactly the paper or original source of the data.  To print all the data, use the command

 

PROC PRINT;

 

This will print everything.  If you do not want to see all of the data, you can reduce the amount in several ways.

 

VAR Statement.  The VAR (variable selection) statement tells SAS to use the chosen procedure (PROC) only on variables that you name.  Otherwise, the default for nearly all PROCs is to use all applicable variables in the dataset.  To select only the variables A B and C for printing, you would write:

 

PROC PRINT; VAR A B C;

 

You can also reduce the number of observations in your dataset by selecting observations with the internal counter.

 

DATA D1;

INPUT A B C;

IF _N_ LE 25;

CARDS;

Data lines...

PROC PRINT;

 

I recommend that you print all of the data before doing any analyses.  If possible, check every data point against the raw data (e.g., check the numbers in the printout against the survey responses).  If this is not possible, check the first few observations, the last few observations, and a sprinkling of  data throughout the file.  You need to discover errors in your data early and to correct them.  Other procedures that may assist you with finding errors are FREQ, UNIVARIATE, MEANS, and PLOT.

 

Example of PROC PRINT

 

DATA D1;

INPUT A 1 B 3;

CARDS;

1 2

2 2

2 3

3 4

2 3

1 3

2 1

1 2

4 4

2 2

PROC PRINT;

RUN;

 

1.  Results of PROC PRINT:

 

                                         OBS    A    B

 

                                           1    1    2

                                           2    2    2

                                           3    2    3

                                           4    3    4

                                           5    2    3

                                           6    1    3

                                           7    2    1

                                           8    1    2

                                           9    4    4

                                          10    2    2

 

This shows that SAS thinks there are records on 10 people, each represented by a row.  There are two columns, corresponding to variables A and B.  SAS thinks that the data for the first person include a "1" for variable A and a "2" for variable B.

 

 

 

PROC MEANS

 

This procedure produces means for your variables, of course.  It also produces other descriptive information.  If you have a large number of variables to view, MEANS is often a good bet.  You can use the maximum and minimum values to see whether any of your variables have any illegal values.  For example, you might have a variable that takes values from 1 to 5, but find that it somehow has a value of 50.  This would cause you to go back to your data to spot the error.

 

Form:

PROC MEANS <DATA=Dataset name>;

<VAR Variable name(s);>

 

DATA= Option.  The default is to use the most recently created dataset.  If you want means for another dataset, specify the dataset name here.

 

VAR Statement.  The default for MEANS is to use all numeric variables.  Use the VAR statement to select a subset of your variables.

 

Example Program

 

DATA D1;

INPUT A 1 B 3;

CARDS;

1 2

2 2

2 3

3 4

2 3

1 3

2 1

1 2

4 4

2 2

PROC MEANS; VAR A B;

RUN;

 

Output from PROC MEANS:

 

              Variable   N          Mean       Std Dev       Minimum       Maximum

              --------------------------------------------------------------------

              A         10     2.0000000     0.9428090     1.0000000     4.0000000

              B         10     2.6000000     0.9660918     1.0000000     4.0000000

              --------------------------------------------------------------------

 

Note how MEANS produces a compact table that shows the name of the variable, the number of non-missing observations, the mean, standard deviation, minimum and maximum.

 

 

PROC UNIVARIATE

 

Univariate provides a great deal of descriptive information about a variable.

 

Form:

PROC UNIVARIATE <PLOT> <NORMAL><DATA=Dataset name>;

VAR Variable name(s);

 

PLOT Option.  The plot statment causes SAS to produce a stem-and-leaf diagram, a box plot and a normal probability plot for visual representations of your variable(s).

 

NORMAL  Option.  The NORMAL option causes SAS to produce a test of thehypothesis that the data were drawn from a population in which the variable is normally distributed.

 

DATA= Option.  The default is  to use the most recently created dataset.  You can specify a different dataset here.

 

VAR Statement.  The default is to produce an analysis of all numeric variables in the dataset.  Use the VAR statement to choose a subset of the variables.

 

Example Program

 

DATA D1;

INPUT A 1 B 3;

CARDS;

1 2

2 2

2 3

3 4

2 3

1 3

2 1

1 2

4 4

2 2

PROC UNIVARIATE PLOT NORMAL;

RUN;

 

 

 

 

PROC UNIVARIATE Results

 

Univariate shows lots of information about the distribution of the variable.  It shows the mean, median and mode.  It shows the variance, standard deviation, range, quartiles, and various percentiles.  It shows the skew and kurtosis.

 

Variable=A

 

 Moments                                            Quantiles(Def=5)

 

 N       10     Sum Wgts  10       100% Max         4       99%         4

 Mean     2     Sum       20        75% Q3          2       95%         4

 Std Dev  0.94  Variance  0.89      50% Med         2       90%       3.5

 Skewness 0.99  Kurtosis  1.19      25% Q1          1       10%         1

 USS     48     CSS       8          0% Min         1        5%         1

 CV      47.14  Std Mean  0.298142                           1%         1

 T:Mean=0   6.708204  Pr>|T|       0.0001                Range            3

 Num ^= 0         10  Num > 0          10                Q3-Q1            1

 M(Sign)           5  Pr>=|M|      0.0020                Mode             2

 Sgn Rank       27.5  Pr>=|S|      0.0020

 W:Normal   0.840083  Pr<W         0.0430

 

Then univariate shows the extremes, or highest and lowest numbers in the distribution and where they came from.  For example, the lowest number in this variable is 1.0, and that number can be found for persons (rows) 8, 6 and 1.  The Highest number in the distribution is 4, and this can be found for person 9.  This is useful in large datasets.  You can find errors or inspect people with unusually high or low scores.

 

 

                                            Extremes

 

 

                               Lowest  Obs     Highest    Obs

                               1(       8)        2(       5)

                               1(       6)        2(       7)

                               1(       1)        2(      10)

                               2(      10)        3(       4)

                               2(       7)        4(       9)

 

The stem-leaf diagram and the boxplot are both graphical methods for representing the distribution.  The stem-leaf diagram is a histogram.  This one shows that 1 person scored a 4.0, one person scored a 3.0, five people scored a 2.0, and 3 people scored a 3.0.  The boxplot is a drawing of the distribution.  The middle of the distribution is shown by the box, and the tails of the distribution are shown by the whiskers extending from the box.  This plot shows that the distribution is skewed. 


 

 

                        Stem Leaf                     #             Boxplot

                           4 0                        1                0

                           3

                           3 0                        1                |

                           2                                           |

                           2 00000                    5             +--+--+

                           1                                        |     |

                           1 000                      3             +-----+

                             ----+----+----+----+

 

 

Stem Leaf and Boxplot for variable B

 

Stem Leaf                     #                                     Boxplot

                           4 00                       2                |

                           3                                           |

                           3 000                      3             +-----+

                           2                                        *--+--*

                           2 0000                     4             +-----+

                           1                                           |

                           1 0                        1                |

                             ----+----+----+----+

 

This distribution (B) is closer to being Normal.  The mean of the distribution is shown by the + in the middle of the box.  The median is shown by the *------* across the box.  For variable B, the mean and median are about equal, as they should be when data are approximately normally distributed.  The quartiles are shown by the lines

+------+ that form the edges or shoulders of the box.  For variable A, the mean is graphed on the upper quartile and the median is not shown.  The tails or extremes are the vertical bars, | .  Unusual scores are shown at the ends of the tails as either 0 as in variable A, or for more extreme scores, the symbol * is used (not shown in either graph).  The graph for variable B is what you want to see for a nice normal distribution.  The mean and median occur together.  The shoulders are equally spaced from the mean and median.  The tails are equally long and there are no circles or asterisks at the ends.  Variables that have distributions that look like those of variable B are best for statistical analysis.  Note that the stem-leaf and boxplot graphs both show the same information, that is, the shape of the distribution.  They just do it in slightly different ways.

 


PROC PLOT & PROC GPLOT

 

PROC PLOT is a simple program for getting a look at your data.  PROC GPLOT produces nicer graphs that you can use in publications or other presentations that require high quality.  With GPLOT, you can copy the entire graph as a window and then paste it into other Windows software, such as MS Word and CorelDRAW.

Form

 

PROC [G]PLOT <DATA=Dataset name>;

PLOT Vertical * Horizontal </ OVERLAY>;

 

DATA= Option.  The default is to use the most recently created dataset.  If you want a different dataset, specify that dataset's name here.

 

PLOT Statement.  You must have at least one PLOT statement.  The first named variable will be the vertical axis; the variable to the right of the asterisk (*) will be the horizontal axis.  In a regression problem,we would write

 

PLOT Y*X;

 

You can have more than one plot statement within a PROC PLOT or PROC GPLOT.  Each PLOT statement will cause SAS to produce one graph.

 

OVERLAY Option.  The OVERLAY option causes SAS to put graphs corresponding to two or more PLOT statements onto a single graph.

 

Example Plot Program

 

DATA D1;

INPUT A 1 B 3;

CARDS;

1 2

2 2

2 3

3 4

2 3

1 3

2 1

1 2

4 4

2 2

PROC PLOT; PLOT B*A;

RUN;

 

 

 

Results of PROC PLOT

 

Results will be more meaningful if (a) you have more observations, and also (b) both variables take on many values.  Each of the variables plotted here takes only 4 values.  In cases with small numbers of values (like ours) it probably makes more sense to produce a two-way frequency table.

 

 

         Plot of B*A.  Legend: A = 1 obs, B = 2 obs, etc.

 

B 

    

   4 ˆ                                        A               A  

    

    

    

   3 ˆ  A                 B     

    

     ,

    

    

   2 ˆ  B                 B  

    

    

    

   1 ˆ                      A

     ----+-------------+-------------+--------------+------

           1                 2                  3                  4

 

                                         A

 

 

Example for GPLOT

 

To use GPLOT, you must have installed SAS GRAPH.  USF has installed this on the open use facilities.  If you leased SAS, you may or may not have installed SAS GRAPH. 

 

This example illustrates the use of GPLOT.  It also uses several other elements described in this document, including a DO LOOP, a function (exp, the base of the natural log function) and the OUTPUT statement.  What we do is to create two logistic curves (see, For example, Pedhazur's chapter on logistic regression).  The DO LOOP has a counter called L.  L varies from -10 to 10 by 1.  At each value, we find two functions of L.  These are called P and P2, respectively.  Each value of both is output for each iteration of the DO LOOP.

 

The OVERLAY command causes both plots to appear in a single graph.  Otherwise, SAS would produce one graph for each plot.

 

 

Input Program:

 

Data d1;

A=1;

A2=0;

B=.50;

Do l = -10 to 10;

P=(exp(a+b*l)/(1+exp(a+b*l)));

P2=(exp(a2+b*l)/(1+exp(a2+b*l)));

Output;

End;

Proc gplot;

Plot p*l p2*l/overalay;

Run;

 

Output of GPLOT:

This plot was copied from SAS to a word file and re-sized.  That is, you can cut and paste from SAS to another Windows application.   This feature is particularly nice with the graphs that GPLOT produces.

 

Example 2

 

The second example shows a scatterplot with a regression line plotted on the same graph.  The data are generated using the NORMAL random number generator with the computer clock NORMAL(0) as a seed, or start, value.

 

 

 

 

 

SAS Input:

 

DATA D1;

DO L=1 TO 100;

X1=NORMAL(0);

X2=NORMAL(0);

Y=X1+X2;

Y2=.0593+.94*X1;

OUTPUT;

END;

PROC PRINT;

PROC GPLOT;

PLOT Y*X1 Y2*X1 /OVERLAY;

RUN;

 

SAS Output:

 

 


Statistical Analyses

PROC CORR

 

Form

 

PROC CORR <DATA=dataset name> <NOMISS> <ALPHA>;

<BY variable name(s);>

<VAR variable name(s);>

 

Description

 

PROC CORR computes Pearson product-moment correlation coefficients.  The default is to find the correlation matrix with all non-missing observations for each pair of numeric variables, thus maximizing the sample size and power for each significance test.  Such an approach is called pairwise deletion of missing values.  For other purposes, such as multivariable applications (multiple regression, etc.) or reliability analyses, you will want to delete any observation that is missing on any variable in the analysis.  This is called listwise deletion of missing data. 

 

DATA = dataset name option.  Use the DATA=dataset name option to choose a dataset for CORR to analyze.  The default is the most recently created dataset.

 

NOMISS option.  For listwise deletion (see the description paragraph on PROC CORR), use the NOMISS option. 

 

ALPHA option.  This produces an item analysis using Cronbach's alpha, an estimate of internal consistency reliability.  Use this only if you specify NOMISS.  This is used primarily in scale development or prior to other statistical analysis to check scale reliability.

 

VAR statement. Use the VAR statement to select a subset of variables.  The default is to use all numeric variables in the dataset.   The choice of variables in the VAR statement may affect the sample size when NOMISS is specified.  If a variable with a small number of observations is omitted, the sample size for the other variables may increase.

 

BY statement.  Use the BY statement to compute correlation matrices separately for  each value of the BY variable.  The data must be sorted using PROC SORT before using the BY statement with PROC CORR.  For example, suppose you wanted to compute correlations separately for males and females, which were coded using a variable named SEX.  This would be accomplished by:

 

PROC SORT;

BY SEX;

PROC CORR;

BY SEX;

Example 1: Height and Weight

Input:

Data d1;

**********************************************

* We input height and weight from the table *

* using column format. *

**********************************************;

input ht 1-2 wt 4-6;

cards;

60 102

62 120

63 130

65 150

65 120

68 145

69 175

70 170

72 185

74 210

proc print;

* to verify that the data are correct;

proc means;

* to find the means and standard deviations of the raw scores;

proc standard m=0 std=1 out=d1;

************************************************

* Proc standard allows us to set the means     *

* and standard deviations of our variables     *

* to be whatever values we want. If we         *

* choose 0 and 1 as I did, we get z scores.    *

************************************************;

data d2; set d1;

* to find the products of z scores;

zhXzw = ht*wt;

proc print;

* to see the z score sand their products;

proc corr; var ht wt;

*to find the correlation between height and weight;

run;

 

 

 

 

 

 

 

 

Output:

 

 First, the raw data, which are output from Proc Print.

 

OBS

HT

WT

1

60

102

2

62

120

3

63

130

4

65

150

5

65

120

6

68

145

7

69

175

8

70

170

9

72

185

10

74

210

 

 Next, descriptive statistics, which are output from Proc Means.

 

Variable

N

Mean

Std Dev

Minimum

Maximum

HT

10

66.80000

4.54116987

60.000

74.000

WT

10

150.7000

33.5911086

102.000

210.000

 

 

 

 

 

 

Z scores output from the second Proc Print.

 

OBS

HT

WT

ZHXZW

1

-1.49741

-1.43442

2.14791

2

-1.05700

-0.90424

-0.95578

3

-0.83679

0.60970

0.51019

4

-0.39637

-0.02062

0.00817

5

-0.39637

-0.90424

0.35842

6

0.26425

-0.16789

-0.04436

7

0.48446

0.71574

0.34674

8

0.70466

0.56846

0.40058

9

1.14508

1.01028

1.15685

10

1.58549

1.74663

2.76927

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Results of Proc Corr.

Correlation Analysis

2 'VAR' Variables: HT WT

Simple Statistics

 Variable

N

Mean

Std Dev

Sum

Minimum

Maximum

HT

10

0

1.000

0

-1.497412

1.585495

WT

10

0

1.000

0

-1.434416

1.746629

Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 10

 

HT

WT

HT

1.00

0.95662

 

0.0

0.0001

WT

.95662

1.00

 

0.0001

0.0001

 

The correlation between height and itself is 1.00. The probability value that the correlation was drawn from a population in which r = 0 with a sample size of 10 is 0. Of course, this isn't very meaningful, because a correlation of any variable with itself is 1.0. The correlation between height and weight is .95662. The probability that this correlation was drawn from a population in which r = 0 with a sample size of 10 is 0.0001. In other words, the correlation is significant (or to be more precise, significantly different from zero by conventional standards).  PROC CORR also produces descriptive statistics, including the number of non-missing observations, the mean and standard deviation of the variable, along with the sum, minimum and maximum.

 

Example 2:  Pairwise vs. Listwise Deletion

 

Assume that the data in this example are

1.  Race (column 1) – 1 Asian, 2=Black, 3=Hispanic, 4=White, 5=Other

2.  Job Satisfaction (columns 6-7) – This indicates to what extent the person finds his/her work satisfying and enjoyable.  Self report; larger numbers mean more satisfied.

3.  Life Satisfaction (columns 12-13) – Indicates the extent the person finds life in general to be enjoyable and satisfying.  Self report; larger numbers mean more satisfied.

4.  Trait Optimism (columns 18-19) – Indicates the extent to which the person takes an optimistic view of challenges and life in general.  Self report; larger numbers mean more optimistic.

 

INPUT

 

DATA D1;

INPUT RACE 1 JSAT 6-7 LSAT 12-13 OPTIMIST 18-19;

CARDS;

1    24    32    36

.    24    21    31

1    17    31    28

4    39    39    39

4    21    22    38

4    16    21    52

1    18    27   

4    27    26   

3    17    36    41

4    29    32    24

3    27    34    42

4    32    47    40

4    12    28    28

1    19          45

2    11          44

2    18    43    23

2    11    37    38

3    22    28    47

4    24    43    41

4    40    28    38

4    17    19    36

4    26    30    52

4    20    23    50

2    27     .    37

4    18    23    48

4    17    42    44

2    23    35    43

4    25    41    15

3    31    27    53

2    26    38    35

4    11    41    39

3    26    24    32

2    16    27    49

4    18    24    25

4    28    30    51

1    12    40    30

1    23    38    33

3    13    27    46

4    16    43    50

2    25    26    45

4    21    51    38

2    24    44    41

4    28    37    41

4    16    31    43

4    15    48    46

2    18    43    31

4    25    35    37

2    19    37    22

4    19    45    20

2    12    24    40

4    18    25    49

1    14    44    50

2    19    30    35

4     .    39    43

1    12    44    38

.    25    26    35

4    12    34    26

2    17    42    31

1    17    47    28

2    23    37    50

4    18    39    49

3    17    25    41

4    23    46    34

3    25    41    47

4    17    40    54

.    16    41    38

4    23    26    47

4    20    45    36

PROC CORR; VAR JSAT LSAT OPTIMIST;

 

This statement produces pairwise deletion.  It maximizes the number of observations per correlation.  The VAR statement was used to select the variables for the analysis.  RACE is not suitable for the analysis because it is categorical rather than continuous.

 

PROC CORR NOMISS;

VAR JSAT LSAT OPTIMIST;

RUN;

 

Output of PAIRWISE deletion

 

Correlation Analysis

 

                           3 'VAR' Variables:  JSAT     LSAT     OPTIMIST

 

SAS Says that there are three variables in the analysis and lists them.

 

                                       Simple Statistics

 

  Variable             N          Mean       Std Dev           Sum       Minimum       Maximum

 

  JSAT                67     20.582090      6.233218   1379.000000     11.000000     40.000000

  LSAT                65     34.446154      8.301037   2239.000000     19.000000     51.000000

  OPTIMIST            66     39.060606      8.914774   2578.000000     15.000000     54.000000

 

 

SAS produces descriptive statistics.  It then produces the actual correlations.

 

 

 

 

 

    Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / Number of Observations

 

                                       JSAT              LSAT          OPTIMIST

 

                 JSAT               1.00000          -0.06700           0.02583

                                     0.0               0.5988            0.8381

                                         67                64                65

 

                 LSAT              -0.06700           1.00000          -0.18162

                                     0.5988            0.0               0.1543

                                         64                65                63

 

                 OPTIMIST           0.02583          -0.18162           1.00000

                                     0.8381            0.1543            0.0

                                         65                63                66

 

The first (top) number in each set of three is a correlation coefficient.  The correlation between job satisfaction (JSAT) and itself is 1.000.  The correlation between job satisfaction and life satisfaction (LSAT) is -0.06700.  The second (middle) number in each set of three is the probability (p) value for testing r=0.  The p value for the correlation between job satisfaction and itself is 0.0.  The probability value for the correlation between job satisfaction and life satisfaction is .5988.  The third (bottom) number in each set of three is the sample size upon which the correlation was computed.  The correlation for job satisfaction with itself was computed on 67 people.  The correlation between job and life satisfaction was computed on 64 people, all of whom had scores on both variables.

 

Output of LISTWISE deletion

 

Correlation Analysis

 

                           3 'VAR' Variables:  JSAT     LSAT     OPTIMIST

 

 

                                       Simple Statistics

 

  Variable             N          Mean       Std Dev           Sum       Minimum       Maximum

 

  JSAT                62     20.596774      6.247517   1277.000000     11.000000     40.000000

  LSAT                62     34.629032      8.357256   2147.000000     19.000000     51.000000

  OPTIMIST            62     38.854839      9.129033   2409.000000     15.000000     54.000000

 

 

            Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 62

 

                                       JSAT              LSAT          OPTIMIST

 

                 JSAT               1.00000          -0.05786           0.04696

                                     0.0               0.6551            0.7170

 

                 LSAT              -0.05786           1.00000          -0.18615

                                     0.6551            0.0               0.1474

 

                 OPTIMIST           0.04696          -0.18615           1.00000

                                     0.7170            0.1474            0.0

Note that all the same information is given with LISTWISE as with PAIRWISE deletion.  Note that there are only 62 observations in the analysis.  These 62 had no missing data on any of the variables listed in the analysis.  SAS only prints the sample size at the header (top) of the correlation matrix rather than under each p value.

 

 

Example 3:  the BY Statement

 

This example uses the same data as example 2.  We will sort the data by RACE and then compute the correlation between job and life satisfaction separately for each value of race.

 

Input

 

PROC SORT;

   BY RACE;

PROC CORR;

  BY RACE;

VAR JSAT LSAT;

 

In this example, we first used PROC SORT to sort the data by RACE. Then we asked for a correlation matrix for each race, but only for the variables job satisfaction and life satisfaction.

 

Output

 

-------------------------------------------- RACE=. -------------------------------------

SAS first produces a correlation matrix for those whose value of RACE is missing.  You may not want this information, so you can delete people who are missing or you may disregard this part of the printout.  On the other hand, you may want to know the correlation for people who are missing - especially if they had a choice about what information to provide you.

 

                                      Correlation Analysis

 

                               2 'VAR' Variables:  JSAT     LSAT

 

 

                                       Simple Statistics

 

  Variable             N          Mean       Std Dev           Sum       Minimum       Maximum

 

  JSAT                 3     21.666667      4.932883     65.000000     16.000000     25.000000

  LSAT                 3     29.333333     10.408330     88.000000     21.000000     41.000000

 

 

 

 

 

 

            Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 3

 

                                              JSAT              LSAT

 

                            JSAT           1.00000          -0.94138

                                            0.0               0.2191

 

                            LSAT          -0.94138           1.00000

                                            0.2191            0.0

 

-------------------------------------------- RACE=1 -------------------------------------

 

                                      Correlation Analysis

 

                               2 'VAR' Variables:  JSAT     LSAT

 

                                       Simple Statistics

 

  Variable             N          Mean       Std Dev           Sum       Minimum       Maximum

 

  JSAT                 9     17.333333      4.301163    156.000000     12.000000     24.000000

  LSAT                 8     37.875000      7.199950    303.000000     27.000000     47.000000

 

 

    Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / Number of Observations

 

                                              JSAT              LSAT

 

                            JSAT           1.00000          -0.48357

                                            0.0               0.2247

                                                 9                 8

 

                            LSAT          -0.48357           1.00000

                                            0.2247            0.0

                                                 8                 8

 

The matrices for the other values of RACE are omitted to conserve space.  The information provided by PROC CORR is given separately for each value of the BY variable.


PROC GLM

 

Form

 

PROC GLM <DATA = dataset name>;

<CLASS name(s) of categorical variabe(s);>

MODEL dependent variable name(s) = independent variable name(s);

<TEST H = term for hypothesis effect E=term for error effect</HTYPE=number ETYPE=number>;>

<OUTPUT OUT = dataset name P=name R=name;>

<MEANS name(s) of categorical variable(s)/[Post Hoc Test];>

 

Description

 

GLM stands for General Linear Model.  GLM is the most flexible of the regression programs in SAS.  It is less efficient than other programs, but this is less of a concern now that even personal computers can handle very large problems in a very short period of time.  I use this program for models that contain categorical independent variables, that is, for ANOVA.  I also use it for models that contain both continuous and categorical variables, such as ANCOVA.

 

DATA = Option.  You can specify the dataset to be analyzed with the DATA = option.  The default is the most recently created dataset.

 

The CLASS Statement tells SAS which variables are categorical in nature.  SAS assumes that all variables are continuous unless you tell it otherwise with this statement.  In other words, you don't need to tell SAS which variables are continuous, just which are categorical.

 

The MODEL Statement tells SAS what the dependent variable is (or what the dependent variables are) and the names of the independent variables. The dependent variable(s) are listed to the left of the equals sign; the independent variables go on the right.

 

The TEST Statement allows you to construct your own tests of effects.  To do this, you specify both the term for the hypothesis and the term for error.  For example, for a mixed design with a repeated measure, you might have a test for factor A with an error term of subjects within A.  You could then write

 

TEST H=A E=SUB(A);

 

The hypothesis and error terms must appear in the previous MODEL statement so that SAS will know what they are.

 

HTYPE and ETYPE Options.  You can specify the type of sums of squares to use in the TEST statement.  The default is the highest TYPE of sums of squares computed by SAS for the problem (usually TYPE=3).  Choosing the proper type is tricky. I recommend that you either find an example in a textbook to match with your output to verify your choice or to find a consulting statistician.

 

The Output Statement allows you to output predicted and residual values along with your original variables.  You can also name the output dataset.  If you do not supply a name, SAS will augment the original dataset with the new information.

 

The Means Statement tells SAS to compute means and standard deviations of the dependent variable at each level of the categorical variable you name.  You can also specify post hoc tests such as the Tukey HSD (TUKEY) and the Scheffe (SCHEFFE), among others.

 

Example 1:  The t-test. 

 

This is a simple example that shows how to use GLM for the t-test or for one-way ANOVA.  The difference between the two is simply the number of levels of the single independent variable.  If there are two levels of the IV, then the t-test and ANOVA amount to the same thing, because F = t2 in such a case.

 

INPUT:

 

data d1;

*********************************************************

* Example of a single factor experiment with            *

* two independent groups.   Data from Thorne & Slane,   *

* 1997, p. 224.  Data represent weight in grams of two  *

*  groups of rats.  All rats have a tendency to become  *

*  obese.  Half (experimental group) were randomly      *

*  selected for weight loss drug.  Control group got    *

*  saline.  Data after 4 weeks:                         *

*********************************************************;

* Experimental group = 1, Control group = 2;            *

*********************************************************;

input group weight;

cards;

2 437

2 455

2 483

2 392

2 455

2 469

2 513

2 452

2 444

2 410

1 383

1 355

1 313

1 410

1 289

1 344

1 418

1 400

1 344

1 310

proc glm;

class group;

model weight = group;

means group/ hovtest;

run;

 

Notice the structure of the data.  SAS wants the data stacked by columns, one for independent variable (group) and one for the dependent variable (weight).  The CLASS statement tells SAS that our IV is categorical.  The MODEL statement tells SAS the proper DV and IV.  The MEANS statements causes SAS to print the mean and standard deviation for each group; the hovtest option tells SAS to test for homogeneity of variance.

 

OUTPUT

 

                                       The GLM Procedure

 

                                    Class Level Information

 

                                 Class         Levels    Values

 

                                 group              2    1 2

 

 

                                  Number of observations    20

 

 

                                       The GLM Procedure

 

Dependent Variable: weight

 

                                              Sum of

      Source                      DF         Squares     Mean Square    F Value    Pr > F

 

      Model                        1     44556.80000     44556.80000      27.85    <.0001

 

      Error                       18     28796.40000      1599.80000

 

      Corrected Total             19     73353.20000

 

The first part of the output shows that SAS sees Group as a categorical variable with 2 levels.  There are 20 observations total (so far, so good).  Then there is an ANOVA source table for the entire model.  You can see the sources of variance, the sums of squares, degrees of freedom, mean squares, and overall F (27.85 in this example).  The significant value of F means that the difference between means is significant with p < .05 (actually, it shows that p < .0001, but this is not typically hypothesized).

 

                      R-Square     Coeff Var      Root MSE    weight Mean

 

                      0.607428      9.905275      39.99750       403.8000

 

Next, (just above) there are summary statistics for the model as a whole (more on this later). 

 

 

 

 

      Source                      DF       Type I SS     Mean Square    F Value    Pr > F

 

      group                        1     44556.80000     44556.80000      27.85    <.0001

 

 

      Source                      DF     Type III SS     Mean Square    F Value    Pr > F

 

      group                        1     44556.80000     44556.80000      27.85    <.0001

 

 

Below that are different summaries of the test for the Group variable.  Notice that the tests all have the same value of F in this case (27.85).  In more complicated models, this equality will not hold, and you will have reason to prefer some tables over others.  However, in this example, life is simple and any way you look at it, the result is significant. 

 

The GLM Procedure

 

                       Levene's Test for Homogeneity of weight Variance

                         ANOVA of Squared Deviations from Group Means

 

                                       Sum of        Mean

                 Source        DF     Squares      Square    F Value    Pr > F

 

                 group          1     2836852     2836852       1.21    0.2861

                 Error         18    42255843     2347547                                      

 

Levene’s test for the homogeneity of variance assumption is shown above.  A non-significant difference means that homogeneithy of variance is plausible.  Note, however, that this is not a very powerful test with small numbers of people.  On the other hand, the t-test and ANOVA are supposed to be robust to departures from homogeneity of variance.

 

The GLM Procedure

 

                       Level of            ------------weight-----------

                       group         N             Mean          Std Dev

 

                       1            10       356.600000       44.9251229

                       2            10       451.000000       34.370530

 

 

 

The last bit of printout is the values of the mean and standard deviation for each group.  As you can see, the control group of rats is the heavier on average.

 

Example 2:  Effects of performance feedback and norm group on level of aspiration for a motor pursuit task.

 

In a classic social psychological experimental waste of time, Dr. Penner decided to test males on their performance using a difficult motor pursuit task.  The task yields a numerical score for each trial that shows how well the participant did on that trial (large numbers indicate better performance).  But as you have probably already guessed, participants are not told how well they actually did.  Instead, they are fed a brazen lie that they scored either Above Average, Average , or Below average compared to a norm group.  Additionally they were told (again, dishonestly) that the norm group was either College Men or Professional Athletes.  The dependent variable was a quantitative score given by the participant about their ‘aspiration level’ for the next trial.  That is, did the feedback that they got about how well they did and compared to whom make a difference in how well they thought they could do (and aspired to do) during the next trial?    [Part B.  How did Penner get this through the IRB?  Stay tuned.]

 

INPUT

 

data d1;

input aspire standing norm;

**********************************************************************

*  Data from Hays, 1994, p. 491, Table 12.7.1.                       * 

*  Aspiration experiment.                                            *

*  Design is fully crossed (independent groups).  Standing has three *

*  levels and Norm has two levels.                                   *

**********************************************************************;

* Standing - 1=Above, 2=Average, 3=Below.

* Norm - 1=College men, 2=Professional Athletes.

**********************************************************************;cards;

52 1 1

48 1 1

43 1 1

50 1 1

43 1 1

44 1 1

46 1 1

46 1 1

43 1 1

49 1 1

28 2 1

35 2 1

34 2 1

32 2 1

34 2 1

27 2 1

31 2 1

27 2 1

29 2 1

25 2 1

15 3 1

14 3 1

23 3 1

21 3 1

14 3 1

20 3 1

21 3 1

16 3 1

20 3 1

14 3 1

38 1 2

42 1 2

42 1 2

35 1 2

33 1 2

38 1 2

39 1 2

34 1 2

33 1 2

34 1 2

43 2 2

34 2 2

33 2 2

42 2 2

41 2 2

37 2 2

37 2 2

40 2 2

36 2 2