APPENDIX D

INTRODUCTION TO SAS

One of the hardest parts about learning SAS is creating data sets. For the most part, this appendix deals with data set creation. It is vital to note that the default data set used by SAS at any given time is the data set most recently created. We can specify the data set for any SAS procedure (PROC). Suppose we wish to do multiple regression analysis on a data set named delivery. The appropriate PROC REG statement is

proc reg data=delivery;

We now consider in more detail how to create SAS data sets.

D.1 BASIC DATA ENTRY

A. Using the SAS Editor Window

The easiest way to enter data into SAS is to use the SAS Editor. We will use the delivery time data, given in Table 3.2 as the example throughout this appendix.

Step 1: Open the SAS Editor Window The SAS Editor window opens automatically upon starting the Windows or UNIX versions of SAS.

Step 2: The Data Command Each SAS data set requires a name, which the data statement provides. This appendix uses a convention whereby all capital letters within a SAS command indicates a name the user must provide. The simplest form of the data statement is

data NAME;

The most painful lesson learning SAS is the use of the semicolon (;). Each SAS command must end in a semicolon. It seems like 95% of the mistakes made by SAS novices is to forget the semicolon. SAS is merciless about the use of the semicolon! For the delivery time data, an appropriate data command is

data delivery;

Later, we will discuss appropriate options for the data command.

Step 3: The Input Command The input command tells SAS the name of each variable in the data set. SAS assumes that each variable is numeric. The general form of the input command is

input VARl VAR2 … ;

We first consider the command when all of the variables are numeric, as in the delivery data from Chapter 2:

input time cases distance;

We designate a variable as alphanumeric (contains some characters other than numbers) by placing a $ after the variable name. For example, suppose we know the delivery person's name for each delivery. We could modify these names through the following input command:

input time cases distance person $;

Step 4: Give the Actual Data We alert SAS to the actual data by either the cards (which is fairly archaic), or the lines commands. The simplest way to enter the data is in space-delimited form. Each line represents a row from Table 3.2. Do not place a semicolon (;) at the end of the data rows. Many SAS users do place a semicolon on a row unto itself after the data to indicate the end of the data set. This semicolon is not required, but many people consider it good practice. For the delivery data, the actual data portion of the SAS code follows:

cards;
                         16.68        7      560
                         11.50        3      220
                         12.03        3      340
                         14.88        4       80
                         13.75        6      150
                         18.11        7      330
                          8.00        2      110
                         17.83        7      210
                         79.24       30     1460
                         21.50        5      605
                         40.33       16      688
                         21.00       10      215
                         13.50        4      255
                         19.75        6      462
                         24.00        9      448
                         29.00       10      776
                         15.35        6      200
                         19.00        7      132
                          9.50        3       36
                         35.10       17      770
                         17.90       10      140
                         52.32       26      810
                         18.75        9      450
                         19.83        8      635
                         10.75        4      150
                         ;

Step 5: Using PROC PRINT to Check Data Entry It is very easy to make mistakes in entering data. If the data set is sufficiently small, it is always wise to print it. The simplest statement to print a data set in SAS is

proc print;

which prints the most recently created data set. This statement prints the entire data set. If we wish to print a subset of the data, we can print specific variables:

proc print;
var VAR1 VAR2 … ;

Many SAS users believe that it is good practice to specify the desired data set. In this manner, we guarantee that we print the data set we want. The modified command is

proc print data=NAME;

The following command prints the entire delivery data set:

proc print data=delivery;

The following commands print only the times from the delivery data set:

proc print data=delivery;
var time;

The run command submits the code. When submitted, SAS produces two files: the output file and the log file. The output file for the delivery data PROC PRINT command follows:

The SAS System
                       Obs      time       cases     distance

                        1       16.68        7          560
                        2       11.50        3          220
                        3       12.03        3          340
                        4       14.88        4           80
                        5       13.75        6          150
                        6       18.11        7          330
                        7        8.00        2          110
                        8       17.83        7          210
                        9       79.24       30          1460
                       10       21.50        5          605
                       11       40.33       16          688
                       12       21.00       10          215
                       13       13.50        4          255
                       14       19.75        6          462
                       15       24.00        9          448
                       16       29.00       10          776
                       17       15.35        6          200
                       18       19.00        7          132
                       19        9.50        3           36
                       20       35.10       17          770
                       21       17.90       10          140
                       22       52.32       26          810
                       23       18.75        9          450
                       24       19.83        8          635
                       25       10.75        4          150

The resulting log file follows:

NOTE: Copyright (c) 2002-2003 by SAS Institute Inc., Cary,
       NC, USA.
NOTE:  SAS (r) 9.1 (TS1M2)
        Licensed to VA POLYTECHNIC INST &
       STATE UNIV-CAMPUSWIDE-IN, Site 0001798011.
NOTE:  This session is executing on the WIN_PRO platform
NOTE:  SAS initialization used:
       real time      19.30 seconds
       cpu time      1.56 seconds
1      data delivery;
2      input time cases distance;
3      cards;
NOTE:  The data set WORK.DELIVERY has 25 observations and 3
       variables.
NOTE:  DATA statement used (Total process time):
       real time     1.22 seconds
       CPU time      0.23 seconds
29     proc print data-delivery;
30     run;
NOTE:  There were 25 observations read from the data set
       WORK.DELIVERY.
NOTE:  PROCEDURE PRINT used (Total process time):
       real time     0.55 seconds
       cpu time      0.17 seconds

The log file provides a brief summary of the SAS session. It tells the analyst how many observations are in the data set, how many observations have missing data (in this case, there are no missing data), the commands executed, and any errors. The log file is almost essential for debugging SAS code. Section D.5 provides more details about this file.

B. Entering Data from a Text File

We can use the infile statement to read data from a text file. The form of this statement is

infile ‘FULL FILE NAME’;

The infile statement requires the full file name, including all path information (all the directories). The full file name must be enclosed by single quotes. Of course, the statement must end in a semicolon (;). The following example has the data in a text file named delivery.txt that is located in the directory

C:My StuffDisk-BooksRegression 5th Ed

of my Windows laptop. UNIX follows a slightly different path convention. The following example illustrates how to use the infile statement for the delivery data:

data delivery;
   infile ‘C:My StuffDisk-BooksRegression 5th Ed
delivery.txt’;
   input time cases distance;
run;

D.2 CREATING PERMANENT SAS DATA SETS

There are many occasions where we expect to use a single data set many times. For example, many regression courses require projects that involve analyzing a single data set several times over the semester as the students learn more analytical techniques. In such a situation, it is nice to read the data only once and then create a permanent data set that is available for future use.

Step 1: Specify the Directory for the Permanent Data Set We specify the directory for our permanent data set through the libname statement, which has the form

libname NAME1 ‘FULL DIRECTORY NAME’;

NAME-1 is the name for the directory that we use purely within the SAS code. FULL DIRECTORY NAME is the actual name of the directory, including the full path information.

Step 2: Use the Data Statement to Create the Data Set The key point is to use the appropriate permanent name for the data set in the data statement. Specifically, suppose that we wish to create a data set named setname and that we named the directory namel. The appropriate name for the permanent SAS data set is namel.setname. The following example creates a SAS data set named book.delivery in the directory; C: My Stuff Disk-Books Regression 5th Ed.

libname book ‘c:My StuffDisk-BooksRegression 5th Ed’;
data book.delivery;
    infile ‘C:My StuffDisk-BooksRegression 5th Ed
delivery.txt’;
    input time cases distance;
run;

The following code illustrates how to use the permanent data set. The libname statement must appear somewhere in the SAS code prior to the data set's use by a procedure:

libname book ‘c:My StuffDisk-BooksRegression 5th Ed’;
proc reg data=book.delivery;
  model time=cases distance;
run;

The output from this code follows:

The REG Procedure
                                   Model: MODELl
                               Dependent Variable: time

                     Number of Observations Read         25
                     Number of Observations Used         25

                                 Analysis of Variance

                                    Sum of      Mean
            Source            DF  Squares     Square    F Value  Pr > F
            Model             2  5550.81092 2775.40546  261.24   <.0001
            Error             22  233.73168   10.62417
            Corrected Total  24  5784.54260

            Root MSE                3.25947  R- square    0.9596
            Dependent Mean         22.38400  Adj R- Sq    0.9559
            Coeff Var              14.56162

                                 Parameter Estimates

                                    Parameter   Standard
            Variable          DF  Estimate     Error    t Value  Pr > |t|
            Intercept         1     2.34123    1.09673    2.13   0.0442
            cases             1     1.61591    0.17073    9.46   <.0001
            distance          1     0.01438    0.00361    3.98   0.0006

D.3 IMPORTING DATA FROM AN EXCEL FILE

The PC version of SAS has a nice wizard for importing an EXCEL spreadsheet as a SAS data set. The user has the option to bring the data in as a permanent data set or a temporary data set. A temporary data set exists purely for the duration of the SAS session. To bring the EXCEL spreadsheet as a permanent data set, we need to run an appropriate libname statement prior to using the wizard.

The first row of the EXCEL spreadsheet needs to provide the variable names associated with each column. The names provided in the first row will become the variables in the SAS data set.

It is not as easy to import an EXCEL spreadsheet into the UNIX version of SAS. The steps required follow.

Step 1: Export the EXCEL Spreadsheet We will need the EXCEL spreadsheet in dbf format (DBF III, IV, or V), which is easily done by the Save As button in EXCEL.

Step 2: Get the dbf File into UNIX Format If the dbf file was created on a Windows computer, we need to change its format for UNIX. Save the file in a UNIX directory and then execute the following UNIX command:

dos2unix-ascii data>newdata

Step 3: Import the File into SAS Let NAME.dbf be the name of the dbf file. The following command creates a temporary work file named NAME:

proc import dbms=dbf out=work.NAME datafile=“NAME.dbf”;

Step 4: When in Doubt, Contact Your System's Administrator! Things often seem to go wrong when crossing platforms, such as from Windows to UNIX. What works for one set of systems may not work perfectly for another.

D.4 OUTPUT COMMAND

The output command allows the user to append a previously created data set with information generated by a SAS procedure. Many SAS procedures support the output command. Its general form is

output out=SAS-NAME (output list) ;

In this case SAS-NAME is the name of the data set created by the output command. The resulting data set is the data set used by the procedure plus the variables added through (output list). Suppose we wish to add the predicted values and the raw residuals to the delivery time data set. Let delivery2 be the new data set. Suppose that we call the predicted delivery times ptime and that we call the raw residuals res. The appropriate output command is

output out=delivery2 p=ptime r=res;

The p is SAS's designation for the predicted values generated by PROC REG, and r is the designation for the raw residuals. In the output list, the SAS designation always is on the left-hand side of the = sign. The variable name within the new data set is always on the right-hand side. To create a data set with

It is very important to remember that the default data set used by a SAS procedure is the one most recently created. One of the saving graces of the output command is that it includes the data set used by the procedure to create the output data set.

D.5 LOG FILE

Every SAS session generates a “log” file that provides a brief summary. New SAS users find out very quickly (and very painfully) that SAS source code is a computer program that must be compiled. As such, the code must follow certain syntax rules. It is important to note that SAS can produce an incorrect, even nonsensical, analysis even if SAS does not reject the syntax. The log file is almost essential for debugging SAS code.

The log file provides a brief summary of the SAS session. It tells the analyst how many observations are in the data set, how many observations have missing data (in this case, there are no missing data), the commands executed, and any errors. Below is a simple example from a correct analysis:

NOTE: Copyright (c) 2002-2003 by SAS Institute Inc., Cary,
       NC, USA
NOTE:  SAS (r) 9.1 (TSIM2)
       Licensed to VA POLYTECHNIC INST & STATE UNIV-
       CAMPUSWIDE-IN, Site 0001798011.
NOTE:  This session is executing on the WIN_PRO platform.

NOTE:  SAS initialization used:
       real time      19.30 seconds
       cpu time        1.56 seconds

1      data delivery; 2     input time cases distance;
3      cards;

NOTE:  The data set WORK.DELIVERY has 25 observations and 3
       variables.
NOTE:  DATA statement used (Total process time):
       real time      1.22 seconds
       cpu time       0.23 seconds

29     proc print data=delivery; 30    run;
NOTE:  There were 25 observations read from the data set
WORK.DELIVERY
NOTE:  PROCEDURE PRINT used (Total process time):
       real time      0.55 seconds
       cpu time       0.17 seconds

Below is an example where we give the command

print data=deli very

instead of the proper syntax

proc print data=delivery;

NOTE: Copyright (c) 2002-2003 by SAS Institute Inc., Cary,
       NC, USA.
NOTE:  SAS (r) 9.1 (TS1M2)
       Licensed to VA POLYTECHNIC INST & STATE UNIV-
       CAMPUSWIDE-IN, Site 0001798011.
NOTE:  This session is executing on the WIN_PRO
       platform.
NOTE:  SAS initialization used:
       real time      5.03 seconds
       cpu time       1.73 seconds
1     libname book  ‘c:My StuffDisk-BooksRegression 5th
       Ed’;
NOTE:  Libref BOOK was successfully assigned as follows:
       Engine    2q    V9
       Physical Name:    c:My StuffDisk-BooksRegression 5th
Ed
2      print data=book.delivery;
       --
       180
       ERROR 180- 322: Statement is not valid or it is
used out of
proper order.
3      run;

One of the most frustrating errors in SAS occurs when we forget a semicolon. SAS rarely, if ever, flags a missing semicolon directly as an error! It flags a syntax problem later in the source code that is the consequence of the missing semicolon.

Finally, for large sets it is not practical to print the entire data set. Many people use SAS to create massive data sets through “merges,” among other techniques. In these circumstances, the log file gives the first information, usually through the number of observations in the data set, of problems. As such, the log file is essential to good SAS programming.

D.6 ADDING VARIABLES TO AN EXISTING SAS DATA SET

We can add variables to a previously created SAS data set. For example, suppose that we would like to use cases2 = cases2 as a regressor for the delivery data, and suppose that the delivery data are in a SAS data set named delivery. We shall call the new data set delivery2. The appropriate SAS commands are

data delivery2;
   set delivery;
   cases2=cases*cases;
run;

Suppose we wish to create a new permanent SAS data set where we add cases2 to the permanent SAS data set book.delivery Suppose further that our code already includes the appropriate libname statement. The appropriate SAS commands are

data book.delivery2;
   set book.delivery;
   cases2=cases*cases;
run;
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset