Chapter 2: Foundations for Analyzing Data and Reading Data from Other Sources

2.1 Learning Objectives

2.2 Case Study Activity

2.3 Getting Started with Data Exploration in SAS

2.3.1 Assigning Labels and Using SAS Formats

2.3.2 PROC SORT and BY-Group Processing

2.4 Using the MEANS Procedure for Quantitative Summaries

2.4.1 Choosing Analysis Variables and Statistics in PROC MEANS

2.4.2 Using the CLASS Statement in PROC MEANS

2.5 User-Defined Formats

2.5.1 The FORMAT Procedure

2.5.2 Permanent Storage and Inspection of Defined Formats

2.6 Subsetting with the WHERE Statement

2.7 Using the FREQ Procedure for Categorical Summaries

2.7.1 Choosing Analysis Variables in PROC FREQ

2.7.2 Multi-Way Tables in PROC FREQ

2.8 Reading Raw Data

2.8.1 Introduction to Reading Delimited Files

2.8.2 More with List Input

2.8.3 Introduction to Reading Fixed-Position Data

2.9 Details of the DATA Step Process

2.9.1 Introduction to the Compilation and Execution Phases

2.9.2 Building blocks of a Data Set: Input Buffers and Program Data Vectors

2.9.3 Debugging the DATA Step

2.10 Validation

2.11 Wrap-Up Activity

2.12 Chapter Notes

2.13 Exercises

2.1 Learning Objectives

At the conclusion of this chapter, mastery of the concepts covered in the narrative includes the ability to:

  • Apply the MEANS procedure to produce a variety of quantitative summaries, potentially grouped across several categories
  • Apply the FREQ procedure to produce frequency and relative frequency tables, including cross-tabulations
  • Categorize data for analyses in either the MEANS or FREQ procedures using internal SAS formats or user-defined formats
  • Formulate a strategy for selecting only the necessary rows when processing a SAS data set
  • Apply the DATA step to read data from delimited or fixed-position raw text files
  • Describe the operations carried out during the compilation and execution phases of the DATA step
  • Compare and contrast the input buffer and program data vector
  • Apply DATA step statements to assist in debugging
  • Apply the COMPARE procedure to compare and validate a data set against a standard

Use the concepts of this chapter to solve the problems in the wrap-up activity. Additional exercises and case-studies are also available to test these concepts.

2.2 Case Study Activity

This section introduces a case study that is used as a basis for most of the concepts and associated activities in this book. The data comes from the Current Population Survey by the Integrated Public Use Microdata Series (IPUMS CPS). IPUMS CPS contains a wide variety of information, only a subset of the data collected from 2001-2015 is included in the examples here. Further, the data used is introduced in various segments, starting with simple sets of variables and eventually adding more information that must be assembled to achieve the objectives of each section.

This chapter works with data that includes household-level information from the 2005 and 2010 IPUMS CPS data sets of over one million observations each. Included are variables on state, county, metropolitan area/city, household income, home value, mortgage status, ownership status, and mortgage payment. Outputs 2.2.1 through 2.2.4 show tabular summaries from the 2010 data, including quantitative statistics, frequencies, and/or percentages. Reproducing these tables in the wrap-up activity in Section 2.11 is the primary objective for this chapter.

The first sample output shown in Output 2.2.1 produces a set of six statistics on mortgage payments across metropolitan status for mortgages of $100 per month or more. In order to make this table, and the slightly more complicated Output 2.2.2, several components of the MEANS procedure must be understood.

Output 2.2.1: Basic Statistics on Mortgage Payments Grouped on Metropolitan Status

Analysis Variable : MortgagePayment Mortgage Payment

Metro

N

Mean

Median

Std Dev

Minimum

Maximum

Not Identifiable

42927

970.2

800.0

668.5

100.0

7400.0

Not in Metro Area

97603

815.0

670.0

576.0

100.0

6800.0

Metro, Inside City

56039

1363.5

1100.0

974.8

100.0

7400.0

Metro, Outside City

185967

1480.8

1300.0

974.7

100.0

7400.0

Metro, City Status Unknown

163204

1233.2

1000.0

846.4

100.0

7400.0

Output 2.2.2: Minimum, Median, and Maximum on Mortgage Payments Across Multiple Categories

Metro

Household Income

Variable

Label

Minimum

Median

Maximum

Metro, Inside City

Negative

MortgagePayment
HomeValue

Mortgage Payment
Home Value

440
70000

1200
250000

4500
675000

$0 to $45K

MortgagePayment
HomeValue

Mortgage Payment
Home Value

100
0

740
130000

6800
5303000

$45K to $90K

MortgagePayment
HomeValue

Mortgage Payment
Home Value

100
0

1000
180000

7400
4915000

Above $90K

MortgagePayment
HomeValue

Mortgage Payment
Home Value

100
0

1600
340000

7400
5303000

Metro, Outside City

Negative

MortgagePayment
HomeValue

Mortgage Payment
Home Value

100
10000

1450
250000

5400
4152000

$0 to $45K

MortgagePayment
HomeValue

Mortgage Payment
Home Value

100
0

850
150000

7400
4304000

$45K to $90K

MortgagePayment
HomeValue

Mortgage Payment
Home Value

100
0

1100
199000

6800
4915000

Above $90K

MortgagePayment
HomeValue

Mortgage Payment
Home Value

100
0

1600
330000

7400
4915000

Metro, City Status Unknown

Negative

MortgagePayment
HomeValue

Mortgage Payment
Home Value

180
17000

1200
245000

5300
2948000

$0 to $45K

MortgagePayment
HomeValue

Mortgage Payment
Home Value

100
0

720
125000

7400
4915000

$45K to $90K

MortgagePayment
HomeValue

Mortgage Payment
Home Value

100
0

960
160000

7400
4915000

Above $90K

MortgagePayment
HomeValue

Mortgage Payment
Home Value

100
0

1400
270000

7400
4915000

In Outputs 2.2.3 and 2.2.4, frequencies and percentages are summarized across combinations of various categories, which requires mastery of the fundamentals of the FREQ procedure.

Output 2.2.3: Income Status Versus Mortgage Payment

Table of HHIncome by MortgagePayment

HHIncome
(Household Income)

MortgagePayment
(Mortgage Payment)

Frequency
Row Pct

$350 and Below

$351 to $1000

$1001 to $1600

Over $1600

Total

Negative

30
9.93

97
32.12

92
30.46

83
27.48

302

$0 to $45K

22929
16.37

83125
59.33

22617
16.14

11436
8.16

140107

$45K to $90K

13877
6.96

103660
51.99

54778
27.48

27052
13.57

199367

Above $90K

5944
2.89

52679
25.58

62474
30.33

84867
41.20

205964

Total

42780

239561

139961

123438

545740

Output 2.2.4: Income Status Versus Mortgage Payment for Metropolitan Households (Table 1 of 3)

Table 1 of HHIncome by MortgagePayment

Controlling for Metro=Metro, Inside City

HHIncome(Household Income)

MortgagePayment(Mortgage Payment)

Frequency
Row Pct

$350 and Below

$351 to $1000

$1001 to $1600

Over $1600

Total

Negative

0
0.00

7
30.43

9
39.13

7
30.43

23

$0 to $45K

1596
10.75

8949
60.30

2597
17.50

1700
11.45

14842

$45K to $90K

910
4.75

9215
48.13

5571
29.10

3450
18.02

19146

Above $90K

504
2.29

4947
22.46

6321
28.70

10256
46.56

22028

Total

3010

23118

14498

15413

56039

2.3 Getting Started with Data Exploration in SAS

This section reviews and extends some fundamental SAS concepts demonstrated in code supplied for Chapter 1, with these examples built upon a simplified version of the case study data. First, Program 2.3.1 uses the CONTENTS and PRINT procedures to make an initial exploration of the Ipums2005Mini data set. To begin, make sure the BookData library is assigned as done in Chapter 1.

Program 2.3.1: Using the CONTENTS and PRINT Procedures to View Data and Attributes

proc contents data=bookdata.ipums2005mini;

  ods select variables;

run;

proc print data=bookdata.ipums2005mini(obs=5);

  var state MortgageStatus MortgagePayment HomeValue Metro;

run;

The BookData.Ipums2005Mini data set is a modification of a data set used later in this chapter, BookData.Ipums2005Basic. It subsets the original data set down to a few records and is used for illustration of these initial concepts.

The ODS SELECT statement limits the output of a given procedure to the chosen tables, with the Variables table from PROC CONTENTS containing the names and attributes of the variables in the chosen data set. Look back to Program 1.4.4, paying attention to the ODS TRACE statement and its results, to review how this choice is made.

The OBS= data set option limits the number of observations processed by the procedure. It is in place here simply to limit the size of the table shown in Output 2.3.1B. At various times in this text, the output shown may be limited in scope; however, the code given may not include this option for all such cases.

The VAR statement is used in the PRINT procedure to select the variables to be shown and the column order in which they appear.

Output 2.3.1A: Using the CONTENTS Procedure to View Attributes

Alphabetic List of Variables and Attributes

#

Variable

Type

Len

Format

4

CITYPOP

Num

8

2

COUNTYFIPS

Num

8

10

City

Char

43

6

HHINCOME

Num

8

7

HomeValue

Num

8

3

METRO

Num

8

BEST12.

5

MortgagePayment

Num

8

9

MortgageStatus

Char

45

11

Ownership

Char

6

1

SERIAL

Num

8

8

state

Char

57

Output 2.3.1B: Using the PRINT Procedure to View Data

Obs

state

MortgageStatus

MortgagePayment

HomeValue

METRO

1

South Carolina

Yes, mortgaged/ deed of trust or similar debt

200

32500

4

2

North Carolina

No, owned free and clear

0

5000

1

3

South Carolina

Yes, mortgaged/ deed of trust or similar debt

360

75000

4

4

South Carolina

Yes, contract to purchase

430

22500

3

5

North Carolina

Yes, mortgaged/ deed of trust or similar debt

450

65000

4

2.3.1 Assigning Labels and Using SAS Formats

As seen in Chapter 1, SAS variable names have a certain set of restrictions they must meet, including no special characters other than an underscore. This potentially limits the quality of the display for items such as the headers in PROC PRINT. SAS does permit the assignment of labels to variables, substituting more descriptive text into the output in place of the variable name, as demonstrated in Program 2.3.2.

Program 2.3.2: Assigning Labels

proc print data=bookdata.ipums2005mini(obs=5) noobs label;

  var state MortgageStatus MortgagePayment HomeValue Metro;

  label HomeValue=’Value of Home ($)’ state=’State’;

run;

By default, the output from PROC PRINT includes an Obs column, which is simply the row number for the record—the NOOBS option in the PROC PRINT statement suppresses this column.

Most SAS procedures use labels when they are provided or assigned; however, PROC PRINT defaults to using variable names. To use labels, the LABEL option is provided in the PROC PRINT statement. See Chapter Note 1 in Section 2.12 for more details.

The LABEL statement assigns labels to selected variables. The general syntax is: LABEL variable1=’label1variable2=’label2’ …; where the labels are given as literal values in either single or double quotation marks, as long as the opening and closing quotation marks match.

Output 2.3.2: Assigning Labels

State

MortgageStatus

MortgagePayment

Value of Home ($)

METRO

South
Carolina

Yes, mortgaged/ deed of trust or similar debt

200

32500

4

North
Carolina

No, owned free and clear

0

5000

1

South
Carolina

Yes, mortgaged/ deed of trust or similar debt

360

75000

4

South
Carolina

Yes, contract to purchase

430

22500

3

North
Carolina

Yes, mortgaged/ deed of trust or similar debt

450

65000

4

In addition to using labels to alter the display of variable names, altering the display of data values is possible with formats. The general form of a format reference is:

<$>format<w>.<d>

The <> symbols denote a portion of the syntax that is sometimes used/required—the <> characters are not part of the syntax. The dollar sign is required for any format that applies to a character variable (character formats) and is not permitted in formats used for numeric variables (numeric formats). The w value is the total number of characters (width) available for the formatted value, while d controls the number of values displayed after the decimal for numeric formats. The dot is required in all format assignments, and in many cases is the means by which the SAS compiler can distinguish between a variable name and a format name. The value of format is called the format name; however, standard numeric and character formats have a null name; for example, the 5.2 format assigns the standard numeric format with a total width of 5 and up to 2 digits displayed past the decimal. Program 2.3.3 uses the FORMAT statement to apply formats to the HomeValue, MortgagePayement, and MortgageStatus variables.

Program 2.3.3: Assigning Formats

proc print data=bookdata.ipums2005mini(obs=5) noobs label;

  var state MortgageStatus MortgagePayment HomeValue Metro;

  label HomeValue=’Value of Home’ state=’State’;

  format HomeValue MortgagePayment dollar9. MortgageStatus $1.;

run;

In the FORMAT statement, a list of one or more variables is followed by a format specification. Both HomeValue and MortgagePayment are assigned a dollar format with a total width of nine—any commas and dollar signs inserted by this format count toward the total width.

The MortgageStatus variable is character and can only be assigned a character format. The $1. format is the standard character format with width one, which truncates the display of MortgageStatus to one letter, but does not alter the actual value. In general, formats assigned in procedures are temporary and only apply to the output for the procedure.

Output 2.3.3: Assigning Formats

State

MortgageStatus

MortgagePayment

Value of Home

METRO

South
Carolina

Y

$200

$32,500

4

North
Carolina

N

$0

$5,000

1

South
Carolina

Y

$360

$75,000

4

South
Carolina

Y

$430

$22,500

3

North
Carolina

Y

$450

$65,000

4

2.3.2 PROC SORT and BY-Group Processing

Rows in a data set can be reordered using the SORT procedure to sort the data on the values of one or more variables in ascending or descending order. Program 2.3.4 sorts the BookData.Ipums2005Mini data set by the HomeValue variable.

Program 2.3.4: Sorting Data with the SORT Procedure

proc sort data=bookdata.ipums2005mini out=work.sorted;

  by HomeValue;

run;

proc print data=work.sorted(obs=5) noobs label;

  var state MortgageStatus MortgagePayment HomeValue Metro;

  label HomeValue=’Value of Home’ state=’State’;

  format HomeValue MortgagePayment dollar9. MortgageStatus $1.;

run;

The default behavior of the SORT procedure is to replace the input data set, specified in the DATA= option, with the sorted data set. To create a new data set from the sorted observations, use the OUT= option.

The BY statement is required in PROC SORT and must name at least one variable. As shown in Output 2.3.4, the rows are now ordered in increasing levels of HomeValue.

Output 2.3.4: Sorting Data with the SORT Procedure

State

MortgageStatus

MortgagePayment

Value of Home

METRO

North
Carolina

N

$0

$5,000

1

South
Carolina

Y

$430

$22,500

3

North
Carolina

Y

$300

$22,500

3

South
Carolina

Y

$200

$32,500

4

North
Carolina

N

$0

$45,000

1

Sorting on more than one variable gives a nested or hierarchical sorting. In those cases, values are ordered on the first variable, then for groups of records having the same value of the first variable those records are sorted on the second variable, and so forth. A specification of ascending (the default) or descending order is made for each variable. Program 2.3.5 sorts the BookData.Ipums2005Mini data set on three variables present in the data set.

Program 2.3.5: Sorting on Multiple Variables

proc sort data=bookdata.ipums2005mini out=work.sorted;

  by MortgagePayment descending State descending HomeValue;

run;

 

proc print data=work.sorted(obs=6) noobs label;

  var state MortgageStatus MortgagePayment HomeValue Metro;

  label HomeValue=’Value of Home’ state=’State’;

  format HomeValue MortgagePayment dollar9. MortgageStatus $1.;

run;

The first sort is on MortgagePayment, in ascending order. Since 0 is the lowest value and that value occurs on six records in the data set, Output 2.3.5 shows one block of records with MortgagePayment 0.

The next sort is on State in descending order—note that the DESCENDING option precedes the variable it applies to. For the six records shown in Output 2.3.5, the first three are South Carolina and the final three are North Carolina—descending alphabetical order. Note, when sorting character data, casing matters—uppercase values are before lowercase in such a sort. For more details about determining the sort order of character data, see Chapter Note 2 in Section 2.12.

The final sort is on HomeValue, also in descending order—note that the DESCENDING option must precede each variable it applies to. So, within each State group in Output 2.3.5, values of the HomeValue variable are in descending order.

Output 2.3.5: Sorting on Multiple Variables

State

MortgageStatus

MortgagePayment

Value of Home

METRO

South
Carolina

N

$0

$137,500

3

South
Carolina

N

$0

$95,000

4

South
Carolina

N

$0

$45,000

3

North
Carolina

N

$0

$162,500

0

North
Carolina

N

$0

$45,000

1

North
Carolina

N

$0

$5,000

1

Most SAS procedures, including PROC PRINT, can take advantage of BY-group processing for data that is sorted into groups. The procedure must use a BY statement that corresponds to the sorting in the data set. If the data is sorted using PROC SORT, the BY statement in a subsequent procedure does not have to completely match the BY statement in PROC SORT; however, it must match the first level of sorting if only one variable is included, the first two levels if two variables are included, and so forth. It must also match ordering, ascending or descending, on each included variable. Program 2.3.6 groups output from the PRINT procedure based on BY grouping constructed with PROC SORT.

Program 2.3.6: BY-Group Processing in PROC PRINT

proc sort data=bookdata.ipums2005mini out= work.sorted;

  by MortgageStatus State descending HomeValue;

run;

proc print data= work.sorted noobs label;

  by MortgageStatus State;

  var MortgagePayment HomeValue Metro;

  label HomeValue=’Value of Home’ state=’State’;

  format HomeValue MortgagePayment dollar9. MortgageStatus $9.;

run;

The original data is sorted first on MortgageStatus, then on State, and finally in descending order of HomeValue for each combination of MortgageStatus and State.

PROC PRINT uses a BY statement matching on the MortgageStatus and State variables, which groups the output into sections based on each unique combination of values for these two variables, with the final sorting on HomeValue appearing in each table. Note that a BY statement with only MortgageStatus can be used as well, but a BY statement with only State cannot—the data is not sorted on State primarily.

Output 2.3.6: BY-Group Processing in PROC PRINT (First 2 of 6 Groups Shown)

MortgageStatus=No, owned State=North Carolina

MortgagePayment

Value of Home

METRO

$0

$162,500

0

$0

$45,000

1

$0

$5,000

1

MortgageStatus=No, owned State=South Carolina

MortgagePayment

Value of Home

METRO

$0

$137,500

3

$0

$95,000

4

$0

$45,000

3

The structure of BY groups in PROC PRINT can be altered slightly through use of an ID statement, as shown in Program 2.3.7. Assuming the variables listed in the ID statement match those in the BY statement, BY-group variables are placed as the left-most columns of each table, rather than between tables.

Program 2.3.7: Using BY and ID Statements Together in PROC PRINT

proc print data= work.sorted noobs label;

  by MortgageStatus State;

  id MortgageStatus State;

  var MortgagePayment HomeValue Metro;

  label HomeValue=’Value of Home’ state=’State’;

  format HomeValue MortgagePayment dollar9. MortgageStatus $9.;

run;

Output 2.3.7: Using BY and ID Statements Together in PROC PRINT (First 2 of 6 Groups Shown)

MortgageStatus

State

MortgagePayment

Value of Home

METRO

No, owned

North Carolina

$0

$162,500

0

$0

$45,000

1

$0

$5,000

1

MortgageStatus

State

MortgagePayment

Value of Home

METRO

No, owned

South Carolina

$0

$137,500

3

$0

$95,000

4

$0

$45,000

3

PROC PRINT is limited in its ability to do computations. (Later in this text, the REPORT procedure is used to create various summary tables.); however, it can do sums of numeric variables with the SUM statement, as shown in Program 2.3.8.

Program 2.3.8: Using the SUM Statement in PROC PRINT

proc print data= work.sorted noobs label;

  by MortgageStatus State;

  id MortgageStatus State;

  var MortgagePayment HomeValue Metro;

  sum MortgagePayment HomeValue;

  label HomeValue=’Value of Home’ state=’State’;

  format HomeValue MortgagePayment dollar9. MortgageStatus $9.;

run;

Output 2.3.8: Using the SUM Statement in PROC PRINT (Last of 6 Groups Shown)

MortgageStatus

State

MortgagePayment

Value of Home

METRO

Yes, mort

South Carolina

$360

$75,000

4

$500

$65,000

3

$200

$32,500

4

Yes, mort

South Carolina

$1,060

$172,500

Yes, mort

$2,200

$315,000

$4,230

$1200000

Sums are produced at the end of each BY group (and the SUMBY statement is available to modify this behavior), and at the end of the full table. Note that the format applied to the HomeValue column is not sufficient to display the grand total with the dollar sign and comma. If a format is of insufficient width, SAS removes what it determines to be the least important characters. However, it is considered good programming practice to determine the minimum format width needed for all values a format is applied to. If the format does not include sufficient width to display the value with full precision, then SAS may adjust the included format to a different format. See Chapter Note 3 in Section 2.12 for further discussion on format widths.

2.4 Using the MEANS Procedure for Quantitative Summaries

Producing tables of statistics like those shown for the case study in Outputs 2.2.1 and 2.2.2 uses MEANS procedure. This section covers the fundamentals of PROC MEANS, including how to select variables for analysis, choosing statistics, and separating analyses across categories.

2.4.1 Choosing Analysis Variables and Statistics in PROC MEANS

To begin, make sure the BookData library is assigned as done in Chapter 1, submit PROC CONTENTS on the IPUMS2005Basic SAS data set from the BookData library, and review the output. Also, to ensure familiarity with the data, open the data set for viewing or run the PRINT procedure to direct it to an output table. Once these steps are complete, enter and submit the code given in Program 2.4.1.

Program 2.4.1: Default Statistics and Behavior for PROC MEANS

options nolabel;

proc means data=BookData.IPUMS2005Basic;

run;

For variables that have labels, PROC MEANS includes them as a column in the output table; using NOLABEL in the OPTIONS statement suppresses their use. Here DATA= is technically an option; however, the default data set in any SAS session is the last data set created. If no data sets have been created during the session, which is the most likely scenario currently, PROC MEANS does not have a data set to process unless this option is provided. Beyond having a data set to work with, no other options or statements are required for PROC MEANS to compile and execute successfully. In this case, the default behavior, as shown in Output 2.4.1, is to summarize all numeric variables on a set of five statistics: number of nonmissing observations, mean, standard deviation, minimum, and maximum.

Output 2.4.1: Default Statistics and Behavior for PROC MEANS

Variable

N

Mean

Std Dev

Minimum

Maximum

SERIAL
COUNTYFIPS
METRO
CITYPOP
MortgagePayment
HHIncome
HomeValue

1159062
1159062
1159062
1159062
1159062
1159062
1159062

621592.24
42.2062901
2.5245354
2916.66
500.2042634
63679.84
2793526.49

359865.41
78.9543285
1.3085302
12316.27
737.9885592
66295.97
4294777.18

2.0000000
0
0
0
0
-29997.00
5000.00

1245246.00
810.0000000
4.0000000
79561.00
7900.00
1739770.00
9999999.00

SAS differentiates variable types as numeric and character only; therefore, variables stored as numeric that are not quantitative are summarized even if those summaries do not make sense. Here, the Serial, CountyFIPS, and Metro variables are stored as numbers, but means and standard deviations are of no utility on these since they are nominal. It is, of course, important to understand the true role and level of measurement (for instance, nominal versus ratio) for the variables in the data set being analyzed.

To select the variables for analysis, the MEANS procedure includes the VAR statement. Any variables listed in the VAR statement must be numeric, but should also be appropriate for quantitative summary statistics. As in the previous example, the summary for each variable is listed in its own row in the output table. (If only one variable is provided, it is named in the header above the table instead of in the first column.) Program 2.4.2 modifies Program 2.4.1 to summarize only the truly quantitative variables from BookData.IPUMS2005Basic, with the results shown in Output 2.4.2.

Program 2.4.2: Selecting Analysis Variables Using the VAR Statement in MEANS

proc means data=BookData.IPUMS2005Basic;

  var Citypop MortgagePayment HHIncome HomeValue;

run;

Output 2.4.2: Selecting Analysis Variables Using the VAR Statement in MEANS

Variable

N

Mean

Std Dev

Minimum

Maximum

CITYPOP
MortgagePayment
HHIncome
HomeValue

1159062
1159062
1159062
1159062

2916.66
500.2042634
63679.84
2793526.49

12316.27
737.9885592
66295.97
4294777.18

0
0
-29997.00
5000.00

79561.00
7900.00
1739770.00
9999999.00

The default summary statistics for PROC MEANS can be modified by including statistic keywords as options in the PROC MEANS statement. Several statistics are available, with the available set listed in the SAS Documentation, and any subset of those may be used. The listed order of the keywords corresponds to the order of the statistic columns in the table, and those replace the default statistic set. One common set of statistics is the five-number summary (minimum, first quartile, median, third quartile, and maximum), and Program 2.4.3 provides a way to generate these statistics for the four variables summarized in the previous example.

Program 2.4.3: Setting the Statistics to the Five-Number Summary in MEANS

proc means data=BookData.IPUMS2005Basic min q1 median q3 max;

  var Citypop MortgagePayment HHIncome HomeValue;

run;

Output 2.4.3: Setting the Statistics to the Five-Number Summary in MEANS

Variable

Minimum

Lower
Quartile

Median

Upper
Quartile

Maximum

CITYPOP
MortgagePayment
HHIncome
HomeValue

0
0
-29997.00
5000.00

0
0
24000.00
112500.00

0
0
47200.00
225000.00

0
830.0000000
80900.00
9999999.00

79561.00
7900.00
1739770.00
9999999.00

Confidence limits for the mean are included in the keyword set, both as a pair with the CLM keyword, and separately with LCLM and UCLM. The default confidence level is 95%, but is changeable by setting the error rate using the ALPHA= option. Consider Program 2.4.4, which constructs the 99% confidence intervals for the means, with the estimated mean between the lower and upper limits.

Program 2.4.4: Using the ALPHA= Option to Modify Confidence Levels

proc means data=BookData.IPUMS2005Basic lclm mean uclm alpha=0.01;

  var Citypop MortgagePayment HHIncome HomeValue;

run;

Output 2.4.4: Using the ALPHA= Option to Modify Confidence Levels

Variable

Lower 99%
CL for Mean

Mean

Upper 99%
CL for Mean

CITYPOP
MortgagePayment
HHIncome
HomeValue

2887.19
498.4385749
63521.22
2783250.94

2916.66
500.2042634
63679.84
2793526.49

2946.12
501.9699520
63838.46
2803802.04

There are also options for controlling the column display; rounding can be controlled by the MAXDEC= option (maximum number of decimal places). Program 2.4.5 modifies the previous example to report the statistics to a single decimal place.

Program 2.4.5: Using MAXDEC= to Control Precision of Results

proc means data=BookData.IPUMS2005Basic lclm mean uclm alpha=0.01 maxdec=1;

  var Citypop MortgagePayment HHIncome HomeValue;

run;

Output 2.4.5: Using MAXDEC= to Control Precision of Results

Variable

Lower 99%
CL for Mean

Mean

Upper 99%
CL for Mean

CITYPOP
MortgagePayment
HHIncome
HomeValue

2887.2
498.4
63521.2
2783250.9

2916.7
500.2
63679.8
2793526.5

2946.1
502.0
63838.5
2803802.0

MAXDEC= is limited in that it sets the precision for all columns. Also, no direct formatting of the statistics is available. The REPORT procedure, introduced in Chapter 4 and discussed in detail in Chapters 6 and 7, provides much more control over the displayed table at the cost of increased complexity of the syntax.

2.4.2 Using the CLASS Statement in PROC MEANS

In several instances, it is desirable to split an analysis across a set of categories and, if those categories are defined by a variable in the data set, PROC MEANS can separate those analyses using a CLASS statement. The CLASS statement accepts either numeric or character variables; however, the role assigned to class variables by SAS is special. Any variable included in the CLASS statement (regardless of type) is taken as categorical, which results in each distinct value of the variable corresponding to a unique category. Therefore, variables used in the CLASS statement should provide useful groupings or, as shown in Section 2.5, be formatted into a set of desired groups. Two examples follow, the first (Program 2.4.6) providing an illustration of a reasonable class variable, the second (Program 2.4.7) showing a poor choice.

Program 2.4.6: Setting a Class Variable in PROC MEANS

proc means data=BookData.IPUMS2005Basic;

  class MortgageStatus;

  var HHIncome;

run;

Output 2.4.6: Setting a Class Variable in PROC MEANS

Analysis Variable : HHIncome

MortgageStatus

N Obs

N

Mean

Std Dev

Minimum

Maximum

N/A

303342

303342

37180.59

39475.13

-19998.00

1070000.00

No, owned free and clear

300349

300349

53569.08

63690.40

-22298.00

1739770.00

Yes, contract to purchase

9756

9756

51068.50

46069.11

-7599.00

834000.00

Yes, mortgaged/ deed of trust or similar debt

545615

545615

84203.70

72997.92

-29997.00

1407000.00

In this data, MortgageStatus provides a clear set of distinct categories and is potentially useful for subsetting the summarization of the data. In Program 2.4.7, Serial is used as an extreme example of a poor choice since Serial is unique to each household.

Program 2.4.7: A Poor Choice for a Class Variable

proc means data=BookData.IPUMS2005Basic;

  class Serial;

  var HHIncome;

run;

Output 2.4.7: A Poor Choice for a Class Variable (Partial Table Shown)

Analysis Variable : HHIncome

SERIAL

N Obs

N

Mean

Std Dev

Minimum

Maximum

2

1

1

12000.00

.

12000.00

12000.00

3

1

1

17800.00

.

17800.00

17800.00

4

1

1

185000.00

.

185000.00

185000.00

5

1

1

2000.00

.

2000.00

2000.00

Choosing Serial as a class variable results in each class being a single observation, making the mean, minimum, and maximum the same value and creating a situation where the standard deviation is undefined. Again, this would be an extreme case; however, class variables are best when structured to produce relatively few classes that represent a useful stratification of the data.

Of course, more than one variable can be used in a CLASS statement; the categories are then defined as all combinations of the categories from the individual variables. The order of the variables listed in the CLASS statement only alters the nesting order of the levels; therefore, the same information is produced in a different row order in the table. Consider the two MEANS procedures in Program 2.4.8.

Program 2.4.8: Using Multiple Class Variables and Effects of Order

proc means data=BookData.IPUMS2005Basic nonobs n mean std;

  class MortgageStatus Metro;

  var HHIncome;

run;

proc means data=BookData.IPUMS2005Basic nonobs n mean std;

  class Metro MortgageStatus;

  var HHIncome;

run;

Output 2.4.8A: Using Multiple Class Variables (Partial Listing)

Analysis Variable : HHIncome

MortgageStatus

METRO

N

Mean

Std Dev

N/A

0

19009

31672.81

32122.89

1

48618

29122.73

29160.23

2

69201

38749.69

46226.50

3

73234

43325.25

42072.78

4

93280

36514.56

36974.63

No, owned free and clear

0

30370

46533.14

50232.50

1

85696

42541.06

44664.64

2

27286

60011.10

76580.75

3

76727

63925.99

75404.62

4

80270

55915.02

66293.39

Output 2.4.8B: Effects of Order (Partial Listing)

Analysis Variable : HHIncome

METRO

MortgageStatus

N

Mean

Std Dev

0

N/A

19009

31672.81

32122.89

No, owned free and clear

30370

46533.14

50232.50

Yes, contract to purchase

1030

46069.26

36225.80

Yes, mortgaged/ deed of trust or similar debt

41619

71611.01

55966.31

1

N/A

48618

29122.73

29160.23

No, owned free and clear

85696

42541.06

44664.64

Yes, contract to purchase

3034

42394.12

35590.14

Yes, mortgaged/ deed of trust or similar debt

93427

62656.54

48808.66

The same statistics are present in both tables, but the primary ordering is on MortgageStatus in Output 2.4.8A as opposed to metropolitan status (Metro) in Output 2.4.8B. Two additional items of note from this example: first, note the use of NONOBS in each. By default, using a CLASS statement always produces a column for the number of observations in each class level (NOBS), and this may be different from the statistic N due to missing data, but that is not an issue for this example. Second, the numeric values of Metro really have no clear meaning. Titles and footnotes, as shown in Chapter 1, are available to add information about the meaning of these numeric values. However, a better solution is to build a format and apply it to that variable, a concept covered in the next section.

2.5 User-Defined Formats

As seen in Section 2.3, SAS provides a variety of formats for altering the display of data values. It is also possible to define formats using the FORMAT procedure. These formats are used to assign replacements for individual data values or for groups or ranges of data, and they may be permanently stored in a library for subsequent use. Formats, both native SAS formats and user-defined formats, are an invaluable tool that are used in a variety of contexts throughout this book.

2.5.1 The FORMAT Procedure

The FORMAT procedure provides the ability to create custom formats, both for character and numeric variables. The principal tool used in writing formats is the VALUE statement, which defines the name of the format and its rules for converting data values to formatted values. Program 2.5.1 gives an example of a format written to improve the display of the Metro variable from the BookData.IPUMS2005Basic data set.

Program 2.5.1: Defining a Format for the Metro Variable

proc format;

  value Metro

    0 = “Not Identifiable”

    1 = “Not in Metro Area”

    2 = “Metro, Inside City”

    3 = “Metro, Outside City”

    4 = “Metro, City Status Unknown”

  ;

run;

The VALUE statement tends to be rather long given the number of items it defines. Remember, SAS code is generally free-form outside of required spaces and delimiters, along with the semicolon that ends every statement. Adopt a sound strategy for using indentation and line breaks to make code readable.

The VALUE statement requires the format name, which follows the SAS naming conventions of up to 32 characters, but with some special restrictions. Format names must meet an additional restriction of being distinct from the names of any formats supplied by SAS. Also, given that numbers are used to define format widths, a number at the end of a format name would create an ambiguity in setting lengths; therefore, format names cannot end with a number. If the format is for character values, the name must begin with $, and that character counts toward the 32-character limit.

In this format, individual values are set equal to their replacements (as literals) for all values intended to be formatted. Values other than 0, 1, 2, 3, and 4 may not appear as intended. For a discussion of displaying values other than those that appear in the VALUE statement, see Chapter Note 4 in Section 2.12.

The semicolon that ends the value statement is set out on its own line here for readability—simply to make it easy to verify that it is present.

Submitting Program 2.5.1 makes a format named Metro in the format catalog in the Work library, it only takes effect when used, and it is used in effectively the same manner as a format supplied by SAS. Program 2.5.2 uses the Metro format for the class variable Metro to alter the appearance of its values in Output 2.5.2. Note that since the variable Metro and the format Metro have the same name, and since no width is required, the only syntax element that distinguishes these to the SAS compiler is the required dot (.) in the format name.

Program 2.5.2: Using the Metro Format

proc means data=BookData.IPUMS2005Basic nonobs maxdec=0;

  class Metro;

  var HHIncome;

  format Metro Metro.;

run;

Output 2.5.2: Using the Metro Format

Analysis Variable : HHIncome

METRO

N

Mean

Std Dev

Minimum

Maximum

Not Identifiable

92028

54800

52333

-19998

1076000

Not in Metro Area

230775

47856

45547

-29997

1050000

Metro, Inside City

154368

60328

70874

-19998

1391000

Metro, Outside City

340982

77648

75907

-29997

1739770

Metro, City Status Unknown

340909

64335

66110

-22298

1536000

For this case, a simplified format that distinguishes metro, non-metro, and non-identifiable observations may be desired. Program 2.5.3 contains two approaches to this, the first being clearly the most efficient.

Program 2.5.3: Assigning Multiple Values to the Same Formatted Value

proc format;

  value MetroB

    0 = “Not Identifiable”

    1 = “Not in Metro Area”

    2,3,4  = “In a Metro Area”

  ;

  value MetroC

    0 = “Not Identifiable”

    1 = “Not in Metro Area”

    2 = “In a Metro Area”

    3 = “In a Metro Area”

    4 = “In a Metro Area”

  ;

run;

A comma-separated list of values is legal on the left side of each assignment, which assigns the formatted value to each listed data value.

This format accomplishes the same result; however, it is important that the literal values on the right side of the assignment are exactly the same. Differences in even simple items like spacing or casing results in different formatted values.

Either format given in Program 2.5.3 can replace the Metro format in Program 2.5.2 to create the result in Output 2.5.3.

Output 2.5.3: Assigning Multiple Values to the Same Formatted Value

Analysis Variable : HHIncome

METRO

N

Mean

Std Dev

Minimum

Maximum

Not Identifiable

92028

54800

52333

-19998

1076000

Not in Metro Area

230775

47856

45547

-29997

1050000

In a Metro Area

836259

69024

71495

-29997

1739770

It is also possible to use the dash character as an operator in the form of ValueA-ValueB to define a range on the left side of any assignment, which assigns the formatted value to every data value between ValueA and ValueB, inclusive. Program 2.5.4 gives an alternate strategy to constructing the formats given in Program 2.5.3 and that format can also be placed into Program 2.5.2 to produce Output 2.5.3.

Program 2.5.4: Assigning a Range of Values to a Single Formatted Value

proc format;

  value MetroD

    0 = “Not Identifiable”

    1 = “Not in Metro Area”

    2-4 = “In a Metro Area”

  ;

run;

Certain keywords are also available for use on the left side of an assignment, one of which is OTHER. OTHER applies the assigned format to any value not listed on the left side of an assignment elsewhere in the format definition. Program 2.5.5 uses OTHER to give another method for creating a format that can be used to generate Output 2.5.3. It is important to note that using OTHER often requires significant knowledge of exactly what values are present in the data set.

Program 2.5.5: Assigning a Range of Values to a Single Formatted Value

proc format;

  value MetroE

    0 = “Not Identifiable”

    1 = “Not in Metro Area”

    other = “In a Metro Area”

  ;

run;

In general, value ranges should be non-overlapping, and the < symbol—called an exclusion operator in this context—can be used at either end (or both ends) of the dash to indicate the value should not be included in the range. Overlapping ranges are discussed in Chapter Note 5 in Section 2.12. Using exclusion operators to create non-overlapping ranges allows for the categorization of a quantitative variable without having to know the precision of measurement. Program 2.5.6 gives two variations on creating bins for the MortgagePayment data and uses those bins as classes in PROC MEANS, with the results shown in Output 2.5.6A and Output 2.5.6B.

Program 2.5.6: Binning a Quantitative Variable Using a Format

proc format;

  value Mort

    0=’None’

    1-350=”$350 and Below”

    351-1000=”$351 to $1000”

    1001-1600=”$1001 to $1600”

    1601-high=”Over $1600”

  ;

 

  value MortB

    0=’None’

    1-350=”$350 and Below”

    350<-1000=”Over $350, up to $1000”

    1000<-1600=”Over $1000, up to $1600”

    1600<-high=”Over $1600”

  ;

run;

 

proc means data=BookData.IPUMS2005Basic nonobs maxdec=0;

  class MortgagePayment;

  var HHIncome;

  format MortgagePayment Mort.;

run;

 

proc means data=BookData.IPUMS2005Basic nonobs maxdec=0;

  class MortgagePayment;

  var HHIncome;

  format MortgagePayment MortB.;

run;

The keywords LOW and HIGH are available so that the maximum and minimum values need not be known. When applied to character data, LOW and HIGH refer to the sorted alphanumeric values. Note that the LOW keyword excludes missing values for numeric variables but includes missing values for character variables.

In these value ranges, the values used exploit the fact that the mortgage payments are reported to the nearest dollar.

Using the < symbol to not include the starting ranges allows the bins to be mutually exclusive and exhaustive irrespective of the precision of the data values. The exclusion operator, <, omits the adjacent value from the range so that 350<-1000 omits only 350, 350-<1000 omits only 1000, and 350<-<1000 omits both 350 and 1000.

When a format is present for a class variable, the format is used to construct the unique values for each category, and this behavior persists in most cases where SAS treats a variable as categorical.

Output 2.5.6A: Binning a Quantitative Variable Using the Mort Format

Analysis Variable : HHIncome

MortgagePayment

N

Mean

Std Dev

Minimum

Maximum

None

603691

45334

53557

-22298

1739770

$350 and Below

59856

47851

42062

-16897

841000

$351 to $1000

283111

64992

45107

-19998

1060000

$1001 to $1600

128801

96107

63008

-29997

1125000

Over $1600

83603

153085

117134

-29997

1407000

Output 2.5.6B: Binning a Quantitative Variable Using the MortB Format

Analysis Variable : HHIncome

MortgagePayment

N

Mean

Std Dev

Minimum

Maximum

None

603691

45334

53557

-22298

1739770

$350 and Below

59856

47851

42062

-16897

841000

Over $350, up to $1000

283111

64992

45107

-19998

1060000

Over $1000, up to $1600

128801

96107

63008

-29997

1125000

Over $1600

83603

153085

117134

-29997

1407000

2.5.2 Permanent Storage and Inspection of Defined Formats

Formats can be permanently stored in a catalog (with the default name of Formats) in any assigned SAS library via the use of the LIBRARY= option in the PROC FORMAT statement. As an example, consider Program 2.5.7, which is a revision and extension of Program 2.5.6.

Program 2.5.7: Revisiting Program 2.5.6, Adding LIBRARY= and FMTLIB Options

proc format library=sasuser;

  value Mort

    0=’None’

    1-350=”$350 and Below”

    351-1000=”$351 to $1000”

    1001-1600=”$1001 to $1600”

    1601-high=”Over $1600”

  ;

  value MortB

    0=’None’

    1-350=”$350 and Below”

    350<-1000=”Over $350, up to $1000”

    1000<-1600=”Over $1000, up to $1600”

    1600<-high=”Over $1600”

  ;

run;

 

proc format fmtlib library=sasuser;

run;

Using the LIBRARY= option in this manner places the format definitions into the Formats catalog in the Sasuser library and accessing them in subsequent coding sessions requires the system option FMTSEARCH=(SASUSER) to be specified prior to their use. An alternate format catalog can also be used via two-level naming of the form libref.catalog, with the catalog being created if it does not already exist. Any catalog in any library that contains stored formats to be used in a given session can be listed as a set inside the parentheses following the FMTSEARCH= option. Those listed are searched in the given order, with WORK.FORMATS being defined implicitly as the first catalog to be searched unless it is included explicitly in the list.

The FMTLIB option shows information about the formats in the chosen library in the Output window, Output 2.5.7 shows the results for this case.

Output 2.5.7: Revisiting Program 2.5.6, Adding LIBRARY= and FMTLIB Options

The top of the table includes general information about the format, including the name, various lengths, and number of format categories. The default length corresponds to the longest format label set in the VALUE statement. The rows below have columns for each format label and the start and end of each value range. Note that the first category in each of these formats is assigned to a range, even though it only contains a single value, with the start and end values being the same. The use of < as an exclusion operator is also shown in ranges where it is used, and the keyword HIGH is left-justified in the column where it is used. Note the exclusion operation is applied to the value of 1600 at the low end of the range, it is a syntax error to attempt to apply it to the keyword HIGH (or LOW).

2.6 Subsetting with the WHERE Statement

In many cases, only a subset of the data is used, with the subsetting criteria based on the values of variables in the data set. In these cases, using the WHERE statement allows conditions to be set which choose the records a SAS procedure processes while ignoring the others—no modification to the data set itself is required. If the OBS= data set option is in use, the number chosen corresponds to the number of observations meeting the WHERE condition.

In order to use the WHERE statement, it is important to understand the comparison and logical operators available. Basic comparisons like equality or various inequalities can be done with symbolic or mnemonic operators—Table 2.6.1 shows the set of comparison operators.

Table 2.6.1: Comparison Operators

Operation

Symbol

Mnemonic

Equal

=

EQ

Not Equal

^=

NE

Less Than

<

LT

Less Than or Equal

<=

LE

Greater Than

>

GT

Greater Than or Equal

>=

GE

In addition to comparison operators, Boolean operators for negation and compounding (along with some special operators) are also available—Table 2.6.2 summarizes these operators.

Table 2.6.2: Boolean and Associated Operators

Symbol

Mnemonic

Logic

&

AND

True result if both conditions are true

|

OR

True result if either, or both, conditions are true

IN

True if matches any element in a list

BETWEEN-AND

True if in a range of values (including endpoints)

~

NOT

Negates the condition that follows

Revisiting Program 2.5.2 and Output 2.5.2, subsetting the results to only include observations known to be in a metro area can be accomplished with any one of the following WHERE statements.

where Metro eq 2 or Metro eq 3 or Metro eq 4;

where Metro ge 2 and Metro le 4;

where Metro in (2,3,4);

where Metro between 2 and 4;

where Metro not in (0,1);

Each possible value can be checked by using the OR operator between equality comparisons for each possible value. When using OR, each comparison must be complete/specific. For example, it is not legal to say: Metro eq 2 or eq 3 or eq 4. It is legal, but unhelpful, to say Metro eq 2 or 3 or 4, as SAS uses numeric values for truth (since it does not include Boolean variables). The values 0 and missing are false, while any other value is true; hence, Metro eq 2 or 3 or 4 is an immutably true condition.

This conditioning takes advantage of the fact that the desired values fall into a range. As with OR, each condition joined by the AND must be complete; again, it is not legal to say: Metro ge 2 and le 4. Also, with knowledge of the values of Metro, this condition could have been simplified to Metro ge 2. However, good programming practice dictates that specificity is preferred to avoid incorrect assumptions about data values.

IN allows for simplification of a set of conditions that might otherwise be written using the OR operator, as was done in . The list is given as a set of values separated by commas or spaces and enclosed in parentheses.

BETWEEN-AND allows for simplification of a value range that can otherwise be written using AND between appropriate comparisons, as was done in .

The NOT operator allows the truth condition to be made the opposite of what is specified. This is a slight improvement over , as the list of values not desired is shorter than the list of those that are.

Adding any of these WHERE statements (or any other logically equivalent WHERE statement) to Program 2.5.2 produces the results shown in Table 2.6.3.

Table 2.6.3: Using WHERE to Subset Results to Specific Values of the Metro Variable

Analysis Variable : HHIncome

METRO

N

Mean

Std Dev

Minimum

Maximum

Metro, Inside City

154368

60328

70874

-19998

1391000

Metro, Outside City

340982

77648

75907

-29997

1739770

Metro, City Status Unknown

340909

64335

66110

-22298

1536000

The tools available allow for conditioning on more than one variable, and the variable(s) conditioned on need only be in the data set in use and do not have to be present in the output generated. In Program 2.6.1, the output is conditioned additionally on households known to have an outstanding mortgage.

Program 2.6.1: Conditioning on a Variable Not Used in the Analysis

proc means data=BookData.IPUMS2005Basic nonobs maxdec=0;

  class Metro;

  var HHIncome;

  format Metro Metro.;

  where Metro in (2,3,4) 

         and 

        MortgageStatus in 

         (‘Yes, contract to purchase’,

          ‘Yes, mortgaged/ deed of trust or similar debt’);

run;

Output 2.6.1: Conditioning on a Variable Not Used in the Analysis

Analysis Variable : HHIncome

METRO

N

Mean

Std Dev

Minimum

Maximum

Metro, Inside City

57881

86277

82749

-19998

1361000

Metro, Outside City

191021

96319

80292

-29997

1266000

Metro, City Status Unknown

167359

83879

72010

-19998

1407000

The condition on the MortgageStatus variable is a bit daunting, particularly noting that matching character values is a precise operation. Seemingly simple differences like casing or spacing lead to values that are non-matching. Therefore, the literals used in Program 2.6.1 are specified to be an exact match for the data. In Section 3.9, functions are introduced that are useful in creating consistency among character values, along with others that allow for extraction and use of relevant portions of a string. However, the WHERE statement provides some special operators, shown in Table 2.6.4, that allow for simplification in these types of cases without the need to intervene with a function.

Table 2.6.4: Operators for General Comparisons

Symbol

Mnemonic

Logic

?

CONTAINS

True result if the specified value is
contained in the data value (character only).

LIKE

True result if data value matches the specified value which
may include wildcards. _ is any single character, % is any set
of characters.

Program 2.6.2 offers two methods for simplifying the condition on MortgageStatus, one using CONTAINS, the other using LIKE. Either reproduces Output 2.6.1.

Program 2.6.2: Conditioning on a Variable Using General Comparison Operators

proc means data=BookData.IPUMS2005Basic nonobs maxdec=0;

  class Metro;

  var HHIncome;

  format Metro Metro.;

  where Metro in (2,3,4) and MortgageStatus contains ’Yes’; 

run;

proc means data=BookData.IPUMS2005Basic nonobs maxdec=0;

  class Metro;

  var HHIncome;

  format Metro Metro.;

  where Metro in (2,3,4) and MortgageStatus like ’%Yes%’; 

run;

CONTAINS checks to see if the data value contains the string Yes; again, note that the casing must be correct to ensure a match. Also, ensure single or double quotation marks enclose the value to search for—in this case, without the quotation marks, Yes forms a legal variable name and is interpreted by the compiler as a reference to a variable.

LIKE allows for the use of wildcards as substitutes for non-essential character values. Here the % wildcard before and after Yes results in a true condition if Yes appears anywhere in the string and is thus logically equivalent to the CONTAINS conditioning above.

2.7 Using the FREQ Procedure for Categorical Summaries

To produce tables of frequencies and relative frequencies (percentages) like those shown for the case study in Outputs 2.2.3 and 2.2.4, the FREQ procedure is the tool of choice, and this section covers its fundamentals.

2.7.1 Choosing Analysis Variables in PROC FREQ

As in previous sections, the examples here use the IPUMS2005Basic SAS data set, so make sure the BookData library is assigned. As a first step, enter and submit Program 2.7.1. (Note that the use of labels has been re-established in the OPTIONS statement.)

Program 2.7.1: PROC FREQ with Variables Listed Individually in the TABLE Statement

options label;

proc freq data=BookData.IPUMS2005Basic;

  table metro mortgageStatus;

run;

The TABLE statement allows for specification of the variables to summarize, and a space-delimited list of variables produces a one-way frequency table for each, as shown in Output 2.7.1.

Output 2.7.1: PROC FREQ with Variables Listed Individually in the TABLE Statement

Metropolitan status

METRO

Frequency

Percent

Cumulative
Frequency

Cumulative
Percent

0

92028

7.94

92028

7.94

1

230775

19.91

322803

27.85

2

154368

13.32

477171

41.17

3

340982

29.42

818153

70.59

4

340909

29.41

1159062

100.00

MortgageStatus

Frequency

Percent

Cumulative
Frequency

Cumulative
Percent

N/A

303342

26.17

303342

26.17

No, owned free and clear

300349

25.91

603691

52.08

Yes, contract to purchase

9756

0.84

613447

52.93

Yes, mortgaged/ deed of trust or similar debt

545615

47.07

1159062

100.00

The TABLE statement is not required; however, in that case, the default behavior produces a one-way frequency table for every variable in the data set. Therefore, both types of SAS variables, character or numeric, are legal in the TABLE statement. Given that variables listed in the TABLE statement are treated as categorical (in the same manner as variables listed in the CLASS statement in PROC MEANS), it is best to have the summary variables be categorical or be formatted into a set of categories.

The default summaries in a one-way frequency table are: frequency (count), percent, cumulative frequency, and cumulative percent. Of course, the cumulative statistics only make sense if the categories are ordinal, which these are not. Many options are available in the table statement to control what is displayed, and one is given in Program 2.7.2 to remove the cumulative statistics.

Program 2.7.2: PROC FREQ Option for Removing Cumulative Statistics

proc freq data=BookData.IPUMS2005Basic;

  table metro mortgageStatus / nocum;

run;

As with the CLASS statement in the MEANS procedure, variables listed in the TABLE statement in PROC FREQ use the format provided with the variable to construct the categories. Program 2.7.3 uses a format defined in Program 2.5.6 to bin the MortgagePayment variable into categories and, as this is an ordinal set, the cumulative statistics are appropriate.

Program 2.7.3: Using a Format to Control Categories for a Variable in the TABLE Statement

proc format;

  value Mort

    0=’None’

    1-350=”$350 and Below”

    351-1000=”$351 to $1000”

    1001-1600=”$1001 to $1600”

    1601-high=”Over $1600”

  ;

run;

proc freq data=BookData.IPUMS2005Basic;

  table MortgagePayment;

  format MortgagePayment Mort.;

run;

Output 2.7.3: Using a Format to Control Categories for a Variable in the TABLE Statement

First mortgage monthly payment

MortgagePayment

Frequency

Percent

Cumulative
Frequency

Cumulative
Percent

None

603691

52.08

603691

52.08

$350 and Below

59856

5.16

663547

57.25

$351 to $1000

283111

24.43

946658

81.67

$1001 to $1600

128801

11.11

1075459

92.79

Over $1600

83603

7.21

1159062

100.00

The FREQ procedure is not limited to one-way frequencies—special operators between variables in the TABLE statement allow for construction of multi-way tables.

2.7.2 Multi-Way Tables in PROC FREQ

The * operator constructs cross-tabular summaries for two categorical variables, which includes the following statistics:

  • cross-tabular and marginal frequencies
  • cross-tabular and marginal percentages
  • conditional percentages within each row and column

Program 2.7.4 summarizes all combinations of Metro and MortgagePayment, with Metro formatted to add detail and MortgagePayment formatted into the bins used in the previous example.

Program 2.7.4: Using the * Operator to Create a Cross-Tabular Summary with PROC FREQ

proc format;

  value METRO

    0 = “Not Identifiable”

    1 = “Not in Metro Area”

    2 = “Metro, Inside City”

    3 = “Metro, Outside City”

    4 = “Metro, City Status Unknown”

  ;

  value Mort

    0=’None’

    1-350=”$350 and Below”

    351-1000=”$351 to $1000”

    1001-1600=”$1001 to $1600”

    1601-high=”Over $1600”

  ;

run;

 

proc freq data=BookData.IPUMS2005Basic;

  table Metro*MortgagePayment;

  format Metro Metro. MortgagePayment Mort.;

run;

The first variable listed in any request of the form A*B is placed on the rows in the table. Requesting MortgagePayment*Metro transposes the table and the included summary statistics.

The format applied to the Metro variable is merely a change in display and has no effect on the structure of the table—it is five rows with or without the format. The format on MortgagePayment is essential to the column structure—allowing each unique value of MortgagePayment to form a column does not produce a useful summary table.

Output 2.7.4: Using the * Operator to Create a Cross-Tabular Summary with PROC FREQ

Table of METRO by MortgagePayment

METRO(Metropolitan status)

MortgagePayment(First mortgage monthly payment)

Frequency
Percent
Row Pct
Col Pct

None

$350 and Below

$351 to $1000

$1001 to $1600

Over $1600

Total

Not Identifiable

49379
4.26
53.66
8.18

6979
0.60
7.58
11.66

25488
2.20
27.70
9.00

7307
0.63
7.94
5.67

2875
0.25
3.12
3.44

92028
7.94

Not in Metro Area

134314
11.59
58.20
22.25

21698
1.87
9.40
36.25

60948
5.26
26.41
21.53

10464
0.90
4.53
8.12

3351
0.29
1.45
4.01

230775
19.91

Metro, Inside City

96487
8.32
62.50
15.98

4410
0.38
2.86
7.37

28866
2.49
18.70
10.20

14049
1.21
9.10
10.91

10556
0.91
6.84
12.63

154368
13.32

Metro, Outside City

149961
12.94
43.98
24.84

12148
1.05
3.56
20.30

79388
6.85
23.28
28.04

56330
4.86
16.52
43.73

43155
3.72
12.66
51.62

340982
29.42

Metro, City Status Unknown

173550
14.97
50.91
28.75

14621
1.26
4.29
24.43

88421
7.63
25.94
31.23

40651
3.51
11.92
31.56

23666
2.04
6.94
28.31

340909
29.41

Total

603691
52.08

59856
5.16

283111
24.43

128801
11.11

83603
7.21

1159062
100.00

Various options are available to control the displayed statistics. Program 2.7.5 illustrates some of these with the result shown in Output 2.7.5.

Program 2.7.5: Using Options in the TABLE Statement.

proc freq data=BookData.IPUMS2005Basic;

  table Metro*MortgagePayment / nocol nopercent format=comma10.;

  format Metro Metro. MortgagePayment Mort.;

run;

NOCOL and NOPERCENT suppress the column and overall percentages, respectively, with NOPERCENT also applying to the marginal totals. NOROW and NOFREQ are also available, with NOFREQ also applying to the marginal totals.

A format can be applied to the frequency statistic; however, this only applies to cross-tabular frequency tables and has no effect in one-way tables.

Output 2.7.5: Using Options in the TABLE Statement

Table of METRO by MortgagePayment

METRO(Metropolitan status)

MortgagePayment(First mortgage monthly payment)

Frequency
Row Pct

None

$350 and Below

$351 to $1000

$1001 to $1600

Over $1600

Total

Not Identifiable

49,379
53.66

6,979
7.58

25,488
27.70

7,307
7.94

2,875
3.12

92,028

Not in Metro Area

134,314
58.20

21,698
9.40

60,948
26.41

10,464
4.53

3,351
1.45

230,775

Metro, Inside City

96,487
62.50

4,410
2.86

28,866
18.70

14,049
9.10

10,556
6.84

154,368

Metro, Outside City

149,961
43.98

12,148
3.56

79,388
23.28

56,330
16.52

43,155
12.66

340,982

Metro, City Status Unknown

173,550
50.91

14,621
4.29

88,421
25.94

40,651
11.92

23,666
6.94

340,909

Total

603,691

59,856

283,111

128,801

83,603

1,159,062

Higher dimensional requests can be made; however, they are constructed as a series of two-dimensional tables. Therefore, a request of A*B*C in the TABLE statement creates the B*C table for each level of A, while a request of A*B*C*D makes the C*D table for each combination of A and B, and so forth. Program 2.7.6 generates a three-way table, where a cross-tabulation of Metro and HomeValue is built for each level of Mortgage Status as shown in Output 2.7.6. The VALUE statement that defines the character format $MortStatus takes advantage of the fact that value ranges are legal for character variables. Be sure to understand the difference between uppercase and lowercase letters when ordering the values of a character variable.

Program 2.7.6: A Three-Way Table in PROC FREQ

proc format;

  value MetroB

    0 = “Not Identifiable”

    1 = “Not in Metro Area”

    other = “In a Metro Area”

  ;

  value $MortStatus

    ‘No’-’Nz’=’No’

    ‘Yes’-’Yz’=’Yes’

  ;

  value Hvalue

    0-65000=’$65,000 and Below’

    65000<-110000=’$65,001 to $110,000’

    110000<-225000=’$110,001 to $225,000’

    225000<-500000=’$225,001 to $500,000’

    500000-high=’Above $500,000’

  ;

run;

 

proc freq data=BookData.IPUMS2005Basic;

  table MortgageStatus*Metro*HomeValue/nocol nopercent format=comma10.;

  format MortgageStatus $MortStatus. Metro MetroB. HomeValue Hvalue.;

  where MortgageStatus ne ‘N/A’;

run;

Output 2.7.6: A Three-Way Table in PROC FREQ

Table 1 of METRO by HomeValue

Controlling for MortgageStatus=No

METRO(Metropolitan status)

HomeValue(House value)

Frequency
Row Pct

$65,000 and Below

$65,001 to $110,000

$110,001 to $225,000

$225,001 to $500,000

Above $500,000

Total

Not Identifiable

10,777
35.49

5,460
17.98

10,415
34.29

2,584
8.51

1,134
3.73

30,370

Not in Metro Area

34,766
40.57

16,261
18.98

26,889
31.38

5,553
6.48

2,227
2.60

85,696

In a Metro Area

34,176
18.55

23,706
12.86

71,133
38.60

33,590
18.23

21,678
11.76

184,283

Total

79,719

45,427

108,437

41,727

25,039

300,349

Table 2 of METRO by HomeValue

Controlling for MortgageStatus=Yes

METRO(Metropolitan status)

HomeValue(House value)

Frequency
Row Pct

$65,000 and Below

$65,001 to $110,000

$110,001 to $225,000

$225,001 to $500,000

Above $500,000

Total

Not Identifiable

7,486
17.55

7,142
16.75

19,453
45.61

6,468
15.17

2,100
4.92

42,649

Not in Metro Area

24,443
25.34

19,396
20.11

40,668
42.16

9,164
9.50

2,790
2.89

96,461

In a Metro Area

26,351
6.33

37,345
8.97

175,482
42.16

110,412
26.52

66,671
16.02

416,261

Total

58,280

63,883

235,603

126,044

71,561

555,371

It is also possible for the FREQ procedure to count based on a quantitative variable using the WEIGHT statement, effectively tabulating the sum of the weights. Program 2.7.7 uses the weight statement to summarize total HomeValue for combinations of Metro and MortgagePayment.

Program 2.7.7: Using the WEIGHT Statement to Summarize a Quantitative Value.

proc freq data=BookData.IPUMS2005Basic;

  table Metro*MortgagePayment /nocol nopercent format=dollar14.;

  weight HomeValue;

  format Metro MetroB. MortgagePayment Mort.;

run;

Output 2.7.7: Using the WEIGHT Statement to Summarize a Quantitative Value

Table of HomeValue by METRO

HomeValue(House value)

METRO(Metropolitan status)

Frequency
Row Pct

Not Identifiable

Not in Metro Area

In a Metro Area

Total

$65,000 and Below

$2,737,530
12.20

$8,736,986
38.93

$10,969,600
48.88

$22,444,116

$65,001 to $110,000

$3,770,840
10.74

$9,887,454
28.16

$21,448,052
61.09

$35,106,346

$110,001 to $225,000

$15,896,854
7.82

$30,632,556
15.07

$156,700,074
77.10

$203,229,484

$225,001 to $500,000

$8,192,908
4.80

$10,741,258
6.30

$151,601,294
88.90

$170,535,460

Above $500,000

$3,854,280
2.60

$4,735,288
3.19

$139,862,780
94.21

$148,452,348

Total

$34,452,412

$64,733,542

$480,581,800

$579,767,754

2.8 Reading Raw Data

Often data is not available as a SAS data set; in practice, data often comes from external sources including raw files such as text files, spreadsheets such as Microsoft Excel ®, or relational databases such as Oracle ®. In this section, work with external data sources begins by exploring how to read raw data files.

Raw data refers to certain files that contain unprocessed data that is not in a SAS data set. (Certain other structures also qualify as raw data. See Chapter Note 6 in Section 2.12 for additional details.) Generally, these are plain-text files and some common file types are:

  • tab-delimited text (.txt or .tsv)
  • comma-separated values (.csv)
  • fixed-position files (.dat)

Choices for file extensions are not fixed; therefore, the extension does not dictate the type of data the file contains, and many other file types exist. Therefore, it is always important to explore the raw data before importing it to SAS. While SAS provides multiple ways to read raw data; this chapter focuses on using the DATA step due to its flexibility and ubiquity—understanding the DATA step is a necessity for a successful SAS programmer.  

To assist in finding the column numbers when displaying raw files, a ruler is included in the first line when presenting raw data in the book, but the ruler is not present in the actual data file. Input Data 2.8.1 provides an example of such a ruler. Each dash in the ruler represents a column, while a plus represents multiples of five, and a digit represents multiples of ten. For example, the 1 in the ruler represents column 10 in the raw file and the plus sign between the 1 and the 2 represents column 15 in the raw file.

Input Data 2.8.1: Space Delimited Raw File (Partial Listing)

----+----1----+----2----+

1 1800 9998 9998 9998

2 480 1440 9998 9998

3 2040 360 100 9998

4 3000 9998 360 9998

5 840 1320 90 9998 

2.8.1 Introduction to Reading Delimited Files

Delimiters, often used in raw files, are a single character such as a tab, space, comma, or pipe (vertical bar) used to indicate the break between one value and the next in a single record. Input Data 2.8.1 includes a partial representation of the first five records from a space-delimited file (Utility 2001.prn). Reading in this file, or any raw file, requires determining whether the file is delimited and, if so, what delimiters are present. If a file is delimited, it is important to note whether the delimiters also appear as part of the values for one or more variables. The data presented in Input Data 2.8.1 follows a basic structure and uses spaces to separate each record into five distinct values or fields. SAS can read this file correctly using simple list input without the need for additional options or statements using the following rules:

1. At least one blank/space must separate the input values and SAS treats multiple, sequential blanks as a single blank.

2. Character values cannot contain embedded blanks.

3. Character variables are given a length of eight bytes by default.

4. Data must be in standard numeric or character format. Standard numeric values must only contain digits, decimal point, +/-, and E for scientific notation.

Input Data 2.8.1 satisfies these rules using the default delimiter (space). Options and statements are available to help control the behavior associated with rules 1 through 3, which are covered in subsequent sections of this chapter. Violating rule 4 precludes the use of simple list input but is easily addressed with modified list input, as shown in Chapter 3. However, no such options or modifications are required to read Input Data 2.8.1, which is done using Program 2.8.1.

Program 2.8.1: Reading the Utility 2001 Data

data Utility2001;

  infile “--insert path here--Utility 2001.prn”;

  input Serial$ Electric Gas Water Fuel;

run;

proc print data = Utility2001 (obs=5 );

run;

The DATA statement begins the DATA step and here names the data set as Utility2001, placing it in the Work library given the single-level naming. Explicit specification of the library is available with two-level naming, for example, Sasuser.Utility2001 or Work.Utility2001—see Program 1.4.3. If no data set name appears, SAS provides the name as DATAn, where n is the smallest whole number (1, 2, 3, …) that makes the data set name unique.

The INFILE statement specifies the location of the file via a full path specification to the file—this path must be completed to reflect the location of the raw file for the code to execute successfully.

The INPUT statement sets the names of each variable from the raw file in the INFILE statement with those names following the conventions outlined in Section 1.6.2. By default, SAS assumes the incoming variables are numeric. One way to indicate character data is shown here – place a dollar sign after each character variable.

Good programming practice dictates that all steps end with an explicit step boundary, including the DATA step.

The OBS= option selects the last observation for processing. Because procedures start with the first observation by default, this step uses the first five observations from the Utility2001 data set, as shown in Output 2.8.1.

Output 2.8.1: Reading the Utility 2001 Data (Partial Listing)

Obs

Serial

Electric

Gas

Water

Fuel

1

1

1800

9998

9998

9998

2

2

480

1440

9998

9998

3

3

2040

360

100

9998

4

4

3000

9998

360

9998

5

5

840

1320

90

9998

In Program 2.8.1, Serial is read as a character variable; however, it contains only digits and therefore can be stored as numeric. The major advantage in storing Serial as character is size—its maximum value is six digits long and therefore requires six bytes of storage as character, while all numeric variables have a default size of eight bytes. The major disadvantage to storing Serial as character is ordering—for example, as a character value, 11 comes before 2. While the other four variables can be read as character as well, it is a very poor choice as no mathematical or statistical operations can be done on those values. For examples in subsequent sections, Serial is read as numeric.

In Program 2.8.1, the INFILE statement is used to specify the raw data file that the DATA step reads. In general, the INFILE statement may include references to a single file or to multiple files, with each reference provided one of the following ways:

  • A physical path to the files. Physical paths can be either relative or absolute.
  • A file reference created via the FILENAME statement.

Program 2.8.1 is set up to use the first method, with either an absolute or relative path chosen. An absolute path starts with a drive letter or name, while any other specification is a relative path. All relative paths are built from the current working directory. (Refer to Section 1.5 for a discussion of the working directory and setting its value.) It is often more efficient to use a FILENAME statement to build references to external files or folders. Programs 2.8.2 and 2.8.3 demonstrate these uses of the FILENAME statement, producing the same data set as Program 2.8.1.

Program 2.8.2: Using the FILENAME Statement to Point to an Individual File

filename Util2001  “--insert path here--Utility 2001.prn”

 

data work.Utility2001A;

  infile Util2001;

  input Serial$ Electric Gas Water Fuel;

run;

The FILENAME statement creates a file reference, called a fileref, named Util2001. Naming conventions for a fileref are the same as those for a libref.

The path specified, which can be relative or absolute as in Program 2.8.1, includes the file name. SAS assigns the fileref Util2001 to this file.

The INFILE statement now references the fileref Util2001 rather than the path or file name. Note, quotation marks are not used on Util2001 since it is to be interpreted as a fileref and not a file name or path.

Program 2.8.3: Associating the FILENAME Statement with a Folder

filename RawData ‘--insert path to folder here--’; 

 

data work.Utility2001B;

  infile RawData(“Utility 2001.prn”);

  input Serial$ Electric Gas Water Fuel;

run;

It is assumed here that the path, either relative or absolute, points to a folder and not a specific file. In that case, the FILENAME statement associates a folder with the fileref RawData. The path specified should be to the folder containing the raw files downloaded from the author page, much like the BookData library was assigned to the folder containing the SAS data sets.

The INFILE statement references both the fileref and the file name. Although the file reference can be made without the quotation marks in certain cases, good programming practice includes the quotation marks.

Since each of Programs 2.8.2 and 2.8.3 generate the same result as Program 2.8.1 but actually require slightly more code, the benefits of using the FILENAME statement may not be obvious. The form of the FILENAME in Program 2.8.3 is useful if a single file needs to be read repeatedly under different conditions, allowing the multiple references to that file to be shortened. More commonly, the form used in Program 2.8.4 is more efficient when reading multiple files from a common location. Again, if the path specified is to the folder containing the raw files downloaded from the author page, the fileref RawData refers to the location for all non-SAS data sets used in examples for Chapters 2 through 7.

2.8.2 More with List Input

Input Data 2.8.4 includes a partial representation of the first five records from a comma-delimited file (IPUMS2005Basic.csv). Due to the width of the file, Input Data 2.8.4 truncates the third and fifth records.

Input Data 2.8.4: Comma Delimited Raw File (Partial Listing)

----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+

2,Alabama,Not in identifiable city (or size group),0,4,73,Rented,N/A,0,12000,9999999

3,Alabama,Not in identifiable city (or size group),0,1,0,Rented,N/A,0,17800,9999999

4,Alabama,Not in identifiable city (or size group),0,4,73,Owned,”Yes, mortgaged/ deed

5,Alabama,Not in identifiable city (or size group),0,1,0,Rented,N/A,0,2000,9999999

6,Alabama,Not in identifiable city (or size group),0,3,97,Owned,”No, owned free and 

Not only is this file delimited by commas, but the eighth field on the third and fifth rows also includes data values containing a comma, with those values embedded in quotation marks. (Recall these records are truncated in the text due to their length so the final quote is not shown for these two records.) To successfully read this file, the DATA step must recognize the delimiter as a comma, but also that commas embedded in quoted values are not delimiters. The DSD option is introduced in Program 2.8.4 to read such a file.

Program 2.8.4: Reading the 2005 Basic IPUMS CPS Data

data work.Ipums2005Basic;

  infile RawData(“IPUMS2005basic.csv”) dsd;

  input Serial State $ City $ CityPop Metro 

        CountyFIPS Ownership $ MortgageStatus $

        MortgagePayment HHIncome HomeValue;

run;

 

proc print data = work.Ipums2005Basic (obs=5);

run;

The DSD option included in the INFILE statement modifies the delimiter and some additional default behavior as listed below.

Again, the INPUT statement names each of the variables read from the raw file in the INFILE statement and sets their types. By default, SAS assumes the incoming variables are numeric; however, State, City, Ownership, and MortgageStatus must be read as character values.

Output 2.8.4 shows that, while Program 2.8.4 executes successfully, the resulting data set does not correctly represent the values from Input Data 2.8.4—the City and MortgageStatus variables are truncated. This truncation occurs due to the default length of 8 assigned to character variables; therefore, SAS did not allocate enough memory to store the values in their entirety. Only the first five records are shown; however, further investigation reveals this truncation occurs for the variable State as well.

Output 2.8.4: Reading the 2005 Basic IPUMS CPS Data (Partial Listing).

Obs

Serial

State

City

CityPop

Metro

CountyFIPS

Ownership

1

2

Alabama

Not in i

0

4

73

Rented

2

3

Alabama

Not in i

0

1

0

Rented

3

4

Alabama

Not in i

0

4

73

Owned

4

5

Alabama

Not in i

0

1

0

Rented

5

6

Alabama

Not in i

0

3

97

Owned

Obs

MortgageStatus

MortgagePayment

HHIncome

HomeValue

1

N/A

0

12000

9999999

2

N/A

0

17800

9999999

3

Yes, mor

900

185000

137500

4

N/A

0

2000

9999999

5

No, owne

0

72600

95000

Program 2.8.4 uses the DSD option in the INFILE statement to change three default behaviors:

1. Change the delimiter to comma

2. Treat two consecutive delimiters as a missing value

3. Treat delimiters inside quoted strings as part of a character value and strip off the quotation marks

For Input Data 2.8.4, the first and third actions are necessary to successfully match the structure of the delimiters in the data since (a) the file uses commas as delimiters and (b) commas are included in the quoted strings in the data for the MortgageStatus variable. Because the file does not contain consecutive delimiters, the second modification has no effect.

Of course, it might be necessary to produce the second and third effects while using blanks—or any other character—as the delimiter. It is also often necessary to change the delimiter without making the other modifications included with the DSD option. In those cases, use the DLM= option to specify one or more delimiters by placing them in a single set of quotation marks, as shown in the following examples.

1. DLM = ‘/’ causes SAS to move to a new field when it encounters a forward slash

2. DLM = ‘, ‘ causes SAS to move to a new field when it encounters a comma

3. DLM = ‘,/’ causes SAS to move to a new field when it encounters either a comma or forward slash

Introduction to Variable Attributes

In SAS, the amount of memory allocated to a variable is called the variable’s length; length is one of several attributes that each variable possesses. Other attributes include the name of the variable, its position in the data set (1st column, 2nd column, ...), and its type (character or numeric). As with all the variable attributes, the length is set either by use of a default value or by explicitly setting a value.

By default, both numeric and character variables have a length of eight bytes. For character variables, one byte of memory can hold one character in the English language. Thus, the DATA step truncates several values of State, City, and MortgageStatus from Input Data 2.8.4 since they exceed the default length of eight bytes. For numeric variables, the default length of eight bytes is sufficient to store up to 16 decimal digits (commonly known as double-precision). When using the Microsoft Windows® operating system, numeric variables have a minimum allowable length of three bytes and a maximum length of eight bytes. Character variables may have a minimum length of 1 byte and a maximum length of 32,767 bytes. While there are many options and statements that affect the length of a variable implicitly, the LENGTH statement allows for explicit declaration of the length and type attributes for any variables. Program 2.8.5 demonstrates the usage of the LENGTH statement.

Program 2.8.5: Using the LENGTH Statement

data work.Ipums2005Basic;

  length state $ 20 City$ 25 MortgageStatus$50; 

  infile RawData(“IPUMS2005basic.csv”) dsd;

  input Serial State City  CityPop Metro 

        CountyFIPS Ownership $  MortgageStatus$  

        MortgagePayment HHIncome HomeValue;

run;

 

proc print data = work.Ipums2005Basic(obs = 5);

run;

The LENGTH statement sets the lengths of State, City, and MortgageStatus to 20, 25, and 50 characters, respectively, with the dollar sign indicating these are character variables. Separating the dollar sign from the variable name or length value is optional, though good programming practices dictate using a consistent style to improve readability.

Type (character or numeric) is an attribute that cannot be changed in the DATA step once it has been established. Because the LENGTH statement sets these variables as character, the dollar sign is optional in the INPUT statement. However, good programming practices generally dictate including it for readability and so that removal of the LENGTH statement does not lead to a data type mismatch. (This would be an execution-time error.)

As in , the spacing between the dollar sign and variable name is optional in the INPUT statement as well. Good programming practices still dictate selecting a consistent spacing style.

Output 2.8.5 shows the results of explicitly setting the length of the State, City, and MortgageStatus variables. In addition to the lengths of these three variables changing, their column position in the SAS data set has changed as well. Variables are added to the data set based on the order they are encountered during compilation of the DATA step, so since the LENGTH statement precedes the INPUT statement, it has actually changed two attributes—length and position—for these three variables (while also defining the type attribute as character).

Output 2.8.5: Using the LENGTH Statement (Partial Listing)

Obs

state

City

MortgageStatus

Serial

CityPop

Metro

CountyFIPS

1

Alabama

Not in identifiable city

N/A

2

0

4

73

2

Alabama

Not in identifiable city

N/A

3

0

1

0

3

Alabama

Not in identifiable city

Yes, mortgaged/ deed of trust or similar debt

4

0

4

73

4

Alabama

Not in identifiable city

N/A

5

0

1

0

5

Alabama

Not in identifiable city

No, owned free and clear

6

0

3

97

Obs

Ownership

MortgagePayment

HHIncome

HomeValue

1

Rented

0

12000

9999999

2

Rented

0

17800

9999999

3

Owned

900

185000

137500

4

Rented

0

2000

9999999

5

Owned

0

72600

95000

Like the type attribute, SAS does not allow the position and length attributes to change after their initial values are set. Attempting to change the length attribute after the INPUT statement, as shown in Program 2.8.6, results in a warning in the Log.

Program 2.8.6: Using the LENGTH Statement After the INPUT Statement

data work.Ipums2005Basic;

  infile RawData(“IPUMS2005basic.csv”) dsd;

  input Serial State $ City $ CityPop Metro 

        CountyFIPS Ownership $ MortgageStatus $ 

        MortgagePayment HHIncome HomeValue;

  length state $20 City $25 MortgageStatus $50; 

run;

Log 2.8.6: Warning Generated by Attempting to Reset Length

WARNING: Length of character variable State has already been set. Use the LENGTH statement as the very first statement in the DATA STEP to declare the length of a character variable.

Tab-Delimited Files

If the delimiter is not a standard keyboard character, such as the tab used in tab-delimited files, an alternate method is used to specify the delimiter via its hexadecimal code. While the correct hexadecimal representation depends on the operating system, Microsoft Windows and Unix/Linux machines typically use ASCII codes. The ASCII hexadecimal code for a tab is 09 and is written in the SAS language as ‘09 ‘x; the x appended to the literal value of 09 instructs the compiler to make the conversion from hexadecimal. Program 2.8.7 uses hexadecimal encoding in the DLM= option to correctly set the delimiter to a tab. The results of Program 2.8.7 are identical to those of Program 2.8.5.

Program 2.8.7: Reading Tab-Delimited Data

data work.Ipums2005Basic;

  length state $ 20 City $ 25 MortgageStatus $ 50;

  infile RawData (‘ipums2005basic.txt’) dlm = ‘09’x;

  input Serial State $ City $ CityPop Metro 

        CountyFIPS Ownership $ MortgageStatus $

        MortgagePayment HHIncome HomeValue;

run;

Because there are no missing values denoted by sequential tabs, nor any tabs included in data values, the DSD option is no longer needed in the INFILE statement for this program.

To specify multiple delimiters that include the tab, each must use a hexadecimal representation—for example, DLM= ‘2C09’x selects commas and tabs as delimiters since 2C is the hexadecimal value for a comma. For records with different delimiters within the same DATA step, see Chapter Note 7 in Section 2.12.

2.8.3 Introduction to Reading Fixed-Position Data

While delimited data takes advantage of delimiting characters in the data, other files depend on the starting and stopping position of the values being read. These types of files are referred to by several names: fixed-width, fixed-position, and fixed-field, among others. The first five records from a fixed-position file (IPUMS2005Basic.dat) are shown in Input Data 2.8.8. As with Input Data 2.8.4, truncation of this display occurs due to the length of the record—now occurring in each of the five records.

Input Data 2.8.8: Excerpt from a Fixed-Position Data File

----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+

       2 Alabama              Not in identifiable city (or size group)     0   4  73

       3 Alabama              Not in identifiable city (or size group)     0   1   0 

       4 Alabama              Not in identifiable city (or size group)     0   4  73 

       5 Alabama              Not in identifiable city (or size group)     0   1   0

       6 Alabama              Not in identifiable city (or size group)     0   3  97

Since fixed-position files do not use delimiters, reading a fixed-position file requires knowledge of the starting position of each data value. In addition, either the length or stopping position of the data value must be known. Using the ruler, the first displayed field, Serial, appears to begin and end in column 8. However, inspection of the complete raw file reveals that is only the case for the single-digit values of Serial. The longest value is eight-digits wide, so the variable Serial truly starts in column 1 and ends in column 8. Similarly, the next field, State, begins in column 10 and ends in column 29. Some text editors, such as Notepad++ and Visual Studio Code, show the column number in a status bar as the cursor moves across lines in the file.

The DATA step for reading fixed-position data looks similar to the DATA step for reading delimited data, but there are several important modifications. For fixed-position files, the syntax of the INPUT statement provides information about column positions of the variable values in the raw file, as it cannot rely on delimiters for separating values. Therefore, delimiter-modifying INFILE options such as DSD and DLM= have no utility with fixed-position data. Two different forms of input are commonly used for fixed-position data: column input or formatted input. This section focuses on column input while Chapter 4 discusses formatted input.

Column Input

Column input takes advantage of the fixed positions in which variable values are found by directly placing the starting and ending column positions into the INPUT statement. Program 2.8.8 shows how to use column input to read the IPUMS CPS 2005 basic data. The results of Program 2.8.8 are identical to Output 2.8.5.

Program 2.8.8: Reading Data Using Column Input

data work.ipums2005basicFPa;

  infile RawData (‘ipums2005basic.dat’);

  input serial 1-8 state $ 10-29  city $ 31-70  cityPop 72-76 

        metro 78-80 countyFips 82-84 ownership $ 86-91

        mortgageStatus $ 93-137 mortgagePayment 139-142

        HHIncome 144-150 homeValue 152-158;

run;

The LENGTH statement is no longer needed—when using column input, SAS assigns the length based on the number of columns read if the length attribute is not previously defined. Here, SAS assigns State a length of 20 bytes, just as was done in the LENGTH statement in Program 2.8.5.

The first value indicates the column position—31—from which SAS should start reading for the current variable, City. The second number—70—indicates the last column SAS reads to determine the value of City.

The default length of eight bytes is still used for numeric variables, regardless of the number of columns.

Beyond the differences between column input and list input shown in Program 2.8.8, since column input uses the column positions, the INPUT statement can read variables in any order, and can even reread columns if necessary. Furthermore, the INPUT statement can skip unwanted variables. Program 2.8.9 reads Input Data 2.8.8 and demonstrates the ability to reorder and reread columns.

Program 2.8.9: Reading the Input Variables Differently than Column Order

data work.ipums2005basicFPb;

  infile RawData(‘ipums2005basic.dat’);

  input serial 1-8 hhIncome 144-150 homeValue 152-158 

        ownership $ 86-91 ownershipCoded $ 86 

        state $ 10-29 city $ 31-70 cityPop 72-76 

        metro 78-80 countyFips 82-84 

        mortgageStatus $ 93-137 mortgagePayment 139-142;

run;

 

proc print data = work.ipums2005basicFPb(obs = 5);

  var serial -- state;

run;

Output 2.8.9 shows that HHIncome and HomeValue are now earlier in the data set. Column input allows for reading variables in a user-specified order.

Column 86 is read twice: first as part of a full value for Ownership, and second as a simplified version using only the first character as the value of a new variable, OwnershipCoded.

As discussed in Chapter Note 3 in Section 1.7, the double-dash selects all variables between Serial and State, inclusive.

Output 2.8.9: Reading the Input Variables Differently than Column Order

Obs

serial

hhIncome

homeValue

ownership

ownershipCoded

state

1

2

12000

9999999

Rented

R

Alabama

2

3

17800

9999999

Rented

R

Alabama

3

4

185000

137500

Owned

O

Alabama

4

5

2000

9999999

Rented

R

Alabama

5

6

72600

95000

Owned

O

Alabama

Mixed Input

Programs 2.8.1 through 2.8.7 make use of simple list input for every variable, and Programs 2.8.8 and 2.8.9 use column input for every variable. However, it may not always be the case of making a choice between one or the other. If files contain some delimited fields while other fields have fixed positions, it is necessary to use multiple input styles simultaneously. This process, called mixed input, requires mastery of two other input methods covered in Chapter 3, modified list input and formatted input, along with a substantial understanding of how the DATA step processes raw data. For a discussion of the fifth and final input style, named input, see the SAS Documentation.

2.9 Details of the DATA Step Process

This section provides further details about how the DATA step functions. While this material can initially be considered optional for many readers, understanding it makes writing high-quality code easier by providing a foundation for how certain coding decisions lead to particular outcomes. This material is also essential for successful completion of the base certification exam.

2.9.1 Introduction to the Compilation and Execution Phases

SAS processes every step in Base SAS, including the DATA step, in two phases: compilation and execution. Each of the DATA steps seen so far in this text have several elements in common: they each read data from one or more sources (for example, a SAS data set or a raw data file), and they each create a data set as a result of the DATA step. For DATA steps such as these, the flowchart in Figure 2.9.1 provides a high-level overview of the actions taken by SAS upon submission of a DATA step. Details about the individual actions are included in this section, in the Chapter Notes in Section 2.12, and in subsequent chapters.

Figure 2.9.1: Flowchart of Default DATA Step Actions

Compilation Phase

During the compilation phase, SAS begins by tokenizing the submitted code and sending complete statements to the compiler. (For more details, see Chapter Note 8 in Section 2.12.) Once a complete statement is sent to the compiler, the compiler checks the statement for syntax errors. If there is a syntax error, SAS attempts to make a substitution that creates legal syntax and prints a warning to the SAS log indicating the substitution made. For example, misspelling the keyword DATA as DAAT produces the following warning.

WARNING 14-169: Assuming the symbol DATA was misspelled as daat.

Be sure to review these warnings and correct the syntax even if SAS makes an appropriate substitution. If there is a syntax error and SAS cannot make a substitution, then an error message is printed to the log, and the current step is not executed. For example, misspelling the keyword DATA as DSTS results in the following error.

ERROR 180-322: Statement is not valid or it is used out of proper order.

If there is not a syntax error, or if SAS can make a substitution to correct a syntax error, then the compilation phase continues to the next statement, tokenizes it, and checks it for syntax errors. This process continues until SAS compiles all statements in the current DATA step.

When reading raw data, SAS creates an input buffer to load individual records from the raw data and creates the program data vector to assign the parsed values to variables for later delivery to the SAS data set. During this process, SAS also creates the shell for the descriptor portion, or metadata, for the data set, which is accessible via procedures such as the CONTENTS procedure from Chapter 1. Of course, not all elements of the descriptor portion, such as the number of observations, are known during the compilation phase. Once the compilation phase ends, SAS enters the execution phase where the compiled code is executed. At the conclusion of the execution phase, SAS populates any such remaining elements of the descriptor portion.

Execution Phase

The compilation phase creates the input buffer (when reading from a raw data source) and creates the program data vector; however, it is the execution phase that populates them. SAS begins by initializing the variables in the program data vector based on data type (character or numeric) and variable origin (for example, a raw data file or a SAS data set). SAS then executes the programming statements included in the DATA step. Certain statements, such as the LENGTH or FORMAT statements shown earlier in this chapter, are considered compile-time statements because SAS completes their actions during the compilation phase. Compile-time statements take effect during the compilation phase, and their effects cannot be altered during the execution phase. Statements active during the execution phase are referred to as execution-time statements.

Finally, when SAS encounters the RUN statement (or any other step boundary) the default actions are as follows:

1. output the current values of user-selected variables to the data set

2. return to the top of the DATA step

3. reset the values in the input buffer and program data vector

At this point, the input buffer (if it exists) is empty, and the program data vector variables are incremented/reinitialized as appropriate so that the execution phase can continue processing the incoming data set. For more information about step boundaries, see Chapter Note 9 in Section 2.12.

When reading in data from various sources, the execution phase ends when it is determined that no more data can or should be read, based on the programming statements in the DATA step. Because there are multiple factors that affect this, an in-depth discussion is not provided here. Instead, as each new technique for reading and combining data is presented, a review of when the DATA step execution phase ends is included. This chapter includes examples on reading a single raw data using an INFILE statement and, in this case, the execution phase ends when SAS encounters an end-of-file (EOF) marker in the incoming data source. For plain text files, the EOF marker is a non-printable character that lets software reading the file know that the file contains no further information. At the conclusion of the execution phase, SAS completes the content portion of the data set, which contains the data values, and finalizes the descriptor portion.

2.9.2 Building blocks of a Data Set: Input Buffers and Program Data Vectors

Input Buffer

When reading raw data, SAS needs to parse the characters from the plain text in order to determine the values to place in the data set. Parsing involves dividing the text into groups of characters and interpreting each group as a value for a variable. To facilitate this, the default is for SAS to read a single line of text from the raw file and place it into the input buffer—a section of logical memory. In the input buffer, SAS places each character into its own column and uses a column pointer to keep track of the column the INPUT statement is currently reading.

Program Data Vector

Regardless of the data source used in a DATA step (raw data files or SAS data sets), a program data vector (PDV) is created. Like the input buffer, the PDV is a section of logical memory; but, instead of storing raw, unstructured data, the PDV is where SAS stores variable values. SAS determines these values in potentially many ways: by parsing information in the input buffer, by reading values from structured sources such as Excel spreadsheets or SAS data sets, or by executing programming statements in the DATA step. Just as the input buffer holds a single line of raw text, the PDV holds only the values of each variable for a single record.

In addition to user-defined variables, SAS places automatic variables into the PDV. Two automatic variables, _N_ and _ERROR_, are present in every PDV. By default, the DATA step acts as a loop that repeatedly processes any executable statements and builds the final data set one record at a time. These loops are referred to as iterations and are tracked by the automatic variable, _N_. _N_ is a counter that keeps track of the number of DATA step iterations—how many times the DATA statement has executed—and is initialized to one at invocation of the DATA step. _N_ is not necessarily the same as the number of observations in the data set since programming elements are available to selectively output records to the final data set. Similarly, certain statements and options are available to only select a subset of the variables in the final data set.

The second automatic variable, _ERROR_, is an indicator that SAS initializes to zero and sets to one at the first instance of certain non-syntax errors. Details about the errors it tracks are discussed in Chapter Note 10 in Section 2.12. Automatic variables are not written to the resulting data set, though their values can be assigned to new variables or used in other DATA step programming statements.

Example

Some aspects of the compilation and execution phases are demonstrated below using a raw data set having the five variables shown in Input Data 2.9.1: flight number (FlightNum), flight date (Date), destination city (Destination), number of first-class passengers (FirstClass), and number of economy passengers (EconClass). The first line contains a ruler to help locate the values; it is not included in the raw file. Program 2.9.1 reads in this data set.

Input Data 2.9.1: Flights.prn data set

----+----1----+----2----+----3

439 12/11/2000  LAX 20 137

921 12/11/2000 DFW 20 131

114  12/12/2000 LAX 15 170

982 12/12/2000  dfw 5  85

439 12/13/2000 LAX 14 196

982  12/13/2000 DFW 15 116

431 12/14/2000 LaX 17 166

982 12/14/2000  DFW 7  88

114  12/15/2000 LAX  0  187

982 12/15/2000  DFW 14 31

Program 2.9.1: Demonstrating the Input Buffer and Program Data Vector (PDV)

data work.flights;

  infile RawData(‘flights.prn’);

  input FlightNum Date $ Destination $ FirstClass EconClass;

run;

During the compilation phase, SAS scans each statement for syntax errors, and finding none in this code, it creates various elements as each statement compiles. The DATA statement triggers the initial creation of the PDV with the two automatic variables: _N_ and _ERROR_. SAS then determines an input buffer is necessary when it encounters the INFILE statement. SAS automatically allocates the maximum amount of memory, 32,767 bytes, when it creates the input buffer. If explicit control is needed, the INFILE option LRECL= allows specification of a value. Compilation of the input statement completes the PDV with the five variables in the input statement established, in the same order they are encountered, along with their attributes.

A visual representation of the input buffer and PDV at various points in the compilation phase is given in Tables 2.9.1 through 2.9.10, with the input buffer showing 26 columns here for the sake of brevity—the actual size is 32,767 columns.

Table 2.9.1: Representation of Input Buffer During Compilation Phase

0
1

0
2

0
3

0
4

0
5

0
6

0
7

0
8

0
9

1
0

1
1

1
2

1
3

1
4

1
5

1
6

1
7

1
8

1
9

2
0

2
1

2
2

2
3

2
4

2
5

2
6

Table 2.9.2: Representation of the PDV at the Completion of the Compilation Phase

_N_

_ERROR_

FlightNum

Date

Destination

FirstClass

EconClass

Note that while SAS is not case-sensitive when referencing variables, variable names are stored as they are first referenced.

Once the compilation phase has completed, the execution phase begins by initializing the variables in the PDV. Each of the user-defined variables in this example comes from a raw data file, so they are all initialized to missing. Recall missing numeric data is represented with a single period, and missing character values are represented as a null string. The automatic variables _N_ and _ERROR_ are initialized to one and zero, respectively, since this is the first iteration through the DATA step, and no errors tracked by _ERROR_ have been encountered.

Table 2.9.3: Representation of the PDV at the Beginning of the Execution Phase

_N_

_ERROR_

FlightNum

Date

Destination

FirstClass

EconClass

1

0

.

.

.

When SAS encounters the INPUT statement on the first iteration of the DATA step, it reads the first line of data and places it in the input buffer with each character in its own column. Table 2.9.4 illustrates this for the first record of Flights.prn.

Table 2.9.4: Illustration of the Input Buffer After Reaching the INPUT Statement on the First Iteration

0
1

0
2

0
3

0
4

0
5

0
6

0
7

0
8

0
9

1
0

1
1

1
2

1
3

1
4

1
5

1
6

1
7

1
8

1
9

2
0

2
1

2
2

2
3

2
4

2
5

2
6

4

3

9

1

2

/

1

1

/

2

0

0

0

L

A

X

2

0

1

3

7

To move raw data from the input buffer to the PDV, SAS must parse the character string from the input buffer to determine which characters should be grouped together and whether any of these character groupings must be converted to numeric values. The parsing process uses information found in the INFILE statement (for example, DSD and DLM=), the INPUT statement (such as the $ for character data), and other sources (like the LENGTH statement) to determine the values it places in the PDV.

No delimiter options are present in the INFILE statement in Program 2.9.1; thus, a space is used as the default delimiter, and the first variable, FlightNum, is read using simple list input. SAS uses column pointers to keep track of where the parsing begins and ends for each variable and, with simple list input, SAS begins in the first column and scans until the first non-delimiter character is found. In this record, the first column is non-blank, so the starting pointer is placed there, indicated by the blue triangle below Table 2.9.5. Next, SAS scans until it finds another delimiter (a blank in this case), which is indicated below Table 2.9.5 with the red octagon. Thus, when reading the input buffer to create FlightNum, SAS has read from column 1 up to column 4.

Table 2.9.5: Column Pointers at the Starting and Ending Positions for Parsing FlightNum

01

0
2

0
3

04

0
5

0
6

0
7

0
8

0
9

1
0

1
1

1
2

1
3

1
4

1
5

1
6

1
7

1
8

1
9

2
0

2
1

2
2

2
3

2
4

2
5

2
6

4

3

9

1

2

/

1

1

/

2

0

0

0

L

A

X

2

0

1

3

7

Based on information defined in the descriptor portion of the data during compilation of the INPUT statement, FlightNum is a numeric variable with a default length of eight bytes. There are no additional instructions on how this value should be handled, so SAS converts the extracted character string “439” to the number 439 and sends it to the PDV. Note that the blank found in column 4 is not part of the parsed value—only non-delimiter columns are included. Table 2.9.6 shows the results of parsing FlightNum from the first record.

Table 2.9.6: Representation of the PDV After FlightNum is Read During the First Iteration

_N_

_ERROR_

FlightNum

Date

Destination

FirstClass

EconClass

1

0

439

.

.

Before parsing begins for the next value, SAS advances the column pointer one position, in this case advancing to column 5. This prevents SAS from beginning the next value in a column that was used to create a previous value. This automatic advancement of a single column occurs regardless of the input style, even though it is only demonstrated here for simple list input.

Since Date is also read in using simple list input, the parsing process is similar to how FlightNum was read. SAS begins at column 5 and reads until it encounters the next delimiter, which is in column 15. Table 2.9.7 shows this with, as before, the blue triangle indicating the starting column and the red octagon the ending column.

Table 2.9.7: Column Pointers at the Starting and Ending Positions for Parsing FlightNum

0
1

0
2

0
3

0
4

05

0
6

0
7

0
8

0
9

1
0

1
1

1
2

1
3

1
4

15

1
6

1
7

1
8

1
9

2
0

2
1

2
2

2
3

2
4

2
5

2
6

4

3

9

1

2

/

1

1

/

2

0

0

0

L

A

X

2

0

1

3

7

The characters read for Date are “12/11/2000”, and SAS must parse this string using the provided instructions, which are given by the dollar sign used in the INPUT statement for this variable. This declares its type as character and, since there are no instructions about the length, the default length of eight is used. Unlike numeric variables, where the length attribute does not control the displayed width, the length of a character variable is typically equal to the number of characters that can be stored. (Check the SAS Documentation to determine how length is related to printed width for character values in various languages and encodings.) The resulting value for date, shown in Table 2.9.8, has been truncated to the first 8 characters. This highlights that while list input reads from delimiter to delimiter, it only stores the value parsed from the input buffer subject to any attributes, such as type and length, previously established.

Table 2.9.8: Representation of the PDV after Date is Read During the First Iteration

_N_

_ERROR_

FlightNum

Date

Destination

FirstClass

EconClass

1

0

439

12/11/20

.

.

As demonstrated in Program 2.8.5, one way to prevent the truncation of Date values is to use a LENGTH statement to set the length of Date to at least 10 bytes. Another means of avoiding this issue with Date is to read it with an informat, a concept covered in Chapter 3. This has the added benefits of converting the Date values to numeric; allowing for easier sorting, computations, and changes in display formats.

SAS continues through the input buffer and, since all variables in this example are read using simple list input, the reading and parsing follows the same process as before. Table 2.9.9 shows the starting and stopping position of each of the remaining variables. Note that because SAS automatically advances the column pointer by one column after every variable and because list input scans for the next non-delimiter, the starting position for Destination in this record is column 17 rather than column 16.

Table 2.9.9: Column Pointers at the Starting and Ending Positions for Parsing the Remaining Values

0
1

0
2

0
3

0
4

0
5

0
6

0
7

0
8

0
9

1
0

1
1

1
2

1
3

1
4

1
5

1
6

17

1
8

1
9

20

21

2
2

23

24

2
5

26

4

3

9

1

2

/

1

1

/

2

0

0

0

L

A

X

2

0

1

3

7

Table 2.9.10 contains the final PDV for the first record—the values that are sent to the data set at the end of the first iteration of the DATA step which corresponds to the location of the RUN statement (or other step boundary).

Table 2.9.10: Representation of the PDV After Reading All Values

_N_

_ERROR_

FlightNum

Date

Destination

FirstClass

EconClass

1

0

439

12/11/20

LAX

20

137

After this record is sent to the data set, the execution phase returns to the top of the DATA step and the variables are reinitialized as needed. _N_ is incremented to 2, _ERROR_ remains at zero since no tracked errors have been encountered, and the remaining variables are set back to missing in this case. Further iterations continue the process: the next row is loaded into the input buffer, values are parsed to the PDV, and those values are sent to the data set at the bottom of the DATA step. This implicit iteration terminates when the end-of-file marker is encountered.

2.9.3 Debugging the DATA Step

Following the process from Section 2.9.1 may seem tedious at first, but understanding how SAS parses data when moving it from the raw file through the input buffer then to the PDV (and ultimately to the data set) is crucial for success in more complex cases. It is often inefficient to develop a program using a trial-and-error approach; instead, knowledge of the data-handling process ensures a smoother, more reliable process for developing programs. This section discusses several statements that SAS provides to help follow aspects of the parsing process through iterations of the DATA step. Program 2.9.2 demonstrates the LIST statement using Input Data 2.9.1.

Program 2.9.2: Demonstrating the LIST Statements

data work.flights;

  infile RawData(‘flights.prn’);

  input FlightNum Date $ Destination $ FirstClass EconClass;

  list;

run;

The LIST statement writes the contents of the input buffer to the log at the end of each iteration of the DATA step, placing a ruler before the first observation is printed. Log 2.9.2 shows the results of including the LIST statement in Program 2.9.2 for the first five records. The complete input buffer, including the delimiters, appear for each record.

Log 2.9.2: Demonstrating the LIST Statements

RULE:     ----+----1----+----2----+----3----+----4

1         439 12/11/2000  LAX 20 137  26 

2         921 12/11/2000 DFW 20 131 25

3         114  12/12/2000 LAX 15 170 26

4         982 12/12/2000  dfw 5  85 25

5         439 12/13/2000 LAX 14 196 25

Before writing the input buffer contents for the first time, the INPUT statement prints a ruler to the log.

The LIST statement writes the complete input buffer for the record.

If the records have variable length, then the LIST statement also prints the number of characters in the input buffer.

Since the log is a plain-text environment, SAS cannot display non-printable characters such as tabs. However, in these cases, SAS prints additional information to the log to ensure an unambiguous representation of the input buffer. Program 2.9.3 demonstrates the results of using the LIST statement with a tab-delimited file.

Program 2.9.3: LIST Statement Results with Non-Printable Characters

data work.flights;

  infile RawData(‘flights.txt’) dlm = ‘09’x;

  input FlightNum Date $ Destination $ FirstClass EconClass;

  list;

run;

Note the different file extension. This data is similar to the data in Program 2.9.2, but in a tab-delimited file.

The DLM= option uses the hexadecimal representation of a tab, 09, along with hexadecimal literal modifier, x.

No change is necessary in the LIST statement, regardless of what, if any, delimiter is used in the file.

Because of the increase in information present in the log, Log 2.9.3 only shows the results for the first two records. For each record, the contents from the input buffer now occupy three lines in the log.

Log 2.9.3: LIST Statement Results with Non-Printable Characters

RULE:     ----+----1----+----2----+----3----+----4

 

1   CHAR  439.12/11/2000.LAX.20.137 25

    ZONE  3330332332333304450330333

    NUMR  439912F11F20009C189209137

             

2   CHAR  921.12/11/2000.DFW.20.131 25

    ZONE  3330332332333304450330333

    NUMR  921912F11F200094679209131

The CHAR line represents the printable data from the input buffer. It displays non-printable characters as periods.

The ZONE and NUMR rows represent the two digits in the hexadecimal representation.

Note the fourth column in the input buffer appears to be a period. However, combining the information from the ZONE and NUMR lines indicates the hexadecimal value is 09—a tab.

Because SAS converts all non-printable characters to a period when writing the input buffer to the log, the ZONE and NUMR lines provide crucial information to determine the actual value stored in the input buffer. In particular, they provide a way to differentiate a period that was in the original data (hexadecimal code 2E) from a period that appears as a result of a non-printable character (for example, a tab with the hexadecimal code 09).

When debugging, two other useful statements are the PUT and PUTLOG statements. Both PUT and PUTLOG statements provide a way for SAS to write out information from the PDV along with other user-defined messages. The statements differ only in their destination—the PUTLOG statement can only write to the SAS log, while the PUT statement can write to the log or any file destination specified in the DATA step. The PUT statement is covered in more detail in Chapter 7; this section focuses on the PUTLOG statement. Program 2.9.4 uses Input Data 2.9.1 to demonstrate various uses of the PUTLOG statement, with Log 2.9.4 showing the results for the first record.

Program 2.9.4: Demonstrating the PUTLOG Statement

data work.flights;

  infile RawData(‘flights.prn’);

  input FlightNum Date $ Destination $ FirstClass EconClass;

  putlog ‘NOTE: It is easy to write the PDV to the log’;

  putlog _all_;

  putlog ‘NOTE: Selecting individual variables is also easy’;

  putlog ‘NOTE: ‘  FlightNum=  Date 

  putlog ‘WARNING: Without the equals sign variable names are omitted’;

run;

This PUTLOG statement writes the quoted string to the log once for every record. If the string starts with the string ‘NOTE:’, then SAS color-codes the statement just like a system-generated note, and it is indexed in SAS University Edition with other system notes.

The _ALL_ keyword selects every variable from the PDV, including the automatic variables _N_ and _ERROR_.

The PUTLOG statements accept a mix of quoted strings and variable names or keywords.

A variable name followed by the equals sign prints both the name and current value of the variable.

Omitting the equals sign only prints the value of the variable, with a potentially adverse effect on readability.

Beginning the string with ‘WARNING:’ or ‘ERROR:’ also ensures SAS color-codes the messages to match the formatting of the system-generated warnings and errors, and indexes them in SAS University Edition. If this behavior is not desired, use alternate terms such as QC_NOTE, QC_WARNING, and QC_ERROR to differentiate these user-generated quality control messages from automatically generated messages.

Log 2.9.4 Demonstrating the PUTLOG Statement

NOTE: It is easy to write the PDV to the log

FlightNum=439 Date=12/11/20 Destination=LAX FirstClass=20 EconClass=137 _ERROR_=0 _N_=1

NOTE: Selecting individual variables is also easy

FlightNum=439 12/11/20

WARNING: Without the equals sign variable names are omitted

When used in conjunction in a single DATA step, the PUTLOG and LIST statements allow for easy access to both the PDV and input buffer contents, providing a simple way to track down the source of data errors. In these examples, the PUTLOG results shown in Log 2.9.4 reveal the truncation in the Date values in the PDV, while the LIST results from Log 2.9.2 or 2.9.3 show the full date is present in the input buffer. Using them together, it is clear the issue with Date values is not present in the raw data and must relate to how the INPUT statement has parsed those values from the input buffer.

One additional debugging tool, the ERROR statement, acts exactly as the PUTLOG statement does, while also setting the automatic variable _ERROR_ to one.

2.10 Validation

The term validation has many definitions in the context of programming. It can refer to ensuring software is correctly performing the intended actions—for example, confirming that PROC MEANS is accurately calculating statistics on the provided data set. Validation can also refer to the processes of self-validation, independent validation, or both. Self-validation occurs when a programmer performs checks to ensure any data sets read in accurately reflect the source data, data set manipulations are correct, or that any analyses represent the input data sets. Independent validation occurs when two programmers develop code separately to achieve the same results. Once both programmers have produced their intended results, those results (such as listings of data sets, tables of summary statistics, or graphics) are compared and are considered validated when no substantive differences appear between the two sets of results. SAS provides a variety of tools to use for validation, including PROC COMPARE, which is an essential tool for independent validation of data sets.

The COMPARE procedure compares the contents of two SAS data sets, selected variables across different data sets, or selected variables within a single data set. When two data sets are used, one is specified as the base data set and the other as the comparison data set. Program 2.10.1 shows the options used to specify these data sets, BASE= and COMPARE=, respectively.

Program 2.10.1: Comparing the Contents of Two Data Sets

proc compare base = sashelp.fish compare = sashelp.heart;

run;

If no statements beyond those shown in Program 2.10.1 are included, the complete contents portions of the two data sets are compared, along with meta-information such as types, lengths, labels, and formats. PROC COMPARE only compares attributes and values for variables with common names across the two data sets. Thus, even though the full data sets are specified for comparison, it is possible for individual variables in one data set to not be compared against any variable in the other set. Program 2.10.1 compares two data sets from the Sashelp library: Fish and Heart. These data sets are not intended to match, so submitting Program 2.10.1 produces a summary of the mismatches to demonstrate the types of output available from the COMPARE procedure.

Output 2.10.1: Comparing the Contents of Two Data Sets

While the output in this text is normally delivered in an RTF format using the Output Delivery System, the output from PROC COMPARE is not well-suited to such an environment, so Output 2.10.1 shows the results as they appear in the Output window in the SAS Windowing Environment. Regardless of the destination, the output from this example of the COMPARE procedure includes sections for the following:

  • Data set summary—data set names, number of variables, number of observations
  • Variables summary—number of variables in common, along with the number in each data set which are not found in the other set
  • Observation summary—location of first/last unequal record and number of matching/nonmatching observations
  • Values comparison summary—number of variables compared with matches/mismatches, listing of mismatched variables and their differences

A review of the output provided by PROC COMPARE shows, in this case, only two variables are compared, despite the Fish and Heart data sets containing 7 and 17 variables, respectively. This is because only two variables (Weight and Height) have names in common. As such, even if the results indicate the base and comparison data sets have no mismatches, it is important to confirm that all variables were compared before declaring the data sets are identical. Similarly, the number of records compared is the minimum of the number of records in the two data sets, so the number of records must be compared as well. Several options and statements exist to alter how comparisons are done and to direct some comparison information to data sets.

Since the Heart and Fish data sets are not expected to be similar, applying PROC COMPARE to them is a simplistic demonstration of the procedure. A more typical comparison is given in Program 2.10.2, which applies the COMPARE procedure to the data set read in by Program 2.8.8 (using fixed-position data) and the IPUMS2005Basic data set in the BookData library.

Program 2.10.2: Comparing IPUMS 2005 Basic Data Generated from Different Sources

data work.ipums2005basicSubset;

  set work.ipums2005basicFPa;

  where homeValue ne 9999999;

run;

 

proc compare base = BookData.ipums2005basic compare = work.ipums2005basicSubset

             out = work.diff  outbase  outcompare  outdif  outnoequal  

             method = absolute  criterion = 1E-9 ;

run;

 

proc print data = work.diff(obs=6);

  var _type_ _obs_ serial countyfips metro citypop homevalue;

run;

 

proc print data = work.diff(obs=6);

  var _type_ _obs_ city ownership;

run;

To create a data set which differs from the provided BookData.IPUMS2005Basic data set, a WHERE statement is used to remove any homes with a home value of $9,999,999.

OUT= produces a data set containing information about the differences for each pair of compared observations for all matching variables. SAS includes all compared variables and two automatic variables, _TYPE_ and _OBS_.

OUTBASE copies the record being compared in the BASE= data set into the OUT= data set.

Like OUTBASE, OUTCOMPARE copies the record being compared in the COMPARE= data set into the OUT= data set.

OUTDIF produces a record that contains the difference between the OUTBASE and OUTCOMPARE records.

OUTNOEQUAL outputs the requested records (base, compare, and difference) only if a difference is found in at least one of the compared variables for the two records.

METHOD= selects the way in which two numeric values are compared. ABSOLUTE considers two numbers equal if their absolute difference is smaller than a fixed precision level.

CRITERION= specifies the precision used when comparing two numeric values.

Output 2.10.2A: Comparing IPUMS 2005 Basic Data Generated from Different Sources

Obs

_TYPE_

_OBS_

SERIAL

COUNTYFIPS

METRO

CITYPOP

HomeValue

1

BASE

1

2

73

4

0

9999999

2

COMPARE

1

4

73

4

0

137500

3

DIF

1

2

E

E

E

-9862499

4

BASE

2

3

0

1

0

9999999

5

COMPARE

2

6

97

3

0

95000

6

DIF

2

3

97

2

E

-9904999

Output 2.10.2A shows the first six records and nine columns of the data set created by the OUT= option in PROC COMPARE in Program 2.10.2, including _TYPE_ and _OBS_, which are both automatic variables. The automatic variable _TYPE_ is a character variable (with the default length of 8) that indicates the source of the observation—BASE, COMPARE, or DIF—based on the options in use. The second automatic variable is _OBS_, which uniquely identifies the observations from the source data sets that were used for comparison. In Output 2.10.2A, the first three records are: the first record from the base data set, the first record from the compare data set, and the difference between them. The second set of three records repeats this pattern for the second observation from the BASE= and COMPARE= data sets.

The next seven columns include the values compared and their differences. The first compared column, Serial, has different values in the BASE (Serial = 2) and COMPARE (Serial = 4) data sets. Since the absolute value of the difference (|BASE-COMPARE| = |4 – 2| = 2) is larger than the specified criterion, 1x10-9, these two values are evaluated as unequal and the difference is placed in this column on the DIF record. For the next three columns, CountyFips, Metro, and CityPop, the values are evaluated as equal. In these cases, numeric variables display a single E as the value in the difference record when using the OUTNOEQUAL option, unless a format overrides this behavior. All comparisons are made on unformatted values; the comparison of formats is done as part of the metadata comparison and is shown in the procedure output.

Output 2.10.2B: Results of Additional Options in Program 2.10.2

Obs

_TYPE_

_OBS_

City

Ownership

1

BASE

1

Not in identifiable city (or size group)

Rented

2

COMPARE

1

Not in identifiable city (or size group)

Owned

3

DIF

1

...........................................

XX.XXX

4

BASE

2

Not in identifiable city (or size group)

Rented

5

COMPARE

2

Not in identifiable city (or size group)

Owned

6

DIF

2

...........................................

XX.XXX

Output 2.10.2B shows the comparison between two character variables, City and Ownership, for the same records. SAS compares strings character by character, and if the strings are not of the same length, the end of the shorter string is padded with spaces until the strings are the same length. In any position where identical characters are encountered, a single period is used to represent the match in the _TYPE_ = DIF record. Since the two values for the City variable are identical at every character, the difference records (the third and sixth lines) only contain periods. However, when SAS encounters non-matching characters, that position in the difference record is indicated with a single X. Thus, for the variable Ownership, the first two characters do not match, the third matches, and the last three do not match. This is of particular use when comparing long character strings, as the difference record not only indicates that a mismatch is present but also exactly where in the string mismatches occur.

Finally, since Program 2.10.2 uses the OUTNOEQUAL option, only records corresponding to differences appear in the output data set. Thus, a critical indicator of a successful comparison is an empty OUT= data set when OUTNOEQUAL option is active. However, because the OUT= data set contains only value comparisons on the default matching of variables on common names, care must still be taken to ensure that the comparison is complete and accurate.

The data set, variables, observations, and values comparison summaries produced in the PROC COMPARE output are invaluable for ensuring that all variables and records were compared and, if necessary, that their attributes matched as well. PROC COMPARE cannot be used to directly create an output data set for attribute comparisons in the same way as it does for the content portion of a data set.

Earlier, this section mentions that PROC COMPARE can implement three distinct types of comparisons: comparing the complete content portions of two different data sets, comparing user-selected variables across data sets, or comparing user-selected variables from a single data set. The options shown in Program 2.10.2 are useful for all three scenarios; however, the second and third comparison types require the use of additional statements in PROC COMPARE. Specifically, VAR and WITH statements can be used to facilitate those comparison types. By default, observations are compared in all three scenarios based on their row order; that is, the first observations from each data set (base and comparison) are compared. To compare based on key variables instead of by position, the ID statement can be added. Details of using the ID, VAR, and WITH statements can be found in the SAS Documentation.

2.11 Wrap-Up Activity

Use the lessons and examples contained in this chapter to complete the activity shown in Section 2.2.

Data

Use the basic IPUMS CPS data from 2010 to complete the activity. This data set is similar in structure and contents to the 2005 data used for many of the in-chapter examples. Completing this activity requires the following files:

  • Ipums2010basic.sas7bdat
  • Ipums2010basic.txt
  • Ipums2010basic.csv
  • Ipums2010basic.dat

Scenario

Read in the raw files and validate them to ensure the SAS data sets created match the provided SAS data set. Use any of these SAS data sets to generate the tables shown in Outputs 2.2.1 through 2.2.4.

2.12 Chapter Notes

1. LABEL/NOLABEL. In addition to variable labels, data sets may have temporary or permanent labels and each procedure—such as FREQ and MEANS—is labeled in SAS output destinations if that destination includes a table of contents. By default, the system option LABEL is in effect, which allows procedures to display variable labels (if they exist) instead of variable names when producing output. Similarly, this option allows procedures to appear with custom labels in the table of contents if a custom label is provided in the ODS PROCLABEL statement. For procedures that do not use labels by default, such as PROC PRINT, local LABEL options are typically available to allow the procedure access to variable labels. However, when the NOLABEL option is in effect, no procedure can display variable labels—even if its local LABEL option is in effect—and procedure labels given via ODS PROCLABEL do not appear in the table of contents for output destinations. Data set labels are unaffected by the NOLABEL system option.

2. Determining Sort Order. PROC SORT determines the order of the resulting data set based on the collating sequence option given, or the default sequence if no option is given. The SORT procedure code in this text always uses the default collating sequence option, which depends on information provided to SAS by the operating system. The examples in this text use ASCII, for which the default options sort character values so that, for example, capital letters are given higher sort priority than lowercase letters. Similarly, it sorts other characters (digits, underscores, spaces, and other special characters) before all letters. It is important to understand the sort method used since the default is machine-dependent.

In English, the two primary collating sequences are ASCII and EBCDIC, which are two different methods for encoding characters as numbers. EBCDIC is primarily used in IBM systems, while most other systems use ASCII—either of these being available as collating sequence options in PROC SORT. Other options provide foreign language support (for example, DANISH and SWEDISH) or even customized methods via the SORTSEQ= option. PROC SORT also supplies options for controlling many other aspects of sorting. For programs where code portability across operating systems is needed, it is a good programming practice to use sequencing options explicitly to prevent machine-dependent results. For more information about the available options in PROC SORT, see the SAS Documentation.

3. Widths of Formats. Every named format includes a default width that SAS applies when no other width is specified—for example, the DOLLAR format has a default width of 6, the $REVERSJ format has a default width of 1, and some other formats have default lengths that only apply if the length of the variable has not been previously defined. When values exceed the width—either default or user-provided—SAS applies various rules depending on several contributing factors.

Character. For character formats provided by SAS, SAS formats values longer than the stated width by simply truncating the value after the specified width has been reached. For example, applying the format $9. to the value ‘This is only a test.’ yields a value of ‘This is o’ because SAS truncates after reaching the ninth character.

Numeric. For numeric formats provided by SAS, SAS applies a variety of rules to limit the impact on the representation of the number. For a complete description, see the SAS Documentation. In general, SAS begins by removing characters such as dollar signs, percent signs, and commas that do not affect the precision of the displayed value. If this is still not sufficient, SAS may apply a different format such as BEST, represent the value using scientific notation, or simply display asterisks in place of the value. For example, applying the format DOLLAR6. to the value 12345.67 yields $12346, omitting the comma. Applying DOLLAR4. to that same value results in 12E3, making use of scientific notation. Using DOLLAR2. does not provide enough width to display a usable value and is displayed as **. In certain cases, SAS also produces the following note in the log.

NOTE: At least one W.D format was too small for the number to be printed. The

      decimal may be shifted by the “BEST” format.

● It is important to recognize the source of this note since the note always refers to the W.D format even when other formats are used.

4. Displaying Values Not Defined in a User-Defined Format. When creating a custom format, it is not necessary to format every possible value. However, values that do not appear in the format definition may not be formatted as intended when the custom format is used. Values that do not correspond to a formatted value are displayed using their internal, unformatted, value. However, the width SAS uses to display the value is inherited from the custom format applied, even though the value did not appear in that format definition. To provide an example, consider the following format intended to classify a numeric value into two categories.

proc format;

  value Audit 70-high = ‘AU’

                 0-69 = ‘NR’

  ;

run;

For the value 69.82, which is not in either value range defined in the Audit format, the result is an unexpected display value. This occurs due to AUDIT having a default length of two (the default length of a custom format is the length of the longest formatted value), which SAS applies to the internal value of 69.82. Thus, SAS rounds the value to 70 to accommodate the width of two and displays that value—a result which is confusing since a value of 70 is defined by the format to appear as AU. Of course, either using AUDIT5. or including all values in the format definition alleviates the issue. As shown in Program 2.5.7, the default width of a format is part of the information provided by the FMTLIB option in PROC FORMAT. Good programming practice dictates that format definitions cover all values the format is applied to.

5. Overlapping Ranges in PROC FORMAT. As stated in Section 2.5.1, ranges used in the VALUE statement should be non-overlapping in most cases. If two ranges overlap only at the boundary (for example, 1-3 and 3-5), then no error occurs and SAS assigns a formatted value based on the range that it encountered first in the VALUE statement. Use the MULTILABEL option if overlapping formatted ranges are necessary. Using this option allows assignment of overlapping ranges to unique formatted values or assignment of a single range to multiple formatted values. However, not all procedures support the multilabel formats. The exclusion operator, <, can be used to prevent overlapping ranges, as shown in Section 2.5.1. Another useful option is FUZZ=, which allows specification of an allowable margin when matching values to a range. For example, using FUZZ=0.25 with the assignment 1=‘Low’ would be equivalent to assigning 0.75-1.25=‘Low’ in the VALUE statement. For more details, see the SAS Documentation.

6. Raw Data. Not all external data qualifies as raw data. Recall raw data was defined as data that had not been processed into a SAS data set; however, Database Management System (DBMS) files are not considered raw data, despite not being SAS data sets. This includes files from Microsoft Access and Excel, Oracle, SPSS, JMP, Stata, and many other non-SAS sources. To retrieve data from DBMS sources and process it in SAS, the LIBNAME statement provided by SAS/ACCESS can potentially be used if the SAS/ACCESS product is licensed. To determine what SAS products are licensed in a given session (including SAS/ACCESS), run Program 2.12.1. This program creates a list of all licensed products and prints the list in the Log window.

Program 2.12.1: Determining Licensed SAS Products

proc setinit;

run;

7. Variable Delimiters. Occasionally, such as when reading data from multiple raw files in a single DATA step, the delimiters are not consistent across records. In cases such as these, it is possible to use a variable, rather than a single string, to identify the delimiter. For example, setting DLM = MyDLM in the INFILE statement uses the current value of MyDLM as the delimiter(s) for that record. Note, the value of MyDLM must already be populated in the PDV before SAS can read the delimited fields.

8. Tokens. When submitting a program, SAS sends the code to a location called the input stack to parse the characters submitted. To facilitate this, SAS uses a word scanner that reads the program character by character and groups the characters into tokens, a process called tokenization. There are four classes of tokens that SAS recognizes:

1. Numbers—Any numeric value including real numbers; date, time, and datetime constants; and hexadecimal constants. The decimal point, sign (+ or -), and E for scientific notation are allowed. For example, -6.2, 7E-2, ‘27OCT1977’d, ‘13:44’t, and 1C99.

2. Name—A group of characters not beginning with a digit, but which contain only letters, underscores, and digits. For example, PROC, HHIncome, COMPARE, _OBS_.

3. Literal—A string of characters contained in single or double quotation marks. For example, ‘$350 and Below’ or ‘Metro, Inside City’.

4. Special Character—Any character other than letters, numbers, underscores, and blanks. For example, = ( ) @ ^ &.

Tokens end when a new token begins, a blank is encountered (after name and number tokens), or when a literal token is ended by the same type of quotation mark (single or double) that began the literal string.

9. Step Boundaries. In this chapter, each DATA and PROC step included a RUN statement as the final statement in the step. For example, Program 2.10.2 uses a RUN statement at the end of each of the four steps. Since the nesting of steps is not permitted, invocation of any step is seen as a step boundary, ending the previous step and forcing its compilation and execution prior to those processes occurring for the next step. For an example of this behavior, see Program 1.4.2. In order to explicitly indicate a step or procedure is complete, include a statement or action that creates a step boundary; this is the most common purpose for the RUN statement (though other statements, such as QUIT, may sometimes fit this role). Even though the additional RUN statements are not required syntax, they are considered a good programming practice. RUN statements do not impact the run time of a program and can prevent ambiguity when submitting global statements, such as TITLE or OPTIONS. They also allow for easy submissions of sections of code, creating a much simpler debugging process.

10. _ERROR_. This error flag does not track all errors and is not a counter. If _ERROR_ is zero, it is not an indication that the code has no errors, only that no tracked errors are present; and if it is one, there is at least one tracked error. The _ERROR_ automatic variable only tracks data errors and some semantic and execution-time errors. To review the error types, see the SAS Documentation.

2.13 Exercises

Concepts: Multiple Choice

1. In the following FORMAT procedure, the dollar sign preceding the format name Codes is:

proc format;

  value $codes

    ‘1’=’East’

    ‘2’=’West’

    ‘3’=’North’

    ‘4’=’South’

  ;

run;

a. A syntax error

b. A logic error

c. Both a syntax and a logic error

d. None of the above

2. Which of the following answer choices correctly identifies the variables included in the results of PROC FREQ if only a data set is specified?

a. All numeric variables in the data set.

b. All character variables in the data set.

c. All variables in the data set.

d. No variables—a TABLE statement is required.

3. Which of the following answer choices correctly identifies the variables included in the results of PROC MEANS if only a data set is specified?

a. All numeric variables in the data set.

b. All character variables in the data set.

c. All variables in the data set.

d. No variables—a VAR statement is required.

4. Which of the MEANS procedures given below generates the table shown?

Analysis Variable : MPG_Highway MPG (Highway)

Origin

Type

N

Mean

Median

Std Dev

Asia

Sedan

94

29.9680851

29.0000000

4.8845865

Wagon

11

28.1818182

28.0000000

5.3817875

Europe

Sedan

78

27.1153846

27.0000000

3.8069682

Wagon

12

26.5833333

26.5000000

2.8431204

USA

Sedan

90

28.5444444

28.0000000

4.1412182

Wagon

7

29.7142857

30.0000000

4.8550416

a.

proc means data=sashelp.cars nonobs n mean median std;

  class origin type;

  var mpg_Highway;

  where type eq (‘Sedan’,’Wagon’);

run;

b.

proc means data=sashelp.cars n mean median std;

  class origin type;

  var mpg_Highway;

  where type in (‘Sedan’,’Wagon’);

run;

c.

proc means data=sashelp.cars nonobs n mean median std;

  class type origin;

  var mpg_Highway;

  where type in (‘Sedan’,’Wagon’);

run;

d.

proc means data=sashelp.cars nonobs n mean median std;

  class origin type;

  var mpg_Highway;

  where type in (‘Sedan’,’Wagon’);

run;

5. Which of the answer choices contains the FREQ procedure that generates the given table?

Table of Cause by Day

Cause(Cause of Failure)

Day

Frequency
Row Pct

Monday

Tuesday

Wednesday

Thursday

Friday

Total

Contamination

23
20.91

25
22.73

24
21.82

14
12.73

24
21.82

110

Corrosion

4
25.00

3
18.75

2
12.50

3
18.75

4
25.00

16

Total

27

28

26

17

28

126

a.

proc freq data=sashelp.failure;

  table day*cause / nocol nopercent;  

  weight count; 

  where cause eq ‘Corrosion’ or cause eq ‘Contamination’;

run;

b.

proc freq data=sashelp.failure;

  table cause*day / norow nopercent;

  weight count; 

  where cause eq ‘Corrosion’ or cause eq ‘Contamination’;

run;

c.

proc freq data=sashelp.failure;

  table cause*day / nocol nopercent;

  weight count; 

  where cause eq ‘Corrosion’ or cause eq ‘Contamination’;

run;

d.

proc freq data=sashelp.failure;

  table cause*day / nocol nopercent; 

  weight count; 

  where cause eq ‘Corrosion’ and cause eq ‘Contamination’;

run;

6. Which of the following format definitions could be used to alter the PROC MEANS output in the first table to match that shown in the second table?

Analysis Variable : Systolic

Smoking Status

N Obs

Lower
Quartile

Median

Upper
Quartile

Heavy (16-25)

1046

120.0000000

130.0000000

142.0000000

Light (1-5)

579

120.0000000

130.0000000

144.0000000

Moderate (6-15)

576

118.0000000

126.0000000

140.0000000

Non-smoker

2501

122.0000000

136.0000000

152.0000000

Very Heavy (> 25)

471

122.0000000

134.0000000

145.0000000

Analysis Variable : Systolic

Smoking Status

N Obs

Lower
Quartile

Median

Upper
Quartile

Non-Smoker

2501

122.0000000

136.0000000

152.0000000

Smoker

2672

120.0000000

130.0000000

144.0000000

a.

proc format;

  value $smk

    low - < ‘N’,’O’ - high=’Smoker’

    other = ‘Non-Smoker’;

run;

b.

proc format;

  value $smk

    ‘Non-smoker’ = ‘Non-Smoker’

    Other = ‘Smoker’;

run;

c. Both a and b

d. Neither a nor b

7. Based on the following program, what is the correct order of the user-defined variables in the PDV?

data Athletes;

  format Salary dollar12.;

  infile Teams;

  input Division $ League $ FirstName $ LastName $ Salary;

run;

a. Salary Division League FirstName LastName

b. Division League FirstName LastName Salary

c. Salary Teams Division League FirstName LastName

d. Teams Division League FirstName LastName Salary

8. Given the raw data file shown below, which of the following answer choices contains an INPUT statement that produces the following table?

----+----1----+

Bill30E8567411

Joe 32D7344211

Sue 29C6652327

Jane33C9359121

Name

Age

IDNum

JobCode

Salary

Bill

30

E85

E

67411

Joe

32

D73

D

44211

Sue

29

C66

C

52327

Jane

33

C93

C

59121

a. input name $ age IDNum $ jobcode $ salary;

b. input name $ 1-4 age 5-6 IDNum $ 7-9 jobcode $ 7 salary 10-14;

c. input name $ 1-4 age 5-6 IDNum 8-9 jobcode $ 7 salary 10-14;

d. input name $ low-4 age 5-6 IDNum $ 8-9 jobcode $ 7 salary 10-high;

9. Given the data table created with the OUT= option in PROC COMPARE, which Compare procedure would create this table?

one

two

three

four

five

1999

Children Sports

14

288.80

580.40

0

.........................

0

0.00

0.00

1999

Children Sports

19

344.80

693.40

0

.........................

0

0.00

0.00

1999

Clothes

25

3767.90

7192.10

0

.XXXXXXX.XXXXXX..........

4

-3271.50

-6193.50

a.

proc compare base=b compare=c out=diff outbase outdiff;

run;

b.

proc compare base=b compare=c out=diff outbase outdiff outnoequal;

run;

c.

proc compare base=b compare=c out=diff outnoequal;

run;

d.

proc compare base=b compare=c out=diff outall;

run;

10. Which of the following initial objectives should be verified first (via PROC COMPARE or another method) before going on to use PROC COMPARE as a validation tool for variable values?

a. Ensure the number of records in each data set is the same.

b. Ensure the variables in each data set have matching names and types.

c. Ensure the data sets are both sorted in the same order.

d. All of the above.

Concepts: Short-Answer

1. For each of the following, find the syntax error in the provided code.

a.

proc format;

  value bins

    low-<10=’Less than 10’

    10-20=’10 to 20’

    20<-high=’More than 20’;

  value bins2

    low-<5=’Less than 5’

    5-15=’5 to 15’

    15<-high=’More than 15’;

run;

b.

proc format;

  value $codes;

    ‘1’=’New’

    ‘2’=’Old’

    ‘3’=’Unknown’;

run;

c.

proc means data=stuff;

  class category;

  var sales;

  where store ge 1 and le 4;

run;

2. If a variable is used in the CLASS statement in PROC MEANS or the TABLE statement in PROC FREQ, how does SAS determine the levels of the CLASS variables? To alter the set of classes without altering the data, what approach can be taken?

3. When the DSD option is used in the INFILE statement, what three characteristics for handling delimiters does it alter and how do they change? Which of these three can be further modified with an additional option in the INFILE statement?

4. True or false: When reading fixed-position data, it is possible to list the variables in any order in the input statement. Based on the concepts from this chapter, is the answer the same for list input?

5. Even when using OUT= in PROC COMPARE to create a data set containing the information requested on differences, why is it also important to carefully consider the tables PROC COMPARE places in the Output window?

6. Suppose the SORT procedure shown below is successfully executed.

proc sort data = example out = SortedExample;

  by w x descending y z;

run;

Explain whether the following BY statements are valid in a subsequent step that uses the SortedExample data set.

a. by w;

b. by x;

c. by w x y;

d. by w x descending y;

e. by w x descending y descending z;

7. Consider the comma-delimited data set shown below and answer the following items. Note that this is a variation on the data in Input Data 2.9.1 where the value of 0 for the FirstClass in the ninth observation is now missing.

----+----1----+----2----+----3

439,12/11/2000,LAX,20,137

921,12/11/2000,DFW,20,131

114,12/12/2000,LAX,15,170

982,12/12/2000,dfw,5,85

439,12/13/2000,LAX,14,196

982,12/13/2000,DFW,15,116

431,12/14/2000,LaX,17,166

982,12/14/2000,DFW,7,88

114,12/15/2000,LAX,,187

982,12/15/2000,DFW,14,31

a. Without submitting any code, give the input buffer and program data vector for the ninth observation if the following DATA step is used.

data PartA;

  infile myData dsd;

  input FlightNum $ Date $ Destination $ FirstClass EconClass;

run;

b. Without submitting any code, give the input buffer and value of FlightNum in the PDV for the first observation if the following DATA step is used.

data PartB;

  infile myData dsd dlm=’ ‘;

  input FlightNum $ Date $ Destination $ FirstClass EconClass;

run;

Programming Basics

1. Using the Sashelp.Cars data set, create the tables given below.

Analysis Variable : MPG_City MPG (City)

Origin

N Obs

Minimum

Lower Quartile

Median

Upper Quartile

Maximum

Asia

158

13.0

18.0

20.5

24.0

60.0

Europe

123

12.0

17.0

19.0

20.0

38.0

USA

147

10.0

17.0

18.0

21.0

29.0

Table of Type by Origin

Type

Origin

Frequency
Row Pct

Asia

Europe

USA

Total

SUV

25
41.67

10
16.67

25
41.67

60

Sedan

94
35.88

78
29.77

90
34.35

262

Sports

17
34.69

23
46.94

9
18.37

49

Wagon

11
36.67

12
40.00

7
23.33

30

Total

147

123

131

401

2. Using the Sashelp.Heart data set, create the tables given below:

Systolic

N Obs

Variable

Mean

Median

Minimum

Maximum

Normal (Below 120)

1029

Cholesterol
Weight

214.2
140.6

209.0
138.0

118.0
71.0

479.0
226.0

Pre-Hypertension (120-139)

2157

Cholesterol
Weight

224.9
151.3

221.0
149.0

96.0
67.0

435.0
276.0

Stage 1 (140-159)

1226

Cholesterol
Weight

232.8
160.4

228.0
159.0

124.0
89.0

568.0
281.0

Stage 2 (160 or More)

797

Cholesterol
Weight

242.9
162.8

238.0
160.0

142.0
92.0

492.0
300.0

Table of Systolic by Sex

Systolic

Sex

Percent
Row Pct
Col Pct

Female

Male

Total

Normal (Below 120)

12.50
63.27
22.66

7.26
36.73
16.18

19.75

Pre-Hypertension (120-139)

21.65
52.29
39.26

19.75
47.71
44.05

41.41

Stage 1 (140-159)

11.63
49.43
21.09

11.90
50.57
26.54

23.54

Stage 2 (160 or More)

9.37
61.23
16.99

5.93
38.77
13.23

15.30

Total

2873
55.15

2336
44.85

5209
100.00

3. Write a DATA step that reads the data in the file Heart.csv and uses PROC COMPARE to demonstrate that the result is equivalent to the Sashelp.Heart data set.

4. Write a DATA step that reads the data in the file Heart.dat and uses PROC COMPARE to demonstrate that the result is equivalent to the Sashelp.Heart data set.

5. Program 2.3.8 introduces the SUM statement for computing cumulative sums in PROC PRINT and shows that, in addition to the overall sum, the SUM statement computes certain cumulative sums associated with the BY groups. The SUMBY statement is available to provide additional control over which cumulative sums PROC PRINT displays. Use the SUMBY statement to produce the following report from the Sashelp.Mdv data set.

TYPE

CODE

COUNTRY

SALES95

_4CAST96

MD11

THIRD DAY

UNITED STATES

$1,672.00

$1,880.10

MD11

THIRD DAY

$10,563.00

$11,278.39

MD11

$32,720.19

$35,865.77

$240,958.47

$267,273.82

6. Write a program that does the following:

  • Reads in the Flights.dat data set
  • Uses the PUTLOG statement to write the FlightNum and FirstClass values along with the associated variable names
  • Uses the LIST statement
  • Compares the data set against the Flights data set created in Program 2.9.1

1. Write a program that does the following:

  • Reads in the Flights.txt data set
  • Uses the PUTLOG statement to write the FlightNum and FirstClass values along with the associated variable names
  • Uses the LIST statement
  • Compares the data set against the Flights data set created in Program 2.9.1

Case Studies

For additional practice, multiple case studies are available in addition to the IPUMS CPS case study used in the chapters. See Section 8.2 to apply the skills from this chapter to the Clinical Trials Case Study. For additional case studies including extensions to the IPUMS CPS case study, see the author pages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset