Chapter 4: Combining Data Vertically in the DATA Step

4.1 Learning Objectives

4.2 Case Study Activity

4.3 Vertically Combining SAS Data Sets in the DATA Step

4.3.1 Concatenation

4.3.2 Interleaving

4.4 Managing Data Sets During Combination

4.4.1 Selecting Variables with KEEP and DROP

4.4.2 Changing Variable names with RENAME

4.4.3 Selecting Records with WHERE and Subsetting IF

4.4.4 Identifying Record Sources with IN=

4.4.5 Implicit Conversion

4.5 Creating Variables Conditionally

4.5.1 IF-THEN Statements

4.5.2 IF-THEN/ELSE Statements

4.5.3 SELECT Groups

4.6 Working with Dates and Times

4.7 Data Exploration with the UNIVARIATE Procedure

4.7.1 Summary Statistics in PROC UNIVARIATE

4.7.2 Graphing Statements in PROC UNIVARIATE

4.8 Data Distribution Plots

4.8.1 Histograms

4.8.2 Histograms in SGPANEL

4.8.3 Boxplots

4.8.4 High-Low Plots

4.9 Wrap-Up Activity

4.10 Chapter Notes

4.11 Exercises

4.1 Learning Objectives

At the conclusion of this chapter, mastery of the concepts covered in the narrative includes the ability to:

  • Differentiate between concatenation and interleaving and apply the correct technique in a given scenario
  • Formulate a strategy for selecting only the necessary rows and columns when processing a SAS data set
  • Apply conditional logic to create a new variable
  • Develop sound strategies for the use of IF-THEN, IF-THEN/ELSE, and SELECT when applying conditional logic in a given scenario
  • Describe how SAS stores date and time values; apply functions and arithmetic operations to perform calculations on date and time variables; apply formats to control how date and time values appear
  • Apply the UNIVARIATE and SGPLOT procedures to explore a SAS data set

Use the concepts of this chapter to solve the problems in the wrap-up activity. Additional exercises and case-studies are also available to test these concepts.

4.2 Case Study Activity

For a further continuation of the case study covered in the previous chapters, the output shown below provides summaries on mortgage payments across the years of 2005, 2010, and 2015 from the IPUMS CPS Basic data sets. The objective, as explained in the Wrap-Up Activity in Section 4.9, is to assemble the data from the individual files for these three years and produce the results shown.

Output 4.2.1: Basic Statistical Summaries on Nonzero Mortgage Payments

Variable: MortgagePayment (First mortgage monthly payment)
Year = 2005

Basic Statistical Measures

Location

Variability

Mean

1043.929

Std Deviation

754.34069

Median

860.000

Variance

569030

Mode

1200.000

Range

7896

Interquartile Range

750.00000

Variable: MortgagePayment (First mortgage monthly payment)
Year = 2010

Basic Statistical Measures

Location

Variability

Mean

1228.807

Std Deviation

890.39318

Median

1000.000

Variance

792800

Mode

1200.000

Range

7396

Interquartile Range

970.00000

Variable: MortgagePayment (First mortgage monthly payment)
Year = 2015

Basic Statistical Measures

Location

Variability

Mean

1227.864

Std Deviation

871.51891

Median

1000.000

Variance

759545

Mode

1200.000

Range

6896

Interquartile Range

960.00000

Output 4.2.2: Histograms Across Years for Nonzero Mortgage Payments

Output 4.2.2: Histograms Across Years for Nonzero Mortgage Payments

Output 4.2.3: Boxplots Across Years for Nonzero Mortgage Payments

Output 4.2.3: Boxplots Across Years for Nonzero Mortgage Payments

Output 4.2.4: Boxplots Across Years, Separated by Metro Status

Output 4.2.4: Boxplots Across Years, Separated by Metro Status

Output 4.2.5: Customized Distribution Plots Across Years for Nonzero Mortgage Payments

Output 4.2.5: Customized Distribution Plots Across Years for Nonzero Mortgage Payments

4.3 Vertically Combining SAS Data Sets in the DATA Step

Chapters 2 and 3 show how to read raw files with various structures using the INFILE and INPUT statements and Chapter 1 has examples on how to read a single SAS data set into a DATA step by using the SET statement. Sometimes the information needed for an analysis comes from multiple files. This scenario can occur naturally when data is collected over time or when related information for a single record is stored in separate data tables.

There are several ways to describe methods available in SAS for combining data sets. The two basic criteria used to classify techniques are:

  • Orientation. Orientation is either vertical, a stacking of rows from multiple data sets based on matching columns, or horizontal, joining columns together on a criterion that matches rows.
  • BY-Grouping. BY-grouping can either be used or not used.

These result in four classifications of data combination, with the SAS terminology for each shown in Table 4.3.1.

Table 4.3.1: Overview of the Four Methods for Combining Data via the DATA Step

Orientation of Combination

Vertical

Horizontal

Grouping Used?

No

Concatenation

One-to-One Read

One-to-One Merge

Yes

Interleave

Match-Merge

The following subsections provide details about each of the vertical methods, with horizontal methods discussed in Chapter 5.

4.3.1 Concatenation

In the SAS DATA step, concatenation is a vertical combination method that creates a data set by stacking records from the data sets in the SET statement, with Table 4.3.2 illustrating the concept of concatenating records for a single matching variable. From this illustration, the vertical concept implies a “taller” data set—the number of records is the sum of those from the contributing data sets. No grouping of records is used, the records from one data set are simply appended to the other.

Table 4.3.2: Illustration of Concatenation

Amount

Amount

Amount

1

2

1

2

4

2

3

+

4

=

3

4

9

4

5

9

5

2

4

4

9

9

Regardless of the number of data sets to concatenate, use a single SET statement and place all the data sets in the same statement, as shown in Program 4.3.1, which concatenates two of the IPUMS CPS data sets into a single data set.

Program 4.3.1: A Simple Concatenation

data work.Ipums0105Basic;

  set BookData.Ipums2001Basic(obs=3) 

      BookData.Ipums2005Basic(obs=3);

run;

proc report data= work.Ipums0105Basic;

  column Serial HHIncome HomeValue State MortgageStatus Metro;

run;

The OBS=3 data set option is used to limit the number of records input for this example only to make the display of results easier. For other examples and wrap-up activities, full data sets are used.

Instead of using PROC PRINT to display the results, this program uses the REPORT procedure. This portion of the code is replicated in Program 4.3.2 with some introductory information about using PROC REPORT.

Output 4.3.1: A Simple Concatenation

Household serial number

Total household income

House value

state

MortgageStatus

Metropolitan status

1

6400

9999999

Alabama

N/A

.

2

30000

45000

Alabama

No, owned free and clear

.

3

37500

12500

Alabama

No, owned free and clear

.

2

12000

9999999

Alabama

N/A

4

3

17800

9999999

Alabama

N/A

1

4

185000

137500

Alabama

Yes, mortgaged/ deed of trust or similar debt

4

Recall the compilation phase is responsible for creating the PDV and, in this case, during the compilation phase the data from 2001 is encountered first and the data from 2005 second. As a result, SAS adds variables from the 2001 data set to the PDV first, followed by any variables in the 2005 data set that are not present in the 2001. Metro is not present in the 2001 data, but is present in the concatenated result since it is in the 2005 data. Because all values are initialized to missing in the PDV for each iteration of this DATA step, Metro is missing for those first three records read from the 2001 data.

During execution, records are added to the output data set in the order in which they are encountered. SAS reads left-to-right in the SET statement and top-to-bottom within each data set – records from the second data set are not accessed until the first data set is exhausted. The result is a data set with all records from 2001 stacked on top of all records from 2005.

Program 4.3.2: Using PROC REPORT to Display Data

proc report data= work.Ipums0105Basic nowd ;

  columns  Serial HHIncome HomeValue State MortgageStatus Metro;

run;

Prior to SAS 9.4, the REPORT procedure default is to deliver results in an interactive report window instead of producing a static report file. Starting with SAS 9.4, the default is to produce a static file instead. To ensure a PROC REPORT step produces the same output regardless of software version, explicitly include an option to produce the desired results—NOWINDOWS (or NOWD) to produce the static file or WINDOWS (or WD) to deliver the report to the interactive report window.

By default, PROC REPORT includes every variable from the data set in its output. The COLUMN statement names specific variables and sets the order in which they appear in the results—in that role, it is very similar to the VAR statement in the PRINT procedure. Both the COLUMN and COLUMNS keywords are valid syntax, and there is no difference in their effects in the REPORT procedure.

PROC REPORT provides a substantial increase in functionality for creating reports compared to the PRINT procedure. However, due to that additional functionality, one important characteristic of PROC REPORT needs to be demonstrated, as shown in Program 4.3.3.

Program 4.3.3: Behavior of PROC REPORT with All Numeric Columns

proc report data= work.Ipums0105Basic;

  columns Serial HHIncome HomeValue;

run;

Because Program 4.3.3 contains a subset of the columns from Program 4.3.2, it is reasonable to assume that the report generated by Program 4.3.3 simply contains fewer columns than Output 4.3.2 but otherwise has the same structure. However, Output 4.3.3 shows this is not the case. Since the COLUMNS statement in Program 4.3.3 now only contains numeric variables, PROC REPORT produces a summary report—specifically with the sums of the columns—instead of reporting the individual values. This is an example of how the REPORT procedure assigns a usage to each variable, which can be controlled with a DEFINE statement. The examples in Chapters 4 and 5 introduce this and other small additions to the PROC REPORT syntax in order to present results. Chapter 6 provides a more thorough introduction to PROC REPORT and Chapter 7 provides a more in-depth look. For now, as long as a character variable is present in the COLUMNS statement, no summarization occurs.

Output 4.3.3: Behavior of PROC REPORT with All Numeric Columns

Household serial number

First mortgage monthly payment

Total household income

House value

15

900

288700

30194997

4.3.2 Interleaving

Interleaving in the DATA step is similar to concatenation; both combine two or more data sets vertically and use the SET statement. However, unlike concatenation which stacks observations sequentially across data sets and records within each, an interleave stacks records vertically based on the values of key variables. In order to interleave data sets, they must be sorted or indexed by the key variables. Key variables are often referred to as BY variables in SAS because they are typically specified in a BY statement. Table 4.3.3 illustrates the concept of interleaving using Amount as the key variable.

Table 4.3.3: Illustration of Interleaving

Amount

Amount

Amount

1

2

1

2

4

2

3

+

4

=

2

4

9

3

5

9

4

4

4

5

9

9

In Table 4.3.1, the interleave is classified as a vertical, grouped technique. Comparing Table 4.3.3 to Table 4.3.2, the vertical component is the same—the number of records is the sum of the contributing records from the two data sets. Unlike Table 4.3.2, these records are in groups based on the value of Amount, rather than the records from the second table simply following the first. The grouping is not unique as there are ties on the key variable—these ties are then resolved by the concatenation rules. Effectively, interleaving can be viewed as a method to concatenate sorted data sets in a manner that preserves the sorting, up to any ambiguity introduced by ties.

Program 4.3.4 demonstrates a simple interleave on two of the IPUMS CPS data sets into a single file using a single key variable.

Program 4.3.4: A Simple Interleave

proc sort data = BookData.Ipums2001Basic out = work.Ipums2001Basic;

  by serial;

run;

proc sort data = BookData.Ipums2005Basic out = work.Ipums2005Basic;

  by serial;

run;

data work.Ipums0105Basic;

  set work.Ipums2001Basic work.Ipums2005Basic;

  by serial;

run;

proc report data= work.Ipums0105Basic(obs = 5);

  column Serial MortgagePayment HHIncome HomeValue State MortgageStatus;

run;

Sort all data sets to be interleaved using the same key variables.

Place all data sets to be interleaved in the same SET statement.

Use key variables compatible with previous SORT procedures. In this case, each of the BY statements uses the same key variable.

Output 4.3.4: A Simple Interleave

Household serial number

First mortgage monthly payment

Total household income

House value

state

MortgageStatus

1

0

6400

9999999

Alabama

N/A

2

0

30000

45000

Alabama

No, owned free and clear

2

0

12000

9999999

Alabama

N/A

3

0

37500

12500

Alabama

No, owned free and clear

3

0

17800

9999999

Alabama

N/A

As with concatenation, the PDV is built by adding the variables from 2001 first and the variables from 2005 second, and variable attributes are based on the first encounter of the variable during the compilation phase. The interleave differs from concatenation in that the resulting data set records are ordered by the key variables. In the case of Program 4.3.4, the Ipums0105Basic data set is ordered by Serial, while the results of Program 4.3.1 contain the same records, but not sorted by Serial.

To further explore interleaving, consider Program 4.3.5 which combines the 2001 and 2005 Basic IPUMS CPS data by both HHIncome and MortgagePayment using sorting principles discussed in Section 2.3.

Program 4.3.5: Interleaving Using Two Variables

proc sort data = BookData.Ipums2001Basic out = work.Sort2001Basic;

  by HHIncome descending  mortgagePayment;

  where HHIncome gt 84000 and state eq ‘Vermont’;

run;

proc sort data = BookData.Ipums2005Basic out = work.Sort2005Basic;

  by HHIncome descending mortgagePayment;

  where HHIncome gt 84000 and state eq ‘Vermont’;

run;

data work.Ipums0105Basic;

 set work.Sort2001Basic work.Sort2005Basic;

 by HHIncome descending mortgagePayment;

run;

proc report data= work.Ipums0105Basic(obs=8);

 column Serial HHIncome MortgagePayment HomeValue MortgageStatus;

run;

DESCENDING changes the sort order from the default value of ASCENDING. It must precede the variable name; here it is being used to sort MortgagePayment from largest to smallest within each value of HHIncome.

Sort order for all variables included in the BY statement in the DATA step must match across all data sets named in the SET statement. If any variable included in the BY statement is sorted in descending order, then the DESCENDING option must accompany that variable.

The BY statement used in the DATA step does not have to be identical to those used in the SORT procedures. However, they must be compatible. The BY statement can successfully use HHIncome alone, since MortgagePayment is sorted within each level of HHIncome, but cannot use MortgagePayment alone as the primary sort of the data set is not on its values.

Output 4.3.5: Interleaving Using Two Variables

Household serial number

Total household income

First mortgage monthly payment

House value

MortgageStatus

469684

84020

1300

137500

Yes, mortgaged/ deed of trust or similar debt

467495

84100

1400

225000

Yes, mortgaged/ deed of trust or similar debt

467086

84100

800

275000

Yes, mortgaged/ deed of trust or similar debt

469463

84110

990

112500

Yes, mortgaged/ deed of trust or similar debt

1148721

84110

290

85000

Yes, mortgaged/ deed of trust or similar debt

468355

84141

1200

95000

Yes, mortgaged/ deed of trust or similar debt

469870

84180

890

112500

Yes, mortgaged/ deed of trust or similar debt

1148731

84200

650

112500

Yes, mortgaged/ deed of trust or similar debt

4.4 Managing Data Sets During Combination

When combining data sets it is often necessary to manage variables, records, or both. Variables with different names may contain comparable information, for example, MortPay and MortgagePayment may both contain the amounts homeowners pay on their mortgages. However, it is not possible to use these as a key variable during an interleave because they have different names and, as shown in Program 4.4.5, they do not align in the same column. Furthermore, after either a concatenation or an interleave, there is no way to track which record came from which data set. Outputs 4.3.1 and 4.3.4 are results from combining data from the years 2001 and 2005, but that information is lost in the final data set (it was only present in the data set names). This section introduces several tools for managing variables and records that have a wide variety of applications and which are available during the concatenation and interleaving processes.

4.4.1 Selecting Variables with KEEP and DROP

When concatenating or interleaving data sets, it is not uncommon to encounter a situation in which the data sets have only some columns in common. This is one example of where it is beneficial to select variables using a KEEP or DROP list. Program 4.4.1 demonstrates the usage of the DROP= data set option.

Program 4.4.1: Selecting Variables with DROP=

data work.Ipums0105Basic;

  set BookData.Ipums2001Basic 

      BookData.Ipums2005Basic (drop= CountyFips Metro CityPop City);

run;

proc contents data= work.Ipums0105Basic;

  ods select variables;

run;

Data set options are valid any time a program references a SAS data set. Use parentheses immediately after the data set name to indicate what options to apply. Use of multiple options simultaneously is demonstrated later in this section.

The DROP= option provides a space-delimited list of variables to omit when accessing the data set. The list is taken as exhaustive; that is, any variable not in this list is read into the step. The KEEP= option specifies an exhaustive list of variables to include in the current step. This DROP= option is equivalent to the following.

KEEP= Serial MortgagePayment HHIncome HomeValue State MortgageStatus Ownership

Output 4.4.1: Viewing Selected Variables with PROC CONTENTS

Alphabetic List of Variables and Attributes

#

Variable

Type

Len

Label

3

HHINCOME

Num

8

Total household income

4

HomeValue

Num

8

House value

2

MortgagePayment

Num

8

First mortgage monthly payment

6

MortgageStatus

Char

45

7

Ownership

Char

6

1

SERIAL

Num

8

Household serial number

5

state

Char

57

When DROP= and KEEP= are used, their effects are local to the current step and its results. For example, in Program 4.4.1, the four listed variables are not included when the Ipums2005Basic data set is read and, therefore, are not in the output data set. However, the Ipums2005Basic data set itself remains unchanged; that is, the variables are not dropped from the data set—they are just not processed by the current DATA step.

In the DATA step, there are two ways to apply KEEP and DROP lists: via the KEEP= and DROP= options as shown above, or via the KEEP and DROP statements. Program 4.4.2 demonstrates the use of a KEEP statement. The syntax is nearly identical to that of the KEEP= option with two notable differences: no parentheses or equals sign are needed since this is a stand-alone statement. The results of Program 4.4.2 are identical to Output 4.4.1.

Program 4.4.2: Using a KEEP Statement

data work.Ipums2005Basic;

  set BookData.Ipums2005Basic;

  keep Serial MortgagePayment HHIncome HomeValue State MortgageStatus Ownership;

run;

proc contents data= work.Ipums2005Basic;

  ods select variables;

run;

In Program 4.4.2 there is no difference between using the KEEP statement as demonstrated or using a KEEP= option in the SET statement. This is because in both programs only one data set is read in and only one data set is created. However, KEEP and DROP lists in the data set options are local to the data set they modify, while the DROP and KEEP statements apply to all data sets created by the DATA step. For illustration, Program 4.4.3 compares the effects of the KEEP statement, the KEEP= option in the SET statement, and the KEEP= option in the DATA statement.

Program 4.4.3: Contrasting the KEEP Statement and KEEP= Options

data work.Ipums2005Basic work.Serial2005Basic(keep = Serial); 

  set BookData.Ipums2005Basic(keep = Serial MortgagePayment State Ownership);

  keep Serial MortgagePayment State;

run;

ods select variables;

proc contents data= work.Ipums2005Basic;

run;

ods select variables;

proc contents data= work.Serial2005Basic;

run;

Listing multiple data set names in the DATA statement instructs SAS to create multiple data sets. This KEEP= option directs the DATA step to omit every variable in the PDV except Serial when writing to the Serial2005Basic data set. As such, it is considered a write-condition since it only has an effect when SAS writes records to the data set.

This KEEP= option directs the SET statement to omit any variable not in this list when it reads variables into the PDV. As such, it is considered a read-condition since it only has an effect when SAS reads records into the current data set.

The KEEP statement is like the KEEP= option from , except it controls the variables when writing to all data sets created in the current DATA step. Thus, it affects both Ipums2005Basic and Serial2005Basic while only affects Serial2005Basic. As such, it is also considered a write-condition since it only has an effect when SAS writes records to those data sets.

Because ODS statements are global, they are valid outside of the PROC steps as well. By default, ODS SELECT and EXCLUDE are only in effect until they encounter a step boundary.

Output 4.4.3A: Contrasting the KEEP Statement and KEEP= SET Statement Option

Alphabetic List of Variables and Attributes

#

Variable

Type

Len

Label

2

MortgagePayment

Num

8

First mortgage monthly payment

1

SERIAL

Num

8

Household serial number

3

state

Char

57

Output 4.4.3B: Using the KEEP Statement and KEEP= Option in the SET and DATA Statements

Alphabetic List of Variables and Attributes

#

Variable

Type

Len

Label

1

SERIAL

Num

8

Household serial number

The DATA step in Program 4.4.3 simultaneously creates two data sets in the Work library: Ipums2005Basic and Serial2005Basic. Each data set contains the same records; however, they do not contain the same variables. The first data set created, Ipums2005Basic, contains the three variables shown in Output 4.4.3A since the KEEP= option in the SET statement adds four variables to the PDV, and the KEEP statement only flags three of them for use in the output data sets. However, the Serial2005Basic data set is further altered by a KEEP= option that names only the Serial variable; thus, this data set then contains only one variable, Serial. Just as in previous programs in this section, Program 4.4.3 could be modified by using drop lists in place of any of the keep lists.

4.4.2 Changing Variable names with RENAME

Just as variables that do not exist in one data set can cause issues when combining data sets, so can variables with mismatched names. Common examples are cases where variables have different names but contain comparable information (MortPay versus MortgagePayment) or where variables have the same name but contain different information. The latter case is of particular concern if the variables have a different type, preventing the data sets from being combined. To handle the former case, SAS provides the ability to rename data set variables.

Program 4.4.4: Demonstrating the RENAME= Option

data work.Ipums2005Basic work.Serial2005Basic (rename = (Id = IdNum));

  set BookData.Ipums2005Basic;

  rename Serial = Id MortgagePayment = MortPay;

run;

proc contents data= work.Serial2005Basic varnum ;

  ods select position ;

run;

As noted with the DROP= and KEEP= options, data set options must be enclosed in parentheses and follow immediately after the target data set.

The RENAME= option requires a second set of parentheses to enclose its arguments.

When specifying the new name, note the old variable name appears on the left of the equals sign, and the new variable name appears on the right.

Note the RENAME statement goes into effect prior to the RENAME= option, which is why the option specifies Id=IdNum and not Serial=IdNum—the variable Serial had already been renamed.

The VARNUM option instructs PROC CONTENTS to produce a table of variable names listed in position order, rather than the default alphanumeric order.

When using the VARNUM option, PROC CONTENTS no longer creates an ODS table named Variables and instead creates an ODS table named Position. This is a good example of when ODS TRACE is useful for determining ODS table names.

Output 4.4.4 shows the position-ordered table created by PROC CONTENTS. Note that a much simpler alternative for renaming Serial to IdNum is to either use only the RENAME statement or RENAME= option in Program 4.4.4 and complete the rename in a single location. However, there are certain situations where a variable may need to change names multiple times and so it is important to understand how the RENAME statement and RENAME= option affect the results of a DATA step.

Output 4.4.4: Demonstrating the RENAME= Option

Variables in Creation Order

#

Variable

Type

Len

Format

Label

1

IdNum

Num

8

Household serial number

2

COUNTYFIPS

Num

8

County (FIPS code)

3

METRO

Num

8

BEST12.

Metropolitan status

4

CITYPOP

Num

8

City population

5

MortPay

Num

8

First mortgage monthly payment

6

HHINCOME

Num

8

Total household income

7

HomeValue

Num

8

House value

8

state

Char

57

9

MortgageStatus

Char

45

10

City

Char

43

11

Ownership

Char

6

The following examples show how renaming is potentially useful for fixing alignment of columns during the concatenation process, but also show some pitfalls to avoid. Program 4.4.5 reads raw CSV files for each of 2005 and 2010 and then concatenates them, with subsequent output from PROC CONTENTS and PROC PRINT revealing problems with the concatenation.

Program 4.4.5: Concatenation with Mismatched Variable Names

data work.Basic2005;

  infile RawData(“IPUMS2005formatted.csv”) dsd obs=10;

  input serial : 7. state : $25. city : $50. citypop : comma6. 

        metro : 1. countyfips : 3. ownership : $6. MortgageStatus : $50. 

        MortgagePayment : dollar12. HHIncome : dollar12. HomeValue : dollar12.; 

run;

data work.Basic2010;

  infile RawData(“IPUMS2010formatted.csv”) dsd obs=10;

  input serial : 7. state : $25. city : $50. metro : 1. countyfips : 3. 

        citypop : comma6. ownership : $6. MortStat : $50. 

        MortPay : dollar12. HomeValue : dollar12. Inc : dollar12.; 

run;

data work.Basic05And10;

  set work.basic2005 work.basic2010;

run;

proc contents data= work.basic05And10 varnum;

  ods select position;

run;

proc print data= work.Basic05And10(obs=1);

  var MortgageStatus MortStat MortgagePayment MortPay HHIncome Inc;

run; 

proc print data= work.Basic05And10(firstobs=11 obs=11);

  var MortgageStatus MortStat MortgagePayment MortPay HHIncome Inc;

run;

Reading of the raw files is limited to a few observations in order to make diagnostics on the result more manageable. When working with large data sets, it is a good programming practice to check the code logic on smaller subsets before committing to the execution time required for the complete data.

Each of these CSV files is read using modified list input. The columns in the raw files are ordered differently so the variables in the INPUT statements are ordered differently as well. Also, different variable names are used for three of the fields.

Here, the CONTENTS procedure is used to show all the variables in their creation order, as shown in Output 4.4.5A. Recall that creation order is the order in which the variables are encountered during compilation of the DATA step.

The invocations of PROC PRINT display the variables with mismatched names containing matching information, showing the first record read from each of the data sets in Output 4.4.5B.

Output 4.4.5A: Variable Set from Concatenation with Mismatched Columns

Variables in Creation Order

#

Variable

Type

Len

1

serial

Num

8

2

state

Char

25

3

city

Char

50

4

citypop

Num

8

5

metro

Num

8

6

countyfips

Num

8

7

ownership

Char

6

8

MortgageStatus

Char

50

9

MortgagePayment

Num

8

10

HHIncome

Num

8

11

HomeValue

Num

8

12

MortStat

Char

50

13

MortPay

Num

8

14

Inc

Num

8

Output 4.4.5B: Data in Columns Mismatched During Concatenation

MortgageStatus

MortStat

MortgagePayment

MortPay

HHIncome

Inc

N/A

0

.

12000

.

MortgageStatus

MortStat

MortgagePayment

MortPay

HHIncome

Inc

No, owned free and clear

.

0

.

7500

Output 4.4.5A shows that during the compilation of the DATA step in Program 4.4.5 that concatenates Basic05 and Basic10, the 11 variables from Basic05 are added to the PDV first, followed by the three variables from Basic10 that are not in Basic05. The fact that variables such as CityPop are not in the same column position in Basic05 and Basic10 does not affect the matching of those columns because it is based on the variable name, with the additional requirement that the type also matches. Of course, one fix for the name mismatches is to use the same variable names in the two INPUT statements. However, the RENAME methods discussed previously offer a method to fix the alignment without having to alter existing data sets or any previous code generating them. Program 4.4.6 uses the RENAME statement in an attempt to align the columns—several renaming choices are available to align names, this renames the variables in Basic10 to match those in Basic05.

Program 4.4.6: Attempting to Align Columns During Concatenation with the RENAME Statement

data work.Basic05And10Try2;

  set work.basic2005 work.basic2010;

  rename MortStat=MortgageStatus MortPay=MortgagePayment Inc=HHIncome;

run;

Running the same PROC CONTENTS and PRINT procedures from Program 4.4.5 on this data set reveals the same result, and the SAS log contains the following warnings:

WARNING: Variable MortStatus cannot be renamed to MortgageStatus because 

         MortgageStatus already exists.

WARNING: Variable MortPayment cannot be renamed to MortgagePayment because 

         MortgagePayment already exists.

WARNING: Variable Income cannot be renamed to HHIncome because HHIncome already 

         exists.

The RENAME statement sets up renaming flags in the PDV to rename the variables as they are output to the final data set, which is a conflict between two different variables having the same name, so the RENAME statement is ignored. Using the RENAME option on the output data set is effectively the same—the solution is to apply the RENAME option to the input data set(s) so that the alignment occurs during construction of the PDV, as in Program 4.4.7.

Program 4.4.7: Renaming Mismatched Variables During Concatenation in the SET Statement

data work.Basic05And10Try3;

  set work.basic2005

      work.basic2010(rename=(MortStat=MortgageStatus MortPay=MortgagePayment  

                             Inc=HHIncome));

run;

proc contents data= work.basic05And10Try3 varnum;

  ods select position;

run;

Using the same strategy of renaming variables in Basic10 to match those in Basic05 requires the RENAME option to be attached to the Basic2010 data set in the SET statement. As Output 4.4.7 shows, the 11 variables in the final data set match the names (and order) from Basic05.

Output 4.4.7: Successfully Renaming Mismatched Variables During Concatenation

Variables in Creation Order

#

Variable

Type

Len

1

serial

Num

8

2

state

Char

25

3

city

Char

50

4

citypop

Num

8

5

metro

Num

8

6

countyfips

Num

8

7

ownership

Char

6

8

MortgageStatus

Char

50

9

MortgagePayment

Num

8

10

HHIncome

Num

8

11

HomeValue

Num

8

Program 4.4.8 shows a case of mismatching columns on type—fixing this problem requires intervention in the data sets prior to the concatenation.

Program 4.4.8: Type Mismatches During Concatenation

data work.Basic2010B;

  infile RawData(“IPUMS2010formatted.csv”) dsd obs=100;

  input serial : $7. state : $25. city : $50. metro : 1. countyfips : 3. 

        citypop : comma6. ownership : $6. MortStat : $50. 

        MortPay : dollar12. HomeValue : dollar12. Inc : dollar12.;

run;

data work.Basic05And10B;

  set work.basic2005

      work.basic2010B(rename=(MortStatus=MortgageStatus MortPayment=MortgagePayment 

                              Income=HHIncome));

run;

Submission of this code results in the following error in the SAS log:

ERROR: Variable serial has been defined as both character and numeric.

The concatenation DATA step fails to compile (and execute) because the difference in types for Serial, numeric in Basic05 and character in Basic2010B, cannot be resolved. When aligning columns for concatenation or interleaving, it is important to determine whether columns that are expected to align match on name and type. Matching on name can be accomplished with the RENAME option on the input data—matching type requires an intervention in the data itself prior to concatenation. Other variable attributes, such as length, format, and label, should be inspected as well to ensure the desired result occurs; however, mismatches on these do not affect the column alignment. For more information about how these attribute mismatches are resolved, see Chapter Note 1 in Section 4.10.

4.4.3 Selecting Records with WHERE and Subsetting IF

Previous chapters demonstrate various applications of the WHERE statement for filtering records when using a SAS data set. However, WHERE conditioning is available in a statement or an option just like DROP, KEEP, and RENAME, but WHERE is different in that the statement is valid in procedures as well as in the DATA step. The following examples demonstrate the WHERE= option and introduce an alternative approach to filtering records—the subsetting IF, which is only available in the DATA step. Program 4.4.9 demonstrates several instances of conditional logic via WHERE= options.

Program 4.4.9: WHERE= Option Applications to Input and Output Data Sets in a DATA Step

data work.Metro4(where=(metro eq 4)) work.AllMetro;

  set BookData.Ipums2001Basic(obs = 3)

      BookData.Ipums2005Basic(where=(countyfips eq 73) obs = 3);

run;

proc report data = work.metro4;

  column Serial MortgagePayment HHIncome HomeValue State MortgageStatus;

run;

proc report data = work.allmetro;

  column Serial MortgagePayment HHIncome HomeValue State MortgageStatus;

run;

As with the RENAME= option, the WHERE= option requires a set of inner parentheses to separate its arguments from any other data set options in use. In this case, the Metro4 data set is created by only writing out records from the PDV if the current value of Metro is 4.

The work.AllMetro data set contains all records loaded into the PDV.

The 2001 data set does not have a WHERE= option, so all three requested observations are sent to the PDV.

The 2005 data set has a WHERE= option that only sends the first 3 records with CountyFips = 73 to the PDV.

Output 4.4.9A: WHERE= Option Applications to Input and Output Data Sets in a DATA Step

Household serial number

First mortgage monthly payment

Total household income

House value

state

MortgageStatus

2

0

12000

9999999

Alabama

N/A

4

900

185000

137500

Alabama

Yes, mortgaged/ deed of trust or similar debt

10

0

10200

9999999

Alabama

N/A

Output 4.4.9B: WHERE= Option Applications to Input and Output Data Sets in a DATA Step

Household serial number

First mortgage monthly payment

Total household income

House value

state

MortgageStatus

1

0

6400

9999999

Alabama

N/A

2

0

30000

45000

Alabama

No, owned free and clear

3

0

37500

12500

Alabama

No, owned free and clear

2

0

12000

9999999

Alabama

N/A

4

900

185000

137500

Alabama

Yes, mortgaged/ deed of trust or similar debt

10

0

10200

9999999

Alabama

N/A

As shown in Program 4.4.9, the WHERE= options can be used to filter out records before they enter the PDV, as was done for the 2005 data, or after they enter the PDV but before they are sent to the data set, as was done for the Metro4 data set. If the records are not needed in the current DATA step, then using a WHERE= option on the incoming data set allows for a more efficient DATA step because it prevents unnecessary records from being processed. However, if the records are needed for processing in the current DATA step but are not needed in the final data set, then using the WHERE= option on the created data set is necessary.

In addition to being valid in both DATA and PROC steps, another difference between the WHERE statement and the KEEP/DROP/RENAME statements is that the former applies to SAS data sets as their records are read into the PDV, while the latter apply as the PDV is output to the created data set. Thus, the WHERE statement acts as a filter that prevents records from entering the PDV, just as the WHERE= option does when it applies to an input data set. Program 4.4.10 demonstrates the effects of using both the WHERE statement and WHERE= options simultaneously on input and output data sets.

Program 4.4.10: Using the WHERE Statement and WHERE= Option Simultaneously

data work.Metro4(where=(metro eq 4)) work.AllMetro;

  set BookData.Ipums2001Basic(obs = 3)

      BookData.Ipums2005Basic(where=(countyfips eq 73) obs = 3);

  where MortgagePayment gt 0;

run;

The first WHERE= option still prevents records from being written to the Metro4 data set unless the value of Metro is 4.

The 2005 data is still filtered to only allow three records with CountyFips = 73 to enter the PDV.

The WHERE statement is applied to all incoming data sets that do not already have a WHERE= option attached. In this case, only the 2001 data is filtered to include records with mortgages over 0 in the PDV.

Using the same REPORT procedures from Program 4.4.9 on the data sets generated in Program 4.4.10 produces Outputs 4.4.10A and 4.4.10B. Note that the results in Output 4.4.10A match those in 4.4.9A, while the results in Output 4.4.10B differ from those in Output 4.4.9B. Although this may not be the intended result, this serves as a reminder to always check the SAS log after submitting a program.

Output 4.4.10A: Using the WHERE Statement and WHERE= Option Simultaneously

Household serial number

First mortgage monthly payment

Total household income

House value

state

MortgageStatus

2

0

12000

9999999

Alabama

N/A

4

900

185000

137500

Alabama

Yes, mortgaged/ deed of trust or similar debt

10

0

10200

9999999

Alabama

N/A

Output 4.4.10B: Using the WHERE Statement and WHERE= Option Simultaneously

Household serial number

First mortgage monthly payment

Total household income

House value

state

MortgageStatus

4

890

12000

95000

Alabama

Yes, mortgaged/ deed of trust or similar debt

5

170

20000

27500

Alabama

Yes, mortgaged/ deed of trust or similar debt

7

790

63670

137500

Alabama

Yes, mortgaged/ deed of trust or similar debt

2

0

12000

9999999

Alabama

N/A

4

900

185000

137500

Alabama

Yes, mortgaged/ deed of trust or similar debt

10

0

10200

9999999

Alabama

N/A

Log 4.4.10 shows that SAS issues a warning as a result of submitting Program 4.4.10 because a WHERE statement is used while a WHERE= option is also specified for at least one incoming data set. As a result, SAS warns that it could not apply the WHERE statement to all incoming data sets. However, the program runs successfully and the data set is created as requested using the following criteria.

1. work.AllMetro is created by concatenating records from 2001 with MortgagePayment > 0 with records from 2005 with CountyFips = 73.

2. work.Metro4 is created with the same records as AllMetro but limited to those with Metro = 4.

Log 4.4.10: Partial Log from Program 4.4.10

where MortgagePayment gt 0;

WARNING: The WHERE statement cannot be applied to the data set on the last SET/MERGE/UPDATE/MODIFY statement.  Either the data set failed to open or it already specifies a WHERE data set option 

Since the WHERE statement is applied to all incoming data sets, the variables in the WHERE statement must exist in each of the incoming data sets. If the variables do not exist, then the WHERE statement cannot check whether the current record should be loaded into the PDV. Program 4.4.11 demonstrates several common ways this issue is encountered, and Log 4.4.11 shows the errors generated as a result of the program.

Program 4.4.11: Attempting to Apply a WHERE Statement when Variables Do Not Exist

data work.Combined;

  set BookData.Ipums2001Basic BookData.Ipums2005Basic;

  Year = ‘2001 or 2005’;

  where CityPop ne 0 and Year eq ‘2001 or 2005’;

run;

Log 4.4.11: Partial Log after Submitting Program 4.4.11

ERROR: Variable CityPop is not on file BOOKDATA.IPUMS2001BASIC.

ERROR: Variable Year is not on file BOOKDATA.IPUMS2005BASIC.

CityPop is one of the variables included in the 2005 basic data but not the 2001 data. Because the WHERE statement is applied to all incoming data sets, this results in an error when SAS cannot evaluate the WHERE condition against the Ipums2001Basic file.

Year is a new variable created during the current DATA step. Thus, it is not in either of the incoming data sets, and again SAS generates an error since the WHERE condition cannot be checked against all incoming data sets.

Of course, checking the condition CityPop is not zero in Program 4.4.11 is possible by modifying the 2005 data with the appropriate WHERE= option. However, checking the value of Year cannot be handled in this manner since Year is not in any of the incoming data sets. In order to filter out records based on a variable that does not enter the PDV from a SAS data set, an alternative method is necessary. Program 4.4.12 demonstrates the use of this method, the subsetting IF statement.

Program 4.4.12: Using the Subsetting IF Statement

data work.SubIF;

  set BookData.Ipums2001Basic BookData.Ipums2005Basic;

  Year = ‘2001 or 2005’;

  if CityPop ne 0 and Year eq ‘2001 or 2005’;

run;

The results of a subsetting IF statement are similar to those of a WHERE statement: records that do not meet the condition are not present in the final data set. However, the process SAS uses to select the records is quite different. In Program 4.4.11, all records from the 2001 and 2005 files are read into the PDV and are available for use in the DATA step. However, only those that meet the conditions in the subsetting IF statement are sent from the PDV to the data set. As such, while the WHERE statement serves as a way to filter records on the way into the PDV, the subsetting IF statement is a way to filter records on the way out of the PDV. Figure 4.4.1 illustrates the location of the effects of the WHERE and subsetting IF statements relative to the creation of the PDV.

Figure 4.4.1: Illustrating Flow of Records in DATA Steps using WHERE and Subsetting IF Statements

Figure 4.4.1: Illustrating Flow of Records in DATA Steps using WHERE and Subsetting IF Statements

4.4.4 Identifying Record Sources with IN=

Notice that when the 2001 and 2005 IPUMS CPS data was combined using concatenation in Program 4.3.1 or using an interleave in Programs 4.3.4 and 4.3.5, once the new data sets are created there is no information in the data sets to identify the year corresponding those records. One common approach to include such information in the data sets is via the IN= data set option.

Unlike the DROP=/KEEP=/RENAME=/WHERE= data set options, the IN= option is only valid for data sets being read during a DATA step. Specifically, it can only be used in the SET, MERGE, UPDATE, and MODIFY statements. In this chapter, various applications of the SET statement are covered, while Chapter 5 discusses uses of the MERGE statement. For more information about the UPDATE and MODIFY statements, see the SAS Documentation.

The IN= option also differs in that it creates a temporary variable in the PDV. Temporary variables are similar to automatic variables in that they are present in the PDV, automatically dropped and cannot be kept, and are not written to the output data set. However, since they appear in the PDV, they are available for use during the current DATA step for programming purposes. Program 4.4.13 updates the interleave from Program 4.3.4 by including the IN= options.

Program 4.4.13: Using the IN= Option to Keep Track of Record Sources

data work.Ipums0105Basic;

  set work.Ipums2001Basic(in = a)

      work.Ipums2005Basic(in = b);

  by serial;

run;

The IN= option specifies the name of the temporary variable, which is given the name A here. Any record in the PDV that comes from Ipums2001Basic has A=1; all other records have A=0. This variable must follow SAS naming conventions and should be given a name distinct from any other variable used in the DATA step.

Records from the 2005 data set are associated with the temporary variable B. All records from the Ipums2005Basic file have B=1 in the PDV, while all other records have B=0.

Program 4.4.13 added the temporary variables A and B to the PDV but did not use them. As a result, the final data set looks the same as the results from Program 4.3.4. Because variables created via the IN= option are temporary, they are not available in the output data set to identify the source of the records. To create a variable that tracks the source of these records, the temporary variables A and B are used in additional programming statements in Program 4.4.14. This assignment statement creates the variable Year, which results in either 2001 (when A=1) or 2005 (when B=1) as shown in Output 4.4.14. Since an interleave is a vertical method for combining the data sets, A and B never have the value 1 simultaneously.

Program 4.4.14: Using the IN= Tracking to Compute an Identifying Variable

data work.InDemo1;

  set BookData.Ipums2001Basic(in = a)

      BookData.Ipums2005Basic(in = b);

  by serial;

  Y05=a;

  Y10=b;

  Year = 2001*a + 2005*b;

run;

proc report data = work.InDemo1;

 column Y05 Y10 Year Serial HHIncome HomeValue MortgageStatus;

run;

Output 4.4.14: Using the IN= Tracking to Compute an Identifying Variable

Y05

Y10

Year

Household serial number

Total household income

House value

MortgageStatus

1

0

2001

1

6400

9999999

N/A

1

0

2001

2

30000

45000

No, owned free and clear

1

0

2001

3

37500

12500

No, owned free and clear

0

1

2005

2

12000

9999999

N/A

0

1

2005

3

17800

9999999

N/A

0

1

2005

4

185000

137500

Yes, mortgaged/ deed of trust or similar debt

4.4.5 Implicit Conversion

Section 3.9.2 introduces the INPUT function for explicitly converting character values into numeric values and a similar function, PUT, is available for carrying out the reverse process explicitly. Failure to use these functions (or an equivalent mechanism) forces SAS to interpret this type mismatch using an automatic process called automatic or implicit conversion. There are several reasons to avoid implicit conversion—in some cases it is not available and in other cases it does not perform as expected. In addition, implicit conversion uses substantially more computational power than explicit conversion to accomplish the same task, and as such, the loss in efficiency is an unavoidable pitfall.

The two common potential drawbacks are failed conversions and loss of data integrity. Not all programming statements in SAS support implicit conversion, naturally limiting its utility. Moreover, even when SAS does support implicit conversion, there is a potential for missing, truncated, or imprecise values. Log 4.4.15 demonstrates both pitfalls and includes the code that generates these notes, warnings, and errors.

Log 4.4.15: Notes and Errors Generated During Implicit Conversion (Partial Log Shown)

55   data work.Base;

56     length y $6;

57     x=123456789; 

58     y=x;

59     put _all_;

60   run;

NOTE: Numeric values have been converted to character values at the places given by:

      (Line):(Column).

      58:5 

y=1.23E8 x=123456789 _ERROR_=0 _N_=1 

61

62   data work.Subset1;

63     set Base;

64     where x = ‘123456789’; 

ERROR: WHERE clause operator requires compatible variables. 

65   run;

NOTE: The SAS System stopped processing this step because of errors. 

WARNING: The data set WORK.SUBSET1 may be incomplete.  When this step was stopped there were 0 observations and 2 variables. 

66

67   data work.Subset2;

68     set Base;

69     if x = ‘123456789’;

70   run;

NOTE: Character values have been converted to numeric values at the places given by:

      (Line):(Column).

      69:10 

NOTE: There were 1 observations read from the data set WORK.BASE.

NOTE: The data set WORK.SUBSET2 has 1 observations and 2 variables.

The LENGTH statement declares Y as a character variable with a length of six bytes. As written, the assignment statement requests that the value of the numeric variable X be assigned to the character variable Y. Since this is not possible directly, it forces SAS to carry out an implicit conversion of the value of X to character, generating the note in the SAS log in the process. This note clearly indicates that column 5 in log line 58 (where SAS encounters the numeric value of X in the assignment statement) is the cause of the note.

The PUT statement along with the keyword _ALL_ places the current PDV into the log, which shows the result of the implicit conversion in this case. The value of 123456789—which has 9 digits—is too wide to fit into the six-byte variable, Y. Using the character length of 6 bytes, SAS formats the numeric value using a format of a matching width before converting. Due to this limited format width, the value is expressed in scientific notation and then converted to character, resulting in the value of Y of 1.23E8.

This WHERE statement compares the numeric variable, X, against a character string, causing a type mismatch. Unlike in the assignment statement in , implicit conversion is not valid in a WHERE statement or WHERE= option. SAS generates the associated error message as a result of the type mismatch when implicit conversion is not available. The note and warning indicate the DATA step does not create a data set and a similar result occurs for procedures.

The logical condition in this statement is identical to the WHERE statement in , but now appears in a subsetting IF statement. Notice that instead of a warning or error, this only generates a note—the same note that appears in . Unlike the WHERE statement and WHERE= option, the subsetting IF statement does support implicit conversion. This support for implicit conversion is also present in IF-THEN/ELSE statements.

While it is not reasonable for this section to present an exhaustive list of the programming statements that do or do not support implicit conversion, it is important to note the pitfalls—potential or guaranteed—that accompany implicit conversion. The potential for a program that fails to compile, or worse a program that compiles but leads to a loss of data integrity, along with the guaranteed loss of computational efficiency are all drawbacks to implicit conversion. As a result, it is considered a good programming practice to always use explicit conversion via PUT or INPUT functions or an equivalent mechanism.

4.5 Creating Variables Conditionally

The assignment statement used in Program 4.4.14 allows for programmatic determination of the year associated with each record. However, it has limited utility since it requires the desired identifying value to be the result of an arithmetic expression.

Conditional logic is used in the WHERE statement in this and previous chapters, as well as in the subsetting IF statement earlier in this chapter. The WHERE and subsetting IF statements are only used to select observations; however, additional conditional logic statements are available and offer a wide variety of applications.

4.5.1 IF-THEN Statements

As an alternative to creating variables using a single formula for all records as was done in Program 4.4.14, Program 4.5.1 uses conditional logic statements to create a new variable for which the values are determined based on the values of the temporary variables created via the IN= data set options.

Program 4.5.1: Using IF-THEN Statements to Create New Variables

data work.IfThenDemo1;

  set BookData.Ipums2001Basic(in = in2001 obs = 3)

      BookData.Ipums2005Basic(in = in2005 obs = 3);

  if (in2001 eq 1) then year = 2001;

  if (in2005 eq 1) then year = 2005;

run;

It is a good programming practice for the IN= variable names to relate meaningfully to the corresponding data set, improving the readability of the code in more complex DATA steps. As demonstrated in Program 4.4.13, they must be legal variable names and not the same as any other variables in the PDV during the current DATA step.

The IF-THEN statement is broken into two clauses. The first clause begins with keyword IF.

The IF clause must contain a logical expression. The parentheses are optional, but are included here to help distinguish the logical expression from the SAS keywords on either side. Any logical expression or an expression that resolves to a numeric value is valid. Recall from Section 2.6—missing and zero are interpreted as false; any other number is interpreted as true.

Unlike the subsetting IF, the IF-THEN statement has a second clause that begins with the keyword THEN.

The THEN clause must include a single SAS statement to execute when the IF clause resolves as true. In this case, an assignment statement is used to create a new variable, Year, and assign it a value of 2001 for records read from the Ipums2001Basic data set.

A second IF-THEN statement is used to detect records from the Ipums2005Basic data set and assign the value 2005 to Year for those records.

The IF-THEN statement differs from the subsetting IF statement due to the requirement of the THEN clause in the IF-THEN statement. Even if the THEN clause contains only a semicolon, which is valid syntax, the IF-THEN statement is not equivalent to the subsetting IF statement. Other differences between the IF-THEN and subsetting IF statements exist and are investigated in Chapter 5.

Program 4.5.1 demonstrates the use of two IF-THEN statements to create the variable Year in a way that is easily programmed. However, it is an inefficient approach since the use of multiple IF-THEN statements requires each IF-THEN statement to be evaluated for every record. As a result, a record that has already been identified as coming from Ipums2001Basic and has already had Year set to 2001 is still checked to determine whether it comes from Ipums2005Basic, which is not possible. As a result, the 483,000 records from 2001 are each checked twice, and the 1.2 million records from 2005 are also checked twice. The loss in efficiency grows as the number of data sets, records, or conditions to check grows. Thus, the IF-THEN statements are typically used only when independent checks need to be carried out on each record.

4.5.2 IF-THEN/ELSE Statements

A more efficient solution to creating the Year variable in Program 4.5.1 is provided by modifying the IF-THEN statement slightly. This modified approach uses IF-THEN/ELSE statements, which are actually a pair of statements—an IF-THEN statement paired with an ELSE statement. The gain in efficiency is due to the fact that SAS only executes an ELSE statement if the condition in the preceding IF-THEN statement evaluates as false. Program 4.5.2 demonstrates the use of the IF-THEN/ELSE statement.

Program 4.5.2: Using IF-THEN/ELSE Statements to Create New Variables

data work.IfThenDemo2;

  set BookData.Ipums2001Basic(in = in2001 obs = 3)

      BookData.Ipums2005Basic(in = in2005 obs = 3);

  length MetFlag $ 3;

  if in2001 eq 1 then year = 2001;

    else if in2005 eq 1 then year = 2005;

  if Metro in (1,2,3) then MetFlag = ‘Yes’;

    else if Metro eq 4 then MetFlag = ‘No’;

      else MetFlag = ‘N/A’;

run;

proc report data = work.IfThenDemo2;

  column Year MetFlag Serial HHIncome HomeValue State MortgageStatus;

run;

It is a good programming practice to define the length of user-defined character variables, helping to prevent unexpected truncation.

The IF-THEN/ELSE statement begins with an IF-THEN statement.

Instead of another IF-THEN statement, an ELSE statement is used and, like the THEN clause, it contains a single SAS statement. That statement executes only if the previous IF-THEN condition was false—in this case, if the data did not come from the Ipums2001Basic data set. This prevents rechecking the 483,000 records from 2001 as done with two IF-THEN statements in Program 4.5.1.

A second IF-THEN/ELSE chain creates another new variable, MetFlag. The first statement checks if the value of Metro is 1 or 2 or 3, assigning Yes to MetFlag if true.

For records where Metro is neither 1 nor 2 nor 3, the first ELSE statement is executed. Since the ELSE statement includes a single SAS statement, it can invoke another IF-THEN statement. IF-THEN/ELSE statements can thus be chained to check a sequence of conditions of practically any length, subject only to the amount of memory allocated to SAS.

Recall variable length is defined based on the first encounter of the variable during DATA step compilation. If MetFlag first appeared in the assignment of the literal value ‘No’ in the IF-THEN/ELSE chain, SAS would define MetFlag as a character variable with length two. The LENGTH statement is thus placed before the IF-THEN/ELSE chain to ensure the length of MetFlag is set explicitly.

The final ELSE statement serves as a “catch-all”—defining what action SAS takes for records with a value for Metro other than 1, 2, 3, or 4. Without this final ELSE statement, MetFlag is missing (blank) if Metro has a value other than 1, 2, 3, or 4—no assignment to MetFlag is made under that condition and its value is unchanged from being initialized to missing at the start of the DATA step iteration.

Output 4.5.2: Using IF-THEN/ELSE Statements to Create New Variables

year

MetFlag

Household serial number

Total household income

House value

state

MortgageStatus

2001

N/A

1

6400

9999999

Alabama

N/A

2001

N/A

2

30000

45000

Alabama

No, owned free and clear

2001

N/A

3

37500

12500

Alabama

No, owned free and clear

2005

No

2

12000

9999999

Alabama

N/A

2005

Yes

3

17800

9999999

Alabama

N/A

2005

No

4

185000

137500

Alabama

Yes, mortgaged/ deed of trust or similar debt

The IF-THEN/ELSE chains allow increased efficiency over IF-THEN statements by only evaluating the ELSE statements when all preceding conditions in the same IF-THEN/ELSE chain are false. However, it is possible to further improve the efficiency of Program 4.5.2 by checking the condition for Ipums2005Basic first since that data set has more than twice as many records as the Ipums2001Basic data. Thus, reversing the order would save nearly 700,000 additional executions of the conditional logic. Be sure to consider all aspects of efficiency when writing and maintaining programs.

Each THEN clause and ELSE statement concludes with a single statement; however, it is often necessary to execute several statements when a condition is true. It is possible, but inefficient, to write multiple IF-THEN statements with the same IF clause but different THEN clauses. Instead, SAS allows the use of a DO group to carry out multiple actions based on a single logical expression, as Program 4.5.3 demonstrates. DO groups greatly increase the flexibility of conditional logic statements as they allow for multiple actions as the result of a single condition.

Program 4.5.3: Using Do Groups to Create Multiple Variables

data work.IfThenDemo3;

  set BookData.Ipums2001Basic(in = in2001 obs = 3)

      BookData.Ipums2005Basic(in = in2005 obs = 3);

  length FirstFlag $ 3;

  if in2001 eq 1 then do;

                        year = 2001;

                        FirstFlag = ‘Yes’;

                      end;

    else if in2005 then year = 2005;

run;

proc report data = work.IfThenDemo3;

  column Year FirstFlag Serial HHIncome HomeValue State MortgageStatus;

run;

As before, define the length of new character variables.

DO statements contain only the keyword DO.

All statements to be executed when the logical expression is true must appear in the DO group.

A DO group ends with the END statement. As always, the indentation is for increased readability.

It is not necessary to explicitly compare to 0 or 1 since SAS interprets 0 as FALSE and 1 as TRUE. As a result, when In2005 has the value 1 the IF clause resolves as TRUE. However, the explicit comparison is generally considered a good programming practice.

Output 4.5.3: Using Do Groups to Create Multiple Variables

year

FirstFlag

Household serial number

Total household income

House value

state

MortgageStatus

2001

Yes

1

6400

9999999

Alabama

N/A

2001

Yes

2

30000

45000

Alabama

No, owned free and clear

2001

Yes

3

37500

12500

Alabama

No, owned free and clear

2005

2

12000

9999999

Alabama

N/A

2005

3

17800

9999999

Alabama

N/A

2005

4

185000

137500

Alabama

Yes, mortgaged/ deed of trust or similar debt

4.5.3 SELECT Groups

SELECT groups are most similar to the IF-THEN/ELSE chains seen previously since the purpose of a SELECT group is to evaluate an expression based on one or more logical conditions. When that expression resolves to TRUE, SAS executes one or more statements. However, SELECT groups differ from IF-THEN/ELSE chains by more than just syntax—the execution also differs slightly. This section provides several examples of how SELECT groups operate and gives a brief comparison between SELECT groups and IF-THEN/ELSE chains.

SELECT Groups Without A Select-Expression

Program 4.5.2 used the code snippet below to create a new variable, MetFlag. In this snippet the values of Metro are grouped into three sets of values, each associated with a value for the new variable MetFlag.

if Metro in (1,2,3) then MetFlag = ‘Yes’;

  else if Metro eq 4 then MetFlag = ‘No’;

    else MetFlag = ‘N/A’;

The following snippet is logically equivalent to the previous snippet, but uses a SELECT group.

select;

  when(Metro in (1,2,3)) MetFlag = ‘Yes’;

  when(Metro eq 4) MetFlag = ‘No’;

  otherwise MetFlag = ‘N/A’;

end;

Every SELECT group begins with a SELECT statement. SAS syntax allows for two versions of the SELECT statement. The version shown here includes only the keyword SELECT—the other version is shown in a later example.

WHEN statements identify distinct cases exactly like the IF and ELSE IF clauses do. The conditions listed in the parentheses are collectively referred to as a when-expression.

Like a THEN clause or ELSE statement, the WHEN statement includes a single statement to execute if the current record satisfies the when-expression. Similar to the THEN clause in an IF-THEN statement, the WHEN statement also supports use of the DO statement as well as the use of a single semicolon to take no action.

The OTHERWISE statement is available in a SELECT group to indicate what action SAS should take when the current record does not satisfy any of the when-expressions. In this example, any values of Metro other than 1, 2, 3, or 4 result in the SELECT group assigning a value of N/A as the value of MetFlag. The OTHERWISE and WHEN statements support the same programming statements.

Every SELECT group must close with an END statement.

The results of the two snippets above, one using IF-THEN/ELSE chains and one using a SELECT group, produce identical results. Furthermore, the execution of the two blocks of code is similar as well—the first condition is checked, and if true, then SAS performs an action. Just as ELSE IF statements are not executed unless all previous conditions in the current chain are not true, a WHEN statement is not executed unless all previous WHEN conditions in the same SELECT group are not met. As such, in both snippets a subsequent condition is only checked if preceding conditions are all false. Finally, if neither condition is met, both approaches assign a value of N/A to MetFlag.

One major difference between the two is that an OTHERWISE statement is required in a SELECT block if any record fails to satisfy one of the set of WHEN conditions. In other words, the SELECT block must have an explicit action to apply to every record; otherwise, it produces an error in the SAS log and stops execution. The IF-THEN/ELSE chain has no such requirement as shown in the following modified code snippet.

if Metro in (1,2,3) then MetFlag = ‘Yes’;

  else if Metro eq 4 then MetFlag = ‘No’;

  

Recall, variables created via an assignment statement are initialized to missing at the beginning of each iteration of the DATA step. Only for records where one of these conditions is met is the value of MetFlag assigned.

Without any further statements in the IF-THEN/ELSE chain, records with Metro values other than 1, 2, 3, or 4 do not result in the assignment of a value to MetFlag, and the initial value of missing (blank) remains.

In the DATA step shown in Log 4.5.4, the OTHERWISE statement is omitted and the set of WHEN conditions is not exhaustive. The submission of this DATA step generates the following lines in the SAS log as soon as a value other than 1,2,3, or 4 is encountered. Note that the line numbers are only for reference within the log—they do not correspond to line numbers in the Editor.

Log 4.5.4: Results of Omitting a Necessary OTHERWISE Statement

80   data work.OtherwiseNeeded;

81     Metro = 5;

82     select(Metro);

83       when(Metro in (1,2,3)) MetFlag = ‘Yes’;

84       when(Metro eq 4) MetFlag = ‘No’;

85     end;

86   run;

ERROR: Unsatisfied WHEN clause and no OTHERWISE clause at line 85 column 3.

Metro=5 MetFlag=  _ERROR_=1 _N_=1

NOTE: The SAS System stopped processing this step because of errors.

WARNING: The data set WORK.OTHERWISENEEDED may be incomplete.  When this step was stopped there were 0 observations and 2 variables.

As with many errors, SAS identifies the exact cause—no satisfied WHEN clause and no OTHERWISE statement—and indicates the location in the SAS log (line and column number) at which the error occurs. In addition, SAS prints the current PDV to the log following the error to assist in identification of the unexpected value. From one perspective, this requirement is an advantage—the SELECT group does not permit the conditional logic to accidentally ignore any cases present in the data.

SELECT Groups with a Select-Expression

As shown in above, the first statement in a SELECT group must be the SELECT statement. The version of the SELECT statement demonstrated so far uses only the keyword SELECT in this statement. Another version of the SELECT statement is shown in the snippet below.

select(Metro);

  when(1,2,3) MetFlag = ‘Yes’;

  when(4) MetFlag = ‘No’;

  otherwise MetFlag = ‘N/A’;

end;

The SELECT statement now contains a value, called a select-expression, in parentheses following the keyword SELECT, which can be any valid expression in SAS.

WHEN statements still identify the distinct cases, a comma delimited list is used to indicate multiple values or a single value is valid as well. The comparison is for equality of the select-expression to values in the various when-expressions.

This SELECT group produces the same results as the SELECT snippet given at the opening of this section, and the use of a select-expression makes the conditioning slightly more compact than the equivalent IF-THEN/ELSE. If the number of conditions to check is large, and the select-expression is complex (and all comparisons are equalities), this can be a significant advantage in coding efficiency. Programs 4.5.5 and 4.5.6 consider conditioning on relatively complex select-expressions, each of which introduces a function useful in a variety of contexts. Program 4.5.5 creates a flag variable that is 1 if the property is mortgaged, 0 if it is not, and missing (.) if unknown. Conditioning on the full value of MortgageStatus is possible, but the full value is not required to achieve the conditional assignment, so the SCAN function is used to simplify this process.

Program 4.5.5: Using the SCAN Function to Simplify Conditioning

data work.one;

  set BookData.Ipums2005Basic;

  select(upcase(scan(MortgageStatus,1,’,’)));

    when(‘YES’) MortFlag = 1;

    when(‘NO’) MortFlag = 0;

    when(‘N/A’) MortFlag = .;

  end;

run;

The select-expression, uses both the SCAN and UPCASE functions—the UPCASE function is discussed in Section 3.9.2. The first argument of the SCAN function is an expression which is taken as a character value (numeric values are converted). The second argument is the piece or “word” to extract from the string—words are determined based on a set of default delimiters that are operating system dependent. The third argument allows for specification of the set of delimiters to use, much like the DLM= option in the INFILE statement. A fourth argument is possible to set special behaviors for the SCAN function. Here, the value of MortgageStatus up to the first comma is extracted, which is sufficient to determine each potential value of the flag being created.

Even though the casing of the values of MortgageStatus is consistent throughout, defensive programming techniques do not assume this. Using the UPCASE function allows for a single condition to be checked irrespective of what casings are used in the data set.

In Program 4.5.6, the flag variable is created with a simplified set of rules: 1 if the property is mortgaged, 0 if it is not or the status is unknown. While extracting the first word with the SCAN function is still effective, the first letter is sufficient to properly make this assignment, which is achieved with the SUBSTR (substring) function.

Program 4.5.6: Using the SUBSTR Function to Simplify Conditioning

data work.two;

  set BookData.Ipums2005Basic;

  select(upcase(substr(MortgageStatus,1,1)));

    when(‘Y’) MortFlag = 1;

    when(‘N’) MortFlag = 0;

  end;

run;

The first argument of the SUBSTR function is an expression which is taken as a character value (numeric values are converted). The second argument is the starting position that forms the substring, and the third argument is the number of characters to read. The third argument is optional—if omitted, SUBSTR reads from the starting position to the end of the string. Here, starting at the first character and reading one character means only the first character from the value of MortgageStatus is extracted.

Using the UPCASE function again provides for some measure of defensive programming. Also, both here and in Program 4.5.5, not using the OTHERWISE statement is a form of defensive programming—if any values not accounted for in the conditioning actually exist in the data, the processing stops and an error is transmitted to the log.

It is possible to use a select-expression with logical expressions in the when-expressions, but this is typically a dangerous practice—this danger is highlighted in Program 4.5.7.

Program 4.5.7: Demonstrating Inappropriate Usage of a Select-Expression

data work.wrong;

  length MetFlag $ 3;

  Metro = 2;

  select(Metro);

    when(Metro in (1,2,3)) MetFlag = ‘Yes’;

    when(Metro eq 4) MetFlag = ‘No’;

    otherwise MetFlag = ‘N/A’;

  end;

run;

For the current record, the value of Metro has been set to 2.

The SELECT statement contains a select-expression, Metro. When-expressions are compared to its value, 2, until one is found to match.

For the first WHEN statement, since Metro=2 the first when-expression resolves to TRUE, and SAS assigns the value 1 for this TRUE logical expression (it assigns 0 for those that are FALSE). As a result, SAS compares the select-expression value of 2 with the when-expression value of 1, giving a result of FALSE.

SAS moves to the next when-expression, but with a similar result. Metro eq 4 evaluates to FALSE, which SAS assigns the value 0. Thus, the select-expression to when-expression comparison is 2 = 0, which also resolves as FALSE.

Finally, since no matches occurred between the select-expression and any when-expression, SAS moves to the OTHERWISE statement and sets MetFlag equal to N/A.

Select-expressions are commonly used in cases where a single variable with distinct levels is of interest. In cases where intervals of values are of interest, such as household income or mortgage payment amounts, a SELECT group without a select-expression is more beneficial. In cases where multiple variables are of interest, either compound when-expressions or nested SELECT groups may be used exactly as is possible with IF-THEN/ELSE chains.

4.6 Working with Dates and Times

Section 3.6 introduced the concept of how SAS dates, times, and datetimes are numeric values: dates are the number of days since January 1, 1960; times are the number of seconds from midnight; and datetimes are the number of seconds since midnight on January 1, 1960. Informats provide an easy method for reading such values into numeric variables for future use, and formats provide a variety of ways to display date, time, and datetime values. SAS also provides a rich set of functions for working with dates, times, and datetimes. Program 4.6.1 demonstrates several techniques for creating and working with dates.

Program 4.6.1: Creating and Doing Arithmetic Operations on Dates

data work.DatesDemo01;

  set BookData.Ipums2001Basic(in = in2001 obs = 3)

      BookData.Ipums2005Basic(in = in2005 obs = 3);

  if in2001 eq 1 then year = 2001;

    else if in2005 eq 1 then year = 2005;

  Date = mdy(12,31,year);

  DynamicDay = today();

  FixedDay = ‘14MAR2019’d;

  Diff1 = DynamicDay – Date;

  Diff2 = FixedDay – Date;

  Diff3 = Time() – ‘13:56’t; 

run;

proc report data = work.DatesDemo01;

  column Serial State Year Date DynamicDay FixedDay Diff1 Diff2 Diff3;

  label Serial = ‘Serial’;

run;

The MDY function requires three numeric arguments that represent the month, day, and year for a date, respectively. As shown here, both numeric constants and DATA step expressions are valid.

The TODAY function retrieves the current date and stores it as a SAS date in the DynamicDay variable. If the code is executed again on a different date, then the new value is retrieved and stored. The function has no arguments, but the parentheses are required—this is to distinguish the TODAY function from a DATA step variable named Today. The DATE function also exists and performs the same action.

In the case where fixed dates are necessary, one option is to place the date in a literal string followed immediately by the letter d—this is called a date literal. Note that only dates in the SAS DATE format, two- or four-digit year, are valid. SAS interprets the value in the literal by converting it to a SAS date. This is similar to how SAS interprets the value ‘09’x as a hexadecimal value due to the presence of the hexadecimal literal modifier.

A simple approach to calculating a duration is the subtraction of the two date variables.

Like the TODAY() function, the TIME() function retrieves the current time. The time literal modifier, t, interprets the quoted string as a SAS time (operating like the hexadecimal and date literal modifiers). A datetime modifier, dt, is also available.

Output 4.6.1 shows the results of the calculations run on a specific date and demonstrates the need for formats when displaying most date, time, and datetime fields. Variables such as Date, DynamicDay, and FixedDay are all calendar dates – but the stored values are not interpretable by most readers. However, the difference variables simply represent the number of days between two dates—for Diff1 and Diff2—and seconds between two times—for Diff3. Date or time formats are inappropriate for these variables, though they would perhaps benefit from the use of a COMMA format to improve readability.

Output 4.6.1: Creating and Doing Arithmetic Operations on Dates

Serial

state

year

Date

DynamicDay

FixedDay

Diff1

Diff2

Diff3

1

Alabama

2001

15340

21691

21622

6351

6282

-3636.229

2

Alabama

2001

15340

21691

21622

6351

6282

-3636.229

3

Alabama

2001

15340

21691

21622

6351

6282

-3636.229

2

Alabama

2005

16801

21691

21622

4890

4821

-3636.229

3

Alabama

2005

16801

21691

21622

4890

4821

-3636.229

4

Alabama

2005

16801

21691

21622

4890

4821

-3636.229

SAS provides several functions for calculating durations at specific scales and extracting information from date, time, and datetime values. Program 4.6.2 demonstrates several useful date functions as well as the application of a few common date formats.

Program 4.6.2: Using Date Functions

data work.DatesDemo02;

  set BookData.Ipums2001Basic(in = in2001 obs = 3)

      BookData.Ipums2005Basic(in = in2005 obs = 3);

  if in2001 eq 1 then year = 2001;

    else if in2005 eq 1 then year = 2005;

  Date = mdy(12,31,year);

  FixedDay = ‘14MAR2019’d;

  Years = yrdif(mdy(12,31,year), ‘14MAR2019’d, ‘Actual’);

  Int1 = intck(‘year’, mdy(12,31,year), ‘14MAR2019’d,’continuous’);

  Int2 = intck(‘month’, mdy(12,31,year), ‘14MAR2019’d);

  Next1 = intnx(‘month’, ‘14MAR2019’d, 3, ‘beginning’);

  Next2 = intnx(‘month’, ‘14MAR2019’d, 3, ‘sameday’);

  Day1a = weekday(Next1); 

  Day1b = weekday(Next2); 

  Format Date date7. FixedDay yymmdd8. Day1b next2 downame.  Next1 date7.;

run;

proc print data = work.DatesDemo02 noobs;

  var Serial Date FixedDay Years Int1 Int2 Next1 Day1a Day1b Next2;

run;

The results of date functions and date literals are valid for direct input into other functions.

Based on the dates in the first and second argument, the YRDIF function calculates the time between the dates, in years, using a predefined calculation method. Here, the ACTUAL method instructs SAS the actual number of days (365/non-leap year and 366/leap year). The DATDIF function is available for calculating differences, in days, between dates. For details about the various calculation methods available, see the SAS Documentation.

The INTCK function calculates the number of time interval boundaries between two dates. Here, the INTCK function uses the CONTINUOUS method to calculate the number of complete years between the two dates. For example, when using the CONTINUOUS method, the number of years between 14MAR2019 and 13MAR2023 is three since only three full years have elapsed.

Here, the INTCK function calculates the number of months between the two dates using the default DISCRETE method. For example, the number of months between 14MAR2019 and 01APR2019 is one since the dates include one month boundary—the first of the month. Custom intervals and boundaries are available. For more information, see the SAS Documentation.

The INTNX function is similar to the INTCK function in that it is based on time intervals. However, it is used to determine the future date based on a time interval and a target date during the interval. Here, 14MAR2019 is the initial date, and the INTNX function calculates a date three months in the future. The BEGINNING method aligns the calculated date to the beginning of the interval—first day of the month in this case.

This INTNX function also projects a date three months in the future, but sets the target date using the SAMEDAY method to ensure the day of the month is the same as the initial date.

The WEEKDAY function extracts the day of the week from the provided date and stores it as a numeric value (1=Sunday, 2=Monday, ... ). Other functions such as QTR, YEAR, MONTH, and more are also available. The same value is calculated twice to demonstrate the use of formats in .

As discussed in previous chapters, a wide variety of formats are available—including some specifically designed for use with dates and times. If the width value is too small to allow for a full representation of the date, SAS adjusts how it displays the dates. For more details about this, see the SAS Documentation and Chapter Note 2 in Section 4.10.

It is important to understand that date formats treat the underlying value as a SAS date. Here, Day1b is no longer representing a date, it represents the day of the week, but the DOWNAME format interprets the current value—six—as a date value (Thursday, January 7, 1960) and thus Day1b is displayed as Thursday while Next2 appears as Friday.

Output 4.6.2: Using Date Functions

Serial

state

year

Date

DynamicDay

FixedDay

Diff1

Diff2

Diff3

1

Alabama

2001

15340

21696

21622

6356

6282

14776.169

2

Alabama

2001

15340

21696

21622

6356

6282

14776.169

3

Alabama

2001

15340

21696

21622

6356

6282

14776.169

2

Alabama

2005

16801

21696

21622

4895

4821

14776.169

3

Alabama

2005

16801

21696

21622

4895

4821

14776.169

4

Alabama

2005

16801

21696

21622

4895

4821

14776.169

4.7 Data Exploration with the UNIVARIATE Procedure

Demonstrations on using the MEANS and FREQ procedures as data exploration tools are given in Chapter 3; this section presents data exploration methods using PROC UNIVARIATE. The UNIVARIATE procedure produces several summary statistics by default, is capable of producing others, and can also produce a variety of plot types.

4.7.1 Summary Statistics in PROC UNIVARIATE

When PROC UNIVARIATE is executed with only the PROC UNIVARIATE statement and DATA= as its only option, the default behavior is to summarize all numeric variables. Since the IPUMS CPS data contains variables such as Serial and CountyFIPS that are numeric but not quantitative, many of the default summaries from PROC UNIVARIATE are undesirable. To choose variables for analysis, PROC UNIVARIATE uses a VAR statement that works in much the same manner as it does in PROC MEANS. Program 4.7.1 summarizes selected variables from the combination of the IPUMS CPS data from 2001 and 2005 (similar to Program 4.5.2) using PROC UNIVARIATE.

Program 4.7.1: Default Summaries Generated by PROC UNIVARIATE

data work.Ipums2001and2005;

  set BookData.Ipums2001Basic(in = in2001)

      BookData.Ipums2005Basic(in = in2005);

  if in2001 eq 1 then Year = 2001;

    else if in2005 eq 1 then Year = 2005;

run;

proc univariate data= work.Ipums2001and2005;

  var MortgagePayment HomeValue Citypop;

run;

By default, PROC UNIVARIATE creates five output tables for each variable listed. The resulting tables for the MortgagePayment variable are shown in Output 4.7.1

Output 4.7.1: Default Output from PROC UNIVARIATE on the MortgagePayment Variable

Moments

N

1642156

Sum Weights

1642156

Mean

472.363975

Sum Observations

775695336

Std Deviation

707.232195

Variance

500177.377

Skewness

2.45508546

Kurtosis

10.4176149

Uncorrected SS

1.18778E12

Corrected SS

8.21369E11

Coeff Variation

149.721874

Std Error Mean

0.55189291

Basic Statistical Measures

Location

Variability

Mean

472.3640

Std Deviation

707.23219

Median

0.0000

Variance

500177

Mode

0.0000

Range

7900

Interquartile Range

790.00000

Tests for Location: Mu0=0

Test

Statistic

p Value

Student’s t

t

855.8979

Pr > |t|

<.0001

Sign

M

387391.5

Pr >= |M|

<.0001

Signed Rank

S

1.501E11

Pr >= |S|

<.0001

Quantiles (Definition 5)

Level

Quantile

100% Max

7900

99%

3000

95%

1800

90%

1400

75% Q3

790

50% Median

0

25% Q1

0

10%

0

5%

0

1%

0

0% Min

0

Extreme Observations

Lowest

Highest

Value

Obs

Value

Obs

0

1.64E6

7900

694144

0

1.64E6

7900

694186

0

1.64E6

7900

694359

0

1.64E6

7900

695015

0

1.64E6

7900

695754

The default tables in Output 4.7.1 contain information about moments (basic and central moments of various orders), some measures of center and variability (a few of which are repeats from the moments table), tests for a mean of zero, a set of common quantiles, and an extreme observations table (denoting the five largest and smallest values for the variable and the record number in the data set for each). In several instances, a subset of these tables is desired and, to achieve this, the ODS SELECT or ODS EXCLUDE statements are available, as first shown in Program 1.5.2. (As noted in Program 1.5.1, table names can be determined via the ODS TRACE statement.) Program 4.7.2 modifies Program 4.7.1 to display only the table for basic measures of center and variability along with the quantiles table for all variables (output not shown).

Program 4.7.2: Using ODS SELECT to Subset Output Tables

proc univariate data= work.Ipums2001and2005;

  var MortgagePayment HomeValue Citypop;

  ods select Quantiles BasicMeasures;

run;

PROC UNIVARIATE also permits the use of a CLASS statement, and it behaves in much the same manner as seen with PROC MEANS in Section 2.3.2. Year is used as a class variable in Program 4.7.3, with ODS SELECT limiting results to the Basic Statistical Measures table, with the results shown in Output 4.7.3.

Program 4.7.3: Using a Class Variable in PROC UNIVARIATE

proc univariate data= work.Ipums2001and2005;

  class year;

  var HomeValue Citypop;

  ods select BasicMeasures;

run;

Output 4.7.3: Using a Class Variable in PROC UNIVARIATE

Variable: HomeValue (House value)
Year = 2001

Basic Statistical Measures

Location

Variability

Mean

2964943

Std Deviation

4444513

Median

162500

Variance

1.97537E13

Mode

9999999

Range

9994999

Interquartile Range

9914999

Variable: HomeValue (House value)
Year = 2005

Basic Statistical Measures

Location

Variability

Mean

2793526

Std Deviation

4294777

Median

225000

Variance

1.84451E13

Mode

9999999

Range

9994999

Interquartile Range

9887499

Variable: CITYPOP (City population)
Year = 2005

Basic Statistical Measures

Location

Variability

Mean

2916.656

Std Deviation

12316

Median

0.000

Variance

151690509

Mode

0.000

Range

79561

Interquartile Range

0

Note that the only level for the CLASS variable Year is 2005 for the analysis variable CityPop; CityPop is not present in the IPUMS CPS 2001 data, so all values are missing for that class. By default, PROC UNIVARIATE produces a table noting the missing values when they occur, but this is suppressed by the selection of only the BasicMeasures table in ODS SELECT. Also note the mode of 9,999,999 for HomeValue, which is due to a special encoding in the IPUMS CPS data. If it is desired to restrict the analysis to records without that special encoding, care should be taken to achieve the desired result. Program 4.7.4 uses a WHERE statement to eliminate these values, with two of the output tables shown in Output 4.7.4A and 4.7.4B.

Program 4.7.4: Subsetting Records with the WHERE Statement in PROC UNIVARIATE

proc univariate data= work.Ipums2001and2005;

  class year;

  var HomeValue Citypop;

  ods select BasicMeasures;

  where HomeValue ne 9999999;

run;

Output 4.7.4A: Where Subsetting (Partial Output, HomeValue for 2005)

Variable: HomeValue (House value)
Year = 2005

Basic Statistical Measures

Location

Variability

Mean

238922.4

Std Deviation

219007

Median

162500.0

Variance

4.7964E10

Mode

225000.0

Range

995000

Interquartile Range

255000

The WHERE subsetting has the desired effect on the HomeValue results. However, since CityPop is also part of the PROC UNIVARIATE analysis, those results are also different.

Output 4.7.4B: Where Subsetting (Partial Output, CityPop for 2005)

Variable: CITYPOP (City population)
Year = 2005

Basic Statistical Measures

Location

Variability

Mean

1799.078

Std Deviation

9196

Median

0.000

Variance

84572682

Mode

0.000

Range

79561

Interquartile Range

0

Since the WHERE condition subsets the entire record based on the provided condition, a record having a HomeValue of 9,999,999 also has its Citypop value excluded from the analysis. If this is not what is desired, separate submissions of PROC UNIVARIATE must be made with the appropriate variables and conditions. Program 4.7.5 also produces the table shown in Output 4.7.4B.

Program 4.7.5: How WHERE Subsets Records

proc univariate data= work.Ipums2001and2005;

  class year;

  var CityPop;

  ods select BasicMeasures;

  where HomeValue ne 9999999;

run;

4.7.2 Graphing Statements in PROC UNIVARIATE

The UNIVARIATE procedure also allows for a variety of diagnostic plots to be constructed which typically include the ability to check the fit of certain theoretical distributions. PROC UNIVARIATE includes probability plots, cumulative distribution plots, probability-probability plots, quantile-quantile plots, and histograms. Only quantile-quantile plots and histograms are considered in this section. Plotting statements can be made for any variable listed in the VAR statement, as is shown in Program 4.7.6.

Program 4.7.6: Generating Plots in PROC UNIVARIATE

proc univariate data= work.Ipums2001and2005;

  class year;

  var MortgagePayment;

  histogram MortgagePayment  / normal(mu=est sigma=est);

  qqplot MortgagePayment  / weibull(c=est sigma=est theta=est);

  where MortgagePayment gt 0 ;

run;

The HISTOGRAM statement creates a two-panel graph in Output 4.7.6A, one panel for each level of the class variable Year.

Several different distributions can be requested with parameters specified or estimated (EST) from the data. See the SAS Documentation for more information about distribution choices and associated parameters.

The QQPLOT statement also generates a two-panel graph in Output 4.7.6B.

The quantiles from the data are plotted against the theoretical quantiles of the chosen distribution. If a sufficient set of parameters is chosen, a reference line is produced.

MortgagePayment is subset in this case to remove the dominant number of zeros in the data and to make the parameter set of the Weibull distribution estimable.

Output 4.7.6A: Generating Histograms in PROC UNIVARIATE

Output 4.7.6A: Generating Histograms in PROC UNIVARIATE

Output 4.7.6B: Generating QQ Plots in PROC UNIVARIATE

Output 4.7.6B: Generating QQ Plots in PROC UNIVARIATE

The next section revisits histograms as part of the SGPLOT procedure; however, quantile, probability, and probability-probability plots are often useful diagnostic tools for assessing distributions of data, but are not directly implemented in PROC SGPLOT. As these plots and their use are beyond the scope of this book, refer to the SAS Documentation for more information about using them.

4.8 Data Distribution Plots

Histograms and boxplots are commonly used to display information about data distributions and are available through the SGPLOT procedure. Various statements and options for creating data distribution plots are covered in this section, drawing on similarities to the charting concepts covered in Sections 3.3 through 3.5.

4.8.1 Histograms

The HISTOGRAM statement in the SGPLOT procedure requires a numeric variable from a data set, producing a relative frequency histogram across a set of bins. Program 4.8.1 uses the combined 2001 and 2005 IPUMS CPS data created in Program 4.7.1 to create a histogram for nonzero values of MortgagePayment shown in Output 4.8.1.

Program 4.8.1: Creating a Histogram in PROC SGPLOT

proc sgplot data= work.Ipums2001and2005;

  histogram MortgagePayment;

  where MortgagePayment gt 0;

run;

Output 4.8.1: Histogram of Mortgage Payments

Output 4.8.1: Histogram of Mortgage Payments

In this case, the default choice of bins is poor, and other aspects of the graph can be improved as well. Program 4.8.2 uses AXIS statements, initially discussed in Section 3.4.2, and some options specific to the HISTOGRAM statement to improve Output 4.8.1.

Program 4.8.2: Histogram Options and Axis Modifications

proc sgplot data= work.Ipums2001and2005;

  histogram MortgagePayment / binstart=250 binwidth=500 scale=proportion

                              dataskin=gloss;

  xaxis label=’Mortgage Payment’ valuesformat=dollar8.;

  yaxis display=(nolabel) valuesformat=percent7.;

  where MortgagePayment gt 0;

run;

BINSTART= and BINWIDTH= set the starting point for the first bin and the width of all bins, respectively. The location of the bin is referenced by its midpoint, so making the first bin span 0 to 500 requires a starting value of 250 in conjunction with a width of 500.

The SCALE= option allows for choices for summarizing frequency or relative frequency, here PROPORTION summarizes relative frequency with a decimal value. The VALUESFORMAT= option in the YAXIS statement alters the display of the values.

Like bar charts, bar fills for a histogram accept the DATASKIN= option.

XAXIS and YAXIS statements are available as discussed in Section 3.4.2. Refer to that section and the SAS Documentation for more details about these and other options.

Output 4.8.2: Setting Options for a Histogram

Output 4.8.2: Setting Options for a Histogram

These histograms cannot be separated by year with a CLASS statement as PROC UNIVARIATE does in Section 4.7.2. Previous chapters show that many plotting calls in PROC SGPLOT allow for grouping; however, the HISTOGRAM statement does not. To produce a similar plot to the histogram panels in Output 4.7.6, the SGPANEL procedure can be used.

4.8.2 Histograms in SGPANEL

The SGPANEL procedure provides the same essential set of plotting statements as PROC SGPLOT with other statements and options to create multi-panel graphs. The panels are determined by the levels of the variables listed in the PANELBY statement, and the XAXIS and YAXIS statements are replaced by the COLAXIS and ROWAXIS statements, respectively. Program 4.8.3 modifies the previous histogram into two panels across the Year variable.

Program 4.8.3: Multi-Panel Histogram

proc sgpanel data= work.Ipums2001and2005;

  panelby Year;

  histogram MortgagePayment / binstart=250 binwidth=500 scale=proportion 

                              dataskin=gloss;

  colaxis label=’Mortgage Payment’ valuesformat=dollar8.;

  rowaxis display=(nolabel) valuesformat=percent7.;

  where MortgagePayment gt 0;

run;

The PANELBY statement treats each variable as categorical, much like a CLASS statement in PROC MEANS or PROC UNIVARIATE. The specified plots are repeated in panels for each level of the variable or each combination of levels when multiple variables are given.

The options listed in COLAXIS and ROWAXIS here are the same as those in Program 4.8.2. However, given the nature of the panel graph, these options now apply across axes on both histograms

Output 4.8.3: Multi-Panel Histogram

Output 4.8.3: Multi-Panel Histogram

Several options are available as part of the PANELBY statement to control the arrangement and appearance of the panels. Program 4.8.4 illustrates some of these, see the SAS Documentation for information about more available options.

Program 4.8.4: Altering Panel Structure with PANELBY Statement Options

proc sgpanel data= work.Ipums2001and2005;

  panelby Year / columns=1  novarname  headerattrs=(family=’Georgia’);

  histogram MortgagePayment / binstart=250 binwidth=500 scale=proportion 

                              dataskin=gloss;

  colaxis label=’Mortgage Payment’ valuesformat=dollar8.;

  rowaxis display=(nolabel) valuesformat=percent7.;

  where MortgagePayment gt 0;

run;

COLUMNS= or ROWS= can be used to control the size of the grid in either direction—only one of the two should be specified.

NOVARNAME suppresses the variable name from appearing in the heading label—only the variable value appears.

ATTRS are available for several elements, including the headers. As discussed in Section 3.4.2, FAMILY= sets the font face for text elements.

Output 4.8.4: Altering Panel Structure with PANELBY Statement Options

Output 4.8.4: Altering Panel Structure with PANELBY Statement Options

4.8.3 Boxplots

Boxplots are available in PROC SGPLOT via the VBOX or HBOX statement for vertical or horizontal boxplots, respectively. The default boxplot style is the typical outlier boxplot, though other styles are available. The VBOX and HBOX statements each permit the GROUP= option and the GROUPDISPLAY=CLUSTER setting, first used for bar charts in Section 3.3.3, as Program 4.8.5 shows.

Program 4.8.5: Grouped Boxplot

proc sgplot data= work.Ipums2001and2005;

  vbox MortgagePayment / group=year groupdisplay=cluster;

  where MortgagePayment gt 0;

run;

Output 4.8.5: Grouped Boxplot

Output 4.8.5: Grouped Boxplot

Various options are available for modifying the boxplot, and axis and legend modifications are available to set options similar to those discussed in previous plotting examples. Program 4.8.6 modifies the type of boxplot and several styles on the graph in Output 4.8.6.

Program 4.8.6: Boxplot Modifications

proc sgplot data= work.Ipums2001and2005;

  vbox MortgagePayment / group=year groupdisplay=cluster extreme  

                         whiskerattrs=(color=red);

  keylegend / position=topright location=inside title=’’;

  yaxis display=(nolabel) valuesformat=dollar8.;

  where MortgagePayment gt 0;

run;

EXTREME draws the whiskers of the boxplot to the maximum and minimum values; thus, no outliers are displayed. Other options to control the positioning of whiskers and display of outliers are available.

The boxplot is made up of several line elements, each of which has an ATTRS option. Here, both sets of whiskers are colored red in the WHISKERATTRS= option.

Since this is a grouped chart, a legend is produced, and its attributes are changeable via options in the KEYLEGEND statement, as first shown in Section 3.4.2.

Output 4.8.6: Boxplot Modifications

Output 4.8.6: Boxplot Modifications

The number of graphing statements and options available in PROC SGPLOT is extensive; however, the general logic enables relatively easy associations between them. Reviewing the PROC SGPLOT documentation and examples from various sources is important for developing sound plotting skills. Section 4.8.4 introduces the high-low plot and some advanced variations on it to create a customized version of the boxplot.

4.8.4 High-Low Plots

High-low plots have their origins and most common use in financial time series; however, they are valuable in a variety of other scenarios. The HIGHLOW statement requires specification of three variables to generate the plot. Two of these are variables specified for each of LOW= and HIGH= with the requirement, as expected, that the value for the variable set as HIGH= is greater than or equal to that of LOW= for all records in the data set. The third variable is set as one of X= or Y= and controls the orientation of the graph. Choosing an X= variable provides a vertical axis for the high-low range; Y= orients this range horizontally.

The programs in this section, starting with Program 4.8.7, use data containing several percentiles on MortgagePayment, which is generated from the combined IPUMS CPS data for 2001 and 2005 using PROC MEANS and the ODS OUTPUT statement.

Program 4.8.7: High-Low Plot

proc means data= work.Ipums2001and2005 min p10 p25 median p75 p90 max;

  class year;

  var MortgagePayment;

  where MortgagePayment gt 0;

  ods output summary= work.MPQuantiles;

run;

proc sgplot data= work.MPQuantiles;

  highlow x=year low=MortgagePayment_p25 high=MortgagePayment_P75;

run;

Output 4.8.7: High-Low Plot

Output 4.8.7: High-Low Plot

The default plot, as shown in Output 4.8.7, is a line between the high and low value at each distinct value of the third value chosen for X= or Y=. As with other statements, the HIGHLOW statement contains various options to alter the display. Other PROC SGPLOT statements are also of utility here, Program 4.8.8 uses some of these to enhance the graph.

Program 4.8.8: High-Low Chart Modifications

proc sgplot data= work.MPQuantiles;

  highlow x=year low=MortgagePayment_P25 high=MortgagePayment_P75

          / type=bar fillattrs=(color=cx99FF99) dataskin=sheen;

  xaxis values=(2001 2005) display=(nolabel);

  yaxis label=’Mortgage Payment’ valuesformat=dollar8.;

run;

The display between the high and low values can be set as a bar instead of a line.

Since the plotting elements are changed to bars, styling options for bars and their fills are available.

The Year variable is effectively treated as quantitative even though it only has two distinct values. This is an instance where specifying a list of individual values for that axis is desirable.

Output 4.8.8: High-Low Plot Modifications

Output 4.8.8: High-Low Plot Modifications

Using the ability to overlay several plots in a PROC SGPLOT call, Program 4.8.9 uses multiple HIGHLOW statements to create a custom variation of a boxplot in Output 4.8.9.

Program 4.8.9: Creating a Custom Boxplot from Multiple High-Low Plots

proc sgplot data= work.MPQuantiles;

  highlow x=year low=MortgagePayment_Min high=MortgagePayment_Max/

          legendlabel=’Minimum to Maximum’ lineattrs=(color=red) name=’Line’;

  highlow x=year low=MortgagePayment_P10 high=MortgagePayment_P90/

          legendlabel=’10th to 90th Percentile’ type=bar barwidth=.3   

          fillattrs=(color=cx66AA66) name=’Box1’;

  highlow x=year low=MortgagePayment_P25 high=MortgagePayment_P75

          / type=bar legendlabel=’Inter-Quartile Range’ barwidth=.5  

          fillattrs=(color=cx77FF77) name=’Box2’;

  xaxis values=(2001 2005) display=(nolabel);

  yaxis label=’Mortgage Payment’ valuesformat=dollar8.;

  keylegend ‘Line’ / position=topright location=inside noborder  

                     valueattrs=(size=8pt);

  keylegend ‘Box1’ ‘Box2’ / across=1 position=topleft location=inside noborder 

                            valueattrs=(size=8pt);

run;

The layer created by the first HIGHLOW statement is a red line from the maximum to minimum values. As a legend is generated for this overlay, the LEGENDLABEL= option is used to create a more useful description of this portion of the graph. NAME= uses a literal value to name the legend for use in a KEYLEGEND statement—this graph has two separate legends.

The next plot, created by the second HIGHLOW statement, is from the 10th to 90th percentiles as a narrow bar in a green hue. A description for this portion of the graph is also transmitted to the legend via the LEGENDLABEL= option, and it is named with the NAME=option.

The final plot spans the traditional first to third quartile span with a bar of somewhat more width in a different hue of green.

As in previous examples, any legend generated can be altered with the KEYLEGEND statement. Unlike previous examples, two KEYLEGEND statements are used, with NAME= values from the plotting statements provided before the slash. Each is used to position and style the legend information—KEYLEGEND statements with multiple names include all legend information generated by all named plotting statements. A KEYLEGEND statement without a name thus includes all legend information from plots without a NAME= option.

simage shown here

4.9 Wrap-Up Activity

Use the lessons and examples contained in this and previous chapters to complete the activity shown in Section 4.2.

Data

Various data sets are available for use to complete the activity. This data expands on the 2001 and 2005 data used for many of the in-chapter examples. Completing this activity requires the following files:

  • Ipums2005Basic.sas7bdat
  • Ipums2010Basic.sas7bdat
  • Ipums2015Basic.sas7bdat

Scenario

Use the skills mastered so far, including those from previous chapters, to assemble the listed files into a single data set and create the analyses shown in Section 4.2

4.10 Chapter Notes

1. Attributes for a Common Variable Across Multiple Input Data Sets. When multiple SAS data sets are referenced in a SET statement (or other DATA step statements, such as the MERGE statement introduced in Chapter 5), variables with the same name are established as a single variable in the PDV unless the types do not match, which generates an error and stops compilation of the DATA step. Mismatches for attributes like length, format, and label do not generate errors, and SAS sources these attributes from one of the data sets. Generally, these attributes are sourced from the first data set listed; however, this is not always the case. Default values for these attributes are often assigned, such as the default length of 8 bytes for character variables in list input, but default behaviors also exist for attributes that may be unassigned. The use of standard character or numeric formats or using variable names as labels in displayed output may be the result of a specific assignment of those attributes or a result default behavior when they are no format or label attribute is assigned—assignments of these attributes can be checked with PROC CONTENTS. If the attribute is not assigned in some of the referenced data sets, the first encounter of the attribute becomes the assignment for the final data set. Except in rare circumstances, the length attribute is always assigned and therefore is established by the first data set encountered. If that length is shorter than lengths assigned to that variable in other data sets, truncation can occur (and a warning is generated in the SAS log in such circumstances). It is a good programming practice to check attributes with PROC CONTENTS before attempting to combine data sets.

2. YEARCUTOFF=. When SAS displays a date using a format that limits the width so that full precision is not available, it generally begins by using only two-digit years, then by removing any delimiters—such as forward slashes—that appear in the date. Date, time, and datetime formats have a specific range of allowable widths, and the SAS Documentation demonstrates how values appear using a variety of widths. However, when reading date values—either from raw data or from a date literal—if only a two-digit year is present, then SAS uses the value of the system option YEARCUTOFF= to determine the century in which the two-digit year falls. For example, if YEARCUTOFF=1910 then SAS interprets the date 3/21/19 as March 21, 1919, while if YEARCUTOFF=1920, then 3/21/19 is interpreted as March 21, 2019. The two-digit year is compared to the last two digits in the YEARCUTOFF= value—if larger the century given is used; if smaller the next century is used. While two-digit years have become less common in modern databases, when dealing with historical data it is important to ensure date-related fields are properly represented. To determine the default value of the YEARCUTOFF= for a particular version of SAS, see the SAS Documentation.

4.11 Exercises

Concepts: Multiple Choice

1. Which of the following programs interleaves the data sets A and B according to the variable EmployeeID?

a.

data combined;

  set A B;

run;

b.

data combined;

  set A B;

  by employeeID;

run;

c.

proc sort data = A;

  by employeeID;

run;

proc sort data = B;

  by employeeID;

run;

data combined;

  set A B;

run;

d.

proc sort data = A;

  by employeeID;

run;

proc sort data = B;

  by employeeID;

run;

data combined;

  set A B;

  by employeeID;

run;

2. Which of the following variable attributes must match for a DATA step concatenation to combine columns from two contributing data sets?

a. Type and Name

b. Position and Name

c. Position only

d. Name only

3. Which of the following data set options can be used to identify which data sets contributed information to the current PDV?

a. WHERE=

b. KEEP=

c. RENAME=

d. IN=

4. Consider the following three items: 1) using a DROP= option on a data set in the SET statement, 2) using a DROP= option on the single data set listed in the DATA statement, and 3) using a DROP statement in the DATA step. Which of the following are equivalent?

a. Item 1 and Item 2

b. Item 1 and Item 3

c. Item 2 and Item 3

d. All three items are equivalent

5. Which of the following is false about DO groups?

a. Must close with an END statement

b. Can contain multiple SAS statements

c. Cannot be used in an ELSE statement

d. None of the above are false

6. Which of the following procedures can be used to generate histograms?

a. UNIVARIATE

b. SGPLOT

c. SGPANEL

d. All of the above

7. The GROUP= option is not available in PROC SGPLOT for which of the following plotting statements?

a. VBAR

b. VBOX

c. HISTOGRAM

d. HIGHLOW

8. For the SGPANEL procedure, which of the following is true regarding the XAXIS statement?

a. It works the same as it does in PROC SGPLOT

b. It is not available and is supplanted by the COLAXIS statement

c. It is not available and is supplanted by the ROWAXIS statement

d. None of the above

9. The orientation of a high-low plot in the SGPLOT procedure is determined by which of the following?

a. The VERTICAL or HORIZONTAL option

b. The choice of an X= or Y= variable

c. The variable type (character or numeric)

d. THE SGPLOT procedure itself

10. Which of the following programs could be used to interleave two data sets—Accounts1 and Accounts2—by the numeric key variables Date and Amount in order to produce the data set shown?

Combined

Date

Amount

Account

03JUL2019

$8,700

EK-1257

03JUL2019

$1,600

RJ-002X

18JUN2019

$3,200

JB-1977

18JUN2019

$425

JB-1941

a.

data combined;

  set Accounts1 Accounts2;

  by Date Amount;

run;

b.

data combined;

  set Accounts1 Accounts2;

  by descending Date Amount;

run;

c.

data combined;

  set Accounts1 Accounts2;

  by Date Amount descending;

run;

d.

data combined;

  set Accounts1 Accounts2;

  by descending Date descending Amount;

run;

Concepts: Short-Answer

1. For each of the following, what condition(s) must be satisfied?

a. A successful concatenation of multiple data sets.

b. A successful interleave of multiple data sets.

2. How is an IF-THEN/ELSE chain potentially more efficient than a series of IF-THEN statements?

3. Describe the difference between the WHERE and subsetting IF statements in the DATA step.

4. Both KEEP and DROP statements are available in the DATA step, and similarly both ODS SELECT and ODS EXCLUDE statements are available for use with any procedure. Why is it the case that both are available even though only one is used at a time?

5. Consider the following DATA step code.

data Logic;

  set InputData;

  select;

    when(1 <= status <= 5) rank = ‘Assistant’;

    when(3 <= status <=14) rank = ‘Associate’;

    when(status >= 10) rank = ‘Full’;

  end;

  if 0 <= SeqNum <= 100 then series = 3;

    else if 50 <= SeqNum <= 150 then series = 4;

run;

For each of the following, determine the results in each of the following scenarios and provide a justification.

a. What is the value of Rank for a record with Status = 5?

b. What is the value of Rank for a record with Status = 0?

c. What is the value of Series for a record with SeqNum = 75?

d. What is the value of Series for a record with SeqNum = 200?

6. Consider the following DATA step code, which is a modification of the DATA step from the previous question.

data Logic;

  set InputData;

  select;

    when(1 <= status <= 5) rank = ‘Assistant’;

    when(3 <= status <=14) ranks = ‘Associate’;

    when(status >= 10) rank = ‘Full’;

  end;

  if 0 <= SeqNum <= 100 then series = ‘3’;

    else if 50 <= SeqNum <= 150 then series = ‘10’;

      else series = 15;

run;

Use this DATA step to explain each of the following.

a. In the SELECT block, both Rank and Ranks appear in the assignment statements. Describe the effects on the data set and list any notes, warnings, or errors that appear in the SAS log as a result of this discrepancy.

b. In the IF-THEN/ELSE chain, Series appears in three assignment statements. Assuming each condition is satisfied at least once, what values does Series have in the data set?

7. For each of the following items, determine whether it correctly completes the DATA step below in order to write a subset of the data where Age is at least 55 or CholFlag is equal to Y.

data CheckUp;

  set PatientDemog(keep = Age Chol);

  if (Age <= 40 and Chol >= 130) or 

     (Age >= 40 and Chol >= 100) 

    then CholFlag = ‘Y’;

  <insert code here>

run;

a. if CholFlag = ‘Y’ or Age >= 55;

b. where CholFlag = ‘Y’ or Age >= 55;

c. if CholFlag = ‘Y’;
where Age >= 55;

d. where CholFlag = ‘Y’;
if Age >= 55;

8. The following code is a modified version of the code that generates Log 4.4.15. Execute the code below and use the results to help answer the following items.

data Base;

  length y $6;

  x=123456789;

  y=x;

  put _all_;

run;

data Subset3;

  set Base;

  if x = y;

run;

a. Why does the Subset3 data set contain zero observations?

b. Without introducing new statements or options, what modification of the code results in the implicit conversion successfully comparing X and Y to produce a non-empty Subset3 data set.

c. Rewrite the second DATA step to correctly use explicit conversion when matching X and Y.

Programming Basics

1. The data sets BookData.Ratings2016 and BookData.Ratings2017 contain several reviews of restaurants. In both data sets, ID is used as a unique identifier for the restaurants. Str1 represents the rating and Str2 represents the rater’s first name. Write a program that combines the data sets according to the following specifications. In each, ensure a variable exists that contains the year in which the restaurant was rated.

a. Concatenates the data sets

b. Interleaves the data sets based on Str2

c. Interleaves the data sets based on Str2 and Str1

2. Using the data sets from the previous question, use the appropriate function(s) to create a new variable, Stars, that is numeric and contains the number of stars assigned to each restaurant. (For example, in the 2016 data the values should be 4.5, 4, 5, 3, and 5.) Interleave the resulting data sets by Stars and ID, and include the year in which the restaurant was rated.

3. Use the Sashelp.Stocks data set to do the following.

a. Create a new data set that includes two new variables, with one being Year. The other is FiscalSeason, which has the values Federal1 during the months October, November, and December; Federal2 during the months January, February, and March; Federal3 during the months April, May, June; and Federal4 for the months of July, August, and September.

b. Summarize the maximum and minimum of monthly high price for each combination of Year and FiscalSeason.

c. Create the following panel graph:

image shown here

4. Using PROC UNIVARIATE, obtain the quantiles, extreme observations, and QQ-plots for the Sales, Inventory, and Returns variables in the Sashelp.Shoes data set. Ensure the procedure does not produce any additional results.

5. The previous question used the raw Sales, Inventory, and Returns values from Sashelp.Shoes. However, each record is based on a variable number of stores. Re-create the quantiles, extreme observations, and QQ-plots using a per-store average value of Sales, Inventory, and Returns.

6. Write a program that does all of the following:

◦ Reads in the Flights.csv data set, which contains the same fields read by Program 2.9.1, and ensures data values are read correctly and in a manner that allows for future analyses.

◦ Uses the PUTLOG statement to write the FlightNum and FirstClass values along with the associated variable names only when the value of FirstClass is missing.

◦ Uses the LIST statement only when the flight is on a date prior to 12/14/2000 or is from DFW.

7. Write a program that does all of the following:

◦ Reads in the Flights.txt data set, which contains the same fields read by Program 2.9.1.

◦ Uses the PUTLOG statement to write the complete PDV if any variable is missing.

◦ Uses the LIST statement to write the complete input buffer if any variable is missing.

◦ Includes a comment detailing whether the PUTLOG or LIST statement is a better debugging tool in this scenario.

◦ Validates the data set against the Flights data set created in the previous exercise.

8. Read in the raw data contained in Cars.datfile and compare it to the Cars data set in the Sashelp library. Modify the DATA step that reads the raw file to correct any data differences that arise so that the file generated validates against Sashelp.Cars.

Case Studies

For additional practice, multiple case studies are available in addition to the IPUMS CPS case study used in the chapters. See Section 8.4 to apply the skills from this chapter to the Clinical Trials Case Study. For additional case studies, including extensions to the IPUMS CPS case study, see the author pages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset