Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

R. Betancourt, S. ChenPython for SAS Usershttps://doi.org/10.1007/978-1-4842-5001-3_5

5. Data Management

Randy Betancourt¹ and Sarah Chen²

(1)

Chadds Ford, PA, USA

(2)

Livingston, NJ, USA

In this chapter we discuss common data management tasks beginning with combining DataFrames. Other tasks discussed include sorting, finding duplicate values, drawing samples, and transposing. Every analysis task requires data to be organized into a specific form before the data is used to render meaningful results. Data often comes from multiple sources with a range of different formats. This requires you to logically relate and organize data to fit the analysis task. In fact, most of the effort for any data analysis is “wrangling” data to shape it appropriately.

pandas have two main facilities for combining DataFrames with various types of set logic and relational/algebraic capabilities for join/merge operations. The concat() method performs row-wise or column-wise concatenation operations and performs union and intersection set logic on DataFrames. The examples explored in this chapter are analogous to the SAS Data Step SET statement and PROC APPEND.

The merge() method offers an SQL-like interface for performing DataFrame join/merge operations. The SAS MERGE statement and PROC SQL are the analogs used to introduce the merge() method.

The merge() method is like the SAS match-merge operation. And since data is rarely tidy, we also explore cases where key columns are inferred as well as handling key columns having different names followed by merges with missing key values.

Start by constructing the left and right DataFrames illustrated in Listing 5-1. In this example, the DataFrames use the ID column as a common key.

>>> import pandas as pd

>>> url_l = "https://raw.githubusercontent.com/RandyBetancourt/PythonForSASUsers/master/data/Left.csv"

>>> left = pd.read_csv(url_l)

>>> url_r = "https://raw.githubusercontent.com/RandyBetancourt/PythonForSASUsers/master/data/Right.csv"

>>> right = pd.read_csv(url_r)

>>>

>>> print(left)

ID Name Gender Dept

0 929 Gunter M Mfg

1 446 Harbinger M Mfg

2 228 Benito F Mfg

3 299 Rudelich M Sales

4 442 Sirignano F Admin

5 321 Morrison M Sales

6 321 Morrison M Sales

7 882 Onieda F Admin

>>> print(right)

ID Salary

0 929 45,650

1 446 51,290

2 228 62,000

3 299 39,800

4 442 44,345

5 871 70,000

Listing 5-1

Build Left and Right DataFrames

Notice the left DataFrame has ‘321’ as a duplicate value in the ID column , making a many-to-one relationship between the DataFrames. Also notice how the right DataFrame has ‘871’ as an ID value not found in the left DataFrame. These are the types of issues that may cause unexpected results when performing merge/join operations. These two DataFrames are used in several examples throughout this chapter.

Listing 5-2 builds the same input data into the left and right datasets in SAS. With the SAS examples, we explore the use of both Data Step and PROC SQL logic as analogs to the merge() and concat() methods for DataFrames.

4 data left;

5 infile datalines dlm=',';

6 length name $ 12 dept $ 5;

7 input name $

8 id

9 gender $

10 dept;

11 list;

12 datalines;

RULE: --+--1--+--2--+--3--+--4--+--5

13 Gunter, 929, M, Mfg

14 Harbinger, 446, M, Mfg

15 Benito, 228, F, Mfg

16 Rudelich, 299, M, Sales

17 Sirignano, 442, F, Admin

18 Morrison, 321, M, Sales

19 Morrison, 321, M, Sales

20 Oniedae, 882, F, Admin

NOTE: The dataset WORK.LEFT has 8 observations and 4 variables.

21 ;;;;

23 data right;

24 input id

25 salary;

26 list;

27 datalines;

RULE: --+--1--+--2--+--3--+--4--+--5

28 929 45650

29 446 51290

30 228 62000

31 299 39800

32 442 44345

33 871 70000

NOTE: The dataset WORK.RIGHT has 6 observations and 2 variables.

34 ;;;;

36 proc print data=left;

NOTE: There were 8 observations read from the dataset WORK.LEFT.

37 proc print data=right;

38 run;

Listing 5-2

Create Left and Right Datasets

Figure 5-1 uses PROC PRINT to display the left dataset.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig1_HTML.jpg — Figure 5-1
Left Dataset

Figure 5-2 uses PROC PRINT to display the right dataset.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig2_HTML.jpg — Figure 5-2
Right Dataset

SAS Sort/Merge

To begin, consider Listing 5-3, also referred to as match-merge. This program is a common pattern for match-merging two SAS datasets containing a common key. In this example, both the left and right SAS datasets are sorted by the id variable enabling SAS By Group processing. After sorting, the match-merge joins the datasets by the id variable.

Experienced SQL users know the results from this match-merge are the same as a FULL OUTER join in PROC SQL in cases where the table relationships are one-to-one or many-to-one. In cases where the table relationships are many-to-many, the results from the Data Step and PROC SQL differ. The many-to-many use cases for SAS and pandas are detailed in Appendix B at the end of the book. For the remainder of this chapter, our examples deal with one-to-one or one-to-many join relationships represented by the left and right DataFrames and datasets created in Listing 5-1 and Listing 5-2.

Listing 5-3 combines observations from the left and right datasets into a single observation in the new merge_lr dataset according to the values for the id variable found in both datasets.

4 proc sort data=left;

5 by id;

6 run;

NOTE: There were 8 observations read from the dataset WORK.LEFT.

NOTE: The dataset WORK.LEFT has 8 observations and 4 variables.

7 proc sort data=right;

8 by id;

9 run;

NOTE: There were 6 observations read from the dataset WORK.RIGHT.

NOTE: The dataset WORK.RIGHT has 6 observations and 2 variables.

10 data merge_lr;

11 merge left

12 right;

13 by id;

14 run;

NOTE: There were 8 observations read from the dataset WORK.LEFT.

NOTE: There were 6 observations read from the dataset WORK.RIGHT.

NOTE: The dataset WORK.MERGE_LR has 9 observations and 5 variables.

15 proc print data=merge_lr;

16 id id;

17 run;

NOTE: There were 9 observations read from the dataset WORK.MERGE_LR.

Listing 5-3

SAS Sort/Merge

Figure 5-3 uses PROC PRINT to display the resulting merge_lr dataset.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig3_HTML.jpg — Figure 5-3
Results from SAS Sort/Merge

To introduce the pandas merge() method , consider Listing 5-4. Using a single Data Step, the program creates seven output datasets. Each of the joins illustrates the following operations:

Inner join
Right join
Left join
Outer join
Left join with no matched keys
Right join with no matched keys
Outer join with no matched keys

4 data inner

5 right

6 left

7 outer

8 nomatch_l

9 nomatch_r

10 nomatch;

12 merge left(in=l)

13 right(in=r);

14 by id;

16 if (l=l and r=1) then output inner; *Inner Join;

18 if r = 1 then output right; * Right Join;

20 if l = 1 then output left; * Left Join;

21 if (l=1 or r=1) then output outer; *Full Outer Join;

24 if (l=1 and r=0) then output nomatch_l;

26 if (l=0 and r=1) then output nomatch_r;

28 if (l=0 or r=0) then output nomatch;

30 run;

NOTE: There were 8 observations read from the dataset WORK.LEFT.

NOTE: There were 6 observations read from the dataset WORK.RIGHT.

NOTE: The dataset WORK.INNER has 6 observations and 5 variables.

NOTE: The dataset WORK.RIGHT has 6 observations and 5 variables.

NOTE: The dataset WORK.LEFT has 8 observations and 5 variables.

NOTE: The dataset WORK.OUTER has 9 observations and 5 variables.

NOTE: The dataset WORK.NOMATCH_L has 3 observations and 5 variables.

NOTE: The dataset WORK.NOMATCH_R has 1 observations and 5 variables.

NOTE: The dataset WORK.NOMATCH has 4 observations and 5 variables .

Listing 5-4

Create Seven Output Datasets

Each of these seven join/merge operations is explored in detail along with their pandas counterpart operations.

Inner Join

An INNER JOIN selects only those rows whose key values are found in both tables. Another way to say this is the intersection of matched key values. The SAS Data Step for an INNER JOIN is shown in Listing 5-5. The id column in both datasets must have matching values to be included in the result set.

4 data inner;

5 merge left(in=l)

6 right(in=r);

7 by id;

9 if (l=1 and r=1) then output;

10 run;

NOTE: There were 8 observations read from the dataset WORK.LEFT.

NOTE: There were 6 observations read from the dataset WORK.RIGHT.

NOTE: The dataset WORK.INNER has 5 observations and 5 variables.

Listing 5-5

Data Step Inner Join

The dataset IN= option creates Boolean variables to indicate which dataset contributes values to the current observation being read. The IF statement applies a truth test selecting only those observations where the id variable has matched values in both the left and right datasets.

With the SAS example as a template, we illustrate the pandas merge() method. The merge() method signature is

pd.merge(left, right, how="inner", on=None, left_on=None,

right_on=None,

left_index=False, right_index=False, sort=False

suffixes=('_x', '_y'), copy=True, indicator=False,

validate=None)

Listing 5-6 uses the on=‘ID’ argument to indicate the ID column is a key column found in both DataFrames. It turns out the on=‘ID’ argument is not needed in this example, since the merge() method detects the presence of a column labeled ID in both DataFrames and automatically asserts them as the key column. The how=‘inner’ argument performs an inner join, which is the default. The sort= argument is set to False. Not surprisingly, merge() operations on large DataFrames gain substantial performance improvements by not having to return rows in sorted order.

>>> inner = pd.merge(left, right, on="ID", how="inner", sort=False)

>>> print(inner)

ID Name Gender Dept Salary

0 929 Gunter M Mfg 45,650

1 446 Harbinger M Mfg 51,290

2 228 Benito F Mfg 62,000

3 299 Rudelich M Sales 39,800

4 442 Sirignano F Admin 44,345

Listing 5-6

pandas Inner Join

Listing 5-7 illustrates the PROC SQL query for an INNER JOIN.

4 proc sql;

5 select *

6 from left

7 ,right

8 where left.id = right.id;

9 quit;

Listing 5-7

PROC SQL Inner Join

An alternative syntax uses the PROC SQL keywords INNER JOIN. The join predicate, l.id = r.id, must be true to be included in the result set.

proc sql;

select *

from left as l

inner join

right as r

on l.id = r.id;

quit;

Figure 5-4 displays the result set from PROC SQL.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig4_HTML.jpg — Figure 5-4
PROC SQL Inner Join Results

Right Join

A RIGHT JOIN returns all observations from the right dataset along with any observations from the left dataset where the id variable has a match with the id variable in the right dataset. In this case, all observations from the right dataset are returned along with any observations in the left dataset with id values matching in the right dataset.

The SAS Data Step equivalent for a right join is illustrated in Listing 5-8.

4 data r_join;

5 merge left(in=l)

6 right(in=r);

7 by id;

9 if r=1 then output;

10 run;

NOTE: There were 8 observations read from the dataset WORK.LEFT.

NOTE: There were 6 observations read from the dataset WORK.RIGHT.

NOTE: The dataset WORK.R_JOIN has 6 observations and 5 variables.

Listing 5-8

Data Step Right Join

The dataset IN= option creates Boolean variables to indicate whether the datasets contributed values to the current observation. The IF statement applies a truth test selecting all observations in the right table along with those observations in the left dataset where the id variable has matched values in the right dataset. In cases where there are observations in the right dataset with no matching id in the left dataset, these values are set to missing.

Listing 5-9 illustrates the compactness of the merge() method.

>>> r_join = pd.merge(left, right, how="right", sort=True)

>>> print(r_join)

ID Name Gender Dept Salary

0 228 Benito F Mfg 62,000

1 299 Rudelich M Sales 39,800

2 442 Sirignano F Admin 44,345

3 446 Harbinger M Mfg 51,290

4 871 NaN NaN NaN 70,000

5 929 Gunter M Mfg 45,650

Listing 5-9

pandas Right Join

The merge() method automatically coalesces the ID column values from both DataFrames into a single column in the returned r_join DataFrame. In cases where there are rows in the right DataFrame with no matching id in the left DataFrame, these values are set to NaN’s.

Listing 5-10 illustrates a right join with PROC SQL.

4 proc sql;

5 select coalesce(left.id, right.id) as id

6 ,name

7 ,dept

8 ,gender

9 ,salary

10 from left

11 right join

12 right

13 on left.id = right.id;

14 quit;

Listing 5-10

PROC SQL Right Join

The COALESCE function coerces the id columns from both tables to return a single column. Without the COALESCE function, the result set returns columns labeled ID from both the right and left tables with the ID column from the left table containing nulls for those rows with no matches found for the ID column from the right table.

Figure 5-5 displays the result set created from the right join.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig5_HTML.jpg — Figure 5-5
RIGHT JOIN Results

As the output shows, all rows from the right table are returned and PROC SQL assigns missing values for rows in the right table having unmatched values for the id column in the left table.

Left Join

A left join returns all rows from the left dataset along with any rows from the right table where the join predicate is true. In this case, all rows from the left are returned along with any rows in the right with id values matching in the left. Listing 5-11 illustrates a left join.

4 data l_join;

5 merge left(in=l)

6 right(in=r);

7 by id;

9 if l=1 then output;

10 run;

NOTE: There were 8 observations read from the dataset WORK.LEFT.

NOTE: There were 6 observations read from the dataset WORK.RIGHT.

NOTE: The dataset WORK.L_JOIN has 8 observations and 5 variables.

Listing 5-11

Data Step Left Join

Listing 5-12 illustrates a left join on the left and right tables.

>>> l_join = pd.merge(left, right, how="left", sort=False)

>>> print(l_join)

ID Name Gender Dept Salary

0 929 Gunter M Mfg 45,650

1 446 Harbinger M Mfg 51,290

2 228 Benito F Mfg 62,000

3 299 Rudelich M Sales 39,800

4 442 Sirignano F Admin 44,345

5 321 Morrison M Sales NaN

6 321 Morrison M Sales NaN

7 882 Onieda F Admin NaN

Listing 5-12

pandas Left Join

Like the pandas right join example in Listing 5-9, the merge() method automatically coalesces the ID column values from both DataFrames into a single column in the returned l_join DataFrame. The output shows all rows from the left DataFrame are returned with the merge() method assigning NaN’s for columns in the right DataFrame having unmatched values for the id column in the left DataFrame.

Listing 5-13 illustrates a left join with PROC SQL.

4 proc sql;

5 select coalesce(left.id, right.id) as id

6 ,name

7 ,dept

8 ,gender

9 ,salary

10 from left

11 left join

12 right

13 on left.id = right.id;

14 quit;

Listing 5-13

PROC SQL Left Join

Similar to Listing 5-10, this example uses the COALESCE function to coerce the id columns from both tables to return a single column. Figure 5-6 displays the PROC SQL output .

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig6_HTML.jpg — Figure 5-6
Left Join Results

Outer Join

Earlier we stated the SAS Sort/Merge returns the same result set as those from PROC SQL OUTER JOINs in cases where the table relationships are either one-to-one or one-to-many. The results from this example are the same as the SAS Sort/Merge results. See Listing 5-3 for the Data Step source program.

Listing 5-14 illustrates uses of the how='outer' argument to select all rows from the left and right tables to perform an outer join.

>>> merge_lr = pd.merge(left, right, on="ID", how="outer", sort=True)

>>> print(merge_lr)

ID Name Gender Dept Salary

0 228 Benito F Mfg 62,000

1 299 Rudelich M Sales 39,800

2 321 Morrison M Sales NaN

3 321 Morrison M Sales NaN

4 442 Sirignano F Admin 44,345

5 446 Harbinger M Mfg 51,290

6 871 NaN NaN NaN 70,000

7 882 Onieda F Admin NaN

8 929 Gunter M Mfg 45,650

Listing 5-14

pandas Outer Join

Listing 5-15 illustrates a PROC SQL outer join.

4 proc sql;

5 select coalesce(left.id, right.id)

6 ,name

7 ,dept

8 ,salary

9 from left

10 full join

11 right

12 on left.id=right.id;

13 quit;

Listing 5-15

PROC SQL Outer Join

All rows from the left and right tables are returned. In cases where there are no matched values for the id variable, the values are set to missing.

Figure 5-7 displays the result set from an outer join on the left and right tables with PROC SQL.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig7_HTML.jpg — Figure 5-7
PROC SQL Outer Join

Right Join Unmatched Keys

Up to this point, the examples are based on finding matching key values in the data to be joined. The next three examples illustrate joining data where keys are not matched.

Every SQL join is either a Cartesian product join or a subset of a Cartesian product join. In cases involving unmatched key values, a form of WHERE processing is required. Together, the SAS Data Step with its IN= and associated IF processing logic is a common pattern for this type of filtering. PROC SQL with a WHERE clause is used as well.

The next three examples illustrate the indicator= argument for the pandas merge() method as an analog to the SAS IN= dataset option. For pandas, the filtering process utilizes a Boolean comparison based on the indicator= value.

Consider Listing 5-16.

4 data r_join_nmk;

5 merge left(in=l)

6 right(in=r);

7 by id;

9 if (l=0 and r=1) then output;

10 run;

NOTE: There were 8 observations read from the dataset WORK.LEFT.

NOTE: There were 6 observations read from the dataset WORK.RIGHT.

NOTE: The dataset WORK.R_JOIN_NMK has 1 observations and 5 variables.

Listing 5-16

Data Step Unmatched Keys in Right

The statement

if (l=0 and r=1) then output;

behaves as a WHERE filter matching on the id column in both datasets resulting in rows from the right dataset with no matched values in the left dataset .

To perform a right join on unmatched keys on DataFrames, use the indicator= argument to the merge() method. This argument adds a column to the output DataFrame with the default name _merge as an indicator for the source of each row. Their returned values are

left_only

right_only

both

By applying a Boolean filter to the indicator= values, we replicate the behaviors for the SAS IN= dataset option for merge operations. Consider Listing 5-17.

>>> nomatch_r = pd.merge(left, right, on="ID", how="outer", sort=False, indicator="in_col")

>>> print(' ',

... nomatch_r,

... ' ')

ID Name Gender Dept Salary in_col

0 929 Gunter M Mfg 45,650 both

1 446 Harbinger M Mfg 51,290 both

2 228 Benito F Mfg 62,000 both

3 299 Rudelich M Sales 39,800 both

4 442 Sirignano F Admin 44,345 both

5 321 Morrison M Sales NaN left_only

6 321 Morrison M Sales NaN left_only

7 882 Onieda F Admin NaN left_only

8 871 NaN NaN NaN 70,000 right_only

>>> nomatch_r[(nomatch_r['in_col'] == 'right_only')]

ID Name Gender Dept Salary in_col

8 871 NaN NaN NaN 70,000 right_only

Listing 5-17

pandas Right Join Unmatched Keys

The indicator= argument adds the in_col column to the nomatch_r DataFrame in this example. We print the nomatch_r DataFrame as an intermediate step to display values for the in_col column added by the indicator= argument .

The Boolean expression

nomatch_r[(nomatch_r['in_col'] == 'right_only')]

is a subsetting operation to create the nomatch_r DataFrame with the Boolean test

(nomatch_r['in_col'] == 'right_only')

selecting rows where the in_col column value is 'right_only'.

The SAS analog to Listing 5-17 example is generated by simply adding a WHERE filter to the example from Listing 5-10. This is illustrated in Listing 5-18.

4 proc sql;

5 select coalesce(left.id, right.id) as id

6 ,name

7 ,dept

8 ,gender

9 ,salary

11 from left

12 right join right on left.ID = right.ID

13 where left.ID is NULL;

14 quit;

Listing 5-18

PROC SQL RIGHT JOIN Unmatched Keys

The WHERE clause returns those rows from the right table having no matching id values in the left table. Columns returned from the left table are set to missing.

Figure 5-8 displays the result set.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig8_HTML.jpg — Figure 5-8
PROC SQL Right Join No Matched Keys

Left Join Unmatched Keys

To find the unmatched key values in the left table, consider Listing 5-19.

4 data l_join_nmk;

5 merge left(in=l)

6 right(in=r);

7 by id;

9 if (l=1 and r=0) then output;

10 run;

NOTE: There were 8 observations read from the dataset WORK.LEFT.

NOTE: There were 6 observations read from the dataset WORK.RIGHT.

NOTE: The dataset WORK.L_JOIN_NMK has 3 observations and 5 variables.

Listing 5-19

Data Step Unmatched Keys in Left

The statement

if (l=1 and r=0) then output;

is the reverse from Listing 5-16. The IF statement behaves like a WHERE filter matching on the id column in both datasets resulting in just the observations from the left dataset having no matched id values in the right dataset.

Finding the unmatched key values in the left DataFrame is illustrated in Listing 5-20. Notice the call to the merge() method is the same as illustrated in Listing 5-17; however, in this case, we create the nomatch_l DataFrame and the subsequent filtering logic is different.

>>> nomatch_l = pd.merge(left, right, on="ID", how="outer", sort=False, indicator="in_col")

>>> nomatch_l = nomatch_l[(nomatch_l['in_col'] == 'left_only')]

>>>

>>> print(' ',

... nomatch_l,

... ' ')

ID Name Gender Dept Salary in_col

5 321 Morrison M Sales NaN left_only

6 321 Morrison M Sales NaN left_only

7 882 Onieda F Admin NaN left_only

Listing 5-20

pandas Left Join Unmatched Keys

The statement

nomatch_l = nomatch_l[(nomatch_l['in_col'] == 'left_only')]

is the subsetting logic to create the nomatch_l DataFrame with the Boolean test

(nomatch_l['in_col'] == 'left_only')

selecting rows where the in_col column value is 'left_only'.

The same result set is generated by simply adding a WHERE filter to the example from Listing 5-13. This is illustrated in Listing 5-21.

4 proc sql;

5 select coalesce(left.id, right.id) as id

6 ,name

7 ,dept

8 ,gender

9 ,salary

10 from left

11 left join

12 right

13 on left.id = right.id

14 where right.id is null;

15 quit;

Listing 5-21

PROC SQL Left Join on Unmatched Keys

The WHERE clause returns those rows from the left table having no matching id values in the right table. Columns returned from the right table are set to missing.

Figure 5-9 displays the result set.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig9_HTML.jpg — Figure 5-9
PROC SQL Left Join on Unmatched Keys

Outer Join Unmatched Keys

An outer join on unmatched keys returns rows from each dataset with unmatched keys in the other dataset. To find the unmatched key values in the left or right dataset, consider Listing 5-22.

4 data outer_nomatch_both;

5 merge left (in=l)

6 right (in=r);

7 by id;

9 if (l=0 or r=0) then output;

10 run;

NOTE: There were 8 observations read from the dataset WORK.LEFT.

NOTE: There were 6 observations read from the dataset WORK.RIGHT.

NOTE: The dataset WORK.OUTER_NOMATCH_BOTH has 4 observations and 5 variables.

Listing 5-22

Data Step Outer Join Unmatched Keys in Both

The statement

if (l=0 or r=0) then output;

behaves like a WHERE filter matching on the id column in both datasets returning rows from the left dataset with no matched values in the right dataset or rows from the right dataset with no matched values in the left dataset.

To find unmatched keys in both DataFrames requires an outer join with a corresponding filter identifying the null or missing values. Consider Listing 5-23.

>>> nomatch_both = pd.merge(left, right, on="ID", how="outer", sort=False, indicator="in_col")

>>> print(' ',

... nomatch_both,

... ' ')

ID Name Gender Dept Salary in_col

0 929 Gunter M Mfg 45,650 both

1 446 Harbinger M Mfg 51,290 both

2 228 Benito F Mfg 62,000 both

3 299 Rudelich M Sales 39,800 both

4 442 Sirignano F Admin 44,345 both

5 321 Morrison M Sales NaN left_only

6 321 Morrison M Sales NaN left_only

7 882 Onieda F Admin NaN left_only

8 871 NaN NaN NaN 70,000 right_only

>>>

>>> nomatch_both[(nomatch_both['in_col'] == 'right_only') |

... (nomatch_both['in_col'] == 'left_only')]

ID Name Gender Dept Salary in_col

5 321 Morrison M Sales NaN left_only

6 321 Morrison M Sales NaN left_only

7 882 Onieda F Admin NaN left_only

8 871 NaN NaN NaN 70,000 right_only

Listing 5-23

pandas Outer Join Unmatched Keys in Both

The nomatch_both DataFrame holds the rows resulting from the outer join. The indicator= argument adds the in_col column to the DataFrame containing values subject to a Boolean test for identifying the null values.

The statement

nomatch_both[(nomatch_both['in_col'] == 'right_only') |

(nomatch_both['in_col'] == 'left_only')]

is the filtering logic to create the nomatch_both DataFrame with the Boolean test

(nomatch_both['in_col'] == 'right_only') |

(nomatch_both['in_col'] == 'left_only')

selecting rows where the in_col column value is 'left_only' or where the in_col column value is 'right_only'.

An alternative Boolean expression that is more Pythonic is

nomatch_both = nomatch_both[nomatch_both["in_col"] != 'both']

selecting those rows where the in_col column value is not 'both'.

Listing 5-24 illustrates finding the unmatched key values in both the right and left tables.

4 proc sql;

5 select coalesce(left.id, right.id)

6 ,name

7 ,dept

8 ,salary

9 from left

10 full join

11 right

12 on left.id=right.id

13 where left.id ne right.id;

14 quit;

Listing 5-24

PROC SQL Outer Join Unmatched Keys in Both

The WHERE clause returns rows from the left table having no matching id values in the right table and rows from the right table having no matching id values in the left table. Figure 5-10 displays the PROC SQL output.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig10_HTML.jpg — Figure 5-10
Outer Join Unmatched Keys Found in Both Tables

Validate Keys

As a part of any data management tasks, one must understand the relationships among tables being joined or merged. For example, the results one expects from joining may not be correct if tables believed to have unique key values turn out to have non-unique key values instead. Fortunately, both the pandas library and SAS have methods to detect and enforce join key uniqueness if this is a requirement. Consider Listing 5-25.

>>> dups = pd.merge(left, right, on="ID", how="outer", sort=False, validate="one_to_one" )

pandas.errors.MergeError: Merge keys are not unique in left dataset; not a one-to-one merge

Listing 5-25

pandas Validate 1:1 Relationship

In this example, the validate= argument determines if the merge is of a specified type of relationship. The valid values for validate= are

one_to_one or 1:1: Check if keys are unique in both the left and right DataFrames.
one_to_many or 1:m: Check if keys are unique in the left Dataframe.
many_to_one or m:1: Check if keys are unique in the right DataFrame.
many_to_many or m:m: Allowed, but does not result in a check.

The default value is None. In this example, validate=“one_to_one” raises a pandas.errors.MergeError and reports the merge key, ID, in the left Dataframe is not unique.

Joining on an Index

Recall in Chapter 4, “Indexing and GroupBy,” we detailed the use of indexes as row labels for DataFrames. The pandas library implements the join() method to combine columns from two differently indexed DataFrames into a single DataFrame. It turns out the join() method is a convenience for calling the merge() method in those cases where a DataFrame lacks an index (other than the default RangeIndex) or where key columns have different names.

Consider Listing 5-26. In this example, the left and right DataFrames use an index to label the rows.

>>> import pandas as pd

>>> left = pd.DataFrame(

... { 'Style' : ['S1', 'S2', 'S3', 'S4'],

... 'Size' : ['SM', 'MD', 'LG', 'XL']},

... index = ['01', '02', '03', '05'])

>>> right = pd.DataFrame(

... { 'Color' : ['Red', 'Blue', 'Cyan', 'Pink'],

... 'Brand' : ['X', 'Y', 'Z', 'J']},

... index = ['01', '02', '03', '04'])

>>> print(left)

Style Size

01 S1 SM

02 S2 MD

03 S3 LG

05 S4 XL

>>> print(right)

Color Brand

01 Red X

02 Blue Y

03 Cyan Z

04 Pink J

>>>

>>> df1 = left.join(right, how="left")

>>> print(df1)

Style Size Color Brand

01 S1 SM Red X

02 S2 MD Blue Y

03 S3 LG Cyan Z

05 S4 XL NaN NaN

Listing 5-26

Left Join on Indexed DataFrames

This example creates the left DataFrame with a Python list of values assigning the values ‘01’, ‘02’, ’03, and ‘05’ as the row labels. Similarly, the right DataFrame is created with an index for row labels, ‘01’ to ‘04’. The df1 DataFrame is created by calling the join() method which joins either on an index (as in this case) or on a designated key column.

In this example, the call to the join() method performs the default left join with the how=‘left’ argument.

The indexes from both columns are preserved on the new df1 DataFrame as a function of the type of join called. Observe how the print() function displays the index column as row labels in the df1 DataFrame.

We can appreciate how the join() method is a convenience for the merge() method by reducing the amount of typing needed, since the corresponding syntax needed to produce the same results is

df1 = left.merge(right, how="left", left_index=True, right_index=True)

The join() method provisions four joining methods:

Left: Uses the calling DataFrame’s index, or a key column, if specified. This is the default join() method.
Right: Uses the other DataFrame’s index.
Outer: Returns union of calling DataFrame’s index with the other DataFrame index and sorts the index.
Inner: Returns intersection of calling DataFrame’s index with the other DataFrame index and preserves the order of the calling DataFrame index.

Consider Listing 5-27.

>>> df2 = left.join(right, how="outer")

>>> print(df2)

Style Size Color Brand

01 S1 SM Red X

02 S2 MD Blue Y

03 S3 LG Cyan Z

04 NaN NaN Pink J

05 S4 XL NaN NaN

Listing 5-27

Outer Join on Indexed DataFrames

The how='outer' argument enables an outer join.

An inner join operation on the indexed DataFrames is illustrated in Listing 5-28.

>>> df3 = left.join(right, how="inner")

>>> print(df3)

Style Size Color Brand

01 S1 SM Red X

02 S2 MD Blue Y

03 S3 LG Cyan Z

Listing 5-28

Inner Join on Index DataFrames

In both Listing 5-27 and Listing 5-28, the indexes labeling the rows are the join keys and remain a part of the joined DataFrames.

Join Key Column with an Index

In those cases where a DataFrame is not indexed, use the on= argument to identify the key column used in a join operation. Observe in Listing 5-29 that the left Dataframe does not have an index and the right DataFrame does.

>>> left = pd.DataFrame(

... {'Style' : ['S1', 'S2', 'S3', 'S4'],

... 'Size' : ['SM', 'MD', 'LG', 'XL'],

... 'Key' : ['01', '02', '03', '05']})

>>> right = pd.DataFrame(

... {'Color' : ['Red', 'Blue', 'Cyan', 'Pink'],

... 'Brand' : ['X', 'Y', 'Z', 'J']},

... index = ['01', '02', '03', '04'])

>>> print(left)

Style Size Key

0 S1 SM 01

1 S2 MD 02

2 S3 LG 03

3 S4 XL 05

>>> print(right)

Color Brand

01 Red X

02 Blue Y

03 Cyan Z

04 Pink J

>>>

>>> df4 = left.join(right, on="Key", how="outer")

>>> print(df4)

Style Size Key Color Brand

0 S1 SM 01 Red X

1 S2 MD 02 Blue Y

2 S3 LG 03 Cyan Z

3 S4 XL 05 NaN NaN

3 NaN NaN 04 Pink J

Listing 5-29

Outer Join Key Column with Index

The on= argument identifies the column called Key in the right DataFrame as the join key used with the index on the left DataFrame as the join key.

As pointed out previously, the join() method is a convenience for calling the merge() method to join an indexed DataFrame with a non-indexed DataFrame. The same result set from Listing 5-29 is generated with Listing 5-30.

>>> df5 = pd.merge(left, right, left_on="Key", how="outer", right_index=True)

>>>

>>> print(df5)

Style Size Key Color Brand

0 S1 SM 01 Red X

1 S2 MD 02 Blue Y

2 S3 LG 03 Cyan Z

3 S4 XL 05 NaN NaN

3 NaN NaN 04 Pink J

Listing 5-30

Merge Key Column with an Index

In this example, the how='outer' argument calls for an outer join using the left_on='Key' argument to designate the join key on the left Dataframe. The right_index=True argument designates the index as the join key on the right DataFrame .

Update

Update operations are used in cases where there is a master table containing original data values. Transaction datasets are typically shaped the same as the master datasets containing new values for updating the master dataset. In the case of SAS, an observation from the transaction dataset that does not correspond to any observations in the master dataset becomes a new observation. Begin by observing the behavior of the SAS UPDATE statement in Listing 5-31.

4 data master;

5 input ID salary;

6 list;

7 datalines;

RULE: --+--1--+--2--+--3--+--4--+--

8 023 45650

9 088 55350

10 099 55100

11 111 61625

NOTE: The dataset WORK.MASTER has 4 observations and 2 variables.

12 ;;;;

14 data trans;

15 infile datalines dlm=',';

16 input ID

17 salary

18 bonus;

19 list;

20 datalines;

RULE: --+--1--+--2--+--3--+--4--+--

21 023, 45650, 2000

22 088, 61000,

23 099, 59100,

24 111, 61625, 3000

25 121, 50000,

NOTE: The dataset WORK.TRANS has 5 observations and 3 variables.

26 ;;;;

27 data new_pay;

28 update master(rename=(salary = old_salary))

29 trans (rename=(salary = new_salary));

30 by id;

31 run;

NOTE: There were 4 observations read from the dataset WORK.MASTER.

NOTE: There were 5 observations read from the dataset WORK.TRANS.

NOTE: The dataset WORK.NEW_PAY has 5 observations and 4 variables.

33 proc print data=new_pay;

34 id ID;

35 run;

Listing 5-31

SAS Update

The SAS UPDATE statement creates the new_pay dataset by applying transaction values from the transact dataset to the master dataset. A BY variable is required. Because the ID values are read in sorted order on input, a call to PROC SORT is not needed in this example.

All non-missing values for variables in the transact dataset replace the corresponding values in the master dataset. A RENAME statement is used to rename the salary variable in order to display the effects of the UPDATE operation. The resulting new_pay dataset is displayed in Figure 5-11.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig11_HTML.jpg — Figure 5-11
SAS Update Results

In this example, the master dataset does not contain a bonus variable. Since the transact dataset contains a bonus variable, it is applied to the new_pay dataset.

The pandas update() method is used to modify DataFrame values using non-NA values from another DataFrame. In contrast to SAS, the update() method performs an in-place update to the calling DataFrame, in our case the master DataFrame. The SAS UPDATE statement forces the creation of a new dataset and does not modify the master dataset. To understand these difference, consider Listing 5-32.

>>> import numpy as np

>>> master = pd.DataFrame({'ID': ['023', '088', '099', '111'],

... 'Salary': [45650, 55350, 55100, 61625]})

>>> trans = pd.DataFrame({'ID': ['023', '088', '099', '111', '121'],

... 'Salary': [45650, 61000, 59100, 61625, 50000],

... 'Bonus': [2000, np.NaN , np.NaN, 3000, np.NaN]})

>>> print(master)

ID Salary

0 023 45650

1 088 55350

2 099 55100

3 111 61625

>>> print(trans)

ID Salary Bonus

0 023 45650 2000.0

1 088 61000 NaN

2 099 59100 NaN

3 111 61625 3000.0

4 121 50000 NaN

>>> master.update(trans)

>>> print(master)

ID Salary

0 023 45650

1 088 61000

2 099 59100

3 111 61625

Listing 5-32

DataFrame update() Method

The master DataFrame calls the update() method using the trans DataFrame to update non-NA values using values in the trans DataFrame. The default join operation for the update() method is a left join which explains why row ID 121 is not in the updated master DataFrame. This row exists only in the trans DataFrame. This is also the explanation for why the Bonus column in the trans DataFrame is not a part of the update to the master DataFrame.

An alternate approach is attempting a call to the DataFrame update() method with the how= argument set to outer. Unfortunately, this raises the NotImplementedError: Only left join is supported error illustrated in Listing 5-33. (In order to run this example, copy and paste the code for defining the left and right DataFrame from Listing 5-32.) Remember, the update() method performs an in-place update.

>>> master.update(trans, join="outer")

NotImplementedError: Only left join is supported

Listing 5-33

DataFrame update() Outer Join

The correct alternative is to call the merge() method as illustrated in Listing 5-34.

>>> master = pd.DataFrame({'ID2': ['023', '088', '099', '111'],

... 'Salary2': [45650 , 55350, 55100, 61625]})

>>> trans = pd.DataFrame({'ID2': ['023' , '088', '099', '111', '121'],

... 'Salary2': [45650 , 61000, 59100, 61625, 50000],

... 'Bonus2': [2000 , np.NaN, np.NaN, 3000, np.NaN]})

>>> df6 = pd.merge(master, trans,

... on='ID2', how="outer",

... suffixes=('_Old','_Updated' ))

>>> print(df6)

ID2 Salary2_Old Salary2_Updated Bonus2

0 023 45650.0 45650 2000.0

1 088 55350.0 61000 NaN

2 099 55100.0 59100 NaN

3 111 61625.0 61625 3000.0

4 121 NaN 50000 NaN

Listing 5-34

DataFrame Update Using merge() Method

The on='ID2' argument uses the ID column common to the master and trans DataFrame as join keys. The how='outer' argument performs an outer join and suffixes=('_Old','_Updated' ) adds a suffix to like-named columns in both DataFrames to disambiguate the DataFrame column contribution.

Conditional Update

There are cases when updates need to be applied conditionally. SAS users are accustomed to thinking in terms of IF/THEN/ELSE logic for conditional updates. To help understand how this logic works with pandas, we use two examples. The first example defines a Python function to calculate tax rates conditionally on the Salary2_Updated column in the df6 DataFrame. This example will look familiar to SAS users. It uses row iteration with if/else logic to calculate the new Taxes column.

The second approach is a better performing method of using the loc() indexer to apply calculated values to the DataFrame.

Begin with Listing 5-35. Our first step is to copy the df6 DataFrame, created in Listing 5-34, to one called df7.

>>> #copy df6 to df7 to be used in second example

... df7 = df6.copy()

>>> print(df6)

ID2 Salary2_Old Salary2_Updated Bonus2

0 023 45650.0 45650 2000.0

1 088 55350.0 61000 NaN

2 099 55100.0 59100 NaN

3 111 61625.0 61625 3000.0

4 121 NaN 50000 NaN

>>> def calc_taxes(row):

... if row['Salary2_Updated'] <= 50000:

... val = row['Salary2_Updated'] * .125

... else:

... val = row['Salary2_Updated'] * .175

... return val

...

>>> df6['Taxes'] = df6.apply(calc_taxes, axis=1)

>>> print(df6)

ID2 Salary2_Old Salary2_Updated Bonus2 Taxes

0 023 45650.0 45650 2000.0 5706.250

1 088 55350.0 61000 NaN 10675.000

2 099 55100.0 59100 NaN 10342.500

3 111 61625.0 61625 3000.0 10784.375

4 121 NaN 50000 NaN 6250.000

Listing 5-35

Conditional DataFrame Column Update with a Function

This example defines the calc_taxes function to be applied iterating over the DataFrame rows. The row variable is local to the function definition and returns the val object value when called. The function contains two conditions, if and else. If you need to cascade a Series of if statements, then use the elif keyword following the first if statement.

The if and else statements define the Taxes column calculated at a 12.5% rate when the Salary2_Updated value is less than or equal to $50,000. Otherwise the tax rate is 17.5%.

The statement

df6['Taxes'] = df6.apply(calc_taxes, axis=1)

creates the Taxes column in the df6 DataFrame by calling the apply function with the calc_taxes function and the axis=1 argument indicates the function is applied along the column.

The second approach for conditional update uses the .loc indexer illustrated in Listing 5-36.

>>> df7.loc[df7['Salary2_Updated'] <= 50000, 'Taxes'] = df7.Salary2_Updated * .125

>>> df7.loc[df7['Salary2_Updated'] > 50000, 'Taxes'] = df7.Salary2_Updated * .175

>>> print(df7)

ID Salary2_Old Salary2_Updated Bonus2 Taxes

0 023 45650.0 45650 2000.0 5706.250

1 088 55350.0 61000 NaN 10675.000

2 099 55100.0 59100 NaN 10342.500

3 111 61625.0 61625 3000.0 10784.375

4 121 NaN 50000 NaN 6250.000

Listing 5-36

Conditional DataFrame Update with loc() Indexer

There are two conditions: 12.5% tax rate on the Salary2_Updated column less than or equal to $50,000 is

df7.loc[df7['Salary2_Updated'] <=50000, 'Taxes'] = df7.Salary2_Updated * .125

One way to read this statement is to consider the syntax to the left of the comma (,) similar to a WHERE clause. The df7 DataFrame calls the loc() indexer to find the condition df7['Salary2_Updated'] less than or equal to $50,000. To the right of the comma is the assignment when the condition is True; the value for the Taxes column (which is created on the DataFrame) is calculated at a rate of 12.5% of the value found in the Salary2_Updated column .

The second condition, 17.5% tax rate on the Salary2_Updated column greater than $50,000

df7.loc[df7['Salary2_Updated'] >50000, 'Taxes'] = df7.Salary2_Updated * .175

works in a similar fashion.

Specifically, the loc() indexer creates a Boolean mask which is used to index the DataFrame and return those rows meeting the logical condition. This is illustrated in Listing 5-37. For the logical condition, df7['Salary2_Updated'] <= 50000, only when the Boolean mask returns True is the value multiplied by .125. Likewise for the logical condition, df7['Salary2_Updated'] > 50000, is the value multiplied by 17.5.

>>> nl = ' '

>>>

>>> print(nl,

... "Boolean Mask for 'Salary2_Updated' <= 50000",

... nl,

... df7['Salary2_Updated'] <= 50000,

... nl,

... "Boolean Mask for 'Salary2_Updated' > 50000",

... nl,

... df7['Salary2_Updated'] > 50000)

Boolean Mask for 'Salary2_Updated' <= 50000

0 True

1 False

2 False

3 False

4 True

Name: Salary2_Updated, dtype: bool

Boolean Mask for 'Salary2_Updated' > 50000

0 False

1 True

2 True

3 True

4 False

Name: Salary2_Updated, dtype: bool

Listing 5-37

Boolean Mask for Logical Conditions

A conditional update with a SAS Data Step is illustrated in Listing 5-38.

4 data calc_taxes;

5 set new_pay;

6 if new_salary <= 50000 then

7 taxes = new_salary * .125;

8 else taxes = new_salary * .175;

9 run;

NOTE: There were 5 observations read from the dataset WORK.NEW_PAY.

NOTE: The dataset WORK.CALC_TAXES has 5 observations and 5 variables.

10 proc print data=calc_taxes;

11 id ID;

12 run;

Listing 5-38

SAS Conditional Update

The SAS IF/ELSE statement is similar to the Python function defining the calc_taxes function in Listing 5-35.

Figure 5-12 uses PROC PRINT to display the newly created taxes variable.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig12_HTML.jpg — Figure 5-12
SAS Conditional Update

Concatenation

The pandas library implements a concat() method similar in behavior to the SAS SET statement . It is used to “glue” DataFrames together both on a row-oriented basis and on a column-oriented basis, like the SAS MERGE statement. Here we examine its behavior as an analog to the SAS SET statement.

The concat() method signature is

pd.concat(objs, axis=0, join="outer", join_axes=None,

ignore_index=False, keys=None, levels=None,

names=None, verify_integrity=False,

sort=None, copy=True)

Begin by creating the example DataFrames shown in Listing 5-39.

>>> loc1 = pd.DataFrame({'Onhand': [21, 79, 33, 81],

... 'Price': [17.99, 9.99, 21.00, .99]},

... index = ['A0', 'A1', 'A2', 'A3'])

>>>

>>> loc2 = pd.DataFrame({'Onhand': [12, 33, 233, 45],

... 'Price': [21.99, 18.00, .19, 23.99]},

... index = ['A4', 'A5', 'A6', 'A7'])

>>>

>>> loc3 = pd.DataFrame({'Onhand': [37, 50, 13, 88],

... 'Price': [9.99, 5.00, 22.19, 3.99]},

... index = ['A8', 'A9', 'A10', 'A11'])

>>> frames = [loc1, loc2, loc3]

>>> all = pd.concat(frames)

>>> print(all)

Onhand Price

A0 21 17.99

A1 79 9.99

A2 33 21.00

A3 81 0.99

A4 12 21.99

A5 33 18.00

A6 233 0.19

A7 45 23.99

A8 37 9.99

A9 50 5.00

A10 13 22.19

A11 88 3.99

Listing 5-39

DataFrames for concat() Method

This example uses the DataFrame() method to create the three DataFrames, loc1, loc2, and loc3. The objects are placed into a Python list and assigned to the frames object with the syntax

frames = [loc1, loc2, loc3]

The three DataFrames are concatenated by calling the concat() method using the syntax

all = pd.concat(frames)

creating the output all DataFrame.

The analog SAS program is shown in Listing 5-40. It creates the three SAS datasets, loc1, loc2, and loc3, and uses the SET statement to concatenate them together producing the all dataset.

4 data loc1;

5 length id $ 3;

6 input id $

7 onhand

8 price;

9 list;

10 datalines;

RULE: --+--1--+--2--+--3--+--4--+--

12 A0 21 17.19

13 A1 79 9.99

14 A2 33 21

15 A3 81 .99

NOTE: The dataset WORK.LOC1 has 4 observations and 3 variables.

16 ;;;;

17 data loc2;

18 length id $ 3;

19 input id $

20 onhand

21 price;

22 list;

23 datalines;

RULE: --+--1--+--2--+--3--+--4--+--

24 A4 12 21.99

25 A5 33 18

26 A6 233 .19

27 A7 45 23.99

NOTE: The dataset WORK.LOC2 has 4 observations and 3 variables.

28 ;;;;

29 data loc3;

30 length id $ 3;

31 input id $

32 onhand

34 price;

35 list;

36 datalines;

RULE: --+--1--+--2--+--3--+--4--+--

37 A8 37 9.99

38 A9 50 5

39 A10 13 22.19

40 A11 88 3.99

NOTE: The dataset WORK.LOC3 has 4 observations and 3 variables.

41 ;;;;

42 data all;

44 set loc1

45 loc2

46 loc3;

47 run;

NOTE: There were 4 observations read from the dataset WORK.LOC1.

NOTE: There were 4 observations read from the dataset WORK.LOC2.

NOTE: There were 4 observations read from the dataset WORK.LOC3.

NOTE: The dataset WORK.ALL has 12 observations and 3 variables.

48 proc print data=all;

49 run;

Listing 5-40

SAS SET Statement

The PROC SQL UNION ALL set operator is an alternative to the SAS SET statement for creating the all table. This example is illustrated in Listing 5-41.

4 proc sql;

5 create table all as

6 select * from loc1

7 union all

8 select * from loc2

9 union all

10 select * from loc3;

NOTE: Table WORK.ALL created, with 12 rows and 3 columns.

12 select * from all;

13 quit;

Listing 5-41

PROC SQL UNION ALL

The PROC SQL output is displayed in Figure 5-13.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig13_HTML.jpg — Figure 5-13
PROC SQL Creating ALL Table

The concat() method is able to construct a hierarchical index by providing the keys= argument to form the outermost level. Listing 5-42 illustrates this feature.

>>> all = pd.concat(frames, keys=['Loc1', 'Loc2', 'Loc3'])

>>> print(all)

Onhand Price

Loc1 A0 21 17.99

A1 79 9.99

A2 33 21.00

A3 81 0.99

Loc2 A4 12 21.99

A5 33 18.00

A6 233 0.19

A7 45 23.99

Loc3 A8 37 9.99

A9 50 5.00

A10 13 22.19

A11 88 3.99

>>> all.loc['Loc3']

Onhand Price

A8 37 9.99

A9 50 5.00

A10 13 22.19

A11 88 3.99

Listing 5-42

Hierarchical Index from concat() Method

With the hierarchical index in place, we can easily identify subsets using the loc() indexer discussed in Chapter 4, “Indexing and GroupBy.” In this example, the loc() indexer slices those rows belonging to the original loc3 DataFrame.

The IN= dataset option enables the ability to uniquely identify observations contributed by a specific SAS dataset as illustrated in Listing 5-43.

4 data all;

5 length id $ 3;

6 set loc1 (in=l1)

7 loc2 (in=l2)

8 loc3 (in=l3);

9 if l1 then location = 'Loc1';

10 if l2 then location = 'Loc2';

11 if l3 then location = 'Loc3';

12 run;

NOTE: There were 4 observations read from the dataset WORK.LOC1.

NOTE: There were 4 observations read from the dataset WORK.LOC2.

NOTE: There were 4 observations read from the dataset WORK.LOC3.

NOTE: The dataset WORK.ALL has 12 observations and 4 variables.

14 proc print data = all(where=(location='Loc3'));

15 id id;

16 var onhand price;

17 run;

Listing 5-43

SET Statement Using IN=

The IF statements create the location variable by using the IN= dataset option by identifying which dataset contributed observations. To identify those observations contributed from the loc3 dataset, apply the WHERE= filter. In this example, WHERE=(location='Loc3') is applied when calling PROC PRINT. The results are displayed in Figure 5-14.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig14_HTML.jpg — Figure 5-14
SET with IN= Option

Similar to SAS PROC APPEND, the pandas library provisions an append() method. Consider Listing 5-44. This method is a convenience for calling the concat() method. The append() method syntax is likely a more natural syntax for SAS users.

>>> all_parts = loc1.append([loc2, loc3])

>>> print(all_parts)

Onhand Price

A0 21 17.99

A1 79 9.99

A2 33 21.00

A3 81 0.99

A4 12 21.99

A5 33 18.00

A6 233 0.19

A7 45 23.99

A8 37 9.99

A9 50 5.00

A10 13 22.19

A11 88 3.99

Listing 5-44

pandas append() Method

DataFrame loc1 calls the append() method to append the loc2 and loc3 DataFrames. For this example, the equivalent syntax using the concat() method is

all_parts = pd.concat([loc1, loc2, loc3], join="outer")

SAS uses PROC APPEND to append observations from a dataset to a base dataset as illustrated in Listing 5-45.

4 proc append base = loc1

5 data = loc2;

6 run;

NOTE: Appending WORK.LOC2 to WORK.LOC1.

NOTE: There were 4 observations read from the dataset WORK.LOC2.

NOTE: 4 observations added.

NOTE: The dataset WORK.LOC1 has 8 observations and 3 variables.

7 proc append base = loc1

8 data = loc3;

NOTE: Appending WORK.LOC3 to WORK.LOC1.

NOTE: There were 4 observations read from the dataset WORK.LOC3.

NOTE: 4 observations added.

NOTE: The dataset WORK.LOC1 has 12 observations and 3 variables.

9 proc print data=loc1;

10 run;

Listing 5-45

SAS PROC APPEND

In cases where more than one dataset is being appended, multiple calls to PROC APPEND are needed. In some cases, appending is a better performance choice over the SET statement when appending smaller datasets to a larger dataset. PROC APPEND avoids reading observations in the BASE= dataset by positioning the record pointer at the end of the BASE= dataset. Observations from the DATA= dataset are applied to the end of the BASE= dataset. Figure 5-15 displays the loc1 dataset after the append operations.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig15_HTML.jpg — Figure 5-15
PROC APPEND Results

Finding Column Min and Max Values

A common task is finding the minimum and maximum values in a column. The pandas max() and min() methods return the minimum and maximum column values, respectively. This is illustrated in Listing 5-46.

>>> all_parts['Price'].max()

23.99

>>> all_parts['Price'].min()

0.19

>>> all_parts[all_parts['Price']==all_parts['Price'].max()]

Onhand Price

A7 45 23.99

Listing 5-46

Return DataFrame Column Min and Max

The first two lines of the script return the maximum and minimum values for the Price column in the all_parts Dataframe. The line in the script returns the row for the maximum value of Price using the [ ] indexer followed by the Boolean comparison.

Listing 5-47 illustrates the same logic with PROC SQL.

4 proc sql;

5 select min(price) as Price_min

6 , max(price) as Price_max

7 from loc1;

8 quit;

Listing 5-47

Return Dataset Column Min and Max

Figure 5-16 displays the PROC SQL result set.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig16_HTML.jpg — Figure 5-16
Returned Column Min and Max

Sorting

Both the pandas library and SAS provide sort methods and options for controlling how values are sorted. Examine Listing 5-48. It creates the df DataFrame and calls the sort_values attribute with a Python list of column names indicating a multi-key sort.

>>> df = pd.DataFrame({'ID': ['A0', 'A1', 'A2', 'A3', 'A4', '5A', '5B'],

... 'Age': [21, 79, 33, 81, np.NaN, 33, 33],

... 'Rank': [1, 2, 3, 3, 4, 5, 6]})

>>> print(df)

ID Age Rank

0 A0 21.0 1

1 A1 79.0 2

2 A2 33.0 3

3 A3 81.0 3

4 A4 NaN 4

5 5A 33.0 5

6 5B 33.0 6

>>> df.sort_values(by=['Age', 'Rank'])

ID Age Rank

0 A0 21.0 1

2 A2 33.0 3

5 5A 33.0 5

6 5B 33.0 6

1 A1 79.0 2

3 A3 81.0 3

4 A4 NaN 4

Listing 5-48

pandas sort_values Attribute

Both the sort_values attribute and PROC SORT use ascending as the default sort order. Listing 5-49 is the analog illustrating a call to PROC SORT using a multi-key sort. Of course, the same dataset can be generated with PROC SQL and an ORDER BY statement.

4 data df;

5 length id $ 2;

6 input id $

7 age

8 rank;

9 list;

10 datalines;

RULE: --+--1--+--2--+--3--+--4--+--

11 A0 21 1

12 A1 79 2

13 A2 33 3

14 A3 81 3

15 A4 . 4

16 5A 33 5

17 5B 33 6

NOTE: The dataset WORK.DF has 7 observations and 3 variables.

18 ;;;;

20 proc sort data = df;

21 by age rank;

22 run;

NOTE: There were 7 observations read from the dataset WORK.DF.

23 proc print data = df;

24 id id;

25 run;

Listing 5-49

SAS PROC SORT

Figure 5-17 displays the results of the sort operation.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig17_HTML.jpg — Figure 5-17
PROC SORT Results

PROC SORT sorts missing values as the smallest numeric value. The default behavior for pandas sort_values attribute is to sort NaN’s as the largest numeric values. The sort_values attribute has the na_position= argument which uses values of 'first' or 'last' for placing NaN’s at the beginning or the end, respectively, of a DataFrame.

Listing 5-50 illustrates overriding the default behavior of 'last' with 'first' with the sort_values attribute.

>>> df.sort_values(by=['Age', 'Rank'], na_position="first")

ID Age Rank

4 A4 NaN 4

0 A0 21.0 1

2 A2 33.0 3

5 5A 33.0 5

6 5B 33.0 6

1 A1 79.0 2

3 A3 81.0 3

Listing 5-50

Sort NaN’s to Beginning of DataFrame

The behaviors between sort_values attribute and PROC SORT differ with respect to where the sort occurs. By default, the sort_values attribute returns a new DataFrame with the inplace= argument set to False. By contrast, PROC SORT sorts the dataset in place unless an output dataset is specified with the OUT= option.

For a multi-key sort using the sort_values attribute, the ascending or descending sort order is specified with the ascending= argument using True or False for the same number of values listed with the by= argument. Take, for example, Listing 5-51.

>>> df.sort_values(by=['Age', 'Rank'], na_position="first", ascending = (True, False))

ID Age Rank

4 A4 NaN 4

0 A0 21.0 1

6 5B 33.0 6

5 5A 33.0 5

2 A2 33.0 3

1 A1 79.0 2

3 A3 81.0 3

Listing 5-51

Multi-key Sort Ascending and Descending

In this example, the Age column is sorted by ascending value and the Rank column is sorted by descending value.

PROC SQL uses the keyword DESCENDING following a column name to indicate a descending sort order used with ORDER BY. Listing 5-52 illustrates the DESCENDING keyword to alter the default sort order.

4 proc sql;

5 select * from df

6 order by age, rank descending;

7 quit;

Listing 5-52

PROC SQL ORDER BY

The results from PROC SQL with a multi-key sort is displayed in Figure 5-18.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig18_HTML.jpg — Figure 5-18
PROC SQL Multi-key ORDER BY

Finding Duplicates

In some cases data entry errors lead to duplicate data values or non-unique key values or duplicate rows. In Listing 5-25, we discuss validating the join relationship among tables. The validation argument for the merge() method validates whether a join is of a particular type. This raises the question on how to find and remove duplicate key values.

Consider Listing 5-53. This example illustrates how to find duplicate key values using the duplicated attribute.

>> print(df)

ID Age Rank

0 A0 21.0 1

1 A1 79.0 2

2 A2 33.0 3

3 A3 81.0 3

4 A4 NaN 4

5 5A 33.0 5

6 5B 33.0 6

>>> dup_mask = df.duplicated('Age', keep="first")

>>> df_dups = df.loc[dup_mask]

>>> print(df_dups)

ID Age Rank

5 5A 33.0 5

6 5B 33.0 6

Listing 5-53

Find DataFrame Duplicate Keys

The duplicated attribute returns Booleans indicating duplicate rows and, in this case, limited to duplicate values in the Age column. By default, the duplicated attribute applies to all DataFrame columns. In other words, the default behavior is to identify duplicate rows.

The statement

dup_mask = df.duplicated('Age', keep="first")

defines a Boolean mask using the keep='first' argument for the duplicated attribute. The keep= argument has the following three values:

first: Mark duplicates as True except for the first occurrence.
last: Mark duplicates as False except for the last occurrence.
False: Mark all duplicates as True.

The statement

df_dups = df.loc[dup_mask]

creates the df_dups DataFrame containing the duplicate values for the Age column. This form of conditional slicing is covered in detail in Chapter 4, “Indexing and GroupBy,” in the section “Conditional Slicing.”

Dropping Duplicates

The drop_duplicates attribute returns a de-duplicated DataFrame based on its argument values. This feature is illustrated in Listing 5-54.

>> print(df)

ID Age Rank

0 A0 21.0 1

1 A1 79.0 2

2 A2 33.0 3

3 A3 81.0 3

4 A4 NaN 4

5 5A 33.0 5

6 5B 33.0 6

>>> df_deduped = df.drop_duplicates('Age', keep = 'first')

>>> print(df_deduped)

ID Age Rank

0 A0 21.0 1

1 A1 79.0 2

2 A2 33.0 3

3 A3 81.0 3

4 A4 NaN 4

Listing 5-54

Drop Duplicate Column Values

The keep= argument has the same values and behavior as the keep= argument for the duplicated attribute .

Similarly, with SAS, the NODUPKEY and NODUPRECS options for PROC SORT are used to detect and remove duplicate values. The NODUPKEY option removes observations with duplicate BY values. This is analogous to the drop_duplicates attribute with a column argument.

The PROC SORT NODUPRECORDS is analogous to the drop_duplicates attribute with the keep= argument set to False. In each case, duplicate observations or rows, if found, are eliminated.

Consider Listing 5-55. This example replicates the logic for Listing 5-53 and Listing 5-54.

4 proc sort data = df nodupkey

5 out = df_deduped

6 dupout = df_dups;

7 by age;

9 run;

NOTE: There were 7 observations read from the dataset WORK.DF.

NOTE: 2 observations with duplicate key values were deleted.

NOTE: The dataset WORK.DF_DEDUPED has 5 observations and 3 variables.

NOTE: The dataset WORK.DF_DUPS has 2 observations and 3 variables.

10 proc print data = df;

11 id id;

12 run;

NOTE: There were 7 observations read from the dataset WORK.DF.

13 proc print data = df_dups;

14 id id;

15 run;

NOTE: There were 2 observations read from the dataset WORK.DF_DUPS.

16 proc print data = df_deduped;

17 id id;

18 run;

NOTE: There were 5 observations read from the dataset WORK.DF_DEDUPED.

Listing 5-55

SAS Find and Drop Duplicates

In this example, the OUT= dataset option creates the df_deduped dataset dropping the duplicate values for the age variable found in the df dataset. Without the OUT= option, PROC SORT does an in-place sort overwriting the df dataset. The NODUPKEY option removes observations with duplicate BY values for the age variable. The DUPOUT= option identifies the output dataset to output the duplicate values.

Figure 5-19 uses PROC PRINT to display the original dataset with duplicate age values.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig19_HTML.jpg — Figure 5-19
Original df Dataset

Figure 5-20 uses PROC PRINT to display the duplicate observations found using the NODUPKEY option that are written to the df_dups dataset using the DUPOUT= option.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig20_HTML.jpg — Figure 5-20
Duplicate Age Observations

Figure 5-21 uses PROC PRINT to display the de-duplicated dataset df_deduped.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig21_HTML.jpg — Figure 5-21
Deduped Observations

Sampling

Sampling is often a requirement in data analysis. The idea is to select a sample from a population and, by analyzing the sample, draw inferences about the population. In a simple random sample without replacement process every row has an equal chance of being selected. Once selected, the observation cannot be chosen again. Listing 5-56 illustrates this concept.

>>> np.random.seed(987654)

>>> df = pd.DataFrame({'value': np .random.randn(360)},

... index=pd.date_range('1970-01-01', freq="M", periods=360))

>>> print(df.head(5))

value

1970-01-31 -0.280936

1970-02-28 -1.292098

1970-03-31 -0.881673

1970-04-30 0.518407

1970-05-31 1.829087

>>>

>>> samp1 = df.sample(n= 100, replace=False)

>>> samp1.head(5)

value

1993-10-31 -0.471982

1998-04-30 -0.906632

1980-09-30 -0.313441

1986-07-31 -0.872584

1975-01-31 0.237037

>>> print(samp1.shape)

(100, 1)

Listing 5-56

Simple Random Sample from a DataFrame

The df DataFrame is constructed using the numpy random.randn random number generator to generate 360 values into the value column with a standard normal distribution. The samp1 DataFrame is created by drawing a sample by calling the sample attribute along with the n= argument to set the number of rows to be drawn without replacement. The replace= argument is False by default to sample without replacement.

The same logic for a simple random sample without replacement is illustrated in Listing 5-57.

4 data df;

5 do date = '01Jan1970'd to '31Dec2000'd by 31;

6 value = rand('NORMAL');

7 output;

8 end;

9 format date yymmdd10.;

10 run;

NOTE: The dataset WORK.DF has 366 observations and 2 variables.

11 data samp1 (drop = k n);

12 retain k 100 n;

13 if _n_ = 1 then n = total;

14 set df nobs=total;

16 if ranuni(654321) <= k/n then do;

17 output;

18 k = k -1;

19 end;

20 n = n -1;

22 if k = 0 then stop;

23 run;

NOTE: There were 360 observations read from the dataset WORK.DF.

NOTE: The dataset WORK.SAMP1 has 100 observations and 2 variables.

24 proc print data = samp1 (obs=5);

25 id date;

26 var value;

27 run;

Listing 5-57

Simple Random Sample from a Dataset

The df dataset is generated using a DO/END block. The values assigned to the value variable are randomly generated from the standard normal distribution.

The samp1 dataset is created by reading the df dataset and retaining the k variable to the desired sample size of 100. The total variable holds the value for the number of observations read from the input df dataset using the NOBS= dataset option on the SET statement. When the first observation is read from the df dataset, the n variable is assigned the value from the total variable.

The IF-THEN/DO block is based on the logic that each observation in the input dataset has an equal probability of k/n (100/360) of being selected as the first observation in the sample. The values for the k and n variables are decremented by one to reach the desired sample size. If a random number from the RANUNI function is less than or equal to k/n, then the observation is included in the sample. The iteration stops when k, which began with a retained value of 100, decrements to zero (0).

The pandas sample attribute uses the frac= argument to include a portion, for example, 30% of the rows to be included in the sample. This capability is illustrated in Listing 5-58.

>>> samp2 = df.sample(frac=.3, replace=True)

>>> print(samp2.head(5))

value

1971-05-31 0.639097

1997-10-31 1.779798

1971-07-31 1.578456

1981-12-31 2.114340

1980-11-30 0.293887

>>> print(samp2.shape)

(108, 1)

Listing 5-58

Proportional Sample from a DataFrame with Replacement

The replace=True argument indicates the sample is drawn with replacement.

Convert Types

There are occasions when a value is assigned to a DataFrame column that is not appropriate for a desired operation. For example, date values may have been read as strings or numeric values are read as strings.

pandas provisions the attribute to convert types. This feature is illustrated in Listing 5-59. The DataFrame dtypes attribute returns the column’s type.

>>> df8 = pd.DataFrame({'String': ['2', '4', '6', '8'],

... 'Ints' : [1, 3, 5, 7]})

>>> df8.dtypes

String object

Ints int64

dtype: object

>>> df8['String'] = df8['String'].astype(float)

>>> df8['Ints'] = df8['Ints'].astype(object)

>>> df8.dtypes

String float64

Ints object

dtype: object

Listing 5-59

Converting Types

In this example, DataFrame df8 contains two columns, String and Ints, with types of object and int64, respectively. After the calls to the astype attribute , the String and Ints columns have data types float64 and object, respectively. The “Column Types” section in Chapter 3, “pandas Library,” provides more details for the common pandas types.

Rename Columns

Listing 5-60 illustrates calling the DataFrame rename attribute to rename column labels.

>>> df8.rename(columns={"String": "String_to_Float",

... "Ints": "Ints_to_Object"},

... inplace=True)

>>> print(df8)

String_to_Float Ints_to_Object

0 2.0 1

1 4.0 3

2 6.0 5

3 8.0 7

Listing 5-60

Rename Columns

The DataFrame rename attribute accepts the Dictionary key:value pairs where the key is the existing column name and the value is the new name. The default value for the inplace= argument is False .

Map Column Values

Mapping column values is similar to the function provided by PROC FORMAT. Mapping allows translation of DataFrame column values into associated values as illustrated in Listing 5-61.

>>> dow = pd.DataFrame({"Day_Num":[1,2,3,4,5,6,7]})

>>> dow_map={1:'Sunday',

... 2:'Monday',

... 3:'Tuesday',

... 4:'Wednesday',

... 5:'Thursday',

... 6:'Friday',

... 7:'Saturday'

... }

>>>

>>> dow["Day_Name"] = dow["Day_Num"].map(dow_map)

>>> print(dow)

Day_Num Day_Name

0 1 Sunday

1 2 Monday

2 3 Tuesday

3 4 Wednesday

4 5 Thursday

5 6 Friday

6 7 Saturday

Listing 5-61

Mapping Column Values

The dow DataFrame is constructed with the Day_Num column having values 1 through 7. dow_map defines a Dictionary of key:value pairs where the keys are equivalent to the values on the Day_Num column. The Dictionary values are the values to be mapped to these keys, in this case, names for the week day. The statement

dow["Day_Name"] = dow["Day_Num"].map(dow_map)

creates the Day_Name column in the dow DataFrame by calling the map attribute with the dow_map Dictionary for the Day_Num column.

Transpose

Transposing creates a DataFrame by restructuring the values in a DataFrame, transposing columns into row. By default a DataFrame is transposed in place. In most cases, an output DataFrame is created with an assignment statement. Creating a new DataFrame by transposing an existing DataFrame is shown in Listing 5-62.

>>> uni = {'School' : ['NCSU', 'UNC', 'Duke'],

... 'Mascot' : ['Wolf', 'Ram', 'Devil'],

... 'Students' : [22751, 31981, 19610],

... 'City' : ['Raleigh', 'Chapel Hill', 'Durham']}

>>> df_uni = pd.DataFrame(data=uni)

>>> print (df_uni)

School Mascot Students City

0 NCSU Wolf 22751 Raleigh

1 UNC Ram 31981 Chapel Hill

2 Duke Devil 19610 Durham

>>> df_uni.dtypes

School object

Mascot object

Students int64

City object

dtype: object

>>>

>>> t_df_uni = df_uni.T

>>> print(t_df_uni)

0 1 2

School NCSU UNC Duke

Mascot Wolf Ram Devil

Students 22751 31981 19610

City Raleigh Chapel Hill Durham

>>> t_df_uni.dtypes

0 object

1 object

2 object

dtype: object

>>>

>>> t_df_uni[0]

School NCSU

Mascot Wolf

Students 22751

City Raleigh

Name: 0, dtype: object

Listing 5-62

Transpose a DataFrame

The example begins by defining the Dictionary uni. The statement

df_uni = pd.DataFrame(data=uni)

uses the uni Dictionary to create the df_uni DataFrame.

The call to create the transposed DataFrame and create the output t_df_uni DataFrame is the statement

t_df_uni = df_uni.T

In cases where the DataFrame has mixed data types, such as this one, the transposed DataFrame columns are returned as object (string) types. The Students column in the original df_uni DataFrame is int64. The columns 0–2 in the transposed DataFrame are object data types.

The [ ] indexer is used to return column 0 in this example.

Listing 5-63 illustrates transposing a SAS dataset.

4 data df_uni;

5 infile datalines dlm=',';

6 length school $ 4

7 mascot $ 5

8 city $ 11;

9 input school $

10 mascot $

11 students

12 city $;

13 list;

14 datalines;

RULE: --+--1--+--2--+--3--+--4--+--5

2922 NCSU, Wolf, 22751, Raleigh

2923 UNC, Ram, 31981, Chapel Hill

2924 Duke, Devil, 19610, Durham

NOTE: The dataset WORK.DF_UNI has 3 observations and 4 variables.

15 ;;;;

17 proc print data = df_uni;

18 id school;

19 run;

NOTE: There were 3 observations read from the dataset WORK.DF_UNI.

21 proc transpose data = df_uni

22 out = t_df_uni;

23 var school mascot students city;

24 run;

NOTE: Numeric variables in the input dataset will be converted to character in the output dataset.

NOTE: There were 3 observations read from the dataset WORK.DF_UNI.

NOTE: The dataset WORK.T_DF_UNI has 4 observations and 4 variables.

26 proc print data = t_df_uni;

27 run;

NOTE: There were 4 observations read from the dataset WORK.T_DF_UNI.

29 proc sql;

30 select name

31 , type

32 from dictionary.columns

33 where libname="WORK" and memname="T_DF_UNI";

34 quit;

Listing 5-63

SAS Transpose

The call to PROC TRANSPOSE transposes the input dataset df_uni and creates the transposed t_df_uni dataset using the OUT= option. Observe the note in the SAS log

Numeric variables in the input dataset will be converted to character in the output dataset .

The note indicates numerics, in this case, the student variable, are converted to characters. The PROC SQL code queries the SAS DICTIONARY table COLUMNS to display the column names and data types for the transposed t_df_uni dataset.

Figure 5-22 uses PROC PRINT to display the input dataset. Figure 5-23 uses PROC PRINT to display the transposed dataset. Figure 5-24 displays the PROC SQL results from the DICTIONARY.COLUMNS table.

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig22_HTML.jpg — Figure 5-22
Input Dataset to Transpose

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig23_HTML.jpg — Figure 5-23
Output Dataset from Transpose

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig24_HTML.jpg — Figure 5-24
Transposed Dataset Variable Types

Summary

In this chapter, we examined a variety of essential data management tasks needed to reshape and organize data. The primary focus was to enable an understanding of joining/merging and reshaping DataFrames. These principles set the stage for the next chapter’s topic, “pandas Readers and Writers.”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5. Data Management

Create new playlist

Sign In

Sign Up

5. Data Management

SAS Sort/Merge

Inner Join

Right Join

Left Join

Outer Join

Right Join Unmatched Keys

Left Join Unmatched Keys

Outer Join Unmatched Keys

Validate Keys

Joining on an Index

Join Key Column with an Index

Update

Conditional Update

Concatenation

Finding Column Min and Max Values

Sorting

Finding Duplicates

Dropping Duplicates

Sampling

Convert Types

Rename Columns

Map Column Values

Transpose

Summary

Table of Contents for
5. Data Management