© Randy Betancourt, Sarah Chen 2019
R. Betancourt, S. ChenPython for SAS Usershttps://doi.org/10.1007/978-1-4842-5001-3_5

5. Data Management

Randy Betancourt1  and Sarah Chen2
(1)
Chadds Ford, PA, USA
(2)
Livingston, NJ, USA
 

In this chapter we discuss common data management tasks beginning with combining DataFrames. Other tasks discussed include sorting, finding duplicate values, drawing samples, and transposing. Every analysis task requires data to be organized into a specific form before the data is used to render meaningful results. Data often comes from multiple sources with a range of different formats. This requires you to logically relate and organize data to fit the analysis task. In fact, most of the effort for any data analysis is “wrangling” data to shape it appropriately.

pandas have two main facilities for combining DataFrames with various types of set logic and relational/algebraic capabilities for join/merge operations. The concat() method performs row-wise or column-wise concatenation operations and performs union and intersection set logic on DataFrames. The examples explored in this chapter are analogous to the SAS Data Step SET statement and PROC APPEND.

The merge() method offers an SQL-like interface for performing DataFrame join/merge operations. The SAS MERGE statement and PROC SQL are the analogs used to introduce the merge() method.

The merge() method is like the SAS match-merge operation. And since data is rarely tidy, we also explore cases where key columns are inferred as well as handling key columns having different names followed by merges with missing key values.

Start by constructing the left and right DataFrames illustrated in Listing 5-1. In this example, the DataFrames use the ID column as a common key.
>>> import pandas as pd
>>> url_l = "https://raw.githubusercontent.com/RandyBetancourt/PythonForSASUsers/master/data/Left.csv"
>>> left = pd.read_csv(url_l)
>>> url_r = "https://raw.githubusercontent.com/RandyBetancourt/PythonForSASUsers/master/data/Right.csv"
>>> right = pd.read_csv(url_r)
>>>
>>> print(left)
    ID       Name Gender   Dept
0  929     Gunter      M    Mfg
1  446  Harbinger      M    Mfg
2  228     Benito      F    Mfg
3  299   Rudelich      M  Sales
4  442  Sirignano      F  Admin
5  321   Morrison      M  Sales
6  321   Morrison      M  Sales
7  882     Onieda      F  Admin
>>> print(right)
    ID  Salary
0  929  45,650
1  446  51,290
2  228  62,000
3  299  39,800
4  442  44,345
5  871  70,000
Listing 5-1

Build Left and Right DataFrames

Notice the left DataFrame has ‘321’ as a duplicate value in the ID column , making a many-to-one relationship between the DataFrames. Also notice how the right DataFrame has ‘871’ as an ID value not found in the left DataFrame. These are the types of issues that may cause unexpected results when performing merge/join operations. These two DataFrames are used in several examples throughout this chapter.

Listing 5-2 builds the same input data into the left and right datasets in SAS. With the SAS examples, we explore the use of both Data Step and PROC SQL logic as analogs to the merge() and concat() methods for DataFrames.
4 data left;
5 infile datalines dlm=',';
6        length name $ 12 dept $ 5;
7        input name $
8              id
9              gender $
10             dept;
11 list;
12 datalines;
RULE:      --+--1--+--2--+--3--+--4--+--5
13        Gunter,    929, M, Mfg
14        Harbinger, 446, M, Mfg
15        Benito,    228, F, Mfg
16        Rudelich,  299, M, Sales
17        Sirignano, 442, F, Admin
18        Morrison,  321, M, Sales
19        Morrison,  321, M, Sales
20        Oniedae,   882, F, Admin
NOTE: The dataset WORK.LEFT has 8 observations and 4 variables.
21 ;;;;
22
23 data right;
24    input id
25          salary;
26 list;
27 datalines;
RULE:      --+--1--+--2--+--3--+--4--+--5
28         929 45650
29         446 51290
30         228 62000
31         299 39800
32         442 44345
33         871 70000
NOTE: The dataset WORK.RIGHT has 6 observations and 2 variables.
34 ;;;;
35
36 proc print data=left;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
37 proc print data=right;
38 run;
Listing 5-2

Create Left and Right Datasets

Figure 5-1 uses PROC PRINT to display the left dataset.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig1_HTML.jpg
Figure 5-1

Left Dataset

Figure 5-2 uses PROC PRINT to display the right dataset.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig2_HTML.jpg
Figure 5-2

Right Dataset

SAS Sort/Merge

To begin, consider Listing 5-3, also referred to as match-merge. This program is a common pattern for match-merging two SAS datasets containing a common key. In this example, both the left and right SAS datasets are sorted by the id variable enabling SAS By Group processing. After sorting, the match-merge joins the datasets by the id variable.

Experienced SQL users know the results from this match-merge are the same as a FULL OUTER join in PROC SQL in cases where the table relationships are one-to-one or many-to-one. In cases where the table relationships are many-to-many, the results from the Data Step and PROC SQL differ. The many-to-many use cases for SAS and pandas are detailed in Appendix B at the end of the book. For the remainder of this chapter, our examples deal with one-to-one or one-to-many join relationships represented by the left and right DataFrames and datasets created in Listing 5-1 and Listing 5-2.

Listing 5-3 combines observations from the left and right datasets into a single observation in the new merge_lr dataset according to the values for the id variable found in both datasets.
4 proc sort data=left;
5     by id;
6 run;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
NOTE: The dataset WORK.LEFT has 8 observations and 4 variables.
7 proc sort data=right;
8      by id;
9 run;
NOTE: There were 6 observations read from the dataset WORK.RIGHT.
NOTE: The dataset WORK.RIGHT has 6 observations and 2 variables.
10 data merge_lr;
11    merge left
12          right;
13    by id;
14 run;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
NOTE: There were 6 observations read from the dataset WORK.RIGHT.
NOTE: The dataset WORK.MERGE_LR has 9 observations and 5 variables.
15 proc print data=merge_lr;
16    id id;
17 run;
NOTE: There were 9 observations read from the dataset WORK.MERGE_LR.
Listing 5-3

SAS Sort/Merge

Figure 5-3 uses PROC PRINT to display the resulting merge_lr dataset.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig3_HTML.jpg
Figure 5-3

Results from SAS Sort/Merge

To introduce the pandas merge() method , consider Listing 5-4. Using a single Data Step, the program creates seven output datasets. Each of the joins illustrates the following operations:
  • Inner join

  • Right join

  • Left join

  • Outer join

  • Left join with no matched keys

  • Right join with no matched keys

  • Outer join with no matched keys

4 data inner
5      right
6      left
7      outer
8      nomatch_l
9      nomatch_r
10     nomatch;
11
12 merge left(in=l)
13       right(in=r);
14 by id;
15
16 if (l=l and r=1) then output inner; *Inner Join;
17
18 if r = 1 then output right; * Right Join;
19
20 if l = 1 then output left;  * Left Join;
21
21 if (l=1 or r=1) then output outer; *Full Outer Join;
23
24 if (l=1 and r=0) then output nomatch_l;
25
26 if (l=0 and r=1) then output nomatch_r;
27
28 if (l=0 or r=0) then output nomatch;
29
30 run;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
NOTE: There were 6 observations read from the dataset WORK.RIGHT.
NOTE: The dataset WORK.INNER has 6 observations and 5 variables.
NOTE: The dataset WORK.RIGHT has 6 observations and 5 variables.
NOTE: The dataset WORK.LEFT has 8 observations and 5 variables.
NOTE: The dataset WORK.OUTER has 9 observations and 5 variables.
NOTE: The dataset WORK.NOMATCH_L has 3 observations and 5 variables.
NOTE: The dataset WORK.NOMATCH_R has 1 observations and 5 variables.
NOTE: The dataset WORK.NOMATCH has 4 observations and 5 variables .
Listing 5-4

Create Seven Output Datasets

Each of these seven join/merge operations is explored in detail along with their pandas counterpart operations.

Inner Join

An INNER JOIN selects only those rows whose key values are found in both tables. Another way to say this is the intersection of matched key values. The SAS Data Step for an INNER JOIN is shown in Listing 5-5. The id column in both datasets must have matching values to be included in the result set.
4  data inner;
5  merge left(in=l)
6        right(in=r);
7  by id;
8
9  if (l=1 and r=1) then output;
10 run;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
NOTE: There were 6 observations read from the dataset WORK.RIGHT.
NOTE: The dataset WORK.INNER has 5 observations and 5 variables.
Listing 5-5

Data Step Inner Join

The dataset IN= option creates Boolean variables to indicate which dataset contributes values to the current observation being read. The IF statement applies a truth test selecting only those observations where the id variable has matched values in both the left and right datasets.

With the SAS example as a template, we illustrate the pandas merge() method. The merge() method signature is
pd.merge(left, right, how="inner", on=None, left_on=None,
         right_on=None,
         left_index=False, right_index=False, sort=False
         suffixes=('_x', '_y'), copy=True, indicator=False,
         validate=None)
Listing 5-6 uses the on=‘ID’ argument to indicate the ID column is a key column found in both DataFrames. It turns out the on=‘ID’ argument is not needed in this example, since the merge() method detects the presence of a column labeled ID in both DataFrames and automatically asserts them as the key column. The how=‘inner’ argument performs an inner join, which is the default. The sort= argument is set to False. Not surprisingly, merge() operations on large DataFrames gain substantial performance improvements by not having to return rows in sorted order.
>>> inner = pd.merge(left, right, on="ID", how="inner", sort=False)
>>> print(inner)
    ID       Name Gender   Dept  Salary
0  929     Gunter      M    Mfg  45,650
1  446  Harbinger      M    Mfg  51,290
2  228     Benito      F    Mfg  62,000
3  299   Rudelich      M  Sales  39,800
4  442  Sirignano      F  Admin  44,345
Listing 5-6

pandas Inner Join

Listing 5-7 illustrates the PROC SQL query for an INNER JOIN.
4 proc sql;
5    select *
6 from left
7     ,right
8 where left.id = right.id;
9 quit;
Listing 5-7

PROC SQL Inner Join

An alternative syntax uses the PROC SQL keywords INNER JOIN. The join predicate, l.id = r.id, must be true to be included in the result set.
proc sql;
select *
   from left as l
inner join
   right as r
on l.id = r.id;
quit;
Figure 5-4 displays the result set from PROC SQL.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig4_HTML.jpg
Figure 5-4

PROC SQL Inner Join Results

Right Join

A RIGHT JOIN returns all observations from the right dataset along with any observations from the left dataset where the id variable has a match with the id variable in the right dataset. In this case, all observations from the right dataset are returned along with any observations in the left dataset with id values matching in the right dataset.

The SAS Data Step equivalent for a right join is illustrated in Listing 5-8.
4  data r_join;
5     merge left(in=l)
6           right(in=r);
7     by id;
8
9  if r=1 then output;
10 run;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
NOTE: There were 6 observations read from the dataset WORK.RIGHT.
NOTE: The dataset WORK.R_JOIN has 6 observations and 5 variables.
Listing 5-8

Data Step Right Join

The dataset IN= option creates Boolean variables to indicate whether the datasets contributed values to the current observation. The IF statement applies a truth test selecting all observations in the right table along with those observations in the left dataset where the id variable has matched values in the right dataset. In cases where there are observations in the right dataset with no matching id in the left dataset, these values are set to missing.

Listing 5-9 illustrates the compactness of the merge() method.
>>> r_join = pd.merge(left, right, how="right", sort=True)
>>> print(r_join)
    ID       Name Gender   Dept  Salary
0  228     Benito      F    Mfg  62,000
1  299   Rudelich      M  Sales  39,800
2  442  Sirignano      F  Admin  44,345
3  446  Harbinger      M    Mfg  51,290
4  871        NaN    NaN    NaN  70,000
5  929     Gunter      M    Mfg  45,650
Listing 5-9

pandas Right Join

The merge() method automatically coalesces the ID column values from both DataFrames into a single column in the returned r_join DataFrame. In cases where there are rows in the right DataFrame with no matching id in the left DataFrame, these values are set to NaN’s.

Listing 5-10 illustrates a right join with PROC SQL.
4  proc sql;
5     select coalesce(left.id, right.id) as id
6           ,name
7           ,dept
8           ,gender
9           ,salary
10      from left
11  right join
12      right
13  on left.id = right.id;
14  quit;
Listing 5-10

PROC SQL Right Join

The COALESCE function coerces the id columns from both tables to return a single column. Without the COALESCE function, the result set returns columns labeled ID from both the right and left tables with the ID column from the left table containing nulls for those rows with no matches found for the ID column from the right table.

Figure 5-5 displays the result set created from the right join.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig5_HTML.jpg
Figure 5-5

RIGHT JOIN Results

As the output shows, all rows from the right table are returned and PROC SQL assigns missing values for rows in the right table having unmatched values for the id column in the left table.

Left Join

A left join returns all rows from the left dataset along with any rows from the right table where the join predicate is true. In this case, all rows from the left are returned along with any rows in the right with id values matching in the left. Listing 5-11 illustrates a left join.
4  data l_join;
5  merge left(in=l)
6        right(in=r);
7  by id;
8
9  if l=1 then output;
10 run;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
NOTE: There were 6 observations read from the dataset WORK.RIGHT.
NOTE: The dataset WORK.L_JOIN has 8 observations and 5 variables.
Listing 5-11

Data Step Left Join

Listing 5-12 illustrates a left join on the left and right tables.
>>> l_join = pd.merge(left, right, how="left", sort=False)
>>> print(l_join)
    ID       Name Gender   Dept  Salary
0  929     Gunter      M    Mfg  45,650
1  446  Harbinger      M    Mfg  51,290
2  228     Benito      F    Mfg  62,000
3  299   Rudelich      M  Sales  39,800
4  442  Sirignano      F  Admin  44,345
5  321   Morrison      M  Sales     NaN
6  321   Morrison      M  Sales     NaN
7  882     Onieda      F  Admin     NaN
Listing 5-12

pandas Left Join

Like the pandas right join example in Listing 5-9, the merge() method automatically coalesces the ID column values from both DataFrames into a single column in the returned l_join DataFrame. The output shows all rows from the left DataFrame are returned with the merge() method assigning NaN’s for columns in the right DataFrame having unmatched values for the id column in the left DataFrame.

Listing 5-13 illustrates a left join with PROC SQL.
4 proc sql;
5 select coalesce(left.id, right.id) as id
6          ,name
7          ,dept
8          ,gender
9          ,salary
10     from left
11 left join
12     right
13 on left.id = right.id;
14 quit;
Listing 5-13

PROC SQL Left Join

Similar to Listing 5-10, this example uses the COALESCE function to coerce the id columns from both tables to return a single column. Figure 5-6 displays the PROC SQL output .
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig6_HTML.jpg
Figure 5-6

Left Join Results

Outer Join

Earlier we stated the SAS Sort/Merge returns the same result set as those from PROC SQL OUTER JOINs in cases where the table relationships are either one-to-one or one-to-many. The results from this example are the same as the SAS Sort/Merge results. See Listing 5-3 for the Data Step source program.

Listing 5-14 illustrates uses of the how='outer' argument to select all rows from the left and right tables to perform an outer join.
>>> merge_lr = pd.merge(left, right, on="ID", how="outer", sort=True)
>>> print(merge_lr)
    ID       Name Gender   Dept  Salary
0  228     Benito      F    Mfg  62,000
1  299   Rudelich      M  Sales  39,800
2  321   Morrison      M  Sales     NaN
3  321   Morrison      M  Sales     NaN
4  442  Sirignano      F  Admin  44,345
5  446  Harbinger      M    Mfg  51,290
6  871        NaN    NaN    NaN  70,000
7  882     Onieda      F  Admin     NaN
8  929     Gunter      M    Mfg  45,650
Listing 5-14

pandas Outer Join

Listing 5-15 illustrates a PROC SQL outer join.
4  proc sql;
5     select coalesce(left.id, right.id)
6           ,name
7           ,dept
8           ,salary
9      from left
10 full join
11      right
12 on left.id=right.id;
13 quit;
Listing 5-15

PROC SQL Outer Join

All rows from the left and right tables are returned. In cases where there are no matched values for the id variable, the values are set to missing.

Figure 5-7 displays the result set from an outer join on the left and right tables with PROC SQL.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig7_HTML.jpg
Figure 5-7

PROC SQL Outer Join

Right Join Unmatched Keys

Up to this point, the examples are based on finding matching key values in the data to be joined. The next three examples illustrate joining data where keys are not matched.

Every SQL join is either a Cartesian product join or a subset of a Cartesian product join. In cases involving unmatched key values, a form of WHERE processing is required. Together, the SAS Data Step with its IN= and associated IF processing logic is a common pattern for this type of filtering. PROC SQL with a WHERE clause is used as well.

The next three examples illustrate the indicator= argument for the pandas merge() method as an analog to the SAS IN= dataset option. For pandas, the filtering process utilizes a Boolean comparison based on the indicator= value.

Consider Listing 5-16.
4  data r_join_nmk;
5  merge left(in=l)
6         right(in=r);
7  by id;
8
9  if (l=0 and r=1) then output;
10 run;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
NOTE: There were 6 observations read from the dataset WORK.RIGHT.
NOTE: The dataset WORK.R_JOIN_NMK has 1 observations and 5 variables.
Listing 5-16

Data Step Unmatched Keys in Right

The statement
 if (l=0 and r=1) then output;

behaves as a WHERE filter matching on the id column in both datasets resulting in rows from the right dataset with no matched values in the left dataset .

To perform a right join on unmatched keys on DataFrames, use the indicator= argument to the merge() method. This argument adds a column to the output DataFrame with the default name _merge as an indicator for the source of each row. Their returned values are
left_only
right_only
both
By applying a Boolean filter to the indicator= values, we replicate the behaviors for the SAS IN= dataset option for merge operations. Consider Listing 5-17.
>>> nomatch_r = pd.merge(left, right, on="ID", how="outer", sort=False, indicator="in_col")
>>> print(' ',
...       nomatch_r,
...       ' ')
     ID       Name Gender   Dept  Salary      in_col
0  929     Gunter      M    Mfg  45,650        both
1  446  Harbinger      M    Mfg  51,290        both
2  228     Benito      F    Mfg  62,000        both
3  299   Rudelich      M  Sales  39,800        both
4  442  Sirignano      F  Admin  44,345        both
5  321   Morrison      M  Sales     NaN   left_only
6  321   Morrison      M  Sales     NaN   left_only
7  882     Onieda      F  Admin     NaN   left_only
8  871        NaN    NaN    NaN  70,000  right_only
>>> nomatch_r[(nomatch_r['in_col'] == 'right_only')]
    ID Name Gender Dept  Salary      in_col
8  871  NaN    NaN  NaN  70,000  right_only
Listing 5-17

pandas Right Join Unmatched Keys

The indicator= argument adds the in_col column to the nomatch_r DataFrame in this example. We print the nomatch_r DataFrame as an intermediate step to display values for the in_col column added by the indicator= argument .

The Boolean expression
nomatch_r[(nomatch_r['in_col'] == 'right_only')]
is a subsetting operation to create the nomatch_r DataFrame with the Boolean test
(nomatch_r['in_col'] == 'right_only')

selecting rows where the in_col column value is 'right_only'.

The SAS analog to Listing 5-17 example is generated by simply adding a WHERE filter to the example from Listing 5-10. This is illustrated in Listing 5-18.
4  proc sql;
5     select coalesce(left.id, right.id) as id
6           ,name
7           ,dept
8           ,gender
9           ,salary
10
11  from left
12      right join right on left.ID = right.ID
13  where left.ID is NULL;
14  quit;
Listing 5-18

PROC SQL RIGHT JOIN Unmatched Keys

The WHERE clause returns those rows from the right table having no matching id values in the left table. Columns returned from the left table are set to missing.

Figure 5-8 displays the result set.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig8_HTML.jpg
Figure 5-8

PROC SQL Right Join No Matched Keys

Left Join Unmatched Keys

To find the unmatched key values in the left table, consider Listing 5-19.
4  data l_join_nmk;
5  merge left(in=l)
6        right(in=r);
7  by id;
8
9  if (l=1 and r=0) then output;
10 run;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
NOTE: There were 6 observations read from the dataset WORK.RIGHT.
NOTE: The dataset WORK.L_JOIN_NMK has 3 observations and 5 variables.
Listing 5-19

Data Step Unmatched Keys in Left

The statement
 if (l=1 and r=0) then output;

is the reverse from Listing 5-16. The IF statement behaves like a WHERE filter matching on the id column in both datasets resulting in just the observations from the left dataset having no matched id values in the right dataset.

Finding the unmatched key values in the left DataFrame is illustrated in Listing 5-20. Notice the call to the merge() method is the same as illustrated in Listing 5-17; however, in this case, we create the nomatch_l DataFrame and the subsequent filtering logic is different.
>>> nomatch_l = pd.merge(left, right, on="ID", how="outer", sort=False, indicator="in_col")
>>> nomatch_l = nomatch_l[(nomatch_l['in_col'] == 'left_only')]
>>>
>>> print(' ',
...        nomatch_l,
...       ' ')
     ID      Name Gender   Dept Salary     in_col
5  321  Morrison      M  Sales    NaN  left_only
6  321  Morrison      M  Sales    NaN  left_only
7  882    Onieda      F  Admin    NaN  left_only
Listing 5-20

pandas Left Join Unmatched Keys

The statement
nomatch_l = nomatch_l[(nomatch_l['in_col'] == 'left_only')]
is the subsetting logic to create the nomatch_l DataFrame with the Boolean test
(nomatch_l['in_col'] == 'left_only')

selecting rows where the in_col column value is 'left_only'.

The same result set is generated by simply adding a WHERE filter to the example from Listing 5-13. This is illustrated in Listing 5-21.
4 proc sql;
5 select coalesce(left.id, right.id) as id
6          ,name
7          ,dept
8          ,gender
9          ,salary
10      from left
11 left join
12     right
13 on left.id = right.id
14 where right.id is null;
15 quit;
Listing 5-21

PROC SQL Left Join on Unmatched Keys

The WHERE clause returns those rows from the left table having no matching id values in the right table. Columns returned from the right table are set to missing.

Figure 5-9 displays the result set.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig9_HTML.jpg
Figure 5-9

PROC SQL Left Join on Unmatched Keys

Outer Join Unmatched Keys

An outer join on unmatched keys returns rows from each dataset with unmatched keys in the other dataset. To find the unmatched key values in the left or right dataset, consider Listing 5-22.
4  data outer_nomatch_both;
5      merge left (in=l)
6            right (in=r);
7      by id;
8
9  if (l=0 or r=0) then output;
10 run;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
NOTE: There were 6 observations read from the dataset WORK.RIGHT.
NOTE: The dataset WORK.OUTER_NOMATCH_BOTH has 4 observations and 5 variables.
Listing 5-22

Data Step Outer Join Unmatched Keys in Both

The statement
 if (l=0 or r=0) then output;

behaves like a WHERE filter matching on the id column in both datasets returning rows from the left dataset with no matched values in the right dataset or rows from the right dataset with no matched values in the left dataset.

To find unmatched keys in both DataFrames requires an outer join with a corresponding filter identifying the null or missing values. Consider Listing 5-23.
>>> nomatch_both = pd.merge(left, right, on="ID", how="outer", sort=False, indicator="in_col")
>>> print(' ',
...       nomatch_both,
...       ' ')
     ID       Name Gender   Dept  Salary      in_col
0  929     Gunter      M    Mfg  45,650        both
1  446  Harbinger      M    Mfg  51,290        both
2  228     Benito      F    Mfg  62,000        both
3  299   Rudelich      M  Sales  39,800        both
4  442  Sirignano      F  Admin  44,345        both
5  321   Morrison      M  Sales     NaN   left_only
6  321   Morrison      M  Sales     NaN   left_only
7  882     Onieda      F  Admin     NaN   left_only
8  871        NaN    NaN    NaN  70,000  right_only
>>>
>>> nomatch_both[(nomatch_both['in_col'] == 'right_only') |
...              (nomatch_both['in_col'] == 'left_only')]
    ID      Name Gender   Dept  Salary      in_col
5  321  Morrison      M  Sales     NaN   left_only
6  321  Morrison      M  Sales     NaN   left_only
7  882    Onieda      F  Admin     NaN   left_only
8  871       NaN    NaN    NaN  70,000  right_only
Listing 5-23

pandas Outer Join Unmatched Keys in Both

The nomatch_both DataFrame holds the rows resulting from the outer join. The indicator= argument adds the in_col column to the DataFrame containing values subject to a Boolean test for identifying the null values.

The statement
nomatch_both[(nomatch_both['in_col'] == 'right_only') |
             (nomatch_both['in_col'] == 'left_only')]
is the filtering logic to create the nomatch_both DataFrame with the Boolean test
(nomatch_both['in_col'] == 'right_only') |
(nomatch_both['in_col'] == 'left_only')

selecting rows where the in_col column value is 'left_only' or where the in_col column value is 'right_only'.

An alternative Boolean expression that is more Pythonic is
nomatch_both = nomatch_both[nomatch_both["in_col"] != 'both']

selecting those rows where the in_col column value is not 'both'.

Listing 5-24 illustrates finding the unmatched key values in both the right and left tables.
4 proc sql;
5    select coalesce(left.id, right.id)
6          ,name
7          ,dept
8          ,salary
9     from left
10 full join
11    right
12 on left.id=right.id
13 where left.id ne right.id;
14 quit;
Listing 5-24

PROC SQL Outer Join Unmatched Keys in Both

The WHERE clause returns rows from the left table having no matching id values in the right table and rows from the right table having no matching id values in the left table. Figure 5-10 displays the PROC SQL output.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig10_HTML.jpg
Figure 5-10

Outer Join Unmatched Keys Found in Both Tables

Validate Keys

As a part of any data management tasks, one must understand the relationships among tables being joined or merged. For example, the results one expects from joining may not be correct if tables believed to have unique key values turn out to have non-unique key values instead. Fortunately, both the pandas library and SAS have methods to detect and enforce join key uniqueness if this is a requirement. Consider Listing 5-25.
>>> dups = pd.merge(left, right, on="ID", how="outer", sort=False, validate="one_to_one" )
pandas.errors.MergeError: Merge keys are not unique in left dataset; not a one-to-one merge
Listing 5-25

pandas Validate 1:1 Relationship

In this example, the validate= argument determines if the merge is of a specified type of relationship. The valid values for validate= are
  • one_to_one or 1:1: Check if keys are unique in both the left and right DataFrames.

  • one_to_many or 1:m: Check if keys are unique in the left Dataframe.

  • many_to_one or m:1: Check if keys are unique in the right DataFrame.

  • many_to_many or m:m: Allowed, but does not result in a check.

The default value is None. In this example, validate=one_to_one” raises a pandas.errors.MergeError and reports the merge key, ID, in the left Dataframe is not unique.

Joining on an Index

Recall in Chapter 4, “Indexing and GroupBy,” we detailed the use of indexes as row labels for DataFrames. The pandas library implements the join() method to combine columns from two differently indexed DataFrames into a single DataFrame. It turns out the join() method is a convenience for calling the merge() method in those cases where a DataFrame lacks an index (other than the default RangeIndex) or where key columns have different names.

Consider Listing 5-26. In this example, the left and right DataFrames use an index to label the rows.
>>> import pandas as pd
>>> left = pd.DataFrame(
...    { 'Style'  :  ['S1', 'S2', 'S3', 'S4'],
...      'Size'   :  ['SM', 'MD', 'LG', 'XL']},
...      index =     ['01', '02', '03', '05'])
>>> right = pd.DataFrame(
...    { 'Color' :  ['Red', 'Blue', 'Cyan', 'Pink'],
...      'Brand' :  ['X', 'Y', 'Z', 'J']},
...      index =    ['01', '02', '03', '04'])
>>> print(left)
   Style Size
01    S1   SM
02    S2   MD
03    S3   LG
05    S4   XL
>>> print(right)
   Color Brand
01   Red     X
02  Blue     Y
03  Cyan     Z
04  Pink     J
>>>
>>> df1 = left.join(right, how="left")
>>> print(df1)
   Style Size Color Brand
01    S1   SM   Red     X
02    S2   MD  Blue     Y
03    S3   LG  Cyan     Z
05    S4   XL   NaN   NaN
Listing 5-26

Left Join on Indexed DataFrames

This example creates the left DataFrame with a Python list of values assigning the values ‘01’, ‘02’, ’03, and ‘05’ as the row labels. Similarly, the right DataFrame is created with an index for row labels, ‘01’ to ‘04’. The df1 DataFrame is created by calling the join() method which joins either on an index (as in this case) or on a designated key column.

In this example, the call to the join() method performs the default left join with the how=leftargument.

The indexes from both columns are preserved on the new df1 DataFrame as a function of the type of join called. Observe how the print() function displays the index column as row labels in the df1 DataFrame.

We can appreciate how the join() method is a convenience for the merge() method by reducing the amount of typing needed, since the corresponding syntax needed to produce the same results is
df1 = left.merge(right, how="left", left_index=True, right_index=True)
The join() method provisions four joining methods:
  • Left: Uses the calling DataFrame’s index, or a key column, if specified. This is the default join() method.

  • Right: Uses the other DataFrame’s index.

  • Outer: Returns union of calling DataFrame’s index with the other DataFrame index and sorts the index.

  • Inner: Returns intersection of calling DataFrame’s index with the other DataFrame index and preserves the order of the calling DataFrame index.

Consider Listing 5-27.
>>> df2 = left.join(right, how="outer")
>>> print(df2)
   Style Size Color Brand
01    S1   SM   Red     X
02    S2   MD  Blue     Y
03    S3   LG  Cyan     Z
04   NaN  NaN  Pink     J
05    S4   XL   NaN   NaN
Listing 5-27

Outer Join on Indexed DataFrames

The how='outer' argument enables an outer join.

An inner join operation on the indexed DataFrames is illustrated in Listing 5-28.
>>> df3 = left.join(right, how="inner")
>>> print(df3)
   Style Size Color Brand
01    S1   SM   Red     X
02    S2   MD  Blue     Y
03    S3   LG  Cyan     Z
Listing 5-28

Inner Join on Index DataFrames

In both Listing 5-27 and Listing 5-28, the indexes labeling the rows are the join keys and remain a part of the joined DataFrames.

Join Key Column with an Index

In those cases where a DataFrame is not indexed, use the on= argument to identify the key column used in a join operation. Observe in Listing 5-29 that the left Dataframe does not have an index and the right DataFrame does.
>>> left = pd.DataFrame(
...    {'Style'  :  ['S1', 'S2', 'S3', 'S4'],
...     'Size'   :  ['SM', 'MD', 'LG', 'XL'],
...     'Key'    :  ['01', '02', '03', '05']})
>>> right = pd.DataFrame(
...    {'Color' :  ['Red', 'Blue', 'Cyan', 'Pink'],
...     'Brand' :  ['X', 'Y', 'Z', 'J']},
...      index =   ['01', '02', '03', '04'])
>>> print(left)
  Style Size Key
0    S1   SM  01
1    S2   MD  02
2    S3   LG  03
3    S4   XL  05
>>> print(right)
   Color Brand
01   Red     X
02  Blue     Y
03  Cyan     Z
04  Pink     J
>>>
>>> df4 = left.join(right, on="Key", how="outer")
>>> print(df4)
  Style Size Key Color Brand
0    S1   SM  01   Red     X
1    S2   MD  02  Blue     Y
2    S3   LG  03  Cyan     Z
3    S4   XL  05   NaN   NaN
3   NaN  NaN  04  Pink     J
Listing 5-29

Outer Join Key Column with Index

The on= argument identifies the column called Key in the right DataFrame as the join key used with the index on the left DataFrame as the join key.

As pointed out previously, the join() method is a convenience for calling the merge() method to join an indexed DataFrame with a non-indexed DataFrame. The same result set from Listing 5-29 is generated with Listing 5-30.
>>> df5 = pd.merge(left, right, left_on="Key", how="outer", right_index=True)
>>>
>>> print(df5)
  Style Size Key Color Brand
0    S1   SM  01   Red     X
1    S2   MD  02  Blue     Y
2    S3   LG  03  Cyan     Z
3    S4   XL  05   NaN   NaN
3   NaN  NaN  04  Pink     J
Listing 5-30

Merge Key Column with an Index

In this example, the how='outer' argument calls for an outer join using the left_on='Key' argument to designate the join key on the left Dataframe. The right_index=True argument designates the index as the join key on the right DataFrame .

Update

Update operations are used in cases where there is a master table containing original data values. Transaction datasets are typically shaped the same as the master datasets containing new values for updating the master dataset. In the case of SAS, an observation from the transaction dataset that does not correspond to any observations in the master dataset becomes a new observation. Begin by observing the behavior of the SAS UPDATE statement in Listing 5-31.
4 data master;
5    input ID salary;
6 list;
7 datalines;
RULE:       --+--1--+--2--+--3--+--4--+--
8           023 45650
9           088 55350
10          099 55100
11          111 61625
NOTE: The dataset WORK.MASTER has 4 observations and 2 variables.
12 ;;;;
13
14 data trans;
15 infile datalines dlm=',';
16 input ID
17       salary
18       bonus;
19 list;
20  datalines;
RULE:       --+--1--+--2--+--3--+--4--+--
21        023, 45650, 2000
22        088, 61000,
23        099, 59100,
24        111, 61625, 3000
25        121, 50000,
NOTE: The dataset WORK.TRANS has 5 observations and 3 variables.
26 ;;;;
27 data new_pay;
28    update master(rename=(salary = old_salary))
29    trans (rename=(salary = new_salary));
30 by id;
31 run;
NOTE: There were 4 observations read from the dataset WORK.MASTER.
NOTE: There were 5 observations read from the dataset WORK.TRANS.
NOTE: The dataset WORK.NEW_PAY has 5 observations and 4 variables.
32
33 proc print data=new_pay;
34    id ID;
35 run;
Listing 5-31

SAS Update

The SAS UPDATE statement creates the new_pay dataset by applying transaction values from the transact dataset to the master dataset. A BY variable is required. Because the ID values are read in sorted order on input, a call to PROC SORT is not needed in this example.

All non-missing values for variables in the transact dataset replace the corresponding values in the master dataset. A RENAME statement is used to rename the salary variable in order to display the effects of the UPDATE operation. The resulting new_pay dataset is displayed in Figure 5-11.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig11_HTML.jpg
Figure 5-11

SAS Update Results

In this example, the master dataset does not contain a bonus variable. Since the transact dataset contains a bonus variable, it is applied to the new_pay dataset.

The pandas update() method is used to modify DataFrame values using non-NA values from another DataFrame. In contrast to SAS, the update() method performs an in-place update to the calling DataFrame, in our case the master DataFrame. The SAS UPDATE statement forces the creation of a new dataset and does not modify the master dataset. To understand these difference, consider Listing 5-32.
>>> import numpy as np
>>> master = pd.DataFrame({'ID': ['023', '088', '099', '111'],
...                'Salary': [45650, 55350, 55100, 61625]})
>>> trans = pd.DataFrame({'ID': ['023', '088', '099', '111', '121'],
...          'Salary': [45650, 61000, 59100, 61625, 50000],
...           'Bonus': [2000, np.NaN , np.NaN, 3000, np.NaN]})
>>> print(master)
    ID  Salary
0  023   45650
1  088   55350
2  099   55100
3  111   61625
>>> print(trans)
    ID  Salary   Bonus
0  023   45650  2000.0
1  088   61000     NaN
2  099   59100     NaN
3  111   61625  3000.0
4  121   50000     NaN
>>> master.update(trans)
>>> print(master)
    ID  Salary
0  023   45650
1  088   61000
2  099   59100
3  111   61625
Listing 5-32

DataFrame update() Method

The master DataFrame calls the update() method using the trans DataFrame to update non-NA values using values in the trans DataFrame. The default join operation for the update() method is a left join which explains why row ID 121 is not in the updated master DataFrame. This row exists only in the trans DataFrame. This is also the explanation for why the Bonus column in the trans DataFrame is not a part of the update to the master DataFrame.

An alternate approach is attempting a call to the DataFrame update() method with the how= argument set to outer. Unfortunately, this raises the NotImplementedError: Only left join is supported error illustrated in Listing 5-33. (In order to run this example, copy and paste the code for defining the left and right DataFrame from Listing 5-32.) Remember, the update() method performs an in-place update.
>>> master.update(trans, join="outer")
NotImplementedError: Only left join is supported
Listing 5-33

DataFrame update() Outer Join

The correct alternative is to call the merge() method as illustrated in Listing 5-34.
>>> master = pd.DataFrame({'ID2': ['023', '088', '099', '111'],
...           'Salary2': [45650 , 55350, 55100, 61625]})
>>> trans = pd.DataFrame({'ID2': ['023' , '088', '099', '111', '121'],
...           'Salary2': [45650 , 61000, 59100, 61625, 50000],
...                     'Bonus2': [2000 , np.NaN, np.NaN, 3000, np.NaN]})
>>> df6 = pd.merge(master, trans,
...                on='ID2', how="outer",
...                suffixes=('_Old','_Updated' ))
>>> print(df6)
   ID2  Salary2_Old  Salary2_Updated  Bonus2
0  023      45650.0            45650  2000.0
1  088      55350.0            61000     NaN
2  099      55100.0            59100     NaN
3  111      61625.0            61625  3000.0
4  121          NaN            50000     NaN
Listing 5-34

DataFrame Update Using merge() Method

The on='ID2' argument uses the ID column common to the master and trans DataFrame as join keys. The how='outer' argument performs an outer join and suffixes=('_Old','_Updated' ) adds a suffix to like-named columns in both DataFrames to disambiguate the DataFrame column contribution.

Conditional Update

There are cases when updates need to be applied conditionally. SAS users are accustomed to thinking in terms of IF/THEN/ELSE logic for conditional updates. To help understand how this logic works with pandas, we use two examples. The first example defines a Python function to calculate tax rates conditionally on the Salary2_Updated column in the df6 DataFrame. This example will look familiar to SAS users. It uses row iteration with if/else logic to calculate the new Taxes column.

The second approach is a better performing method of using the loc() indexer to apply calculated values to the DataFrame.

Begin with Listing 5-35. Our first step is to copy the df6 DataFrame, created in Listing 5-34, to one called df7.
>>> #copy df6 to df7 to be used in second example
... df7 = df6.copy()
>>> print(df6)
   ID2  Salary2_Old  Salary2_Updated  Bonus2
0  023      45650.0            45650  2000.0
1  088      55350.0            61000     NaN
2  099      55100.0            59100     NaN
3  111      61625.0            61625  3000.0
4  121          NaN            50000     NaN
>>> def calc_taxes(row):
...     if row['Salary2_Updated'] <= 50000:
...         val = row['Salary2_Updated'] * .125
...     else:
...         val = row['Salary2_Updated'] * .175
...     return val
...
>>> df6['Taxes'] = df6.apply(calc_taxes, axis=1)
>>> print(df6)
   ID2  Salary2_Old  Salary2_Updated  Bonus2      Taxes
0  023      45650.0            45650  2000.0   5706.250
1  088      55350.0            61000     NaN  10675.000
2  099      55100.0            59100     NaN  10342.500
3  111      61625.0            61625  3000.0  10784.375
4  121          NaN            50000     NaN   6250.000
Listing 5-35

Conditional DataFrame Column Update with a Function

This example defines the calc_taxes function to be applied iterating over the DataFrame rows. The row variable is local to the function definition and returns the val object value when called. The function contains two conditions, if and else. If you need to cascade a Series of if statements, then use the elif keyword following the first if statement.

The if and else statements define the Taxes column calculated at a 12.5% rate when the Salary2_Updated value is less than or equal to $50,000. Otherwise the tax rate is 17.5%.

The statement
df6['Taxes'] = df6.apply(calc_taxes, axis=1)

creates the Taxes column in the df6 DataFrame by calling the apply function with the calc_taxes function and the axis=1 argument indicates the function is applied along the column.

The second approach for conditional update uses the .loc indexer illustrated in Listing 5-36.
>>> df7.loc[df7['Salary2_Updated'] <= 50000, 'Taxes'] = df7.Salary2_Updated * .125
>>> df7.loc[df7['Salary2_Updated'] >  50000, 'Taxes'] = df7.Salary2_Updated * .175
>>> print(df7)
    ID Salary2_Old Salary2_Updated  Bonus2      Taxes
0  023     45650.0           45650  2000.0   5706.250
1  088     55350.0           61000     NaN  10675.000
2  099     55100.0           59100     NaN  10342.500
3  111     61625.0           61625  3000.0  10784.375
4  121         NaN           50000     NaN   6250.000
Listing 5-36

Conditional DataFrame Update with loc() Indexer

There are two conditions: 12.5% tax rate on the Salary2_Updated column less than or equal to $50,000 is
df7.loc[df7['Salary2_Updated'] <=50000, 'Taxes'] = df7.Salary2_Updated * .125

One way to read this statement is to consider the syntax to the left of the comma (,) similar to a WHERE clause. The df7 DataFrame calls the loc() indexer to find the condition df7['Salary2_Updated'] less than or equal to $50,000. To the right of the comma is the assignment when the condition is True; the value for the Taxes column (which is created on the DataFrame) is calculated at a rate of 12.5% of the value found in the Salary2_Updated column .

The second condition, 17.5% tax rate on the Salary2_Updated column greater than $50,000
df7.loc[df7['Salary2_Updated'] >50000, 'Taxes'] = df7.Salary2_Updated * .175

works in a similar fashion.

Specifically, the loc() indexer creates a Boolean mask which is used to index the DataFrame and return those rows meeting the logical condition. This is illustrated in Listing 5-37. For the logical condition, df7['Salary2_Updated'] <= 50000, only when the Boolean mask returns True is the value multiplied by .125. Likewise for the logical condition, df7['Salary2_Updated'] > 50000, is the value multiplied by 17.5.
>>> nl = ' '
>>>
>>> print(nl,
...       "Boolean Mask for 'Salary2_Updated' <= 50000",
...       nl,
...       df7['Salary2_Updated'] <= 50000,
...       nl,
...       "Boolean Mask for 'Salary2_Updated' > 50000",
...       nl,
...   df7['Salary2_Updated'] > 50000)
Boolean Mask for 'Salary2_Updated' <= 50000
0    True
1    False
2    False
3    False
4    True
Name: Salary2_Updated, dtype: bool
Boolean Mask for 'Salary2_Updated' > 50000
0     False
1     True
2     True
3     True
4     False
Name: Salary2_Updated, dtype: bool
Listing 5-37

Boolean Mask for Logical Conditions

A conditional update with a SAS Data Step is illustrated in Listing 5-38.
4 data calc_taxes;
5    set new_pay;
6    if new_salary <= 50000 then
7       taxes = new_salary * .125;
8    else taxes = new_salary * .175;
9 run;
NOTE: There were 5 observations read from the dataset WORK.NEW_PAY.
NOTE: The dataset WORK.CALC_TAXES has 5 observations and 5 variables.
10 proc print data=calc_taxes;
11    id ID;
12 run;
Listing 5-38

SAS Conditional Update

The SAS IF/ELSE statement is similar to the Python function defining the calc_taxes function in Listing 5-35.

Figure 5-12 uses PROC PRINT to display the newly created taxes variable.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig12_HTML.jpg
Figure 5-12

SAS Conditional Update

Concatenation

The pandas library implements a concat() method similar in behavior to the SAS SET statement . It is used to “glue” DataFrames together both on a row-oriented basis and on a column-oriented basis, like the SAS MERGE statement. Here we examine its behavior as an analog to the SAS SET statement.

The concat() method signature is
pd.concat(objs, axis=0, join="outer", join_axes=None,
            ignore_index=False, keys=None, levels=None,
            names=None, verify_integrity=False,
            sort=None, copy=True)
Begin by creating the example DataFrames shown in Listing 5-39.
>>> loc1 = pd.DataFrame({'Onhand': [21, 79, 33, 81],
...                       'Price': [17.99, 9.99, 21.00, .99]},
...                        index = ['A0', 'A1', 'A2', 'A3'])
>>>
>>> loc2 = pd.DataFrame({'Onhand': [12, 33, 233, 45],
...                       'Price': [21.99, 18.00, .19, 23.99]},
...                        index = ['A4', 'A5', 'A6', 'A7'])
>>>
>>> loc3 = pd.DataFrame({'Onhand': [37, 50, 13, 88],
...                       'Price': [9.99, 5.00, 22.19, 3.99]},
...                        index = ['A8', 'A9', 'A10', 'A11'])
>>> frames = [loc1, loc2, loc3]
>>> all = pd.concat(frames)
>>> print(all)
     Onhand  Price
A0       21  17.99
A1       79   9.99
A2       33  21.00
A3       81   0.99
A4       12  21.99
A5       33  18.00
A6      233   0.19
A7       45  23.99
A8       37   9.99
A9       50   5.00
A10      13  22.19
A11      88   3.99
Listing 5-39

DataFrames for concat() Method

This example uses the DataFrame() method to create the three DataFrames, loc1, loc2, and loc3. The objects are placed into a Python list and assigned to the frames object with the syntax
frames = [loc1, loc2, loc3]
The three DataFrames are concatenated by calling the concat() method using the syntax
all = pd.concat(frames)

creating the output all DataFrame.

The analog SAS program is shown in Listing 5-40. It creates the three SAS datasets, loc1, loc2, and loc3, and uses the SET statement to concatenate them together producing the all dataset.
4 data loc1;
5 length id $ 3;
6 input id $
7       onhand
8       price;
9 list;
10 datalines;
RULE:       --+--1--+--2--+--3--+--4--+--
12          A0 21 17.19
13          A1 79 9.99
14          A2 33 21
15          A3 81 .99
NOTE: The dataset WORK.LOC1 has 4 observations and 3 variables.
16  ;;;;
17 data loc2;
18 length id $ 3;
19  input id $
20       onhand
21       price;
22 list;
23 datalines;
RULE:       --+--1--+--2--+--3--+--4--+--
24          A4 12 21.99
25          A5 33 18
26          A6 233 .19
27          A7 45 23.99
NOTE: The dataset WORK.LOC2 has 4 observations and 3 variables.
28  ;;;;
29  data loc3;
30  length id $ 3;
31  input id $
32        onhand
34        price;
35  list;
36  datalines;
RULE:       --+--1--+--2--+--3--+--4--+--
37          A8 37 9.99
38          A9 50 5
39          A10 13 22.19
40          A11 88 3.99
NOTE: The dataset WORK.LOC3 has 4 observations and 3 variables.
41 ;;;;
42 data all;
43
44    set loc1
45        loc2
46        loc3;
47  run;
NOTE: There were 4 observations read from the dataset WORK.LOC1.
NOTE: There were 4 observations read from the dataset WORK.LOC2.
NOTE: There were 4 observations read from the dataset WORK.LOC3.
NOTE: The dataset WORK.ALL has 12 observations and 3 variables.
48 proc print data=all;
49   run;
Listing 5-40

SAS SET Statement

The PROC SQL UNION ALL set operator is an alternative to the SAS SET statement for creating the all table. This example is illustrated in Listing 5-41.
4 proc sql;
5    create table all as
6 select * from loc1
7    union all
8 select * from loc2
9    union all
10 select * from loc3;
NOTE: Table WORK.ALL created, with 12 rows and 3 columns.
11
12  select * from all;
13  quit;
Listing 5-41

PROC SQL UNION ALL

The PROC SQL output is displayed in Figure 5-13.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig13_HTML.jpg
Figure 5-13

PROC SQL Creating ALL Table

The concat() method is able to construct a hierarchical index by providing the keys= argument to form the outermost level. Listing 5-42 illustrates this feature.
>>> all = pd.concat(frames, keys=['Loc1', 'Loc2', 'Loc3'])
>>> print(all)
          Onhand  Price
Loc1 A0       21  17.99
     A1       79   9.99
     A2       33  21.00
     A3       81   0.99
Loc2 A4       12  21.99
     A5       33  18.00
     A6      233   0.19
     A7       45  23.99
Loc3 A8       37   9.99
     A9       50   5.00
     A10      13  22.19
     A11      88   3.99
>>> all.loc['Loc3']
     Onhand  Price
A8       37   9.99
A9       50   5.00
A10      13  22.19
A11      88   3.99
Listing 5-42

Hierarchical Index from concat() Method

With the hierarchical index in place, we can easily identify subsets using the loc() indexer discussed in Chapter 4, “Indexing and GroupBy.” In this example, the loc() indexer slices those rows belonging to the original loc3 DataFrame.

The IN= dataset option enables the ability to uniquely identify observations contributed by a specific SAS dataset as illustrated in Listing 5-43.
4 data all;
5 length id $ 3;
6    set loc1 (in=l1)
7        loc2 (in=l2)
8        loc3 (in=l3);
9  if l1 then location = 'Loc1';
10 if l2 then location = 'Loc2';
11 if l3 then location = 'Loc3';
12 run;
NOTE: There were 4 observations read from the dataset WORK.LOC1.
NOTE: There were 4 observations read from the dataset WORK.LOC2.
NOTE: There were 4 observations read from the dataset WORK.LOC3.
NOTE: The dataset WORK.ALL has 12 observations and 4 variables.
13
14  proc print data = all(where=(location='Loc3'));
15     id id;
16     var onhand price;
17  run;
Listing 5-43

SET Statement Using IN=

The IF statements create the location variable by using the IN= dataset option by identifying which dataset contributed observations. To identify those observations contributed from the loc3 dataset, apply the WHERE= filter. In this example, WHERE=(location='Loc3') is applied when calling PROC PRINT. The results are displayed in Figure 5-14.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig14_HTML.jpg
Figure 5-14

SET with IN= Option

Similar to SAS PROC APPEND, the pandas library provisions an append() method. Consider Listing 5-44. This method is a convenience for calling the concat() method. The append() method syntax is likely a more natural syntax for SAS users.
>>> all_parts = loc1.append([loc2, loc3])
>>> print(all_parts)
     Onhand  Price
A0       21  17.99
A1       79   9.99
A2       33  21.00
A3       81   0.99
A4       12  21.99
A5       33  18.00
A6      233   0.19
A7       45  23.99
A8       37   9.99
A9       50   5.00
A10      13  22.19
A11      88   3.99
Listing 5-44

pandas append() Method

DataFrame loc1 calls the append() method to append the loc2 and loc3 DataFrames. For this example, the equivalent syntax using the concat() method is
all_parts = pd.concat([loc1, loc2, loc3], join="outer")
SAS uses PROC APPEND to append observations from a dataset to a base dataset as illustrated in Listing 5-45.
4 proc append base = loc1
5             data = loc2;
6 run;
NOTE: Appending WORK.LOC2 to WORK.LOC1.
NOTE: There were 4 observations read from the dataset WORK.LOC2.
NOTE: 4 observations added.
NOTE: The dataset WORK.LOC1 has 8 observations and 3 variables.
7 proc append base = loc1
8             data = loc3;
NOTE: Appending WORK.LOC3 to WORK.LOC1.
NOTE: There were 4 observations read from the dataset WORK.LOC3.
NOTE: 4 observations added.
NOTE: The dataset WORK.LOC1 has 12 observations and 3 variables.
9  proc print data=loc1;
10 run;
Listing 5-45

SAS PROC APPEND

In cases where more than one dataset is being appended, multiple calls to PROC APPEND are needed. In some cases, appending is a better performance choice over the SET statement when appending smaller datasets to a larger dataset. PROC APPEND avoids reading observations in the BASE= dataset by positioning the record pointer at the end of the BASE= dataset. Observations from the DATA= dataset are applied to the end of the BASE= dataset. Figure 5-15 displays the loc1 dataset after the append operations.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig15_HTML.jpg
Figure 5-15

PROC APPEND Results

Finding Column Min and Max Values

A common task is finding the minimum and maximum values in a column. The pandas max() and min() methods return the minimum and maximum column values, respectively. This is illustrated in Listing 5-46.
>>> all_parts['Price'].max()
23.99
>>> all_parts['Price'].min()
0.19
>>> all_parts[all_parts['Price']==all_parts['Price'].max()]
    Onhand  Price
A7      45  23.99
Listing 5-46

Return DataFrame Column Min and Max

The first two lines of the script return the maximum and minimum values for the Price column in the all_parts Dataframe. The line in the script returns the row for the maximum value of Price using the [ ] indexer followed by the Boolean comparison.

Listing 5-47 illustrates the same logic with PROC SQL.
4 proc sql;
5 select min(price) as Price_min
6      , max(price) as Price_max
7 from loc1;
8 quit;
Listing 5-47

Return Dataset Column Min and Max

Figure 5-16 displays the PROC SQL result set.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig16_HTML.jpg
Figure 5-16

Returned Column Min and Max

Sorting

Both the pandas library and SAS provide sort methods and options for controlling how values are sorted. Examine Listing 5-48. It creates the df DataFrame and calls the sort_values attribute with a Python list of column names indicating a multi-key sort.
>>> df = pd.DataFrame({'ID': ['A0', 'A1', 'A2', 'A3', 'A4', '5A', '5B'],
...                   'Age': [21, 79, 33, 81, np.NaN, 33, 33],
...                  'Rank': [1, 2, 3, 3, 4, 5, 6]})
>>> print(df)
   ID   Age  Rank
0  A0  21.0     1
1  A1  79.0     2
2  A2  33.0     3
3  A3  81.0     3
4  A4   NaN     4
5  5A  33.0     5
6  5B  33.0     6
>>> df.sort_values(by=['Age', 'Rank'])
   ID   Age  Rank
0  A0  21.0     1
2  A2  33.0     3
5  5A  33.0     5
6  5B  33.0     6
1  A1  79.0     2
3  A3  81.0     3
4  A4   NaN     4
Listing 5-48

pandas sort_values Attribute

Both the sort_values attribute and PROC SORT use ascending as the default sort order. Listing 5-49 is the analog illustrating a call to PROC SORT using a multi-key sort. Of course, the same dataset can be generated with PROC SQL and an ORDER BY statement.
4  data df;
5  length id $ 2;
6  input id $
7        age
8        rank;
9  list;
10  datalines;
RULE:       --+--1--+--2--+--3--+--4--+--
11          A0 21 1
12          A1 79 2
13          A2 33 3
14          A3 81 3
15          A4 .  4
16          5A 33 5
17          5B 33 6
NOTE: The dataset WORK.DF has 7 observations and 3 variables.
18  ;;;;
19
20 proc sort data = df;
21   by age rank;
22 run;
NOTE: There were 7 observations read from the dataset WORK.DF.
23 proc print data = df;
24    id id;
25 run;
Listing 5-49

SAS PROC SORT

Figure 5-17 displays the results of the sort operation.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig17_HTML.jpg
Figure 5-17

PROC SORT Results

PROC SORT sorts missing values as the smallest numeric value. The default behavior for pandas sort_values attribute is to sort NaN’s as the largest numeric values. The sort_values attribute has the na_position= argument which uses values of 'first' or 'last' for placing NaN’s at the beginning or the end, respectively, of a DataFrame.

Listing 5-50 illustrates overriding the default behavior of 'last' with 'first' with the sort_values attribute.
>>> df.sort_values(by=['Age', 'Rank'], na_position="first")
   ID   Age  Rank
4  A4   NaN     4
0  A0  21.0     1
2  A2  33.0     3
5  5A  33.0     5
6  5B  33.0     6
1  A1  79.0     2
3  A3  81.0     3
Listing 5-50

Sort NaN’s to Beginning of DataFrame

The behaviors between sort_values attribute and PROC SORT differ with respect to where the sort occurs. By default, the sort_values attribute returns a new DataFrame with the inplace= argument set to False. By contrast, PROC SORT sorts the dataset in place unless an output dataset is specified with the OUT= option.

For a multi-key sort using the sort_values attribute, the ascending or descending sort order is specified with the ascending= argument using True or False for the same number of values listed with the by= argument. Take, for example, Listing 5-51.
>>> df.sort_values(by=['Age', 'Rank'], na_position="first", ascending = (True, False))
   ID   Age  Rank
4  A4   NaN     4
0  A0  21.0     1
6  5B  33.0     6
5  5A  33.0     5
2  A2  33.0     3
1  A1  79.0     2
3  A3  81.0     3
Listing 5-51

Multi-key Sort Ascending and Descending

In this example, the Age column is sorted by ascending value and the Rank column is sorted by descending value.

PROC SQL uses the keyword DESCENDING following a column name to indicate a descending sort order used with ORDER BY. Listing 5-52 illustrates the DESCENDING keyword to alter the default sort order.
4 proc sql;
5    select * from df
6 order by age, rank descending;
7 quit;
Listing 5-52

PROC SQL ORDER BY

The results from PROC SQL with a multi-key sort is displayed in Figure 5-18.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig18_HTML.jpg
Figure 5-18

PROC SQL Multi-key ORDER BY

Finding Duplicates

In some cases data entry errors lead to duplicate data values or non-unique key values or duplicate rows. In Listing 5-25, we discuss validating the join relationship among tables. The validation argument for the merge() method validates whether a join is of a particular type. This raises the question on how to find and remove duplicate key values.

Consider Listing 5-53. This example illustrates how to find duplicate key values using the duplicated attribute.
>> print(df)
   ID   Age  Rank
0  A0  21.0     1
1  A1  79.0     2
2  A2  33.0     3
3  A3  81.0     3
4  A4   NaN     4
5  5A  33.0     5
6  5B  33.0     6
>>> dup_mask = df.duplicated('Age', keep="first")
>>> df_dups = df.loc[dup_mask]
>>> print(df_dups)
   ID   Age  Rank
5  5A  33.0     5
6  5B  33.0     6
Listing 5-53

Find DataFrame Duplicate Keys

The duplicated attribute returns Booleans indicating duplicate rows and, in this case, limited to duplicate values in the Age column. By default, the duplicated attribute applies to all DataFrame columns. In other words, the default behavior is to identify duplicate rows.

The statement
dup_mask = df.duplicated('Age', keep="first")
defines a Boolean mask using the keep='first' argument for the duplicated attribute. The keep= argument has the following three values:
  • first: Mark duplicates as True except for the first occurrence.

  • last: Mark duplicates as False except for the last occurrence.

  • False: Mark all duplicates as True.

The statement
df_dups = df.loc[dup_mask]

creates the df_dups DataFrame containing the duplicate values for the Age column. This form of conditional slicing is covered in detail in Chapter 4, “Indexing and GroupBy,” in the section “Conditional Slicing.”

Dropping Duplicates

The drop_duplicates attribute returns a de-duplicated DataFrame based on its argument values. This feature is illustrated in Listing 5-54.
>> print(df)
   ID   Age  Rank
0  A0  21.0     1
1  A1  79.0     2
2  A2  33.0     3
3  A3  81.0     3
4  A4   NaN     4
5  5A  33.0     5
6  5B  33.0     6
>>> df_deduped = df.drop_duplicates('Age', keep = 'first')
>>> print(df_deduped)
   ID   Age  Rank
0  A0  21.0     1
1  A1  79.0     2
2  A2  33.0     3
3  A3  81.0     3
4  A4   NaN     4
Listing 5-54

Drop Duplicate Column Values

The keep= argument has the same values and behavior as the keep= argument for the duplicated attribute .

Similarly, with SAS, the NODUPKEY and NODUPRECS options for PROC SORT are used to detect and remove duplicate values. The NODUPKEY option removes observations with duplicate BY values. This is analogous to the drop_duplicates attribute with a column argument.

The PROC SORT NODUPRECORDS is analogous to the drop_duplicates attribute with the keep= argument set to False. In each case, duplicate observations or rows, if found, are eliminated.

Consider Listing 5-55. This example replicates the logic for Listing 5-53 and Listing 5-54.
4 proc sort data = df nodupkey
5    out = df_deduped
6    dupout = df_dups;
7    by age;
8
9 run;
NOTE: There were 7 observations read from the dataset WORK.DF.
NOTE: 2 observations with duplicate key values were deleted.
NOTE: The dataset WORK.DF_DEDUPED has 5 observations and 3 variables.
NOTE: The dataset WORK.DF_DUPS has 2 observations and 3 variables.
10 proc print data = df;
11    id id;
12 run;
NOTE: There were 7 observations read from the dataset WORK.DF.
13 proc print data = df_dups;
14    id id;
15 run;
NOTE: There were 2 observations read from the dataset WORK.DF_DUPS.
16 proc print data = df_deduped;
17    id id;
18 run;
NOTE: There were 5 observations read from the dataset WORK.DF_DEDUPED.
Listing 5-55

SAS Find and Drop Duplicates

In this example, the OUT= dataset option creates the df_deduped dataset dropping the duplicate values for the age variable found in the df dataset. Without the OUT= option, PROC SORT does an in-place sort overwriting the df dataset. The NODUPKEY option removes observations with duplicate BY values for the age variable. The DUPOUT= option identifies the output dataset to output the duplicate values.

Figure 5-19 uses PROC PRINT to display the original dataset with duplicate age values.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig19_HTML.jpg
Figure 5-19

Original df Dataset

Figure 5-20 uses PROC PRINT to display the duplicate observations found using the NODUPKEY option that are written to the df_dups dataset using the DUPOUT= option.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig20_HTML.jpg
Figure 5-20

Duplicate Age Observations

Figure 5-21 uses PROC PRINT to display the de-duplicated dataset df_deduped.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig21_HTML.jpg
Figure 5-21

Deduped Observations

Sampling

Sampling is often a requirement in data analysis. The idea is to select a sample from a population and, by analyzing the sample, draw inferences about the population. In a simple random sample without replacement process every row has an equal chance of being selected. Once selected, the observation cannot be chosen again. Listing 5-56 illustrates this concept.
>>> np.random.seed(987654)
>>> df = pd.DataFrame({'value': np  .random.randn(360)},
...      index=pd.date_range('1970-01-01', freq="M", periods=360))
>>> print(df.head(5))
               value
1970-01-31 -0.280936
1970-02-28 -1.292098
1970-03-31 -0.881673
1970-04-30  0.518407
1970-05-31  1.829087
>>>
>>> samp1 = df.sample(n= 100, replace=False)
>>> samp1.head(5)
               value
1993-10-31 -0.471982
1998-04-30 -0.906632
1980-09-30 -0.313441
1986-07-31 -0.872584
1975-01-31  0.237037
>>> print(samp1.shape)
(100, 1)
Listing 5-56

Simple Random Sample from a DataFrame

The df DataFrame is constructed using the numpy random.randn random number generator to generate 360 values into the value column with a standard normal distribution. The samp1 DataFrame is created by drawing a sample by calling the sample attribute along with the n= argument to set the number of rows to be drawn without replacement. The replace= argument is False by default to sample without replacement.

The same logic for a simple random sample without replacement is illustrated in Listing 5-57.
4  data df;
5  do date = '01Jan1970'd to '31Dec2000'd by 31;
6     value = rand('NORMAL');
7     output;
8  end;
9  format date yymmdd10.;
10 run;
NOTE: The dataset WORK.DF has 366 observations and 2 variables.
11 data samp1 (drop = k n);
12 retain k 100 n;
13 if _n_ = 1 then n = total;
14   set df nobs=total;
15
16  if ranuni(654321) <= k/n then do;
17    output;
18       k = k -1;
19    end;
20 n = n -1;
21
22 if k = 0 then stop;
23 run;
NOTE: There were 360 observations read from the dataset WORK.DF.
NOTE: The dataset WORK.SAMP1 has 100 observations and 2 variables.
24 proc print data = samp1 (obs=5);
25    id date;
26    var value;
27 run;
Listing 5-57

Simple Random Sample from a Dataset

The df dataset is generated using a DO/END block. The values assigned to the value variable are randomly generated from the standard normal distribution.

The samp1 dataset is created by reading the df dataset and retaining the k variable to the desired sample size of 100. The total variable holds the value for the number of observations read from the input df dataset using the NOBS= dataset option on the SET statement. When the first observation is read from the df dataset, the n variable is assigned the value from the total variable.

The IF-THEN/DO block is based on the logic that each observation in the input dataset has an equal probability of k/n (100/360) of being selected as the first observation in the sample. The values for the k and n variables are decremented by one to reach the desired sample size. If a random number from the RANUNI function is less than or equal to k/n, then the observation is included in the sample. The iteration stops when k, which began with a retained value of 100, decrements to zero (0).

The pandas sample attribute uses the frac= argument to include a portion, for example, 30% of the rows to be included in the sample. This capability is illustrated in Listing 5-58.
>>> samp2 = df.sample(frac=.3, replace=True)
>>> print(samp2.head(5))
               value
1971-05-31  0.639097
1997-10-31  1.779798
1971-07-31  1.578456
1981-12-31  2.114340
1980-11-30  0.293887
>>> print(samp2.shape)
(108, 1)
Listing 5-58

Proportional Sample from a DataFrame with Replacement

The replace=True argument indicates the sample is drawn with replacement.

Convert Types

There are occasions when a value is assigned to a DataFrame column that is not appropriate for a desired operation. For example, date values may have been read as strings or numeric values are read as strings.

pandas provisions the attribute to convert types. This feature is illustrated in Listing 5-59. The DataFrame dtypes attribute returns the column’s type.
>>> df8 = pd.DataFrame({'String': ['2', '4', '6', '8'],
...                     'Ints'  : [1, 3, 5, 7]})
>>> df8.dtypes
String    object
Ints       int64
dtype: object
>>> df8['String'] = df8['String'].astype(float)
>>> df8['Ints']  = df8['Ints'].astype(object)
>>> df8.dtypes
String    float64
Ints       object
dtype: object
Listing 5-59

Converting Types

In this example, DataFrame df8 contains two columns, String and Ints, with types of object and int64, respectively. After the calls to the astype attribute , the String and Ints columns have data types float64 and object, respectively. The “Column Types” section in Chapter 3, “pandas Library,” provides more details for the common pandas types.

Rename Columns

Listing 5-60 illustrates calling the DataFrame rename attribute to rename column labels.
>>> df8.rename(columns={"String": "String_to_Float",
...                     "Ints": "Ints_to_Object"},
... inplace=True)
>>> print(df8)
   String_to_Float Ints_to_Object
0              2.0              1
1              4.0              3
2              6.0              5
3              8.0              7
Listing 5-60

Rename Columns

The DataFrame rename attribute accepts the Dictionary key:value pairs where the key is the existing column name and the value is the new name. The default value for the inplace= argument is False .

Map Column Values

Mapping column values is similar to the function provided by PROC FORMAT. Mapping allows translation of DataFrame column values into associated values as illustrated in Listing 5-61.
>>> dow = pd.DataFrame({"Day_Num":[1,2,3,4,5,6,7]})
>>> dow_map={1:'Sunday',
...        2:'Monday',
...        3:'Tuesday',
...        4:'Wednesday',
...        5:'Thursday',
...        6:'Friday',
...        7:'Saturday'
...       }
>>>
>>> dow["Day_Name"] = dow["Day_Num"].map(dow_map)
>>> print(dow)
   Day_Num   Day_Name
0        1     Sunday
1        2     Monday
2        3    Tuesday
3        4  Wednesday
4        5   Thursday
5        6     Friday
6        7   Saturday
Listing 5-61

Mapping Column Values

The dow DataFrame is constructed with the Day_Num column having values 1 through 7. dow_map defines a Dictionary of key:value pairs where the keys are equivalent to the values on the Day_Num column. The Dictionary values are the values to be mapped to these keys, in this case, names for the week day. The statement
 dow["Day_Name"] = dow["Day_Num"].map(dow_map)

creates the Day_Name column in the dow DataFrame by calling the map attribute with the dow_map Dictionary for the Day_Num column.

Transpose

Transposing creates a DataFrame by restructuring the values in a DataFrame, transposing columns into row. By default a DataFrame is transposed in place. In most cases, an output DataFrame is created with an assignment statement. Creating a new DataFrame by transposing an existing DataFrame is shown in Listing 5-62.
>>> uni = {'School'  :  ['NCSU',   'UNC',         'Duke'],
...        'Mascot'  :  ['Wolf',   'Ram',         'Devil'],
...       'Students' :  [22751,     31981,         19610],
...       'City'     :  ['Raleigh', 'Chapel Hill', 'Durham']}
>>> df_uni = pd.DataFrame(data=uni)
>>> print (df_uni)
  School Mascot  Students         City
0   NCSU   Wolf     22751      Raleigh
1    UNC    Ram     31981  Chapel Hill
2   Duke  Devil     19610       Durham
>>> df_uni.dtypes
School      object
Mascot      object
Students     int64
City        object
dtype: object
>>>
>>> t_df_uni = df_uni.T
>>> print(t_df_uni)
                0            1       2
School       NCSU          UNC    Duke
Mascot       Wolf          Ram   Devil
Students    22751        31981   19610
City      Raleigh  Chapel Hill  Durham
>>> t_df_uni.dtypes
0    object
1    object
2    object
dtype: object
>>>
>>> t_df_uni[0]
School         NCSU
Mascot         Wolf
Students      22751
City        Raleigh
Name: 0, dtype: object
Listing 5-62

Transpose a DataFrame

The example begins by defining the Dictionary uni. The statement
df_uni = pd.DataFrame(data=uni)

uses the uni Dictionary to create the df_uni DataFrame.

The call to create the transposed DataFrame and create the output t_df_uni DataFrame is the statement
t_df_uni = df_uni.T

In cases where the DataFrame has mixed data types, such as this one, the transposed DataFrame columns are returned as object (string) types. The Students column in the original df_uni DataFrame is int64. The columns 0–2 in the transposed DataFrame are object data types.

The [ ] indexer is used to return column 0 in this example.

Listing 5-63 illustrates transposing a SAS dataset.
4 data df_uni;
5 infile datalines dlm=',';
6        length school $ 4
7               mascot $ 5
8               city $ 11;
9        input school $
10              mascot $
11              students
12              city $;
13  list;
14  datalines;
RULE:       --+--1--+--2--+--3--+--4--+--5
2922        NCSU, Wolf, 22751, Raleigh
2923        UNC, Ram, 31981, Chapel Hill
2924        Duke, Devil, 19610, Durham
NOTE: The dataset WORK.DF_UNI has 3 observations and 4 variables.
15  ;;;;
16
17 proc print data = df_uni;
18    id school;
19 run;
NOTE: There were 3 observations read from the dataset WORK.DF_UNI.
20
21 proc transpose data = df_uni
22                out = t_df_uni;
23 var school mascot students city;
24 run;
NOTE: Numeric variables in the input dataset will be converted to character in the output dataset.
NOTE: There were 3 observations read from the dataset WORK.DF_UNI.
NOTE: The dataset WORK.T_DF_UNI has 4 observations and 4 variables.
25
26 proc print data = t_df_uni;
27    run;
NOTE: There were 4 observations read from the dataset WORK.T_DF_UNI.
28
29 proc sql;
30    select name
31         , type
32    from dictionary.columns
33 where libname="WORK" and memname="T_DF_UNI";
34 quit;
Listing 5-63

SAS Transpose

The call to PROC TRANSPOSE transposes the input dataset df_uni and creates the transposed t_df_uni dataset using the OUT= option. Observe the note in the SAS log
Numeric variables in the input dataset will be converted to character in the output dataset .

The note indicates numerics, in this case, the student variable, are converted to characters. The PROC SQL code queries the SAS DICTIONARY table COLUMNS to display the column names and data types for the transposed t_df_uni dataset.

Figure 5-22 uses PROC PRINT to display the input dataset. Figure 5-23 uses PROC PRINT to display the transposed dataset. Figure 5-24 displays the PROC SQL results from the DICTIONARY.COLUMNS table.
../images/440803_1_En_5_Chapter/440803_1_En_5_Fig22_HTML.jpg
Figure 5-22

Input Dataset to Transpose

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig23_HTML.jpg
Figure 5-23

Output Dataset from Transpose

../images/440803_1_En_5_Chapter/440803_1_En_5_Fig24_HTML.jpg
Figure 5-24

Transposed Dataset Variable Types

Summary

In this chapter, we examined a variety of essential data management tasks needed to reshape and organize data. The primary focus was to enable an understanding of joining/merging and reshaping DataFrames. These principles set the stage for the next chapter’s topic, “pandas Readers and Writers.”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset