SQL-like merging/joining of DataFrame objects

The merge function is used to join two DataFrame objects similar to those used in SQL database queries. It results in a merged DataFrame. DataFrame objects are analogous to SQL tables. The following command explains this:

    merge(left, right, how='inner', on=None, left_on=None,
          right_on=None, left_index=False, right_index=False, 
          sort=True, suffixes=('_x', '_y'), copy=True)

The following is a summary of the merge function:

  • The left argument: This is the first DataFrame object.
  • The right argument: This is the second DataFrame object.
  • The how argument: This is the type of join and can be inner, outer, left, or right. The default is inner.
  • The on argument: This shows the names of columns to join on as join keys.
  • The left_on and right_on arguments: These show the left and right DataFrame column names to join on.
  • The left_index and right_index arguments: These have a Boolean value. If this is True, use the left or right DataFrame index/row labels to join on.
  • The sort argument: This has a Boolean value. The default True setting results in a lexicographical sort. Setting the default value to False may improve performance.
  • The suffixes argument: The tuple of string suffixes to be applied to overlapping columns. The defaults are '_x' and '_y'.
  • The copy argument: The default True value causes data to be copied from the passed DataFrame objects.

The source of the preceding information is http://pandas.pydata.org/pandas-docs/stable/merging.html.

Let's create two DataFrames – left and right – to understand merging:

left

The following will be the output:

Left DataFrame for merge

The right dataframe can be viewed using the following:

right

The following will be the output:

 Right dataframe for merge

The DataFrames have five rows each, with Category and Region as the keys. Of these five rows, two rows from each DataFrame share the same set of keys. Let's perform a merge on both keys:

pd.merge(left, right, on = ["Category", "Region"])

The following will be the output:

 Default inner merge

By default, the how argument is set to inner, hence, in this scenario, an inner join is performed. Now, let's perform a left join:

pd.merge(left, right, how = "left", on = ["Category", "Region"])

The following will be the output:

 Left merge

In a left join, all the rows found in the left DataFrame are included in the result. The rows of left not found in right get NAs appended to the columns originating from the right DataFrame – Discount and Profit – for which keys do not exist in the left DataFrame. A right join would be the exact opposite: the result would contain all the rows from the right dataframe and NAs would be appended to Sales and Quantity for cases where keys are found in left but not in the right DataFrame:

pd.merge(left, right, how = "right", on = ["Category", "Region"])

The following will be the output:

 Right merge

In the case of an outer join, no rows are excluded and NAs are appended as necessary for missing values:

pd.merge(left, right, how = "outer", on = ["Category", "Region"])

The following will be the output:

Outer merge

Let's investigate the behavior of an outer merge when duplicate entries of a key are found. The following command duplicates the last key combination of the left DataFrame. The keys with the Office Supplies category and the Canada region occur twice:

left.loc[5,:] =["Office Supplies", "Canada", 111, 111]
left

The following will be the output:

Inserting duplicates in the left DataFrame

The result of the outer merge is as follows:

pd.merge(left, right, how = "outer", on = ["Category", "Region"])

The following will be the output:

 Outer merge for data with duplicates

As you can see, the right DataFrame gets merged on the left DataFrame for each occurrence of the key and duplicates are not dropped. This behavior may not be desirable in huge datasets. It may be necessary to drop duplicates before merging. For such instances, the validate argument of merge helps to keep a check to support only one-to-one merges:

pd.merge(left, right, how = "outer", on = ["Category", "Region"], validate = "one_to_one")

The following will be the output:

Error indicating duplicates in the DataFrame when merging

The indicator argument of merge indicates the source of a row – left, right, or both:

pd.merge(left, right, how = "outer", on = ["Category", "Region"], indicator = "Indicator")

The following will be the output:

The indicator parameter of merge
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset