The .join(...) transformation

The .join(...) transformation allow us to join two DataFrames. The first parameter is the other DataFrame we want to join with, while the second parameter specifies the columns on which to join, and the final parameter specifies the nature of the join. Available types are inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, and left_anti. In SQL, the equivalent is the JOIN statement.

If you're not familiar with the ANTI and SEMI joins, check out this blog: https://blog.jooq.org/2015/10/13/semi-join-and-anti-join-should-have-its-own-syntax-in-sql/.

Look at the following code as follows:

models_df = sc.parallelize([
('MacBook Pro', 'Laptop')
, ('MacBook', 'Laptop')
, ('MacBook Air', 'Laptop')
, ('iMac', 'Desktop')
]).toDF(['Model', 'FormFactor'])

(
sample_data_schema
.join(
models_df
, sample_data_schema.Model == models_df.Model
, 'left'
).show()
)

It produces the following output:

In SQL syntax, this would be:

SELECT a.*
, b,FormFactor
FROM sample_data_schema AS a
LEFT JOIN models_df AS b
ON a.Model == b.Model

If we had a DataFrame that would not list every Model (note that the MacBook is missing), then the following code is:

models_df = sc.parallelize([
('MacBook Pro', 'Laptop')
, ('MacBook Air', 'Laptop')
, ('iMac', 'Desktop')
]).toDF(['Model', 'FormFactor'])

(
sample_data_schema
.join(
models_df
, sample_data_schema.Model == models_df.Model
, 'left'
).show()
)

This will generate a table with some missing values:

The RIGHT join keeps only the records that are matched with the records in the right DataFrame. Thus, look at the following code:

(
sample_data_schema
.join(
models_df
, sample_data_schema.Model == models_df.Model
, 'right'
).show()
)

This produces a table as follows:

The SEMI and ANTI joins are somewhat recent additions. The SEMI join keeps all the records from the left DataFrame that are matched with the records in the right DataFrame (as with the RIGHT join) but only keeps the columns from the left DataFrame; the ANTI join is the opposite of the SEMI join—it keeps only the records that are not found in the right DataFrame. So, the following example of a SEMI join is:

(
sample_data_schema
.join(
models_df
, sample_data_schema.Model == models_df.Model
, 'left_semi'
).show()
)

This will produce the following result:

Whereas the example of an ANTI join is:

(
sample_data_schema
.join(
models_df
, sample_data_schema.Model == models_df.Model
, 'left_anti'
).show()
)

This will generate the following:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset