Front Matter

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Raju Kumar Mishra and Sundar Rajan Raman

PySpark SQL RecipesWith HiveQL, Dataframe and Graphframes

../images/469054_1_En_BookFrontmatter_Figa_HTML.png

Raju Kumar Mishra

Bangalore, Karnataka, India

Sundar Rajan Raman

Chennai, Tamil Nadu, India

ISBN 978-1-4842-4334-3e-ISBN 978-1-4842-4335-0

https://doi.org/10.1007/978-1-4842-4335-0

Library of Congress Control Number: 2019934769

Apress Standard

Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.

Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail [email protected], or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.

To the Almighty, who guides me in every aspect of my life. And to my mother, Smt. Savitri Mishra, and my lovely wife, Smt. Smita Rani Pathak.

Introduction

This book will take you on an interesting journey to learn about PySparkSQL and Big Data using a problem-solution approach. Every problem is followed by a detailed, step-by-step answer, which will improve your thought process for solving Big Data problems with PySparkSQL. The following is a brief description of each chapter:

Chapter 1 , “Introduction to PySparkSQL,” covers Many Big Data processing tools such as Apache Hadoop, Apache Pig, Apache Hive, and Apache Spark. The shortcomings of Hadoop and the evolution of Spark are discussed. It discusses PySparkSQL, includes an introduction to DataFrame, and covers structured streaming. A discussion of Apache Kafka is also included. This chapter also sheds light on some NoSQL databases like MongoDB and Cassandra.
Chapter 2 , “Installation,” will take you to the real battleground. You’ll learn how to install many Big Data processing tools such as Hadoop, Hive, Spark, MongoDB, and Apache Cassandra.
Chapter 3 , “IO in PySparkSQL,” will take you through many recipes that read data from many data sources using PySparkSQL. You’ll read data from many file formats like CSV, JSON, ORC, and Parquet, then from many RDBMS like MySQL and PostgreSQL. It also discusses how to read data from NoSQL databases like MongoDB and Cassandra using PySparkSQL. Then you see how to save the data into many sinks like files and RDBMS or NoSQL databases.
Chapter 4 , “Operations on PySparkSQL DataFrames,” explains different operations like data filtering, data transformation, and data sorting on DataFrames.
Chapter 5 , “Data Merging and Data Aggregation Using PySparkSQL,” shows how to perform data aggregation and data merging on DataFrames.
Chapter 6 , “SQL, NoSQL, and PySparkSQL,” shows how to perform SQL operations on DataFrames. It contains multiple recipes that will help you convert DataFrames to table-like structures and then apply SQL queries to them.
Chapter 7 , “Optimizing PySparkSQL,” shows you how to perform optimal joins that run faster. You will understand the basics of how Spark works in the background and, based on that, you will see multiple recipes that will help you optimize your SQL queries on DataFrames.
Chapter 8 , “Structured Streaming,” shows you how to use Spark streaming with streaming data. This chapter provides multiple recipes to help you apply Spark’s structured streaming APIs and SQLs to streaming data.
Chapter 9 , “GraphFrames,” shows you how to perform Graph operations on DataFrames. There are multiple GraphFrame recipes, including PageRank, that will help you to appreciate and apply complex graph operations using Spark’s GraphFrame.

Acknowledgments

My heartiest thanks to the Almighty. I also would like to thank my mother, Smt. Savitri Mishra; my sisters, Mitan and Priya; my cousins, Suchitra and Chandni; and my maternal uncle, Shyam Bihari Pandey; for their support and encouragement. I am very grateful to my sweet and beautiful wife, Smt. Smita Rani Pathak, for her continuous encouragement and love while I was writing this book. I thank my brother-in-law, Mr. Prafull Chandra Pandey, for his encouragement to write this book. I am very thankful to my sisters-in-law, Rinky, Reena, Kshama, Charu, and Dhriti, for their encouragement as well. I am grateful to Anurag Pal Sehgal, Saurabh Gupta, Devendra Mani Tripathi, Avinash Dash, Rajesh Thakur, and all my friends. My nephews, Rashu and Rishu. Last but not least, thanks to coordinating editor Aditee Mirashi and acquisitions editor Celestin Suresh John at Apress; without them, this book would not have been possible.

Chapter 1: Introduction to PySpark SQL 1

Introduction to Big Data 2

Volume 2

Velocity 3

Variety 3

Veracity 3

Introduction to Hadoop 4

Introduction to HDFS 5

Introduction to MapReduce 6

Introduction to Apache Hive 7

Introduction to Apache Pig 9

Introduction to Apache Kafka 10

Producer 11

Broker 11

Consumer 11

Introduction to Apache Spark 12

PySpark SQL: An Introduction 14

Introduction to DataFrames 15

SparkSession 16

Structured Streaming 17

Catalyst Optimizer 17

Introduction to Cluster Managers 18

Introduction to PostgreSQL 20

Introduction to MongoDB 21

Introduction to Cassandra 22

Chapter 2: Installation 23

Recipe 2-1. Install Hadoop on a Single Machine 24

Problem 24

Solution 24

How It Works 24

Recipe 2-2. Install Spark on a Single Machine 37

Problem 37

Solution 38

How It Works 38

Recipe 2-3. Use the PySpark Shell 41

Problem 41

Solution 41

How It Works 41

Recipe 2-4. Install Hive on a Single Machine 42

Problem 42

Solution 42

How It Works 43

Recipe 2-5. Install PostgreSQL 47

Problem 47

Solution 47

How It Works 47

Recipe 2-6. Configure the Hive Metastore on PostgreSQL 49

Problem 49

Solution 49

How It Works 49

Recipe 2-7. Connect PySpark to Hive 57

Problem 57

Solution 58

How It Works 58

Recipe 2-8. Install MySQL 58

Problem 58

Solution 58

How It Works 59

Recipe 2-9. Install MongoDB 60

Problem 60

Solution 60

How It Works 60

Recipe 2-10. Install Cassandra 62

Problem 62

Solution 63

How It Works 63

Chapter 3: IO in PySpark SQL 65

Recipe 3-1. Read a CSV File 66

Problem 66

Solution 66

How It Works 67

Recipe 3-2. Read a JSON File 71

Problem 71

Solution 71

How It Works 72

Recipe 3-3. Save a DataFrame as a CSV File 73

Problem 73

Solution 73

How It Works 74

Recipe 3-4. Save a DataFrame as a JSON File 75

Problem 75

Solution 75

How It Works 75

Recipe 3-5. Read ORC Files 76

Problem 76

Solution 76

How It Works 78

Recipe 3-6. Read a Parquet File 78

Problem 78

Solution 78

How It Works 79

Recipe 3-7. Save a DataFrame as an ORC File 80

Problem 80

Solution 80

How It Works 80

Recipe 3-8. Save a DataFrame as a Parquet File 81

Problem 81

Solution 81

How It Works 81

Recipe 3-9. Read Data from MySQL 82

Problem 82

Solution 82

How It Works 82

Recipe 3-10. Read Data from PostgreSQL 84

Problem 84

Solution 84

How It Works 85

Recipe 3-11. Read Data from Cassandra 86

Problem 86

Solution 86

How It Works 87

Recipe 3-12. Read Data from MongoDB 88

Problem 88

Solution 88

How It Works 90

Recipe 3-13. Save a DataFrame to MySQL 91

Problem 91

Solution 91

How It Works 91

Recipe 3-14. Save a DataFrame to PostgreSQL 93

Problem 93

Solution 93

How It Works 94

Recipe 3-15. Save DataFrame Contents to MongoDB 95

Problem 95

Solution 95

How It Works 96

Recipe 3-16. Read Data from Apache Hive 97

Problem 97

Solution 97

How It Works 100

Chapter 4: Operations on PySpark SQL DataFrames 101

Recipe 4-1. Transform Values in a Column of a DataFrame 102

Problem 102

Solution 103

How It Works 104

Recipe 4-2. Select Columns from a DataFrame 108

Problem 108

Solution 108

How It Works 109

Recipe 4-3. Filter Rows from a DataFrame 111

Problem 111

Solution 111

How It Works 112

Recipe 4-4. Delete a Column from an Existing DataFrame 114

Problem 114

Solution 114

How It Works 115

Recipe 4-5. Create and Use a PySpark SQL UDF 117

Problem 117

Solution 117

How It Works 119

Recipe 4-6. Data Labeling 122

Problem 122

Solution 122

How It Works 123

Recipe 4-7. Perform Descriptive Statistics on a Column of a DataFrame 124

Problem 124

Solution 124

How It Works 127

Recipe 4-8. Calculate Covariance 132

Problem 132

Solution 132

How It Works 133

Recipe 4-9. Calculate Correlation 134

Problem 134

Solution 134

How It Works 135

Recipe 4-10. Describe a DataFrame 136

Problem 136

Solution 136

How It Works 137

Recipe 4-11. Sort Data in a DataFrame 143

Problem 143

Solution 143

How It Works 145

Recipe 4-12. Sort Data Partition-Wise 148

Problem 148

Solution 148

How It Works 149

Recipe 4-13. Remove Duplicate Records from a DataFrame 154

Problem 154

Solution 155

How It Works 156

Recipe 4-14. Sample Records 160

Problem 160

Solution 160

How It Works 162

Recipe 4-15. Find Frequent Items 166

Problem 166

Solution 166

How It Works 166

Chapter 5: Data Merging and Data Aggregation Using PySparkSQL 167

Recipe 5-1. Aggregate Data on a Single Key 168

Problem 168

Solution 168

How It Works 169

Recipe 5-2. Aggregate Data on Multiple Keys 172

Problem 172

Solution 172

How It Works 172

Recipe 5-3. Create a Contingency Table 175

Problem 175

Solution 175

How It Works 178

Recipe 5-4. Perform Joining Operations on Two DataFrames 180

Problem 180

Solution 180

How It Works 182

Recipe 5-5. Vertically Stack Two DataFrames 188

Problem 188

Solution 188

How It Works 190

Recipe 5-6. Horizontally Stack Two DataFrames 193

Problem 193

Solution 193

How It Works 195

Recipe 5-7. Perform Missing Value Imputation 200

Problem 200

Solution 200

How It Works 201

Chapter 6: SQL, NoSQL, and PySparkSQL 207

Recipe 6-1. Create a DataFrame from a CSV File 208

Problem 208

Solution 208

How It Works 209

Recipe 6-2. Create a Temp View from a DataFrame 214

Problem 214

Solution 214

How It Works 214

Recipe 6-3. Create a Simple SQL from a DataFrame 216

Problem 216

Solution 216

How It Works 217

Recipe 6-4. Apply Spark UDF Methods on Spark SQL 222

Problem 222

Solution 222

How It Works 223

Recipe 6-5. Create a New PySpark UDF 228

Problem 228

Solution 228

How It Works 229

Recipe 6-6. Join Two DataFrames Using SQL 233

Problem 233

Solution 233

How It Works 234

Recipe 6-7. Join Multiple DataFrames Using SQL 242

Problem 242

Solution 242

How It Works 244

Chapter 7: Optimizing PySpark SQL 249

Recipe 7-1. Apply Aggregation Using PySpark SQL 254

Problem 254

Solution 255

How It Works 256

Recipe 7-2. Apply Windows Functions Using PySpark SQL 260

Problem 260

Solution 260

How It Works 260

Recipe 7-3. Cache Data Using PySpark SQL 266

Problem 266

Solution 266

How It Works 266

Recipe 7-4. Apply the Distribute By, Sort By, and Cluster By Clauses in PySpark SQL 268

Problem 268

Solution 268

How It Works 269

Chapter 8: Structured Streaming 275

Recipe 8-1. Set Up a Streaming DataFrame on a Directory 276

Problem 276

Solution 276

How It Works 278

Recipe 8-2. Initiate a Streaming Query and See It in Action 281

Problem 281

Solution 281

How It Works 284

Recipe 8-3. Apply PySparkSQL on Streaming 288

Problem 288

Solution 288

How It Works 289

Recipe 8-4. Join Streaming Data with Static Data 290

Problem 290

Solution 290

How It Works 292

Chapter 9: GraphFrames 297

Recipe 9-1. Create GraphFrames 299

Problem 299

Solution 299

How It Works 300

Recipe 9-2. Apply Triangle Counting in a GraphFrame 306

Problem 306

Solution 306

How It Works 306

Recipe 9-3. Apply a PageRank Algorithm 308

Problem 308

Solution 308

How It Works 308

Recipe 9-4. Apply the Breadth First Algorithm 312

Problem 312

Solution 312

How It Works 313

Index 317

About the Authors and About the Technical Reviewer

About the Authors

Raju Kumar Mishra

../images/469054_1_En_BookFrontmatter_Figb_HTML.jpg

has strong interests in data science and systems that have the capability of handling large amounts of data and operating complex mathematical models through computational programming. He was inspired to pursue an M.Tech in computational sciences from the Indian Institute of Science in Bangalore, India. Raju primarily works in the areas of data science and its different applications. Working as a corporate trainer, he has developed unique insights that help him teach and explain complex ideas with ease. Raju is also a data science consultant solving complex industrial problems. He works on programming tools such as R, Python, scikit-learn, Statsmodels, Hadoop, Hive, Pig, Spark, and many others. His venture Walsoul Private Ltd provides training in data science, programming, and Big Data.

Sundar Rajan Raman

../images/469054_1_En_BookFrontmatter_Figc_HTML.jpg

has been working as a Big Data architect with strong hands-on experience in various technologies such as Hadoop, Spark, Hive, Pig, oozie, Kafka, and others. With a strong Machine Learning background, he has implemented various Machine Learning projects that are based on huge volumes of data. Sundar completed his B.Tech from the National Institute of Technology with Honors. He has an innovative mind for solving complex problems. He also has patents in his name. He is currently working for one of the top Financial Institutions in the United States of America.

About the Technical Reviewer

Pramod Singh

../images/469054_1_En_BookFrontmatter_Figd_HTML.jpg

is currently a data science manager at Publicis.Sapient, working with clients like Daimler, Nissan, and JCPenney. He has extensive hands-on experience in Machine Learning, data engineering, programming, and in designing algorithms for various business requirements in domains such as retail, telecom, automobile, and consumer goods. He drives lots of strategic initiatives that deal with Machine Learning and AI at Publicis.Sapient. He is a published author and has published books on Machine Learning and AI. He has also been a regular speaker at major conferences and universities. He lives in Bangalore with his wife and two-year-old son. In his spare time, he enjoys playing guitar, coding, reading, and watching football.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
Front Matter

Table of Contents

About the Authors and About the Technical Reviewer

About the Authors

About the Technical Reviewer

Table of Contents for Front Matter

Create new playlist

Sign In

Sign Up

Table of Contents

About the Authors and About the Technical Reviewer

About the Authors

About the Technical Reviewer

Table of Contents for
Front Matter