Raju Kumar Mishra and Sundar Rajan Raman
PySpark SQL RecipesWith HiveQL, Dataframe and Graphframes
Raju Kumar Mishra
Bangalore, Karnataka, India
Sundar Rajan Raman
Chennai, Tamil Nadu, India
ISBN 978-1-4842-4334-3e-ISBN 978-1-4842-4335-0
Library of Congress Control Number: 2019934769
© Raju Kumar Mishra and Sundar Rajan Raman 2019
Apress Standard
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.
Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail [email protected], or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.

To the Almighty, who guides me in every aspect of my life. And to my mother, Smt. Savitri Mishra, and my lovely wife, Smt. Smita Rani Pathak.

Introduction
This book will take you on an interesting journey to learn about PySparkSQL and Big Data using a problem-solution approach. Every problem is followed by a detailed, step-by-step answer, which will improve your thought process for solving Big Data problems with PySparkSQL. The following is a brief description of each chapter:
  • Chapter 1 , “Introduction to PySparkSQL,” covers Many Big Data processing tools such as Apache Hadoop, Apache Pig, Apache Hive, and Apache Spark. The shortcomings of Hadoop and the evolution of Spark are discussed. It discusses PySparkSQL, includes an introduction to DataFrame, and covers structured streaming. A discussion of Apache Kafka is also included. This chapter also sheds light on some NoSQL databases like MongoDB and Cassandra.

  • Chapter 2 , “Installation,” will take you to the real battleground. You’ll learn how to install many Big Data processing tools such as Hadoop, Hive, Spark, MongoDB, and Apache Cassandra.

  • Chapter 3 , “IO in PySparkSQL,” will take you through many recipes that read data from many data sources using PySparkSQL. You’ll read data from many file formats like CSV, JSON, ORC, and Parquet, then from many RDBMS like MySQL and PostgreSQL. It also discusses how to read data from NoSQL databases like MongoDB and Cassandra using PySparkSQL. Then you see how to save the data into many sinks like files and RDBMS or NoSQL databases.

  • Chapter 4 , “Operations on PySparkSQL DataFrames,” explains different operations like data filtering, data transformation, and data sorting on DataFrames.

  • Chapter 5 , “Data Merging and Data Aggregation Using PySparkSQL,” shows how to perform data aggregation and data merging on DataFrames.

  • Chapter 6 , “SQL, NoSQL, and PySparkSQL,” shows how to perform SQL operations on DataFrames. It contains multiple recipes that will help you convert DataFrames to table-like structures and then apply SQL queries to them.

  • Chapter 7 , “Optimizing PySparkSQL,” shows you how to perform optimal joins that run faster. You will understand the basics of how Spark works in the background and, based on that, you will see multiple recipes that will help you optimize your SQL queries on DataFrames.

  • Chapter 8 , “Structured Streaming,” shows you how to use Spark streaming with streaming data. This chapter provides multiple recipes to help you apply Spark’s structured streaming APIs and SQLs to streaming data.

  • Chapter 9 , “GraphFrames,” shows you how to perform Graph operations on DataFrames. There are multiple GraphFrame recipes, including PageRank, that will help you to appreciate and apply complex graph operations using Spark’s GraphFrame.

Acknowledgments

My heartiest thanks to the Almighty. I also would like to thank my mother, Smt. Savitri Mishra; my sisters, Mitan and Priya; my cousins, Suchitra and Chandni; and my maternal uncle, Shyam Bihari Pandey; for their support and encouragement. I am very grateful to my sweet and beautiful wife, Smt. Smita Rani Pathak, for her continuous encouragement and love while I was writing this book. I thank my brother-in-law, Mr. Prafull Chandra Pandey, for his encouragement to write this book. I am very thankful to my sisters-in-law, Rinky, Reena, Kshama, Charu, and Dhriti, for their encouragement as well. I am grateful to Anurag Pal Sehgal, Saurabh Gupta, Devendra Mani Tripathi, Avinash Dash, Rajesh Thakur, and all my friends. My nephews, Rashu and Rishu. Last but not least, thanks to coordinating editor Aditee Mirashi and acquisitions editor Celestin Suresh John at Apress; without them, this book would not have been possible.

Table of Contents

Index 317

About the Authors and About the Technical Reviewer

About the Authors

Raju Kumar Mishra
../images/469054_1_En_BookFrontmatter_Figb_HTML.jpg

has strong interests in data science and systems that have the capability of handling large amounts of data and operating complex mathematical models through computational programming. He was inspired to pursue an M.Tech in computational sciences from the Indian Institute of Science in Bangalore, India. Raju primarily works in the areas of data science and its different applications. Working as a corporate trainer, he has developed unique insights that help him teach and explain complex ideas with ease. Raju is also a data science consultant solving complex industrial problems. He works on programming tools such as R, Python, scikit-learn, Statsmodels, Hadoop, Hive, Pig, Spark, and many others. His venture Walsoul Private Ltd provides training in data science, programming, and Big Data.

 
Sundar Rajan Raman
../images/469054_1_En_BookFrontmatter_Figc_HTML.jpg

has been working as a Big Data architect with strong hands-on experience in various technologies such as Hadoop, Spark, Hive, Pig, oozie, Kafka, and others. With a strong Machine Learning background, he has implemented various Machine Learning projects that are based on huge volumes of data. Sundar completed his B.Tech from the National Institute of Technology with Honors. He has an innovative mind for solving complex problems. He also has patents in his name. He is currently working for one of the top Financial Institutions in the United States of America.

 

About the Technical Reviewer

Pramod Singh
../images/469054_1_En_BookFrontmatter_Figd_HTML.jpg

is currently a data science manager at Publicis.Sapient, working with clients like Daimler, Nissan, and JCPenney. He has extensive hands-on experience in Machine Learning, data engineering, programming, and in designing algorithms for various business requirements in domains such as retail, telecom, automobile, and consumer goods. He drives lots of strategic initiatives that deal with Machine Learning and AI at Publicis.Sapient. He is a published author and has published books on Machine Learning and AI. He has also been a regular speaker at major conferences and universities. He lives in Bangalore with his wife and two-year-old son. In his spare time, he enjoys playing guitar, coding, reading, and watching football.

 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset