Raju Kumar Mishra and Sundar Rajan Raman
PySpark SQL RecipesWith HiveQL, Dataframe and Graphframes
Raju Kumar Mishra
Bangalore, Karnataka, India
Sundar Rajan Raman
Chennai, Tamil Nadu, India
This book will take you on an interesting journey to learn about PySparkSQL and Big Data using a problem-solution approach. Every problem is followed by a detailed, step-by-step answer, which will improve your thought process for solving Big Data problems with PySparkSQL. The following is a brief description of each chapter:
  • Chapter 1 , “Introduction to PySparkSQL,” covers Many Big Data processing tools such as Apache Hadoop, Apache Pig, Apache Hive, and Apache Spark. The shortcomings of Hadoop and the evolution of Spark are discussed. It discusses PySparkSQL, includes an introduction to DataFrame, and covers structured streaming. A discussion of Apache Kafka is also included. This chapter also sheds light on some NoSQL databases like MongoDB and Cassandra.

  • Chapter 2 , “Installation,” will take you to the real battleground. You’ll learn how to install many Big Data processing tools such as Hadoop, Hive, Spark, MongoDB, and Apache Cassandra.

  • Chapter 3 , “IO in PySparkSQL,” will take you through many recipes that read data from many data sources using PySparkSQL. You’ll read data from many file formats like CSV, JSON, ORC, and Parquet, then from many RDBMS like MySQL and PostgreSQL. It also discusses how to read data from NoSQL databases like MongoDB and Cassandra using PySparkSQL. Then you see how to save the data into many sinks like files and RDBMS or NoSQL databases.

  • Chapter 4 , “Operations on PySparkSQL DataFrames,” explains different operations like data filtering, data transformation, and data sorting on DataFrames.

  • Chapter 5 , “Data Merging and Data Aggregation Using PySparkSQL,” shows how to perform data aggregation and data merging on DataFrames.

  • Chapter 6 , “SQL, NoSQL, and PySparkSQL,” shows how to perform SQL operations on DataFrames. It contains multiple recipes that will help you convert DataFrames to table-like structures and then apply SQL queries to them.

  • Chapter 7 , “Optimizing PySparkSQL,” shows you how to perform optimal joins that run faster. You will understand the basics of how Spark works in the background and, based on that, you will see multiple recipes that will help you optimize your SQL queries on DataFrames.

  • Chapter 8 , “Structured Streaming,” shows you how to use Spark streaming with streaming data. This chapter provides multiple recipes to help you apply Spark’s structured streaming APIs and SQLs to streaming data.

  • Chapter 9 , “GraphFrames,” shows you how to perform Graph operations on DataFrames. There are multiple GraphFrame recipes, including PageRank, that will help you to appreciate and apply complex graph operations using Spark’s GraphFrame.


About the Authors and About the Technical Reviewer

About the Authors

Raju Kumar Mishra

has strong interests in data science and systems that have the capability of handling large amounts of data and operating complex mathematical models through computational programming. He was inspired to pursue an M.Tech in computational sciences from the Indian Institute of Science in Bangalore, India. Raju primarily works in the areas of data science and its different applications. Working as a corporate trainer, he has developed unique insights that help him teach and explain complex ideas with ease. Raju is also a data science consultant solving complex industrial problems. He works on programming tools such as R, Python, scikit-learn, Statsmodels, Hadoop, Hive, Pig, Spark, and many others. His venture Walsoul Private Ltd provides training in data science, programming, and Big Data.

Sundar Rajan Raman

has been working as a Big Data architect with strong hands-on experience in various technologies such as Hadoop, Spark, Hive, Pig, oozie, Kafka, and others. With a strong Machine Learning background, he has implemented various Machine Learning projects that are based on huge volumes of data. Sundar completed his B.Tech from the National Institute of Technology with Honors. He has an innovative mind for solving complex problems. He also has patents in his name. He is currently working for one of the top Financial Institutions in the United States of America.


About the Technical Reviewer

Pramod Singh

is currently a data science manager at Publicis.Sapient, working with clients like Daimler, Nissan, and JCPenney. He has extensive hands-on experience in Machine Learning, data engineering, programming, and in designing algorithms for various business requirements in domains such as retail, telecom, automobile, and consumer goods. He drives lots of strategic initiatives that deal with Machine Learning and AI at Publicis.Sapient. He is a published author and has published books on Machine Learning and AI. He has also been a regular speaker at major conferences and universities. He lives in Bangalore with his wife and two-year-old son. In his spare time, he enjoys playing guitar, coding, reading, and watching football.

