Title Page Copyright and Credits Hands-On Big Data Analytics with PySpark About Packt Why subscribe? Packt.com Contributors About the authors Packt is searching for authors like you Preface Who this book is for What this book covers To get the most out of this book Download the example code files Download the color images Conventions used Get in touch Reviews Installing Pyspark and Setting up Your Development Environment An overview of PySpark Spark SQL Setting up Spark on Windows and PySpark Core concepts in Spark and PySpark SparkContext Spark shell SparkConf Summary Getting Your Big Data into the Spark Environment Using RDDs Loading data on to Spark RDDs The UCI machine learning repository Getting the data from the repository to Spark Getting data into Spark Parallelization with Spark RDDs What is parallelization? Basics of RDD operation Summary Big Data Cleaning and Wrangling with Spark Notebooks Using Spark Notebooks for quick iteration of ideas Sampling/filtering RDDs to pick out relevant data points Splitting datasets and creating some new combinations Summary Aggregating and Summarizing Data into Useful Reports Calculating averages with map and reduce Faster average computations with aggregate Pivot tabling with key-value paired data points Summary Powerful Exploratory Data Analysis with MLlib Computing summary statistics with MLlib Using Pearson and Spearman correlations to discover correlations The Pearson correlation The Spearman correlation Computing Pearson and Spearman correlations Testing our hypotheses on large datasets Summary Putting Structure on Your Big Data with SparkSQL Manipulating DataFrames with Spark SQL schemas Using Spark DSL to build queries Summary Transformations and Actions Using Spark transformations to defer computations to a later time Avoiding transformations Using the reduce and reduceByKey methods to calculate the results Performing actions that trigger computations Reusing the same rdd for different actions Summary Immutable Design Delving into the Spark RDD's parent/child chain Extending an RDD Chaining a new RDD with the parent Testing our custom RDD Using RDD in an immutable way Using DataFrame operations to transform Immutability in the highly concurrent environment Using the Dataset API in an immutable way Summary Avoiding Shuffle and Reducing Operational Expenses Detecting a shuffle in a process Testing operations that cause a shuffle in Apache Spark Changing the design of jobs with wide dependencies Using keyBy() operations to reduce shuffle Using a custom partitioner to reduce shuffle Summary Saving Data in the Correct Format Saving data in plain text format Leveraging JSON as a data format Tabular formats – CSV Using Avro with Spark Columnar formats – Parquet Summary Working with the Spark Key/Value API Available actions on key/value pairs Using aggregateByKey instead of groupBy() Actions on key/value pairs Available partitioners on key/value data Implementing a custom partitioner Summary Testing Apache Spark Jobs Separating logic from Spark engine-unit testing Integration testing using SparkSession Mocking data sources using partial functions Using ScalaCheck for property-based testing Testing in different versions of Spark Summary Leveraging the Spark GraphX API Creating a graph from a data source Creating the loader component Revisiting the graph format Loading Spark from file Using the Vertex API Constructing a graph using the vertex Creating couple relationships Using the Edge API Constructing the graph using edge Calculating the degree of the vertex The in-degree The out-degree Calculating PageRank Loading and reloading data about users and followers Summary Other Books You May Enjoy Leave a review - let other readers know what you think