Back Matter

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

9. GraphFrames

Index

A

adduser command

allBigData

Apache Hadoop

Apache Hive

code execution flow

HiveQL commands

RDBMS

SQL

Apache Kafka

broker

consumer

development

message flow

producer

Apache Mesos cluster manager

Apache Pig

Apache Spark

description

GraphFrames

MLlib

Resilient Distributed Datasets

Apache Storm

Apache Tez

Atomicity, Consistency, Isolation and Durability (ACID) principles

B

Big Data

Apache Hadoop

Apache Storm

Apache Tez

variety

velocity

veracity

volume

Big Data frameworks

Breadth first algorithm

API

path-finding algorithms

C

Cassandra

Cassandra installation

Catalyst optimizer

Cluster By clause

Cluster managers

Apache Mesos

distributed system

standalone

YARN

Comma-separated value (CSV) file

DataFrame

header argument

inferSchema argument

reading

spark.read.csv() function

swimmerData.csv

Console sink

Contingency table

creation

MongoDB command

restaurant survey

Correlation

Covariance

Cross tabulations

CSV File, DataFrame creation

JSON format

parquet format

sample data

D, E

Data aggregation

multiple key

department columns

gender columns

multiple columns

single key

average frequency

data from MySQL

functions

gender-wise

mean value

DataFrames

categorical variables

correlation

covariance

CSV file

describe() function

frequent items

horizontal stacking

inner join

new creation

PostgreSQL

joining

arguments

Cassandra table

full outer joins

inner joins

left outer joins

right outer joins

students table

subjects table

types of joins

JSON file

MongoDB

MySQL

ORC file

Parquet file

PostgreSQL

removing duplicate records

sample records ( see Sampling data)

simple SQL creation

create aliases

filtering on case-sensitive

use alias columns

using column names

where clause filtering

summary() function

swimming competition ( see Swimming competition, dataframes)

temp view creation

vertical stacking

new creation

PostgreSQL

SQL commands

DataFrame streaming

creation

join static data

output modes

PySparkSQL queries

set up

sink types

SQL filters

static data

temperature data

Data labeling

Degrees

Descriptive statistics

agg() function

corrData.json

counting, number of elements

population variance

pyspark.sql.functions submodule

sample mean

sample variance

spark.read.json function

summation, mean, and standard deviation

variance, mean, and standard deviation

Directed Acyclic Graphs

Distribute By clause

Distributed systems

drop() function

dropna() function

F

File sink

File source

fillna() function

G

Google File System (GFS)

GraphFrames

creation

DataFrames

degrees

error message

installation

persons.csv

personsDf DataFrame

relationship.csv

triangle count value

vertices

groupby() function

H, I

Hadoop

components

The Google File System

MapReduce

Hadoop Distributed File System (HDFS)

Hadoop ecosystem frameworks

Hadoop installation, single machine

.bashrc file updation

CentOS user creation

directory installation

downloading

environment file

HDFS

java installation

jps command checking

namenode format running

password-less login creation

properties files

start-dfs.sh script

start-yarn.sh script

stop-dfs.sh shell script

YARN

Hive installation, single machine

.bashrc file

datawarehouse directory creation

J

JavaScript Object Notation (JSON) file

corrData.json

DoubleType()

spark.read.json() function

Jobtracker

join() function

K, L

Kafka sink

Kafka source

M, N

MapReduce

Hadoop streaming module

HDFS

iterative algorithms

Jobtracker and Tasktracker

MapReduce framework

maxIter

Missing value imputation

arguments

drop the rows

MongoDB

replace with zero

thresh argument

MongoDB

MySQL

MySQL server installation

O

Optimized Row Columnar (ORC) file

P

PageRank algorithm

Apache Spark

attributes

Parquet file

Partition-wise sorting

Occupation and swimTimeInSecond columns

process of

repartitioning and shuffling

single

sortWithinPartitions() function

passwd command

PostgreSQL

printSchema() function

Procedural Language/PostgreSQL (PL/pgSQL)

PySpark shell

PySpark SQL

aggregation

Group By clause

number of students per subject

number of subjects per student

cache data

catalyst optimizer

Cluster By clause

DataFrames

Distribute By and Sort By clause

domain-specific language

file format systems

SparkSession

Structured Streaming

windows functions

partition

rank function

ranking in place

PySpark to Hive, connection

PySpark UDF creation

genderCodeToValue

Python function

Q

Query Optimizer

R

rank() function

Rate source

Relational Database Management Systems (RDBMSs)

Reset probability

Resilient Distributed Dataset (RDD)

restaurantSurvey

S

Sampling data

noDuplicateDf1 DataFrame

column iv1

without replacement

with replacement

sampleBy(col, fractions, seed=None)

sample(withReplacement, fraction, seed=None)

time-intensive and computation-intensive

Shuffling in Spark

Socket source

Sort By clause

Sorting

ascending order

descending order

Occupation and swimTimeInSecond columns

PySpark SQL API orderBy(*cols, **kwargs)

partition-wise ( see Partition-wise sorting)

swimmerDf

Spark installation, single machine

.bashrc file

directory/allBigData

downloading

environment file changing

pyspark script

.tgz file extraction

SparkSession

Spark SQLs

joining multiple DataFrames

applying queries

dataset

joining two DataFrames

boilerplate code

dataset

join query specifics

observations

running the queries

using full outer join

using left join

using right outer join

UDF methods

date functions

dateofbirth column

Spark terminology

Standalone cluster manager

Stream computation

Structured Streaming

Swimming competition, dataframes

column deletion

swimmerDf2

structured datasets

column selection

id and swimmerSpeed columns

select() function

swimmerDf2 DataFrame

swimTimeInSecond column

data labeling

filtering process

gender column

occupation and swimmerSpeed > 1.17

where() function

sort data ( see Sorting)

transformation operation

CSV file

printSchema() function

round() function

sample data

swimming speed

withColumn() function

UDF ( see User-defined function (UDF))

T

Tasktracker

Temperature dataset

Triangle count value

U

union() function

User Defined Aggregate Functions (UDAF)

User-defined function (UDF)

average temperature

celsiustoFahrenheit

CSV file

Fahrenheit column

pyspark.sql.functions

spark.read.parquet() function

tempDfFahrenheit

temperatureData

temperature value, Celsius and Fahrenheit

tempInCelsius

V

Vertical scaling

Vertical stacking

W, X

writeStream method

Y, Z

Yet Another Resource Negotiator (YARN) cluster manager

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Back Matter

Create new playlist

Sign In

Sign Up

Index

A

B

C

D, E

F

G

H, I

J

K, L

M, N

O

P

Q

R

S

T

U

V

W, X

Y, Z

Table of Contents for
Back Matter