Index

A

adduser command
allBigData
Apache Hadoop
Apache Hive
code execution flow
HiveQL commands
RDBMS
SQL
Apache Kafka
broker
consumer
development
message flow
producer
Apache Mesos cluster manager
Apache Pig
Apache Spark
description
GraphFrames
MLlib
Resilient Distributed Datasets
Apache Storm
Apache Tez
Atomicity, Consistency, Isolation and Durability (ACID) principles

B

Big Data
Apache Hadoop
Apache Storm
Apache Tez
variety
velocity
veracity
volume
Big Data frameworks
Breadth first algorithm
API
path-finding algorithms

C

Cassandra
Cassandra installation
Catalyst optimizer
Cluster By clause
Cluster managers
Apache Mesos
distributed system
standalone
YARN
Comma-separated value (CSV) file
DataFrame
header argument
inferSchema argument
reading
spark.read.csv() function
swimmerData.csv
Console sink
Contingency table
creation
MongoDB command
restaurant survey
Correlation
Covariance
Cross tabulations
CSV File, DataFrame creation
JSON format
parquet format
sample data

D, E

Data aggregation
multiple key
department columns
gender columns
multiple columns
single key
average frequency
data from MySQL
functions
gender-wise
mean value
DataFrames
categorical variables
correlation
covariance
CSV file
describe() function
frequent items
horizontal stacking
inner join
new creation
PostgreSQL
joining
arguments
Cassandra table
full outer joins
inner joins
left outer joins
right outer joins
students table
subjects table
types of joins
JSON file
MongoDB
MySQL
ORC file
Parquet file
PostgreSQL
removing duplicate records
sample records ( see Sampling data)
simple SQL creation
create aliases
filtering on case-sensitive
use alias columns
using column names
where clause filtering
summary() function
swimming competition ( see Swimming competition, dataframes)
temp view creation
vertical stacking
new creation
PostgreSQL
SQL commands
DataFrame streaming
creation
join static data
output modes
PySparkSQL queries
set up
sink types
SQL filters
static data
temperature data
Data labeling
Degrees
Descriptive statistics
agg() function
corrData.json
counting, number of elements
population variance
pyspark.sql.functions submodule
sample mean
sample variance
spark.read.json function
summation, mean, and standard deviation
variance, mean, and standard deviation
Directed Acyclic Graphs
Distribute By clause
Distributed systems
drop() function
dropna() function

F

File sink
File source
fillna() function

G

Google File System (GFS)
GraphFrames
creation
DataFrames
degrees
error message
installation
persons.csv
personsDf DataFrame
relationship.csv
triangle count value
vertices
groupby() function

H, I

Hadoop
components
The Google File System
MapReduce
Hadoop Distributed File System (HDFS)
Hadoop ecosystem frameworks
Hadoop installation, single machine
.bashrc file updation
CentOS user creation
directory installation
downloading
environment file
HDFS
java installation
jps command checking
namenode format running
password-less login creation
properties files
start-dfs.sh script
start-yarn.sh script
stop-dfs.sh shell script
YARN
Hive installation, single machine
.bashrc file
datawarehouse directory creation
directory
downloading
extraction
hive-site.xml
metastore database
Hive metastore configuration, PostgreSQL
downloading JDBC connector
external RDBMS
grant command
hive-site.xml modification
lib directory
pg_hba.conf file
pymetastore database
testing
user and database creation
Horizontal scaling

J

JavaScript Object Notation (JSON) file
corrData.json
DoubleType()
spark.read.json() function
Jobtracker
join() function

K, L

Kafka sink
Kafka source

M, N

MapReduce
Hadoop streaming module
HDFS
iterative algorithms
Jobtracker and Tasktracker
MapReduce framework
maxIter
Missing value imputation
arguments
drop the rows
MongoDB
replace with zero
thresh argument
MongoDB
MySQL
MySQL server installation

O

Optimized Row Columnar (ORC) file

P

PageRank algorithm
Apache Spark
attributes
Parquet file
Partition-wise sorting
Occupation and swimTimeInSecond columns
process of
repartitioning and shuffling
single
sortWithinPartitions() function
passwd command
PostgreSQL
printSchema() function
Procedural Language/PostgreSQL (PL/pgSQL)
PySpark shell
PySpark SQL
aggregation
Group By clause
number of students per subject
number of subjects per student
cache data
catalyst optimizer
Cluster By clause
DataFrames
Distribute By and Sort By clause
domain-specific language
file format systems
SparkSession
Structured Streaming
windows functions
partition
rank function
ranking in place
PySpark to Hive, connection
PySpark UDF creation
genderCodeToValue
Python function

Q

Query Optimizer

R

rank() function
Rate source
Relational Database Management Systems (RDBMSs)
Reset probability
Resilient Distributed Dataset (RDD)
restaurantSurvey

S

Sampling data
noDuplicateDf1 DataFrame
column iv1
without replacement
with replacement
sampleBy(col, fractions, seed=None)
sample(withReplacement, fraction, seed=None)
time-intensive and computation-intensive
Shuffling in Spark
Socket source
Sort By clause
Sorting
ascending order
descending order
Occupation and swimTimeInSecond columns
PySpark SQL API orderBy(*cols, **kwargs)
partition-wise ( see Partition-wise sorting)
swimmerDf
Spark installation, single machine
.bashrc file
directory/allBigData
downloading
environment file changing
pyspark script
.tgz file extraction
SparkSession
Spark SQLs
joining multiple DataFrames
applying queries
dataset
joining two DataFrames
boilerplate code
dataset
join query specifics
observations
running the queries
using full outer join
using left join
using right outer join
UDF methods
date functions
dateofbirth column
Spark terminology
Standalone cluster manager
Stream computation
Structured Streaming
Swimming competition, dataframes
column deletion
swimmerDf2
structured datasets
column selection
id and swimmerSpeed columns
select() function
swimmerDf2 DataFrame
swimTimeInSecond column
data labeling
filtering process
gender column
occupation and swimmerSpeed > 1.17
where() function
sort data ( see Sorting)
transformation operation
CSV file
printSchema() function
round() function
sample data
swimming speed
withColumn() function
UDF ( see User-defined function (UDF))

T

Tasktracker
Temperature dataset
Triangle count value

U

union() function
User Defined Aggregate Functions (UDAF)
User-defined function (UDF)
average temperature
celsiustoFahrenheit
CSV file
Fahrenheit column
pyspark.sql.functions
spark.read.parquet() function
tempDfFahrenheit
temperatureData
temperature value, Celsius and Fahrenheit
tempInCelsius

V

Vertical scaling
Vertical stacking

W, X

writeStream method

Y, Z

Yet Another Resource Negotiator (YARN) cluster manager
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset