Home Page Icon
Home Page
Table of Contents for
Cover
Close
Cover
by Jason Rutherglen, Dean Wampler, Edward Capriolo
Programming Hive
Programming Hive
Preface
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
What Brought Us to Hive?
Edward Capriolo
Dean Wampler
Jason Rutherglen
Acknowledgments
1. Introduction
An Overview of Hadoop and MapReduce
MapReduce
Hive in the Hadoop Ecosystem
Pig
HBase
Cascading, Crunch, and Others
Java Versus Hive: The Word Count Algorithm
What’s Next
2. Getting Started
Installing a Preconfigured Virtual Machine
Detailed Installation
Installing Java
Linux-specific Java steps
Mac OS X−specific Java steps
Installing Hadoop
Local Mode, Pseudodistributed Mode, and Distributed Mode
Testing Hadoop
Installing Hive
What Is Inside Hive?
Starting Hive
Configuring Your Hadoop Environment
Local Mode Configuration
Distributed and Pseudodistributed Mode Configuration
Metastore Using JDBC
The Hive Command
Command Options
The Command-Line Interface
CLI Options
Variables and Properties
Hive “One Shot” Commands
Executing Hive Queries from Files
The .hiverc File
More on Using the Hive CLI
Autocomplete
Command History
Shell Execution
Hadoop dfs Commands from Inside Hive
Comments in Hive Scripts
Query Column Headers
3. Data Types and File Formats
Primitive Data Types
Collection Data Types
Text File Encoding of Data Values
Schema on Read
4. HiveQL: Data Definition
Databases in Hive
Alter Database
Creating Tables
Managed Tables
External Tables
Partitioned, Managed Tables
External Partitioned Tables
Customizing Table Storage Formats
Dropping Tables
Alter Table
Renaming a Table
Adding, Modifying, and Dropping a Table Partition
Changing Columns
Adding Columns
Deleting or Replacing Columns
Alter Table Properties
Alter Storage Properties
Miscellaneous Alter Table Statements
5. HiveQL: Data Manipulation
Loading Data into Managed Tables
Inserting Data into Tables from Queries
Dynamic Partition Inserts
Creating Tables and Loading Them in One Query
Exporting Data
6. HiveQL: Queries
SELECT … FROM Clauses
Specify Columns with Regular Expressions
Computing with Column Values
Arithmetic Operators
Using Functions
Mathematical functions
Aggregate functions
Table generating functions
Other built-in functions
LIMIT Clause
Column Aliases
Nested SELECT Statements
CASE … WHEN … THEN Statements
When Hive Can Avoid MapReduce
WHERE Clauses
Predicate Operators
Gotchas with Floating-Point Comparisons
LIKE and RLIKE
GROUP BY Clauses
HAVING Clauses
JOIN Statements
Inner JOIN
Join Optimizations
LEFT OUTER JOIN
OUTER JOIN Gotcha
RIGHT OUTER JOIN
FULL OUTER JOIN
LEFT SEMI-JOIN
Cartesian Product JOINs
Map-side Joins
ORDER BY and SORT BY
DISTRIBUTE BY with SORT BY
CLUSTER BY
Casting
Casting BINARY Values
Queries that Sample Data
Block Sampling
Input Pruning for Bucket Tables
UNION ALL
7. HiveQL: Views
Views to Reduce Query Complexity
Views that Restrict Data Based on Conditions
Views and Map Type for Dynamic Tables
View Odds and Ends
8. HiveQL: Indexes
Creating an Index
Bitmap Indexes
Rebuilding the Index
Showing an Index
Dropping an Index
Implementing a Custom Index Handler
9. Schema Design
Table-by-Day
Over Partitioning
Unique Keys and Normalization
Making Multiple Passes over the Same Data
The Case for Partitioning Every Table
Bucketing Table Data Storage
Adding Columns to a Table
Using Columnar Tables
Repeated Data
Many Columns
(Almost) Always Use Compression!
10. Tuning
Using EXPLAIN
EXPLAIN EXTENDED
Limit Tuning
Optimized Joins
Local Mode
Parallel Execution
Strict Mode
Tuning the Number of Mappers and Reducers
JVM Reuse
Indexes
Dynamic Partition Tuning
Speculative Execution
Single MapReduce MultiGROUP BY
Virtual Columns
11. Other File Formats and Compression
Determining Installed Codecs
Choosing a Compression Codec
Enabling Intermediate Compression
Final Output Compression
Sequence Files
Compression in Action
Archive Partition
Compression: Wrapping Up
12. Developing
Changing Log4J Properties
Connecting a Java Debugger to Hive
Building Hive from Source
Running Hive Test Cases
Execution Hooks
Setting Up Hive and Eclipse
Hive in a Maven Project
Unit Testing in Hive with hive_test
The New Plugin Developer Kit
13. Functions
Discovering and Describing Functions
Calling Functions
Standard Functions
Aggregate Functions
Table Generating Functions
A UDF for Finding a Zodiac Sign from a Day
UDF Versus GenericUDF
Permanent Functions
User-Defined Aggregate Functions
Creating a COLLECT UDAF to Emulate GROUP_CONCAT
User-Defined Table Generating Functions
UDTFs that Produce Multiple Rows
UDTFs that Produce a Single Row with Multiple Columns
UDTFs that Simulate Complex Types
Accessing the Distributed Cache from a UDF
Annotations for Use with Functions
Deterministic
Stateful
DistinctLike
Macros
14. Streaming
Identity Transformation
Changing Types
Projecting Transformation
Manipulative Transformations
Using the Distributed Cache
Producing Multiple Rows from a Single Row
Calculating Aggregates with Streaming
CLUSTER BY, DISTRIBUTE BY, SORT BY
GenericMR Tools for Streaming to Java
Calculating Cogroups
15. Customizing Hive File and Record Formats
File Versus Record Formats
Demystifying CREATE TABLE Statements
File Formats
SequenceFile
RCFile
Example of a Custom Input Format: DualInputFormat
Record Formats: SerDes
CSV and TSV SerDes
ObjectInspector
Think Big Hive Reflection ObjectInspector
XML UDF
XPath-Related Functions
JSON SerDe
Avro Hive SerDe
Defining Avro Schema Using Table Properties
Defining a Schema from a URI
Evolving Schema
Binary Output
16. Hive Thrift Service
Starting the Thrift Server
Setting Up Groovy to Connect to HiveService
Connecting to HiveServer
Getting Cluster Status
Result Set Schema
Fetching Results
Retrieving Query Plan
Metastore Methods
Example Table Checker
Finding tables not marked as external
Administrating HiveServer
Productionizing HiveService
Cleanup
Hive ThriftMetastore
ThriftMetastore Configuration
Client Configuration
17. Storage Handlers and NoSQL
Storage Handler Background
HiveStorageHandler
HBase
Cassandra
Static Column Mapping
Transposed Column Mapping for Dynamic Columns
Cassandra SerDe Properties
DynamoDB
18. Security
Integration with Hadoop Security
Authentication with Hive
Authorization in Hive
Users, Groups, and Roles
Privileges to Grant and Revoke
Partition-Level Privileges
Automatic Grants
19. Locking
Locking Support in Hive with Zookeeper
Explicit, Exclusive Locks
20. Hive Integration with Oozie
Oozie Actions
Hive Thrift Service Action
A Two-Query Workflow
Oozie Web Console
Variables in Workflows
Capturing Output
Capturing Output to Variables
21. Hive and Amazon Web Services (AWS)
Why Elastic MapReduce?
Instances
Before You Start
Managing Your EMR Hive Cluster
Thrift Server on EMR Hive
Instance Groups on EMR
Configuring Your EMR Cluster
Deploying hive-site.xml
Deploying a .hiverc Script
Deploying .hiverc using a config step
Deploying a .hiverc using a bootstrap action
Setting Up a Memory-Intensive Configuration
Persistence and the Metastore on EMR
HDFS and S3 on EMR Cluster
Putting Resources, Configs, and Bootstrap Scripts on S3
Logs on S3
Spot Instances
Security Groups
EMR Versus EC2 and Apache Hive
Wrapping Up
22. HCatalog
Introduction
MapReduce
Reading Data
Writing Data
Command Line
Security Model
Architecture
23. Case Studies
m6d.com (Media6Degrees)
Data Science at M6D Using Hive and R
M6D UDF Pseudorank
M6D Managing Hive Data Across Multiple MapReduce Clusters
Cross deployment queries with Hive
Replicating Hive data between deployments
Outbrain
In-Site Referrer Identification
Cleaning up the URLs
Determining referrer type
Multiple URLs
Counting Uniques
Why this is a problem
Load a temp table
Querying the temp table
Sessionization
Setting it up
Finding origin pageviews
Bucketing PVs to origins
Aggregating on origins
Aggregating on origin type
Measure engagement
NASA’s Jet Propulsion Laboratory
The Regional Climate Model Evaluation System
Our Experience: Why Hive?
Some Challenges and How We Overcame Them
Conclusion
Photobucket
Big Data at Photobucket
What Hardware Do We Use for Hive?
What’s in Hive?
Who Does It Support?
SimpleReach
Experiences and Needs from the Customer Trenches
A Karmasphere Perspective
Introduction
Use Case Examples from the Customer Trenches
Customer trenches #1: Optimal data formatting for Hive
Customer trenches #2: Partitions and performance
Customer trenches #3: Text analytics with Regex, Lateral View Explode, Ngram, and other UDFs
Apache Hive in production: Incremental needs and capabilities
Collaborative multiuser environments
Productivity enhancements
Managing Hive assets
Extending Hive for advanced analytics
Extending Hive beyond the SQL skill set
Data exploration capabilities
Schedule and operationalize Hive queries
About Karmasphere
Hive features survey
Glossary
A. References
Index
About the Authors
Colophon
Copyright
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Next
Next Chapter
Programming Hive
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset