Glossary

Amazon Elastic MapReduce

Amazon’s EMR is a hosted Hadoop service on top of Amazon EC2 (Elastic Compute Cloud).

Avro

Avro is a new serialization format developed to address some of the common problems associated with evolving other serialization formats. Some of the benefits are: rich data structures, fast binary format, support for remote procedure calls, and built-in schema evolution.

Bash

The “Bourne-Again Shell” that is the default interactive command shell for Linux and Mac OS X systems.

S3 Bucket

The term for the top-level container you own and manage when using S3. A user may have many buckets, analogous to the root of a physical hard drive.

Command-Line Interface

The command-line interface (CLI) can run “scripts” of Hive statements or all the user to enter statements interactively.

Data Warehouse

A repository of structured data suitable for analysis for reports, trends, etc. Warehouses are batch mode or offline, as opposed to providing real-time responsiveness for online activity, like ecommerce.

Derby

A lightweight SQL database that can be embedded in Java applications. It runs in the same process and saves its data to local files. It is used as the default SQL data store for Hive’s metastore. See http://db.apache.org/derby/ for more information.

Dynamic Partitions

A HiveQL extension to SQL that allows you to insert query results into table partitions where you leave one or more partition column values unspecified and they are determined dynamically from the query results themselves. This technique is convenient for partitioning a query result into a potentially large number of partitions in a new table, without having to write a separate query for each partition column value.

Ephemeral Storage

In the nodes for a virtual Amazon EC2 cluster, the on-node disk storage is called ephemeral because it will vanish when the cluster is shut down, in contrast to a physical cluster that is shut down. Hence, when using an EC2 cluster, such as an Amazon Elastic MapReduce cluster, it is important to back up important data to S3.

ExternalTable

A table using a storage location and contents that are outside of Hive’s control. It is convenient for sharing data with other tools, but it is up to other processes to manage the life cycle of the data. That is, when an external table is created, Hive does not create the external directory (or directories for partitioned tables), nor are the directory and data files deleted when an external table is dropped.

Hadoop Distributed File System

(HDFS) A distributed, resilient file system for data storage that is optimized for scanning large contiguous blocks of data on hard disks. Distribution across a cluster provides horizontal scaling of data storage. Blocks of HDFS files are replicated across the cluster (by default, three times) to prevent data loss when hard drives or whole servers fail.

HBase

The NoSQL database that uses HDFS for durable storage of table data. HBase is a column-oriented, key-value store designed to provide traditional responsiveness for queries and row-level updates and insertions. Column oriented means that the data storage is organized on disk by groups of columns, called column families, rather than by row. This feature promotes fast queries for subsets of columns. Key-value means that rows are stored and fetched by a unique key and the value is the entire row. HBase does not provide an SQL dialect, but Hive can be used to query HBase tables.

Hive

A data warehouse tool that provides table abstractions on top of data resident in HDFS, HBase tables, and other stores. The Hive Query Language is a dialect of the Structured Query Language.

Hive Query Language

Hive’s own dialect of the Structured Query Language (SQL). Abbreviated HiveQL or HQL.

Input Format

The input format determines how input streams, usually from files, are split into records. A SerDe handles parsing the record into columns. A custom input format can be specified when creating a table using the INPUTFORMAT clause. The input format for the default STORED AS TEXTFILE specification is implemented by the Java object named org.apache.hadoop.mapreduce.lib.input.TextInputFormat. See also Output Format.

JDBC

The Java Database Connection API provides access to SQL systems, including Hive, from Java code.

Job

In the Hadoop context, a job is a self-contained workflow submitted to MapReduce. It encompasses all the work required to perform a complete calculation, from reading input to generating output. The MapReduce JobTracker will decompose the job into one or more tasks for distribution and execution around the cluster.

JobTracker

The top-level controller of all jobs using Hadoop’s MapReduce. The JobTracker accepts job submissions, determines what tasks to run and where to run them, monitors their execution, restarts failed tasks as needed, and provides a web console for monitoring job and task execution, viewing logs, etc.

Job Flow

A term used in Amazon Elastic MapReduce (EMR) for the sequence of jobs executed on a temporary EMR cluster to accomplish a particular goal.

JSON

JSON (JavaScript Object Notation) is a lightweight data serialization format used commonly in web-based applications.

Map

The mapping phase of a MapReduce process where an input set of key-value pairs are converted into a new set of key-value pairs. For each input key-value pair, there can be zero-to-many output key-value pairs. The input and output keys and the input and output values can be completely different.

MapR

A commercial distribution of Hadoop that replaces HDFS with the MapR File System (MapR-FS), a high-performance, distributed file system.

MapReduce

A computation paradigm invented at Google and based loosely on the common data operations of mapping a collection of data elements from one form to another (the map phase) and reducing a collection to a single value or a smaller collection (the reduce phase). MapReduce is designed to scale computation horizontally by decomposing map and reduce steps into tasks and distributing those tasks across a cluster. The MapReduce runtime provided by Hadoop handles decomposition of a job into tasks, distribution around the cluster, movement of a particular task to the machine that holds the data for the task, movement of data to tasks (as needed), and automated re-execution of failed tasks and other error recovery and logging services.

Metastore

The service that maintains “metadata” information, such as table schemas. Hive requires this service to be running. By default, it uses a built-in Derby SQL server, which provides limited, single-process SQL support. Production systems must use a full-service relational database, such as MySQL.

NoSQL

An umbrella term for data stores that don’t support the relational model for data management, dialects of the structured query language, and features like transactional updates. These data stores trade off these features in order to provide more cost-effective scalability, higher availability, etc.

ODBC

The Open Database Connection API provides access to SQL systems, including Hive, from other applications. Java applications typically use the JDBC API, instead.

Output Format

The output format determines how records are written to output streams, usually to files. A SerDe handles serialization of each record into an appropriate byte stream. A custom output format can be specified when creating a table using the OUTPUTFORMAT clause. The output format for the default STORED AS TEXTFILE specification is implemented by the Java object named org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat. See also Input Format.

Partition

A subset of a table’s data set where one column has the same value for all records in the subset. In Hive, as in most databases that support partitioning, each partition is stored in a physically separate location—in Hive’s case, in a subdirectory of the root directory for the table. Partitions have several advantages. The column value corresponding to a partition doesn’t have to be repeated in every record in the partition, saving space, and queries with WHERE clauses that restrict the result set to specific values for the partition columns can perform more quickly, because they avoid scanning the directories of nonmatching partition values. See also dynamic partitions.

Reduce

The reduction phase of a MapReduce process where the key-value pairs from the map phase are processed. A crucial feature of MapReduce is that all the key-value pairs from all the map tasks that have the same key will be sent together to the same reduce task, so that the collection of values can be “reduced” as appropriate. For example, a collection of integers might be added or averaged together, a collection of strings might have all duplicates removed, etc.

Relational Model

The most common model for database management systems, it is based on a logical model of data organization and manipulation. A declarative specification of the data structure and how it should be manipulated is supplied by the user, most typically using the Structured Query Language. The implementation translates these declarations into procedures for storing, retrieving, and manipulating the data.

S3

The distributed file system for Amazon Web Services. It can be used with or instead of HDFS when running MapReduce jobs.

SerDe

The Serializer/Deserializer or SerDe for short is used to parse the bytes of a record into columns or fields, the deserialization process. It is also used to create those record bytes (i.e., serialization). In contrast, the Input Format is used to split an input stream into records and the Output Format is used to write records to an output stream. A SerDe can be specified when a Hive table is created. The default SerDe supports the field and collection separators discussed in Text File Encoding of Data Values, as well as various optimizations such as a lazy parsing.

Structured Query Language

A language that implements the relational model for querying and manipulating data. Abbreviated SQL. While there is an ANSI standard for SQL that has undergone periodic revisions, all SQL dialects in widespread use add their own custom extensions and variations.

Task

In the MapReduce context, a task is the smallest unit of work performed on a single cluster node, as part of an overall job. By default each task involves a separate JVM process. Each map and reduce invocation will have its own task.

Thrift

An RPC system invented by Facebook and integrated into Hive. Remote processes can send Hive statements to Hive through Thrift.

User-Defined Aggregate Functions

User-defined functions that take multiple rows (or columns from multiple rows) and return a single “aggregation” of the data, such as a count of the rows, a sum or average of number values, etc. The term is abbreviated UDAF. See also user-defined functions and user-defined table generating functions.

User-Defined Functions

Functions implemented by users of Hive to extend their behavior. Sometimes the term is used generically to include built-in functions and sometimes the term is used for the specific case of functions that work on a single row (or columns in a row) and return a single value, (i.e., which don’t change the number of records). Abbreviated UDF. See also user-defined aggregate functions and user-defined table generating functions.

User-Defined Table Generating Functions

User-defined functions that take a column from a single record and expand it into multiple rows. Examples include the explode function that converts an array into rows of single fields and, for Hive v0.8.0 and later, converts a map into rows of key and value fields. Abbreviated UDTF. See also user-defined functions and user-defined aggregate functions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset