Amazon’s EMR is a hosted Hadoop service on top of Amazon EC2 (Elastic Compute Cloud).
Avro is a new serialization format developed to address some of the common problems associated with evolving other serialization formats. Some of the benefits are: rich data structures, fast binary format, support for remote procedure calls, and built-in schema evolution.
The “Bourne-Again Shell” that is the default interactive command shell for Linux and Mac OS X systems.
The term for the top-level container you own and manage when using S3. A user may have many buckets, analogous to the root of a physical hard drive.
The command-line interface (CLI) can run “scripts” of Hive statements or all the user to enter statements interactively.
A repository of structured data suitable for analysis for reports, trends, etc. Warehouses are batch mode or offline, as opposed to providing real-time responsiveness for online activity, like ecommerce.
A lightweight SQL database that can be embedded in Java applications. It runs in the same process and saves its data to local files. It is used as the default SQL data store for Hive’s metastore. See http://db.apache.org/derby/ for more information.
A HiveQL extension to SQL that allows you to insert query results into table partitions where you leave one or more partition column values unspecified and they are determined dynamically from the query results themselves. This technique is convenient for partitioning a query result into a potentially large number of partitions in a new table, without having to write a separate query for each partition column value.
In the nodes for a virtual Amazon EC2 cluster, the on-node disk storage is called ephemeral because it will vanish when the cluster is shut down, in contrast to a physical cluster that is shut down. Hence, when using an EC2 cluster, such as an Amazon Elastic MapReduce cluster, it is important to back up important data to S3.
A table using a storage location and contents that are outside of Hive’s control. It is convenient for sharing data with other tools, but it is up to other processes to manage the life cycle of the data. That is, when an external table is created, Hive does not create the external directory (or directories for partitioned tables), nor are the directory and data files deleted when an external table is dropped.
(HDFS) A distributed, resilient file system for data storage that is optimized for scanning large contiguous blocks of data on hard disks. Distribution across a cluster provides horizontal scaling of data storage. Blocks of HDFS files are replicated across the cluster (by default, three times) to prevent data loss when hard drives or whole servers fail.
The NoSQL database that uses HDFS for durable storage of table data. HBase is a column-oriented, key-value store designed to provide traditional responsiveness for queries and row-level updates and insertions. Column oriented means that the data storage is organized on disk by groups of columns, called column families, rather than by row. This feature promotes fast queries for subsets of columns. Key-value means that rows are stored and fetched by a unique key and the value is the entire row. HBase does not provide an SQL dialect, but Hive can be used to query HBase tables.
A data warehouse tool that provides table abstractions on top of data resident in HDFS, HBase tables, and other stores. The Hive Query Language is a dialect of the Structured Query Language.
Hive’s own dialect of the Structured Query Language (SQL). Abbreviated HiveQL or HQL.
The input format determines how input streams, usually from files, are split into
records. A SerDe handles parsing the record into
columns. A custom input format can be specified when creating a table
using the INPUTFORMAT
clause. The
input format for the default STORED AS
TEXTFILE
specification is implemented by the Java object
named org.apache.hadoop.mapreduce.lib.input.
TextInputFormat
. See also
Output Format.
The Java Database Connection API provides access to SQL systems, including Hive, from Java code.
In the Hadoop context, a job is a self-contained workflow submitted to MapReduce. It encompasses all the work required to perform a complete calculation, from reading input to generating output. The MapReduce JobTracker will decompose the job into one or more tasks for distribution and execution around the cluster.
The top-level controller of all jobs using Hadoop’s MapReduce. The JobTracker accepts job submissions, determines what tasks to run and where to run them, monitors their execution, restarts failed tasks as needed, and provides a web console for monitoring job and task execution, viewing logs, etc.
A term used in Amazon Elastic MapReduce (EMR) for the sequence of jobs executed on a temporary EMR cluster to accomplish a particular goal.
JSON (JavaScript Object Notation) is a lightweight data serialization format used commonly in web-based applications.
The mapping phase of a MapReduce process where an input set of key-value pairs are converted into a new set of key-value pairs. For each input key-value pair, there can be zero-to-many output key-value pairs. The input and output keys and the input and output values can be completely different.
A commercial distribution of Hadoop that replaces HDFS with the MapR File System (MapR-FS), a high-performance, distributed file system.
A computation paradigm invented at Google and based loosely on the common data operations of mapping a collection of data elements from one form to another (the map phase) and reducing a collection to a single value or a smaller collection (the reduce phase). MapReduce is designed to scale computation horizontally by decomposing map and reduce steps into tasks and distributing those tasks across a cluster. The MapReduce runtime provided by Hadoop handles decomposition of a job into tasks, distribution around the cluster, movement of a particular task to the machine that holds the data for the task, movement of data to tasks (as needed), and automated re-execution of failed tasks and other error recovery and logging services.
The service that maintains “metadata” information, such as table schemas. Hive requires this service to be running. By default, it uses a built-in Derby SQL server, which provides limited, single-process SQL support. Production systems must use a full-service relational database, such as MySQL.
An umbrella term for data stores that don’t support the relational model for data management, dialects of the structured query language, and features like transactional updates. These data stores trade off these features in order to provide more cost-effective scalability, higher availability, etc.
The Open Database Connection API provides access to SQL systems, including Hive, from other applications. Java applications typically use the JDBC API, instead.
The output format determines how records are written to output streams, usually to
files. A SerDe handles serialization of each record
into an appropriate byte stream. A custom output format can be specified
when creating a table using the OUTPUTFORMAT
clause. The output format for the
default STORED AS TEXTFILE
specification is implemented by the Java
object named org
.apache.hadoop.hive.ql.io.HiveIgnoreKey
TextOutputFormat
. See also Input
Format.
A subset of a table’s data set where one column has the same value for all records in
the subset. In Hive, as in most databases that support partitioning,
each partition is stored in a physically separate location—in Hive’s
case, in a subdirectory of the root directory for the table. Partitions
have several advantages. The column value corresponding to a partition
doesn’t have to be repeated in every record in the partition, saving
space, and queries with WHERE
clauses
that restrict the result set to specific values for the partition
columns can perform more quickly, because they avoid scanning the
directories of nonmatching partition values. See also dynamic
partitions.
The reduction phase of a MapReduce process where the key-value pairs from the map phase are processed. A crucial feature of MapReduce is that all the key-value pairs from all the map tasks that have the same key will be sent together to the same reduce task, so that the collection of values can be “reduced” as appropriate. For example, a collection of integers might be added or averaged together, a collection of strings might have all duplicates removed, etc.
The most common model for database management systems, it is based on a logical model of data organization and manipulation. A declarative specification of the data structure and how it should be manipulated is supplied by the user, most typically using the Structured Query Language. The implementation translates these declarations into procedures for storing, retrieving, and manipulating the data.
The distributed file system for Amazon Web Services. It can be used with or instead of HDFS when running MapReduce jobs.
The Serializer/Deserializer or SerDe for short is used to parse the bytes of a record into columns or fields, the deserialization process. It is also used to create those record bytes (i.e., serialization). In contrast, the Input Format is used to split an input stream into records and the Output Format is used to write records to an output stream. A SerDe can be specified when a Hive table is created. The default SerDe supports the field and collection separators discussed in Text File Encoding of Data Values, as well as various optimizations such as a lazy parsing.
A language that implements the relational model for querying and manipulating data. Abbreviated SQL. While there is an ANSI standard for SQL that has undergone periodic revisions, all SQL dialects in widespread use add their own custom extensions and variations.
In the MapReduce context, a task is the smallest unit of work performed on a single cluster node, as part of an overall job. By default each task involves a separate JVM process. Each map and reduce invocation will have its own task.
An RPC system invented by Facebook and integrated into Hive. Remote processes can send Hive statements to Hive through Thrift.
User-defined functions that take multiple rows (or columns from multiple rows) and return a single “aggregation” of the data, such as a count of the rows, a sum or average of number values, etc. The term is abbreviated UDAF. See also user-defined functions and user-defined table generating functions.
Functions implemented by users of Hive to extend their behavior. Sometimes the term is used generically to include built-in functions and sometimes the term is used for the specific case of functions that work on a single row (or columns in a row) and return a single value, (i.e., which don’t change the number of records). Abbreviated UDF. See also user-defined aggregate functions and user-defined table generating functions.
User-defined functions that take a column from a single record and expand it into
multiple rows. Examples include the explode
function that converts an array into
rows of single fields and, for Hive v0.8.0 and later, converts a map
into rows of key and value fields. Abbreviated UDTF. See also
user-defined
functions and user-defined aggregate
functions.