Chapter 17 – Hadoop EXPLAIN

“Lucy, you got some ‘splainin’ to do!”

- Ricky Ricardo

There are Many Options to See an EXPLAIN Plan

Syntax: EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query

image

To get an EXPLAIN, you can either type the word EXPLAIN in front of any Hadoop SQL or click on the magnifying glass of Nexus. Hive provides an EXPLAIN command designed to show the user the execution plan for a query being explained. The EXPLAIN plan is designed to show the user exactly how Hive translates queries into MapReduce jobs.

Explain Output has Three Parts

An Abstract Syntax Tree is really the internal
source code that is represented as a tree of nodes.

image

The Explain has three parts, including the Abstract Syntax Tree, the plan dependencies and a description. The description of the stages shows the sequence of operations (operators) and their metadata, such as filter expressions, select expressions or output file names. Hive takes SQL and converts it into map reduce jobs so understanding the Explain can be valuable. Each Hive query gets converted into a sequence called a "Directed Acyclic Graph" of stages. These stages will either be a map/reduce stage or metastore/file system operations.

EXPLAIN EXTENDED and the Abstract Syntax Tree

image

The abstract syntax tree shows how Hive parsed the query into tokens and
literals, as part of the first step in building the query into the result set.

The Explain has three parts, including the Abstract Syntax Tree, the plan dependencies and a description. An Abstract Syntax Tree is really the internal source code that is represented as a tree of nodes. Each node represents constants or variables that are presented as a leaf with operators or statements that are the inner node portions. This is also called a "parse tree". Above, we have the nodes representing the Query, Insert and Select portion of the plan.

EXPLAIN EXTENDED Stage Plans and Stage Dependencies

image

Use the EXPLAIN EXTENDED in front of your SQL to get the Stage Dependencies and Stage Plans. The stage dependencies show the # of stages and the dependency relationship among stages. A stage could be a MapReduce job, a sampling stage, a merge stage, or a limit stage that Hive needs to do. Hive generally executes these stages one at a time, unless it can use parallel execution for some stages. For most queries, there is just one stage. The stage plan provides important information on how Hive is running the job, but it is verbose and complex. Understanding the how Hive parses and plans every query is not always useful, however it can help when analyzing complex or poorly performing queries.

EXPLAIN DEPENDENCY Keywords in an Explain

EXPLAIN DEPENDENCY

SELECT * FROM

Employee_Table

{"input_partitions":[],"input_tables":[{"tablename":"sql_class@employ

ee_table","tabletype":"MANAGED_TABLE"}]}

The dependencies above show that the table will be the Employee_Table and it is in the database SQL_Class (sometimes referred to as the schema) and that it is a managed table (not an External Table).

The use of DEPENDENCY in the EXPLAIN statement produces extra information concerning each of the inputs in the plan. It shows all the varying attributes for the inputs. The inputs contain all of the tables and any partitions that are being utilized. The dependencies will even show the parents in case a table is accessed via a view.

EXPLAIN AUTHORIZATION Keywords in an Explain

EXPLAIN AUTHORIZATION

SELECT * FROM

Employee_Table

image

The authorization phase shows the user, the operation, the input tables and the outputs that concern the authorization mechanisms that have been put in place by Hive.

Using a WHERE Clause Explains a Predicate

CREATE TABLE Employee_table

(Employee_No int

,Dept_No smallint

,Last_name string

,First_name string

,Salary decimal(8,2) ) ;

EXPLAIN SELECT * FROM Employee_Table

WHERE Employee_No = 2000000 ;

Stage-0

Fetch Operator

limit:-1

Select Operator [SEL_4528]

outputColumnNames:["_col0","_col1","_col2","_col3","_col4"]

Filter Operator [FIL_4530]

predicate:(employee_no = 2000000 ) (type: boolean)

TableScan [TS_4526]

alias:employee_table

The predicate means the value in the WHERE clause.

EXPLAIN with an ORDER BY Statement

EXPLAIN SELECT * FROM Department_Table

ORDER BY Department_Name ;

Vertex dependency in root stage

Reducer 2 <- Map 1 (SIMPLE_EDGE)

Stage-0Fetch Operator

limit:-1

Stage-1 Reducer 2

File Output Operator [FS_4558] compressed:false

Statistics:Num rows: 1 Data size: 185 Basic stats: COMPLETE Column stats: NONE

table:{"serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe","input

format:":"org.apache.hadoop.mapred.TextInputFormat","output

format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"}

Select Operator [SEL_4557] outputColumnNames:["_col0","_col1","_col2","_col3"]

| Statistics:Num rows: 1 Data size: 185 Basic stats: COMPLETE Column stats: NONE

|<-Map 1 [SIMPLE_EDGE]

Reduce Output Operator [RS_4556] key expressions:_col1 (type: string) sort order:+

Statistics:Num rows: 1 Data size: 185 Basic stats: COMPLETE Column stats: NONE

value expressions:_col0 (type: smallint), _col2 (type: int), _col3 (type: string)

Select Operator [SEL_4555] outputColumnNames:["_col0","_col1","_col2","_col3"]

Statistics:Num rows: 1 Data size: 185 Basic stats: COMPLETE Column stats: NONE

TableScan [TS_4554] alias:department_table

Statistics:Num rows: 1 Data size: 185 Basic stats: COMPLETE Column stats: NONE

Notice now we are using 2 reducers. The table is a serde and the sort order is asc.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset