“Lucy, you got some ‘splainin’ to do!”
- Ricky Ricardo
Syntax: EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query
To get an EXPLAIN, you can either type the word EXPLAIN in front of any Hadoop SQL or click on the magnifying glass of Nexus. Hive provides an EXPLAIN command designed to show the user the execution plan for a query being explained. The EXPLAIN plan is designed to show the user exactly how Hive translates queries into MapReduce jobs.
An Abstract Syntax Tree is really the internal
source code that is represented as a tree of nodes.
The Explain has three parts, including the Abstract Syntax Tree, the plan dependencies and a description. The description of the stages shows the sequence of operations (operators) and their metadata, such as filter expressions, select expressions or output file names. Hive takes SQL and converts it into map reduce jobs so understanding the Explain can be valuable. Each Hive query gets converted into a sequence called a "Directed Acyclic Graph" of stages. These stages will either be a map/reduce stage or metastore/file system operations.
The abstract syntax tree shows how Hive parsed the query into tokens and
literals, as part of the first step in building the query into the result set.
The Explain has three parts, including the Abstract Syntax Tree, the plan dependencies and a description. An Abstract Syntax Tree is really the internal source code that is represented as a tree of nodes. Each node represents constants or variables that are presented as a leaf with operators or statements that are the inner node portions. This is also called a "parse tree". Above, we have the nodes representing the Query, Insert and Select portion of the plan.
Use the EXPLAIN EXTENDED in front of your SQL to get the Stage Dependencies and Stage Plans. The stage dependencies show the # of stages and the dependency relationship among stages. A stage could be a MapReduce job, a sampling stage, a merge stage, or a limit stage that Hive needs to do. Hive generally executes these stages one at a time, unless it can use parallel execution for some stages. For most queries, there is just one stage. The stage plan provides important information on how Hive is running the job, but it is verbose and complex. Understanding the how Hive parses and plans every query is not always useful, however it can help when analyzing complex or poorly performing queries.
EXPLAIN DEPENDENCY
SELECT * FROM
Employee_Table
{"input_partitions":[],"input_tables":[{"tablename":"sql_class@employ
ee_table","tabletype":"MANAGED_TABLE"}]}
The dependencies above show that the table will be the Employee_Table and it is in the database SQL_Class (sometimes referred to as the schema) and that it is a managed table (not an External Table).
The use of DEPENDENCY in the EXPLAIN statement produces extra information concerning each of the inputs in the plan. It shows all the varying attributes for the inputs. The inputs contain all of the tables and any partitions that are being utilized. The dependencies will even show the parents in case a table is accessed via a view.
EXPLAIN AUTHORIZATION
SELECT * FROM
Employee_Table
The authorization phase shows the user, the operation, the input tables and the outputs that concern the authorization mechanisms that have been put in place by Hive.
CREATE TABLE Employee_table
(Employee_No int
,Dept_No smallint
,Last_name string
,First_name string
,Salary decimal(8,2) ) ;
EXPLAIN SELECT * FROM Employee_Table
WHERE Employee_No = 2000000 ;
Stage-0
Fetch Operator
limit:-1
Select Operator [SEL_4528]
outputColumnNames:["_col0","_col1","_col2","_col3","_col4"]
Filter Operator [FIL_4530]
predicate:(employee_no = 2000000 ) (type: boolean)
TableScan [TS_4526]
alias:employee_table
The predicate means the value in the WHERE clause.
EXPLAIN SELECT * FROM Department_Table
ORDER BY Department_Name ;
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Stage-0Fetch Operator
limit:-1
Stage-1 Reducer 2
File Output Operator [FS_4558] compressed:false
Statistics:Num rows: 1 Data size: 185 Basic stats: COMPLETE Column stats: NONE
table:{"serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe","input
format:":"org.apache.hadoop.mapred.TextInputFormat","output
format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"}
Select Operator [SEL_4557] outputColumnNames:["_col0","_col1","_col2","_col3"]
| Statistics:Num rows: 1 Data size: 185 Basic stats: COMPLETE Column stats: NONE
|<-Map 1 [SIMPLE_EDGE]
Reduce Output Operator [RS_4556] key expressions:_col1 (type: string) sort order:+
Statistics:Num rows: 1 Data size: 185 Basic stats: COMPLETE Column stats: NONE
value expressions:_col0 (type: smallint), _col2 (type: int), _col3 (type: string)
Select Operator [SEL_4555] outputColumnNames:["_col0","_col1","_col2","_col3"]
Statistics:Num rows: 1 Data size: 185 Basic stats: COMPLETE Column stats: NONE
TableScan [TS_4554] alias:department_table
Statistics:Num rows: 1 Data size: 185 Basic stats: COMPLETE Column stats: NONE
Notice now we are using 2 reducers. The table is a serde and the sort order is asc.