HIVE的执行流程及UDF

HIVE SQL的执行流程

SQL ON HADOOP ⇒ Cluster
SQL ⇒ Parser ⇒ AST(抽象语法树) ⇒ Analyzer⇒ QB⇒ Logical Plan⇒ Operator Tree⇒ Logical Optimizer⇒ Operator Tree ⇒ Physical Plan ⇒TaskTree⇒ Physical Optimizer⇒ Task Tree

首先是一个parser(解析),sql解析成抽象语法树,语法树是一个字符串,如:

(TOK_QUERY (TOK_FROM (TOK_TABREF src))  # 起始表
 (TOK_INSERT (TOK_DESTINATION (TOK_TAB dest_g1)) # 目标表 
 (TOK_SELECT (TOK_SELEXPR (TOK_COLREF src key))  #某个字段
 (TOK_SELEXPR (TOK_FUNCTION sum  #函数
 (TOK_FUNCTION substr (TOK_COLREF src value) 4)))) # 
 (TOK_GROUPBY (TOK_COLREF src key))))

遍历AST Tree,抽象出查询的基本组成单元QueryBlock,遍历QueryBlock,生成逻辑执行计划(logical plan),一堆的operator,然后进行Logical Optimizer(逻辑优化),产出还是一堆Operator Tree,之后生成物理执行计划,也就是MapReduce Task,然后进行物理层优化,最后提交运行
更加详细的内容可阅读这篇文章:

https://www.cnblogs.com/nashiyue/p/5751102.html


所有的语句到最后基本上都会落到这两句sql上

select yyy, 聚合函数 from xxx group by yyy;
select a.*, b.* from a join b on a.id=b.id; 
还有个disdinct:
select dealid, count(distinct uid) num from order group by dealid;

像这种sql:

select a.id, a.city, a.cate from access a 
where a.day='20190414' and a.cate='大奔'   
没有shuffle、仅仅就是过滤而已  ,(ETL过滤数据)

group:

select city, count(1) cnt from access  
where day='20190414' and cate='大奔'
group by city   ;

将GroupBy的字段组合为map的输出key值,利用MapReduce的排序,在reduce阶段保存LastKey区分不同的key。
mr count

  1. map: split ⇒ (word, 1)
  2. shuffle: (word,1) partitioner ⇒ reduce
  3. reduce: (word, 可迭代的(1,1,1,1…))
    ⇒ (word, sum(可迭代的))

UDF(User-Defined Function)用户定义函数

UDF: one-to-one row mapping : upper substr
UDAF: Aggregation Many-to-one row mapping sum/min
UDTF: Table-generating one-to-many lateral view explode()
自定义一个UDF:

public class helloUDF2 extends UDF {
    public Text evaluate( Text input){
        return new Text("Hello: "+input);
    }
}
继承UDF
重写evaluate方法即可。

将编写完的udf打成jar包上传到linux

hive临时使用UDF:

ADD JAR /home/hadoop/lib/g6-hadoop-1.0.jar;
CREATE TEMPORARY FUNCTION sayHello AS 'com.ruozedata.hadoop.udf.HelloUDF';
只对当前session有效

将UDF注册到元数据中:

CREATE FUNCTION sayhello2 AS 'com.ruozedata.hadoop.udf.HelloUDF'
USING JAR 'hdfs://hadoop000:8020/lib/g6-hadoop-1.0.jar';

你可能感兴趣的:(HIVE的执行流程及UDF)