SQL ON HADOOP ⇒ Cluster
SQL ⇒ Parser ⇒ AST(抽象语法树) ⇒ Analyzer⇒ QB⇒ Logical Plan⇒ Operator Tree⇒ Logical Optimizer⇒ Operator Tree ⇒ Physical Plan ⇒TaskTree⇒ Physical Optimizer⇒ Task Tree
首先是一个parser(解析),sql解析成抽象语法树,语法树是一个字符串,如:
(TOK_QUERY (TOK_FROM (TOK_TABREF src)) # 起始表
(TOK_INSERT (TOK_DESTINATION (TOK_TAB dest_g1)) # 目标表
(TOK_SELECT (TOK_SELEXPR (TOK_COLREF src key)) #某个字段
(TOK_SELEXPR (TOK_FUNCTION sum #函数
(TOK_FUNCTION substr (TOK_COLREF src value) 4)))) #
(TOK_GROUPBY (TOK_COLREF src key))))
遍历AST Tree,抽象出查询的基本组成单元QueryBlock,遍历QueryBlock,生成逻辑执行计划(logical plan),一堆的operator,然后进行Logical Optimizer(逻辑优化),产出还是一堆Operator Tree,之后生成物理执行计划,也就是MapReduce Task,然后进行物理层优化,最后提交运行
更加详细的内容可阅读这篇文章:
https://www.cnblogs.com/nashiyue/p/5751102.html
所有的语句到最后基本上都会落到这两句sql上
select yyy, 聚合函数 from xxx group by yyy;
select a.*, b.* from a join b on a.id=b.id;
还有个disdinct:
select dealid, count(distinct uid) num from order group by dealid;
像这种sql:
select a.id, a.city, a.cate from access a
where a.day='20190414' and a.cate='大奔'
没有shuffle、仅仅就是过滤而已 ,(ETL过滤数据)
group:
select city, count(1) cnt from access
where day='20190414' and cate='大奔'
group by city ;
将GroupBy的字段组合为map的输出key值,利用MapReduce的排序,在reduce阶段保存LastKey区分不同的key。
mr count
UDF: one-to-one row mapping : upper substr
UDAF: Aggregation Many-to-one row mapping sum/min
UDTF: Table-generating one-to-many lateral view explode()
自定义一个UDF:
public class helloUDF2 extends UDF {
public Text evaluate( Text input){
return new Text("Hello: "+input);
}
}
继承UDF
重写evaluate方法即可。
将编写完的udf打成jar包上传到linux
hive临时使用UDF:
ADD JAR /home/hadoop/lib/g6-hadoop-1.0.jar;
CREATE TEMPORARY FUNCTION sayHello AS 'com.ruozedata.hadoop.udf.HelloUDF';
只对当前session有效
将UDF注册到元数据中:
CREATE FUNCTION sayhello2 AS 'com.ruozedata.hadoop.udf.HelloUDF'
USING JAR 'hdfs://hadoop000:8020/lib/g6-hadoop-1.0.jar';