**
**
一个稍微复杂的Hive SQL,在执行过程中发现某个Task非常慢,怎么去定位这个Task是属于哪段SQL逻辑呢
对于如下逻辑的一个SQL,在Spark引擎和Tez引擎中执行DAG的比较。在两个引擎下,执行慢的逻辑是一样的,分别看两种引擎的执行情况。
WITH tbl_a AS
(
SELECT
a,
b,
name,
SUM(pv) AS pv
FROM (
SELECT a,b,name
, count(1) AS pv
FROM dataware.a
WHERE dt = '{@date}'
GROUP BY a,b,name
UNION ALL
SELECT a,b,name
, count(1) AS pv
FROM dateware.b
WHERE dt = '{@date}'
GROUP BY a,b,name
UNION ALL
SELECT a,b,name
, count(1) AS pv
FROM dateware.c
WHERE dt = '{@date}'
GROUP BY a,b,name
UNION ALL
SELECT a,b,name
, count(1) AS pv
FROM dataware.d
WHERE dt = '{@date}'
GROUP BY a,b,name
) a
GROUP BY a,
b,
name
),
tbl_b AS (
SELECT *
FROM (
SELECT
a,
,b
,row_number() OVER (PARTITION BY app ORDER BY min_time DESC) AS row_num
FROM dataware.demo_a
WHERE dt = '{@date}'
) tmp
WHERE row_num = 1
),
tbl_c AS (
SELECT
a,
b,
user_id
FROM dataware.demo_b
WHERE dt = '{@date}'
),
tbl_d AS (
SELECT user_id
, identity as role
FROM dataware.demo_c
WHERE dt = '{@date}'
)
INSERT OVERWRITE TABLE dataware.result PARTITION ( dt = '{@date}')
SELECT a,
b,
name,
other_fields,
, SUM(pv) AS pv
FROM tbl_a
LEFT JOIN tbl_b ON tbl_a.a = tbl_b.a AND tbl_a.b = tbl_b.b
LEFT JOIN tbl_c ON tbl_a.a = tbl_c.a AND tbl_a.b = tbl_c.b
LEFT JOIN tbl_d ON tbl_c.user_id = tbl_d.user_id
GROUP BY a,
b,
name,
other_fields
Spark引擎当中,从执行结果上看。执行慢的Stage分别是Stage1和Stage8。刚入门要想看出来Stage1和Stage8分别对应哪部分SQL逻辑有一定困难。
在Spark引擎中,我总结可能会有如下办法找到对应SQL逻辑:
1、根据经验将SQL分开。例子中union all的逻辑很好对应。
Stage1是读取tbl_b表阶段
2、也可以单独执行某个SQL逻辑段,看看【单独执行SQL逻辑段的STAGE】与【整体Stage哪个部分】相符。
Stage7对应读取dataware.demo_a表逻辑
3、如果是个读取文件的过程,一般Task的并行度与文件块个数是对应的。所以查看这个表的block块,确定Stage对应SQL逻辑段
Stage5对应是读取dataware.demo_b表逻辑
4、参考Tez引擎
Tez引擎执行情况:
Tez引擎执行情况:下图按Vertices的开始时间排序。看出Map8和Reduce9阶段耗时长。
在 Graphical View中可以很容易看到Map8和Reduce9对应的读取tbl_b的阶段