一个HIVE SQL引发的优化分析


1.      先来看看这个SQL(我把中间select的所有字段有隐去,便于查看)

insert overwrite table dw_user_activity partition(pt='$env.date')
from s_user_activity a
where a.pt  ='$env.date'
order by a.deviceid,a.userid,a.time;

注意到最后的order by,其实order by 意味着全局排序,全局排序意味着一定只有一个reducer,因为只有一个reducer才能保证全局有序(仔细理解下?)


set mapreduce.job.reduces=300;


一个HIVE SQL引发的优化分析_第1张图片




insert overwrite table dw_user_activity partition(pt='$env.date')
from s_user_activity a
where a.pt  ='$env.date'
distribute by a.deviceid,a.userid sort by a.deviceid,a.userid,a.time;

一个HIVE SQL引发的优化分析_第2张图片


2.      我们再来看看order bydistribute bysort bycluste by的区别(先摘录一段官方的说明)

    ORDER BY x: guarantees global ordering, but does this by pushing all data through just one reducer. This is basically unacceptable for large datasets. You end up one sorted file as output.
    SORT BY x: orders data at each of N reducers, but each reducer can receive overlapping ranges of data. You end up with N or more sorted files with overlapping ranges.
    DISTRIBUTE BY x: ensures each of N reducers gets non-overlapping ranges of x, but doesn't sort the output of each reducer. You end up with N or unsorted files with non-overlapping ranges.
    CLUSTER BY x: ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. This gives you global ordering, and is the same as doing (DISTRIBUTE BY x and SORT BY x). You end up with N or more sorted files with non-overlapping ranges.


Distribute by:就是把数据按hash方式shuffle到不同节点,这样保证相同key值的数据一定到同一个节点。

Sort by:就是在节点上做排序。如果只有sort by没有distribute by,同一个key值的数据可能会到不同的节点,这样排序就意义不大了

Cluster by=distribute by+sort by

3.  group by 又怎样呢?

Group by首先它是做聚合的,后面必须跟上sum ,count等聚合函数。她其实先隐含了按group by keydistribute,再做聚合。

如果后面也跟上order by,通常它会在后面有stage再生成一个joborder by,保证只有一个reducer做全局排序。


4.      终于写完了,好累,不知道解释清楚没,谢谢捧场看到最后J
