Hive实验3:Hql中的order by与sort by

1、概述

[sort by] 是Hql特有的语句,同时Hive也支持rdbms的 [order by]。
[sort by] 是局部排序,[order by]是全局排序。

ps:还是要回到MapReduce的本质。MapReduce是分治并行,如果数据在多个子任务中执行,则结果只能保证每个任务内排序有效,总结果的每个任务间不保证有序。

1.1 对应MapReduce解决方案

问题:如何保证整体有序?
解决方法:
1)设定只有一个reduce任务(失去了并行的优点,性能低)
2)自定义Map段的分区方法。让分区按一定顺序进行,这样分区有序、分区内有序,则整体有序。如按月度分区。(需要开发分区方法,且不保证数据分割的平均,故任务负载不同)

2、对比实验

2.1 构造测试表

包括mname和tid(时间戳)2个字段。其中,P700000的tid跨越了P80000。

	"MOBILENAME"    "TESTRECORDID"
	"MOBILENAME"    "TESTRECORDID"
	"P700000"       "20150130204711874002901408624669XXXX1"
	"P700000"       "20150130203851342002901408624846XXXX1"
	"P700000"       "20150130203726033002901408624247XXXX1"
	"P700000"       "20150130203618099002901408624547XXXX1"
	"P700000"       "20150130172902677003014654832192XXXX1"
	"P800000"       "20150131205316663000040229744636XXXX1"
	"P800000"       "20150131204500648000040229744318XXXX1"
	"P800000"       "20150131200923817000040229744422XXXX1"
	"P800000"       "20150131173954764003014654832951XXXX1"
	"P800000"       "20150131152252599003014654832409XXXX1"
	"P700000"       "20150131205316663000040229744636XXXX1" <--

2.2 按tid全局排序

select mname,tid from testtab_sort order by  testrecordid asc;
	"P700000"       "20150130172902677003014654832192XXXX1"
	"P700000"       "20150130203618099002901408624547XXXX1"
	"P700000"       "20150130203726033002901408624247XXXX1"
	"P700000"       "20150130203851342002901408624846XXXX1"
	"P700000"       "20150130204711874002901408624669XXXX1"
	"P800000"       "20150131152252599003014654832409XXXX1"
	"P800000"       "20150131173954764003014654832951XXXX1"
	"P800000"       "20150131200923817000040229744422XXXX1"
	"P800000"       "20150131204500648000040229744318XXXX1"
	"P700000"       "20150131205316663000040229744636XXXX1"
	"P800000"       "20150131205316663000040229744636XXXX1"
	"MOBILENAME"    "TESTRECORDID"
	"MOBILENAME"    "TESTRECORDID"

2.3 默认情况下sort by排序

注意:结果与[order by]没有区别。!?

select mname,tid from testtab_sort order by  testrecordid asc;
	"P700000"       "20150130172902677003014654832192XXXX1"
	"P700000"       "20150130203618099002901408624547XXXX1"
	"P700000"       "20150130203726033002901408624247XXXX1"
	"P700000"       "20150130203851342002901408624846XXXX1"
	"P700000"       "20150130204711874002901408624669XXXX1"
	"P800000"       "20150131152252599003014654832409XXXX1"
	"P800000"       "20150131173954764003014654832951XXXX1"
	"P800000"       "20150131200923817000040229744422XXXX1"
	"P800000"       "20150131204500648000040229744318XXXX1"
	"P700000"       "20150131205316663000040229744636XXXX1"
	"P800000"       "20150131205316663000040229744636XXXX1"
	"MOBILENAME"    "TESTRECORDID"
	"MOBILENAME"    "TESTRECORDID"

为什么?
因为数据量太小,默认只有1个reduce任务。这里的局部也就是全局。

Hive的mr作业如何设定reduce个数?

  1. 法1:reduce个数的设定极大影响任务执行效率,不指定reduce个数的情况下,Hive会猜测确定一个reduce个数,基于以下两个设定:
    hive.exec.reducers.bytes.per.reducer(每个reduce任务处理的数据量,默认为1000^3=1G)
    hive.exec.reducers.max(每个任务最大的reduce数,默认为999)
    计算reducer数的公式很简单N=min(参数2,总输入数据量/参数1)
    即,如果reduce的输入(map的输出)总大小不超过1G,那么只会有一个reduce任务;
  2. 法2:set mapred.reduce.tasks = 15;硬指定

2.4 多Reduce任务的sort by排序

不好构造太多的测试数据,直接设定Reduce个数。

set mapred.reduce.tasks = 3;
select mobilename,testrecordid from testtab_sort sort by  tid asc;
	"P700000"       "20150130172902677003014654832192XXXX1"
	"P700000"       "20150130203618099002901408624547XXXX1"
	"P700000"       "20150130203726033002901408624247XXXX1"
	"P800000"       "20150131200923817000040229744422XXXX1"
	"P800000"       "20150131204500648000040229744318XXXX1"
	"P800000"       "20150131205316663000040229744636XXXX1"
	--------
	"P700000"       "20150130203851342002901408624846XXXX1"
	"P700000"       "20150130204711874002901408624669XXXX1"
	"P800000"       "20150131152252599003014654832409XXXX1"
	"P800000"       "20150131173954764003014654832951XXXX1"
	"MOBILENAME"    "TESTRECORDID"
	--------
	"P700000"       "20150131205316663000040229744636XXXX1"
	"MOBILENAME"    "TESTRECORDID"	

注意:排序结果与预期一致,分成了3段。
但这个排序没有实际使用意义的。因为默认使用的hash分区,数据分区缺乏业务意义。

2.5 指定分区的sort by 排序

注意:结果先按mname分组,再按tid排序,更有实际意义。

select mname,tid from testtab_sort distribute by mname sort by  tid asc;
	"P800000"       "20150131152252599003014654832409XXXX1"
	"P800000"       "20150131173954764003014654832951XXXX1"
	"P800000"       "20150131200923817000040229744422XXXX1"
	"P800000"       "20150131204500648000040229744318XXXX1"
	"P800000"       "20150131205316663000040229744636XXXX1"
	"P700000"       "20150130172902677003014654832192XXXX1"
	"P700000"       "20150130203618099002901408624547XXXX1"
	"P700000"       "20150130203726033002901408624247XXXX1"
	"P700000"       "20150130203851342002901408624846XXXX1"
	"P700000"       "20150130204711874002901408624669XXXX1"
	"P700000"       "20150131205316663000040229744636XXXX1"
	"MOBILENAME"    "TESTRECORDID"
	"MOBILENAME"    "TESTRECORDID"

实质是:认为指定Map中分区的方法。

3、总结

  1. [order by]全排序,通过设定只1个Reduce任务实现。
  2. [sort by]局部排序,分区内有序,整体不保证有序。分区数与Reduce个数一致。
  3. [distribute by]指定分区字段,有意义的分区,分区内有序,结果更有意义。
  4. Hvie确定Reduce任务数的机制有2种:min(最大Reduce数,总输入数据量/每Reduce数据量) 和 直接设定Reduce个数。

你可能感兴趣的:(大数据,数据存储方案)