Hive 1.2.1 排序

1. 借鉴

HIVE的四种排序,以及分区分桶
谈谈hive的order by ,sort by ,distribute by 和 cluster by
hive hsq中的group by & Distribute by & partition by & cluster by & partitioned by & clustered by

2. 开始

排序的四种方式:

  • order by
  • sort by
  • distribute by
  • cluster by

下面我们依次来看下

order by

全局排序,只有一个reducer执行,跟配置的mapreduce.job.reduces无关,将所有数据进行排序(默认升序)。

select * from hotel_checkin_log order by order_time;

sort by

每个reducer内部排序,跟配置的mapreduce.job.reduces有关,从全局的角度来看并不是排序的。
在排序之前,先确认一下我们reducer的数量

hive (db_company)> set mapred.reduce.tasks;
mapred.reduce.tasks=4

是4个reducer,然后我们执行一下sort by,按照order_time升序排序

select * from hotel_checkin_log sort by order_time;

然后我们看下排序后的数据,其中倒数第三列为排序列。
就下面这个结果结合order_time排序列来看,它并不是全局排序的,我们可以根据数据来推算出它每个reducer都计算几行数据(这里我用回车表示每个reducer的结果,实际数据没有回车的)。

hotel_checkin_log.code  hotel_checkin_log.checkin_date  hotel_checkin_log.checkin_time  hotel_checkin_log.checkout_date hotel_checkin_log.checkout_time hotel_checkin_log.order_time    hotel_checkin_log.uid   hotel_checkin_log.order_date
10271563        2020-03-20      14:00   2020-03-21      12:00   02:55:22        1124759681900888144     2020-03-19
10271563        2020-04-02      14:00   2020-04-03      12:00   02:55:22        1123659685898059853     2020-04-01
02371493        2020-04-01      17:00   2020-04-02      12:00   09:20:00        1133659694081146884     2020-03-31
00277553        2020-03-22      12:00   2020-03-25      12:00   12:00:00        1213559689173811267     2020-03-20

02371493        2020-04-02      17:00   2020-04-03      12:00   09:20:00        1133659694081146884     2020-04-01
02371493        2020-03-20      17:00   2020-03-21      12:00   09:20:00        1236559684014817343     2020-03-19
00277553        2020-04-02      12:00   2020-04-03      12:00   12:00:00        1213559689173811267     2020-04-01
00277553        2020-03-19      12:00   2020-03-22      12:00   12:00:00        1124559675462631493     2020-03-18

10271563        2020-03-19      14:00   2020-03-20      12:00   02:55:22        1431259675035954008     2020-03-18
10271563        2020-04-01      14:00   2020-04-02      12:00   02:55:22        1123659685898059853     2020-03-31
02371493        2020-03-19      17:00   2020-03-19      12:00   09:20:00        1634659676964192299     2020-03-18
00277553        2020-04-01      12:00   2020-04-02      12:00   12:00:00        1213559689173811267     2020-03-31
00277553        2020-03-20      12:00   2020-03-22      12:00   12:00:00        1234759682320318549     2020-03-19

10271563        2020-03-21      14:00   2020-03-23      12:00   02:55:22        1123659685898059853     2020-03-20
02371493        2020-03-22      17:00   2020-03-23      12:00   09:20:00        1133659694081146884     2020-03-20
Time taken: 33.041 seconds, Fetched: 15 row(s)

为了更方便确认它的reducer内部排序,我们将排序的结果写入文件,据我们所知,一个reducer的结果会生成一个文件

insert overwrite local directory '/opt/envs/datas/sortby'
select * from hotel_checkin_log sort by order_time;

我们看下文件数量就知道了

[root@e2d21b78fdb0 sortby]# ll
total 16
-rw-r--r-- 1 root root 332 Jun 25 11:07 000000_0
-rw-r--r-- 1 root root 332 Jun 25 11:07 000001_0
-rw-r--r-- 1 root root 415 Jun 25 11:07 000002_0
-rw-r--r-- 1 root root 166 Jun 25 11:07 000003_0

另外我也查看了一下文件内容,跟我的推算完全一致,这里我只粘贴出来000002_0这个文件的内容

102715632020-03-1914:002020-03-2012:0002:55:2214312596750359540082020-03-18
102715632020-04-0114:002020-04-0212:0002:55:2211236596858980598532020-03-31
023714932020-03-1917:002020-03-1912:0009:20:0016346596769641922992020-03-18
002775532020-04-0112:002020-04-0212:0012:00:0012135596891738112672020-03-31
002775532020-03-2012:002020-03-2212:0012:00:0012347596823203185492020-03-19

distribute by

分区排序,结合sort by使用,另外和是否是分区表没有半毛钱关系。
按照指定的字段对数据进行划分到不同的输出reduce文件中。
默认是用hash分区,即按照分区字段的hash值来进行分区。

在排序之前,先确认一下我们reducer的数量

hive (db_company)> set mapred.reduce.tasks;
mapred.reduce.tasks=3

是3个reducer,然后我们执行一下distribute by。按照order_date进行分区,并按照order_time升序排序

select * from hotel_checkin_log distribute by order_date sort by order_time;

结果如下,这个就不太直观了,我们还是输出到文件中看吧。

hotel_checkin_log.code  hotel_checkin_log.checkin_date  hotel_checkin_log.checkin_time  hotel_checkin_log.checkout_date hotel_checkin_log.checkout_time hotel_checkin_log.order_time    hotel_checkin_log.uid   hotel_checkin_log.order_date
10271563        2020-03-20      14:00   2020-03-21      12:00   02:55:22        1124759681900888144     2020-03-19
10271563        2020-04-01      14:00   2020-04-02      12:00   02:55:22        1123659685898059853     2020-03-31
02371493        2020-04-01      17:00   2020-04-02      12:00   09:20:00        1133659694081146884     2020-03-31
02371493        2020-03-20      17:00   2020-03-21      12:00   09:20:00        1236559684014817343     2020-03-19
00277553        2020-03-20      12:00   2020-03-22      12:00   12:00:00        1234759682320318549     2020-03-19
00277553        2020-04-01      12:00   2020-04-02      12:00   12:00:00        1213559689173811267     2020-03-31
10271563        2020-04-02      14:00   2020-04-03      12:00   02:55:22        1123659685898059853     2020-04-01
10271563        2020-03-21      14:00   2020-03-23      12:00   02:55:22        1123659685898059853     2020-03-20
02371493        2020-03-22      17:00   2020-03-23      12:00   09:20:00        1133659694081146884     2020-03-20
02371493        2020-04-02      17:00   2020-04-03      12:00   09:20:00        1133659694081146884     2020-04-01
00277553        2020-03-22      12:00   2020-03-25      12:00   12:00:00        1213559689173811267     2020-03-20
00277553        2020-04-02      12:00   2020-04-03      12:00   12:00:00        1213559689173811267     2020-04-01
10271563        2020-03-19      14:00   2020-03-20      12:00   02:55:22        1431259675035954008     2020-03-18
02371493        2020-03-19      17:00   2020-03-19      12:00   09:20:00        1634659676964192299     2020-03-18
00277553        2020-03-19      12:00   2020-03-22      12:00   12:00:00        1124559675462631493     2020-03-18

我们将数据保存到文件中再看下

insert overwrite local directory '/opt/envs/datas/distributeby'
select * from hotel_checkin_log distribute by order_date sort by order_time;

我们先看下文件数量

[root@e2d21b78fdb0 distributeby]# ll
total 12
-rw-r--r-- 1 root root 498 Jun 25 12:42 000000_0
-rw-r--r-- 1 root root 498 Jun 25 12:42 000001_0
-rw-r--r-- 1 root root 249 Jun 25 12:42 000002_0

再分别看下文件内容

[root@e2d21b78fdb0 distributeby]# more 000000_0
102715632020-03-2014:002020-03-2112:0002:55:2211247596819008881442020-03-19
102715632020-04-0114:002020-04-0212:0002:55:2211236596858980598532020-03-31
023714932020-04-0117:002020-04-0212:0009:20:0011336596940811468842020-03-31
023714932020-03-2017:002020-03-2112:0009:20:0012365596840148173432020-03-19
002775532020-03-2012:002020-03-2212:0012:00:0012347596823203185492020-03-19
002775532020-04-0112:002020-04-0212:0012:00:0012135596891738112672020-03-31

[root@e2d21b78fdb0 distributeby]# more 000001_0
102715632020-04-0214:002020-04-0312:0002:55:2211236596858980598532020-04-01
102715632020-03-2114:002020-03-2312:0002:55:2211236596858980598532020-03-20
023714932020-03-2217:002020-03-2312:0009:20:0011336596940811468842020-03-20
023714932020-04-0217:002020-04-0312:0009:20:0011336596940811468842020-04-01
002775532020-03-2212:002020-03-2512:0012:00:0012135596891738112672020-03-20
002775532020-04-0212:002020-04-0312:0012:00:0012135596891738112672020-04-01

[root@e2d21b78fdb0 distributeby]# more 000002_0
102715632020-03-1914:002020-03-2012:0002:55:2214312596750359540082020-03-18
023714932020-03-1917:002020-03-1912:0009:20:0016346596769641922992020-03-18
002775532020-03-1912:002020-03-2212:0012:00:0011245596754626314932020-03-18

cluster by

1> 当distribute by和sorts by字段相同时,可以使用cluster by方式。
2> 排序只能是升序排序,不能指定排序规则为asc或者desc。

select * from hotel_checkin_log cluster by order_date;

如果指定倒序会出现下列错误

hive (db_company)> select * from hotel_checkin_log cluster by order_date desc;
FAILED: ParseException line 1:54 extraneous input 'desc' expecting EOF near ''

3. 大功告成

你可能感兴趣的:(Hive 1.2.1 排序)