hive基础整理

1.建表将文章放入,完成wordcount

1.split函数

hive> select split("I am a student"," ");
OK
["I","am","a","student"]    

2.explode 把数组中每个元素单独变成一行【行转列】

hive> select explode(split("I am a teacher "," "));
OK
I
am
a
teacher

3.wordcount

select 
lower(regexp_extract(word,'[A-Za-z]+',0))as word, --大小写和正则位置变换一下
count(*) as cnt
from 
(
select 
explode(split(sentence,' ')) as word 
from article
)t
group by word
limit 10;
OK
I   2
a   2
am  2
student 1
teacher 1

2.外部表的概念

3.分区表

  • 对于查询效率上分区表更快:
  • 所有数据都放到一个表的文件夹中,查询数据需要遍历(扫一遍该文件中所有数据),如果这个文件夹中国的数据包含了一年的用户的行为数据,这样扫一遍就是一年所有数据中筛选出需要查询的那一天(或几天)的数据。
  • 如果做了分区,想要获取昨天数据,只需要先扫一遍所有文件夹,将昨天对应日期的文件夹找出来,再对该文件夹中数据进行读取分析。
  • 什么时候采用分区,结合业务,经常要用到的分析条件(口径)。where条件里面经常要被用到的,可以按照条件进行设计分区。

4.创建分区表,查看目录结构

5.分桶:

create table bucket_test(id int);
load data local inpath
'/home/badou/Documents/data/hive/bucket_test.txt'
into table bucket_test;
---------------------------------
-rwxr-xr-x   3 root supergroup          3 2019-11-12 05:54 /user/hive/warehouse/bucket_user/000000_0
-rwxr-xr-x   3 root supergroup          2 2019-11-12 05:54 /user/hive/warehouse/bucket_user/000001_0
-rwxr-xr-x   3 root supergroup          2 2019-11-12 06:06 /user/hive/warehouse/bucket_user/000002_2
-rwxr-xr-x   3 root supergroup          2 2019-11-12 05:55 /user/hive/warehouse/bucket_user/000003_0
-rwxr-xr-x   3 root supergroup          2 2019-11-12 06:08 /user/hive/warehouse/bucket_user/000004_2
-rwxr-xr-x   3 root supergroup          2 2019-11-12 05:56 /user/hive/warehouse/bucket_user/000005_0
-rwxr-xr-x   3 root supergroup          2 2019-11-12 06:07 /user/hive/warehouse/bucket_user/000006_1
-rwxr-xr-x   3 root supergroup          2 2019-11-12 05:54 /user/hive/warehouse/bucket_user/000007_0
-rwxr-xr-x   3 root supergroup          2 2019-11-12 05:54 /user/hive/warehouse/bucket_user/000008_0
-rwxr-xr-x   3 root supergroup          2 2019-11-12 05:54 /user/hive/warehouse/bucket_user/000009_0
-rwxr-xr-x   3 root supergroup          3 2019-11-12 06:05 /user/hive/warehouse/bucket_user/000010_1
-rwxr-xr-x   3 root supergroup          3 2019-11-12 05:54 /user/hive/warehouse/bucket_user/000011_0
-rwxr-xr-x   3 root supergroup          3 2019-11-12 05:55 /user/hive/warehouse/bucket_user/000012_0
-rwxr-xr-x   3 root supergroup          3 2019-11-12 05:54 /user/hive/warehouse/bucket_user/000013_0
-rwxr-xr-x   3 root supergroup          3 2019-11-12 05:59 /user/hive/warehouse/bucket_user/000014_0
-rwxr-xr-x   3 root supergroup          3 2019-11-12 05:59 /user/hive/warehouse/bucket_user/000015_0
-rwxr-xr-x   3 root supergroup          3 2019-11-12 06:07 /user/hive/warehouse/bucket_user/000016_1
-rwxr-xr-x   3 root supergroup          3 2019-11-12 06:01 /user/hive/warehouse/bucket_user/000017_0
-rwxr-xr-x   3 root supergroup          3 2019-11-12 05:59 /user/hive/warehouse/bucket_user/000018_0
-rwxr-xr-x   3 root supergroup          3 2019-11-12 06:01 /user/hive/warehouse/bucket_user/000019_0
-rwxr-xr-x   3 root supergroup          3 2019-11-12 06:08 /user/hive/warehouse/bucket_user/000020_2
-rwxr-xr-x   3 root supergroup          3 2019-11-12 05:59 /user/hive/warehouse/bucket_user/000021_0
-rwxr-xr-x   3 root supergroup          3 2019-11-12 05:59 /user/hive/warehouse/bucket_user/000022_0
-rwxr-xr-x   3 root supergroup          3 2019-11-12 06:00 /user/hive/warehouse/bucket_user/000023_0
-rwxr-xr-x   3 root supergroup          3 2019-11-12 06:07 /user/hive/warehouse/bucket_user/000024_1
-rwxr-xr-x   3 root supergroup          3 2019-11-12 06:01 /user/hive/warehouse/bucket_user/000025_0
-rwxr-xr-x   3 root supergroup          3 2019-11-12 06:01 /user/hive/warehouse/bucket_user/000026_0
-rwxr-xr-x   3 root supergroup          3 2019-11-12 06:01 /user/hive/warehouse/bucket_user/000027_0
-rwxr-xr-x   3 root supergroup          3 2019-11-12 06:06 /user/hive/warehouse/bucket_user/000028_0
-rwxr-xr-x   3 root supergroup          3 2019-11-12 06:04 /user/hive/warehouse/bucket_user/000029_0
-rwxr-xr-x   3 root supergroup          3 2019-11-12 06:05 /user/hive/warehouse/bucket_user/000030_0
-rwxr-xr-x   3 root supergroup          3 2019-11-12 06:06 /user/hive/warehouse/bucket_user/000031_0

按id分桶,如果id都不够,又怎么会有那么多桶?

6.分桶采样

select * from bucket_user tablesample(bucket 1 out of 32 on id );  32 (对32取余数为0)
select * from bucket_user tablesample(bucket 1 out of 16 on id ); 32 16(对16取余数为0)
select * from bucket_user tablesample(bucket 1 out of 9 on id ); 9 18 27(对9取余数为0)
select * from bucket_user tablesample(bucket 2 out of 9 on id ); 1 10 19 28 (对9取余数为1)

7.test

--orders:订单数据表
order_id user_id 历史 序号 周 小时
2539329,1,prior,1,2,08,
2398795,1,prior,2,3,07,15.0
473747,1,prior,3,3,12,21.0
2254736,1,prior,4,4,07,29.0
431534,1,prior,5,4,15,28.0
3367565,1,prior,6,2,07,19.0
550135,1,prior,7,1,09,20.0
3108588,1,prior,8,1,14,14.0
2295261,1,prior,9,1,16,0.0
2550362,1,prior,10,4,08,30.0

select order_dow,count(1) as cnt
from orders
group by order_dow;

select order_hour_of_day,count(1) as cnt
from orders
group by order_hour_of_day;

你可能感兴趣的:(hive基础整理)