1.建表将文章放入,完成wordcount
1.split函数
hive> select split("I am a student"," ");
OK
["I","am","a","student"]
2.explode 把数组中每个元素单独变成一行【行转列】
hive> select explode(split("I am a teacher "," "));
OK
I
am
a
teacher
3.wordcount
select
lower(regexp_extract(word,'[A-Za-z]+',0))as word, --大小写和正则位置变换一下
count(*) as cnt
from
(
select
explode(split(sentence,' ')) as word
from article
)t
group by word
limit 10;
OK
I 2
a 2
am 2
student 1
teacher 1
2.外部表的概念
3.分区表
- 对于查询效率上分区表更快:
- 所有数据都放到一个表的文件夹中,查询数据需要遍历(扫一遍该文件中所有数据),如果这个文件夹中国的数据包含了一年的用户的行为数据,这样扫一遍就是一年所有数据中筛选出需要查询的那一天(或几天)的数据。
- 如果做了分区,想要获取昨天数据,只需要先扫一遍所有文件夹,将昨天对应日期的文件夹找出来,再对该文件夹中数据进行读取分析。
- 什么时候采用分区,结合业务,经常要用到的分析条件(口径)。where条件里面经常要被用到的,可以按照条件进行设计分区。
4.创建分区表,查看目录结构
5.分桶:
create table bucket_test(id int);
load data local inpath
'/home/badou/Documents/data/hive/bucket_test.txt'
into table bucket_test;
---------------------------------
-rwxr-xr-x 3 root supergroup 3 2019-11-12 05:54 /user/hive/warehouse/bucket_user/000000_0
-rwxr-xr-x 3 root supergroup 2 2019-11-12 05:54 /user/hive/warehouse/bucket_user/000001_0
-rwxr-xr-x 3 root supergroup 2 2019-11-12 06:06 /user/hive/warehouse/bucket_user/000002_2
-rwxr-xr-x 3 root supergroup 2 2019-11-12 05:55 /user/hive/warehouse/bucket_user/000003_0
-rwxr-xr-x 3 root supergroup 2 2019-11-12 06:08 /user/hive/warehouse/bucket_user/000004_2
-rwxr-xr-x 3 root supergroup 2 2019-11-12 05:56 /user/hive/warehouse/bucket_user/000005_0
-rwxr-xr-x 3 root supergroup 2 2019-11-12 06:07 /user/hive/warehouse/bucket_user/000006_1
-rwxr-xr-x 3 root supergroup 2 2019-11-12 05:54 /user/hive/warehouse/bucket_user/000007_0
-rwxr-xr-x 3 root supergroup 2 2019-11-12 05:54 /user/hive/warehouse/bucket_user/000008_0
-rwxr-xr-x 3 root supergroup 2 2019-11-12 05:54 /user/hive/warehouse/bucket_user/000009_0
-rwxr-xr-x 3 root supergroup 3 2019-11-12 06:05 /user/hive/warehouse/bucket_user/000010_1
-rwxr-xr-x 3 root supergroup 3 2019-11-12 05:54 /user/hive/warehouse/bucket_user/000011_0
-rwxr-xr-x 3 root supergroup 3 2019-11-12 05:55 /user/hive/warehouse/bucket_user/000012_0
-rwxr-xr-x 3 root supergroup 3 2019-11-12 05:54 /user/hive/warehouse/bucket_user/000013_0
-rwxr-xr-x 3 root supergroup 3 2019-11-12 05:59 /user/hive/warehouse/bucket_user/000014_0
-rwxr-xr-x 3 root supergroup 3 2019-11-12 05:59 /user/hive/warehouse/bucket_user/000015_0
-rwxr-xr-x 3 root supergroup 3 2019-11-12 06:07 /user/hive/warehouse/bucket_user/000016_1
-rwxr-xr-x 3 root supergroup 3 2019-11-12 06:01 /user/hive/warehouse/bucket_user/000017_0
-rwxr-xr-x 3 root supergroup 3 2019-11-12 05:59 /user/hive/warehouse/bucket_user/000018_0
-rwxr-xr-x 3 root supergroup 3 2019-11-12 06:01 /user/hive/warehouse/bucket_user/000019_0
-rwxr-xr-x 3 root supergroup 3 2019-11-12 06:08 /user/hive/warehouse/bucket_user/000020_2
-rwxr-xr-x 3 root supergroup 3 2019-11-12 05:59 /user/hive/warehouse/bucket_user/000021_0
-rwxr-xr-x 3 root supergroup 3 2019-11-12 05:59 /user/hive/warehouse/bucket_user/000022_0
-rwxr-xr-x 3 root supergroup 3 2019-11-12 06:00 /user/hive/warehouse/bucket_user/000023_0
-rwxr-xr-x 3 root supergroup 3 2019-11-12 06:07 /user/hive/warehouse/bucket_user/000024_1
-rwxr-xr-x 3 root supergroup 3 2019-11-12 06:01 /user/hive/warehouse/bucket_user/000025_0
-rwxr-xr-x 3 root supergroup 3 2019-11-12 06:01 /user/hive/warehouse/bucket_user/000026_0
-rwxr-xr-x 3 root supergroup 3 2019-11-12 06:01 /user/hive/warehouse/bucket_user/000027_0
-rwxr-xr-x 3 root supergroup 3 2019-11-12 06:06 /user/hive/warehouse/bucket_user/000028_0
-rwxr-xr-x 3 root supergroup 3 2019-11-12 06:04 /user/hive/warehouse/bucket_user/000029_0
-rwxr-xr-x 3 root supergroup 3 2019-11-12 06:05 /user/hive/warehouse/bucket_user/000030_0
-rwxr-xr-x 3 root supergroup 3 2019-11-12 06:06 /user/hive/warehouse/bucket_user/000031_0
按id分桶,如果id都不够,又怎么会有那么多桶?
6.分桶采样
select * from bucket_user tablesample(bucket 1 out of 32 on id ); 32 (对32取余数为0)
select * from bucket_user tablesample(bucket 1 out of 16 on id ); 32 16(对16取余数为0)
select * from bucket_user tablesample(bucket 1 out of 9 on id ); 9 18 27(对9取余数为0)
select * from bucket_user tablesample(bucket 2 out of 9 on id ); 1 10 19 28 (对9取余数为1)
7.test
--orders:订单数据表
order_id user_id 历史 序号 周 小时
2539329,1,prior,1,2,08,
2398795,1,prior,2,3,07,15.0
473747,1,prior,3,3,12,21.0
2254736,1,prior,4,4,07,29.0
431534,1,prior,5,4,15,28.0
3367565,1,prior,6,2,07,19.0
550135,1,prior,7,1,09,20.0
3108588,1,prior,8,1,14,14.0
2295261,1,prior,9,1,16,0.0
2550362,1,prior,10,4,08,30.0
select order_dow,count(1) as cnt
from orders
group by order_dow;
select order_hour_of_day,count(1) as cnt
from orders
group by order_hour_of_day;