【需求】统计某一款产品的用户中,分别有多少用户是同一个年龄。例如25岁的有100人,60岁的有20人。
// 我已提前在hive数据仓库中建表customers,也已经把本地文件加载到了hive仓库里
hive> desc customers;
OK
id int
name string
age int
Time taken: 0.044 seconds, Fetched: 3row(s)
hive> select * from customers limit 15;// 分别对应id, name, age
OK
1 tom 12
2 tommy 13
3 kevin 14
4 goodman 15
5 tonykid 13
6 alice 14
7 alex 14
8 lucyclaire 13
9 woody 15
10 garyking 14
11 johnson 15
12 tracy 14
13 carl 13
14 hisonburg 58
15 laura 14
// 切割出age列,放到数组里,给每一行数组起个别名
hive> select split(age, ' ') as AGE from customers;
OK
["12"]
["13"]
["14"]
["15"]
["13"]
["14"]
["14"]
["13"]
["15"]
["14"]
["15"]
["14"]
["13"]
["58"]
// 炸开每一行数组后,把每个数组中的元素连接成为一个新表,把表名命名为AgeTable
// 再从这个新表里查出AGE字段及其对应的出现次数
hive> select AgeTable.AGE, count(*) from ((select explode(split(age, ' ')) as AGE from customers) as AgeTable) group by AgeTable.AGE;
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 8.83 sec HDFSRead: 9463 HDFS Write: 206 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds830 msec
OK
12 3
13 6
14 7
15 5
27 1
48 3
58 5
// 获得统计结果后,需要存起来,供其他业务需求再次使用,
// 所以要建个新表(也可以建一个外部表),把刚才的统计结果直接写到表里
hive> CREATE TABLE customer_age_group AS select AgeTable.AGE, count(*) AS age_count FROM((select explode(split(age, ' ')) as AGE from customers) as AgeTable) group by AgeTable.AGE;
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.98 sec HDFSRead: 9187 HDFS Write: 117 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds980 sec
hive> show tables;// 成功将结果写入新表
OK
customer_age_group
customer_roj_orders
customers
hive> desc customer_age_group;
OK
age string
age_count bigint // 字段别名,hive给的默认数据类型是大整数
// 写入表时,按照统计结果的顺序写入
hive> select * from customer_age_group;
OK
12 3
13 6
14 7
15 5
27 1
48 3
58 5