数据分析课程笔记 - 17 - HIVE 核心技能之常用函数

大家好呀，从这节课开始我们要进入 Hive 的实操了，由于实操都是在课程提供方的云平台上做的，所以大家可能没有条件跟着一起做练习了，不过可以先看看，以后有机会的时候再练习。

这节课的主要内容有：

1、Hive云平台使用步骤
2、基础语法
（1）SELECT …A… FROM …B… WHERE …C…
（2）GROUP BY
（3）ORDER BY
（4）HiveSQL解析顺序
3、常用函数
（1）如何把时间戳转化为日期时间？
（2）如何计算日期间隔？
（3）条件函数
（4）字符串函数
（5）聚合函数
4、重点练习

一、在云平台使用 Hive 的步骤：

注：这一步前提是要在云平台安装 Hadoop 和 Hive，这里就不提供安装步骤了，听说比较麻烦，而且我也没有操作过，大家如果有需要可以自行百度哈~

1、在终端切换用户：su - root
2、启动 Hadoop：start-all.sh
3、验证 Hadoop 是否正常启动：jps

应该出现6个进程：

ResourceManager
Jps
NodeManager
NameNode
DataNode
SecondaryNameNode

# 假如hadoop未能正常启动，先关闭 hadoop(stop-all.sh)，再启动 hadoop (start-all.sh)

4、启动 hive

cd /opt/module/apache-hive-3.1.1-bin  # 进入hive的安装目录
bin/hive  # 启动hive

到了这一步，就是正常启动了hive

hive (default)>

启动

二、基础语法

1、筛选数据

SELECT …A… 
FROM …B… 
WHERE …C…

需求1：选出城市在北京，性别为女的10个用户。

表结构：

user_info

select user_name
from user_info
where city = 'beijing' and sex = 'female'
limit 10;

需求1

注意： 如果该表是⼀个分区表，则WHERE条件中必须对分区字段进
行限制。例如 dt 。

如何知道一个表是否是分区表？

desc user_trade 如果有 partition info 就是分区信息。没有就不是分区表。

需求2：选出在2018年12月31日，购买的商品品类是food的用户名、购买数量、支付金额

表结构：

user_trade

select user_name,
    piece,
    pay_amount
from user_trade
where goods_category = 'food' and dt = '2018-12-31';

需求2

3、分组

（1）Group by

需求3：计算出2019年的一月到三月，每个商品品类有多少人购买，总支付金额是多少？

select goods_category,
    count(distinct user_name) as user_num,
    sum(pay_amount) as total_amount
from user_trade
where dt between '2019-01-01' and '2019-03-31'
group by goods_category;

需求3过程

需求3结果

（2）GROUP BY …… HAVING

HAVING：表示对GROUP BY后的对象进行筛选，即对聚合结果进行筛选而不是对原表进行筛选。

需求4：找出在2019年4月总支付金额超过5万元的用户

select user_name,
    sum(pay_amount) as total_amount
from user_trade 
where dt between '2019-04-01' and '2019-04-30'
group by user_name
having sum(pay_amount) > 50000;

需求4结果

4、排序 Order by

ASC：升序(默认，因此不写的时候就是升序)
DESC：降序
对多个字段进行排序：ORDER BY A ASC, B DESC 每个字段都要指定升序还是降序！

需求5：计算出在2019年4月支付金额最多的TOP5用户

select user_name,
    sum(pay_amount) as total_amount
from user_trade
where dt between '2019-04-01' and '2019-04-30'
group by user_name
order by total_amount desc
limit 5;

需求5结果

为什么ORDER BY 后面不直接写 sum(pay_amount) 而是用 total_amount ？

因为执行顺序！ORDER BY的执行顺序在SELECT之后，
所以需使用重新定义的列名进行排序。

5、HiveSQL 解析顺序

FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT

我自己编了一个助记口诀，大家可以参考哈：

学习癌 HiveSQL 执行顺序独家助记方法：

--- 来自哪一组有选排限 ----

分解：来自（from）哪（where）一组（group by）有（having）选（having）排（order by）限（limit）
场景助记：可以想象成一个演技综艺类节目，按照导师分成了几个组，上一轮得分最高的导师组，有选择排练某个剧目的权限。所以作为竞演演员，要看你来自哪一组，才能决定你有没有优先选择排练的权限。（原谅我前阵子看演技综艺有点走火入魔了哈哈

HiveSQL解析顺序

三、常用函数

查看Hive中的函数：show functions;
查看具体函数的用法：
① desc function 函数名;
② desc function extended 函数名; （更详细完整的说明）

1、如何把时间戳转化为日期时间？

SELECT pay_time,
    from_unixtime(pay_time,'yyyy-MM-dd hh:mm:ss')
FROM user_trade
WHERE dt='2019-04-09';

-- 注意：from_unixtime(bigint unixtime,string format)，将时间戳转化成指定格式的日期。

-- 常用格式
1.yyyy-MM-dd hh:mm:ss
2.yyyy-MM-dd hh
3.yyyy-MM-dd hh:mm
4.yyyy-MM-dd

-- 拓展：把日期转化成时间戳函数 `unix_timestamp(string date)`

时间戳转化为日期示例

2、如何计算日期间隔？

datediff(string enddate,string startdate)
# 结束日期减去开始日期的天数。

# 拓展：日期增加、减少函数 
date_add(string date,int days) date_sub(string date,int days)

需求6：计算出用户的首次激活时间与2019年5月1日的日期间隔。

这个需求要用 user_info 表，我们再来看一下表结构：

user_info 表结构

select user_name,
    datediff('2019-05-01', to_date(firstactivetime))
from user_info
limit 10;

需求6

3、条件函数

（1） case when

需求7：统计以下四个年龄段 20岁以下、20-30岁、30-40岁、40岁以上的用户数。

select case when age < 20 then '20岁以下'
    when age >=20 and age < 30 then '20-30岁'
    when age >=30 and age < 40 then '30-40岁'
    else '40岁以上' end as age_type,
    count(user_id) as user_num
from user_info
group by case when age < 20 then '20岁以下'
    when age >=20 and age < 30 then '20-30岁'
    when age >=30 and age < 40 then '30-40岁'
    else '40岁以上' end;

注意： 虽然按照执行顺序，select 在 group by 之后，但是由于 group by 不支持重命名，所以 select 中还是要把 case when 语句再写一遍。

需求7结果

（2） IF

需求8：统计不同性别用户的等级高低分布情况（假设level大于5为高级）。

select sex,
    if(level > 5, '高', '低') as level_type,
    count(user_name) as user_num
from user_info
group by sex,
    if(level > 5, '高', '低');

需求8结果

4、字符串函数

substr(string A, int startindex, int len)
# 如果不指定len，则从起始位置截取到最后。

需求9：计算出每个月的拉新情况。

先看一下 firstactivetime 字段数据是怎么样的：

select firstactivetime
from user_info
limit 1;

结果

需要月份，我们就需要把 firstactivetime 左边 7 位取出来。

select substr(firstactivetime,1,7) as month,
    count(user_id) as user_num
from user_info
group by substr(firstactivetime,1,7);

需求9结果

关键需求10：统计不同手机品牌的用户数

我们再来回顾一下数据结构表：

数据结构表user_info

手机品牌数据在 extra 字段里面的 phonebrand 键中，其中：

extra1(string)：
{"systemtype":"ios","education":"master","marriage_status":"1",
"phonebrand":"iphone X"}
extra2(map)：
{"systemtype":"ios","education":"master","marriage_status":"1",
"phonebrand":"iphone X"}

前者可以通过 get_json_object(string json_str,string path) 函数，将 json 字符串转化为 json 对象，用 .key 的方式取出相应的 value 值，后者可以用类似于字典键值对的方式进行取值，针对上面的需求，响应的语句如下：

-- 方式一
SELECT get_json_object(extra1, '$.phonebrand') as
phone_brand,
count(distinct user_id) user_num
FROM user_info
GROUP BY get_json_object(extra1, '$.phonebrand');

-- 方式二
SELECT extra2['phonebrand'] as phone_brand,
count(distinct user_id) user_num
FROM user_info
GROUP BY extra2['phonebrand'];

需求10结果

5、聚合函数

需求11：ELLA用户在2018年的平均每次支付金额，以及2018年最大的支付日期与最小的支付日期的间隔。

前半个需求很简单，直接用 avg 函数即可，后半个需求要求时间间隔，就要用到 datediff 函数，而 datediff 函数的参数要求日期是字符串格式，而 user_trade 表中 pay_time 是时间戳格式（如1541361822），因此需要用 from_unixtime 函数将时间戳格式转换为字符串格式。

SQL语句如下：

select avg(pay_amount),
    datediff(max(from_unixtime(pay_time,'yyyy-MM-dd')),min(from_unixtime(pay_time,'yyyy-MM-dd')))
from user_trade
where user_name = 'ELLA' and year(dt)=2018;

需求11结果

注意，如果按照以下写法，求时间戳格式的最大值最小值，再转换为字符串的话，可能会存在误差，因为时间戳是精确到时分秒的。

select avg(pay_amount),
    datediff(from_unixtime(max(pay_time),'yyyy-MM-dd'),from_unixtime(min(pay_time),'yyyy-MM-dd'))
from user_trade
where user_name = 'ELLA' and year(dt)=2018;

本例中没有产生误差结果：

运行结果

四、重点练习

需求12：统计在2018年购买的商品品类在两个以上的用户数。

需求拆解：

先计算2018年每个用户购买的商品品类数；
再把商品品类数大于2的用户取出来;
计算符合条件的用户数量。

第1、2步语句：

select user_name,
    count(distinct goods_category)
from user_trade
where year(dt)=2018
group by user_name
having count(distinct goods_category) > 2;

第1-2步结果

第3步用子查询：

select count(e.user_name)
from (select user_name,
    count(distinct goods_category)
from user_trade
where year(dt)=2018
group by user_name
having count(distinct goods_category) > 2) as e;

注：父查询用 select count(*) 也可以。

第3步结果

需求13：用户激活时间在2018年，年龄段在20-30岁和30-40岁的婚姻状况