1、apache-hive-3.1.2简介及部署(三种部署方式-内嵌模式、本地模式和远程模式)及验证详解
2、hive相关概念详解–架构、读写文件机制、数据存储
3、hive的使用示例详解-建表、数据类型详解、内部外部表、分区表、分桶表
4、hive的使用示例详解-事务表、视图、物化视图、DDL(数据库、表以及分区)管理详细操作
5、hive的load、insert、事务表使用详解及示例
6、hive的select(GROUP BY、ORDER BY、CLUSTER BY、SORT BY、LIMIT、union、CTE)、join使用详解及示例
7、hive shell客户端与属性配置、内置运算符、函数(内置运算符与自定义UDF运算符)
8、hive的关系运算、逻辑预算、数学运算、数值运算、日期函数、条件函数和字符串函数的语法与使用示例详解
9、hive的explode、Lateral View侧视图、聚合函数、窗口函数、抽样函数使用详解
10、hive综合示例:数据多分隔符(正则RegexSerDe)、url解析、行列转换常用函数(case when、union、concat和explode)详细使用示例
11、hive综合应用示例:json解析、窗口函数应用(连续登录、级联累加、topN)、拉链表应用
12、Hive优化-文件存储格式和压缩格式优化与job执行优化(执行计划、MR属性、join、优化器、谓词下推和数据倾斜优化)详细介绍及示例
13、java api访问hive操作示例
本文介绍了hive关于json解析、窗口函数的几个应用以及拉链表的具体应用示例。
本文分为三个部分,即json解析、窗口函数常见的应用场景、拉链表的应用示例。
本文前提是hive环境可用。
本文部分数据来源于互联网。
Hive中为了实现JSON格式的数据解析,提供了两种解析JSON数据的方式,在实际工作场景下,可以根据不同数据,不同的需求来选择合适的方式对JSON格式数据进行处理。
用于解析JSON字符串,可以从JSON字符串中返回指定的某个对象列的值
nget_json_object(json_txt, path) - Extract a json object from path
#第一个参数:指定要解析的JSON字符串
#第二个参数:指定要返回的字段,通过$.columnName的方式来指定path
#每次只能返回JSON对象中一列的值
create table tb_json_test1 (
json string
);
--加载数据
load data local inpath '/usr/local/bigdata/device.json' into table tb_json_test1;
select * from tb_json_test1;
0: jdbc:hive2://server4:10000> select * from tb_json_test1;
+----------------------------------------------------+
| tb_json_test1.json |
+----------------------------------------------------+
| {"device":"device_30","deviceType":"kafka","signal":98.0,"time":1616817201390} |
| {"device":"device_40","deviceType":"route","signal":99.0,"time":1616817201887} |
| {"device":"device_21","deviceType":"bigdata","signal":77.0,"time":1616817202142} |
| {"device":"device_31","deviceType":"kafka","signal":98.0,"time":1616817202405} |
| {"device":"device_20","deviceType":"bigdata","signal":12.0,"time":1616817202513} |
| {"device":"device_54","deviceType":"bigdata","signal":14.0,"time":1616817202913} |
。。。
+----------------------------------------------------+
select
--获取设备名称
get_json_object(json,"$.device") as device,
--获取设备类型
get_json_object(json,"$.deviceType") as deviceType,
--获取设备信号强度
get_json_object(json,"$.signal") as signal,
--获取时间
get_json_object(json,"$.time") as stime
from tb_json_test1;
0: jdbc:hive2://server4:10000> select
. . . . . . . . . . . . . . .> --获取设备名称
. . . . . . . . . . . . . . .> get_json_object(json,"$.device") as device,
. . . . . . . . . . . . . . .> --获取设备类型
. . . . . . . . . . . . . . .> get_json_object(json,"$.deviceType") as deviceType,
. . . . . . . . . . . . . . .> --获取设备信号强度
. . . . . . . . . . . . . . .> get_json_object(json,"$.signal") as signal,
. . . . . . . . . . . . . . .> --获取时间
. . . . . . . . . . . . . . .> get_json_object(json,"$.time") as stime
. . . . . . . . . . . . . . .> from tb_json_test1;
+------------+-------------+---------+----------------+
| device | devicetype | signal | stime |
+------------+-------------+---------+----------------+
| device_30 | kafka | 98.0 | 1616817201390 |
| device_40 | route | 99.0 | 1616817201887 |
| device_21 | bigdata | 77.0 | 1616817202142 |
| device_31 | kafka | 98.0 | 1616817202405 |
。。。
+------------+-------------+---------+----------------+
用于实现JSON字符串的解析,可以通过指定多个参数来解析JSON返回多列的值
njson_tuple(jsonStr, p1, p2, ..., pn) like get_json_object, but it takes multiple names and return a tuple
# 第一个参数:指定要解析的JSON字符串
# 第二个参数:指定要返回的第1个字段
# ……
# 第N+1个参数:指定要返回的第N个字段
# 功能类似于get_json_object,但是可以调用一次返回多列的值,属于UDTF类型函数,一般搭配lateral view使用
# 返回的每一列都是字符串类型
--单独使用
select
--解析所有字段
json_tuple(json,"device","deviceType","signal","time") as (device,deviceType,signal,stime)
from tb_json_test1;
0: jdbc:hive2://server4:10000> select
. . . . . . . . . . . . . . .> --解析所有字段
. . . . . . . . . . . . . . .> json_tuple(json,"device","deviceType","signal","time") as (device,deviceType,signal,stime)
. . . . . . . . . . . . . . .> from tb_json_test1;
+------------+-------------+---------+----------------+
| device | devicetype | signal | stime |
+------------+-------------+---------+----------------+
| device_30 | kafka | 98.0 | 1616817201390 |
| device_40 | route | 99.0 | 1616817201887 |
| device_21 | bigdata | 77.0 | 1616817202142 |
| device_31 | kafka | 98.0 | 1616817202405 |
| device_20 | bigdata | 12.0 | 1616817202513 |
。。。
--搭配侧视图使用
select json,
device,deviceType,signal,stime
from tb_json_test1
lateral view json_tuple(json,"device","deviceType","signal","time") b as device,deviceType,signal,stime;
0: jdbc:hive2://server4:10000> select json,
. . . . . . . . . . . . . . .> device,deviceType,signal,stime
. . . . . . . . . . . . . . .> from tb_json_test1
. . . . . . . . . . . . . . .> lateral view json_tuple(json,"device","deviceType","signal","time") b as device,deviceType,signal,stime;
+----------------------------------------------------+------------+-------------+---------+----------------+
| json | device | devicetype | signal | stime |
+----------------------------------------------------+------------+-------------+---------+----------------+
| {"device":"device_30","deviceType":"kafka","signal":98.0,"time":1616817201390} | device_30 | kafka | 98.0 | 1616817201390 |
| {"device":"device_40","deviceType":"route","signal":99.0,"time":1616817201887} | device_40 | route | 99.0 | 1616817201887 |
| {"device":"device_21","deviceType":"bigdata","signal":77.0,"time":1616817202142} | device_21 | bigdata | 77.0 | 1616817202142 |
。。。
上述解析JSON的过程中是将数据作为一个JSON字符串加载到表中,再通过JSON解析函数对JSON字符串进行解析,灵活性比较高,但是对于如果整个文件就是一个JSON文件,在使用起来就相对比较麻烦。
Hive中为了简化对于JSON文件的处理,内置了一种专门用于解析JSON文件的Serde解析器,在创建表时,只要指定使用JSONSerde解析表的文件,就会自动将JSON文件中的每一列进行解析。
--JsonSerDe
--创建表
create table tb_json_test2 (
device string,
deviceType string,
signal double,
`time` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE;
load data local inpath '/usr/local/bigdata/device.json' into table tb_json_test2;
select * from tb_json_test2;
0: jdbc:hive2://server4:10000> create table tb_json_test2 (
. . . . . . . . . . . . . . .> device string,
. . . . . . . . . . . . . . .> deviceType string,
. . . . . . . . . . . . . . .> signal double,
. . . . . . . . . . . . . . .> `time` string
. . . . . . . . . . . . . . .> )
. . . . . . . . . . . . . . .> ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
. . . . . . . . . . . . . . .> STORED AS TEXTFILE;
No rows affected (0.063 seconds)
0: jdbc:hive2://server4:10000> select * from tb_json_test2;
+-----------------------+---------------------------+-----------------------+---------------------+
| tb_json_test2.device | tb_json_test2.devicetype | tb_json_test2.signal | tb_json_test2.time |
+-----------------------+---------------------------+-----------------------+---------------------+
| device_30 | kafka | 98.0 | 1616817201390 |
| device_40 | route | 99.0 | 1616817201887 |
| device_21 | bigdata | 77.0 | 1616817202142 |
| device_31 | kafka | 98.0 | 1616817202405 |
| device_20 | bigdata | 12.0 | 1616817202513 |
基于以上的需求根据数据寻找规律,要想得到连续登录用户,找到两个相同用户ID的行之间登录日期之间的关系。
例如:统计连续登录两天的用户,只要用户ID相等,并且登录日期之间相差1天即可。
--1、连续登录用户
--建表
create table tb_login(
userid string,
logintime string
)
row format delimited fields terminated by '\t';
load data local inpath '/usr/local/bigdata/login.log' into table tb_login;
select * from tb_login;
0: jdbc:hive2://server4:10000> select * from tb_login;
+------------------+---------------------+
| tb_login.userid | tb_login.logintime |
+------------------+---------------------+
| A | 2021-03-22 |
| B | 2021-03-22 |
| C | 2021-03-22 |
| A | 2021-03-23 |
| C | 2021-03-23 |
| A | 2021-03-24 |
| B | 2021-03-24 |
+------------------+---------------------+
--自连接过滤实现
--a.构建笛卡尔积
select
a.userid as a_userid,
a.logintime as a_logintime,
b.userid as b_userid,
b.logintime as b_logintime
from tb_login a,tb_login b;
--上述查询结果保存为临时表
create table tb_login_tmp as
select
a.userid as a_userid,
a.logintime as a_logintime,
b.userid as b_userid,
b.logintime as b_logintime
from tb_login a,tb_login b;
--过滤数据:用户id相同并且登录日期相差1
select
a_userid,a_logintime,b_userid,b_logintime
from tb_login_tmp
where a_userid = b_userid
and cast(substr(a_logintime,9,2) as int) - 1 = cast(substr(b_logintime,9,2) as int);
0: jdbc:hive2://server4:10000> select
. . . . . . . . . . . . . . .> a_userid,a_logintime,b_userid,b_logintime
. . . . . . . . . . . . . . .> from tb_login_tmp
. . . . . . . . . . . . . . .> where a_userid = b_userid
. . . . . . . . . . . . . . .> and cast(substr(a_logintime,9,2) as int) - 1 = cast(substr(b_logintime,9,2) as int);
+-----------+--------------+-----------+--------------+
| a_userid | a_logintime | b_userid | b_logintime |
+-----------+--------------+-----------+--------------+
| A | 2021-03-23 | A | 2021-03-22 |
| C | 2021-03-23 | C | 2021-03-22 |
| A | 2021-03-24 | A | 2021-03-23 |
+-----------+--------------+-----------+--------------+
--统计连续两天登录用户
select
distinct a_userid
from tb_login_tmp
where a_userid = b_userid
and cast(substr(a_logintime,9,2) as int) - 1 = cast(substr(b_logintime,9,2) as int);
用于从当前数据中基于当前行的数据向后偏移取值
--语法
lead(colName,N,defautValue)
--colName:取哪一列的值
--N:向后偏移N行
--defaultValue:如果取不到返回的默认值
基于用户的登录信息:
连续两天登录 : 用户下次登录时间 = 本次登录以后的第二天
连续三天登录: 用户下下次登录时间 = 本次登录以后的第三天
……
可以对用户ID进行分区,按照登录时间进行排序,通过lead函数计算出用户下次登录时间
通过日期函数计算出登录以后第二天的日期,如果相等即为连续两天登录。
----窗口函数实现
--实现连续登录2天
with t1 as (
select
userid,
logintime,
--本次登录日期的第二天
date_add(logintime,1) as nextday,
--按照用户id分区,按照登录日期排序,取下一次登录时间,取不到就为0
lead(logintime,1,0) over (partition by userid order by logintime) as nextlogin
from tb_login )
select distinct userid from t1 where nextday = nextlogin;
0: jdbc:hive2://server4:10000> with t1 as (
. . . . . . . . . . . . . . .> select
. . . . . . . . . . . . . . .> userid,
. . . . . . . . . . . . . . .> logintime,
. . . . . . . . . . . . . . .> --本次登录日期的第二天
. . . . . . . . . . . . . . .> date_add(logintime,1) as nextday,
. . . . . . . . . . . . . . .> --按照用户id分区,按照登录日期排序,取下一次登录时间,取不到就为0
. . . . . . . . . . . . . . .> lead(logintime,1,0) over (partition by userid order by logintime) as nextlogin
. . . . . . . . . . . . . . .> from tb_login )
. . . . . . . . . . . . . . .> select distinct userid from t1 where nextday = nextlogin;
WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+
| userid |
+---------+
| A |
| C |
+---------+
--实现连续3天登录
with t1 as (
select
userid,
logintime,
--本次登录日期的第三天
date_add(logintime,2) as nextday,
--按照用户id分区,按照登录日期排序,取下下一次登录时间,取不到就为0
lead(logintime,2,0) over (partition by userid order by logintime) as nextlogin
from tb_login )
select distinct userid from t1 where nextday = nextlogin;
0: jdbc:hive2://server4:10000> with t1 as (
. . . . . . . . . . . . . . .> select
. . . . . . . . . . . . . . .> userid,
. . . . . . . . . . . . . . .> logintime,
. . . . . . . . . . . . . . .> --本次登录日期的第三天
. . . . . . . . . . . . . . .> date_add(logintime,2) as nextday,
. . . . . . . . . . . . . . .> --按照用户id分区,按照登录日期排序,取下下一次登录时间,取不到就为0
. . . . . . . . . . . . . . .> lead(logintime,2,0) over (partition by userid order by logintime) as nextlogin
. . . . . . . . . . . . . . .> from tb_login )
. . . . . . . . . . . . . . .> select distinct userid from t1 where nextday = nextlogin;
WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+
| userid |
+---------+
| A |
+---------+
--实现连续N天
select
userid,
logintime,
--本次登录日期的第N天
date_add(logintime,N-1) as nextday,
--按照用户id分区,按照登录日期排序,取下下一次登录时间,取不到就为0
lead(logintime,N-1,0) over (partition by userid order by logintime) as nextlogin
from tb_login;
分组统计每个用户每个月的消费金额,然后构建自连接,根据条件分组聚合
--1、建表加载数据
create table tb_money(
userid string,
mth string,
money int
)
row format delimited fields terminated by '\t';
load data local inpath '/usr/local/bigdata/money.tsv' into table tb_money;
select * from tb_money;
0: jdbc:hive2://server4:10000> select * from tb_money;
+------------------+---------------+-----------------+
| tb_money.userid | tb_money.mth | tb_money.money |
+------------------+---------------+-----------------+
| A | 2021-01 | 5 |
| A | 2021-01 | 15 |
| B | 2021-01 | 5 |
| A | 2021-01 | 8 |
| B | 2021-01 | 25 |
| A | 2021-01 | 5 |
| A | 2021-02 | 4 |
| A | 2021-02 | 6 |
| B | 2021-02 | 10 |
| B | 2021-02 | 5 |
| A | 2021-03 | 7 |
| B | 2021-03 | 9 |
| A | 2021-03 | 11 |
| B | 2021-03 | 6 |
+------------------+---------------+-----------------+
--2、统计得到每个用户每个月的消费总金额
create table tb_money_mtn as
select
userid,
mth,
sum(money) as m_money
from tb_money
group by userid,mth;
select * from tb_money_mtn;
--方案一:自连接分组聚合
--1、基于每个用户每个月的消费总金额进行自连接
select a.*,b.*
from tb_money_mtn a
join tb_money_mtn b on a.userid = b.userid;
--2、将每个月之前月份的数据过滤出来
select a.*,b.*
from tb_money_mtn a
join tb_money_mtn b on a.userid = b.userid
where b.mth <= a.mth;
--3、同一个用户 同一个月的数据分到同一组 再根据用户、月份排序
select
a.userid,
a.mth,
max(a.m_money) as current_mth_money, --当月花费
sum(b.m_money) as accumulate_money --累积花费
from tb_money_mtn a join tb_money_mtn b on a.userid = b.userid
where b.mth <= a.mth
group by a.userid,a.mth
order by a.userid,a.mth;
0: jdbc:hive2://server4:10000> select
. . . . . . . . . . . . . . .> a.userid,
. . . . . . . . . . . . . . .> a.mth,
. . . . . . . . . . . . . . .> max(a.m_money) as current_mth_money, --当月花费
. . . . . . . . . . . . . . .> sum(b.m_money) as accumulate_money --累积花费
. . . . . . . . . . . . . . .> from tb_money_mtn a join tb_money_mtn b on a.userid = b.userid
. . . . . . . . . . . . . . .> where b.mth <= a.mth
. . . . . . . . . . . . . . .> group by a.userid,a.mth
. . . . . . . . . . . . . . .> order by a.userid,a.mth;
WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+-----------+----------+--------------------+-------------------+
| a.userid | a.mth | current_mth_money | accumulate_money |
+-----------+----------+--------------------+-------------------+
| A | 2021-01 | 33 | 33 |
| A | 2021-02 | 10 | 43 |
| A | 2021-03 | 18 | 61 |
| B | 2021-01 | 30 | 30 |
| B | 2021-02 | 15 | 45 |
| B | 2021-03 | 15 | 60 |
+-----------+----------+--------------------+-------------------+
分组统计每个用户每个月的消费金额,然后使用窗口聚合函数实现
用于实现基于窗口的数据求和
--语法
sum(colName) over (partition by col order by col)
--colName:对某一列的值进行求和
基于每个用户每个月的消费金额,可以通过窗口函数对用户进行分区,按照月份排序
然后基于聚合窗口,从每个分区的第一行累加到当前和,即可得到累计消费金额。
--方案二:窗口函数实现
--统计每个用户每个月消费金额及累计总金额
select
userid,
mth,
m_money,
sum(m_money) over (partition by userid order by mth) as t_money
from tb_money_mtn;
0: jdbc:hive2://server4:10000> select
. . . . . . . . . . . . . . .> userid,
. . . . . . . . . . . . . . .> mth,
. . . . . . . . . . . . . . .> m_money,
. . . . . . . . . . . . . . .> sum(m_money) over (partition by userid order by mth) as t_money
. . . . . . . . . . . . . . .> from tb_money_mtn;
WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+----------+----------+----------+
| userid | mth | m_money | t_money |
+---------+----------+----------+----------+
| A | 2021-01 | 33 | 33 |
| A | 2021-02 | 10 | 43 |
| A | 2021-03 | 18 | 61 |
| B | 2021-01 | 30 | 30 |
| B | 2021-02 | 15 | 45 |
| B | 2021-03 | 15 | 60 |
+---------+----------+----------+----------+
--实现近几个月的累计消费金额
select
userid,
mth,
m_money,
sum(m_money) over (partition by userid order by mth rows between 1 preceding and 2 following) as t_money
from tb_money_mtn;
0: jdbc:hive2://server4:10000> select
. . . . . . . . . . . . . . .> userid,
. . . . . . . . . . . . . . .> mth,
. . . . . . . . . . . . . . .> m_money,
. . . . . . . . . . . . . . .> sum(m_money) over (partition by userid order by mth rows between 1 preceding and 2 following) as t_money
. . . . . . . . . . . . . . .> from tb_money_mtn;
WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+----------+----------+----------+
| userid | mth | m_money | t_money |
+---------+----------+----------+----------+
| A | 2021-01 | 33 | 61 |
| A | 2021-02 | 10 | 61 |
| A | 2021-03 | 18 | 28 |
| B | 2021-01 | 30 | 60 |
| B | 2021-02 | 15 | 60 |
| B | 2021-03 | 15 | 30 |
+---------+----------+----------+----------+
TopN函数:row_number、rank、dense_rank
row_number:对每个分区的数据进行编号,如果值相同,继续编号
rank:对每个分区的数据进行编号,如果值相同,编号相同,但留下空位
dense_rank:对每个分区的数据进行编号,如果值相同,编号相同,不留下空位
基于row_number实现,按照部门分区,每个部门内部按照薪水降序排序
--3、分组TopN问题
--建表加载数据
create table tb_emp(
empno string,
ename string,
job string,
managerid string,
hiredate string,
salary double,
bonus double,
deptno string
)
row format delimited fields terminated by '\t';
load data local inpath '/usr/local/bigdata/emp.txt' into table tb_emp;
select * from tb_emp;
0: jdbc:hive2://server4:10000> select * from tb_emp;
+---------------+---------------+-------------+-------------------+------------------+----------------+---------------+----------------+
| tb_emp.empno | tb_emp.ename | tb_emp.job | tb_emp.managerid | tb_emp.hiredate | tb_emp.salary | tb_emp.bonus | tb_emp.deptno |
+---------------+---------------+-------------+-------------------+------------------+----------------+---------------+----------------+
| 7369 | SMITH | CLERK | 7902 | 1980-12-17 | 800.0 | NULL | 20 |
| 7499 | ALLEN | SALESMAN | 7698 | 1981-2-20 | 1600.0 | 300.0 | 30 |
| 7521 | WARD | SALESMAN | 7698 | 1981-2-22 | 1250.0 | 500.0 | 30 |
| 7566 | JONES | MANAGER | 7839 | 1981-4-2 | 2975.0 | NULL | 20 |
| 7654 | MARTIN | SALESMAN | 7698 | 1981-9-28 | 1250.0 | 1400.0 | 30 |
| 7698 | BLAKE | MANAGER | 7839 | 1981-5-1 | 2850.0 | NULL | 30 |
| 7782 | CLARK | MANAGER | 7839 | 1981-6-9 | 2450.0 | NULL | 10 |
| 7788 | SCOTT | ANALYST | 7566 | 1987-4-19 | 3000.0 | NULL | 20 |
| 7839 | KING | PRESIDENT | | 1981-11-17 | 5000.0 | NULL | 10 |
| 7844 | TURNER | SALESMAN | 7698 | 1981-9-8 | 1500.0 | 0.0 | 30 |
| 7876 | ADAMS | CLERK | 7788 | 1987-5-23 | 1100.0 | NULL | 20 |
| 7900 | JAMES | CLERK | 7698 | 1981-12-3 | 950.0 | NULL | 30 |
| 7902 | FORD | ANALYST | 7566 | 1981-12-3 | 3000.0 | NULL | 20 |
| 7934 | MILLER | CLERK | 7782 | 1982-1-23 | 1300.0 | NULL | 10 |
+---------------+---------------+-------------+-------------------+------------------+----------------+---------------+----------------+
--基于row_number实现,按照部门分区,每个部门内部按照薪水降序排序
select
empno,
ename,
salary,
deptno,
row_number() over (partition by deptno order by salary desc) as rn
from tb_emp;
0: jdbc:hive2://server4:10000> select
. . . . . . . . . . . . . . .> empno,
. . . . . . . . . . . . . . .> ename,
. . . . . . . . . . . . . . .> salary,
. . . . . . . . . . . . . . .> deptno,
. . . . . . . . . . . . . . .> row_number() over (partition by deptno order by salary desc) as rn
. . . . . . . . . . . . . . .> from tb_emp;
WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+--------+---------+---------+---------+-----+
| empno | ename | salary | deptno | rn |
+--------+---------+---------+---------+-----+
| 7839 | KING | 5000.0 | 10 | 1 |
| 7782 | CLARK | 2450.0 | 10 | 2 |
| 7934 | MILLER | 1300.0 | 10 | 3 |
| 7788 | SCOTT | 3000.0 | 20 | 1 |
| 7902 | FORD | 3000.0 | 20 | 2 |
| 7566 | JONES | 2975.0 | 20 | 3 |
| 7876 | ADAMS | 1100.0 | 20 | 4 |
| 7369 | SMITH | 800.0 | 20 | 5 |
| 7698 | BLAKE | 2850.0 | 30 | 1 |
| 7499 | ALLEN | 1600.0 | 30 | 2 |
| 7844 | TURNER | 1500.0 | 30 | 3 |
| 7654 | MARTIN | 1250.0 | 30 | 4 |
| 7521 | WARD | 1250.0 | 30 | 5 |
| 7900 | JAMES | 950.0 | 30 | 6 |
+--------+---------+---------+---------+-----+
--过滤每个部门的薪资最高的前两名
with t1 as (
select
empno,
ename,
salary,
deptno,
row_number() over (partition by deptno order by salary desc) as rn
from tb_emp )
select * from t1 where rn < 3;
Hive在实际工作中主要用于构建离线数据仓库,定期的从各种数据源中同步采集数据到Hive中,经过分层转换提供数据应用。
例如每天需要从RMDB中同步最新的订单信息、用户信息、店铺信息等到数据仓库中,进行订单分析、用户分析。如果同步后的数据发送了变化,一般的解决办法如下。
在Hive中直接更新数据,即覆盖原来已经存在的数据。
每次数据改变,根据日期构建一份全量的快照表,每天一张表
构建拉链表,通过时间标记发生变化的数据的每种状态的时间周期
拉链表的设计是将更新的数据进行状态记录,没有发生更新的数据不进行状态存储,用于存储所有数据在不同时间上的所有状态,通过时间进行标记每个状态的生命周期,查询时,根据需求可以获取指定时间范围状态的数据,默认用9999-12-31等最大值来表示最新状态。示例如下:
为方便说明下面的步骤,现定义如下:
dw_zipper:拉链表,最终用于数据分析的表
ods_zipper_update:增量表,过程性表,用完后一般而言会清空(overwrite)方便下一次使用
tmp_zipper:拉链临时表,即原历史拉链表与增量表合并后的数据表,过程性表,一般而言用完后会清空(overwrite),方便下一次使用
一般而言,系统上线后,第一次全量同步至拉链表dw_zipper中。
按照采集的频率,存储于增量表ods_zipper_update。使用完后会清空。
将dw_zipper和ods_zipper_update合并至tmp_zipper中,该步骤会比较耗时。
示例:
--合并拉链表与增量表
insert overwrite table tmp_zipper
select
userid,
phone,
nick,
gender,
addr,
starttime,
endtime
from ods_zipper_update
union all
--查询原来拉链表的所有数据,并将这次需要更新的数据的endTime更改为更新值的startTime
select
a.userid,
a.phone,
a.nick,
a.gender,
a.addr,
a.starttime,
--如果这条数据没有更新或者这条数据不是要更改的数据,就保留原来的值,否则就改为新数据的开始时间-1
if(b.userid is null or a.endtime < '9999-12-31', a.endtime , date_sub(b.starttime,1)) as endtime
from dw_zipper a
left join ods_zipper_update b on a.userid = b.userid ;
insert overwrite table dw_zipper
select * from tmp_zipper;
--1、建表加载数据
--创建拉链表
create table dw_zipper(
userid string,
phone string,
nick string,
gender int,
addr string,
starttime string,
endtime string
)
row format delimited fields terminated by '\t';
--加载模拟数据
load data local inpath '/root/hivedata/zipper.txt' into table dw_zipper;
--查询
select userid,nick,addr,starttime,endtime from dw_zipper;
0: jdbc:hive2://server4:10000> select * from dw_zipper;
+-------------------+------------------+-----------------+-------------------+-----------------+----------------------+--------------------+
| dw_zipper.userid | dw_zipper.phone | dw_zipper.nick | dw_zipper.gender | dw_zipper.addr | dw_zipper.starttime | dw_zipper.endtime |
+-------------------+------------------+-----------------+-------------------+-----------------+----------------------+--------------------+
| 001 | 186xxxx1234 | laoda | 0 | sh | 2021-01-01 | 9999-12-31 |
| 002 | 186xxxx1235 | laoer | 1 | bj | 2021-01-01 | 9999-12-31 |
| 003 | 186xxxx1236 | laosan | 0 | sz | 2021-01-01 | 9999-12-31 |
| 004 | 186xxxx1237 | laosi | 1 | gz | 2021-01-01 | 9999-12-31 |
| 005 | 186xxxx1238 | laowu | 0 | sh | 2021-01-01 | 9999-12-31 |
| 006 | 186xxxx1239 | laoliu | 1 | bj | 2021-01-01 | 9999-12-31 |
| 007 | 186xxxx1240 | laoqi | 0 | sz | 2021-01-01 | 9999-12-31 |
| 008 | 186xxxx1241 | laoba | 1 | gz | 2021-01-01 | 9999-12-31 |
| 009 | 186xxxx1242 | laojiu | 0 | sh | 2021-01-01 | 9999-12-31 |
| 010 | 186xxxx1243 | laoshi | 1 | bj | 2021-01-01 | 9999-12-31 |
+-------------------+------------------+-----------------+-------------------+-----------------+----------------------+--------------------+
--创建ods层增量表 加载数据
create table ods_zipper_update(
userid string,
phone string,
nick string,
gender int,
addr string,
starttime string,
endtime string
)
row format delimited fields terminated by '\t';
load data local inpath '/usr/local/bigdata/update.txt' into table ods_zipper_update;
select * from ods_zipper_update;
0: jdbc:hive2://server4:10000> select
. . . . . . . . . . . . . . .> userid,
. . . . . . . . . . . . . . .> phone,
. . . . . . . . . . . . . . .> nick,
. . . . . . . . . . . . . . .> gender,
. . . . . . . . . . . . . . .> addr,
. . . . . . . . . . . . . . .> starttime,
. . . . . . . . . . . . . . .> endtime
. . . . . . . . . . . . . . .> from ods_zipper_update;
+---------+--------------+---------+---------+-------+-------------+-------------+
| userid | phone | nick | gender | addr | starttime | endtime |
+---------+--------------+---------+---------+-------+-------------+-------------+
| 008 | 186xxxx1241 | laoba | 1 | sh | 2021-01-02 | 9999-12-31 |
| 011 | 186xxxx1244 | laoshi | 1 | jx | 2021-01-02 | 9999-12-31 |
| 012 | 186xxxx1245 | laoshi | 0 | zj | 2021-01-02 | 9999-12-31 |
+---------+--------------+---------+---------+-------+-------------+-------------+
--创建临时表
create table tmp_zipper(
userid string,
phone string,
nick string,
gender int,
addr string,
starttime string,
endtime string
)
row format delimited fields terminated by '\t';
--合并拉链表与增量表
insert overwrite table tmp_zipper
select
userid,
phone,
nick,
gender,
addr,
starttime,
endtime
from ods_zipper_update
union all
--查询原来拉链表的所有数据,并将这次需要更新的数据的endTime更改为更新值的startTime
select
a.userid,
a.phone,
a.nick,
a.gender,
a.addr,
a.starttime,
--如果这条数据没有更新或者这条数据不是要更改的数据,就保留原来的值,否则就改为新数据的开始时间-1
if(b.userid is null or a.endtime < '9999-12-31', a.endtime , date_sub(b.starttime,1)) as endtime
from dw_zipper a
left join ods_zipper_update b on a.userid = b.userid ;
0: jdbc:hive2://server4:10000> insert overwrite table tmp_zipper
. . . . . . . . . . . . . . .> select
. . . . . . . . . . . . . . .> userid,
. . . . . . . . . . . . . . .> phone,
. . . . . . . . . . . . . . .> nick,
. . . . . . . . . . . . . . .> gender,
. . . . . . . . . . . . . . .> addr,
. . . . . . . . . . . . . . .> starttime,
. . . . . . . . . . . . . . .> endtime
. . . . . . . . . . . . . . .> from ods_zipper_update
. . . . . . . . . . . . . . .> union all
. . . . . . . . . . . . . . .> --查询原来拉链表的所有数据,并将这次需要更新的数据的endTime更改为更新值的startTime
. . . . . . . . . . . . . . .> select
. . . . . . . . . . . . . . .> a.userid,
. . . . . . . . . . . . . . .> a.phone,
. . . . . . . . . . . . . . .> a.nick,
. . . . . . . . . . . . . . .> a.gender,
. . . . . . . . . . . . . . .> a.addr,
. . . . . . . . . . . . . . .> a.starttime,
. . . . . . . . . . . . . . .> --如果这条数据没有更新或者这条数据不是要更改的数据,就保留原来的值,否则就改为新数据的开始时间-1(因为是每天同步,所以更改发生在上一天)
. . . . . . . . . . . . . . .> if(b.userid is null or a.endtime < '9999-12-31', a.endtime , date_sub(b.starttime,1)) as endtime
. . . . . . . . . . . . . . .> from dw_zipper a
. . . . . . . . . . . . . . .> left join ods_zipper_update b on a.userid = b.userid ;
WARN : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (121.538 seconds)
0: jdbc:hive2://server4:10000> select * from table tmp_zipper;
Error: Error while compiling statement: FAILED: ParseException line 1:20 mismatched input 'tmp_zipper' expecting ( near 'table' in virtual table source (state=42000,code=40000)
0: jdbc:hive2://server4:10000> select * from tmp_zipper;
+--------------------+-------------------+------------------+--------------------+------------------+-----------------------+---------------------+
| tmp_zipper.userid | tmp_zipper.phone | tmp_zipper.nick | tmp_zipper.gender | tmp_zipper.addr | tmp_zipper.starttime | tmp_zipper.endtime |
+--------------------+-------------------+------------------+--------------------+------------------+-----------------------+---------------------+
| 001 | 186xxxx1234 | laoda | 0 | sh | 2021-01-01 | 9999-12-31 |
| 002 | 186xxxx1235 | laoer | 1 | bj | 2021-01-01 | 9999-12-31 |
| 003 | 186xxxx1236 | laosan | 0 | sz | 2021-01-01 | 9999-12-31 |
| 004 | 186xxxx1237 | laosi | 1 | gz | 2021-01-01 | 9999-12-31 |
| 005 | 186xxxx1238 | laowu | 0 | sh | 2021-01-01 | 9999-12-31 |
| 006 | 186xxxx1239 | laoliu | 1 | bj | 2021-01-01 | 9999-12-31 |
| 007 | 186xxxx1240 | laoqi | 0 | sz | 2021-01-01 | 9999-12-31 |
| 008 | 186xxxx1241 | laoba | 1 | gz | 2021-01-01 | 2021-01-01 |
| 009 | 186xxxx1242 | laojiu | 0 | sh | 2021-01-01 | 9999-12-31 |
| 010 | 186xxxx1243 | laoshi | 1 | bj | 2021-01-01 | 9999-12-31 |
| 008 | 186xxxx1241 | laoba | 1 | sh | 2021-01-02 | 9999-12-31 |
| 011 | 186xxxx1244 | laoshi | 1 | jx | 2021-01-02 | 9999-12-31 |
| 012 | 186xxxx1245 | laoshi | 0 | zj | 2021-01-02 | 9999-12-31 |
+--------------------+-------------------+------------------+--------------------+------------------+-----------------------+---------------------+
--覆盖拉链表
insert overwrite table dw_zipper
select * from tmp_zipper;
0: jdbc:hive2://server4:10000> select * from dw_zipper;
+-------------------+------------------+-----------------+-------------------+-----------------+----------------------+--------------------+
| dw_zipper.userid | dw_zipper.phone | dw_zipper.nick | dw_zipper.gender | dw_zipper.addr | dw_zipper.starttime | dw_zipper.endtime |
+-------------------+------------------+-----------------+-------------------+-----------------+----------------------+--------------------+
| 001 | 186xxxx1234 | laoda | 0 | sh | 2021-01-01 | 9999-12-31 |
| 002 | 186xxxx1235 | laoer | 1 | bj | 2021-01-01 | 9999-12-31 |
| 003 | 186xxxx1236 | laosan | 0 | sz | 2021-01-01 | 9999-12-31 |
| 004 | 186xxxx1237 | laosi | 1 | gz | 2021-01-01 | 9999-12-31 |
| 005 | 186xxxx1238 | laowu | 0 | sh | 2021-01-01 | 9999-12-31 |
| 006 | 186xxxx1239 | laoliu | 1 | bj | 2021-01-01 | 9999-12-31 |
| 007 | 186xxxx1240 | laoqi | 0 | sz | 2021-01-01 | 9999-12-31 |
| 008 | 186xxxx1241 | laoba | 1 | gz | 2021-01-01 | 2021-01-01 |
| 009 | 186xxxx1242 | laojiu | 0 | sh | 2021-01-01 | 9999-12-31 |
| 010 | 186xxxx1243 | laoshi | 1 | bj | 2021-01-01 | 9999-12-31 |
| 008 | 186xxxx1241 | laoba | 1 | sh | 2021-01-02 | 9999-12-31 |
| 011 | 186xxxx1244 | laoshi | 1 | jx | 2021-01-02 | 9999-12-31 |
| 012 | 186xxxx1245 | laoshi | 0 | zj | 2021-01-02 | 9999-12-31 |
+-------------------+------------------+-----------------+-------------------+-----------------+----------------------+--------------------+
以上,介绍了hive关于json解析、窗口函数的几个应用以及拉链表的具体应用示例。