11、hive综合应用示例:json解析、窗口函数应用(连续登录、级联累加、topN)、拉链表应用

Apache Hive 系列文章

1、apache-hive-3.1.2简介及部署(三种部署方式-内嵌模式、本地模式和远程模式)及验证详解
2、hive相关概念详解–架构、读写文件机制、数据存储
3、hive的使用示例详解-建表、数据类型详解、内部外部表、分区表、分桶表
4、hive的使用示例详解-事务表、视图、物化视图、DDL(数据库、表以及分区)管理详细操作
5、hive的load、insert、事务表使用详解及示例
6、hive的select(GROUP BY、ORDER BY、CLUSTER BY、SORT BY、LIMIT、union、CTE)、join使用详解及示例
7、hive shell客户端与属性配置、内置运算符、函数(内置运算符与自定义UDF运算符)
8、hive的关系运算、逻辑预算、数学运算、数值运算、日期函数、条件函数和字符串函数的语法与使用示例详解
9、hive的explode、Lateral View侧视图、聚合函数、窗口函数、抽样函数使用详解
10、hive综合示例:数据多分隔符(正则RegexSerDe)、url解析、行列转换常用函数(case when、union、concat和explode)详细使用示例
11、hive综合应用示例:json解析、窗口函数应用(连续登录、级联累加、topN)、拉链表应用
12、Hive优化-文件存储格式和压缩格式优化与job执行优化(执行计划、MR属性、join、优化器、谓词下推和数据倾斜优化)详细介绍及示例
13、java api访问hive操作示例


文章目录

  • Apache Hive 系列文章
  • 一、解析JSON的常用函数及JSONSerde
    • 1、两种处理方式
    • 2、get_json_object
      • 1)、语法
      • 2)、示例
    • 3、json_tuple
      • 1)、语法
      • 2)、示例
    • 4、JSONSerde
  • 二、窗口函数的实际应用场景
    • 1、示例-连续登录用户
      • 1)、方案一:表中的数据自连接,构建笛卡尔积
      • 2)、方案二:使用窗口函数来实现
        • 1、窗口函数lead
        • 2、实现
    • 2、示例-级联累加求和
      • 1)、方案一:非窗口函数实现
      • 2)、方案二:窗口函数实现
        • 1、窗口函数sum
        • 2、示例
    • 3、示例-topN
  • 三、拉链表的功能应用场景及使用
    • 1、方案一:直接更新hive中的数据
    • 2、方案二:按天快照全量数据表
    • 3、方案三:通过拉链表更新数据
    • 4、拉链表设计
    • 5、实现步骤
      • 1)、第一次全量同步
      • 2)、增量采集
      • 3)、合并历史拉链表与增量表数据
      • 4)、将合并后的数据覆盖原拉链表中
    • 6、拉链表实现示例
      • 1)、创建拉链表
      • 2)、模拟增量数据采集
      • 3)、创建临时表
      • 4)、合并历史拉链表与增量表
      • 5)、覆盖到拉链表中


本文介绍了hive关于json解析、窗口函数的几个应用以及拉链表的具体应用示例。
本文分为三个部分,即json解析、窗口函数常见的应用场景、拉链表的应用示例。
本文前提是hive环境可用。
本文部分数据来源于互联网。

一、解析JSON的常用函数及JSONSerde

1、两种处理方式

Hive中为了实现JSON格式的数据解析,提供了两种解析JSON数据的方式,在实际工作场景下,可以根据不同数据,不同的需求来选择合适的方式对JSON格式数据进行处理。

2、get_json_object

用于解析JSON字符串,可以从JSON字符串中返回指定的某个对象列的值

1)、语法

nget_json_object(json_txt, path) - Extract a json object from path

#第一个参数:指定要解析的JSON字符串
#第二个参数:指定要返回的字段,通过$.columnName的方式来指定path
#每次只能返回JSON对象中一列的值

2)、示例

create table tb_json_test1 (
    json string
);
--加载数据
load data local inpath '/usr/local/bigdata/device.json' into table tb_json_test1;

select * from tb_json_test1;
0: jdbc:hive2://server4:10000> select * from tb_json_test1;
+----------------------------------------------------+
|                 tb_json_test1.json                 |
+----------------------------------------------------+
| {"device":"device_30","deviceType":"kafka","signal":98.0,"time":1616817201390} |
| {"device":"device_40","deviceType":"route","signal":99.0,"time":1616817201887} |
| {"device":"device_21","deviceType":"bigdata","signal":77.0,"time":1616817202142} |
| {"device":"device_31","deviceType":"kafka","signal":98.0,"time":1616817202405} |
| {"device":"device_20","deviceType":"bigdata","signal":12.0,"time":1616817202513} |
| {"device":"device_54","deviceType":"bigdata","signal":14.0,"time":1616817202913} |
。。。
+----------------------------------------------------+

select
    --获取设备名称
    get_json_object(json,"$.device") as device,
    --获取设备类型
    get_json_object(json,"$.deviceType") as deviceType,
    --获取设备信号强度
    get_json_object(json,"$.signal") as signal,
    --获取时间
    get_json_object(json,"$.time") as stime
from tb_json_test1;

0: jdbc:hive2://server4:10000> select
. . . . . . . . . . . . . . .>     --获取设备名称
. . . . . . . . . . . . . . .>     get_json_object(json,"$.device") as device,
. . . . . . . . . . . . . . .>     --获取设备类型
. . . . . . . . . . . . . . .>     get_json_object(json,"$.deviceType") as deviceType,
. . . . . . . . . . . . . . .>     --获取设备信号强度
. . . . . . . . . . . . . . .>     get_json_object(json,"$.signal") as signal,
. . . . . . . . . . . . . . .>     --获取时间
. . . . . . . . . . . . . . .>     get_json_object(json,"$.time") as stime
. . . . . . . . . . . . . . .> from tb_json_test1;
+------------+-------------+---------+----------------+
|   device   | devicetype  | signal  |     stime      |
+------------+-------------+---------+----------------+
| device_30  | kafka       | 98.0    | 1616817201390  |
| device_40  | route       | 99.0    | 1616817201887  |
| device_21  | bigdata     | 77.0    | 1616817202142  |
| device_31  | kafka       | 98.0    | 1616817202405  |
。。。
+------------+-------------+---------+----------------+

3、json_tuple

用于实现JSON字符串的解析,可以通过指定多个参数来解析JSON返回多列的值

1)、语法

njson_tuple(jsonStr, p1, p2, ..., pn)   like get_json_object, but it takes multiple names and return a tuple
# 第一个参数:指定要解析的JSON字符串
# 第二个参数:指定要返回的第1个字段
# ……
# 第N+1个参数:指定要返回的第N个字段
# 功能类似于get_json_object,但是可以调用一次返回多列的值,属于UDTF类型函数,一般搭配lateral view使用
# 返回的每一列都是字符串类型

2)、示例

--单独使用
select
    --解析所有字段
    json_tuple(json,"device","deviceType","signal","time") as (device,deviceType,signal,stime)
from tb_json_test1;
0: jdbc:hive2://server4:10000> select
. . . . . . . . . . . . . . .>     --解析所有字段
. . . . . . . . . . . . . . .>     json_tuple(json,"device","deviceType","signal","time") as (device,deviceType,signal,stime)
. . . . . . . . . . . . . . .> from tb_json_test1;
+------------+-------------+---------+----------------+
|   device   | devicetype  | signal  |     stime      |
+------------+-------------+---------+----------------+
| device_30  | kafka       | 98.0    | 1616817201390  |
| device_40  | route       | 99.0    | 1616817201887  |
| device_21  | bigdata     | 77.0    | 1616817202142  |
| device_31  | kafka       | 98.0    | 1616817202405  |
| device_20  | bigdata     | 12.0    | 1616817202513  |
。。。

--搭配侧视图使用
select json,
  device,deviceType,signal,stime
from tb_json_test1
lateral view json_tuple(json,"device","deviceType","signal","time") b as device,deviceType,signal,stime;
0: jdbc:hive2://server4:10000> select json,
. . . . . . . . . . . . . . .>   device,deviceType,signal,stime
. . . . . . . . . . . . . . .> from tb_json_test1
. . . . . . . . . . . . . . .> lateral view json_tuple(json,"device","deviceType","signal","time") b as device,deviceType,signal,stime;
+----------------------------------------------------+------------+-------------+---------+----------------+
|                        json                        |   device   | devicetype  | signal  |     stime      |
+----------------------------------------------------+------------+-------------+---------+----------------+
| {"device":"device_30","deviceType":"kafka","signal":98.0,"time":1616817201390} | device_30  | kafka       | 98.0    | 1616817201390  |
| {"device":"device_40","deviceType":"route","signal":99.0,"time":1616817201887} | device_40  | route       | 99.0    | 1616817201887  |
| {"device":"device_21","deviceType":"bigdata","signal":77.0,"time":1616817202142} | device_21  | bigdata     | 77.0    | 1616817202142  |
。。。

4、JSONSerde

上述解析JSON的过程中是将数据作为一个JSON字符串加载到表中,再通过JSON解析函数对JSON字符串进行解析,灵活性比较高,但是对于如果整个文件就是一个JSON文件,在使用起来就相对比较麻烦。
Hive中为了简化对于JSON文件的处理,内置了一种专门用于解析JSON文件的Serde解析器,在创建表时,只要指定使用JSONSerde解析表的文件,就会自动将JSON文件中的每一列进行解析。

--JsonSerDe
--创建表
create table tb_json_test2 (
    device string,
    deviceType string,
    signal double,
    `time` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE;

load data local inpath '/usr/local/bigdata/device.json' into table tb_json_test2;

select * from tb_json_test2;
0: jdbc:hive2://server4:10000> create table tb_json_test2 (
. . . . . . . . . . . . . . .>     device string,
. . . . . . . . . . . . . . .>     deviceType string,
. . . . . . . . . . . . . . .>     signal double,
. . . . . . . . . . . . . . .>     `time` string
. . . . . . . . . . . . . . .> )
. . . . . . . . . . . . . . .> ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
. . . . . . . . . . . . . . .> STORED AS TEXTFILE;
No rows affected (0.063 seconds)
0: jdbc:hive2://server4:10000> select * from  tb_json_test2;
+-----------------------+---------------------------+-----------------------+---------------------+
| tb_json_test2.device  | tb_json_test2.devicetype  | tb_json_test2.signal  | tb_json_test2.time  |
+-----------------------+---------------------------+-----------------------+---------------------+
| device_30             | kafka                     | 98.0                  | 1616817201390       |
| device_40             | route                     | 99.0                  | 1616817201887       |
| device_21             | bigdata                   | 77.0                  | 1616817202142       |
| device_31             | kafka                     | 98.0                  | 1616817202405       |
| device_20             | bigdata                   | 12.0                  | 1616817202513       |

二、窗口函数的实际应用场景

1、示例-连续登录用户

11、hive综合应用示例:json解析、窗口函数应用(连续登录、级联累加、topN)、拉链表应用_第1张图片
基于以上的需求根据数据寻找规律,要想得到连续登录用户,找到两个相同用户ID的行之间登录日期之间的关系。
例如:统计连续登录两天的用户,只要用户ID相等,并且登录日期之间相差1天即可。

1)、方案一:表中的数据自连接,构建笛卡尔积

--1、连续登录用户
--建表
create table tb_login(
     userid string,
     logintime string
) 
row format delimited fields terminated by '\t';

load data local inpath '/usr/local/bigdata/login.log' into table tb_login;

select * from tb_login;
0: jdbc:hive2://server4:10000> select * from tb_login;
+------------------+---------------------+
| tb_login.userid  | tb_login.logintime  |
+------------------+---------------------+
| A                | 2021-03-22          |
| B                | 2021-03-22          |
| C                | 2021-03-22          |
| A                | 2021-03-23          |
| C                | 2021-03-23          |
| A                | 2021-03-24          |
| B                | 2021-03-24          |
+------------------+---------------------+

--自连接过滤实现
--a.构建笛卡尔积
select
    a.userid as a_userid,
    a.logintime as a_logintime,
    b.userid as b_userid,
    b.logintime as b_logintime
from tb_login a,tb_login b;

--上述查询结果保存为临时表
create table tb_login_tmp as
select
    a.userid as a_userid,
    a.logintime as a_logintime,
    b.userid as b_userid,
    b.logintime as b_logintime
from tb_login a,tb_login b;

--过滤数据:用户id相同并且登录日期相差1
select
    a_userid,a_logintime,b_userid,b_logintime
from tb_login_tmp
where a_userid = b_userid
  and cast(substr(a_logintime,9,2) as int) - 1 = cast(substr(b_logintime,9,2) as int);
0: jdbc:hive2://server4:10000> select
. . . . . . . . . . . . . . .>     a_userid,a_logintime,b_userid,b_logintime
. . . . . . . . . . . . . . .> from tb_login_tmp
. . . . . . . . . . . . . . .> where a_userid = b_userid
. . . . . . . . . . . . . . .>   and cast(substr(a_logintime,9,2) as int) - 1 = cast(substr(b_logintime,9,2) as int);
+-----------+--------------+-----------+--------------+
| a_userid  | a_logintime  | b_userid  | b_logintime  |
+-----------+--------------+-----------+--------------+
| A         | 2021-03-23   | A         | 2021-03-22   |
| C         | 2021-03-23   | C         | 2021-03-22   |
| A         | 2021-03-24   | A         | 2021-03-23   |
+-----------+--------------+-----------+--------------+
--统计连续两天登录用户
select
    distinct a_userid
from tb_login_tmp
where a_userid = b_userid
  and cast(substr(a_logintime,9,2) as int) - 1 = cast(substr(b_logintime,9,2) as int);

2)、方案二:使用窗口函数来实现

1、窗口函数lead

用于从当前数据中基于当前行的数据向后偏移取值

--语法
lead(colName,N,defautValue)
--colName:取哪一列的值
--N:向后偏移N行
--defaultValue:如果取不到返回的默认值

2、实现

基于用户的登录信息:
连续两天登录 : 用户下次登录时间 = 本次登录以后的第二天
连续三天登录: 用户下下次登录时间 = 本次登录以后的第三天
……
可以对用户ID进行分区,按照登录时间进行排序,通过lead函数计算出用户下次登录时间
通过日期函数计算出登录以后第二天的日期,如果相等即为连续两天登录。

----窗口函数实现

--实现连续登录2天
with t1 as (
    select
        userid,
        logintime,
        --本次登录日期的第二天
        date_add(logintime,1) as nextday,
        --按照用户id分区,按照登录日期排序,取下一次登录时间,取不到就为0
        lead(logintime,1,0) over (partition by userid order by logintime) as nextlogin
    from tb_login )
select distinct userid from t1 where nextday = nextlogin;
0: jdbc:hive2://server4:10000> with t1 as (
. . . . . . . . . . . . . . .>     select
. . . . . . . . . . . . . . .>         userid,
. . . . . . . . . . . . . . .>         logintime,
. . . . . . . . . . . . . . .>         --本次登录日期的第二天
. . . . . . . . . . . . . . .>         date_add(logintime,1) as nextday,
. . . . . . . . . . . . . . .>         --按照用户id分区,按照登录日期排序,取下一次登录时间,取不到就为0
. . . . . . . . . . . . . . .>         lead(logintime,1,0) over (partition by userid order by logintime) as nextlogin
. . . . . . . . . . . . . . .>     from tb_login )
. . . . . . . . . . . . . . .> select distinct userid from t1 where nextday = nextlogin;
WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+
| userid  |
+---------+
| A       |
| C       |
+---------+

--实现连续3天登录
with t1 as (
    select
        userid,
        logintime,
        --本次登录日期的第三天
        date_add(logintime,2) as nextday,
        --按照用户id分区,按照登录日期排序,取下下一次登录时间,取不到就为0
        lead(logintime,2,0) over (partition by userid order by logintime) as nextlogin
    from tb_login )
select distinct userid from t1 where nextday = nextlogin;
0: jdbc:hive2://server4:10000> with t1 as (
. . . . . . . . . . . . . . .>     select
. . . . . . . . . . . . . . .>         userid,
. . . . . . . . . . . . . . .>         logintime,
. . . . . . . . . . . . . . .>         --本次登录日期的第三天
. . . . . . . . . . . . . . .>         date_add(logintime,2) as nextday,
. . . . . . . . . . . . . . .>         --按照用户id分区,按照登录日期排序,取下下一次登录时间,取不到就为0
. . . . . . . . . . . . . . .>         lead(logintime,2,0) over (partition by userid order by logintime) as nextlogin
. . . . . . . . . . . . . . .>     from tb_login )
. . . . . . . . . . . . . . .> select distinct userid from t1 where nextday = nextlogin;
WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+
| userid  |
+---------+
| A       |
+---------+

--实现连续N天
select
    userid,
    logintime,
    --本次登录日期的第N天
    date_add(logintime,N-1) as nextday,
    --按照用户id分区,按照登录日期排序,取下下一次登录时间,取不到就为0
    lead(logintime,N-1,0) over (partition by userid order by logintime) as nextlogin
from tb_login;

2、示例-级联累加求和

11、hive综合应用示例:json解析、窗口函数应用(连续登录、级联累加、topN)、拉链表应用_第2张图片

1)、方案一:非窗口函数实现

分组统计每个用户每个月的消费金额,然后构建自连接,根据条件分组聚合

--1、建表加载数据
create table tb_money(
     userid string,
     mth string,
     money int
) 
row format delimited fields terminated by '\t';

load data local inpath '/usr/local/bigdata/money.tsv' into table tb_money;

select * from tb_money;
0: jdbc:hive2://server4:10000> select * from tb_money;
+------------------+---------------+-----------------+
| tb_money.userid  | tb_money.mth  | tb_money.money  |
+------------------+---------------+-----------------+
| A                | 2021-01       | 5               |
| A                | 2021-01       | 15              |
| B                | 2021-01       | 5               |
| A                | 2021-01       | 8               |
| B                | 2021-01       | 25              |
| A                | 2021-01       | 5               |
| A                | 2021-02       | 4               |
| A                | 2021-02       | 6               |
| B                | 2021-02       | 10              |
| B                | 2021-02       | 5               |
| A                | 2021-03       | 7               |
| B                | 2021-03       | 9               |
| A                | 2021-03       | 11              |
| B                | 2021-03       | 6               |
+------------------+---------------+-----------------+

--2、统计得到每个用户每个月的消费总金额
create table tb_money_mtn as
select
    userid,
    mth,
    sum(money) as m_money
from tb_money
group by userid,mth;

select * from tb_money_mtn;

--方案一:自连接分组聚合
--1、基于每个用户每个月的消费总金额进行自连接
select a.*,b.*
from tb_money_mtn a 
join tb_money_mtn b on a.userid = b.userid;

--2、将每个月之前月份的数据过滤出来
select a.*,b.*
from tb_money_mtn a 
join tb_money_mtn b on a.userid = b.userid
where  b.mth <= a.mth;

--3、同一个用户 同一个月的数据分到同一组  再根据用户、月份排序
select
    a.userid,
    a.mth,
       max(a.m_money) as current_mth_money,  --当月花费
       sum(b.m_money) as accumulate_money    --累积花费
from tb_money_mtn a join tb_money_mtn b on a.userid = b.userid
where b.mth <= a.mth
group by a.userid,a.mth
order by a.userid,a.mth;
0: jdbc:hive2://server4:10000> select
. . . . . . . . . . . . . . .>     a.userid,
. . . . . . . . . . . . . . .>     a.mth,
. . . . . . . . . . . . . . .>        max(a.m_money) as current_mth_money,  --当月花费
. . . . . . . . . . . . . . .>        sum(b.m_money) as accumulate_money    --累积花费
. . . . . . . . . . . . . . .> from tb_money_mtn a join tb_money_mtn b on a.userid = b.userid
. . . . . . . . . . . . . . .> where b.mth <= a.mth
. . . . . . . . . . . . . . .> group by a.userid,a.mth
. . . . . . . . . . . . . . .> order by a.userid,a.mth;
WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+-----------+----------+--------------------+-------------------+
| a.userid  |  a.mth   | current_mth_money  | accumulate_money  |
+-----------+----------+--------------------+-------------------+
| A         | 2021-01  | 33                 | 33                |
| A         | 2021-02  | 10                 | 43                |
| A         | 2021-03  | 18                 | 61                |
| B         | 2021-01  | 30                 | 30                |
| B         | 2021-02  | 15                 | 45                |
| B         | 2021-03  | 15                 | 60                |
+-----------+----------+--------------------+-------------------+

2)、方案二:窗口函数实现

分组统计每个用户每个月的消费金额,然后使用窗口聚合函数实现

1、窗口函数sum

用于实现基于窗口的数据求和

--语法
sum(colName) over (partition by col order by col)
--colName:对某一列的值进行求和

2、示例

基于每个用户每个月的消费金额,可以通过窗口函数对用户进行分区,按照月份排序
然后基于聚合窗口,从每个分区的第一行累加到当前和,即可得到累计消费金额。

--方案二:窗口函数实现
--统计每个用户每个月消费金额及累计总金额
select
    userid,
    mth,
    m_money,
    sum(m_money) over (partition by userid order by mth) as t_money
from tb_money_mtn;
0: jdbc:hive2://server4:10000> select
. . . . . . . . . . . . . . .>     userid,
. . . . . . . . . . . . . . .>     mth,
. . . . . . . . . . . . . . .>     m_money,
. . . . . . . . . . . . . . .>     sum(m_money) over (partition by userid order by mth) as t_money
. . . . . . . . . . . . . . .> from tb_money_mtn;
WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+----------+----------+----------+
| userid  |   mth    | m_money  | t_money  |
+---------+----------+----------+----------+
| A       | 2021-01  | 33       | 33       |
| A       | 2021-02  | 10       | 43       |
| A       | 2021-03  | 18       | 61       |
| B       | 2021-01  | 30       | 30       |
| B       | 2021-02  | 15       | 45       |
| B       | 2021-03  | 15       | 60       |
+---------+----------+----------+----------+

--实现近几个月的累计消费金额
select
    userid,
    mth,
    m_money,
    sum(m_money) over (partition by userid order by mth rows between 1 preceding and 2 following) as t_money
from tb_money_mtn;

0: jdbc:hive2://server4:10000> select
. . . . . . . . . . . . . . .>     userid,
. . . . . . . . . . . . . . .>     mth,
. . . . . . . . . . . . . . .>     m_money,
. . . . . . . . . . . . . . .>     sum(m_money) over (partition by userid order by mth rows between 1 preceding and 2 following) as t_money
. . . . . . . . . . . . . . .> from tb_money_mtn;
WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+---------+----------+----------+----------+
| userid  |   mth    | m_money  | t_money  |
+---------+----------+----------+----------+
| A       | 2021-01  | 33       | 61       |
| A       | 2021-02  | 10       | 61       |
| A       | 2021-03  | 18       | 28       |
| B       | 2021-01  | 30       | 60       |
| B       | 2021-02  | 15       | 60       |
| B       | 2021-03  | 15       | 30       |
+---------+----------+----------+----------+

3、示例-topN

TopN函数:row_number、rank、dense_rank
row_number:对每个分区的数据进行编号,如果值相同,继续编号
rank:对每个分区的数据进行编号,如果值相同,编号相同,但留下空位
dense_rank:对每个分区的数据进行编号,如果值相同,编号相同,不留下空位
基于row_number实现,按照部门分区,每个部门内部按照薪水降序排序
11、hive综合应用示例:json解析、窗口函数应用(连续登录、级联累加、topN)、拉链表应用_第3张图片
11、hive综合应用示例:json解析、窗口函数应用(连续登录、级联累加、topN)、拉链表应用_第4张图片

--3、分组TopN问题
--建表加载数据

create table tb_emp(
    empno string,
    ename string,
    job string,
    managerid string,
    hiredate string,
    salary double,
    bonus double,
    deptno string
) 
row format delimited fields terminated by '\t';

load data local inpath '/usr/local/bigdata/emp.txt' into table tb_emp;

select * from tb_emp;
0: jdbc:hive2://server4:10000> select * from tb_emp;
+---------------+---------------+-------------+-------------------+------------------+----------------+---------------+----------------+
| tb_emp.empno  | tb_emp.ename  | tb_emp.job  | tb_emp.managerid  | tb_emp.hiredate  | tb_emp.salary  | tb_emp.bonus  | tb_emp.deptno  |
+---------------+---------------+-------------+-------------------+------------------+----------------+---------------+----------------+
| 7369          | SMITH         | CLERK       | 7902              | 1980-12-17       | 800.0          | NULL          | 20             |
| 7499          | ALLEN         | SALESMAN    | 7698              | 1981-2-20        | 1600.0         | 300.0         | 30             |
| 7521          | WARD          | SALESMAN    | 7698              | 1981-2-22        | 1250.0         | 500.0         | 30             |
| 7566          | JONES         | MANAGER     | 7839              | 1981-4-2         | 2975.0         | NULL          | 20             |
| 7654          | MARTIN        | SALESMAN    | 7698              | 1981-9-28        | 1250.0         | 1400.0        | 30             |
| 7698          | BLAKE         | MANAGER     | 7839              | 1981-5-1         | 2850.0         | NULL          | 30             |
| 7782          | CLARK         | MANAGER     | 7839              | 1981-6-9         | 2450.0         | NULL          | 10             |
| 7788          | SCOTT         | ANALYST     | 7566              | 1987-4-19        | 3000.0         | NULL          | 20             |
| 7839          | KING          | PRESIDENT   |                   | 1981-11-17       | 5000.0         | NULL          | 10             |
| 7844          | TURNER        | SALESMAN    | 7698              | 1981-9-8         | 1500.0         | 0.0           | 30             |
| 7876          | ADAMS         | CLERK       | 7788              | 1987-5-23        | 1100.0         | NULL          | 20             |
| 7900          | JAMES         | CLERK       | 7698              | 1981-12-3        | 950.0          | NULL          | 30             |
| 7902          | FORD          | ANALYST     | 7566              | 1981-12-3        | 3000.0         | NULL          | 20             |
| 7934          | MILLER        | CLERK       | 7782              | 1982-1-23        | 1300.0         | NULL          | 10             |
+---------------+---------------+-------------+-------------------+------------------+----------------+---------------+----------------+

--基于row_number实现,按照部门分区,每个部门内部按照薪水降序排序
select
    empno,
    ename,
    salary,
    deptno,
    row_number() over (partition by deptno order by salary desc) as rn
from tb_emp;
0: jdbc:hive2://server4:10000> select
. . . . . . . . . . . . . . .>     empno,
. . . . . . . . . . . . . . .>     ename,
. . . . . . . . . . . . . . .>     salary,
. . . . . . . . . . . . . . .>     deptno,
. . . . . . . . . . . . . . .>     row_number() over (partition by deptno order by salary desc) as rn
. . . . . . . . . . . . . . .> from tb_emp;
WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+--------+---------+---------+---------+-----+
| empno  |  ename  | salary  | deptno  | rn  |
+--------+---------+---------+---------+-----+
| 7839   | KING    | 5000.0  | 10      | 1   |
| 7782   | CLARK   | 2450.0  | 10      | 2   |
| 7934   | MILLER  | 1300.0  | 10      | 3   |
| 7788   | SCOTT   | 3000.0  | 20      | 1   |
| 7902   | FORD    | 3000.0  | 20      | 2   |
| 7566   | JONES   | 2975.0  | 20      | 3   |
| 7876   | ADAMS   | 1100.0  | 20      | 4   |
| 7369   | SMITH   | 800.0   | 20      | 5   |
| 7698   | BLAKE   | 2850.0  | 30      | 1   |
| 7499   | ALLEN   | 1600.0  | 30      | 2   |
| 7844   | TURNER  | 1500.0  | 30      | 3   |
| 7654   | MARTIN  | 1250.0  | 30      | 4   |
| 7521   | WARD    | 1250.0  | 30      | 5   |
| 7900   | JAMES   | 950.0   | 30      | 6   |
+--------+---------+---------+---------+-----+

--过滤每个部门的薪资最高的前两名
with t1 as (
    select
        empno,
        ename,
        salary,
        deptno,
        row_number() over (partition by deptno order by salary desc) as rn
    from tb_emp )
select * from t1 where rn < 3;

三、拉链表的功能应用场景及使用

Hive在实际工作中主要用于构建离线数据仓库,定期的从各种数据源中同步采集数据到Hive中,经过分层转换提供数据应用。
例如每天需要从RMDB中同步最新的订单信息、用户信息、店铺信息等到数据仓库中,进行订单分析、用户分析。如果同步后的数据发送了变化,一般的解决办法如下。

1、方案一:直接更新hive中的数据

在Hive中直接更新数据,即覆盖原来已经存在的数据。

2、方案二:按天快照全量数据表

每次数据改变,根据日期构建一份全量的快照表,每天一张表

3、方案三:通过拉链表更新数据

构建拉链表,通过时间标记发生变化的数据的每种状态的时间周期

4、拉链表设计

拉链表的设计是将更新的数据进行状态记录,没有发生更新的数据不进行状态存储,用于存储所有数据在不同时间上的所有状态,通过时间进行标记每个状态的生命周期,查询时,根据需求可以获取指定时间范围状态的数据,默认用9999-12-31等最大值来表示最新状态。示例如下:
11、hive综合应用示例:json解析、窗口函数应用(连续登录、级联累加、topN)、拉链表应用_第5张图片

5、实现步骤

11、hive综合应用示例:json解析、窗口函数应用(连续登录、级联累加、topN)、拉链表应用_第6张图片
为方便说明下面的步骤,现定义如下:
dw_zipper:拉链表,最终用于数据分析的表
ods_zipper_update:增量表,过程性表,用完后一般而言会清空(overwrite)方便下一次使用
tmp_zipper:拉链临时表,即原历史拉链表与增量表合并后的数据表,过程性表,一般而言用完后会清空(overwrite),方便下一次使用

1)、第一次全量同步

一般而言,系统上线后,第一次全量同步至拉链表dw_zipper中。

2)、增量采集

按照采集的频率,存储于增量表ods_zipper_update。使用完后会清空。

3)、合并历史拉链表与增量表数据

将dw_zipper和ods_zipper_update合并至tmp_zipper中,该步骤会比较耗时。
示例:

--合并拉链表与增量表
insert overwrite table tmp_zipper
select
    userid,
    phone,
    nick,
    gender,
    addr,
    starttime,
    endtime
from ods_zipper_update
union all
--查询原来拉链表的所有数据,并将这次需要更新的数据的endTime更改为更新值的startTime
select
    a.userid,
    a.phone,
    a.nick,
    a.gender,
    a.addr,
    a.starttime,
    --如果这条数据没有更新或者这条数据不是要更改的数据,就保留原来的值,否则就改为新数据的开始时间-1
    if(b.userid is null or a.endtime < '9999-12-31', a.endtime , date_sub(b.starttime,1)) as endtime
from dw_zipper a  
left join ods_zipper_update b on a.userid = b.userid ;

4)、将合并后的数据覆盖原拉链表中

insert overwrite table dw_zipper
select * from tmp_zipper;

6、拉链表实现示例

1)、创建拉链表

--1、建表加载数据
--创建拉链表
create table dw_zipper(
    userid string,
    phone string,
    nick string,
    gender int,
    addr string,
    starttime string,
    endtime string
) 
row format delimited fields terminated by '\t';

--加载模拟数据
load data local inpath '/root/hivedata/zipper.txt' into table dw_zipper;
--查询
select userid,nick,addr,starttime,endtime from dw_zipper;
0: jdbc:hive2://server4:10000> select * from dw_zipper;
+-------------------+------------------+-----------------+-------------------+-----------------+----------------------+--------------------+
| dw_zipper.userid  | dw_zipper.phone  | dw_zipper.nick  | dw_zipper.gender  | dw_zipper.addr  | dw_zipper.starttime  | dw_zipper.endtime  |
+-------------------+------------------+-----------------+-------------------+-----------------+----------------------+--------------------+
| 001               | 186xxxx1234      | laoda           | 0                 | sh              | 2021-01-01           | 9999-12-31         |
| 002               | 186xxxx1235      | laoer           | 1                 | bj              | 2021-01-01           | 9999-12-31         |
| 003               | 186xxxx1236      | laosan          | 0                 | sz              | 2021-01-01           | 9999-12-31         |
| 004               | 186xxxx1237      | laosi           | 1                 | gz              | 2021-01-01           | 9999-12-31         |
| 005               | 186xxxx1238      | laowu           | 0                 | sh              | 2021-01-01           | 9999-12-31         |
| 006               | 186xxxx1239      | laoliu          | 1                 | bj              | 2021-01-01           | 9999-12-31         |
| 007               | 186xxxx1240      | laoqi           | 0                 | sz              | 2021-01-01           | 9999-12-31         |
| 008               | 186xxxx1241      | laoba           | 1                 | gz              | 2021-01-01           | 9999-12-31         |
| 009               | 186xxxx1242      | laojiu          | 0                 | sh              | 2021-01-01           | 9999-12-31         |
| 010               | 186xxxx1243      | laoshi          | 1                 | bj              | 2021-01-01           | 9999-12-31         |
+-------------------+------------------+-----------------+-------------------+-----------------+----------------------+--------------------+

2)、模拟增量数据采集

--创建ods层增量表 加载数据
create table ods_zipper_update(
    userid string,
    phone string,
    nick string,
    gender int,
    addr string,
    starttime string,
    endtime string
) 
row format delimited fields terminated by '\t';

load data local inpath '/usr/local/bigdata/update.txt' into table ods_zipper_update;

select * from ods_zipper_update;
0: jdbc:hive2://server4:10000> select
. . . . . . . . . . . . . . .>     userid,
. . . . . . . . . . . . . . .>     phone,
. . . . . . . . . . . . . . .>     nick,
. . . . . . . . . . . . . . .>     gender,
. . . . . . . . . . . . . . .>     addr,
. . . . . . . . . . . . . . .>     starttime,
. . . . . . . . . . . . . . .>     endtime
. . . . . . . . . . . . . . .> from ods_zipper_update;
+---------+--------------+---------+---------+-------+-------------+-------------+
| userid  |    phone     |  nick   | gender  | addr  |  starttime  |   endtime   |
+---------+--------------+---------+---------+-------+-------------+-------------+
| 008     | 186xxxx1241  | laoba   | 1       | sh    | 2021-01-02  | 9999-12-31  |
| 011     | 186xxxx1244  | laoshi  | 1       | jx    | 2021-01-02  | 9999-12-31  |
| 012     | 186xxxx1245  | laoshi  | 0       | zj    | 2021-01-02  | 9999-12-31  |
+---------+--------------+---------+---------+-------+-------------+-------------+

3)、创建临时表

--创建临时表
create table tmp_zipper(
    userid string,
    phone string,
    nick string,
    gender int,
    addr string,
    starttime string,
    endtime string
) 
row format delimited fields terminated by '\t';

4)、合并历史拉链表与增量表

--合并拉链表与增量表
insert overwrite table tmp_zipper
select
    userid,
    phone,
    nick,
    gender,
    addr,
    starttime,
    endtime
from ods_zipper_update
union all
--查询原来拉链表的所有数据,并将这次需要更新的数据的endTime更改为更新值的startTime
select
    a.userid,
    a.phone,
    a.nick,
    a.gender,
    a.addr,
    a.starttime,
    --如果这条数据没有更新或者这条数据不是要更改的数据,就保留原来的值,否则就改为新数据的开始时间-1
    if(b.userid is null or a.endtime < '9999-12-31', a.endtime , date_sub(b.starttime,1)) as endtime
from dw_zipper a  
left join ods_zipper_update b on a.userid = b.userid ;

0: jdbc:hive2://server4:10000> insert overwrite table tmp_zipper
. . . . . . . . . . . . . . .> select
. . . . . . . . . . . . . . .>     userid,
. . . . . . . . . . . . . . .>     phone,
. . . . . . . . . . . . . . .>     nick,
. . . . . . . . . . . . . . .>     gender,
. . . . . . . . . . . . . . .>     addr,
. . . . . . . . . . . . . . .>     starttime,
. . . . . . . . . . . . . . .>     endtime
. . . . . . . . . . . . . . .> from ods_zipper_update
. . . . . . . . . . . . . . .> union all
. . . . . . . . . . . . . . .> --查询原来拉链表的所有数据,并将这次需要更新的数据的endTime更改为更新值的startTime
. . . . . . . . . . . . . . .> select
. . . . . . . . . . . . . . .>     a.userid,
. . . . . . . . . . . . . . .>     a.phone,
. . . . . . . . . . . . . . .>     a.nick,
. . . . . . . . . . . . . . .>     a.gender,
. . . . . . . . . . . . . . .>     a.addr,
. . . . . . . . . . . . . . .>     a.starttime,
. . . . . . . . . . . . . . .>     --如果这条数据没有更新或者这条数据不是要更改的数据,就保留原来的值,否则就改为新数据的开始时间-1(因为是每天同步,所以更改发生在上一天)
. . . . . . . . . . . . . . .>     if(b.userid is null or a.endtime < '9999-12-31', a.endtime , date_sub(b.starttime,1)) as endtime
. . . . . . . . . . . . . . .> from dw_zipper a  
. . . . . . . . . . . . . . .> left join ods_zipper_update b on a.userid = b.userid ;
WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (121.538 seconds)
0: jdbc:hive2://server4:10000> select * from table tmp_zipper;
Error: Error while compiling statement: FAILED: ParseException line 1:20 mismatched input 'tmp_zipper' expecting ( near 'table' in virtual table source (state=42000,code=40000)
0: jdbc:hive2://server4:10000> select * from tmp_zipper;
+--------------------+-------------------+------------------+--------------------+------------------+-----------------------+---------------------+
| tmp_zipper.userid  | tmp_zipper.phone  | tmp_zipper.nick  | tmp_zipper.gender  | tmp_zipper.addr  | tmp_zipper.starttime  | tmp_zipper.endtime  |
+--------------------+-------------------+------------------+--------------------+------------------+-----------------------+---------------------+
| 001                | 186xxxx1234       | laoda            | 0                  | sh               | 2021-01-01            | 9999-12-31          |
| 002                | 186xxxx1235       | laoer            | 1                  | bj               | 2021-01-01            | 9999-12-31          |
| 003                | 186xxxx1236       | laosan           | 0                  | sz               | 2021-01-01            | 9999-12-31          |
| 004                | 186xxxx1237       | laosi            | 1                  | gz               | 2021-01-01            | 9999-12-31          |
| 005                | 186xxxx1238       | laowu            | 0                  | sh               | 2021-01-01            | 9999-12-31          |
| 006                | 186xxxx1239       | laoliu           | 1                  | bj               | 2021-01-01            | 9999-12-31          |
| 007                | 186xxxx1240       | laoqi            | 0                  | sz               | 2021-01-01            | 9999-12-31          |
| 008                | 186xxxx1241       | laoba            | 1                  | gz               | 2021-01-01            | 2021-01-01          |
| 009                | 186xxxx1242       | laojiu           | 0                  | sh               | 2021-01-01            | 9999-12-31          |
| 010                | 186xxxx1243       | laoshi           | 1                  | bj               | 2021-01-01            | 9999-12-31          |
| 008                | 186xxxx1241       | laoba            | 1                  | sh               | 2021-01-02            | 9999-12-31          |
| 011                | 186xxxx1244       | laoshi           | 1                  | jx               | 2021-01-02            | 9999-12-31          |
| 012                | 186xxxx1245       | laoshi           | 0                  | zj               | 2021-01-02            | 9999-12-31          |
+--------------------+-------------------+------------------+--------------------+------------------+-----------------------+---------------------+

5)、覆盖到拉链表中

--覆盖拉链表
insert overwrite table dw_zipper
select * from tmp_zipper;

0: jdbc:hive2://server4:10000> select * from dw_zipper;
+-------------------+------------------+-----------------+-------------------+-----------------+----------------------+--------------------+
| dw_zipper.userid  | dw_zipper.phone  | dw_zipper.nick  | dw_zipper.gender  | dw_zipper.addr  | dw_zipper.starttime  | dw_zipper.endtime  |
+-------------------+------------------+-----------------+-------------------+-----------------+----------------------+--------------------+
| 001               | 186xxxx1234      | laoda           | 0                 | sh              | 2021-01-01           | 9999-12-31         |
| 002               | 186xxxx1235      | laoer           | 1                 | bj              | 2021-01-01           | 9999-12-31         |
| 003               | 186xxxx1236      | laosan          | 0                 | sz              | 2021-01-01           | 9999-12-31         |
| 004               | 186xxxx1237      | laosi           | 1                 | gz              | 2021-01-01           | 9999-12-31         |
| 005               | 186xxxx1238      | laowu           | 0                 | sh              | 2021-01-01           | 9999-12-31         |
| 006               | 186xxxx1239      | laoliu          | 1                 | bj              | 2021-01-01           | 9999-12-31         |
| 007               | 186xxxx1240      | laoqi           | 0                 | sz              | 2021-01-01           | 9999-12-31         |
| 008               | 186xxxx1241      | laoba           | 1                 | gz              | 2021-01-01           | 2021-01-01         |
| 009               | 186xxxx1242      | laojiu          | 0                 | sh              | 2021-01-01           | 9999-12-31         |
| 010               | 186xxxx1243      | laoshi          | 1                 | bj              | 2021-01-01           | 9999-12-31         |
| 008               | 186xxxx1241      | laoba           | 1                 | sh              | 2021-01-02           | 9999-12-31         |
| 011               | 186xxxx1244      | laoshi          | 1                 | jx              | 2021-01-02           | 9999-12-31         |
| 012               | 186xxxx1245      | laoshi          | 0                 | zj              | 2021-01-02           | 9999-12-31         |
+-------------------+------------------+-----------------+-------------------+-----------------+----------------------+--------------------+

以上,介绍了hive关于json解析、窗口函数的几个应用以及拉链表的具体应用示例。

你可能感兴趣的:(#,hive专栏,hive,hadoop,大数据,数据分析,数据仓库)