Apache Hive高级查询

hive高级查询

    • hive查询
    • 关联查询
      • Mapjoin
      • hive集合操作
    • 装载数据
      • load移动数据
      • insert表插入数据
    • 数据交换-import、export
    • hive数据排序
    • hive聚合运算
      • group by
      • having
      • 基础聚合
      • 高级聚合
    • 窗口函数
    • 安装zeppelin

hive查询

  • select 基础
select 1;------> 1
select explode(split('aa,bb',','));
  • CTE和嵌套查询
    with…as…语法有效加快运行速度
  • 进阶语句
  • 列匹配正则表达式
select `^o.*` from offers;    列名不是o开头的都列出来
  • 虚拟列

关联查询

  • 指对多表进行联合查询
  • join用于将两个或多个表中的行组合在一起查询
  • 类似于sql join,但是hive仅支持等值连接
    内连接:inner join
    外连接:outer join
    right join、left join、full outer join
    交叉连接:cross join
    交叉连接简称笛卡尔集
    隐式连接:implicit join
  • join发生在where子句之前

# 查询订单表中共有多少不同顾客下过订单
select customer_id,concat(customer_fname,' ',customer_lname) as customer_name
from (select order_customer_id from orders group by order_customer_id) o 
inner join 
customers c 
on o.order_customer_id=c.customer_id
limit 10;

with
a1 as (select order_customer_id from orders group by order_customer_id)
select customer_id,concat(customer_fname,' ',customer_lname) as customer_name 
from customers a2 inner join a1 on a2.customer_id=a1.order_customer_id
limit 10;

select customer_id,concat(customer_fname,' ',customer_lname) as customer_name
from customers c where c.customer_id in
(select order_customer_id from orders group by order_customer_id)
limit 10;

select customer_id,concat(customer_fname,' ',customer_lname) as customer_name
from customers c where exists
(select order_customer_id from orders s where s.order_customer_id=c.customer_id)
limit 10;

with
a1 as (select distinct order_customer_id from orders)
select customer_id,concat(customer_fname,' ',customer_lname) as customer_name 
from customers a2 inner join a1 on a2.customer_id=a1.order_customer_id
limit 10;

# 查询商品表中前五个商品
select * from products limit 5;

# 从顾客表中查询每位顾客的全名(分别使用CTE和子查询)
select concat(customer_fname,' ',customer_lname) as customer_name
from customers;

# 使用关联查询获取没有订单的所有顾客
select customer_id,concat(customer_fname,' ',customer_lname) as customer_name
from customers c where not exists
(select order_customer_id from orders s where s.order_customer_id=c.customer_id);

exist;
mysql内置函数50个 https://blog.csdn.net/qq646040754/article/details/82721801

Mapjoin

  • Mapjoin操作在map端完成
    • 小表关联大表
    • 可进行不等值连接
  • 开启join操作
    • set hive.auto.convert.join=true(默认值)
    • 运行时自动将连接转换为mapjoin
  • mapjoin操作不支持:
    • 在union all、lateral view、group by/join/sort by/cluster by/distribute by等操作后面
    • union,join以及其他mapjoin之前

hive集合操作

  • 所有子集数据必须具有相同的名称和类型
    union all:合并后保留重复项
    union:合并后删除重复项
  • 可以在顶层查询中使用
  • order by、sort by、cluster by、distribute by和limit适用于合并后的整个结果
  • 集合其他操作可以使用join、outer join来实现差集、交集

装载数据

load移动数据

load用于在hive中移动数据
原始数据被移动到目标表/分区,不再存在于原始位置

LOAD DATA LOCAL INPATH '/home/dayongd/Downloads/employee.txt' 
OVERWRITE INTO TABLE employee;
-- LOCAL表示文件位于本地,OVERWRITE表示覆盖现有数据
LOAD DATA LOCAL INPATH '/home/dayongd/Downloads/employee.txt' 
OVERWRITE INTO TABLE employee_partitioned  PARTITION (year=2014, month=12);
-- 没有LOCAL,文件位于HDFS文件系统中
LOAD DATA INPATH '/tmp/employee.txt'  
OVERWRITE INTO TABLE employee_partitioned PARTITION (year=2017, month=12);

insert表插入数据

使用insert语句将数据插入表/分区

// INSERT支持OVERWRITE(覆盖)和INTO(追加)
INSERT OVERWRITE/INTO TABLE tablename1 
[PARTITION (partcol1=val1, partcol2=val2 ...)] 
select fileds,... from tb_other;
  • hive支持从同一个表进行多次插入
  • insert into中table关键字是可选的
  • insert into可以指定插入到哪些字段中
    如:insert into t(x,y,z)
  • insert into table_name values,支持插入值列表
  • 数据插入必须与指定列数相同

insert表插入数据

INSERT OVERWRITE TABLE test select 'hello'; -- INSERT不支持的写法
insert into employee select * from ctas_employee; -- 通过查询语句插入
-- 多插入
from ctas_employee
insert overwrite table employee select *
insert overwrite table employee_internal select *;
-- 插入到分区
from ctas_patitioned 
insert overwrite table employee PARTITION (year, month)
select *,'2018','09';
-- 通过指定列插入(insert into可以省略table关键字)
insert into employee(name) select 'John' from test limit 1;
-- 通过指定值插入
insert into employee(name) value('Judy'),('John');

使用insert语句将数据插入、导出到文件

-- 从同一数据源插入本地文件,hdfs文件,表
from ctas_employee
insert overwrite local directory '/tmp/out1'  select *
insert overwrite directory '/tmp/out1' select *
insert overwrite table employee_internal select *;
-- 以指定格式插入数据
insert overwrite directory '/tmp/out3'
row format delimited fields terminated by ','
select * from ctas_employee;
-- 其他方式从表获取文件
hdfs dfs -getmerge <table_file_path>

数据交换-import、export

  • import和export用于数据导入和导出
    常用于数据迁移场景
    除数据库,可导入导出所有数据和元数据
  • 使用export导出数据
EXPORT TABLE employee TO '/tmp/output3';
EXPORT TABLE employee_partitioned partition (year=2014, month=11) TO '/tmp/output5';
  • 使用import导入数据
IMPORT TABLE employee FROM '/tmp/output3';
IMPORT TABLE employee_partitioned partition (year=2014, month=11) FROM '/tmp/output5';
mysqldump -uroot -p1234 mydemo custom>abc      导出
mysql -uroot -p1234 mydemo1<abc      导入

hive数据排序

  • order by(ASC | DESC)类似于标准sql
    只使用一个reducer执行全局数据排序
    速度慢,应提前做好数据过滤
    支持使用case when表达式
    支持按位置编号排序
set hive.groupby.orderby.position.alias=true;
select * from offers order by case when offerid = 1 then 1 else 0 end;
select * from offers order by 1;
  • sort by对每个reducer中的数据进行排序
    当reducer数量设置为1时,等于order by
    排序列必须出现在select column列表中
set mapred.reduce.tasks=2
select * from mynitltab sort by mntname;

hive> set mapred.reduce.tasks=2;   分两个区
hive> select * from mynitltab distribute by mntid sort by mntname;   
select * from mynitltab distribute by mntname sort by mntname;
  • distribute by类似于标准sql中的group by
    确保具有匹配列值的行被分区到相同的reducer
    不会对reducer的输出进行排序
    通常使用在sort by语句之前
  • cluster by=distribute by + sort by
    不支持ASC|DESC
    排序列必须出现在SELECT column列表中
    为了充分利用所有的Reducer来执行全局排序,可以先使用CLUSTER BY,然后使用ORDER BY

sort by 局部排序,前提设置task任务=reduce
custer by = distribute by + sort by,默认只能升序
order by 一个reduce,一个分区
distribute by 指定哪一个建进行分区,指定哪一列作为分区的依据,一般和sort by联合使用=partition

// mysql实现分组查询每个班级的前三名:
select a.class,a.score from student a where (select count(*) from student where a.class=class and a.score

hive聚合运算

group by

hive基本内置聚合函数与group by一起使用
如果没有指定group by子句,则默认聚合整个表
除聚合函数外,所选的其他列也必须包含在group by中
group by支持使用case when或表达式
支持按位置编号分组
group by后面跟的列名越多,分组越细,带“每”,“组”等关键字的就要用group by
select…from中间放group by后面的列名和聚合函数的列,其他的不要乱放

set hive.groupby.orderby.position.alias=true;

having

对group by聚合结果的条件过滤
可以避免在group by之后使用子查询
having之后可以使用表达式,但不建议这样做,一般都用子查询代替having

  • having和where的不同:where是对整表做过滤,having是分完组之后对每个小组做过滤,对组过滤保留组或者删除组,效率低

基础聚合

  • 使用内置聚合函数进行数据聚合
    max、min、count、sum、avg
    max(distinct col)、avg(distinct col)等
    collect_set,collect_list:返回每个组列中的对象集/列表
    collect_set多了一个去重的功能
    https://www.cnblogs.com/cc11001100/p/9043946.html
  • 与group by一起使用,可应用于列或表达式
    没有group by则按所有列聚合
    select的列必须包含在group by中
    对null的聚合为0
    select count(null) = 0

高级聚合

grouping sets
实现对同一数据集进行多重group by操作
本质是多个group by进行union all操作
group by with cube | rollup
cube:对分组列进行所有可能组合的聚合
rollup:计算维度层次级别上的聚合

窗口函数

窗口函数:https://blog.csdn.net/weixin_38750084/article/details/82779910
文章写的太好,就不过多赘述了,简单做下总结

jack,2015-01-01,10
tony,2015-01-02,15
jack,2015-02-03,23
tony,2015-01-04,29
jack,2015-01-05,46
jack,2015-04-06,42
tony,2015-01-07,50
jack,2015-01-08,55
mart,2015-04-08,62
mart,2015-04-09,68
neil,2015-05-10,12
mart,2015-04-11,75
neil,2015-06-12,80
mart,2015-04-13,94
// 分组
select name,count(cost) from t_window where substring(orderdate,1,7)='2015-04' group by name;
select name,count(*) over () from t_window where substring(orderdate,1,7)='2015-04';

// 交叉连接,计算2015年4月每个人消费的笔数占月笔数的比率
with 
t1 as (select * from t_window where substring(orderdate,1,7)='2015-04'),
t2 as (select name,count(cost) as consume from t1 group by name),
t3 as (select count(cost) as cnt from t1)
select * from t2,t3;
// 窗口函数
分组出一条数据,窗口自定义出几条数据
over()括号里面不写东西,整表作为窗口
name和count(*) over()作为两个表,然后再交叉连接,不聚合
select name,count(*) over() from t_window where substring(orderdate,1,7)='2015-04';

// 所有数据处理完成之后处理count,over字段跟之前的字段无关,只做统计用
select distinct name,count(*) over() from t_window where substring(orderdate,1,7)='2015-04';
select name,count(*) over() from t_window where substring(orderdate,1,7)='2015-04' group by name;
select name,max(ee) from (select name,count(*) over() as ee from t_window where substring(orderdate,1,7)='2015-04') s group by name;
select name,max(ee) from (select name,max(cost) over() as ee from t_window where substring(orderdate,1,7)='2015-04') s group by name;

// 分区
select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) from t_window;

// 排序
select name,orderdate,cost,sum(cost) over(partition by month(orderdate) order by cost) from t_window;
//给窗口里面的数据顺序排序 row number
select name,orderdate,cost,row_number() over(partition by month(orderdate) order by cost) from t_window;
insert into t_window values('smith','2015-01-31',50);
// rank并列第二,去除第三
select name,orderdate,cost,rank() over(partition by month(orderdate) order by cost) from t_window;
// dense_rank并列第二,还有第三
select name,orderdate,cost,dense_rank() over(partition by month(orderdate) order by cost) from t_window;
// 取每人每月消费最高的三个月
select name,orderdate,cost,row_number() over(partition by name,month(orderdate) order by cost) as rn from t_window;

with
t1 as (select name,substring(orderdate,1,7) as y_m,sum(cost) as sale from t_window group by name,substring(orderdate,1,7))
select name,y_m,sale,row_number() over(partition by name order by sale) from t1;

with
t1 as (select name,substring(orderdate,1,7) as y_m,sum(cost) as sale from t_window group by name,substring(orderdate,1,7)),
t2 as (select name,y_m,sale,row_number() over(partition by name order by sale) as mc from t1)
select name,y_m,sale,mc from t2 where mc<3;

select name,orderdate,cost,lag(overdate,1) over (partition by name order by orderdate) as time1 from t_window;

select name,orderdate,cost,lead(overdate,1) over (partition by name order by orderdate) as time1 from t_window;
count | avg | sum | max | min() over()    // 全局一个窗口
over(partition by)  // 按用户选择分窗口
over(partition by...order by)    // 窗口强制排序
row_number() over(...order by)     // 无重复排序
rank() over()   // 并列名次跳
dense_rank()     // 并列名次不跳
lag(,x) over(order by)     // 找上x笔数据,没有就用null补
lead(,x) over(order by)   // 找下x笔数据,没有就用null补
first_value() over(order by)   // 根据窗口的当前指针获取当前窗口的第一个值
last_value() over(order by)    // 根据窗口的当前指针获取当前窗口的最后一个值

sum | max | min...() over(partition by...order by...between x and y)    // 游标
x | y => UNBOUNDED PRECEDING(窗口首行)      UNBOUNDED FOLLOWING(窗口最后一行)
         n   PRECEDING(向前n行)     n   FOLLOWING(向后n行)
         CURRENT  ROW(当前行)

oltp和olap:https://www.cnblogs.com/lingchen-liang/p/10690619.html

安装zeppelin

######### 安装zeppelin ###########
tar -zxf zeppelin-0.8.1-bin-all.tgz
mv zeppelin-0.8.1-bin-all soft/zepplin081
cd soft/zeppelin081/conf
cp zeppelin-site.xml.template zeppelin-site.xml
vi zeppelin-site.xml
cp zeppelin-env.sh.template zeppelin-env.sh
vi zeppelin-env.sh
echo $JAVA_HOME      /opt/soft/jdk180
echo $HADOOP_HOME     /opt/soft/hadoop260
vi zeppelin-env.sh
加入$JAVA_HOME=/opt/soft/jdk180和/opt/soft/hadoop260/etc/hadoop
./zeppelin-daemon.sh start       // 启动zeppelin
[root@xuxu bin]# cp /opt/soft/hive110/conf/hive-site.xml /opt/soft/zeppelin081/conf/
[root@xuxu bin]# cp /opt/soft/hive110/lib/hive-jdbc-1.1.0-cdh5.14.2-standalone.jar /opt/soft/zeppelin081/interpreter/
[root@xuxu bin]# cp /opt/soft/hadoop260/share/hadoop/common/hadoop-common-2.6.0-cdh5.14.2.jar /opt/soft/zeppelin081/interpreter/

你可能感兴趣的:(Apache Hive高级查询)