官方文档(推荐): https://cwiki.apache.org/confluence/display/Hive
概念: https://www.cnblogs.com/netuml/p/7841387.html
安装: https://www.cnblogs.com/garfieldcgf/p/8134452.html
中文官方文档: https://blog.csdn.net/strongyoung88/article/details/53743937
其实官方文档才是学习最好的地方:
这里整理比较重要/常用的功能官方文档链接:
hive介绍: https://cwiki.apache.org/confluence/display/Hive/Tutorial
命令操作: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Commands
命令行操作: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli
数据类型: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
创建/删除/查看表: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
统计分析: https://cwiki.apache.org/confluence/display/Hive/StatsDev
索引: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Indexing
存档: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Archiving
载入数据/插入/删除表操作: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
导入/导出: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport
执行计划: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain
查询: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select
函数: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
锁: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Locks
权限: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Authorization
配置说明: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties
转换: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform
ps:有时候hql执行很慢可以看看执行计划,根据执行计划优化hql.
hive启动
先启动服务端:
hive --service metastore -p
如果不加端口默认启动:hive --service metastore,则默认监听端口是:9083
再启动客户端:
hive
注:如果没有启动服务端,只启动客户端是可以进入命令执行界面但是,执行命令会出错.
显示数据库,数据表
show databases;
show tables;
显示表信息
desc person
desc extended person;
修改操作
修改表名 ALTER TABLE events RENAME TO 3koobecaf;
增加字段 ALTER TABLE pokes ADD COLUMNS (new_col INT);
增加字段 ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
修改字段名 ALTER TABLE invites REPLACE COLUMNS (foo INT, bar STRING, baz INT COMMENT 'baz replaces new_col2');
改变列名类型 ALTER TABLE employee CHANGE name ename String;
删除表 DROP TABLE pokes;
创建表
create table if not exists person3(id string,name string,aeg int,height int,city string)
comment 'person table'
row format delimited fields terminated by '\t';
创建分区表
create table if not exists person2(id string,name string,aeg int,height int,city string)
partitioned by(area string,sex string)
row format delimited fields terminated by '\t';
建立分区后会在表文件夹下创建分区文件夹 命名规则示例:area=A,area=B,分区属性有多少个值就有多少个文件夹,
如果再分区会在下一级文件夹中再创建文件夹,类似决策树特征划分的过程.
除了分区还有分桶具体可以看:https://blog.csdn.net/epitomizelu/article/details/41911657
这样创建的表的类型为Managed Table 其需要把数据复制到指定的文件夹中
(如果文件在hdfs上会剪贴文件到指定文件夹中),删除表格时,data和meta data 都会被删除.
另外一种表格的类型为External Table 创建时需要加上 external 关键词,这种类型的表格删除时不会删除data;
具体可以参考: https://blog.csdn.net/nianguodong/article/details/47358955
还可以创建 临时表 TEMPORARY 只对当前 session有效,session退出后表自动删除.
字段类型
创建时字段类型可以可以为:
data_type
: primitive_type
| array_type
| map_type
| struct_type
| union_type -- (Note: Available in Hive 0.7.0 and later)
primitive_type
: TINYINT
| SMALLINT
| INT
| BIGINT
| BOOLEAN
| FLOAT
| DOUBLE
| DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
| STRING
| BINARY -- (Note: Available in Hive 0.8.0 and later)
| TIMESTAMP -- (Note: Available in Hive 0.8.0 and later)
| DECIMAL -- (Note: Available in Hive 0.11.0 and later)
| DECIMAL(precision, scale) -- (Note: Available in Hive 0.13.0 and later)
| DATE -- (Note: Available in Hive 0.12.0 and later)
| VARCHAR -- (Note: Available in Hive 0.12.0 and later)
| CHAR -- (Note: Available in Hive 0.13.0 and later)
array_type
: ARRAY < data_type >
map_type
: MAP < primitive_type, data_type >
struct_type
: STRUCT < col_name : data_type [COMMENT col_comment], ...>
union_type
: UNIONTYPE < data_type, data_type, ... > -- (Note: Available in Hive 0.7.0 and later)
其中array,map,struct,union类型的数据在创建时,还需要指定数据的分割方式,
具体可以参考: https://blog.csdn.net/yfkiss/article/details/7842014
更详细的创建表格/索引/宏/函数等看官方网站的说明: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
载入数据
load data local inpath '/home/linzhenpeng/Desktop/hadoop/hive/test.txt' into table person
load data local inpath '/home/linzhenpeng/Desktop/hadoop/hive/test.txt' into table person2 partition(area='B',sex='A');
如果载入的是hdfs上的数据 就不需要 local 字段,
查询结果导出
insert overwrite directory '/tem/person_out' select * from person where city='beijing'
insert overwrite local directory '/tem/person_out' select * from person where city='beijing'
查询结果插入其他的表
insert overwrite table person3 select a.id,a.name from table person a
创建一个新的表来保存查询结果(可以用于保存查询中间结果)
create table person_id_name as select id,name from person;
查看数据
select * from person;
设置查看时的字段名
set hive.cli.print.header=true
分组统计
select city,count(*) from person2 group by city;
select city from person group by city having sum(age)>25;
排序
select * from person order by age;
select * from person order by age desc;
聚类
select city ,count(distinct id) from person group by city;
聚类,排序,分组具体可以看:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
连接
内链接 join on, 左(外)连接 left outer join on ,右(外)连接 right outer join on ,
全连接 full outer join on,左半连接 left semi join on
连接的知识点可以看: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
表合并
select * from ( select * from person union all select * from person3 ) person4;
条件判断
1,CASE WHEN THEN END
select name,age,
case
when age<=6 then 'child'
when age>6 and age<=17 then 'teenager'
when age>17 and age<=40 then 'youth'
else 'other'
end as agetype from person;
2,IF
select name,age,if(age>=18,'adult','minor') as t from person;
3,COALESCE(它返回第一个不为NULL的值)
COALESCE(NULL,NULL,5,NULL,4) returns 5
模糊查询
select * from person where city like 'beijing'
select * from person where city like 'beiji%'
还可以使用正则 rlike 来模糊查询(java的正则表达式)
select * from person where city rlike 'beijin.'
限制显示的条数
select * from person limit 2;
使用java函数
SELECT reflect("java.lang.String", "valueOf", 1),
reflect("java.lang.String", "isEmpty"),
reflect("java.lang.Math", "max", 2, 3),
reflect("java.lang.Math", "min", 2, 3),
reflect("java.lang.Math", "round", 2.5),
reflect("java.lang.Math", "exp", 1.0),
reflect("java.lang.Math", "floor", 1.9)
FROM src LIMIT 1;
使用内置函数
select age,exp(age) exp from person;
使用自定义的函数具体可以看: https://www.cnblogs.com/DreamDrive/p/5561113.html
hive 的UDF 既可以使用java也可以使用Python编写具体可以看: https://blog.csdn.net/u010705209/article/details/52935245
自定义Map-Reduce函数
FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script'
AS dt, uid
CLUSTER BY dt) map_output
INSERT OVERWRITE TABLE pv_users_reduced
REDUCE map_output.dt, map_output.uid
USING 'reduce_script'
AS date, count;
FROM (
FROM pv_users
SELECT TRANSFORM(pv_users.userid,pv_users.date)
USING 'map_script'
AS dt, uid
CLUSTER BY dt) map_output
INSERT OVERWRITE TABLE pv_users_reduced
SELECT TRANSFORM(map_output.dt,map_output.uid)
USING 'reduce_script'
AS date, count;
//二者具有相同的效果,后者使用SELECT TRANSFORM代替了前者的MAP和REDUCE从句。
具体看:https://blog.csdn.net/skywalker_only/article/details/37879187