HIVE基础入门

官方文档(推荐): https://cwiki.apache.org/confluence/display/Hive
概念: https://www.cnblogs.com/netuml/p/7841387.html
安装: https://www.cnblogs.com/garfieldcgf/p/8134452.html
中文官方文档: https://blog.csdn.net/strongyoung88/article/details/53743937


其实官方文档才是学习最好的地方:
这里整理比较重要/常用的功能官方文档链接:
hive介绍: https://cwiki.apache.org/confluence/display/Hive/Tutorial
命令操作: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Commands
命令行操作: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli
数据类型: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
创建/删除/查看表: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
统计分析: https://cwiki.apache.org/confluence/display/Hive/StatsDev
索引: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Indexing
存档: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Archiving
载入数据/插入/删除表操作: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
导入/导出: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport
执行计划: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain
查询: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select
函数: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
锁: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Locks
权限: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Authorization
配置说明: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties
转换: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform
ps:有时候hql执行很慢可以看看执行计划,根据执行计划优化hql.

hive启动

先启动服务端:
hive --service metastore -p
如果不加端口默认启动:hive --service metastore,则默认监听端口是:9083
再启动客户端:
hive
注:如果没有启动服务端,只启动客户端是可以进入命令执行界面但是,执行命令会出错.


显示数据库,数据表

  show databases;
  show tables;

显示表信息

  desc person
  desc extended person;


修改操作

  修改表名 ALTER TABLE events RENAME TO 3koobecaf;
  增加字段 ALTER TABLE pokes ADD COLUMNS (new_col INT);
  增加字段 ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
  修改字段名 ALTER TABLE invites REPLACE COLUMNS (foo INT, bar STRING, baz INT COMMENT 'baz replaces new_col2');
  改变列名类型 ALTER TABLE employee CHANGE name ename String;
  删除表 DROP TABLE pokes;


创建表

  create table if not exists person3(id string,name string,aeg int,height int,city string)
  comment 'person table'
  row format delimited fields terminated by '\t';
创建分区表
  create table if not exists person2(id string,name string,aeg int,height int,city string)
  partitioned by(area string,sex string)
  row format delimited fields terminated by '\t';
建立分区后会在表文件夹下创建分区文件夹 命名规则示例:area=A,area=B,分区属性有多少个值就有多少个文件夹,
如果再分区会在下一级文件夹中再创建文件夹,类似决策树特征划分的过程.
除了分区还有分桶具体可以看:https://blog.csdn.net/epitomizelu/article/details/41911657
这样创建的表的类型为Managed Table 其需要把数据复制到指定的文件夹中
(如果文件在hdfs上会剪贴文件到指定文件夹中),删除表格时,data和meta data 都会被删除.
另外一种表格的类型为External Table 创建时需要加上 external 关键词,这种类型的表格删除时不会删除data;
具体可以参考: https://blog.csdn.net/nianguodong/article/details/47358955
还可以创建 临时表  TEMPORARY 只对当前 session有效,session退出后表自动删除.


字段类型

创建时字段类型可以可以为:
data_type
  : primitive_type
  | array_type
  | map_type
  | struct_type
  | union_type  -- (Note: Available in Hive 0.7.0 and later)


primitive_type
  : TINYINT
  | SMALLINT
  | INT
  | BIGINT
  | BOOLEAN
  | FLOAT
  | DOUBLE
  | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
  | STRING
  | BINARY      -- (Note: Available in Hive 0.8.0 and later)
  | TIMESTAMP   -- (Note: Available in Hive 0.8.0 and later)
  | DECIMAL     -- (Note: Available in Hive 0.11.0 and later)
  | DECIMAL(precision, scale)  -- (Note: Available in Hive 0.13.0 and later)
  | DATE        -- (Note: Available in Hive 0.12.0 and later)
  | VARCHAR     -- (Note: Available in Hive 0.12.0 and later)
  | CHAR        -- (Note: Available in Hive 0.13.0 and later)
  array_type
    : ARRAY < data_type >


  map_type
    : MAP < primitive_type, data_type >


  struct_type
    : STRUCT < col_name : data_type [COMMENT col_comment], ...>


  union_type
     : UNIONTYPE < data_type, data_type, ... >  -- (Note: Available in Hive 0.7.0 and later)
其中array,map,struct,union类型的数据在创建时,还需要指定数据的分割方式,
具体可以参考: https://blog.csdn.net/yfkiss/article/details/7842014


更详细的创建表格/索引/宏/函数等看官方网站的说明: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL


载入数据

  load data local inpath '/home/linzhenpeng/Desktop/hadoop/hive/test.txt' into table person
  load data local inpath '/home/linzhenpeng/Desktop/hadoop/hive/test.txt' into table person2 partition(area='B',sex='A');
  如果载入的是hdfs上的数据  就不需要 local 字段,


查询结果导出

insert overwrite directory '/tem/person_out' select * from person where city='beijing'

insert overwrite local directory '/tem/person_out' select * from person where city='beijing'

查询结果插入其他的表
insert overwrite table person3 select a.id,a.name from table person a
创建一个新的表来保存查询结果(可以用于保存查询中间结果)
create table person_id_name as select id,name from person;


查看数据

  select * from person;
设置查看时的字段名
  set hive.cli.print.header=true


分组统计

select city,count(*) from person2 group by city;
select city from person group by city having sum(age)>25;


排序

select * from person order by age;
select * from person order by age desc;


聚类

select city ,count(distinct id) from person group by city;
聚类,排序,分组具体可以看:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy


连接

内链接  join on, 左(外)连接 left outer join on ,右(外)连接 right outer join on ,
全连接 full outer join on,左半连接 left semi join on
连接的知识点可以看: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins


表合并

select * from ( select *  from person  union all select * from person3 ) person4;


条件判断 

1,CASE WHEN THEN END
select name,age, 
      case
        when age<=6 then 'child'
        when age>6 and age<=17 then 'teenager'
        when age>17 and age<=40 then 'youth'
        else 'other'
      end as agetype from person;
2,IF
select name,age,if(age>=18,'adult','minor') as t from person;
3,COALESCE(它返回第一个不为NULL的值)
COALESCE(NULL,NULL,5,NULL,4) returns 5



模糊查询

select * from person where city like 'beijing'
select * from person where city like 'beiji%'
还可以使用正则 rlike 来模糊查询(java的正则表达式)
select * from person where city rlike 'beijin.'


限制显示的条数

select * from person limit 2;


使用java函数

SELECT reflect("java.lang.String", "valueOf", 1),
       reflect("java.lang.String", "isEmpty"),
       reflect("java.lang.Math", "max", 2, 3),
       reflect("java.lang.Math", "min", 2, 3),
       reflect("java.lang.Math", "round", 2.5),
       reflect("java.lang.Math", "exp", 1.0),
       reflect("java.lang.Math", "floor", 1.9)
FROM src LIMIT 1;


使用内置函数

select age,exp(age) exp from person; 
使用自定义的函数具体可以看: https://www.cnblogs.com/DreamDrive/p/5561113.html


hive 的UDF 既可以使用java也可以使用Python编写具体可以看: https://blog.csdn.net/u010705209/article/details/52935245


自定义Map-Reduce函数

FROM (  
  FROM pv_users  
  MAP pv_users.userid, pv_users.date  
  USING 'map_script'  
  AS dt, uid  
  CLUSTER BY dt) map_output  
INSERT OVERWRITE TABLE pv_users_reduced  
  REDUCE map_output.dt, map_output.uid  
  USING 'reduce_script'  
  AS date, count;  
   
FROM (  
  FROM pv_users  
  SELECT TRANSFORM(pv_users.userid,pv_users.date)  
  USING 'map_script'  
  AS dt, uid  
  CLUSTER BY dt) map_output  
INSERT OVERWRITE TABLE pv_users_reduced  
  SELECT TRANSFORM(map_output.dt,map_output.uid)  
  USING 'reduce_script'  
  AS date, count;  
//二者具有相同的效果,后者使用SELECT TRANSFORM代替了前者的MAP和REDUCE从句。
具体看:https://blog.csdn.net/skywalker_only/article/details/37879187













你可能感兴趣的:(大数据)