hive的基本操作

hive的使用方式

1.使用CLI

直接使用hive命令即可进入客户端。

2. 使用hiveserver2服务

修改hdfs-site.xml，core-site.xml
- 在hdfs-site.xml加上dsf.webhdfs.enabled-->true
- core-site.xml加入hadoop.proxyuser.hadoop.hosts-->*
  ......groups-->*
把hive启动为一个后台服务，只有启动为后台服务之后，才能让HJDBC,ODBC等程序去连接hive

nohup command &
nohup 的意思： no hang up 不挂起
输入命令：nohup hiverser2 & 2>~/hive_err.log 1>~/hive_std.log
日志：0代表标准输入，1代表标准输出，2代表异常输出
nohup hiverser2 & 2>/dev/null 1>/dev/null
进入黑洞，所有日志都不保存
输入jps 出现 RunJar进程表示启动成功
使用beeline客户端工具去连接hiveserver2
1. $ beeline
2. >!connect jdbc:hive2://hadoop02:10000

HQL的使用

关于库的DDL

创建库
create database if not exists hadoop;
创建时使用if not exists 忽略异常
删除时，使用 if exists 忽略异常
适用于创建表
查询库列表信息
show databases;
查询正在使用的库
select current_database();
切换库
use dname;
查询库的详细信息
desc database dname;
desc database extended dname;
删除库
drop database dname;
drop database dname restrict;
如果已经有表是不能删除的。
drop database dname cascade;
级联的方式删除数据库
修改库/基本不用

关于表的DDL

创建表
create

comment 表注释
partioner by(col_name data_type...)
分区字段不能在表字段中出现
clustered by (col_name,....) 分桶
[sorted by (col_name[asc|desc],...)]是否排序按照哪个字段排序
into num_buckets BUCKETS 整个表分成多少个桶
分桶表的字段必须是表字段中的一部分
row format row_format 行的分隔符以什么字符终止
row format delimited fields terminated by "," lines terminated by "\n"
stored as file_format 存储什么文件
file_format:
1. textfile 普通文本
2. sequencefile 序列化文本
3. rcfile行列存储相结合的文件
4. 自定义文件格式
location hdfs_path
创建表的时候可以指定表的路径。
内外部表都是可以指定hdfs的存储路径的。
最佳实践是：如果一份数据已经存储在hdfs上并且要让多个客户端使用，就用外部表。
set hive.exec.mode.local.auto=true;
hive尝试本地模式运行
会话断或者reset就自动失效;
复制表
create table student_1 like student;
复制一张表的定义，不包含数据
CTAS
create table .... as select ...
set property
查看配置文件

6个表DDL的例子

创建内部表：create table student (id int, name string) row format delimited fields terminated by ',';
创建外部表： create external studen_ext row format delimited fields terminated by ',' location '/hive/student';
desc 表名就可以看到表结构是externaltable
分区表：
create table student_ptn(id int , name string) partitioned by (city string) row format delimited fields terminated by ','

create table t01_ptn02 (count int) partitioned by (username string,month string) row format delimited fields terminated by ',';

添加分区：alter table student_ptn add partition(city="beijing")

city 是分区字段，如果有还有如zip那目录结构就是/city=beijing/zip=10011

分区字段不能使用表中存在的字段
如果某张表是分区表，某个分区就是这张表目录下的一个分区目录
数据文件只能放在分区文件夹中，不能放在表文件夹下。

查看分区： show partitions student_ptn;
分桶表
create table studen_bck (id int , name string) clustered by (id) sorted by (id asc,name desc) into 4 buckets row format delimited fields terminated by ','
使用CTAS创建表
就是从一个查询sql结果来创建一个表进行存储
create table studnet_ctas as select * from student where id <10;
复制表结构
create table sut_copy like student;

无论被复制的表是内部表还是外部表，如果在table的前面没有加exteral那么复制出来的新表都是内部表

查看命令

show tables;
show tables in dname;
show tables like 'stu*';//使用正则表达式

查看表的详细信息
desc studnet;
desc extended student;
desc formatted student;
show partitions stu; //查看分区信息
show functions;//查看函数
desc function extended substring;//查看函数用法
show create table stu;//查看建表的详细语句

修改表

修改表名
alter table stu rename to new_stu;
修改字段定义
- 增加一个字段
  alter table stu **add columns **(sex string,age int);
- 修改一个字段定义
  alter talbe stu change age new_age string;
- 删除一个字段
  不支持
- 替换所有字段
  alter table stu replace columns(id int,name string);
  int类型可以转成string，string转不成int
  但hive-1.2.2版本可以任意替换
  hive schema on read //hive是读模式的数据仓库
- 修改分区信息
  - 添加静态分区：alter table stu_ptn add partioner(city="chongqing") partioner(city="kunming") ......;
  - 修改分区
    一般来说只会修改分区数据的存储目录alter table stu_ptn partioner(city='beijing') set location '/stu_ptn_beijing';
  - 删除分区
    alter table stu_ptn drop partition (city='beijing')
清空表
truncate table stu;
删除表
drop table stu;

DML数据操纵语言

导入数据

load方式装载数据
hive模式是读模式，可以导入任何数据
- load data local inpath "/home/" into table student;
  从Linux本地导入数据到student表中。
  会把数据文件上传进/user/hive/warehouse/student
- load data inpath "/stu/test.txt" into table stu;
  从hdfs上导入数据
  如果数据已经在hdfs上，就不要再创建内部表。
  因为这样会把这份数据移动到/user/hive/warehouse/目录下
  内部表删除时就会把这份数据删掉。
- hadoop fs -put file user/hive/warehouse/studnet/
  直接上传到上传到hive表中
- load data local inpath "....." overwrite into talbe;
  覆盖导入
inser 方式插入数据
- insert into student (id,name,sex,age,department)values(1111,'ss','f',12,'nn'),(xx,xxx,xxx,);
  insert方式，首先创建一张零时表如values_tmp_table_1 来保存inser语句的结果，然后再将记录插入到表中
- insert into table student_c select *from student where age<=18;

多重插入

创建一张分区表create table stu_ptn_age(id int,name string, sex String )partioned by （age int）.....

从stu表中，把数据分成三类，插入到stu_ptn这张表的三个分区中：
导入数据到分区表时，这个分区可以不存在。会自动创建

insert into table  stu_ptn_age partition(age=18) select id,name,sex,department from student where age <=18;
insert into table  stu_ptn_age partition(age=19) select id,name,sex,department from student where age =19;
insert into table  stu_ptn_age partition(age=20) select  id,name,sex,department from student where age >=20;

这种方式比较耗时

可以使用多重插入来降低任务复杂度
主要减少的是原表的数据扫描次数

  from sudent
  insert into table stu_ptn_age partition(age=18) select id,name,sex,department where age<=18 ;
  insert into table stu_ptn_age partition(age=19) select id,name,sex,department where=19;
  insert into table stu_ptn_age partition(age=20) select id,name,sex,department where >=20;

清空表truncate时不会清空age=xx的分区信息
select * from stu_ptn;
分区字段也会显示。
在使用过程中分区字段和普通字段是一样的。
分区的信息存储在partition表中

问题：如果真实的需求是每一个年龄一个分区？

动态分区插入

创建一张测试表：create stu_ptn_dpt .....partition by (department string)....
插入数据会报错：insert into table t01_ptn partition(username,month) select count,username,month from table01;
set hive.exec.dynamic.partition.mode=nonstrict
如果一张表有多个分区字段：那么在进行动态分区插入是，一定要有一列是静态分区；如果不像受这样的限制就把模式设置为nonstrict。

如果往分区表中插入数据，不要使用load方式，这容易使分区内的数据混乱，除非在非常确定的情况下

insert方式导出数据
insert overwrite local directory "/home/hadoop/tem/stu_le18" select * from student where age<=18;
这种方式要注意路径，因为是overwriter
在查看到处数据时使用：sed -e 's/\x01/\t/g' file.txt 替换默认的Ctrl+a字段分隔符。

字符串替换：s命令
sed 's/hello/hi/g' sed.txt              
##  在整行范围内把hello替换为hi。如果没有g标记，则只有每行第一个匹配的hello被替换成hi。

多点编辑：e命令
sed -e '1,5d' -e 's/hello/hi/' sed.txt
##  (-e)选项允许在同一行里执行多条命令。如例子所示，第一条命令删除1至5行，第二条命令用hello替换hi。
命令的执行顺序对结果有影响。如果两个命令都是替换命令，那么第一个替换命令将影响第二个替换命令的结果。

sed --expression='s/hello/hi/' --expression='/today/d' sed.txt
##  一个比-e更好的命令是--expression。它能给sed表达式赋值。

查询

distinct去重
show function;271个内置函数--2.3.3
UDF 单行函数，输入1，输出1；
UDAF 多对一函数，输入n 输出1
UDTF 一对多函数输入1，输出n
不支持update和delete
因为是hive是数据仓库，联机事务分析
支持in 和 exits
select * from student where in (18,19)
老版本不支持，hive推荐使用semi join半连接
支持 case when

select id,t_job,t_edu **case** t_edu 
when "硕士" then 1 
when "本科" then 2 
else 3 
end as level 
from lagou limit 1,100;

select count(distinct age )from join .. on ..where ... goup by ... having ... cluster by ...distribute by ..sort by .. order by ... limit ....

order by 全局排序 select * from studnet order by age desc
sort by
局部排序，每个分区内有序，但是你会发现同一个age的条目会被分到不同分区中，因为没有进行hash散列。
一个sql就是一个mr程序，局部排序就是指，有多个reduceTask执行的话，那么最终，每个reduceTask的结果是有序的，如果只有一个reduceTask sort by = order by
set mapreduce.job.reduces =3;
select * from student sort by age desc;
如果使用* 号查询出来的是随机进行分区的。
distribute by
分桶操作
select * from student distribute by age sort by age desc;
分桶就是把age求hash值之后模以桶数得到的结果就知道要分到哪个桶中，分桶的个数就是reduceTask的个数。
sort by是进行局部排序，所以每个桶中的数据是有序的
cluster by
cluster by age = distribute by age sort by age;
distribute by id sort by id,age != cluster by id sort by age;
cluster by 不能和sort by 同用。
如果要散列一个字段之后进行多个分区的排序只能用distributed和sort组合。