安装
元素数据存储选择
默认使用derby数据库,不能够多个用户同时使用,多用于测试
使用MySQL数据库存储元数据,多用于生产环境
HDFS数据仓库目录
创建数据仓库目录
hadoop fs -mkdir -p /user/hive/warehouse
赋予权限
hadoop fs -chmod a+w /user/hive/warehouse
hadoop fs -chmod a+w /temp
hive安装
hadoop 用户将HIVE安装包解压到/home/hadoop/apps安装目录
tar -zxvf apache-hive-1.2.2-bin.tar.gz -C /home/hadoop/apps
切换到root用户
创建软连接
ln -s /home/hadoop/apps/hive-1.2.2 /usr/local/hive
修改属主
chown -R hadoop:hadoop /usr/local/hive
添加环境变量
vim /etc/profile
添加内容
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:${HIVE_HOME}/bin
使环境变量生效
source /etc/profile
切换到hadoop用户
修改HIVE_HOME/conf/hive-site.xml内容,没有则新建
javax.jdo.option.ConnectionURL
jdbc:mysql://192.168.183.101:3306/hive?createDatabaseIfNotExist=true&characterEncoding=UTF-8
JDBC connect string for a JDBC metastore
javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
Driver class name for a JDBC metastore
javax.jdo.option.ConnectionUserName
hive
username to use against metastore database
javax.jdo.option.ConnectionPassword
hive123
password to use against metastore database
注: 修改对应ip和密码
启动hive
/usr/local/hive/bin/hive
[hadoop@hadoop4 bin]$ hive
Logging initialized using configuration in jar:file:/home/hadoop/apps/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties
hive>
启动成功
hive -hiveconf hive.root.logger=DEBUG,console
显示日志方式启动hive
Hive 操作
- 查看数据库
show databases
hive> show databases;
OK
default
Time taken: 0.02 seconds, Fetched: 1 row(s)
- 创建数据库
create database mytestDB;
create database IF NOT EXISTS mytestDB;
hadoop fs -ls /user/hive/warehouse
会增加mytestdb.db目录 - 选择数据库
use mytestdb
- 查看表
show tables;
- 创建表
创建用户表:user_info
字段信息:用户id,地域id,年龄,职业
create table user_info(
user_id string,
area_id string,
age int,
occupation string
)
row format delimited fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
创建成功后,同时会在HDFS中创建目录
/user/hive/warehouse/mytestdb.db/user_info
- 删除表
drop table user_info;
user_info表在hdfs的目录也会被同时删除
创建内部表
- 在数据库rel 中创建学生信息表
create table student_info(
student_id string comment '学号',
name string comment '姓名',
age int comment '年龄',
origin string comment '地域'
)
comment '学生信息表'
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
- 使用load从本地加载数据到student_info
load data local inpath '/home/hadoop/apps/hive_test_data/student_info_data.txt' into table student_info;
查看student_info表在hdfs路径,新增加了student_info_data.txt文件
Found 1 items
-rwxrwxrwx 3 hadoop supergroup 341 2018-01-26 10:34 /user/hive/warehouse/mydb.db/student_info/student_info_data.txt
- 查询origin为 11 的学生
hive> select * from student_info where origin='11'
hive> select * from student_info where origin='11'
> ;
OK
1 xiaoming 20 11
6 zhangsan 20 11
7 lisi 19 11
Time taken: 0.473 seconds, Fetched: 3 row(s)
hive>
- 使用load将hdfs文件加载到student_info表中
上传文件到HDFS中
hadoop fs -put student_info_data.txt /
追加的方式载入
load data inpath '/student_info_data.txt' into table student_info;
hdfs中student_info表位置会出现两个 student_info_data.txt
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/mydb.db/student_info
Found 2 items
-rwxrwxrwx 3 hadoop supergroup 341 2018-01-26 10:34 /user/hive/warehouse/mydb.db/student_info/student_info_data.txt
-rwxrwxrwx 3 hadoop supergroup 341 2018-01-26 10:39 /user/hive/warehouse/mydb.db/student_info/student_info_data_copy_1.txt
并且HDFS中 /student_info_data.txt会剪切到student_info表的hdfs路径下/user/hive/warehouse/rel.db/student_info
以重写的方式载入
load data inpath '/student_info_data.txt' overwrite into table student_info;
会覆盖原来的数据.
数据类型
- 创建员工表:employee
字段信息:用户id,工资,工作过的城市,社保缴费情况(养老,医保),福利(吃饭补助(float),是否转正(boolean),商业保险(float))
create table employee(
user_id string,
salary int,
worked_citys array,
social_security map,
welfare struct
)
row format delimited fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile;
- 从本地加载数据到表employee
load data local inpath '/home/hadoop/apps/hive_test_data/employee_data.txt' into table employee;
hive> select * from employee;
OK
zhangsan 10800 ["beijing","shanghai"] {"养老":1000.0,"医疗":600.0} {"meal_allowance":2000.0,"if_regular":true,"commercial_insurance":500.0}
lisi 20000 ["beijing","nanjing"] {"养老":2000.0,"医疗":1200.0} {"meal_allowance":2000.0,"if_regular":false,"commercial_insurance":500.0}
wangwu 17000 ["shanghai","nanjing"] {"养老":1800.0,"医疗":1100.0} {"meal_allowance":2000.0,"if_regular":true,"commercial_insurance":500.0}
- 查询已转正的员工编号,工资,工作过的第一个城市,社保养老缴费情况,福利餐补金额
hive> select user_id,
> salary,
> worked_citys[0],
> social_security['养老'],
> welfare.meal_allowance
> from employee
> where welfare.if_regular=true;
OK
zhangsan 10800 beijing 1000.0 2000.0
wangwu 17000 shanghai 1800.0 2000.0
创建外部表
可以提前创建好HDFS路径
hadoop mkdir -p /user/hive/warehouse/data/student_school_info
如果没有提前创建好,在创建外部表的时候会根据指定路径自动创建
- 创建外部学生入学信息表
字段信息:
学号、姓名、学院id、专业id、入学年份
HDFS数据路径:/user/hive/warehouse/data/student_school_info
create external table rel.student_school_info(
student_id string,
name string,
institute_id string,
major_id string,
school_year string
)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile
location '/user/hive/warehouse/data/student_school_info';
- 上传本地数据文件到HDFS
hadoop fs -put /home/hadoop/apps/hive_test_data/student_school_info_external_data.txt /user/hive/warehouse/data/student_school_info/
查询
select * from student_school_info
创建内部分区表
创建学生入学信息表
字段信息:学号、姓名、学院id
分区字段:专业id
create table student_school_info_partition_maj(
student_id string,
name string,
institute_id string
)
partitioned by(major_id string)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
- 使用insert into从student_school_info表将2017年入学的学籍信息导入到student_school_info_partition_maj分区表中
insert into table student_school_info_partition_maj partition(major_id ='bigdata')
select t1.student_id,t1.name,t1.institute_id
from student_school_info t1
where t1. major_id = bigdata;
查看分区
show partitions student_school_info_partition_maj;
查看hdfs路径,会增加major_id ='bigdata'目录
hadoop fs -ls /user/hive/warehouse/rel.db/student_school_info_partition_maj/
删除分区
alter table student_school_info_partition drop partition (major_id ='bigdata');
查看分区表,数据已经被删除
使用动态分区添加数据
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table student_school_info_partition_maj partition(major_id)
select t1.student_id,t1.name,t1.institute_id,t1.major_id
from student_school_info t1 ;
- 查看分区
show partitions student_school_info_partition_maj;
hive> show partitions student_school_info_partition_maj;
OK
major_id=bigdata
major_id=computer
major_id=software
Time taken: 0.114 seconds, Fetched: 3 row(s)
- 查看hdfs路径
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/rel.db/student_school_info_partition_maj
Found 3 items
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:30 /user/hive/warehouse/rel.db/student_school_info_partition_maj/major_id=bigdata
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:30 /user/hive/warehouse/rel.db/student_school_info_partition_maj/major_id=computer
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:30 /user/hive/warehouse/rel.db/student_school_info_partition_maj/major_id=software
会增加三个目录,每个目录存储对应的数据
创建外部分区表
创建学生入学信息表
字段信息:学号、姓名、学院id
分区字段:专业id
create external table rel.student_school_info_external_partition_maj(
student_id string,
name string,
institute_id string
)
partitioned by(major_id string)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile
location '/user/hive/warehouse/data/student_school_info_external_partition_maj';
- 动态分区
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table student_school_info_external_partition_maj partition(major_id)
select t1.student_id,t1.name,t1.institute_id,t1.major_id
from student_school_info t1;
- 删除内部分区表,表删除,hdfs中的数据也删除了
drop table student_school_info_partition_maj;
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/rel.db/
Found 1 items
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:23 /user/hive/warehouse/rel.db/student_school_info_partition
- 删除外部分区表
hive> drop table student_school_info_external_partition_maj;
OK
Time taken: 0.63 seconds
hive> show tables;
OK
student_school_info
student_school_info_partition
Time taken: 0.027 seconds, Fetched: 2 row(s)
查看hdfs中的文件,数据依然存在
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/data/
Found 2 items
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:06 /user/hive/warehouse/data/student_school_info
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/data/student_school_info_external_partition_maj
Found 3 items
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=bigdata
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=computer
drwxrwxrwx - hadoop supergroup 0 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=software
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=software
Found 1 items
-rwxrwxrwx 3 hadoop supergroup 46 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=software/000000_0
使用LIKE、AS创建表,表重命名,添加、修改、删除列
根据已存在的表结构,使用like关键字,复制一个表结构一模一样的新表
create table student_info2 like student_info;
根据已经存在的表,使用as关键字,创建一个与查询结果字段一致的表,同时将查询结果数据插入到新表
create table student_info3 as select * from student_info;
只有student_id,name两个字段的表
create table student_info4 as select student_id,name from student_info;
student_info4表重命名为student_id_name
alter table student_info4 rename to student_id_name;
给student_info3表添加性别列,新添加的字段会在所有列最后,分区列之前,在添加新列之前已经存在的数据文件中
如果没有新添加列对应的数据,在查询的时候显示为空。添加多个列用逗号隔开
alter table student_info_new3 add columns (gender string comment '性别');
-
删除列或修改列
修改列,将继续存在的列再定义一遍,需要替换的列重新定义
alter table student_info_new3 replace columns(student_id string,name string,age int,origin string,gender2 int);
删除列,将继续存在的列再定义一遍,需要删除的列不再定义
alter table student_info_new3 replace columns(student_id string,name string,age int,origin string);
创建分桶表
按照指定字段取它的hash散列值分桶
创建学生入学信息分桶表
字段信息:学号、姓名、学院ID、专业ID
分桶字段:学号,4个桶,桶内按照学号升序排列
create table rel.student_info_bucket(
student_id string,
name string,
age int,
origin string
)
clustered by (student_id) sorted by (student_id asc) into 4 buckets
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
- 插入数据
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
insert overwrite table student_info_bucket
select student_id,name,age,origin
from student_info
cluster by(student_id);
分桶表一般不使用load向分桶表中导入数据,因为load导入数据只是将数据复制到表的数据存储目录下,hive并不会
在load的时候对数据进行分析然后按照分桶字段分桶,load只会将一个文件全部导入到分桶表中,并没有分桶。一般
采用insert从其他表向分桶表插入数据。
分桶表在创建表的时候只是定义表的模型,插入的时候需要做如下操作:
在每次执行分桶插入的时候在当前执行的session会话中要设置hive.enforce.bucketing = true;声明本次执行的是一次分桶操作。
需要指定reduce个数与分桶的数量相同set mapreduce.job.reduces=4,这样才能保证有多少桶就生成多少个文件。
如果定义了按照分桶字段排序,需要在从其他表查询数据过程中将数据按照分区字段排序之后插入各个桶中,分桶表并不会将各分桶中的数据排序。
排序和分桶的字段相同的时候使用Cluster by(字段),cluster by 默认按照分桶字段在桶内升序排列,如果需要在桶内降序排列,
使用distribute by (col) sort by (col desc)组合实现。
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
insert overwrite table student_info_bucket
select student_id,name,age,origin
from student_info
distribute by (student_id) sort by (student_id desc);
导出数据
- 使用insert将student_info表数据导出到本地指定路径
insert overwrite local directory '/home/hadoop/apps/hive_test_data/export_data' row format delimited fields terminated by '\t' select * from student_info;
- 导出数据到本地的常用方法
hive -e"select * from rel.student_info"> ./student_info_data.txt
默认结果分隔符:'\t'
join关联
- join 或者inner join
两个表通过id关联,只把id值相等的数据查询出来。join的查询结果与inner join的查询结果相同。
select * from a join b on a.id=b.id;
等同于
select * from a inner join b on a.id=b.id;
full outer join 或者 full join
两个表通过id关联,把两个表的数据全部查询出来
select * from a full join b on a.id=b.id;
left join
左连接时,左表中出现的join字段都保留,右表没有连接上的都为空
select * from a left join b on a.id=b.id;
rightjoin
右连接时,右表中出现的join字段都保留,左表没有连接上的都是空
select * from a right join b on a.id=b.id;
left semi join
左半连接实现了类似IN/EXISTS的查询语义,输出符合条件的左表内容。
hive不支持in …exists这种关系型数据库中的子查询结构,hive暂时不支持右半连接。
select a.id, a.name from a where a.id in (select b.id from b);
使用Hive对应于如下语句:
select a.id,a.name from a left semi join b on a.id = b.id;
- map side join
使用分布式缓存将小表数据加载都各个map任务中,在map端完成join,map任务输出后,不需要将数据拷贝到reducer阶段再进行join,
降低的数据在网络节点之间传输的开销。多表关联数据倾斜优化的一种手段。多表连接,如果只有一个表比较大,其他表都很小,
则join操作会转换成一个只包含map的Job。运行日志中会出现Number of reduce tasks is set to 0 since there's no reduce operator
没有reduce的提示。
select /*+ mapjoin(b) */ a.id, a.name from a join b on a.id = b.id
hive 内置函数
- case when
语法1:CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
说明:如果a等于b,那么返回c;如果a等于d,那么返回e;否则返回f
hive> select case 1 when 2 then 'two' when 1 then 'one' else 'zero' end;
one
语法2:CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
说明:如果a为TRUE,则返回b;如果c为TRUE,则返回d;否则返回e
hive> select case when 1=2 then 'two' when 1=1 then 'one' else 'zero' end;
one
自定义UDF函数
当Hive提供的内置函数无法满足你的业务处理需要时,此时就可以考虑使用用户自定义函数(UDF:user-defined function)。
UDF 作用于单个数据行,产生一个数据行作为输出。
步骤:
- 先开发一个java类,继承UDF,并重载evaluate方法
- 打成jar包上传到服务器
- 在使用的时候将jar包添加到hive的classpath
hive>add jar /home/hadoop/apps/hive_test_data/HiveUdfPro-1.0-SNAPSHOT.jar; - 创建临时函数与开发好的java class关联
hive>create temporary function age_partition as 'cn.hadoop.udf.AgePartitionFunction'; - 即可在hql中使用自定义的函数
select gender,
age_partition(age),
max(core) max_core
from rel.user_core_info
group by gender,
age_partition(age);
HIVE安装使用时遇到的问题
- 创建表时失败
原因: mysql字符集问题,要设置mysql中hive数据库的字符为latin1
- 删除表时,卡主
原因: 也是字符问题. 是在创建表时,mysql的字符还是utf-8, 后来用命令改掉为latin1,
需要重新设置,删除hive数据库,重新创建,并设置字符集.