hive 中主要包含以下几种数据模型:Database(数据库)、Table(表)、Partition(分区)、Bucket(桶)
Database(数据库)
Hive中的数据库包含一系列的数据库,每个数据库都对应于HDFS上的一个目录,默认的数据库为default,对应于HDFS目录是/user/hadoop/hive/warehouse,可以通过hive.metastore.warehouse.dir参数进行配置(hive-site.xml中配置)
Table(表)
Hive 中的表又分为内部表和外部表
Hive 中的每张表对应于HDFS上的一个目录,HDFS目录为:
/user/hadoop/hive/warehouse/[databasename.db]/table
Partition(分区)
Hive 中每个分区对应于HDFS上表文件夹的一个子文件夹,比如order_partition表中有一个分区event_month=2014-05,则分区的数据在hdfs中的存放目录为为/user/hadoop/hive/warehouse/[databasename.db]/order_partition/event_month=2014-05
Bucket(桶)
对指定的列计算其hash,根据hash值切分数据,目的是为了并行,每一个桶对应一个文件。比如将emp表empno列分散至10个桶中,首先对id列的值计算hash,对应hash值为0和10的数据存储的HDFS目录为:
/user/hadoop/hive/warehouse/[databasename.db]/emp/part-00000
而hash值为2的数据存储的HDFS 目录为:
/user/hadoop/hive/warehouse/[databasename.db]/emp/part-00002
Hive 基本数据类型
图中红色字体的数据类型为hive中比较常用的
Hive 复杂数据类型
Array:数组,一组有序字段,字段的类型必须相同
Map:一组无序的键值对,键和值可以是任意类型,同一个map的键的类型必须相同,值的类型也必须相同
Struct:结构体,一组命名的字段,字段类型可以不相同
hive 分隔符
DDL操作官方使用文档:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
(1).创建
语法:
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];
对应的HDFS目录为:/user/hive/warehouse
Hive 中的数据库对应HDFS中的一个文件夹,文件夹命名格式为:数据库名.db
create database test;
create database if not exists test1;
create database test2 location '/myhive/mydb';
create database test3 comment 'my test db'with dbproperties ('creator'='zhangsan','date'='2017-8-8');
(2).查询
show databases;
show databases like 'test*';
desc database test3;
在上面的基础上如果想查看数据库更详细的信息,可以加个参数extended
desc database extended test3;
show create database test3;
use databasename;
(3).修改
语法:
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...); -- (Note: SCHEMA added in Hive 0.14.0)
更改数据库的修改人
alter database test3 set dbproperties ('modifier'='lisi');
(4).删除
语法:
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
删除数据库test3
drop database if exists test3;
删除时会报如下错误:
InvalidOperationException(message:Database test3 is not empty. One or more tables exist.)
如果数据库中有0或多个表时,不能直接删除,需要先删除表再删除数据库
如果想要删除含有表的数据库,在删除时加上cascade,表示级联删除(慎用),可以使用如下命令
drop database if exists test3 cascade;
(1).创建
语法:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, ...) -- (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available in Hive 0.6.0 and later)
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)] -- (Note: Available in Hive 0.6.0 and later)
[AS select_statement]; -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path];
data_type
: primitive_type
| array_type
| map_type
| struct_type
| union_type -- (Note: Available in Hive 0.7.0 and later)
primitive_type
: TINYINT
| SMALLINT
| INT
| BIGINT
| BOOLEAN
| FLOAT
| DOUBLE
| DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
| STRING
| BINARY -- (Note: Available in Hive 0.8.0 and later)
| TIMESTAMP -- (Note: Available in Hive 0.8.0 and later)
| DECIMAL -- (Note: Available in Hive 0.11.0 and later)
| DECIMAL(precision, scale) -- (Note: Available in Hive 0.13.0 and later)
| DATE -- (Note: Available in Hive 0.12.0 and later)
| VARCHAR -- (Note: Available in Hive 0.12.0 and later)
| CHAR -- (Note: Available in Hive 0.13.0 and later)
array_type
: ARRAY < data_type >
map_type
: MAP < primitive_type, data_type >
struct_type
: STRUCT < col_name : data_type [COMMENT col_comment], ...>
union_type
: UNIONTYPE < data_type, data_type, ... > -- (Note: Available in Hive 0.7.0 and later)
row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char] -- (Note: Available in Hive 0.13 and later)
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
file_format:
: SEQUENCEFILE
| TEXTFILE -- (Default, depending on hive.default.fileformat configuration)
| RCFILE -- (Note: Available in Hive 0.6.0 and later)
| ORC -- (Note: Available in Hive 0.11.0 and later)
| PARQUET -- (Note: Available in Hive 0.13.0 and later)
| AVRO -- (Note: Available in Hive 0.14.0 and later)
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
constraint_specification:
: [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE ]
[, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE
create table emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
)row format delimited fields terminated by '\t';
有一个emp.txt文件,里面的数据对应我们刚刚创建的表emp
7369 SMITH CLERK 7902 1980-12-17 800.00 207499 ALLEN SALESMAN 7698 1981-2-20 1600.00 300.00 307521 WARD SALESMAN 7698 1981-2-22 1250.00 500.00 307566 JONES MANAGER 7839 1981-4-2 2975.00 207654 MARTIN SALESMAN 7698 1981-9-28 1250.00 1400.00 307698 BLAKE MANAGER 7839 1981-5-1 2850.00 307782 CLARK MANAGER 7839 1981-6-9 2450.00 107788 SCOTT ANALYST 7566 1987-4-19 3000.00 207839 KING PRESIDENT 1981-11-17 5000.00 107844 TURNER SALESMAN 7698 1981-9-8 1500.00 0.00 307876 ADAMS CLERK 7788 1987-5-23 1100.00 207900 JAMES CLERK 7698 1981-12-3 950.00 307902 FORD ANALYST 7566 1981-12-3 3000.00 207934 MILLER CLERK 7782 1982-1-23 1300.00 108888 HIVE PROGRAM 7839 1988-1-23 10300.00
使用如下命令将emp.txt文件中的数据导入到表emp
load data local inpath '/home/hadoop/data/emp.txt' overwrite into table emp;
create table emp2 like emp;
create table emp3 as select * from emp;
导入成功后查看表中数据
create table emp_copy as select empno,ename,job from emp;
(2).查询
show tables;show tables 'emp*';
desc emp;
desc extended emp;
上面查看到的表结构的详细信息格式较乱,可以使用formatted格式化查看(常用)
desc formatted emp;
show create table emp;
(3).修改
语法:
ALTER TABLE table_name RENAME TO new_table_name;
修改表名emp2为emp_bak
alter table emp2 rename to emp_bak;
(4).删除
语法:
DROP TABLE [IF EXISTS] table_name [PURGE];
drop table if exists emp_bak;
truncate table emp_copy;
DML 操作只是针对表的,DML 操作官方帮助文档
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
(1).内部表
create table emp_managed as select * from emp;
$ hadoop fs -ls /user/hive/warehouse/db_hive.db/
select * from TBLS;
我们刚刚创建的一张内部表emp_managed
drop table if exists emp_managed;
(2).外部表
create external table emp_external(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
)row format delimited fields terminated by '\t'
location '/hive_external/emp';
$ hadoop fs -put emp.txt /hive_external/emp/
上传完成后,表emp_external就有数据了,使用sql查看
hive 中 表emp_external被删除
drop table if exists emp_external;
$ hadoop fs -ls /hive_external/emp/
总结:内部表与外部表的区别
如果是内部表,在删除时,MySQL中的元数据和HDFS中的数据都会被删除
如果是外部表,在删除时,MySQL中的元数据会被删除,HDFS中的数据不会被删除
(1).加载数据到文件中
语法:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
LOCAL:如果加上表示从本地加载数据,默认不加,从hdfs中加载数据
OVERWRITE:加上表示覆盖表中数据
load data local inpath '/home/hadoop/data/emp.txt' into table emp;
如果再次执行一次load,那么表中的数据会有30条,会追加,可以加上overwrite在加载数据前删除原有数据
(2).从hdfs中加载数据到hive表中
在hdfs中创建目录/hive/data,然后将emp.txt文件上传到hdfs
$ hadoop fs -mkdir -p /data/hive
$ hadoop fs -put emp.txt /data/hive/
加载数据到表emp中
load data inpath '/hive/data/emp.txt' overwrite into table emp;
向表 emp 中插入数据
insert into emp(empno,ename,job) values(1001,'TOM','MANAGER');insert into emp values(1002,'JOHN','CLERK',7786,'1995-8-28',1850.0,2.0,30);
emp 表中数据
select * from emp where deptno=10;
select * from emp where ename='SMITH';
select * from emp where empno <= 7766;
查询员工工资大于1000小于1500的员工
select * from emp where sal between 1000 and 1500;
select * from emp limit 5;
select * from emp where ename in ('SCOTT','MARTIN');
select * from emp where comm is not null;
select count(*) from emp where deptno=10;
结果:3
select max(sal),min(sal),avg(sal),sum(sal) from emp;
结果:10300.0 800.0 2621.6666666666665 39325.0
select deptno,avg(sal) from emp group by deptno;
结果:
10 2916.6666666666665
20 2175.0
30 1566.6666666666667
select deptno,avg(sal) from emp group by deptno having avg(sal) > 2000;
结果:
10 2916.6666666666665
20 2175.0
sal小于等于1000,显示lower
sal大于1000且小于等于2000,显示middle
sal大于2000小于等于4000,显示high
sal大于4000,显示highest
select ename, sal, case when sal < 1000 then 'lower'when sal > 1000 and sal <= 2000 then 'middle'when sal > 2000 and sal <= 4000 then 'high'else 'highest' end from emp;
语法:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
SELECT ... FROM ...
将表emp中的数据导出到本地目录/home/hadoop/data/tmp下
insert overwrite local directory '/home/hadoop/data/tmp' row format delimited fields terminated by '\t' select * from emp;
运行结果:到/home/hadoop/data/tmp目录下查看
将表emp中的数据导出到hdfs的/data/hive目录下
insert overwrite directory '/data/hive' row format delimited fields terminated by '\t' select * from emp;
结果:在hdfs中查看
Hadoop fs -cat /data/hive/000000_0;
Hive 可以创建分区表,主要用于解决由于单个数据表数据量过大进而导致的性能问题
Hive 中的分区表分为两种:静态分区和动态分区
(1).静态分区
静态分区由分为两种:单级分区和多级分区
单级分区
创建一种订单分区表
create table order_partition(
order_number string,
event_time string
)
partitioned by (event_month string)row format delimited fields terminated by '\t';
order.txt文件中数据如下
10703007267488 2014-05-01 06:01:12.334+01
10101043505096 2014-05-01 07:28:12.342+01
10103043509747 2014-05-01 07:50:12.33+01
10103043501575 2014-05-01 09:27:12.33+01
10104043514061 2014-05-01 09:03:12.324+01
将order.txt 文件中的数据加载到order_partition表中
load data local inpath '/home/hadoop/data/order.txt' overwrite into table order_partition partition (event_month='2014-05');
$ hadoop fs -mkdir -p /user/hive/warehouse/db_hive.db/order_partition/event_month=2014-06
$ hadoop fs -put /home/hadoop/data/order.txt /user/hive/warehouse/db_hive.db/order_partition/event_month=2014-06
上传完成后查看表order_partition
Select * from order_partition;
通过select查询可以看到并没有看到我们刚刚通过hdfs上传后的数据,原因是我们将文件上传到了hdfs,hdfs是有了数据,但hive中的元数据中还没有,执行如下命令更新后,再次查看数据
msck repair table order_partition;
对于分区表,不建议直接使用select * 查询,性能低,建议查询时加上条件,如果加上条件后它会直接从指定的分区中查找数据
select * from order_partition where event_month='2014-06';
多级分区:
创建表order_multi_partition
create table order_multi_partition(
order_number string,
event_time string)
partitioned by (event_month string, step string)
row format delimited fields terminated by '\t';
加载数据到表order_multi_partition
load data local inpath '/home/hadoop/data/order.txt' overwrite into table order_multi_partition partition (event_month='2014-05',step=1);
把step修改为2,再次加载数据
load data local inpath '/home/hadoop/data/order.txt' overwrite into table order_multi_partition partition (event_month='2014-05',step=2);
查看hdfs中的目录结构
hadoop fs -ls /user/hive/warehouse/db_hive.db/order_multi_partition/event_month=2014-05
单机分区和多级分区唯一的区别就是多级分区在hdfs中的目录为多级
(2).动态分区
hive 中默认是静态分区,想要使用动态分区,需要设置如下参数,笔者使用的是临时设置,你也可以写在配置文件(hive-site.xml)里,永久生效。临时配置如下
开启动态分区(默认为false,不开启)
set hive.exec.dynamic.partition=true;
指定动态分区模式,默认为strict,即必须指定至少一个分区为静态分区,nonstrict模式表示允许所有的分区字段都可以使用动态分区。
set hive.exec.dynamic.partition.mode=nonstrict;
创建表student
create table student(
id int,
name string,
tel string,
age int
)row format delimited fields terminated by '\t';
student.txt文件内容如下
1001 zhangsan 18310982765 18
1002 lisi 15028761946 20
1003 wangwu 13018909873 25
将文件student.txt中的内容加载到student表中
load data local inpath '/home/hadoop/data/student.txt' overwrite into table student;
创建分区表stu_age_partition
create table stu_age_partition(
id int,
name string,
tel string
)
partitioned by (age int)row format delimited fields terminated by '\t';
将student表的数据以age为分区插入到stu_age_partition表,试想如果student表中的数据很多,使用insert一条一条插入数据,很不方便,所以这个时候可以使用hive的动态分区来实现
insert into table stu_age_partition partition (age) select id,name,tel,age from student;
查询时,建议加上分区条件,性能高
select * from stu_age_partition;
select * from stu_age_partition where age > 20;
之前的操作都是在hive1中做的,我们安装的hive版本为2.3.0,即支持hive1,也支持hive2,下面的我们来使用hive2进行操作,hive1和hive2的区别就是hive2中解决了hive1中的单点故障,可以搭建高可用的hive集群,hive2界面显示格式要比hive1直观、好看(类似于mysql中的shell界面)
要想使用hive2,需要先执行如下命令开启hive2服务
hiveserver2
开启后使用如下命令连接hive2,就能进行操作了
beeline -u jdbc:hive2://
create table tb_array(
name string,
work_locations array<string>)
row format delimited fields terminated by '\t'
collection items terminated by ',';
hive_array.txt文件内容如下
zhangsan beijing,shanghai,tianjin,hangzhou
lisi changchu,chengdu,wuhan
load data local inpath '/home/hadoop/data/hive_array.txt' overwrite into table tb_array;
select name,work_locations[0] from tb_array where name='zhangsan';
select name,size(work_locations) from tb_array;
create table tb_map(
name string,
scores map<string,int>
)
row format delimited fields terminated by '\t'
collection items terminated by ','map keys terminated by ':';
hive_map.txt文件内容如下
zhangsan math:80,chinese:89,english:95
lisi chinese:60,math:80,english:99
load data local inpath '/home/hadoop/data/hive_map.txt' overwrite into table tb_map;
select name,scores['english'] from tb_map;
查询所有学生的英语和数学成绩
select name,scores['english'],scores['math'] from tb_map;
创建一张带有结构体的表
create table tb_struct(
ip string,
userinfo struct
)row format delimited fields terminated by '#'
collection items terminated by ':';
加载文件hive_struct.txt中的数据到表tb_struct
hive_struct.txt文件内容如下
192.168.1.1#zhangsan:40
192.168.1.2#lisi:50
192.168.1.3#wangwu:60
192.168.1.4#zhaoliu:70
load data local inpath '/home/hadoop/data/hive_struct.txt' overwrite into table tb_struct;
查询姓名及年龄
select userinfo.name,userinfo.age from tb_struct;