数据仓库-Hive基础（五）Hive 的基本操作

1. 数据库操作

1.1 创建数据库

create database if not exists myhive; 

use myhive;

说明：hive的表存放位置模式是由hive-site.xml当中的一个属性指定的

hive.metastore.warehouse.dir 
/user/hive/warehouse

1.2 创建数据库并指定位置

create database myhive2 location '/myhive2';

1.3 设置数据库键值对信息

数据库可以有一些描述性的键值对信息，在创建时添加：

create database foo with dbproperties ('owner'='itcast', 'date'='20190120');

查看数据库的键值对信息：

describe database extended foo;

修改数据库的键值对信息：

alter database foo set dbproperties ('owner'='itheima');

1.4 查看数据库更多详细信息

desc database extended myhive2;

1.5 删除数据库

删除一个空数据库，如果数据库下面有数据表，那么就会报错

drop database myhive2;

强制删除数据库，包含数据库下面的表一起删除

drop database myhive cascade;

2.数据库表操作

create [external] table [if not exists] table_name ( 
col_name data_type [comment '字段描述信息'] 
col_name data_type [comment '字段描述信息']) 
[comment '表的描述信息'] 
[partitioned by (col_name data_type,...)] 
[clustered by (col_name,col_name,...)] 
[sorted by (col_name [asc|desc],...) into num_buckets buckets] 
[row format row_format] 
[storted as ....] 
[location '指定表的路径']

说明：

create table

创建一个指定名字的表。如果相同名字的表已经存在，则抛出异常；用户可以用 IF NOT EXISTS 选项来忽略这个异常。

external

可以让用户创建一个外部表，在建表的同时指定一个指向实际数据的路径（LOCATION），Hive 创建内部表时，会将数据移动到数据仓库指向的路径；若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变。在删除表的时候，内部表的元数据和数据会被一起删除，而外部表只删除元数据，不删除数据。

comment

表示注释,默认不能使用中文

partitioned by

表示使用表分区,一个表可以拥有一个或者多个分区，每一个分区单独存在一个目录下 .

clustered by

对于每一个表分文件， Hive可以进一步组织成桶，也就是说桶是更为细粒度的数据范围划分。Hive也是针对某一列进行桶的组织。

sorted by

指定排序字段和排序规则

row format

指定表文件字段分隔符

storted as

指定表文件的存储格式, 常用格式:SEQUENCEFILE, TEXTFILE, RCFILE,如果文件数据是纯文本，可以使用 STORED AS TEXTFILE。如果数据需要压缩，使用 storted as SEQUENCEFILE。

location

指定表文件的存储路径

3.内部表的操作

创建表时,如果没有使用external关键字,则该表是内部表（managed table）

Hive建表字段类型

分类	类型	描述	字面量示例
原始类型	BOOLEAN	true/false	TRUE
	TINYINT	1字节的有符号整数, -128~127	1Y
	SMALLINT	2个字节的有符号整数，-32768~32767	1S
	INT	4个字节的带符号整数	1
	BIGINT	8字节带符号整数	1L
	FLOAT	4字节单精度浮点数	1.0
	DOUBLE	8字节双精度浮点数	1.0
	DEICIMAL	任意精度的带符号小数	1.0
	STRING	字符串，变长	“a”,’b’
	VARCHAR	变长字符串	“a”,’b’
	CHAR	固定长度字符串	“a”,’b’
	BINARY	字节数组	无法表示
	TIMESTAMP	时间戳，毫秒值精度	122327493795
	DATE	日期	‘2016-03-29’
	INTERVAL	时间频率间隔
复杂类型	ARRAY	有序的的同类型的集合	array(1,2)
	MAP	key-value,key必须为原始类型，value可以任意类型	map(‘a’,1,’b’,2)
	STRUCT	字段集合,类型可以不同	struct(‘1’,1,1.0), named_stract(‘col1’,’1’,’col2’,1,’clo3’,1.0)
	UNION	在有限取值范围内的一个值	create_union(1,’a’,63)

建表入门:

# 选中创建的数据库
use myhive; 

# 创建学生表
create table stu(id int,name string); 

#插入一条数据（insert命令走mapreduce所以效率很低）
insert into stu values (1,"zhangsan"); 

# 查询表内元素
select * from stu;

创建表并指定字段之间的分隔符

create table if not exists stu2(id int ,name string) row format delimited fields terminated by '\t';

创建表并指定表文件的存放路径

create table if not exists stu2(id int ,name string) row format delimited fields terminated by '\t' location '/user/stu2';

根据查询结果创建表

create table stu3 as select * from stu2; # 通过复制表结构和表内容创建新表

根据已经存在的表结构创建表

create table stu4 like stu;

查询表的详细信息

desc formatted stu2;

删除表

drop table stu4;

4. 外部表的操作

外部表说明

外部表因为是指定其他的hdfs路径的数据加载到表当中来，所以hive表会认为自己不完全独占这份数据，所以删除hive表的时候，数据仍然存放在hdfs当中，不会删掉.

内部表和外部表的使用场景

每天将收集到的网站日志定期流入HDFS文本文件。在外部表（原始日志表）的基础上做大量的统计分析，用到的中间表、结果表使用内部表存储，数据通过SELECT+INSERT进入内部表。

操作案例

分别创建老师与学生表外部表，并向表中加载数据

创建老师表

create external table teacher (t_id string,t_name string) row format delimited fields terminated by '\t';

创建学生表

create external table student (s_id string,s_name string,s_birth string , s_sex string ) row format delimited fields terminated by '\t';

加载数据

load data local inpath '/export/servers/hivedatas/student.csv' into table student;

加载数据并覆盖已有数据

load data local inpath '/export/servers/hivedatas/student.csv' overwrite into table student;

从hdfs文件系统向表中加载数据（需要提前将数据上传到hdfs文件系统）

# 进入文件目录
cd /export/servers/hivedatas 

# 用hdfs上传文件到指定文件夹
hdfs dfs -mkdir -p /hivedatas hdfs dfs -put techer.csv /hivedatas/ 

# 读取数据到teable中
load data inpath '/hivedatas/techer.csv' into table teacher;

5. 分区表的操作

在大数据中，最常用的一种思想就是分治，我们可以把大的文件切割划分成一个个的小的文件，这样每次操作一个小的文件就会很容易了，同样的道理，在hive当中也是支持这种思想的，就是我们可以把大的数据，按照每月，或者天进行切分成一个个的小的文件,存放在不同的文件夹中.

创建分区表语法

create table score(s_id string,c_id string, s_score int) partitioned by (month string) row format delimited fields terminated by '\t';

创建一个表带多个分区

create table score2 (s_id string,c_id string, s_score int) partitioned by (year string,month string,day string) row format delimited fields terminated by '\t';

加载数据到分区表中

load data local inpath '/export/servers/hivedatas/score.csv' into table score partition (month='201806');

加载数据到多分区表中

load data local inpath '/export/servers/hivedatas/score.csv' into table score2 partition(year='2018',month='06',day='01');

多分区表联合查询(使用union all)

select * from score where month = '201806' union all select * from score where month = '201806';

查看分区

show partitions score;

添加一个分区

alter table score add partition(month='201805');

删除分区

alter table score drop partition(month = '201806');

6. 分区表综合练习

现在有一个文件score.csv文件，存放在集群的这个目录下/scoredatas/month=201806，这个文件每天都会生成，存放到对应的日期文件夹下面去，文件别人也需要公用，不能移动。需求，创建hive对应的表，并将数据加载到表中，进行数据统计分析，且删除表之后，数据不能删除

数据准备：

hdfs dfs -mkdir -p /scoredatas/month=201806 
hdfs dfs -put score.csv /scoredatas/month=201806/

创建外部分区表，并指定文件数据存放目录

create external table score4(s_id string, c_id string,s_score int) partitioned by (month string) row format delimited fields terminated by '\t' location '/scoredatas';

进行表的修复(建立表与数据文件之间的一个关系映射)

msck repair table score4;

7. 分桶表操作

分桶，就是将数据按照指定的字段进行划分到多个文件当中去,分桶就是MapReduce中的分区.

开启 Hive 的分桶功能

set hive.enforce.bucketing=true;

设置 Reduce 个数

set mapreduce.job.reduces=3;

创建分桶表

create table course (c_id string,c_name string,t_id string) clustered by(c_id) into 3 buckets row format delimited fields terminated by '\t';

桶表的数据加载，由于通标的数据加载通过hdfs dfs -put文件或者通过load data均不好使，只能通过insert overwrite

创建普通表，并通过insert overwriter的方式将普通表的数据通过查询的方式加载到桶表当中去

创建普通表

create table course_common (c_id string,c_name string,t_id string) row format delimited fields terminated by '\t';

普通表中加载数据

load data local inpath '/export/servers/hivedatas/course.csv' into table course_common;

通过insert overwrite给桶表中加载数据

insert overwrite table course select * from course_common cluster by(c_id);

修改表结构

重命名:

alter table old_table_name rename to new_table_name;

把表score4修改成score5

alter table score4 rename to score5;

增加/修改列信息:

查询表结构

desc score5;

添加列

alter table score5 add columns (mycol string, mysco int);

更新列

alter table score5 change column mysco mysconew int;

删除表

drop table score5;

增加/修改列信息:

查询表结构

desc score5;

添加列

alter table score5 add columns (mycol string, mysco int);

更新列

alter table score5 change column mysco mysconew int;

删除表

drop table score5;

hive表中加载数据

直接向分区表中插入数据

create table score3 like score;

insert into table score3 partition(month ='201807') values ('001','002','100');

通过load方式加载数据

load data local inpath '/export/servers/hivedatas/score.csv' overwrite into table score partition(month='201806');

通过查询方式加载数据

create table score4 like score; 

insert overwrite table score4 partition(month = '201806') select s_id,c_id,s_score from score;

数据仓库-Hive基础（五）Hive 的基本操作

1. 数据库操作

1.1 创建数据库

1.2 创建数据库并指定位置

1.3 设置数据库键值对信息

1.4 查看数据库更多详细信息

1.5 删除数据库

2.数据库表操作

create table

external

comment

partitioned by

clustered by

sorted by

row format

storted as

location

3.内部表的操作

Hive建表字段类型

建表入门:

创建表并指定字段之间的分隔符

创建表并指定表文件的存放路径

根据查询结果创建表

根据已经存在的表结构创建表

查询表的详细信息

删除表

4. 外部表的操作

外部表说明

内部表和外部表的使用场景

操作案例

5. 分区表的操作

6. 分区表综合练习

7. 分桶表操作

修改表结构

重命名:

增加/修改列信息:

查询表结构

添加列

更新列

删除表

增加/修改列信息:

查询表结构

添加列

更新列

删除表

hive表中加载数据

你可能感兴趣的:(数据仓库-Hive基础（五）Hive 的基本操作)