人员学习成本高,需要掌握java语言。
MapReduce实现复杂查询逻辑开发难度太大。
操作接口采用类SQL语法,提供快速开发的能力。(简单、容易上手)
避免直接写MapReduce,减少开发人员的学习成本。
支持自定义函数,功能扩展方便。
背靠Hadoop,擅长存储分析海量数据集。
类型 | 实例 |
---|---|
TINYINT | 10 |
INT | 10 |
SMALLINT | 10 |
BIGINT | 100L |
FLOAT | 1.342 |
DOUBLE | 1.234 |
BINARY | 1010 |
BOOLEAN | TRUE |
DECIMAL | 3.14 |
CHAR | 'book’or"book" |
STRING | 'book’or"book" |
VARCHAR | 'book’or"book" |
DATE | ‘2023-02-27’ |
TIMESTAMP | ‘2023-02-277 00:00:00’ |
类型 | 格式 | 定义 | 示例 |
---|---|---|---|
ARRAY | [‘apple’,‘hive’,‘orange’] | ARRAY< string> | a[0]=‘apple’ |
MAP | {‘a’:‘apple’,‘o’:‘orange’} | MAP< string,string> | b[‘a’]=‘apple’ |
STRUCT | {‘apple’,2} | STRUCT< fruit:string,weight:int> | c.weight=2 |
数据库在HDFS中表现为一个文件夹,在配置文件中hive.metastore.warehouse.dir属性目录下。
如果没有指定数据库,默认使用default数据库
# 创建数据库
create database if not exists 库名;
# 使用数据库
use 库名;
# 查看数据库
show databases;
# 查看数据库详细信息
describe database 库名;
# 修改数据库的使用者
alter database 库名 set owner user 用户名;
# 强制删除数据库(cascade) 在不使用cascade时若库内有表有数据时,会删除失败。
drop database if exists 库名 cascade;
# 查看当前数据库
select current_database();
create [ temporary] [ external] table [ if not exists]
[db_name.]table_name
[(col_name data_type [ comment col_comment], ...)]
[ comment table_comment]
[ partitioned by (col_name data_type [ comment col_comment], ...)]
[ clustered by (col_name, col_name, ...)
[ sorted by (col_name, col_name, ...)]
[ row format row_format]
[ stored as file_format]
[ location hdfs_path]
[ tblproperties (proeprty_name=property_value, ...)]
外部表,与之对应的时内部表。内部表一位置hive会完全接管该表,包括元数据和HDFS中的数据。而外部表意味着hive只接管元数据,而不完全接管HDFS中的数据。
create external table studentwb1
(
id int,
name string,
likes array<string>,
address map<string,string>
)
row format delimited fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n'
location '/tmp/hivedata/student';
临时表,该表只在当前会话可见,会话结束,表会被删除。
create temporary table studentwb1
(
id int,
name string,
likes array<string>,
address map<string,string>
)
row format delimited fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n'
location '/tmp/hivedata/student';
#加载本地数据
load data local inpath '/opt/student.txt' into table 表名;
#加载hdfs数据(内部表时,会将文件移动到表目录下)
load data inpath '/opt/student.txt' into table 表名; (追加)
load data inpath '/opt/student.txt' overwrite into table 表名; (覆盖)
# 创建分区表
create table student_pt
(
id int,
name string,
likes array<string>,
address map<string,string>
)
partitioned by (age int)
row format delimited fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n';
# 直接定义分区将文件导入分区表
load data inpath "/tmp/hivedata/student/student2.txt" into table student_pt partition (age = 20);
# 将其他表数据导入分区表
-- 动态分区(可在配置文件中设置)
set hive.exec.dynamic.partition=true; --是否开启动态分区
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table student_pt partition (age)select id,name,likes,address,age from student_pt;
# 查看分区
show partitions 表名;
--------------------------------------分桶表---------------------------
-------------------------原始数据-------------------------------
create table employee_id(
name string,
employee_id int,
work_place array<string>,
gender_age struct<gender:string,age:int>,
skills_score map<string,int>,
depart_title map<string,array<string>>
)
row format delimited fields terminated by '|'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n';
# 导入数据
load data local inpath '/opt/stufile/employee_id.txt' overwrite into table employee_id;
-------------------------------创建分桶表--------------------------------------
create table employee_id_buckets(
name string,
employee_id int,
work_place array<string>,
gender_age struct<gender:string,age:int>,
skills_score map<string,int>,
depart_title map<string,array<string>>
)
clustered by (employee_id) into 2 buckets --以employee_id分桶分两桶
row format delimited fields terminated by '|'
collection items terminated by ','
map keys terminated by ':';
-- 设置mapreduce的任务数量为2
set map.reduce.tasks=2;
set hive.enforce.bucketing=true; --启动分桶设置
insert overwrite table employee_id_buckets select * from employee_id;
-- 获取20%的数据
select * from employee_id_buckets tablesample ( 20 percent );
-- 获取前10行数据
select * from employee_id_buckets tablesample ( 10 rows );
--分桶抽样
select * from employee_id_buckets tablesample ( bucket 5 out of 6 on rand());
分桶抽样讲解:
假设当前分桶表,一共分了z桶!(当前示例中分了2桶)
x: 代表从当前的第几桶开始抽样(从第5个开始抽样)
0
create table ctas_student as select * from student;
create table cte_table as
with
t1 as (select * from student2 where age>25),
t2 as (select * from t1 where name = 'xiaoming9'),
t3 as (select * from student2 where age<25)
select * from t2 union all select * from t3;
# (使用with查询)
with
t1 as (select * from student2 where age>25),
t2 as (select * from t1 where name = 'xiaoming9'),
t3 as (select * from student2 where age<25)
select * from t2 union all select * from t3;
创建下的表数据结构一致,无数据
create table student_like like student;
export和import可用于两个hive实例之间的数据迁移。
语法:
export table tablename to ‘export_target_path’
语法:
import [ external ] table new_or_origunal_tablename from ‘source_path’ [ location ‘import_target_path’ ]
示例:
# 导出数据
export table 表名 to '/outstudentpt'; 导出到hdfs文件系统
# 导入(相等于创建一个表)
use bigdata
import table studentpt from '/outstudentpt';
# 清空表数据
truncate table 表名;