用户接口:Client
元数据:Metastore
Hadoop
驱动器:Driver
Hive 通过给用户提供的一系列交互接口,接收到用户的指令(SQL),使用自己的 Driver, 结合元数据(MetaStore),将这些指令翻译成 MapReduce,提交到 Hadoop 中执行,最后,将 执行返回的结果输出到用户交互接口。
由于Hive采用了类似SQL的查询语言HQL,因此很容易将Hive理解为数据库。从结构上看,Hive和数据库除了拥有类似的查询语言,再无类似之处。
Hive |
数据库 |
|
查询语言 |
HQL |
SQL |
数据存储 |
HDFS |
块设备或者本地文件系统 |
数据更新 |
不支持修改和添加 |
支持修改和添加 |
执行方式 |
MapReduce |
Executor |
执行延迟 |
高 |
低 |
可扩展性 |
高 |
低 |
数据规模 |
大 |
小 |
索引 |
0.8版本后加入位图索引 |
支持多种索引 |
事务 |
0.14版本后开始支持事务 |
支持 |
Hive数据类型 |
Java 数据类型 |
长度 |
例子 |
TINYINT |
byte |
1byte有符号整数 |
20 |
SMALINT |
short |
2byte有符号整数 |
20 |
INT |
int |
4byte有符号整数 |
20 |
BIGINT |
long |
8byte有符号整数 |
20 |
BOOLEAN |
boolean |
布尔类型,true 或者false |
TRUE FALSE |
FLOAT |
float |
单精度浮点数 |
3.14159 |
DOUBLE |
double |
双精度浮点数 |
3.14159 |
STRING |
string |
字符系列。可以指定字符集。可以使用单引号或者双引号。 |
‘now is the time ’ “ for all good men” |
TIMESTAMP |
时间类型 |
||
BINARY |
字节数组 |
数据类型 |
描述 |
语法示例 |
STRUCT |
和c语言中的struct类似,都可以通过“点”符号访问元素内容。例如,如果某个列的数据类型是STRUCT{first STRING,last STRING},那么第 1 个元素可以通过字段.first 来引用。 |
struct() 例如 struct |
MAP |
MAP 是一组键-值对元组集合,使用数组表示法可以访问数据。例如,如果某个列的数据类型是 MAP,其中键->值对是’first’->’John’和'last’->’Doe’,那么可以通过字段名[“last’]获取最后一个元素 |
map() 例如 map |
ARRAY |
数组是一组具有相同类型和名称的变量的集合。这些变量称为数组的元素,每个数组元素都有一个编号,编号从0开始。例如,数组值为[“John’,Doe’],那么第二个元素可以通过数组名[1]进行引用。 |
Array() 例如 array |
CREATE DATABASE [IF NOT EXISTS] database_name --指定库名
[COMMENT database_comment] --指定库的描述信息
[LOCATION hdfs_path] --指定库在HDFS中的对应目录
[WITH DBPROPERTIES (property_name=property_value, ...)]; --指定库的属性信息
hive (default)> create database db_hive;
hive (default)> create database if not exists db_hive;
hive (default)> create database db_hive2 location '/db_hive2.db';
hive> show databases;
hive> show databases like 'db_hive*';
OK
db_hive
db_hive_1
hive> desc database db_hive;
OK
db_hive hdfs://hadoop102:9820/user/hive/warehouse/db_hive.db
WovJ
hive> desc database extended db_hive;
OK
db_hive hdfs://hadoop102:9820/user/hive/warehouse/db_hive.db
WovJ
hive (default)> use db_hive;
hive (default)> alter database db_hive set dbproperties('createtime'='20170830');
hive> desc database extended db_hive;
db_name comment location owner_name owner_type parameters
db_hive hdfs://hadoop102:9820/user/hive/warehouse/db_hive.db
WovJ USER {createtime=20170830}
hive>drop database db_hive2;
hive> drop database db_hive;
FAILED: SemanticException [Error 10072]: Database does not exist: db_hive
hive> drop database if exists db_hive2;
hive> drop database db_hive cascade;
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name --指定表名,[EXTERNAL]表示外部表,如果不加,则创建内部表
[(col_name data_type [COMMENT col_comment], ...)] --指定列名,列类型,列描述信息
[COMMENT table_comment] --指定表的描述信息
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] --指定分区列名 列类型 列描述信息 分区是分目录
[CLUSTERED BY (col_name, col_name, ...) --指定分桶列名 分桶是分数据
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] --指定排序列(几乎不用),分多少个桶
[ROW FORMAT delimited fields terminated by 分隔符] --指定每行数据中每个元素的分隔符
[collection items terminated by 分隔符] --指定集合的元素的分隔符
[map keys terminated by 分隔符] --指定Map的kv的分隔符
[lines terminated by 分隔符] --指定行分隔符
[STORED AS file_format] --指定表数据的存储格式(textfile,orc,parquet)
[LOCATION hdfs_path] --指定表对应的HDFS的目录
[TBLPROPERTIES (property_name=property_value, ...)] --指定表的属性信息
[AS select_statement]
字段解释说明:
|
创建 |
存储位置 |
删除数据 |
理念 |
内部表 |
create table 表名 …… |
Hive管理,存储位置默认 /user/hive/warehouse |
|
Hive管理表 持久使用 |
外部表 |
create externaltable 表名 ……location 存储位置…… |
随意,location关键字之指定存储位置 |
|
临时链接外部数据 |
SELECT [ALL | DISTINCT] select_expr, select_expr, ... --查询表中的哪些字段
FROM table_reference --从哪个表查
[WHERE where_condition] --where过滤条件
[GROUP BY col_list] --分组
[ORDER BY col_list] --全局排序
[CLUSTER BY col_list --分区排序
| [DISTRIBUTE BY col_list] --分区 [SORT BY col_list] --排序
]
[LIMIT number] --限制返回的数据条数
dept:
10 ACCOUNTING 1700
20 RESEARCH 1800
30 SALES 1900
40 OPERATIONS 1700
emp:
7369 SMITH CLERK 7902 1980-12-17 800.00 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.00 300.00 30
7521 WARD SALESMAN 7698 1981-2-22 1250.00 500.00 30
7566 JONES MANAGER 7839 1981-4-2 2975.00 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.00 1400.00 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.00 30
7782 CLARK MANAGER 7839 1981-6-9 2450.00 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.00 20
7839 KING PRESIDENT 1981-11-17 5000.00 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.00 0.00 30
7876 ADAMS CLERK 7788 1987-5-23 1100.00 20
7900 JAMES CLERK 7698 1981-12-3 950.00 30
7902 FORD ANALYST 7566 1981-12-3 3000.00 20
7934 MILLER CLERK 7782 1982-1-23 1300.00 10
create table if not exists dept(
deptno int,
dname string,
loc int
)
row format delimited fields terminated by '\t';
create table if not exists emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int)
row format delimited fields terminated by '\t';
load data local inpath '/opt/module/datas/dept.txt' into table
dept;
load data local inpath '/opt/module/datas/emp.txt' into table emp;
select * from emp;
select empno,ename,job,mgr,hiredate,sal,comm,deptno from emp;
select empno,ename from emp;
select ename AS name,deptno dn from emp;
运算符 |
描述 |
A+B |
A 和 B 相加 |
A-B |
A 减去 B |
A*B |
A 和 B 相乘 |
A/B |
A 除以 B |
A%B |
A 对 B 取余 |
A&B |
A 和 B 按位取与 |
A|B |
A 和 B 按位取或 |
A^B |
A 和 B 按位取异或 |
~A |
A 按位取反 |
select sal + 1 from emp;
select count(*) cnt from emp;
select max(sal) max_sal from emp;
select min(sal) min_sal from emp;
select sum(sal) sum_sal from emp;
select avg(sal) avg_sal from emp;
select * from emp limit 5; --取前五行数据
select * from emp limit 2,3; --从第二行起,取前三行
select * from emp where sal>1000;
操作符 |
支持的数据类型 |
描述 |
A=B |
基本数据类型 |
如果 A 等于 B 则返回 TRUE,反之返回 FALSE |
A<=>B |
基本数据类型 |
如果 A 和 B 都为 NULL,则返回 TRUE,如果一边为 NULL, 返回 False |
A<>B, A!=B |
基本数据类型 |
A 或者 B 为 NULL 则返回 NULL;如果 A 不等于 B,则返回 TRUE,反之返回 FALSE |
A |
基本数据类型 |
A 或者 B 为 NULL,则返回 NULL;如果 A 小于 B,则返回 TRUE,反之返回 FALSE |
A<=B |
基本数据类型 |
A 或者 B 为 NULL,则返回 NULL;如果 A 小于等于 B,则返 回 TRUE,反之返回 FALSE |
A>B |
基本数据类型 |
A 或者 B 为 NULL,则返回 NULL;如果 A 大于 B,则返回 TRUE,反之返回 FALSE |
A>=B |
A 或者 B 为 NULL,则返回 NULL;如果 A 大于等于 B,则返 回 TRUE,反之返回 FALSE |
|
A [NOT] BETWEEN B AND C |
基本数据类型 |
如果 A,B 或者 C 任一为 NULL,则结果为 NULL。如果 A 的 值大于等于 B 而且小于或等于 C,则结果为 TRUE,反之为 FALSE。 如果使用 NOT 关键字则可达到相反的效果。 |
A IS NULL |
所有数据类型 |
如果 A 等于 NULL,则返回 TRUE,反之返回 FALSE |
A IS NOT NULL |
所有数据类型 |
如果 A 不等于 NULL,则返回 TRUE,反之返回 FALSE |
IN(数值 1, 数值 2) |
所有数据类型 |
使用 IN 运算显示列表中的值 |
A [NOT] LIKE B |
STRING 类型 |
B 是一个 SQL 下的简单正则表达式,也叫通配符模式,如 果 A 与其匹配的话,则返回 TRUE;反之返回 FALSE。B 的表达式 说明如下:‘x%’表示 A 必须以字母‘x’开头,‘%x’表示 A 必须以字母’x’结尾,而‘%x%’表示 A 包含有字母’x’,可以 位于开头,结尾或者字符串中间。如果使用 NOT 关键字则可达到 相反的效果。 |
A RLIKE B, A REGEXP B |
STRING 类型 |
B 是基于 java 的正则表达式,如果 A 与其匹配,则返回 TRUE;反之返回 FALSE。匹配使用的是 JDK 中的正则表达式接口实现的,因为正则也依据其中的规则。例如,正则表达式必须和 整个字符串 A 相匹配,而不是只需与其字符串匹配。 |
select * from emp where sal = 5000;
select * from emp where sal between 500 and 1000;
select * from emp where comm is null;
select * from emp where sal In (1500,5000);
select * from emp where ename like 'A%';
select * from emp where ename like '_A%';
select * from emp where ename rlike '[A]';
操作符 |
含义 |
AND |
逻辑并 |
OR |
逻辑或 |
NOT |
逻辑否 |
select * from emp where sal > 1000 and deptno = 30;
select * from emp where sal > 1000 or deptno =30;
select * from emp where deptno not in(30,20);
select t.deptno,avg(t.sal) avg_sal from emp t group by t.deptno;
select t.deptno,t.job,max(t.sal) max_saL from emp t
group by
t.deptno,t.job;
select deptno,avg(sal) from emp group by deptno;
select deptno,avg(sal) avg_sal from emp group by deptno
having
avg_sal > 2000;
Hive支持通常的SQL JOIN语句,但是只支持等值连接,不支持非等值连接
select e.empno,e.ename,d.deptno,d.dname from emp e
join dept d on e.deptno = d.deptno;
select e.emp,e.ename,d.dname from emp e
inner join
dept d
on
e.deptno =d.deptno;
select e.empno,e.ename,d.dname from emp e
left outer join
dept d
on
e.deptno =d.deptno;
select e.empno,e.ename,d.deptno from emp e
join
dept d on e.deptno = d.deptno;
select e.empno,e.ename,d.deptno from emp e
join
dept d on e.deptno =d.deptno;
select e.empno,e.ename,d.deptno from emp e
left join
dept d on e.deptno = d.deptno;
select e.empno,e.ename,d.deptno from emp e
right join
dept d on e.deptno = d.deptno;
select e.empno,e.ename,d.deptno from emp e
full join
dept d on e.deptno = d.deptno;
1700 Beijing
1800 London
1900 Tokyo
create table if not exists location(
loc int,
loc_name string
)
row format delimited fields terminated by '\t';
load data local inpath '/opt/module/datas/location.txt'
into table location;
SELECT e.ename, d.dname, l.loc_name
FROM emp e
JOIN dept d
ON d.deptno = e.deptno
JOIN location l
ON d.loc = l.loc;
select empno,dname from emp ,dept;
select * from emp order by sal;
select * from emp order by sal desc;
select ename,sal * 12 twosal from emp order by twosal;
select ename,deptno,sal from emp order by deptno,sal;
set mapreduce.job.reduces = 3;
set mapreduce.job.reduces;
select * from emp sort by deptno desc;
insert overwrite local directory
'/opt/module/data/sortby-result'
select * from emp sort by deptno desc;
set mapreduce.job.reduces=3;
insert overwrite local directory
'opt/module/hive/datas/distribute-result' select * from emp distribute
by deptno sort by empno desc;
select * from emp cluster by deptno;
select * from emp distribute by deptno sort by deptno;
dept_20200401.log
dept_20200402.log
dept_20200403.log
create table dept_partition(
deptno int, dname string, loc string
)
partitioned by (day string)
row format delimited fields terminated by '\t';
10 ACCOUNTING 1700
20 RESEARCH 1800
30 SALES 1900
40 OPERATIONS 1700
50 TEST 2000
60 DEV 1900
load data local inpath
'/opt/module/hive/datas/dept_20200401.log' into table dept_partition
partition(day='20200401');
load data local inpath
'/opt/module/hive/datas/dept_20200402.log' into table dept_partition
partition(day='20200402');
load data local inpath
'/opt/module/hive/datas/dept_20200403.log' into table dept_partition
partition(day='20200403');
select * from dept_partition where day='20200401';
select * from dept_partition where day='20200401'
union
select * from dept_partition where day='20200402'
union
select * from dept_partition where day='20200403';
select * from dept_partition where day='20200401'
or day='20200402' or day='20200403';
alter table dept_partition add partition(day='20200404');
alter table dept_partition add partition(day='20200405')
partition(day='20200406');
alter table dept_partition drop partition
(day='20200406');
alter table dept_partition drop partition
(day='20200404'), partition(day='20200405');
show partitions dept_partition;
desc formatted dept_partition;
create table dept_partition2(
deptno int, dname string, loc string
)
partitioned by (day string, hour string)
row format delimited fields terminated by '\t';
load data local inpath
'/opt/module/hive/datas/dept_20200401.log' into table
dept_partition2 partition(day='20200401', hour='12');
select * from dept_partition2 where day='20200401' and
hour='12';
dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;
hive (default)> dfs -put /opt/module/datas/dept_20200401.log
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;
select * from dept_partition2 where day='20200401' and
hour='13';
msck repair table dept_partition2;
select * from dept_partition2 where day='20200401' and
hour='13';
dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;
dfs -put /opt/module/hive/datas/dept_20200401.log
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;
alter table dept_partition2 add
partition(day='201709',hour='14');
select * from dept_partition2 where day='20200401' and
hour='14';
dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=15;
load data local inpath
'/opt/module/hive/datas/dept_20200401.log' into table
dept_partition2 partition(day='20200401',hour='15');
select * from dept_partition2 where day='20200401' and
hour='15';
hive.exec.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
hive.exec.max.dynamic.partitions=1000
hive.exec.max.dynamic.partitions.pernode=100
hive.exec.max.created.files=100000
hive.error.on.empty.partition=false
create table dept_partition_dy(id int, name string)
partitioned by (loc int) row format delimited fields terminated by '\t';
set hive.exec.dynamic.partition.mode = nonstrict;
insert into table dept_partition_dy partition(loc) select
deptno, dname, loc from dept;
show partitions dept_partition;
1001 ss1
1002 ss2
1003 ss3
1004 ss4
1005 ss5
1006 ss6
1007 ss7
1008 ss8
1009 ss9
1010 ss10
1011 ss11
1012 ss12
1013 ss13
1014 ss14
1015 ss15
1016 ss16
create table stu_buck(id int, name string)
clustered by(id)
into 4 buckets
row format delimited fields terminated by '\t';
desc formatted stu_buck;
Num Buckets:
load data inpath '/student.txt' into table stu_buck;
select * from stu_buck;
insert into table stu_buck select * from student_insert;
create table dept_partition_bucket
(deptno int, dname string,loc int)
partitioned by (day string)
clustered by (deptno)
row format delimited fields terminated by '\t';
select * from stu_buck tablesample(bucket 1 out of 4 on id);
FAILED: SemanticException [Error 10061]: Numerator should not be bigger
than denominator in sample clause for table stu_buck
分区表 |
分桶表 |
|
表现形式 |
是一个目录 |
是文件 |
创建语句 |
使用partitioned by 子句指定,以指定字段为伪列,需要指定字段类型 |
由clustered by 子句指定,指定字段为真实字段,需要指定桶的个数 |
数量 |
分区个数可以增长 |
一旦指定,不能再增长 |
作用 |
避免全表扫描,根据分区列查询指定目录提高查询速度 |
分桶保存分桶查询结果的分桶结构 (数据已经按照分桶字段进行了hash散列)。 分桶表数据进行抽样和JOIN时可以提高MR程序效率 |
show functions;
desc function upper;
desc function extended upper;
select comm,nvl(comm, -1) from emp;
select comm, nvl(comm,mgr) from emp;
name |
dept_id |
sex |
悟空 |
A |
男 |
大海 |
A |
男 |
宋宋 |
B |
男 |
凤姐 |
A |
女 |
婷姐 |
B |
女 |
婷婷 |
B |
女 |
dept_Id 男 女
A 2 1
B 1 2
[WovJ@hadoop102 datas]$ vi emp_sex.txt
悟空 A 男
大海 A 男
宋宋 B 男
凤姐 A 女
婷姐 B 女
婷婷 B 女
create table emp_sex(
name string,
dept_id string,
sex string)
row format delimited fields terminated by "\t";
load data local inpath '/opt/module/hive/data/emp_sex.txt' into table
emp_sex;
select
dept_id,
sum(case sex when '男' then 1 else 0 end) male_count,
sum(case sex when '女' then 1 else 0 end) female_count
from emp_sex
group by dept_id;
name |
constellation |
blood_type |
孙悟空 |
白羊座 |
A |
大海 |
射手座 |
A |
宋宋 |
白羊座 |
B |
猪八戒 |
白羊座 |
A |
凤姐 |
射手座 |
A |
苍老师 |
白羊座 |
B |
射手座,A 大海|凤姐
白羊座,A 孙悟空|猪八戒
白羊座,B 宋宋|苍老师
[WovJ@hadoop102 datas]$ vim person_info.txt
孙悟空 白羊座 A
大海 射手座 A
宋宋 白羊座 B
猪八戒 白羊座 A
凤姐 射手座 A
苍老师 白羊座 B
create table person_info(
name string,
constellation string,
blood_type string)
row format delimited fields terminated by "\t";
load data local inpath "/opt/module/hive/data/person_info.txt" into table
person_info;
SELECT
t1.c_b,
CONCAT_WS("|",collect_set(t1.name))
FROM (
SELECT
NAME,
CONCAT_WS(',',constellation,blood_type) c_b
FROM person_info
)t1
GROUP BY t1.c_b
movie |
category |
《疑犯追踪》 |
悬疑,动作,科幻,剧情 |
《Lie to me》 |
悬疑,警匪,动作,心理,剧情 |
《战狼 2》 |
战争,动作,灾难 |
《疑犯追踪》 悬疑
《疑犯追踪》 动作
《疑犯追踪》 科幻
《疑犯追踪》 剧情
《Lie to me》 悬疑
《Lie to me》 警匪
《Lie to me》 动作
《Lie to me》 心理
《Lie to me》 剧情
《战狼 2》 战争
《战狼 2》 动作
《战狼 2》 灾难
[WovJ@hadoop102 datas]$ vi movie_info.txt
《疑犯追踪》 悬疑,动作,科幻,剧情
《Lie to me》悬疑,警匪,动作,心理,剧情
《战狼 2》 战争,动作,灾难
create table movie_info(
movie string,
category string)
row format delimited fields terminated by "\t";
load data local inpath "/opt/module/data/movie.txt" into table
movie_info;
SELECT
movie,
category_name
FROM
movie_info
lateral VIEW
explode(split(category,",")) movie_info_tmp AS category_name;
jack,2023-01-01,10
tony,2023-01-02,15
jack,2023-02-03,23
tony,2023-01-04,29
jack,2023-01-05,46
jack,2023-04-06,42
tony,2023-01-07,50
jack,2023-01-08,55
mart,2023-04-08,62
mart,2023-04-09,68
neil,2023-05-10,12
mart,2023-04-11,75
neil,2023-06-12,80
mart,2023-04-13,94
[WovJ@hadoop102 datas]$ vi business.txt
create table business(
name string,
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
load data local inpath "/opt/module/data/business.txt" into table
business;
select name,count(*) over ()
from business
where substring(orderdate,1,7) = '2023-04'
group by name;
select name,orderdate,cost,sum(cost) over(partition by month(orderdate))
from business;
select name,orderdate,cost,
sum(cost) over() as sample1,--所有行相加
sum(cost) over(partition by name) as sample2,--按 name 分组,组内数据相加
sum(cost) over(partition by name order by orderdate) as sample3,--按 name
分组,组内数据累加
sum(cost) over(partition by name order by orderdate rows between
UNBOUNDED PRECEDING and current row ) as sample4 ,--和 sample3 一样,由起点到
当前行的聚合
sum(cost) over(partition by name order by orderdate rows between 1
PRECEDING and current row) as sample5, --当前行和前面一行做聚合
sum(cost) over(partition by name order by orderdate rows between 1
PRECEDING AND 1 FOLLOWING ) as sample6,--当前行和前边一行及后面一行
sum(cost) over(partition by name order by orderdate rows between current
row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行
from business;
select name,orderdate,cost,
lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate )
as time1, lag(orderdate,2) over (partition by name order by orderdate) as
time2
from business;
select * from (
select name,orderdate,cost, ntile(5) over(order by orderdate) sorted
from business
) t
where sorted = 1;
name |
subject |
score |
孙悟空 |
语文 |
87 |
孙悟空 |
数学 |
95 |
孙悟空 |
英语 |
68 |
大海 |
语文 |
94 |
大海 |
数学 |
56 |
大海 |
英语 |
84 |
宋宋 |
语文 |
64 |
宋宋 |
数学 |
86 |
宋宋 |
英语 |
84 |
婷婷 |
语文 |
65 |
婷婷 |
数学 |
85 |
婷婷 |
英语 |
78 |
[WovJ@hadoop102 datas]$ vi score.txt
create table score(
name string,
subject string,
score int)
row format delimited fields terminated by "\t";
load data local inpath '/opt/module/data/score.txt' into table score;
select name,
subject,
score,
rank() over(partition by subject order by score desc) rp,
dense_rank() over(partition by subject order by score desc) drp,
row_number() over(partition by subject order by score desc) rmp
from score;
建表:
查询:
窗口函数:
hive.fetch.task.conversion
more
Expects one of [none, minimal, more].
Some select queries can be converted to single FETCH task minimizing
latency.
Currently the query should be single sourced not having any subquery
and should not have any aggregations or distincts (which incurs RS),
lateral views and joins.
0. none : disable hive.fetch.task.conversion
1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
2. more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and
virtual columns)
set hive.fetch.task.conversion=none;
select * from emp;
select ename from emp;
select ename from emp limit 3;
set hive.fetch.task.conversion=more;
select * from emp;
select ename from emp;
select ename from emp limit 3;
set hive.exec.mode.local.auto=true; //开启本地 mr
//设置 local mr 的最大输入数据量,当输入数据量小于这个值时采用 local mr 的方式,默认
为 134217728,即 128M
set hive.exec.mode.local.auto.inputbytes.max=50000000;
//设置 local mr 的最大输入文件个数,当输入文件个数小于这个值时采用 local mr 的方式,默
认为 4
set hive.exec.mode.local.auto.input.files.max=10;
select count(*) from emp group by deptno;
set hive.exec.mode.local.auto=true;
select count(*) from emp group by deptno;
如果不指定 MapJoin 或者不符合 MapJoin 的条件,那么Hive 解析器会将 Join 操作转换成CommonJoin,即:在 Reduce 阶段完成 join。容易发生数据倾斜。可以用 MapJoin 把小表全部加载到内存在 map 端进行 join,避免 reducer 处理。
set hive.auto.convert.join = true; 默认为 true
set hive.mapjoin.smalltable.filesize = 25000000;
set hive.map.aggr = true
set hive.groupby.mapaggr.checkinterval = 100000
有数据倾斜的时候进行负载均衡(默认是 false)
set hive.groupby.skewindata = true
通过分目录的方式把数据集按照规则进行拆开
没有办法通过某一个规则再把数据拆开的时候,就分桶
select count(*) from emp;
Hadoop job information for Stage-1: number of mappers: 1; number of
reducers: 1
set mapreduce.input.fileinputformat.split.maxsize=100;
select count(*) from emp;
Hadoop job information for Stage-1: number of mappers: 6; number of
reducers: 1
set hive.input.format=
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
SET hive.merge.mapfiles = true;
SET hive.merge.mapredfiles = true;
SET hive.merge.size.per.task = 268435456;
SET hive.merge.smallfiles.avgsize = 16777216;
hive.exec.reducers.bytes.per.reducer=256000000
hive.exec.reducers.max=1009
N=min(参数 2,总输入数据量/参数 1)
set mapreduce.job.reduces = 15;
set hive.exec.parallel=true; //打开任务并行执行
set hive.exec.parallel.thread.number=16; //同一个 sql 允许最大并行度,默认为8。
略
略
EXPLAIN [EXTENDED | DEPENDENCY | AUTHORIZATION] query
explain select * from emp;
Explain
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: emp
Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: empno (type: int), ename (type: string), job
(type: string), mgr (type: int), hiredate (type: string), sal (type:
double), comm (type: double), deptno (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5,
_col6, _col7
Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE
Column stats: NONE
ListSink
explain select deptno, avg(sal) avg_sal from emp group by
deptno;
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: emp
Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: sal (type: double), deptno (type: int)
outputColumnNames: sal, deptno
Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE
Column stats: NONE
Group By Operator
aggregations: sum(sal), count(sal)
keys: deptno (type: int)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 7020 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 1 Data size: 7020 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col1 (type: double), _col2 (type:
bigint)
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
aggregations: sum(VALUE._col0), count(VALUE._col1)
keys: KEY._col0 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: _col0 (type: int), (_col1 / _col2) (type: double)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE
Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE
Column stats: NONE
table:
input format:
org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
explain extended select * from emp;
explain extended select deptno, avg(sal) avg_sal from emp
group by deptno;
public class ETLUtil {
/**
* 数据清洗方法
* 清洗规则:
* 1.数据长度
* 2.去掉视频类别中的空格
* 3.将多个关联视频的id通过&拼接
*/
public static String etlData(String srcData){
StringBuffer resultData = new StringBuffer();
//1.先将数据通过\t 切割
String[] datas = srcData.split("\t");
//2.判断长度是否小于 9
if(datas.length <9){
return null ;
}
//3将数据中的视频类别的空格去掉
datas[3]=datas[3].replaceAl1(”","");
//4.将数据中的关联视频 id 通过&拼接
for (int i = 0; i < datas.length; i++){
if(i<9){
//4.1没有关联视频的情况
if(i == datas.length-1){
resultData.append(datas[i]);
}else{
resultData.append(datas[i]).append("\t");
}
}else{
//4.2有关联视频的情况
if(i == datas.length-1){
resultData.append(datas[i]);
}else{
resultData.append(datas[i]+"&");
}
}
}
return resultData.toString();
}
}