利用hive进行join连接操作,相较于MR有两种执行方案,一种为common join,另一种为map join ,map join是相对于common join的一种优化,省去shullfe和reduce的过程,大大的降低的作业运行的时间。
一.先决条件
hive> select * from emp;
OK
369 SMITH CLERK 7902 1980-12-17 00:00:00 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-02-20 00:00:00 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-02-22 00:00:00 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-04-02 00:00:00 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-09-28 00:00:00 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-05-01 00:00:00 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1982-12-09 00:00:00 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 00:00:00 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-09-08 00:00:00 1500.0 0.0 30
7876 ADAMS CLERK 7788 1983-01-12 00:00:00 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-03 00:00:00 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-03 00:00:00 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-01-23 00:00:00 1300.0 NULL 10
Time taken: 0.161 seconds, Fetched: 14 row(s)
dept表
hive> select * from dept;
OK
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON
Time taken: 0.185 seconds, Fetched: 4 row(s)
二.具体实现
1.common join
两个map作业读取两张表,归并为emp:
deptno, (e.empno, e.ename)
dept: deptno, (d.dname)
的格式,然后经由reducer合并。最后能获取到join的连接结果。
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: e
Statistics: Num rows: 7 Data size: 820 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: deptno is not null (type: boolean)
Statistics: Num rows: 4 Data size: 468 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: deptno (type: int)
sort order: +
Map-reduce partition columns: deptno (type: int)
Statistics: Num rows: 4 Data size: 468 Basic stats: COMPLETE Column stats: NONE
value expressions: empno (type: int), ename (type: string)
TableScan
alias: d
Statistics: Num rows: 1 Data size: 80 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: deptno is not null (type: boolean)
Statistics: Num rows: 1 Data size: 80 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: deptno (type: int)
sort order: +
Map-reduce partition columns: deptno (type: int)
Statistics: Num rows: 1 Data size: 80 Basic stats: COMPLETE Column stats: NONE
value expressions: dname (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 deptno (type: int)
1 deptno (type: int)
outputColumnNames: _col0, _col1, _col7, _col12
Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col7 (type: int), _col12 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
TableScan
alias: e
Statistics: Num rows: 7 Data size: 820 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: deptno is not null (type: boolean)
Statistics: Num rows: 4 Data size: 468 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: deptno (type: int)//键值对的键
sort order: +
Map-reduce partition columns: deptno (type: int)
Statistics: Num rows: 4 Data size: 468 Basic stats: COMPLETE Column stats: NONE
value expressions: empno (type: int), ename (type: string)//键值对中的值
2.reduce端join
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 deptno (type: int)//连接条件,两个字段
1 deptno (type: int)
outputColumnNames: _col0, _col1, _col7, _col12//输出位置号
Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col7 (type: int), _col12 (type: string)//输出类型
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
3.结束
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
2.map join
首先在本地生成一个local task 读取比较小的表dept,然后将表写入Hash Table Files ,上传到HDFS的缓存中,然后启动一个map作业,每读取一条数据,就与缓存中的小表进行join操作,直至整个大表读取结束。
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
d
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
d
TableScan
alias: d
Statistics: Num rows: 1 Data size: 80 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: deptno is not null (type: boolean)
Statistics: Num rows: 1 Data size: 80 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
keys:
0 deptno (type: int)
1 deptno (type: int)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: e
Statistics: Num rows: 7 Data size: 820 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: deptno is not null (type: boolean)
Statistics: Num rows: 4 Data size: 468 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 deptno (type: int)
1 deptno (type: int)
outputColumnNames: _col0, _col1, _col7, _col12
Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col7 (type: int), _col12 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.191 seconds, Fetched: 62 row(s)
Map Reduce Local Work
Alias -> Map Local Tables:
d
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
d
TableScan
alias: d
Statistics: Num rows: 1 Data size: 80 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: deptno is not null (type: boolean)
Statistics: Num rows: 1 Data size: 80 Basic stats: COMPLETE Column stats: NONE
2.写入哈希表文件
HashTable Sink Operator
keys:
0 deptno (type: int)
1 deptno (type: int)
3.上传到Hadoop缓存中(执行计划不可见该步骤,可由日志看见)
2018-01-11 10:30:28 Uploaded 1 File to: file:/tmp/hadoop/aedaa8e1-17a9-4211-86b1-79debe362aba/hive_2018-01-11_22-30-12_222_6099353227386611286-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile32--.hashtable (373 bytes)
4.执行一个map作业,读取大表,并与缓存中的小表连接操作
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: e
Statistics: Num rows: 7 Data size: 820 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: deptno is not null (type: boolean)
Statistics: Num rows: 4 Data size: 468 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 deptno (type: int)
1 deptno (type: int)
outputColumnNames: _col0, _col1, _col7, _col12
Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col7 (type: int), _col12 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
5.结束
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
三.实验结果
hive> select e.empno,e.ename,e.deptno, d.dname
> from emp e join dept d on e.deptno=d.deptno;
Query ID = hadoop_20180111202424_2a1594f6-ef46-4a99-a85d-4a4cc82e1b9c
Total jobs = 1
18/01/12 00:08:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Execution log at: /tmp/hadoop/hadoop_20180111202424_2a1594f6-ef46-4a99-a85d-4a4cc82e1b9c.log
2018-01-12 12:08:21 Starting to launch local task to process map join; maximum memory = 518979584
2018-01-12 12:08:24 Dump the side-table for tag: 1 with group count: 4 into file: file:/tmp/hadoop/aedaa8e1-17a9-4211-86b1-79debe362aba/hive_2018-01-12_00-08-10_896_8719990077853360918-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile51--.hashtable
2018-01-12 12:08:24 Uploaded 1 File to: file:/tmp/hadoop/aedaa8e1-17a9-4211-86b1-79debe362aba/hive_2018-01-12_00-08-10_896_8719990077853360918-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile51--.hashtable (373 bytes)
2018-01-12 12:08:24 End of local task; Time Taken: 2.59 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1515720212312_0005, Tracking URL = http://hadoop:8088/proxy/application_1515720212312_0005/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1515720212312_0005
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2018-01-12 00:08:39,954 Stage-3 map = 0%, reduce = 0%
2018-01-12 00:08:53,406 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 2.59 sec
MapReduce Total cumulative CPU time: 2 seconds 590 msec
Ended Job = job_1515720212312_0005
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1 Cumulative CPU: 2.59 sec HDFS Read: 6800 HDFS Write: 309 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 590 msec
OK
369 SMITH 20 RESEARCH
7499 ALLEN 30 SALES
7521 WARD 30 SALES
7566 JONES 20 RESEARCH
7654 MARTIN 30 SALES
7698 BLAKE 30 SALES
7782 CLARK 10 ACCOUNTING
7788 SCOTT 20 RESEARCH
7839 KING 10 ACCOUNTING
7844 TURNER 30 SALES
7876 ADAMS 20 RESEARCH
7900 JAMES 30 SALES
7902 FORD 20 RESEARCH
7934 MILLER 10 ACCOUNTING
Time taken: 43.697 seconds, Fetched: 14 row(s)
之所以会出现两种方式,就是一个hive的调优参数
set hive.auto.convert.join = true;
若为false,则为common join
若为true,则为map join
若泽大数据交流群:671914634