场景八_数据裁剪问题
1) 记录数裁剪
i.通过分区,分桶 表的优势 去构建
分桶表的构建过程与使用:https://blog.csdn.net/qq_26803795/article/details/105205702
分区表与分桶表本身就是一个一个优化策略。
分区表 更适用可以合理方式进行数据组织的列,比如 每天的log 日志,由于每天的数据都差不多,可以按分区进行组织!!
分桶表 适用于无法形成合理分区的场景,并且可以比较方便的实现抽取,如用户id列。
当然两者可以相互结合,并无冲突
分区表 : 通过筛选指定的分区,就可以很容易做到数据筛选,因为分区在 hdfs 会以文件夹的形式存在 !!(分区列 ,可以大幅度提升筛选效率)
分桶表 :由于分桶列是按照 对列值哈希,然后除以桶的个数求余,决定将该条记录存放到哪个桶中。所以在后续的筛选也有很大的优势 (不用扫描全表)
例如 下面3个语句执行的结果都是一致的。
SELECT
a.user_id
,b.user_id
FROM datacube_salary_org AS a
JOIN datacube_salary_org AS b
ON a.user_id = b.user_id
AND a.salary >= 10000.0
AND b.salary >= 10000.0
ORDER BY a.user_id, b.user_id
;
SELECT
tmp_a.user_id
,tmp_b.user_id
FROM
(
SELECT *
FROM datacube_salary_org AS a
WHERE a.salary >= 10000.0
) tmp_a
JOIN
(
SELECT *
FROM datacube_salary_org AS b
WHERE b.salary >= 10000.0
) tmp_b
ON tmp_a.user_id = tmp_b.user_id
ORDER BY tmp_a.user_id, tmp_b.user_id
;
SELECT
a.user_id
,b.user_id
FROM datacube_salary_org AS a
JOIN datacube_salary_org AS b
ON a.user_id = b.user_id
WHERE
a.salary >= 10000.0
AND b.salary >= 10000.0
ORDER BY a.user_id, b.user_id
;
我们比较下3个语句的执行计划 , 利用 EXPLAIN
EXPLAIN
SELECT
a.user_id
,b.user_id
FROM datacube_salary_org AS a
JOIN datacube_salary_org AS b
ON a.user_id = b.user_id
AND a.salary >= 10000.0
AND b.salary >= 10000.0
ORDER BY a.user_id, b.user_id
;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 depends on stages: Stage-2
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: a
filterExpr: ((salary >= 10000.0) and user_id is not null) (type: boolean)
Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((salary >= 10000.0) and user_id is not null) (type: boolean)
Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: user_id (type: bigint)
sort order: +
Map-reduce partition columns: user_id (type: bigint)
Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE
TableScan
alias: b
filterExpr: ((salary >= 10000.0) and user_id is not null) (type: boolean)
Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((salary >= 10000.0) and user_id is not null) (type: boolean)
Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: user_id (type: bigint)
sort order: +
Map-reduce partition columns: user_id (type: bigint)
Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 user_id (type: bigint)
1 user_id (type: bigint)
outputColumnNames: _col2, _col13
Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col2 (type: bigint), _col13 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
key expressions: _col0 (type: bigint), _col1 (type: bigint)
sort order: ++
Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Select Operator
expressions: KEY.reducesinkkey0 (type: bigint), KEY.reducesinkkey1 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
第二种方式
EXPLAIN
SELECT
tmp_a.user_id
,tmp_b.user_id
FROM
(
SELECT *
FROM datacube_salary_org AS a
WHERE a.salary >= 10000.0
) tmp_a
JOIN
(
SELECT *
FROM datacube_salary_org AS b
WHERE b.salary >= 10000.0
) tmp_b
ON tmp_a.user_id = tmp_b.user_id
ORDER BY tmp_a.user_id, tmp_b.user_id
;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 depends on stages: Stage-2
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (salary >= 10000.0) (type: boolean)
Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: user_id (type: bigint)
outputColumnNames: _col2
Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: _col2 is not null (type: boolean)
Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col2 (type: bigint)
sort order: +
Map-reduce partition columns: _col2 (type: bigint)
Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE
TableScan
alias: b
Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (salary >= 10000.0) (type: boolean)
Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: user_id (type: bigint)
outputColumnNames: _col2
Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: _col2 is not null (type: boolean)
Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col2 (type: bigint)
sort order: +
Map-reduce partition columns: _col2 (type: bigint)
Statistics: Num rows: 2 Data size: 97 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col2 (type: bigint)
1 _col2 (type: bigint)
outputColumnNames: _col2, _col10
Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col2 (type: bigint), _col10 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
key expressions: _col0 (type: bigint), _col1 (type: bigint)
sort order: ++
Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Select Operator
expressions: KEY.reducesinkkey0 (type: bigint), KEY.reducesinkkey1 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2 Data size: 106 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
第三种方式
EXPLAIN
SELECT
a.user_id
,b.user_id
FROM datacube_salary_org AS a
JOIN datacube_salary_org AS b
ON a.user_id = b.user_id
WHERE
a.salary >= 10000.0
AND b.salary >= 10000.0
ORDER BY a.user_id, b.user_id
;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 depends on stages: Stage-2
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: user_id is not null (type: boolean)
Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: user_id (type: bigint)
sort order: +
Map-reduce partition columns: user_id (type: bigint)
Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE
value expressions: salary (type: decimal(10,2))
TableScan
alias: b
Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: user_id is not null (type: boolean)
Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: user_id (type: bigint)
sort order: +
Map-reduce partition columns: user_id (type: bigint)
Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE
value expressions: salary (type: decimal(10,2))
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 user_id (type: bigint)
1 user_id (type: bigint)
outputColumnNames: _col2, _col4, _col13, _col15
Statistics: Num rows: 7 Data size: 374 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((_col4 >= 10000.0) and (_col15 >= 10000.0)) (type: boolean)
Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col2 (type: bigint), _col13 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
key expressions: _col0 (type: bigint), _col1 (type: bigint)
sort order: ++
Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Select Operator
expressions: KEY.reducesinkkey0 (type: bigint), KEY.reducesinkkey1 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
可以看到 第一种 , 第二种方式的执行计划 基本一致。 第三个执行计划,可以看到 把筛选条件 放到了 JOIN 之后, 在大数据量下,必定慢 :
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 user_id (type: bigint)
1 user_id (type: bigint)
outputColumnNames: _col2, _col4, _col13, _col15
Statistics: Num rows: 7 Data size: 374 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((_col4 >= 10000.0) and (_col15 >= 10000.0)) (type: boolean)
Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col2 (type: bigint), _col13 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 53 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
另外 我们需要注意 第二种方式
SELECT
tmp_a.user_id
,tmp_b.user_id
FROM
(
SELECT *
FROM datacube_salary_org AS a
WHERE a.salary >= 10000.0
) tmp_a
JOIN
(
SELECT *
FROM datacube_salary_org AS b
WHERE b.salary >= 10000.0
) tmp_b
ON tmp_a.user_id = tmp_b.user_id
ORDER BY tmp_a.user_id, tmp_b.user_id
;
需要注意 : 虽然我们子查询里面用的是 * 。 但是实际上,执行器会帮我优化,只选取 user_id !!!
2) 列裁剪
i.剔除无效, 非计算范围内的列数据
这个比较好理解,在 SELECT 之中对 需要的列做筛选。而不是使用 SELECT *
下面,我们看下这两种方式有什么区别:
EXPLAIN
SELECT
a.user_id
,a.company_name
FROM datacube_salary_org AS a
WHERE
a.pt = '20200406'
;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 7 Data size: 984 Basic stats: COMPLETE Column stats: PARTIAL
Filter Operator
predicate: (pt = '20200406') (type: boolean)
Statistics: Num rows: 7 Data size: 644 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: user_id (type: bigint), company_name (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 7 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL
File Output Operator
compressed: false
Statistics: Num rows: 7 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
选取所有列
EXPLAIN
SELECT
a.*
FROM datacube_salary_org AS a
WHERE
a.pt = '20200406'
;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 7 Data size: 984 Basic stats: COMPLETE Column stats: PARTIAL
Filter Operator
predicate: (pt = '20200406') (type: boolean)
Statistics: Num rows: 7 Data size: 644 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string), salary (type: decimal(10,2)), create_time (type: date), update_time (type: date), '20200406' (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
Statistics: Num rows: 7 Data size: 644 Basic stats: COMPLETE Column stats: PARTIAL
File Output Operator
compressed: false
Statistics: Num rows: 7 Data size: 644 Basic stats: COMPLETE Column stats: PARTIAL
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
ii.使用列式存储
使用列式存储的 作为 表的存储方式,也可以帮助我们更迅速的实现对列 的快速筛选。
目前 Hive 支持的列式存储方式 : PARQUET, RCFILE(不推荐), ORCFILE
————————————————
版权声明:本文为CSDN博主「高达一号」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/u010003835/article/details/105495151