1 Hive--参数优化
1.1 hive.fetch.task.conversion
1.2 hive.exec.mode.local.auto
1.3 hive.mapred.mode
1.4 hive.mapred.reduce.tasks.speculative.execution
1.5 hive.optimize.cp
1.6 hive.optimize.ppd
2 MapReduce 阶段Map、Reduce Task个数优化
2.1 Map Task 个数优化
2.2 Reduce Task 个数优化
Default Value: minimal in Hive 0.10.0 through 0.13.1, more in Hive 0.14.0 and later
Added In: Hive 0.10.0 with HIVE-2925; default changed in Hive 0.14.0 with HIVE-7397
Some select queries can be converted to a single FETCH task, minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incur RS – ReduceSinkOperator, requiring a MapReduce task), lateral views and joins.
Supported values are none, minimal and more.
0. none: Disable hive.fetch.task.conversion (value added in Hive 0.14.0 with HIVE-8389)
1. minimal: SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only
2. more: SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns)
"more" can take any kind of expressions in the SELECT clause, including UDFs.
(UDTFs and lateral views are not yet supported – see HIVE-5718.)
1.1.1 none模式
hive> set hive.fetch.task.conversion;
hive> select * from bigdata.emp;
Query ID = work_20201216094245_d44ea4d3-0a5b-4302-93dd-4ef9a5252517
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1608016084001_0020, Tracking URL = http://bigdatatest02:8088/proxy/application_1608016084001_0020/
Kill Command = /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop job -kill job_1608016084001_0020
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-16 09:43:02,342 Stage-1 map = 0%, reduce = 0%
2020-12-16 09:43:10,641 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.28 sec
MapReduce Total cumulative CPU time: 2 seconds 280 msec
Ended Job = job_1608016084001_0020
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.28 sec HDFS Read: 4413 HDFS Write: 451 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 280 msec
Time taken: 27.002 seconds, Fetched: 14 row(s)
1.1.2 minimal模式
hive> set hive.fetch.task.conversion;
hive> select * from bigdata.emp where dept_no = '20';
Query ID = work_20201216094750_3df492b8-bbd8-4e41-b378-bba5fe1b3dc7
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1608016084001_0022, Tracking URL = http://bigdatatest02:8088/proxy/application_1608016084001_0022/
Kill Command = /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop job -kill job_1608016084001_0022
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-16 09:48:07,799 Stage-1 map = 0%, reduce = 0%
2020-12-16 09:48:17,119 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.33 sec
MapReduce Total cumulative CPU time: 4 seconds 330 msec
Ended Job = job_1608016084001_0022
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 4.33 sec HDFS Read: 4952 HDFS Write: 216 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 330 msec
Time taken: 27.728 seconds, Fetched: 5 row(s)
hive> select * from bigdata.emp;
Time taken: 0.144 seconds, Fetched: 14 row(s)
CREATE TABLE IF NOT EXISTS bigdata.emp_partition(
emp_no String,
emp_name String
PARTITIONED BY (dept_no String)
-- 开启动态分区
set hive.exec.dynamic.partition=true;
-- 这个属性默认是strict,即限制模式,strict是避免全分区字段是动态的,必须至少一个分区字段是指定有值即静态的,且必
-- 须放在最前面。设置为nonstrict之后所有的分区都可以是动态的了。
set hive.exec.dynamic.partition.mode=nonstrict;
hive> load data local inpath '/home/work/data/hive/emp.txt' overwrite into table bigdata.emp_partition;
hive> select * from bigdata.emp_partition;
Time taken: 0.204 seconds, Fetched: 14 row(s)
hive> set hive.fetch.task.conversion;
hive> select * from bigdata.emp_partition where dept_no = '20';
Time taken: 0.172 seconds, Fetched: 5 row(s)
1.1.3 more 模式
hive> set hive.fetch.task.conversion=more;
hive> select * from bigdata.emp where dept_no = '20';
Time taken: 0.153 seconds, Fetched: 5 row(s)
hive> select * from bigdata.emp_partition where dept_no = '20';
Time taken: 0.224 seconds, Fetched: 5 row(s)
hive> set hive.exec.mode.local.auto;
hive> select count(1) from bigdata.emp;
Query ID = work_20201216102827_ff3113d0-5c91-4a4f-a330-f9ce782d0e62
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1608016084001_0024, Tracking URL = http://bigdatatest02:8088/proxy/application_1608016084001_0024/
Kill Command = /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop job -kill job_1608016084001_0024
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-12-16 10:28:43,829 Stage-1 map = 0%, reduce = 0%
2020-12-16 10:28:54,149 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.28 sec
2020-12-16 10:29:00,337 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.24 sec
MapReduce Total cumulative CPU time: 6 seconds 240 msec
Ended Job = job_1608016084001_0024
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.24 sec HDFS Read: 8334 HDFS Write: 102 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 240 msec
Time taken: 34.122 seconds, Fetched: 1 row(s)
hive> set hive.exec.mode.local.auto=true;
hive> set hive.exec.mode.local.auto;
hive> select count(1) from bigdata.emp;
Automatically selecting local only mode for query
Query ID = work_20201216103030_6c88b989-8348-4521-aa41-c23dee70931e
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Default Value:
Hive 0.x: nonstrict
Hive 1.x: nonstrict
Hive 2.x: strict (HIVE-12413)
Added In: Hive 0.3.0
The mode in which the Hive operations are being performed. In strict mode, some risky queries are not allowed to run. For example, full table scans are prevented (see HIVE-10454) and ORDER BY requires a LIMIT clause.
hive> set hive.mapred.mode;
-- 正常表使用Order by
hive> select * from bigdata.emp order by emp_no;
Automatically selecting local only mode for query
Query ID = work_20201216104456_5f7b9b48-11d8-4268-9101-0e93e77d9e28
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
7369 SMITH 20
7499 ALLEN 30
7521 WARD 30
7566 JONES 20
7654 MARTIN 30
7698 BLAKE 30
7782 CLARK 10
7788 SCOTT 20
7839 KING 10
7844 TURNER 30
7876 ADAMS 20
7900 JAMES 30
7902 FORD 20
7934 MILLER 10
Time taken: 3.617 seconds, Fetched: 14 row(s)
-- 分区表不使用分区字段
hive> select * from bigdata.emp_partition where emp_no='7782';
7782 CLARK 10
Time taken: 0.154 seconds, Fetched: 1 row(s)
hive> set hive.mapred.mode;
-- 正常表使用orderby
hive> select * from bigdata.emp order by emp_no;
FAILED: SemanticException 1:35 Order by-s without limit are disabled for safety reasons. If you know what you are doing, please set hive.strict.checks.orderby.no.limit to false and make sure that hive.mapred.mode is not set to 'strict' to proceed. Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features.. Error encountered near token 'emp_no'
-- 分区表FILTER不使用分区字段
hive> select * from bigdata.emp_partition where emp_no='20';
FAILED: SemanticException [Error 10056]: Queries against partitioned tables without a partition filter are disabled for safety reasons. If you know what you are doing, please set hive.strict.checks.no.partition.filter to false and make sure that hive.mapred.mode is not set to 'strict' to proceed. Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features. No partition predicate for Alias "emp_partition" Table "emp_partition"
-- 笛卡尔积测试
hive> select * from bigdata.emp a join bigdata.emp b;
FAILED: SemanticException Cartesian products are disabled for safety reasons. If you know what you are doing, please set hive.strict.checks.cartesian.product to false and make sure that hive.mapred.mode is not set to 'strict' to proceed. Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features.
Default Value: true
Added In: Hive 0.5.0
Whether speculative execution for reducers should be turned on.
Default Value: true
Added In: Hive 0.4.0 with HIVE-626
Removed In: Hive 0.13.0 with HIVE-4113
Whether to enable column pruner. (This configuration property was removed in release 0.13.0.)
Default Value: true
Added In: Hive 0.4.0 with HIVE-279, default changed to true in Hive 0.4.0 with HIVE-626
Whether to enable predicate pushdown (PPD).
Note: Turn on Configuration Properties#hive.optimize.index.filter as well to use file format specific indexes with PPD.
hive> set mapreduce.input.fileinputformat.split.maxsize;
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
int reducers = Utilities.estimateNumberOfReducers(conf, inputSummary, work.getMapWork(),
long bytesPerReducer = conf.getLongVar(HiveConf.ConfVars.BYTESPERREDUCER);
int maxReducers = conf.getIntVar(HiveConf.ConfVars.MAXREDUCERS);
estimateReducers(totalInputFileSize, bytesPerReducer, maxReducers, powersOfTwo){
// bytesPerReducer 数据就是通过这个参数设定的 hive.exec.reducers.bytes.per.reducer 默认是256000000L
// maxReducers 是通过hive.exec.reducers.max这个参数设定的,默认是1009
// bytes 是这批数据的总的字节大小
double bytes = Math.max(totalInputFileSize, bytesPerReducer);
int reducers = (int) Math.ceil(bytes / bytesPerReducer);
reducers = Math.max(1, reducers);
reducers = Math.min(maxReducers, reducers);
// 总的来说,可以把Redcue Task 计算公式 = min((总的数据字节大小/hive.exec.reducers.bytes.per.reducer参数设定的数据),hive.exec.reducers.max设定的数据大小)