hive调优案例

Hive 优化
核心思想:把Hive SQL 当做Mapreduce程序去优化
以下SQL不会转为Mapreduce来执行
select仅查询本表字段
where仅对本表字段做条件过滤

Explain 显示执行计划
EXPLAIN [EXTENDED] query

0: jdbc:hive2://node1:10000> explain select count(student.sno) from student;
+--------------------------------------------------------------------------------------------------+--+
|                                             Explain                                              |
+--------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES:                                                                              |
|   Stage-1 is a root stage                                                                        |
|   Stage-0 depends on stages: Stage-1                                                             |
|                                                                                                  |
| STAGE PLANS:                                                                                     |
|   Stage: Stage-1                                                                                 |
|     Map Reduce                                                                                   |
|       Map Operator Tree:                                                                         |
|           TableScan                                                                              |
|             alias: student                                                                       |
|             Statistics: Num rows: 131 Data size: 526 Basic stats: COMPLETE Column stats: NONE    |
|             Select Operator                                                                      |
|               expressions: sno (type: int)                                                       |
|               outputColumnNames: sno                                                             |
|               Statistics: Num rows: 131 Data size: 526 Basic stats: COMPLETE Column stats: NONE  |
|               Group By Operator                                                                  |
|                 aggregations: count(sno)                                                         |
|                 mode: hash                                                                       |
|                 outputColumnNames: _col0                                                         |
|                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE    |
|                 Reduce Output Operator                                                           |
|                   sort order:                                                                    |
|                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE  |
|                   value expressions: _col0 (type: bigint)                                        |
|       Reduce Operator Tree:                                                                      |
|         Group By Operator                                                                        |
|           aggregations: count(VALUE._col0)                                                       |
|           mode: mergepartial                                                                     |
|           outputColumnNames: _col0                                                               |
|           Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE          |
|           File Output Operator                                                                   |
|             compressed: false                                                                    |
|             Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE        |
|             table:                                                                               |
|                 input format: org.apache.hadoop.mapred.TextInputFormat                           |
|                 output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat        |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                        |
|                                                                                                  |
|   Stage: Stage-0                                                                                 |
|     Fetch Operator                                                                               |
|       limit: -1                                                                                  |
|       Processor Tree:                                                                            |
|         ListSink                                                                                 |
|                                                                                                  |
+--------------------------------------------------------------------------------------------------+--+


0: jdbc:hive2://node1:10000> explain extended select count(student.sno) from student;
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                                                                         Explain                                                                                                                          |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| ABSTRACT SYNTAX TREE:                                                                                                                                                                                                                                    |
|                                                                                                                                                                                                                                                          |
| TOK_QUERY                                                                                                                                                                                                                                                |
|    TOK_FROM                                                                                                                                                                                                                                              |
|       TOK_TABREF                                                                                                                                                                                                                                         |
|          TOK_TABNAME                                                                                                                                                                                                                                     |
|             student                                                                                                                                                                                                                                      |
|    TOK_INSERT                                                                                                                                                                                                                                            |
|       TOK_DESTINATION                                                                                                                                                                                                                                    |
|          TOK_DIR                                                                                                                                                                                                                                         |
|             TOK_TMP_FILE                                                                                                                                                                                                                                 |
|       TOK_SELECT                                                                                                                                                                                                                                         |
|          TOK_SELEXPR                                                                                                                                                                                                                                     |
|             TOK_FUNCTION                                                                                                                                                                                                                                 |
|                count                                                                                                                                                                                                                                     |
|                .                                                                                                                                                                                                                                         |
|                   TOK_TABLE_OR_COL                                                                                                                                                                                                                       |
|                      student                                                                                                                                                                                                                             |
|                   sno                                                                                                                                                                                                                                    |
|                                                                                                                                                                                                                                                          |
|                                                                                                                                                                                                                                                          |
| STAGE DEPENDENCIES:                                                                                                                                                                                                                                      |
|   Stage-1 is a root stage                                                                                                                                                                                                                                |
|   Stage-0 depends on stages: Stage-1                                                                                                                                                                                                                     |
|                                                                                                                                                                                                                                                          |
| STAGE PLANS:                                                                                                                                                                                                                                             |
|   Stage: Stage-1                                                                                                                                                                                                                                         |
|     Map Reduce                                                                                                                                                                                                                                           |
|       Map Operator Tree:                                                                                                                                                                                                                                 |
|           TableScan                                                                                                                                                                                                                                      |
|             alias: student                                                                                                                                                                                                                               |
|             Statistics: Num rows: 131 Data size: 526 Basic stats: COMPLETE Column stats: NONE                                                                                                                                                            |
|             GatherStats: false                                                                                                                                                                                                                           |
|             Select Operator                                                                                                                                                                                                                              |
|               expressions: sno (type: int)                                                                                                                                                                                                               |
|               outputColumnNames: sno                                                                                                                                                                                                                     |
|               Statistics: Num rows: 131 Data size: 526 Basic stats: COMPLETE Column stats: NONE                                                                                                                                                          |
|               Group By Operator                                                                                                                                                                                                                          |
|                 aggregations: count(sno)                                                                                                                                                                                                                 |
|                 mode: hash                                                                                                                                                                                                                               |
|                 outputColumnNames: _col0                                                                                                                                                                                                                 |
|                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE                                                                                                                                                            |
|                 Reduce Output Operator                                                                                                                                                                                                                   |
|                   sort order:                                                                                                                                                                                                                            |
|                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE                                                                                                                                                          |
|                   tag: -1                                                                                                                                                                                                                                |
|                   value expressions: _col0 (type: bigint)                                                                                                                                                                                                |
|                   auto parallelism: false                                                                                                                                                                                                                |
|       Path -> Alias:                                                                                                                                                                                                                                     |
|         hdfs://node1:9000/user/hive/warehouse/erzhen.db/student [student]                                                                                                                                                                                |
|       Path -> Partition:                                                                                                                                                                                                                                 |
|         hdfs://node1:9000/user/hive/warehouse/erzhen.db/student                                                                                                                                                                                          |
|           Partition                                                                                                                                                                                                                                      |
|             base file name: student                                                                                                                                                                                                                      |
|             input format: org.apache.hadoop.mapred.TextInputFormat                                                                                                                                                                                       |
|             output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                                                                                                                                                                    |
|             properties:                                                                                                                                                                                                                                  |
|               COLUMN_STATS_ACCURATE true                                                                                                                                                                                                                 |
|               bucket_count -1                                                                                                                                                                                                                            |
|               columns sno,sname,sex,sage,sdept                                                                                                                                                                                                           |
|               columns.comments                                                                                                                                                                                                                           |
|               columns.types int:string:string:int:string                                                                                                                                                                                                 |
|               field.delim ,                                                                                                                                                                                                                              |
|               file.inputformat org.apache.hadoop.mapred.TextInputFormat                                                                                                                                                                                  |
|               file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                                                                                                                                                               |
|               location hdfs://node1:9000/user/hive/warehouse/erzhen.db/student                                                                                                                                                                           |
|               name erzhen.student                                                                                                                                                                                                                        |
|               numFiles 1                                                                                                                                                                                                                                 |
|               serialization.ddl struct student { i32 sno, string sname, string sex, i32 sage, string sdept}                                                                                                                                              |
|               serialization.format ,                                                                                                                                                                                                                     |
|               serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                                                                                                                                                                       |
|               totalSize 526                                                                                                                                                                                                                              |
|               transient_lastDdlTime 1532869025                                                                                                                                                                                                           |
|             serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                                                                                                                                                                                    |
|                                                                                                                                                                                                                                                          |
|               input format: org.apache.hadoop.mapred.TextInputFormat                                                                                                                                                                                     |
|               output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                                                                                                                                                                  |
|               properties:                                                                                                                                                                                                                                |
|                 COLUMN_STATS_ACCURATE true                                                                                                                                                                                                               |
|                 bucket_count -1                                                                                                                                                                                                                          |
|                 columns sno,sname,sex,sage,sdept                                                                                                                                                                                                         |
|                 columns.comments                                                                                                                                                                                                                         |
|                 columns.types int:string:string:int:string                                                                                                                                                                                               |
|                 field.delim ,                                                                                                                                                                                                                            |
|                 file.inputformat org.apache.hadoop.mapred.TextInputFormat                                                                                                                                                                                |
|                 file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                                                                                                                                                             |
|                 location hdfs://node1:9000/user/hive/warehouse/erzhen.db/student                                                                                                                                                                         |
|                 name erzhen.student                                                                                                                                                                                                                      |
|                 numFiles 1                                                                                                                                                                                                                               |
|                 serialization.ddl struct student { i32 sno, string sname, string sex, i32 sage, string sdept}                                                                                                                                            |
|                 serialization.format ,                                                                                                                                                                                                                   |
|                 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                                                                                                                                                                     |
|                 totalSize 526                                                                                                                                                                                                                            |
|                 transient_lastDdlTime 1532869025                                                                                                                                                                                                         |
|               serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                                                                                                                                                                                  |
|               name: erzhen.student                                                                                                                                                                                                                       |
|             name: erzhen.student                                                                                                                                                                                                                         |
|       Truncated Path -> Alias:                                                                                                                                                                                                                           |
|         /erzhen.db/student [student]                                                                                                                                                                                                                     |
|       Needs Tagging: false                                                                                                                                                                                                                               |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                                                                         Explain                                                                                                                          |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|       Reduce Operator Tree:                                                                                                                                                                                                                              |
|         Group By Operator                                                                                                                                                                                                                                |
|           aggregations: count(VALUE._col0)                                                                                                                                                                                                               |
|           mode: mergepartial                                                                                                                                                                                                                             |
|           outputColumnNames: _col0                                                                                                                                                                                                                       |
|           Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE                                                                                                                                                                  |
|           File Output Operator                                                                                                                                                                                                                           |
|             compressed: false                                                                                                                                                                                                                            |
|             GlobalTableId: 0                                                                                                                                                                                                                             |
|             directory: hdfs://node1:9000/tmp/hive/root/00b6fa00-8dc7-433d-b1af-bae934c06990/hive_2018-11-26_12-28-22_186_7180346854191458345-1/-mr-10000/.hive-staging_hive_2018-11-26_12-28-22_186_7180346854191458345-1/-ext-10001                     |
|             NumFilesPerFileSink: 1                                                                                                                                                                                                                       |
|             Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE                                                                                                                                                                |
|             Stats Publishing Key Prefix: hdfs://node1:9000/tmp/hive/root/00b6fa00-8dc7-433d-b1af-bae934c06990/hive_2018-11-26_12-28-22_186_7180346854191458345-1/-mr-10000/.hive-staging_hive_2018-11-26_12-28-22_186_7180346854191458345-1/-ext-10001/  |
|             table:                                                                                                                                                                                                                                       |
|                 input format: org.apache.hadoop.mapred.TextInputFormat                                                                                                                                                                                   |
|                 output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                                                                                                                                                                |
|                 properties:                                                                                                                                                                                                                              |
|                   columns _col0                                                                                                                                                                                                                          |
|                   columns.types bigint                                                                                                                                                                                                                   |
|                   escape.delim \                                                                                                                                                                                                                         |
|                   hive.serialization.extend.additional.nesting.levels true                                                                                                                                                                               |
|                   serialization.format 1                                                                                                                                                                                                                 |
|                   serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                                                                                                                                                                   |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                                                                                                                                                                                |
|             TotalFiles: 1                                                                                                                                                                                                                                |
|             GatherStats: false                                                                                                                                                                                                                           |
|             MultiFileSpray: false                                                                                                                                                                                                                        |
|                                                                                                                                                                                                                                                          |
|   Stage: Stage-0                                                                                                                                                                                                                                         |
|     Fetch Operator                                                                                                                                                                                                                                       |
|       limit: -1                                                                                                                                                                                                                                          |
|       Processor Tree:                                                                                                                                                                                                                                    |
|         ListSink                                                                                                                                                                                                                                         |
|                                                                                                                                                                                                                                                          |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
134 rows selected (0.309 seconds)

Hive运行方式:
本地模式
集群模式

本地模式
开启本地模式:

set hive.exec.mode.local.auto=true;

注意:
hive.exec.mode.local.auto.inputbytes.max默认值为128M
表示加载文件的最大值,若大于该配置仍会以集群方式来运行!

并行计算
通过设置以下参数开启并行模式:

set hive.exec.parallel=true;

注意:hive.exec.parallel.thread.number
(一次SQL计算中允许并行执行的job个数的最大值)

严格模式
通过设置以下参数开启严格模式:

set hive.mapred.mode=strict;

(默认为:nonstrict非严格模式)

查询限制:
1、对于分区表,必须添加where对于分区字段的条件过滤;
2、order by语句必须包含limit输出限制;
3、限制执行笛卡尔积的查询。

Hive排序
Order By - 对于查询结果做全排序,只允许有一个reduce处理
(当数据量较大时,应慎用。严格模式下,必须结合limit来使用)
Sort By - 对于单个reduce的数据进行排序
Distribute By - 分区排序,经常和Sort By结合使用
Cluster By - 相当于 Sort By + Distribute By
(Cluster By不能通过asc、desc的方式指定排序规则;
可通过 distribute by column sort by column asc|desc 的方式)

Hive Join
Join计算时,将小表(驱动表)放在join的左边
Map Join:在Map端完成Join
两种实现方式:
1、SQL方式,在SQL语句中添加MapJoin标记(mapjoin hint)
语法:
SELECT /*+ MAPJOIN(smallTable) */ smallTable.key, bigTable.value
FROM smallTable JOIN bigTable ON smallTable.key = bigTable.key;
2、开启自动的MapJoin

Hive Join
自动的mapjoin
通过修改以下配置启用自动的mapjoin:

set hive.auto.convert.join = true;

(该参数为true时,Hive自动对左边的表统计量,如果是小表就加入内存,即对小表使用Map join)

相关配置参数:

hive.mapjoin.smalltable.filesize;  

(大表小表判断的阈值,如果表的大小小于该值则会被加载到内存中运行)
hive.ignore.mapjoin.hint;
(默认值:true;是否忽略mapjoin hint 即mapjoin标记)

hive.auto.convert.join.noconditionaltask;

(默认值:true;将普通的join转化为普通的mapjoin时,是否将多个mapjoin转化为一个mapjoin)

hive.auto.convert.join.noconditionaltask.size;

(将多个mapjoin转化为一个mapjoin时,其表的最大值)

Map-Side聚合
通过设置以下参数开启在Map端的聚合:

set hive.map.aggr=true;

相关配置参数:

hive.groupby.mapaggr.checkinterval: 

map端group by执行聚合时处理的多少行数据(默认:100000)

hive.map.aggr.hash.min.reduction: 

进行聚合的最小比例(预先对100000条数据做聚合,若聚合之后的数据量/100000的值大于该配置0.5,则不会聚合)

hive.map.aggr.hash.percentmemory: 

map端聚合使用的内存的最大值
hive.map.aggr.hash.force.flush.memory.threshold:
map端做聚合操作是hash表的最大可用内容,大于该值则会触发flush

hive.groupby.skewindata

是否对GroupBy产生的数据倾斜做优化,默认为false

控制Hive中Map以及Reduce的数量
Map数量相关的参数

mapred.max.split.size

一个split的最大值,即每个map处理文件的最大值

mapred.min.split.size.per.node

一个节点上split的最小值

mapred.min.split.size.per.rack

一个机架上split的最小值

Reduce数量相关的参数

mapred.reduce.tasks

强制指定reduce任务的数量

hive.exec.reducers.bytes.per.reducer

每个reduce任务处理的数据量

hive.exec.reducers.max

每个任务最大的reduce数

Hive - JVM重用
适用场景:
1、小文件个数过多
2、task个数过多

通过 set mapred.job.reuse.jvm.num.tasks=n; 来设置
(n为task插槽个数)

缺点:设置开启之后,task插槽会一直占用资源,不论是否有task运行,直到所有的task即整个job全部执行完成时,才会释放所有的task插槽资源!

hsql判断

1. IF( Test Condition, True Value, False Value )
The IF condition evaluates the “Test Condition” and if the “Test Condition” is true, then it returns the “True Value”. Otherwise, it returns the False Value.
Example: IF(1=1, 'working', 'not working') returns 'working'
     COALESCE( value1,value2,… )
    The COALESCE function returns the fist not NULL value from the list of values. If all the values in the list are NULL, then it returns NULL.
    Example: COALESCE(NULL,NULL,5,NULL,4) returns 5

3. CASE Statement
The syntax for the case statement is:
CASE  [ expression ]
  WHEN condition1 THEN result1
  WHEN condition2 THEN result2
  ...
  WHEN conditionn THEN resultn
  ELSE result
END
Here expression is optional. It is the value that you are comparing to the list of conditions. (ie: condition1, condition2, ... conditionn).

All the conditions must be of same datatype. Conditions are evaluated in the order listed. Once a condition is found to be true, the case statement will return the result and not evaluate the conditions any further.

All the results must be of same datatype. This is the value returned once a condition is found to be true.

IF no condition is found to be true, then the case statement will return the value in the ELSE clause. If the ELSE clause is omitted and no condition is found to be true, then the case statement will return NULL

Example: 

CASE Fruit
  WHEN 'APPLE' THEN 'The owner is APPLE'
  WHEN 'ORANGE' THEN 'The owner is ORANGE'
  ELSE 'It is another Fruit'
END
The other form of CASE is

CASE 
  WHEN Fruit = 'APPLE' THEN 'The owner is APPLE'
  WHEN Fruit = 'ORANGE' THEN 'The owner is ORANGE'
  ELSE 'It is another Fruit'
END

在hive表前1000行里,过滤出不重复的refid,imsi。
错误的写法:

select distinct refid,imsi from HIVE_D_MT_UU_H_SPARK limit 1000; 

会去读取全表,把0~1000行的不重复refid,imsi显示出来。

正确的写法:

select distinct refid,imsi from (select * from HIVE_D_MT_UU_H_SPARK limit 1000);

调优的写法:

CREATE TABLE TEMP_HIVE_D_MT_UU_H_SPARK AS 
select * from HIVE_D_MT_UU_H_SPARK limit 1000; 
select distinct refid,imsi from TEMP_HIVE_D_MT_UU_H_SPARK;

hive最快的执行就是不走MapReduce。简单的select的是最快的,嵌套啥的都比较忙。与关系型数据库不同。
调优的写法执行更快。

你可能感兴趣的:(hive)