优化连接查询最简单的方式是使用compute stats命令收集所有参与关联表的统计信息,让impala根据每个表的大小、列的非重复值个数等相关信息自动优化查询。
如果参与关联的表的统计信息不可用,使用impala自动的连接顺序效率很低,可以在select关键字后使用straight_join关键字手动指定连接顺序,指定了该关键字之后,impala会使用表在查询中出现的先后顺序作为关联顺序进行处理。
使用straight_join关键字需要手动指定连接表的先后顺序:
(1)指定最大的表为第一张表。
(2)指定最小的一张表作为下一张表。
(3)接着指定剩下的表中最小的表作为下一张表。如果有四张表分别为BIG
, MEDIUM
, SMALL
, 和TINY
, 指定的顺序应该为BIG
, TINY
, SMALL
, MEDIUM
.
Impala查询优化器根据表的绝对大小和相对大小而选择不同的关联技术:
(1)默认的方式为Broadcast joins,当大表连接小表时,小表的内容会被发送到所有执行查询的节点上
(2)另一种为partitioned join,用于大小差不多的大表关联,使用此方式,可以保证关联操作可以并行执行,每个表的一部分数据被发送到不同的节点上,最后各个节点分别对传送过来的数据并行处理。具体使用哪种方式依赖于compute stats的统计信息。
可以使用特定的查询执行explain语句,来确定表的连接策略,如果通过基准测试发现某种策略优于另外一种策略,那么可以通过Hint的方式手动指定需要的连接方式。
1.当统计信息不可用时如何处理join
如果只有某些表的统计信息不可用,impala会根据存在统计信息的表重新生成连接顺序,有统计信息的表会被放在连接顺序的最左端,并根据表的基数和规模降序排列,没有统计信息的表会被作为空表对待,总是放在连接顺序的最右边。
2.使用straight_join覆盖连接顺序
如果关联查询由于统计信息过期或者数据分布等问题导致效率低下,可以通过straight_join关键字改变连接顺序,指定顺序后不会再使用impala自动生成的连接顺序。
3.案例
[localhost:21000] > create table big stored as parquet as select * from raw_data;
+----------------------------+
| summary |
+----------------------------+
| Inserted 1000000000 row(s) |
+----------------------------+
Returned 1 row(s) in 671.56s
[localhost:21000] > desc big;
+-----------+---------+---------+
| name | type | comment|
+-----------+---------+---------+
| id | int | |
| val | int | |
| zfill | string | |
| name | string | |
| assertion | boolean | |
+-----------+---------+---------+
Returned 5 row(s) in 0.01s
[localhost:21000] > create table medium stored as parquet as select * from big limit 200 * floor(1e6);
+---------------------------+
| summary |
+---------------------------+
| Inserted 200000000 row(s) |
+---------------------------+
Returned 1 row(s) in 138.31s
[localhost:21000] > create table small stored as parquet as select id,val,name from big where assertion = true limit 1 * floor(1e6);
+-------------------------+
| summary |
+-------------------------+
| Inserted 1000000 row(s) |
+-------------------------+
Returned 1 row(s) in 6.32s
实际运行查询之前使用explain查看连接信息,启用执行计划的详细输出,可以看到更多的性能相关的输出信息,红色字体显示。信息提示参与关联的表没有统计信息,impala不能为每个执行阶段估计出结果集的大小,使用Broadcast方式向每个节点发送一个表的完整副本。
[localhost:21000] > set explain_level=verbose;
EXPLAIN_LEVEL set to verbose
[localhost:21000] > explain select count(*) from big join medium where big.id = medium.id;
+----------------------------------------------------------+
| Explain String |
+----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=2.10GB VCores=2 |
| |
| PLAN FRAGMENT 0 |
| PARTITION: UNPARTITIONED |
| |
| 6:AGGREGATE (merge finalize) |
| | output: SUM(COUNT(*)) |
| | cardinality: 1 |
| | per-host memory: unavailable |
| | tuple ids: 2 |
| | |
| 5:EXCHANGE |
| cardinality: 1 |
| per-host memory: unavailable |
| tuple ids: 2 |
| |
| PLAN FRAGMENT 1 |
| PARTITION: RANDOM |
| |
| STREAM DATA SINK |
| EXCHANGE ID: 5 |
| UNPARTITIONED |
| |
| 3:AGGREGATE |
| | output: COUNT(*) |
| | cardinality: 1 |
| | per-host memory: 10.00MB |
| | tuple ids: 2 |
| | |
| 2:HASH JOIN |
| | join op: INNER JOIN (BROADCAST) |
| | hash predicates: |
| | big.id = medium.id |
| | cardinality: unavailable |
| | per-host memory: 2.00GB |
| | tuple ids: 0 1 |
| | |
| |----4:EXCHANGE |
| | cardinality: unavailable |
| | per-host memory: 0B |
| | tuple ids: 1 |
| | |
| 0:SCAN HDFS |
| table=join_order.big #partitions=1/1 size=23.12GB |
| table stats: unavailable |
| column stats: unavailable |
| cardinality: unavailable |
| per-host memory: 88.00MB |
| tuple ids: 0 |
| |
| PLAN FRAGMENT 2 |
| PARTITION: RANDOM |
| |
| STREAM DATA SINK |
| EXCHANGE ID: 4 |
| UNPARTITIONED |
| |
| 1:SCAN HDFS |
| table=join_order.medium #partitions=1/1 size=4.62GB |
| table stats: unavailable |
| column stats: unavailable |
| cardinality: unavailable |
| per-host memory: 88.00MB |
| tuple ids: 1 |
+----------------------------------------------------------+
Returned 64 row(s) in 0.04s
为每张表执行compute stats收集统计信息:
[localhost:21000] > compute stats small;
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 1 partition(s) and 3 column(s). |
+-----------------------------------------+
Returned 1 row(s) in 4.26s
[localhost:21000] > compute stats medium;
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 1 partition(s) and 5 column(s). |
+-----------------------------------------+
Returned 1 row(s) in 42.11s
[localhost:21000] > compute stats big;
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 1 partition(s) and 5 column(s). |
+-----------------------------------------+
Returned 1 row(s) in 165.44s
收集完统计信息之后,impala会根据统计信息选择更有效的连接顺序,具体选择哪种方式仍然是根据表的大小和行数的差别来确定。
[localhost:21000] > explain select count(*) from medium join big where big.id = medium.id;
Query: explain select count(*) from medium join big where big.id = medium.id
+-----------------------------------------------------------+
| Explain String |
+-----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=937.23MB VCores=2 |
| |
| PLAN FRAGMENT 0 |
| PARTITION: UNPARTITIONED |
| |
| 6:AGGREGATE (merge finalize) |
| | output: SUM(COUNT(*)) |
| | cardinality: 1 |
| | per-host memory: unavailable |
| | tuple ids: 2 |
| | |
| 5:EXCHANGE |
| cardinality: 1 |
| per-host memory: unavailable |
| tuple ids: 2 |
| |
| PLAN FRAGMENT 1 |
| PARTITION: RANDOM |
| |
| STREAM DATA SINK |
| EXCHANGE ID: 5 |
| UNPARTITIONED |
| |
| 3:AGGREGATE |
| | output: COUNT(*) |
| | cardinality: 1 |
| | per-host memory: 10.00MB |
| | tuple ids: 2 |
| | |
| 2:HASH JOIN |
| | join op: INNER JOIN (BROADCAST) |
| | hash predicates: |
| | big.id = medium.id |
| | cardinality: 1443004441 |
| | per-host memory: 839.23MB |
| | tuple ids: 1 0 |
| | |
| |----4:EXCHANGE |
| | cardinality: 200000000 |
| | per-host memory: 0B |
| | tuple ids: 0 |
| | |
| 1:SCAN HDFS |
| table=join_order.big #partitions=1/1 size=23.12GB |
| table stats: 1000000000 rows total |
| column stats: all |
| cardinality: 1000000000 |
| per-host memory: 88.00MB |
| tuple ids: 1 |
| |
| PLAN FRAGMENT 2 |
| PARTITION: RANDOM |
| |
| STREAM DATA SINK |
| EXCHANGE ID: 4 |
| UNPARTITIONED |
| |
| 0:SCAN HDFS |
| table=join_order.medium #partitions=1/1 size=4.62GB |
| table stats: 200000000 rows total |
| column stats: all |
| cardinality: 200000000 |
| per-host memory: 88.00MB |
| tuple ids: 0 |
+-----------------------------------------------------------+
Returned 64 row(s) in 0.04s
[localhost:21000] > explain select count(*) from small join big where big.id = small.id;
Query: explain select count(*) from small join big where big.id = small.id
+-----------------------------------------------------------+
| Explain String |
+-----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=101.15MB VCores=2 |
| |
| PLAN FRAGMENT 0 |
| PARTITION: UNPARTITIONED |
| |
| 6:AGGREGATE (merge finalize) |
| | output: SUM(COUNT(*)) |
| | cardinality: 1 |
| | per-host memory: unavailable |
| | tuple ids: 2 |
| | |
| 5:EXCHANGE |
| cardinality: 1 |
| per-host memory: unavailable |
| tuple ids: 2 |
| |
| PLAN FRAGMENT 1 |
| PARTITION: RANDOM |
| |
| STREAM DATA SINK |
| EXCHANGE ID: 5 |
| UNPARTITIONED |
| |
| 3:AGGREGATE |
| | output: COUNT(*) |
| | cardinality: 1 |
| | per-host memory: 10.00MB |
| | tuple ids: 2 |
| | |
| 2:HASH JOIN |
| | join op: INNER JOIN (BROADCAST) |
| | hash predicates: |
| | big.id = small.id |
| | cardinality: 1000000000 |
| | per-host memory: 3.15MB |
| | tuple ids: 1 0 |
| | |
| |----4:EXCHANGE |
| | cardinality: 1000000 |
| | per-host memory: 0B |
| | tuple ids: 0 |
| | |
| 1:SCAN HDFS |
| table=join_order.big #partitions=1/1 size=23.12GB |
| table stats: 1000000000 rows total |
| column stats: all |
| cardinality: 1000000000 |
| per-host memory: 88.00MB |
| tuple ids: 1 |
| |
| PLAN FRAGMENT 2 |
| PARTITION: RANDOM |
| |
| STREAM DATA SINK |
| EXCHANGE ID: 4 |
| UNPARTITIONED |
| |
| 0:SCAN HDFS |
| table=join_order.small #partitions=1/1 size=17.93MB |
| table stats: 1000000 rows total |
| column stats: all |
| cardinality: 1000000 |
| per-host memory: 32.00MB |
| tuple ids: 0 |
+-----------------------------------------------------------+
Returned 64 row(s) in 0.03s
而实际执行查询时发现无论表的连接顺序如何,执行的时间差不多,因为样本数据的ID列和VAL列都包含很多的重复值
[localhost:21000] > select count(*) from big join small on (big.id = small.id);
Query: select count(*) from big join small on (big.id = small.id)
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
Returned 1 row(s) in 21.68s
[localhost:21000] > select count(*) from small join big on (big.id = small.id);
Query: select count(*) from small join big on (big.id = small.id)
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
Returned 1 row(s) in 20.45s
[localhost:21000] > select count(*) from big join small on (big.val = small.val);
+------------+
| count(*) |
+------------+
| 2000948962 |
+------------+
Returned 1 row(s) in 108.85s
[localhost:21000] > select count(*) from small join big on (big.val = small.val);
+------------+
| count(*) |
+------------+
| 2000948962 |
+------------+
Returned 1 row(s) in 100.76s
1.表统计信息
show table stats parquet_snappy;
compute stats parquet_snappy;
show table stats parquet_snappy;
2.列统计信息
show column stats parquet_snappy;
compute stats parquet_snappy;
show column stats parquet_snappy;
3.分区表的表统计信息和列统计信息
show partitions 与show table stats 显示信息一样
show partitions year_month_day; +-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+... | year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |... +-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+... | 2013 | 12 | 1 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |... | 2013 | 12 | 2 | -1 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET |... | 2013 | 12 | 3 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |... | 2013 | 12 | 4 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |... | 2013 | 12 | 5 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |... | Total | | | -1 | 5 | 12.58MB | 0B | | |... +-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
show table stats year_month_day; +-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+... | year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |... +-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+... | 2013 | 12 | 1 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |... | 2013 | 12 | 2 | -1 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET |... | 2013 | 12 | 3 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |... | 2013 | 12 | 4 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |... | 2013 | 12 | 5 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |... | Total | | | -1 | 5 | 12.58MB | 0B | | |... +-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
show column stats year_month_day; +-----------+---------+------------------+--------+----------+----------+ | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | +-----------+---------+------------------+--------+----------+----------+ | id | INT | -1 | -1 | 4 | 4 | | val | INT | -1 | -1 | 4 | 4 | | zfill | STRING | -1 | -1 | -1 | -1 | | name | STRING | -1 | -1 | -1 | -1 | | assertion | BOOLEAN | -1 | -1 | 1 | 1 | | year | INT | 1 | 0 | 4 | 4 | | month | INT | 1 | 0 | 4 | 4 | | day | INT | 5 | 0 | 4 | 4 | +-----------+---------+------------------+--------+----------+----------+
compute stats year_month_day; +-----------------------------------------+ | summary | +-----------------------------------------+ | Updated 5 partition(s) and 5 column(s). | +-----------------------------------------+
show table stats year_month_day; +-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+... | year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |... +-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+... | 2013 | 12 | 1 | 93606 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |... | 2013 | 12 | 2 | 94158 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET |... | 2013 | 12 | 3 | 94122 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |... | 2013 | 12 | 4 | 93559 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |... | 2013 | 12 | 5 | 93845 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |... | Total | | | 469290 | 5 | 12.58MB | 0B | | |... +-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+...
show column stats year_month_day; +-----------+---------+------------------+--------+----------+-------------------+ | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | +-----------+---------+------------------+--------+----------+-------------------+ | id | INT | 511129 | -1 | 4 | 4 | | val | INT | 364853 | -1 | 4 | 4 | | zfill | STRING | 311430 | -1 | 6 | 6 | | name | STRING | 471975 | -1 | 22 | 13.00160026550293 | | assertion | BOOLEAN | 2 | -1 | 1 | 1 | | year | INT | 1 | 0 | 4 | 4 | | month | INT | 1 | 0 | 4 | 4 | | day | INT | 5 | 0 | 4 | 4 | +-----------+---------+------------------+--------+----------+-------------------+
|
4 COMPUTE STATS
与COMPUTE INCREMENTAL STATS
COMPUTE STATS 和 COMPUTE INCREMENTAL STATS,只能使用其中的一个,不可同时使用。COMPUTE STATS收集表级和分区级的行统计与列统计信息,使用时会消耗CPU,对于非常大的表而言,会耗费很长的时间。提高COMPUTE STATS的效率,需要做到下面几点:
(1)限制统计列的数量。从2.12版本开始有此特点。
(2)设置MT_DOP查询选项,使用更多的线程进行统计信息,注意:对大表收集统计信息时,如果设置较高的MT_DOP值会对同时间运行的其他查询产生负面影响。此特点从2.8开始引入。
(3)通过实验推断或者取样特征进一步提高统计信息的效率。(属于实验)
Compute stats需要周期地运行,比如每周,或者当表的内容发生重大改变的时候。取样的特点是通过处理表的一部分数据,使得compute stats更有效率,推断特点的目的是通过估计新的或者修改的分区的行统计来减少需要重新compute stats的频率。取样和推断的特点默认是关闭的,可以全局开启也可以针对某个表开启,设置--enable_stats_extrapolation参数全局开启,同过针对某个表设置impala.enable.stats.extrapolation=true属性进行开启,表级别的设置会覆盖全局设置。
对于2.1.0或者更高版本,可以使用COMPUTE INCREMENTAL STATS 和DROP INCREMENTAL STATS命令,指的是增量统计,针对分区表。如果对分区表使用此命令,默认情况下impala只处理没有增量统计的分区,即仅处理新加入的分区。
对于一个有大量分区和许多列的表,每个分区的每个列大约400byte的元数据增加内存负载,当必须要缓存到catalogd主机和充当coordinator 的impalad主机时,如果所有表的元数据超过2G,那么服务会宕机。COMPUTE INCREMENTAL STATS比COMPUTE STATS耗时。
5使用alter table手动设置表和列的统计信息
--创建表
create table analysis_data stored as parquet as select * from raw_data;
Inserted 1000000000 rows in 181.98s
--收集统计信息
compute stats analysis_data;
--插入数据
insert into analysis_data select * from smaller_table_we_forgot_before;
Inserted 1000000 rows in 15.32s
-- 共 1001000000行. 设置统计信息.
alter table analysis_data set tblproperties('numRows'='1001000000', 'STATS_GENERATED_VIA_STATS_TASK'='true');
准入机制:为高并发查询避免内存不足提供了有利的保障。
准入机制功能可以让我们在集群侧对并发执行的查询的数目和使用的内存设置一个上限。那些超多限制的查询不会被取消,而是被放在队列中等待执行。一旦其他的查询执行结束释放了相关资源,队列中的查询任务就可以继续执行了。
1.使用cloudera manager配置
可以使用cloudera manager管理控制台配置资源池、管理等待队列、设置并发查询的个数限制以及如何捕获到是否超过了限制等。
2.手动配置
通过修改配置文件fair-scheduler.xml 和llama-site.xml,并修改impala进程启动参数。
对于一个只使用单个资源池的简单配置,可以不配置fair-scheduler.xml 和llama-site.xml,只需要配置命令行参数。
(1)--default_pool_max_queued
(2)--default_pool_max_requests
(3) --default_pool_mem_limit
(4) –-disable_admission_control
(5)--disable_pool_max_requests
(6)--disable_pool_mem_limits
(7)-- fair_scheduler_allocation_path
(8) --llama_site_path
(9) --queue_wait_timeout_ms
对于使用多个资源池的配置,需要修改fair-scheduler.xml 和llama-site.xml
fair-scheduler.xml
|
llama-site.xml:
|
Explain语句提供了一个查询执行的逻辑步骤,包括怎样将查询分不到多个节点上,各个节点之前怎样交换中间结果以及产生最终结果等,可以通过这些信息初步判断查询执行是否高效。
[impalad-host:21000] > explain select count(*) from customer_address; +----------------------------------------------------------+ | Explain String | +----------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=42.00MB VCores=1 | | | | 03:AGGREGATE [MERGE FINALIZE] | | | output: sum(count(*)) | | | | | 02:EXCHANGE [PARTITION=UNPARTITIONED] | | | | | 01:AGGREGATE | | | output: count(*) | | | | | 00:SCAN HDFS [default.customer_address] | | partitions=1/1 size=5.25MB | +----------------------------------------------------------+
|
1.选择合适的数据格式
2.避免数据处理过程中产生过多小文件
使用insert…select在表表之间拷贝数据。避免对海量数据或者影响性能的关键表使用insert…values插入数据,因为每条这样的insert语句都会产生单个的小文件。
如果在数据处理过程中产生了上千个小文件,需要使用insert…select来讲数据复制到另外一张表,在复制的过程中也解决了小文件过多的问题。
3.选择合适的分区粒度。
如果一个包含上千个分区的parquet表,每个分区的数据都小于1G,就需要采用更大的分区粒度,只有分区的粒度使文件的大小合适,才能充分利用HDFS的IO批处理性能和Impala的分布式查询。
4.使用compute stats收集连接查询中海量数据表或者影响性能的关键表的统计信息
5.最小化向客户端传输结果的开销
使用聚集、过滤、limit子句、避免结果集输出样式。
6.在实际运行一个查询之前,使用explain查看执行计划是否以高效合理的方式运行
7.在运行一个查询之后,使用profile命令查看IO,内存消耗,网络带宽占用,CPU使用率等信息是否在期望的范围之内。