一般来说,查询加速的最有效方法即
Hive3.0开始尝试引入物化视图,并提供对于物化视图的查询自动重写(基于Apache Calcite实现);值得注意的是,3.0中提供了物化视图存储选择机制,可以本地存储在hive,同时可以通过用户自定义storage handlers存储在其他系统(如Druid)。Hive3.0提供了对于物化视图生命周期管理(如数据更新)的控制。
According to Wikipedia a SQL View is the result set of a stored query on the data. Let’s say you have a lot of different tables that you are constantly requesting, using always the same joins, filters and aggregations. With a view, you could simplify access to those datasets while providing more meaning to the end user. It avoids repeating the same complex queries and eases schema evolution.
For example, an application needs access to a products dataset with the product owner and the total number of order for each product. Such queries would need to join the User and Order tables with the Product table. A view would mask the complexity of the schema to the end users by only providing one table with custom and dedicated ACLs.
However such views in Hive used to be virtual and implied huge and slow queries. Instead, you could create an intermediate table to store the results of your query, but such operations require changing your access patterns and has the challenge of making sure the data in the table stays fresh.
We can identify four main types of optimization:
The goal of Materialized views (MV) is to improve the speed of queries while requiring zero maintenance operations.
The main features are:
Materialized views creation
支持的基本特性:
CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db_name.]materialized_view_name
[DISABLE REWRITE]
[COMMENT materialized_view_comment]
[PARTITIONED ON (col_name, ...)]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
AS
<query>;
说明
(1)物化视图创建后,query的执行数据自动落地,"自动"也即在query的执行期间,任何用户对该物化视图是不可见的
(2)默认,该物化视图可被用于查询优化器optimizer查询重写(在物化视图创建期间可以通过DISABLE REWRITE参数设置禁止使用)
(3) SerDe和storage format非强制参数,可以用户配置,默认可用hive.materializedview.serde、 hive.materializedview.fileformat
(4)物化视图可以使用custom storage handlers存储在外部系统(如druid)例如:
CREATE MATERIALIZED VIEW druid_wiki_mv
STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler'
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
目前支持物化视图的drop和show操作,后续会增加其他操作
-- Drops a materialized view
DROP MATERIALIZED VIEW [db_name.]materialized_view_name;
-- Shows materialized views (with optional filters)
SHOW MATERIALIZED VIEWS [IN database_name] ['identifier_with_wildcards’];
-- Shows information about a specific materialized view
DESCRIBE [EXTENDED | FORMATTED] [db_name.]materialized_view_name;
Materialized view-based query rewriting
SET hive.materializedview.rewriting=true;
ALTER MATERIALIZED VIEW [db_name.]materialized_view_name ENABLE|DISABLE REWRITE;
基于Calcite重写物化视图,其中支持的重写样例可参见:
Materialized Views
当数据源变更(新数据插入inserted、数据修改modified),物化视图也需要更新以保持数据一致性,目前需要用户主动触发rebuild:
ALTER MATERIALIZED VIEW [db_name.]materialized_view_name REBUILD;
增量更新
Hive supports incremental view maintenance, i.e., only refresh data that was affected by the changes in the original source tables. Incremental view maintenance will decrease the rebuild step execution time. In addition, it will preserve LLAP cache for existing data in the materialized view.
By default, Hive will attempt to rebuild a materialized view incrementally, falling back to full rebuild if it is not possible. Current implementation only supports incremental rebuild when there were INSERT operations over the source tables, while UPDATE and DELETE operations will force a full rebuild of the materialized view.
To execute incremental maintenance, following conditions should be met:
By default, once a materialized view contents are stale, the materialized view will not be used for automatic query rewriting.
However, in some occasions it may be fine to accept stale data, e.g., if the materialized view uses non-transactional tables and hence we cannot verify whether its contents are outdated, however we still want to use the automatic rewriting. For those occasions, we can combine a rebuild operation run periodically, e.g., every 5minutes, and define the required freshness of the materialized view data using the hive.materializedview.rewriting.time.window configuration parameter, for instance:
SET hive.materializedview.rewriting.time.window=10min;
The parameter value can be also overridden by a concrete materialized view just by setting it as a table property when the materialization is created.
<property>
<name>hive.materializedview.rewritingname>
<value>truevalue>
<description>Whether to try to rewrite queries using the materialized views enabled for rewritingdescription>
property>
<property>
<name>hive.materializedview.rewriting.strategyname>
<value>heuristicvalue>
<description>
Expects one of [heuristic, costbased].
The strategy that should be used to cost and select the materialized view rewriting.
heuristic: Always try to select the plan using the materialized view if rewriting produced one,choosing the plan with lower cost among possible plans containing a materialized view
costbased: Fully cost-based strategy, always use plan with lower cost, independently on whether it uses a materialized view or not
description>
property>
<property>
<name>hive.materializedview.rewriting.time.windowname>
<value>0minvalue>
<description>
Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is min if not specified.
Time window, specified in seconds, after which outdated materialized views become invalid for automatic query rewriting.
For instance, if more time than the value assigned to the property has passed since the materialized view was created or rebuilt, and one of its source tables has changed since, the materialized view will not be considered for rewriting. Default value 0 means that the materialized view cannot be outdated to be used automatically in query rewriting. Value -1 means to skip this check.
description>
property>
<property>
<name>hive.materializedview.rewriting.incrementalname>
<value>falsevalue>
<description>
Whether to try to execute incremental rewritings based on outdated materializations and
current content of tables. Default value of true effectively amounts to enabling incremental
rebuild for the materializations too.
description>
property>
<property>
<name>hive.materializedview.rebuild.incrementalname>
<value>truevalue>
<description>
Whether to try to execute incremental rebuild for the materialized views. Incremental rebuild
tries to modify the original materialization contents to reflect the latest changes to the
materialized view source tables, instead of rebuilding the contents fully. Incremental rebuild
is based on the materialized view algebraic incremental rewriting.
description>
property>
<property>
<name>hive.materializedview.fileformatname>
<value>ORCvalue>
<description>
Expects one of [none, textfile, sequencefile, rcfile, orc].
Default file format for CREATE MATERIALIZED VIEW statement
description>
property>
<property>
<name>hive.materializedview.serdename>
<value>org.apache.hadoop.hive.ql.io.orc.OrcSerdevalue>
<description>Default SerDe used for materialized viewsdescription>
property>
(1)新建一张transactional表depts
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.support.concurrency=true;
SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.compactor.initiator.on=true;
SET hive.compactor.worker.threads=2;
CREATE TABLE depts (
deptno INT,
deptname VARCHAR(256),
locationid INT)
STORED AS ORC
TBLPROPERTIES ('transactional'='true');
(2)导入数据
hive> INSERT OVERWRITE TABLE depts
> select
> id,name,1 as loc from student;
Query ID = didi_20181128204405_c06c8983-a363-458b-b1f8-443deeb514c2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
....
hive> select * from depts;
OK
1001 zhangsan 1
1002 lisi 1
Time taken: 0.195 seconds, Fetched: 2 row(s)
...
(3)对depts建立聚合物化视图
hive> CREATE MATERIALIZED VIEW depts_agg
> AS
> SELECT deptno, count(1) as deptno_cnt from depts group by deptno;
Query ID = didi_20181128204706_be53ca94-f594-49a2-beda-7cec2b2f2c71
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1543385586294_0004, Tracking URL = http://localhost:8088/proxy/application_1543385586294_0004/
Kill Command = /..../software/hadoop/hadoop-2.7.4/bin/mapred job -kill job_1543385586294_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
.....
注意.
这里日志可见,当执行CREATE MATERIALIZED VIEW,与一遍creat table 不同,会启动一个MR(这里没有指定其他类型的引擎如spark,默认为MR)对物化视图进行构建
(4)对原始表deptno查询
由于会命中物化视图,重写query查询物化视图,查询速度会加快(没有启动MR,只是普通的tablescan)
hive> SELECT deptno, count(1) as deptno_cnt from depts group by deptno;
OK
1001 1
1002 1
Time taken: 0.414 seconds, Fetched: 2 row(s)
具体可见执行过程
查询被自动重写为TableScan alias: hive3_test.depts_agg
hive> explain SELECT deptno, count(1) as deptno_cnt from depts group by deptno;
OK
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: hive3_test.depts_agg
Statistics: Num rows: 2 Data size: 24 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: deptno (type: int), deptno_cnt (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 2 Data size: 24 Basic stats: COMPLETE Column stats: NONE
ListSink
Time taken: 0.275 seconds, Fetched: 17 row(s)
Many improvements are planned :