1. 使用 Tez
set hive.execution.engine=tez;
2. 使用 ORCFILE。当有多个表 join 时,使用 ORCFile 进行存储,会显著地提高速度。
CREATE TABLE A_ORC ( customerID int, name string, age int, address string ) STORED AS ORC tblproperties ("orc.compress" = "SNAPPY");
3. 使用 VECTORIZATION。会提高 scans, aggregations, filters and joins 等操作的性能。它会把 1024条记录做为一批进行处理,而不是每条记录进行处理。
set hive.vectorized.execution.enabled = true; set hive.vectorized.execution.reduce.enabled = true;
4. 使用 Cost-based optimization (CBO) 。根据查询代价进行优化。
set hive.cbo.enable=true; set hive.compute.query.using.stats=true; set hive.stats.fetch.column.stats=true; set hive.stats.fetch.partition.stats=true;
需要运行 "analyze" 命令为 CBO 收集表的各种统计信息。
analyze table tbl_student compute statistics; analyze table tbl_student compute statistics for columns birthday, race;
5. 优化 sql
SELECT clicks.* FROM clicks inner join (select sessionID, max(timestamp) as max_ts from clicks group by sessionID) latest ON clicks.sessionID = latest.sessionID and clicks.timestamp = latest.max_ts;
使用下面的 sql 代替上面的
SELECT * FROM (SELECT *, RANK() over (partition by sessionID, order by timestamp desc) as rank FROM clicks) ranked_clicks WHERE ranked_clicks.rank=1;