HIVE常用命令之ANALYZE TABLE命令简述

上次讲过HIVE 的一个常用命令 MSCK REPAIR TABLE , 这次讲讲HIVE的ANALYZE TABLE命令,接下来还会讲下Impala的 COMPUTE STATS 命令。这几个命令都是用来统计表的信息的,用于加速查询。

HIVE的 ANALYZE TABLE 命令是做啥的

Statistics such as the number of rows of a table or partition and the histograms of a particular interesting column are important in many ways. One of the key use cases of statistics is query optimization. Statistics serve as the input to the cost functions of the optimizer so that it can compare different plans and choose among them. Statistics may sometimes meet the purpose of the users' queries. Users can quickly get the answers for some of their queries by only querying stored statistics rather than firing long-running execution plans. Some examples are getting the quantile of the users' age distribution, the top 10 apps that are used by people, and the number of distinct sessions.

其实主要就是为了优化查询,加快查询的速度。

HIVE的主要统计数据
1. 表和分区的统计数据 (Table and Partition Statistics):
  • 行数 (Number of rows)
  • 文件数 (Number of files)
  • 文件大小 (Size in Bytes)
2. 列的统计数据 (Column Statistics): 这是Hive 0.10.0后加入的
  • 极大值等信息
3. Top K Statistics
语法

analyze table t [partition p] compute statistics for [columns c,...];

例子1 对hive进行表和分区的分析
#执行语句
ANALYZE TABLE dw_wy_srvcc_customized_drilldown_table_daily PARTITION(date_time='20190228') COMPUTE STATISTICS;
#查看结果
DESCRIBE extended dw_wy_srvcc_customized_drilldown_table_daily PARTITION(date_time='20190228');
image.png

结果如下 numRows=12552, totalSize=1564884, rawDataSize=1552332

Partition(values:[20190228], dbName:coordinator_test, tableName:dw_wy_srvcc_customized_drilldown_table_daily, createTime:1551344149, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:level, type:bigint, comment:null), FieldSchema(name:pi_name, type:string, comment:null), FieldSchema(name:dimension_type, type:string, comment:null), FieldSchema(name:demension_item, type:string, comment:null), FieldSchema(name:numerator_value, type:double, comment:null), FieldSchema(name:denominator_value, type:double, comment:null), FieldSchema(name:pi_value, type:double, comment:null), FieldSchema(name:degrad_weight, type:double, comment:null), FieldSchema(name:date_time, type:string, comment:null)], location:hdfs://dlbdn3:8020/user/hive/warehouse/coordinator_test.db/dw_wy_srvcc_customized_drilldown_table_daily/date_time=20190228, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), parameters:{numFiles=20, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1551408396, numRows=12552, totalSize=1564884, rawDataSize=1552332})

例子2 对hive进行列的分析
#执行语句
analyze table dw_wy_srvcc_customized_drilldown_table_daily compute statistics for columns level;
#查看结果
describe formatted dw_wy_srvcc_customized_drilldown_table_daily level;
image.png
我这面就不造大量数据了,有兴趣的可以试下,对比下使用analyze table前后的查询速度,会快很多的。其实如果是IMPALA和HIVE结合的话,用IMPALA的COMPUTE STAT会更好。

你可能感兴趣的:(HIVE常用命令之ANALYZE TABLE命令简述)