浅析HiBench 的 3个 SQL测试用例 (Hive)及升级HiBench的Hive版本

一、前言

分析HiBench 的 Hive测试用例 sql (包括scan/join/aggregation),版本 HiBench6.0

Hadoop版本:v2.7.1

Spark版本:v2.1

二、需要解决的问题

1. 跑 HiBench 的 测试用例 sql 需要 手动安装Hive吗? 

不需要,HiBench 会自动下载 pom.xml文件里的Hive版本。

2. HiBench6.0 的Hive版本是多少?

版本是0.140。HiBench/hadoopbench/sql/pom.xml文件里显示的Hive URL 如下:

 
    http://archive.apache.org
    dist/hive/hive-0.14.0/apache-hive-0.14.0-bin.tar.gz
 

拼起来就是http://archive.apache.org/dist/hive/hive-0.14.0/apache-hive-0.14.0-bin.tar.gz. 所以如果想用HiBench测试Hive其它的版本,就可以更改这里的 Hive URL。

3. HiBench 的 Spark SQL bench 代码,载入 sql 语句文件 并运行

package com.intel.hibench.sparkbench.sql

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.hive.HiveContext

/*
 * ported from HiBench's hive bench
 */
object ScalaSparkSQLBench{
  def main(args: Array[String]){
    if (args.length < 2){
      System.err.println(
        s"Usage: $ScalaSparkSQLBench  "
      )
      System.exit(1)
    }
    val workload_name = args(0)
    val sql_file = args(1)
    val sparkConf = new SparkConf().setAppName(workload_name)
    val sc = new SparkContext(sparkConf)
    val hc = new HiveContext(sc)

    val _sql = scala.io.Source.fromFile(sql_file).mkString
    _sql.split(';').foreach { x =>
      if (x.trim.nonEmpty)
        hc.sql(x)
    }

    sc.stop()
  }
}

4. HiBench 的 Hive测试用例 sql (包括scan/join/aggregation)究竟是些什么sql语句?

 a) scan 测试用例:就是select 生成数据input的表uservisits所有内容,插入到output的uservisits_copy表中。只有一个job,包含一个阶段的map任务,没有reduce任务

run.sh最终会调用hive -f HiBench/report/scan/hadoop/rankings_uservisits_scan.hive,文件内容如下:


DROP TABLE IF EXISTS uservisits;
CREATE EXTERNAL TABLE uservisits (sourceIP STRING,destURL STRING,visitDate STRING,adRevenue DOUBLE,userAgent STRING,countryCode STRING,languageCode STRING,searchWord STRING,duration INT ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS  SEQUENCEFILE LOCATION 'hdfs://cluster1.serversolution.sh.hxt:9000/HiBench/Scan/Input/uservisits';
DROP TABLE IF EXISTS uservisits_copy;
CREATE EXTERNAL TABLE uservisits_copy (sourceIP STRING,destURL STRING,visitDate STRING,adRevenue DOUBLE,userAgent STRING,countryCode STRING,languageCode STRING,searchWord STRING,duration INT ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS  SEQUENCEFILE LOCATION 'hdfs://cluster1.serversolution.sh.hxt:9000/HiBench/Scan/Output/uservisits_copy';
INSERT OVERWRITE TABLE uservisits_copy SELECT * FROM uservisits;


b) join 测试用例:就是联合表input的rankings 和 表uservisits 到 output 的新的表 rankings_uservisits_join 中。


DROP TABLE IF EXISTS rankings;
CREATE EXTERNAL TABLE rankings (pageURL STRING, pageRank INT, avgDuration INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS  SEQUENCEFILE LOCATION 'hdfs://cluster1.serversolution.sh.hxt:9000/HiBench/Join/Input/rankings';
DROP TABLE IF EXISTS uservisits_copy;
CREATE EXTERNAL TABLE uservisits_copy (sourceIP STRING,destURL STRING,visitDate STRING,adRevenue DOUBLE,userAgent STRING,countryCode STRING,languageCode STRING,searchWord STRING,duration INT ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS  SEQUENCEFILE LOCATION 'hdfs://cluster1.serversolution.sh.hxt:9000/HiBench/Join/Input/uservisits';
DROP TABLE IF EXISTS rankings_uservisits_join;
CREATE EXTERNAL TABLE rankings_uservisits_join ( sourceIP STRING, avgPageRank DOUBLE, totalRevenue DOUBLE) STORED AS  SEQUENCEFILE LOCATION 'hdfs://cluster1.serversolution.sh.hxt:9000/HiBench/Join/Output/rankings_uservisits_join';
INSERT OVERWRITE TABLE rankings_uservisits_join SELECT sourceIP, avg(pageRank), sum(adRevenue) as totalRevenue FROM rankings R JOIN (SELECT sourceIP, destURL, adRevenue FROM uservisits_copy UV WHERE (datediff(UV.visitDate, '1999-01-01')>=0 AND datediff(UV.visitDate, '2000-01-01')<=0)) NUV ON (R.pageURL = NUV.destURL) group by sourceIP order by totalRevenue DESC;

 

运行完job后,hdfs目录查看:

$ hdfs dfs -ls /HiBench/Join/Output/
Found 1 items
drwxr-xr-x   - user supergroup          0 2018-08-21 17:08 /HiBench/Join/Output/rankings_uservisits_join
[user@cluster1 ~]$ hdfs dfs -ls /HiBench/Join/Output/rankings_uservisits_join/
Found 1 items
-rw-r--r--   1 user supergroup 1000523704 2018-08-21 17:08 /HiBench/Join/Output/rankings_uservisits_join/000000_0
[user@cluster1 ~]$ hdfs dfs -du -s -h /HiBench/Join/Output/rankings_uservisits_join/
954.2 M  /HiBench/Join/Output/rankings_uservisits_join


c) aggregation 测试用例:统计 input的表 uservisits 中 同一个sourceIP的总共sumAdRevenue,写到 output表的 uservisits_aggre中。有1个job,也就是包含了一个完整的的map任务和reduce任务。


DROP TABLE IF EXISTS uservisits;
CREATE EXTERNAL TABLE uservisits (sourceIP STRING,destURL STRING,visitDate STRING,adRevenue DOUBLE,userAgent STRING,countryCode STRING,languageCode STRING,searchWord STRING,duration INT ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS  SEQUENCEFILE LOCATION 'hdfs://cluster1.serversolution.sh.hxt:9000/HiBench/Aggregation/Input/uservisits';
DROP TABLE IF EXISTS uservisits_aggre;
CREATE EXTERNAL TABLE uservisits_aggre ( sourceIP STRING, sumAdRevenue DOUBLE) STORED AS  SEQUENCEFILE LOCATION 'hdfs://cluster1.serversolution.sh.hxt:9000/HiBench/Aggregation/Output/uservisits_aggre';
INSERT OVERWRITE TABLE uservisits_aggre SELECT sourceIP, SUM(adRevenue) FROM uservisits GROUP BY sourceIP;

运行完job后,hdfs目录查看:

$ hdfs dfs -du -s -h /HiBench/Aggregation/Input/
89.4 G  /HiBench/Aggregation/Input

$hdfs dfs -du -s -h /HiBench/Aggregation/Output
16.4 G  /HiBench/Aggregation/Output

$hdfs dfs -ls /HiBench/Aggregation/Output/uservisits_aggre
Found 69 items

 255501195 2018-08-21 17:27 /HiBench/Aggregation/Output/uservisits_aggre/000000_0

...

 255501195 2018-08-21 17:27 /HiBench/Aggregation/Output/uservisits_aggre/000068_0


三、升级HiBench中的Hive版本(0.14.0)到 1.2.1

为什么要升级?v 0.14.0 源码里并没有SparkCompiler,那么为什么HiBench 能跑Spark的SQL测试用例。因为HiBench的Spark SQL测试用例根本没有Hive,根本没有Hive,根本没有Hive,而是用的SparkSQL。

v 0.14.0  TaskCompilerFactory.java 代码如下:

  /**
   * Returns the appropriate compiler to translate the operator tree
   * into executable units.
   */
  public static TaskCompiler getCompiler(HiveConf conf, ParseContext parseContext) {
    if (HiveConf.getVar(conf, HiveConf.ConfVars.HIVE_EXECUTION_ENGINE).equals("tez")) {
      return new TezCompiler();
    } else {
      return new MapReduceCompiler();
    }

v 1.2.1 TaskCompilerFactory.java 代码如下:

  public static TaskCompiler getCompiler(HiveConf conf, ParseContext parseContext) {
    if (HiveConf.getVar(conf, HiveConf.ConfVars.HIVE_EXECUTION_ENGINE).equals("tez")) {
      return new TezCompiler();
    } else if (HiveConf.getVar(conf, HiveConf.ConfVars.HIVE_EXECUTION_ENGINE).equals("spark")) {
      return new SparkCompiler();

    } else {
      return new MapReduceCompiler();
    }
  }

怎么升级 HiBench 的 Hive-v 0.14.0 到 Hive-v1.2.1

1 修改maven 配置文件 HiBench/hadoopbench/sql/pom.xml

 
    http://archive.apache.org
    dist/hive/hive-1.2.1/apache-hive-1.2.1-bin.tar.gz
 

2. 重新编译HiBench:  sudo mvn -Phadoopbench -Dspark=2.1 -Dscala=2.11 clean package

对于已经编译过的目录,如果再次编译出错且提示 mvs [goal] -rf :xxx,例如:mahout,可以执行如下命令:
sudo mvn -Phadoopbench -Dspark=2.1 -Dscala=2.11 clean package -rf :mahout


3. 更新HiBench的配置文件 HiBench/conf/hibench.conf
hibench.hive.home               ${hibench.home}/hadoopbench/sql/target/${hibench.hive.release}
hibench.hive.release            apache-hive-1.2.1-bin
hibench.hivebench.template.dir  ${hibench.home}/hadoopbench/sql/hive_template

四、相关的源码

HiBench/autogen/src/main/java/HiBench目录下有HiveData.java、DataGen.java 和 DataOptions.java的源码。

五、运行时报的错误

1. 运行HiBench的scan测试用例:

~/HiBench/bin/workloads/sql/scan/prepare/prepare.sh报如下错误:

19/01/04 13:47:12 INFO mapreduce.Job: Task Id : attempt_1546410718870_0023_r_000019_0, Status : FAILED
Error: java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.io.SequenceFile.CompressionType.block
        at java.lang.Enum.valueOf(Enum.java:238)
        at org.apache.hadoop.io.SequenceFile$CompressionType.valueOf(SequenceFile.java:222)

解决方法:注释掉 Hadoop/etc/hadoop/mapred-site.xml 里的 file output 压缩配置

         
             mapreduce.output.fileoutputformat.compress
             true
         

         
             mapreduce.output.fileoutputformat.compress.type
             block
         

         
             mapreduce.output.fileoutputformat.compress.codec
             org.apache.hadoop.io.compress.SnappyCodec
         

六、参考

理解Hive表(Hive Table)https://blog.csdn.net/bingduanlbd/article/details/52076219

你可能感兴趣的:(Spark,SparkSQL)