TPC-DS在大数据中的使用

  • 大数据测试基准的选择
  • TPC-DS简介
  • 编译hive-testbench
  • 运行测试案例

大数据测试基准的选择


    企业在选择大数据测试基准时,首先应考虑基准与其自身业务的相关性。

  1. 与其自身业务的相关性
        它主要描述测试基准设定的应用场景是否与企业的实际业务场景类似,如基于社交网络应用的评测基准与银行系统的应用场景就没有什么相关性。不相关的基准,测试结果再好,也没有实际意义。相关性还要考虑测试基准所采用的数据模型是否代表数据仓库的发展方向,如基于星型模型的开发要比基于传统的关系模型开发更加有效。
  2. 模拟数据生成要具有真实性
        它描述了测试基准是否仿真真实应用场景,所产生的模拟数据是否与真实数据相似。
  3. 工作负载的设定具有可扩展性
         它描述该评测基准是否适用于不同规模的计算机系统,许多评测基准会使用标度因子来决定模拟数据的规模,通过调整标度因子来得到不同规模的工作负载。
  4. 度量的选取的可理解性
        它衡量该评测基准是否易于为用户理解,不易为用户理解的基准的可信程度也较低。
  5. 客观性与公正性
        众所周知,在竞技比赛中,一个人不能既是运动员又是裁判员。测试基准好比竞技比赛中的裁判员,应该由中立的第三方机构制定。事实也证明,在各个领域最受欢迎的测试基准都是有第三方机构设计的。过去20多年的经历证明TPC系列基准是数据库领域最为广泛接受的基准。除此之外,第三方机构的审计也是保证证评测结果的客观性与公正性的重要手段。
  6. 健壮性
        测试基准要足够健壮,不能轻易被“hack”,这对测试结果的公平性非常重要。例如对TPC-H的前身TPC-D,通过物理化视图,Oracle的性能比Micosoft的SQLServer高100倍,这些显然是不公平的。因此TPC组织规定TPC-H测试中物理化视图是不和法的。但是除非是专业人员,一般用户很难判定测试过程中视图有没有被物理化。TPC-DS在健壮行方面要好很多,因为它的SQL本身比较复杂,也比较多,Hack起来相对困难,并且只hack几个SQL对整体性能提高有限。
  7. SQL标准兼容性
        SQL是ANSI为统一各个数据库厂商之间的编程差异定义的标准,已发布SQL86、SQL92、SQL99、SQL2003等版本。这些标准已经被主流的商用(例如Oracle、DB2、SQL server)以及开源的数据库产品(例如MySQL、mSQL和PostgreSQL)的广泛采用。对整个数据库产业的发展起到了巨大的推动作用。大数据是个新兴的领域,它的发展不能完全抛弃原有的应用。如果不能全面支持SQL标准,现有系统的移植非常困难,学习曲线就会变长。
  8. 通用性/可迁移性
        通用性描述是否可在不同数据库系统和架构上实现指定的评测基准。测试基准不应该规定实现的细节,而只需要定义测试规范。DBMS只要遵循规范得到正确的结果,就是合理的测试,无论其基于Map/Reduce、Spark还是其他的技术,也不管其底层存储是用HDFS、HBASE还是其他方式。

TPC-DS简介


    TPC-DS是一个面向决策支持系统(decision support system)的包含多维度常规应用模型的决策支持基准,包括查询(queries)与数据维护。此基准对被测系统(System Under Test's, SUT)在决策支持系统层面上的表现进行的评估具有代表性。
此基准体现决策支持系统以下特性:

  1. 测试大规模数据
  2. 对实际商业问题进行解答
  3. 执行需求多样或复杂的查询(如临时查询,报告,迭代OLAP,数据挖掘)
  4. 以高CPU和IO负载为特征
  5. 通过数据库维护对OLTP数据库资源进行周期同步
  6. 解决大数据问题,如关系型数据库(RDBMS),或基于Hadoop/Spark的系统

    基准结果用来测量,较为复杂的多用户决策中,单一用户模型下的查询响应时间,多用户模型下的查询吞吐量,以及数据维护表现。
    TPC-DS采用星型、雪花型等多维数据模式。它包含7张事实表,17张纬度表平均每张表含有18列。其工作负载包含99个SQL查询,覆盖SQL99和2003的核心部分以及OLAP。这个测试集包含对大数据集的统计、报表生成、联机查询、数据挖掘等复杂应用,测试用的数据和值是有倾斜的,与真实数据一致。可以说TPC-DS是与真实场景非常接近的一个测试集,也是难度较大的一个测试集。
     TPC-DS的这个特点跟大数据的分析挖掘应用非常类似。Hadoop等大数据分析技术也是对海量数据进行大规模的数据分析和深度挖掘,也包含交互式联机查询和统计报表类应用,同时大数据的数据质量也较低,数据分布是真实而不均匀的。因此TPC-DS成为客观衡量多个不同Hadoop版本以及SQL on Hadoop技术的最佳测试集。这个基准测试有以下几个主要特点:

  • 一共99个测试案例,遵循SQL’99和SQL 2003的语法标准,SQL案例比较复杂
  • 分析的数据量大,并且测试案例是在回答真实的商业问题
  • 测试案例中包含各种业务模型(如分析报告型,迭代式的联机分析型,数据挖掘型等)
  • 几乎所有的测试案例都有很高的IO负载和CPU计算需求

hive-testbench

    Hive -testbench一个数据生成器和一组查询,让您可以大规模地使用Apache Hive进行试验。testbench允许您在大型数据集上体验hive的基本性能,并提供一种容易的方法来查看hive调优参数和高级设置的影响。

先决条件

  • Hadoop 2.2或更高版本的集群或沙箱(Sanbox)
  • Apache Hive
  • 15分钟到2天之间生成数据的时间(取决于选择的数据规模和可用硬件)。
  • 如果计划生成1TB或更多的数据,强烈建议使用Apache Hive 13+来生成数据。

hive-testbench github地址

使用hive-testbench生成基准测试数据


所有这些步骤都应该在Hadoop集群上执行。

hive-testbench

  1. 准备环境

    1. 准备hive-testbench编译依赖环境
    yum -y install gcc gcc-c++
    
    1. maven安装
      安装过程
      1. wget http://mirror.bit.edu.cn/apache/maven/maven-3/3.6.0/binaries/apache-maven-3.6.0-bin.tar.gz
      2. tar zxf apache-maven-3.6.0-bin.tar.gz
      3. mv apache-maven-3.6.0 /usr/local/maven3
      4. 配置环境变量 /etc/profile
      export M2_HOME=/usr/local/maven3
      export PATH=$PATH:$M2_HOME/bin
      
      1. source /etc/profile
  2. 决定使用哪个案例(TPC-DS和TPC-H)
    hive-testbench提供了基于TPC-DS和TPC-H基准测试的数据生成器和示例查询。您可以选择使用这些基准中的任何一个或两个作为体验。有关这些基准的更多资料可以查看TPC官网

  3. 编译并打包选择案例的数据生成器
    对于TPC-DS ./tpcds-build.sh下载、编译和打包TPC-DS数据生成器。对于TPC-H ./ tpch-build.sh下载、编译和打包TPC-H数据生成器。

  4. 决定要生成多少数据
    需要决定一个表示将生成多少数据的"Scale Factor"。Scale Factor大致相当于单位为G,所以100的比例因子大约是100g,1tb的Scale Factor是1000。决定需要多少数据,并将其牢记于心,以备下一步使用。如果您有一个由4-10个节点组成的集群,或者只想在更小的范围内进行试验,那么规模为1000 (1tb)的数据是一个很好的起点。如果您有一个大型集群,您可能希望选择Scale 10000 (10 TB)或更多。Scale Factor的概念在TPC-DS和TPC-H之间类似。
    如果想生成大量数据,应该使用Hive 13或更高版本。Hive 13引入了一种优化,允许更大范围的数据分区。如果生成的数据超过几百GB,那么Hive 12和更低可能会崩溃,而围绕这个问题进行调优是很困难的。可以在Hive 13中生成文本或RCFile数据,并在Hive的多个版本中使用它。

  5. 生成和加载数据
    脚本tpcds-setup.sh和tpch-setup.sh分别为TPC-DS和TPC-H生成和加载数据。通常使用

    tpcds-setup.sh scale_factor [directory]
    tpch-setup.sh scale_factor [directory]

    一些案例
    1TB的TPC-DS数据:

    ./tpcds-setup.sh 1000
    

    1TP的TPC-H数据:

    ./tpch-setup.sh 1000  
    

    100TB的TPC-DS数据:

    ./tpcds-setup.sh 100000  
    

    30TB文本格式的TPC-DS数据:

    FORMAT=textfile ./tpcds-setup 30000
    

    30TB RCFile格式的TPC-DS数据:

    FORMAT=rcfile ./tpcds-setup 30000
    

    检查设置脚本中的其他参数,其中重要的一个是BUCKET_DATA。

  6. 安装过程中的异常说明

    1. 执行./tpcds-build.sh 出现问题
    cd target/tools; cat ../../patches/all/*.patch | patch -p0
    /bin/sh: patch: command not found  
    解决方案:
    安装path 
    命令:yum -y install patch
    
    1. 执行命令 ./tpcds-setup.sh 2 /extwarehouse/tpcds生成数据时候异常
    增加调试设置
    export DEBUG_SCRIPT=X
    hive 连接修改(tpch-setup.sh一样)
    修改tpcds-setup.sh 中的连接信息
    HIVE="beeline -n hive -u 'jdbc:hive2://localhost:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2?tez.queue.name=default' "
    
  7. 到现在数据已经生成在hive数据库中了

运行测试案例


当前环境要做测试的是Impala+parquet 和Impala+kudu,所以我们要生产parquet的数据,然后把数据导入到kudu里面,最后执行测试sql。

  1. 创建SQL脚本alltables_parquet.sql用于生成parquet格式的表,内容如下
drop database if exists ${VAR:DB} cascade;
create database ${VAR:DB};
use ${VAR:DB};
set parquet_file_size=512M;
set COMPRESSION_CODEC=snappy;
drop table if exists call_center;
create table ${VAR:DB}.call_center
stored as parquet
as select * from ${VAR:HIVE_DB}.call_center;
drop table if exists catalog_page;
create table ${VAR:DB}.catalog_page
stored as parquet
as select * from ${VAR:HIVE_DB}.catalog_page;
drop table if exists catalog_returns;
create table ${VAR:DB}.catalog_returns
stored as parquet
as select * from ${VAR:HIVE_DB}.catalog_returns;
drop table if exists catalog_sales;
create table ${VAR:DB}.catalog_sales
stored as parquet
as select * from ${VAR:HIVE_DB}.catalog_sales;
drop table if exists customer_address;
create table ${VAR:DB}.customer_address
stored as parquet
as select * from ${VAR:HIVE_DB}.customer_address;
drop table if exists customer_demographics;
create table ${VAR:DB}.customer_demographics
stored as parquet
as select * from ${VAR:HIVE_DB}.customer_demographics;
drop table if exists customer;
create table ${VAR:DB}.customer
stored as parquet
as select * from ${VAR:HIVE_DB}.customer;
drop table if exists date_dim;
create table ${VAR:DB}.date_dim
stored as parquet
as select * from ${VAR:HIVE_DB}.date_dim;
drop table if exists household_demographics;
create table ${VAR:DB}.household_demographics
stored as parquet
as select * from ${VAR:HIVE_DB}.household_demographics;
drop table if exists income_band;
create table ${VAR:DB}.income_band
stored as parquet
as select * from ${VAR:HIVE_DB}.income_band;
drop table if exists inventory;
create table ${VAR:DB}.inventory
stored as parquet
as select * from ${VAR:HIVE_DB}.inventory;
drop table if exists item;
create table ${VAR:DB}.item
stored as parquet
as select * from ${VAR:HIVE_DB}.item;
drop table if exists promotion;
create table ${VAR:DB}.promotion
stored as parquet
as select * from ${VAR:HIVE_DB}.promotion;
drop table if exists reason;
create table ${VAR:DB}.reason
stored as parquet
as select * from ${VAR:HIVE_DB}.reason;
drop table if exists ship_mode;
create table ${VAR:DB}.ship_mode
stored as parquet
as select * from ${VAR:HIVE_DB}.ship_mode;
drop table if exists store_returns;
create table ${VAR:DB}.store_returns
stored as parquet
as select * from ${VAR:HIVE_DB}.store_returns;
drop table if exists store_sales;
create table ${VAR:DB}.store_sales
stored as parquet
as select * from ${VAR:HIVE_DB}.store_sales;
drop table if exists store;
create table ${VAR:DB}.store
stored as parquet
as select * from ${VAR:HIVE_DB}.store;
drop table if exists time_dim;
create table ${VAR:DB}.time_dim
stored as parquet
as select * from ${VAR:HIVE_DB}.time_dim;
drop table if exists warehouse;
create table ${VAR:DB}.warehouse
stored as parquet
as select * from ${VAR:HIVE_DB}.warehouse;
drop table if exists web_page;
create table ${VAR:DB}.web_page
stored as parquet
as select * from ${VAR:HIVE_DB}.web_page;
drop table if exists web_returns;
create table ${VAR:DB}.web_returns
stored as parquet
as select * from ${VAR:HIVE_DB}.web_returns;
drop table if exists web_sales;
create table ${VAR:DB}.web_sales
stored as parquet
as select * from ${VAR:HIVE_DB}.web_sales;
drop table if exists web_site;
create table ${VAR:DB}.web_site
stored as parquet
as select * from ${VAR:HIVE_DB}.web_site;

注意:在脚本中使用了${VAR:variable_name}动态传参的方式指定Hive数据库及Impala的数据库。

  1. 在Impala Daemon节点执行如下命令,生成Impala基准测试数据
impala-shell -i bi-master:25003 --var=DB=tpcds_parquet_2 --var=HIVE_DB=tpcds_text_2 -f /usr/local/sql/alltables_parquet.sql
  1. 创建analyze.sql,用于统计分析Impala的表
use ${VAR:DB};
compute stats call_center ;
compute stats catalog_page ;
compute stats catalog_returns ;
compute stats catalog_sales ;
compute stats customer_address ;
compute stats customer_demographics ;
compute stats customer ;
compute stats date_dim ;
compute stats household_demographics ;
compute stats income_band ;
compute stats inventory ;
compute stats item ;
compute stats promotion ;
compute stats reason ;
compute stats ship_mode ;
compute stats store_returns ;
compute stats store_sales ;
compute stats store ;
compute stats time_dim ;
compute stats warehouse ;
compute stats web_page ;
compute stats web_returns ;
compute stats web_sales ;
compute stats web_site ;
  1. 执行如下命令对Impala的表进行统计分析
impala-shell -i bi-master:25003 --var=DB=tpcds_parquet_2 -f /usr/local/sql/analyze.sql

至此已完成Impala基准测试的环境准备。

TPC-DS测试

  1. 准备好TPC-DS的99条SQL语句
  2. 编写批量运行脚本run_all_queries.sh,将结果输出到日志文件
#!/bin/bash

impala_demon=bi-master:25003  
database_name=tpcds_parquet_2
current_path=`pwd`
queries_dir=${current_path}/queries
rm -rf logs
mkdir logs
for t in `ls ${queries_dir}`
do
    echo "current query will be ${queries_dir}/${t}"
    impala-shell --database=$database_name -i $impala_demon -f ${queries_dir}/${t} &>logs/${t}.log
done
echo "all queries execution are finished, please check logs for the result!"

将脚本中impala_daemon和database_name修改为你自己环境的配置即可。

  1. 脚本执行成功后可以在logs目录下查看执行结果及运行时间
./run_all_queries.sh
  1. 通过logs目录下的log文件可以查看每条SQL的执行结果和执行时间
[root@bi-master logs]# grep Fetch *.log
query11.sql.log:Fetched 100 row(s) in 3.95s
  1. parquet 数据写入到kudu 脚本alltables_kudu.sql,这里的分区比较随意,如果数据量比较大的时候,想要比较精确的结果考虑kudu的分区怎么去建立。
drop database if exists ${VAR:DB} cascade;
create database ${VAR:DB};
use ${VAR:DB};

drop table if exists call_center;
create table ${VAR:DB}.call_center
PRIMARY KEY (cc_call_center_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.call_center;

drop table if exists catalog_page;
create table ${VAR:DB}.catalog_page
PRIMARY KEY (cp_catalog_page_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.catalog_page;

drop table if exists catalog_returns;
create table ${VAR:DB}.catalog_returns
PRIMARY KEY (cr_returned_date_sk,cr_returned_time_sk,cr_item_sk,cr_refunded_customer_sk)
PARTITION BY HASH(cr_returned_date_sk,cr_returned_time_sk,cr_item_sk,cr_refunded_customer_sk) PARTITIONS 5
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.catalog_returns;

drop table if exists catalog_sales;
create table ${VAR:DB}.catalog_sales
PRIMARY KEY (cs_sold_date_sk,cs_sold_time_sk,cs_ship_date_sk,cs_bill_customer_sk)
PARTITION BY HASH(cs_sold_date_sk,cs_sold_time_sk,cs_ship_date_sk,cs_bill_customer_sk) PARTITIONS 5
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.catalog_sales;

drop table if exists customer_address;
create table ${VAR:DB}.customer_address
PRIMARY KEY (ca_address_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.customer_address;

drop table if exists customer_demographics;
create table ${VAR:DB}.customer_demographics
PRIMARY KEY (cd_demo_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.customer_demographics;

drop table if exists customer;
create table ${VAR:DB}.customer
PRIMARY KEY (c_customer_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.customer;

drop table if exists date_dim;
create table ${VAR:DB}.date_dim
PRIMARY KEY (d_date_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.date_dim;

drop table if exists household_demographics;
create table ${VAR:DB}.household_demographics
PRIMARY KEY (hd_demo_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.household_demographics;

drop table if exists income_band;
create table ${VAR:DB}.income_band
PRIMARY KEY (ib_income_band_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.income_band;

drop table if exists inventory;
create table ${VAR:DB}.inventory
PRIMARY KEY (inv_date_sk,inv_item_sk,inv_warehouse_sk)
PARTITION BY HASH(inv_date_sk,inv_item_sk,inv_warehouse_sk) PARTITIONS 5
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.inventory;

drop table if exists item;
create table ${VAR:DB}.item
PRIMARY KEY (i_item_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.item;

drop table if exists promotion;
create table ${VAR:DB}.promotion
PRIMARY KEY (p_promo_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.promotion;

drop table if exists reason;
create table ${VAR:DB}.reason
PRIMARY KEY (r_reason_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.reason;

drop table if exists ship_mode;
create table ${VAR:DB}.ship_mode
PRIMARY KEY (sm_ship_mode_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.ship_mode;

drop table if exists store_returns;
create table ${VAR:DB}.store_returns
PRIMARY KEY (sr_returned_date_sk,sr_return_time_sk,sr_item_sk,sr_customer_sk)
PARTITION BY HASH(sr_returned_date_sk,sr_return_time_sk,sr_item_sk,sr_customer_sk) PARTITIONS 5
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.store_returns;

drop table if exists store_sales;
create table ${VAR:DB}.store_sales
PRIMARY KEY (ss_sold_date_sk,ss_sold_time_sk,ss_item_sk,ss_customer_sk)
PARTITION BY HASH(ss_sold_date_sk,ss_sold_time_sk,ss_item_sk,ss_customer_sk) PARTITIONS 5
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.store_sales;

drop table if exists store;
create table ${VAR:DB}.store
PRIMARY KEY (s_store_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.store;

drop table if exists time_dim;
create table ${VAR:DB}.time_dim
PRIMARY KEY (t_time_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.time_dim;

drop table if exists warehouse;
create table ${VAR:DB}.warehouse
PRIMARY KEY (w_warehouse_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.warehouse;

drop table if exists web_page;
create table ${VAR:DB}.web_page
PRIMARY KEY (wp_web_page_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.web_page;

drop table if exists web_returns;
create table ${VAR:DB}.web_returns
PRIMARY KEY (wr_returned_date_sk,wr_returned_time_sk,wr_item_sk,wr_refunded_customer_sk)
PARTITION BY HASH(wr_returned_date_sk,wr_returned_time_sk,wr_item_sk,wr_refunded_customer_sk) PARTITIONS 5
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.web_returns;

drop table if exists web_sales;
create table ${VAR:DB}.web_sales
PRIMARY KEY (ws_sold_date_sk,ws_sold_time_sk,ws_ship_date_sk,ws_item_sk,ws_bill_customer_sk)
PARTITION BY HASH(ws_sold_date_sk,ws_sold_time_sk,ws_ship_date_sk,ws_item_sk,ws_bill_customer_sk) PARTITIONS 5
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.web_sales;

drop table if exists web_site;
create table ${VAR:DB}.web_site
PRIMARY KEY (web_site_sk)
STORED AS KUDU
as select * from ${VAR:HIVE_DB}.web_site;

执行命令:

impala-shell -i bi-master:25003 --var=DB=tpcds_kudu_2 --var=HIVE_DB=tpcds_parquet_2 -f /usr/local/sql/alltables_kudu.sql

表分析

impala-shell -i bi-master:25003 --var=DB=tpcds_kudu_2 -f /usr/local/sql/analyze.sql

修改run_all_queries.sh 将里面的数据库修改为kudu的数据库然后执行得到查询结果

你可能感兴趣的:(TPC-DS在大数据中的使用)