TPC-H测试数据表生成,以及在Impala中的使用

一、下载TPCH生成工具

官方网址:http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp

本文中安装的是:

TPC-H 	2.17.1 	pdf 	Download TPCH_Tools.zip

可通过wget、curl来获取安装包。

【注】获取包地址,需要填写Email信息进行申请,无伤大雅,照做。请确保,您的邮箱可以收到自动发送的包含连接地址的邮件,

这个地址才是真正的下载地址。

TPC-H测试数据表生成,以及在Impala中的使用_第1张图片

二、安装TPCH生成工具

下载完成后,您可能看到这样的文件:

c81a6b51-9f9c-4a5b-8979-9bf85688767b-tpc-h-tool.zip

这个就是TCPH的安装工具,so,使用unzip工具解压之。

unzip c81a6b51-9f9c-4a5b-8979-9bf85688767b-tpc-h-tool.zip

而后,得到两个文件夹:

第一个,无用,忽略之。进入第二个。

[root@node20 tpch_2_17_0]# cd dbgen/
[root@node20 dbgen]# ls
answers  bm_utils.c  build.o          dbgen      driver.o  dsstypes.h   makefile        permute.o      qgen         queries    rnd.c    rng64.h       speed_seed.o  tpcd.h       update_release.sh
bcd2.c   bm_utils.o  check_answers    dbgen.dsp  dss.ddl   HISTORY      makefile.suite  PORTING.NOTES  qgen.c       README     rnd.h    rng64.o       tests         tpch.dsw     variants
bcd2.h   BUGS        column_split.sh  dists.dss  dss.h     load_stub.c  permute.c       print.c        qgen.o       reference  rnd.o    shared.h      text.c        tpch.sln     varsub.c
bcd2.o   build.c     config.h         driver.c   dss.ri    load_stub.o  permute.h       print.o        qgen.vcproj  release.h  rng64.c  speed_seed.c  text.o        tpch.vcproj  varsub.o
[root@node20 dbgen]# ./dbgen -help
你会看到相应的参数信息。详情,自己看。

三、生成指定的数据库文件

1. 编辑Makefile文件,具体的参数配置可参考如下示例。

[root@node20 dbgen]# cp makefile.suite makefile
[root@node20 dbgen]# vim makefile
CC      = gcc
DATABASE= SQLSERVER
MACHINE = LINUX
WORKLOAD = TPCH
CFLAGS  = -g -DDBNAME=\"dss\" -D$(MACHINE) -D$(DATABASE) -D$(WORKLOAD) -DRNG_TEST -D_FILE_OFFSET_BITS=64
LDFLAGS = -O

执行make

2.执行dbgen,生成orders数据表,大小为5G。

[root@node20 dbgen]# ./dbgen -vf -s 5 -T O
TPC-H Population Generator (Version 2.17.0)
Copyright Transaction Processing Performance Council 1994 - 2010
Generating data for order table/
Preloading text ... 100%
done.

[root@node20 dbgen]# ls -lh *.tbl
-rw-r--r-- 1 root root 830M May 11 14:01 orders.tbl

四、加载数据表到Impala中

实际上就是,将tpch-tool生成的

*.tbl
文件加载到hadoop系统中,然后,通过impala-shell来创建相应的表。


1.clone脚本工程,用于自动完成加载工作。

参考网址:https://github.com/kj-ki/tpc-h-impala

2.将利用TPCH工具生成的数据表文件,移动到该工程目录的data/文件夹下。

[root@node20 data]# mv ../../../tpch_2_17_0/dbgen/*.tbl ./
[root@node20 data]# ls
lineitem.tbl  orders.tbl  tpch_prepare_data.sh

3.加载数据到Hadoop系统中。

[root@node20 data]# ./tpch_prepare_data.sh 
mkdir: `/tpch': File exists
mkdir: `/tpch/customer': File exists
mkdir: `/tpch/lineitem': File exists
mkdir: `/tpch/nation': File exists
mkdir: `/tpch/orders': File exists
mkdir: `/tpch/part': File exists
mkdir: `/tpch/partsupp': File exists
mkdir: `/tpch/region': File exists
mkdir: `/tpch/supplier': File exists
copyFromLocal: `customer.tbl': No such file or directory
copyFromLocal: `/tpch/lineitem/lineitem.tbl': File exists
copyFromLocal: `nation.tbl': No such file or directory
copyFromLocal: `part.tbl': No such file or directory
copyFromLocal: `partsupp.tbl': No such file or directory
copyFromLocal: `region.tbl': No such file or directory
copyFromLocal: `supplier.tbl': No such file or directory

这里,之前已经执行过该脚本,导入 lineitem数据表,所以相应的文件夹已经存在,无碍。

查看导入情况:

[root@node20 data]# hadoop fs -ls /tpch
Found 8 items
drwxr-xr-x   - root supergroup          0 2016-05-06 12:10 /tpch/customer
drwxr-xr-x   - root supergroup          0 2016-05-06 15:42 /tpch/lineitem
drwxr-xr-x   - root supergroup          0 2016-05-06 12:10 /tpch/nation
drwxr-xr-x   - root supergroup          0 2016-05-11 14:14 /tpch/orders
drwxr-xr-x   - root supergroup          0 2016-05-06 12:10 /tpch/part
drwxr-xr-x   - root supergroup          0 2016-05-06 12:11 /tpch/partsupp
drwxr-xr-x   - root supergroup          0 2016-05-06 12:11 /tpch/region
drwxr-xr-x   - root supergroup          0 2016-05-06 12:11 /tpch/supplier
[root@node20 data]# hadoop fs -ls /tpch/orders
Found 1 items
-rw-r--r--   3 root supergroup  870187306 2016-05-11 14:14 /tpch/orders/orders.tbl
[root@node20 data]# hadoop fs -ls /tpch/lineitem
Found 2 items
drwxr-xr-x   - root supergroup          0 2016-05-06 15:42 /tpch/lineitem/lineitem
-rw-r--r--   3 root supergroup 7775727688 2016-05-06 12:12 /tpch/lineitem/lineitem.tbl

【注】如果你遇到如下异常情况,请确保impala所需的环境变量已经导入。执行source ${IMPALA_HOME}/bin/impala-config.sh

[root@node20 data]# ./tpch_prepare_data.sh 
-mkdir: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -mkdir [-p]  ...
-mkdir: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -mkdir [-p]  ...
-mkdir: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -mkdir [-p]  ...
-mkdir: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -mkdir [-p]  ...
-mkdir: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -mkdir [-p]  ...
-mkdir: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -mkdir [-p]  ...
-mkdir: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -mkdir [-p]  ...
-mkdir: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -mkdir [-p]  ...
-mkdir: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -mkdir [-p]  ...
^C-copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p]  ... 
-copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p]  ... 
-copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p]  ... 
-copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p]  ... 
-copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p]  ... 
-copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p]  ... 
-copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p]  ... 

4. 生成表文本文件的数据表。

各类表的参考描述如下:

The official TPC-H specification: http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf

直接执行如下语句,会一直卡住。

[root@node20 tpc-h-impala]# ./tpch_benchmark.sh 

***********************************************
*          TPC-H benchmark on Impala          *
***********************************************
                                               
See benchmark.log for more details of query errors.

Executing Trial #1 of 1 trial(s)...
Running query: q1_pricing_summary_report

SO,直接查看该文件,可知自动导入表到impala是通过tpch_prepare/文件夹中的.hive文件来完成的。
分析可知,直接打开impala-shell,手动创建外部表。

[node20:21000] > create external table orders (O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE STRING, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/orders';
Query: create external table orders (O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE STRING, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/orders'

Fetched 0 row(s) in 0.04s
[node20:21000] > select count(*) from orders;
Query: select count(*) from orders
+----------+
| count(*) |
+----------+
| 7500000  |
+----------+
Fetched 1 row(s) in 5.03s

其它表的插入过程类似,在.hive文件中检索相应的创建表的sql语句。

5.生成parquet格式的数据表。

[node20:21000] > create database tpch_parquet;
Query: create database tpch_parquet

Fetched 0 row(s) in 0.10s
[node20:21000] > show databases;
Query: show databases
+------------------+----------------------------------------------+
| name             | comment                                      |
+------------------+----------------------------------------------+
| _impala_builtins | System database for Impala builtin functions |
| default          | Default Hive database                        |
| tpch_parquet     |                                              |
+------------------+----------------------------------------------+
Fetched 3 row(s) in 0.01s
[node20:21000] > create table tpch_parquet.orders_pq stored as parquet as select * from orders;
Query: create table tpch_parquet.orders_pq stored as parquet as select * from orders
+-------------------------+
| summary                 |
+-------------------------+
| Inserted 7500000 row(s) |
+-------------------------+
Fetched 1 row(s) in 10.62s
[node20:21000] > desc tpch_parquet.orders_pq;
Query: describe tpch_parquet.orders_pq
+-----------------+--------+---------+
| name            | type   | comment |
+-----------------+--------+---------+
| o_orderkey      | int    |         |
| o_custkey       | int    |         |
| o_orderstatus   | string |         |
| o_totalprice    | double |         |
| o_orderdate     | string |         |
| o_orderpriority | string |         |
| o_clerk         | string |         |
| o_shippriority  | int    |         |
| o_comment       | string |         |
+-----------------+--------+---------+
Fetched 9 row(s) in 0.01s
[node20:21000] > show table stats tpch_parquet.orders_pq;
Query: show table stats tpch_parquet.orders_pq
+-------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+
| #Rows | #Files | Size     | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                                        |
+-------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+
| -1    | 3      | 269.18MB | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://localhost:20500/test-warehouse/tpch_parquet.db/orders_pq |
+-------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+
Fetched 1 row(s) in 0.01s


你可能感兴趣的:(Intel大数据组实习)