一、下载TPCH生成工具
官方网址:http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp
本文中安装的是:
TPC-H 2.17.1 pdf Download TPCH_Tools.zip
可通过wget、curl来获取安装包。
【注】获取包地址,需要填写Email信息进行申请,无伤大雅,照做。请确保,您的邮箱可以收到自动发送的包含连接地址的邮件,
这个地址才是真正的下载地址。
二、安装TPCH生成工具
下载完成后,您可能看到这样的文件:
c81a6b51-9f9c-4a5b-8979-9bf85688767b-tpc-h-tool.zip
unzip c81a6b51-9f9c-4a5b-8979-9bf85688767b-tpc-h-tool.zip
第一个,无用,忽略之。进入第二个。
[root@node20 tpch_2_17_0]# cd dbgen/
[root@node20 dbgen]# ls
answers bm_utils.c build.o dbgen driver.o dsstypes.h makefile permute.o qgen queries rnd.c rng64.h speed_seed.o tpcd.h update_release.sh
bcd2.c bm_utils.o check_answers dbgen.dsp dss.ddl HISTORY makefile.suite PORTING.NOTES qgen.c README rnd.h rng64.o tests tpch.dsw variants
bcd2.h BUGS column_split.sh dists.dss dss.h load_stub.c permute.c print.c qgen.o reference rnd.o shared.h text.c tpch.sln varsub.c
bcd2.o build.c config.h driver.c dss.ri load_stub.o permute.h print.o qgen.vcproj release.h rng64.c speed_seed.c text.o tpch.vcproj varsub.o
[root@node20 dbgen]# ./dbgen -help
你会看到相应的参数信息。详情,自己看。
三、生成指定的数据库文件
1. 编辑Makefile文件,具体的参数配置可参考如下示例。
[root@node20 dbgen]# cp makefile.suite makefile
[root@node20 dbgen]# vim makefile
CC = gcc
DATABASE= SQLSERVER
MACHINE = LINUX
WORKLOAD = TPCH
CFLAGS = -g -DDBNAME=\"dss\" -D$(MACHINE) -D$(DATABASE) -D$(WORKLOAD) -DRNG_TEST -D_FILE_OFFSET_BITS=64
LDFLAGS = -O
执行make。
2.执行dbgen,生成orders数据表,大小为5G。
[root@node20 dbgen]# ./dbgen -vf -s 5 -T O
TPC-H Population Generator (Version 2.17.0)
Copyright Transaction Processing Performance Council 1994 - 2010
Generating data for order table/
Preloading text ... 100%
done.
[root@node20 dbgen]# ls -lh *.tbl
-rw-r--r-- 1 root root 830M May 11 14:01 orders.tbl
实际上就是,将tpch-tool生成的
*.tbl
文件加载到hadoop系统中,然后,通过impala-shell来创建相应的表。
1.clone脚本工程,用于自动完成加载工作。
参考网址:https://github.com/kj-ki/tpc-h-impala
2.将利用TPCH工具生成的数据表文件,移动到该工程目录的data/文件夹下。
[root@node20 data]# mv ../../../tpch_2_17_0/dbgen/*.tbl ./
[root@node20 data]# ls
lineitem.tbl orders.tbl tpch_prepare_data.sh
3.加载数据到Hadoop系统中。
[root@node20 data]# ./tpch_prepare_data.sh
mkdir: `/tpch': File exists
mkdir: `/tpch/customer': File exists
mkdir: `/tpch/lineitem': File exists
mkdir: `/tpch/nation': File exists
mkdir: `/tpch/orders': File exists
mkdir: `/tpch/part': File exists
mkdir: `/tpch/partsupp': File exists
mkdir: `/tpch/region': File exists
mkdir: `/tpch/supplier': File exists
copyFromLocal: `customer.tbl': No such file or directory
copyFromLocal: `/tpch/lineitem/lineitem.tbl': File exists
copyFromLocal: `nation.tbl': No such file or directory
copyFromLocal: `part.tbl': No such file or directory
copyFromLocal: `partsupp.tbl': No such file or directory
copyFromLocal: `region.tbl': No such file or directory
copyFromLocal: `supplier.tbl': No such file or directory
查看导入情况:
[root@node20 data]# hadoop fs -ls /tpch
Found 8 items
drwxr-xr-x - root supergroup 0 2016-05-06 12:10 /tpch/customer
drwxr-xr-x - root supergroup 0 2016-05-06 15:42 /tpch/lineitem
drwxr-xr-x - root supergroup 0 2016-05-06 12:10 /tpch/nation
drwxr-xr-x - root supergroup 0 2016-05-11 14:14 /tpch/orders
drwxr-xr-x - root supergroup 0 2016-05-06 12:10 /tpch/part
drwxr-xr-x - root supergroup 0 2016-05-06 12:11 /tpch/partsupp
drwxr-xr-x - root supergroup 0 2016-05-06 12:11 /tpch/region
drwxr-xr-x - root supergroup 0 2016-05-06 12:11 /tpch/supplier
[root@node20 data]# hadoop fs -ls /tpch/orders
Found 1 items
-rw-r--r-- 3 root supergroup 870187306 2016-05-11 14:14 /tpch/orders/orders.tbl
[root@node20 data]# hadoop fs -ls /tpch/lineitem
Found 2 items
drwxr-xr-x - root supergroup 0 2016-05-06 15:42 /tpch/lineitem/lineitem
-rw-r--r-- 3 root supergroup 7775727688 2016-05-06 12:12 /tpch/lineitem/lineitem.tbl
【注】如果你遇到如下异常情况,请确保impala所需的环境变量已经导入。执行source ${IMPALA_HOME}/bin/impala-config.sh
[root@node20 data]# ./tpch_prepare_data.sh
-mkdir: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -mkdir [-p] ...
-mkdir: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -mkdir [-p] ...
-mkdir: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -mkdir [-p] ...
-mkdir: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -mkdir [-p] ...
-mkdir: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -mkdir [-p] ...
-mkdir: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -mkdir [-p] ...
-mkdir: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -mkdir [-p] ...
-mkdir: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -mkdir [-p] ...
-mkdir: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -mkdir [-p] ...
^C-copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p] ...
-copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p] ...
-copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p] ...
-copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p] ...
-copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p] ...
-copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p] ...
-copyFromLocal: java.net.UnknownHostException: node1.hadoop.com
Usage: hadoop fs [generic options] -copyFromLocal [-f] [-p] ...
各类表的参考描述如下:
The official TPC-H specification: http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf
直接执行如下语句,会一直卡住。
[root@node20 tpc-h-impala]# ./tpch_benchmark.sh
***********************************************
* TPC-H benchmark on Impala *
***********************************************
See benchmark.log for more details of query errors.
Executing Trial #1 of 1 trial(s)...
Running query: q1_pricing_summary_report
SO,直接查看该文件,可知自动导入表到impala是通过tpch_prepare/文件夹中的.hive文件来完成的。
分析可知,直接打开impala-shell,手动创建外部表。
[node20:21000] > create external table orders (O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE STRING, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/orders';
Query: create external table orders (O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE STRING, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/orders'
Fetched 0 row(s) in 0.04s
[node20:21000] > select count(*) from orders;
Query: select count(*) from orders
+----------+
| count(*) |
+----------+
| 7500000 |
+----------+
Fetched 1 row(s) in 5.03s
5.生成parquet格式的数据表。
[node20:21000] > create database tpch_parquet;
Query: create database tpch_parquet
Fetched 0 row(s) in 0.10s
[node20:21000] > show databases;
Query: show databases
+------------------+----------------------------------------------+
| name | comment |
+------------------+----------------------------------------------+
| _impala_builtins | System database for Impala builtin functions |
| default | Default Hive database |
| tpch_parquet | |
+------------------+----------------------------------------------+
Fetched 3 row(s) in 0.01s
[node20:21000] > create table tpch_parquet.orders_pq stored as parquet as select * from orders;
Query: create table tpch_parquet.orders_pq stored as parquet as select * from orders
+-------------------------+
| summary |
+-------------------------+
| Inserted 7500000 row(s) |
+-------------------------+
Fetched 1 row(s) in 10.62s
[node20:21000] > desc tpch_parquet.orders_pq;
Query: describe tpch_parquet.orders_pq
+-----------------+--------+---------+
| name | type | comment |
+-----------------+--------+---------+
| o_orderkey | int | |
| o_custkey | int | |
| o_orderstatus | string | |
| o_totalprice | double | |
| o_orderdate | string | |
| o_orderpriority | string | |
| o_clerk | string | |
| o_shippriority | int | |
| o_comment | string | |
+-----------------+--------+---------+
Fetched 9 row(s) in 0.01s
[node20:21000] > show table stats tpch_parquet.orders_pq;
Query: show table stats tpch_parquet.orders_pq
+-------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
+-------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+
| -1 | 3 | 269.18MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://localhost:20500/test-warehouse/tpch_parquet.db/orders_pq |
+-------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------+
Fetched 1 row(s) in 0.01s