There are two parts of this Unix-Shell project.
Part-1: Building the tpcds-gen-.jar, recently version=1.1 is up to date.
Part-2: Generating the tpcds flat data, creating tpcds tables.
Part-1
YOU NEED NOT RUN PART-1 IF YOU HAVE tpcds-gen-.jar IN generator/target.
Precondition:
a). gcc, mvn, java, unzip installed and can be accessed.
b). Internet access.
To build the generator of tpcds, Do:
1). change directory to generator and run build.sh
$ cd generator; ./build.sh
You will get tpcds-gen-.jar in sub-directory target. Keep it stay there, It will be used in Part-2.
Part-2
Precondition:
a). hadoop, hdfs installed and can be accessed.
b). At least one of transwarp, impala-shell, hive installed and can be accessed.
To generate tpcds data and create tpcds tables stored as format such as flat tables, orc, parquet, Do
1). change directory to bin and check the variables in tpcds-env.sh
$ cd bin; cat tpcds-env.sh
Those variables you should take care of:
a) TPCDS_SCALE -> Mandatory, tpcds scale in GB.
b) EXEC_ENGINE -> Mandatory, Query engine, only transwarp, impala-shell, hive is support recently.
c) TRANS_HOST -> Optional, IP or hostname of Inceptor server when query engine is transwarp.
d) TEXT_DB -> Optional, The flat database's name, tpcds_text_"$TPCDS_SCALE" by default.
e) TBL_FORMAT -> Mandatory, Storage format of which kind tables you need.
f) FORMAT_DB -> Optional, The none flat database's name, tpcds_$TBL_FORMAT_"$TPCDS_SCALE" by default.
g) DELETE_MODE -> Mandatory, The flag whether or not delete the flat tables and related files in HDFS, true will delete and false not.
h) LOCATION_HDFS -> Mandatory, HDFS directory for falt files.
Please check and confirm, modify some of them if needed.
tpcds-env.sh
1 # PROJ_HOME 2 PROJ_BIN=$(dirname "${BASH_SOURCE-$0}") 3 PROJ_HOME=$(cd "$PROJ_BIN"/..; pwd) 4 5 # TPCDS Scale in GB 6 export TPCDS_SCALE=${TPCDS_SCALE:1000} //1T数据 7 # Query engine, only tranwarp, impala-shell, hive is support recently. 8 export EXEC_ENGINE=${EXEC_ENGINE:-impala-shell} // 用 impala-shell 生成数据 9 # inceptor server when query engine is transwarp 10 export TRANS_HOST=${TRANS_HOST:-localhost} 11 12 export TEXT_DB=tpcds_text_"$TPCDS_SCALE" 13 # Table format we need. only orc, flat, parquet are supported recently. 14 export TBL_FORMAT=parquet //表类型为 parquet15 export FORMAT_DB=tpcds_"$TBL_FORMAT"_"$TPCDS_SCALE" 16 # Delete or not the text file in HDFS. 17 export DELETE_MODE=${DELETE_MODE:-true} 18 19 case $EXEC_ENGINE in 20 transwarp) 21 which transwarp > /dev/null 2>&1 22 if [ $? -ne 0 ]; then 23 echo "transwarp not found, data generation will exit soon" && exit 1 24 elif [ 'X'$TRANS_HOST == 'X' ]; then 25 echo "Inceptor server not exist while query engine is tranwarp!" && exit 1 26 fi 27 # directory for falt files in HDFS 28 export LOCATION_HDFS=/user/transwarp/tpcds 29 ;; 30 impala-shell) 31 which impala-shell > /dev/null 2>&1 32 [ $? -ne 0 ] && echo "impala-shell not found, data generation will exit soon." 33 # directory for falt files in HDFS 34 export LOCATION_HDFS=/user/impala/tpcds 35 ;; 36 hive) 37 which hive > /dev/null 2>&1 38 [ $? -ne 0 ] && echo "hive not found, data generation will exit soon." 39 # directory for falt files in HDFS 40 export LOCATION_HDFS=/user/hive/tpcds 41 ;; 42 spark-sql) 43 which spark-sql > /dev/null 2>&1 44 [ $? -ne 0 ] && echo "spark-sql not found, data generation will exit soon." 45 # directory for falt files in HDFS 46 export LOCATION_HDFS=/user/spark/tpcds 47 ;; 48 49 *) 50 echo "Invalid engine, only tranwarp, impala-shell and hive are supported recently." 51 exit 1 52 ;; 53 esac
2) Generate raw data
$ ./gen-data.sh
There are some options that will overwrite the variables in tpcds-env.sh, It is highly recommended DO NOT do this.
-s | --scale, scale(in GB).
-l | --location, HDFS directory for falt files.
-h | --help, Show this help message.
This script will help you generating flat data and putting them into HDFS.
gen-data.sh
source ./tpcds-env.sh echo "********************************************************************" echo "***** Generate data by run mapreduce routine *****" echo "**** hadoop jar tpcds-gen.jar -d XXX -s XXX ****" echo "********************************************************************" function usage { echo "Usage: $0 -s | --scale, scale(in GB) -l | --locate, HDFS directory for falt files. -h | --help, Show this help message." } while [ $# -gt 0 ]; do case "$1" in -s | --scale) shift TPCDS_SCALE=$1 shift ;; -l | --location) shift LOCATION_HDFS=$1 shift ;; -h | --help) HELP=true shift ;; *) echo "Invalid args: $1" exit 1 ;; esac done [ "$HELP" == "true" ] && usage && exit 1 if [ ! -f $PROJ_HOME/generator/target/tpcds-gen-1.1.jar ]; then echo "tpcds-gen-1.1.jar not found, Build the data generator with\ build.sh first or make sure tpcds-env.sh is modified correctly." exit 1 fi which hadoop > /dev/null 2>&1 if [ $? -ne 0 ]; then echo "Script must be run where hadoop is installed" exit 1 fi which hdfs > /dev/null 2>&1 if [ $? -ne 0 ]; then echo "Script must be run where hdfs is installed" exit 1 fi # Ensure arguments exist. if [ X"$TPCDS_SCALE" = "X" ]; then usage && exit 1 fi if [ X"$LOCATION_HDFS" = "X" ]; then usage && exit 1 fi # Sanity checking. if [ $TPCDS_SCALE -lt 1 ]; then echo "Scale factor cannot be less than 1" exit 1 fi if [ 'X'$INTEGRATE_MODE != "Xtrue" ]; then read -p "You are generating ${TPCDS_SCALE}g tpcds data and then store it at HDFS directory ${LOCATION_HDFS}, disk usage of HDFS will be ${TPCDS_SCALE}g, is that OK [Yes|No]? " CONFIRM [ 'X'$CONFIRM != "XYes" ] && echo "Your answer is not Yes, check tpcds-env.sh and your HDFS storage and run again." && exit 1 fi hdfs dfs -mkdir -p ${LOCATION_HDFS} # TODO: How to test the directory is writable for current user gracefully? hdfs dfs -put $PROJ_HOME/bin/tpcds-env.sh ${LOCATION_HDFS} > /dev/null 2>&1 if [ $? -ne 0 ]; then echo "${LOCATION_HDFS} is not writable for current user." && exit 1 else hdfs dfs -rm ${LOCATION_HDFS}/tpcds-env.sh > /dev/null 2>&1 fi hdfs dfs -ls ${LOCATION_HDFS}/${TPCDS_SCALE} > /dev/null 2>&1 || (cd $PROJ_HOME/generator; hadoop jar target/tpcds-gen-*.jar -d ${LOCATION_HDFS}/${TPCDS_SCALE}/ -s ${TPCDS_SCALE}) hdfs dfs -ls ${LOCATION_HDFS}/${TPCDS_SCALE}View Code
运行脚本的时候 会出现 HDFS相应目录没有写权限的问题。同事给出的 解决方法是:
hdfs dfs -chmod 777 相应目录
但是不知道为什么不起作用。
网上查到(http://stackoverflow.com/questions/11593374/permission-denied-at-hdfs)
方法一(对我没用):
I solved this problem temporary by disabling the dfs permission.By adding below property code to conf/hdfs-site.xml
dfs.permissions false
方法二:
改变环境变量
export HADOOP_USER_NAME=hdfs
完美解决~
小错误记录:像类似于HDFS的程序不能简单的ctrl+z把它关掉,需要跑自带的脚本或者运维界面。我之前下捣鼓,后台起来好多个HDFS却不知道,node直接快卡死,kill不干净,于是乎鲁莽的把这个node reboot 了,而导致这个node的HDFS无法挂载原先的三块磁盘。请交大博士同事帮我搞了1一个半小时,谁知道整个cluster起不来了== 第二天 早上 他才弄弄好 == (我就是 x一般的队友)Database Row Counts
3) Create text table
$ ./create-text-table.sh
There are some options that will overwrite the variables in tpcds-env.sh, It is highly recommended DO NOT do this.
-l | --location, HDFS directory of files generated by gen-data.sh.
-t | --textdb, The flat database's name.
-e | --engine, Query engine, one of [transwarp | impala-shell | hive].
-h | --help, show this help text.
This script will help you create flat tables against the data generated by gen-data.sh.
create-text-table.sh
source ./tpcds-env.sh echo "*****************************************************************" echo "**** Create text table by run Transwarp | Hive | Impala DDL ****" echo "**** Transwarp -f table-ddl.sql ****" echo "*****************************************************************" function usage { echo "Usage: $0 -s | --scale, scale(in GB). -l | --location, HDFS directory of files generated by gen-data.sh. -t | --textdb, The flat database's name. -e | --engine, Query engine, one of [transwarp | impala-shell | hive]. -i | --inceptor, inceptor server for transwarp engine. -h | --help, show this help text." } while [ $# -gt 0 ]; do case "$1" in -s | --scale) shift TPCDS_SCALE=$1 shift ;; -l | --location) shift LOCATION_HDFS=$1 shift ;; -t | --textdb) shift TEXT_DB=$1 shift ;; -e | --engine) shift EXEC_ENGINE=$1 shift ;; -i | --inceptor) shift INCEPTOR_SERVER=$1 shift ;; -h | --help) HELP=true shift ;; *) echo "Invalid args: $1" exit 1 ;; esac done [ X"$HELP" == X"true" ] && usage && exit 1 # Tables in the TPC-DS schema. LIST="date_dim time_dim item customer customer_demographics household_demographics customer_address store promotion warehouse ship_mode reason income_band call_center web_page catalog_page inventory store_sales store_returns web_sales web_returns web_site catalog_sales catalog_returns" if [ X"$LOCATION_HDFS" == "X" ]; then usage && exit 1 fi if [ X"$TPCDS_SCALE" == "X" ]; then usage && exit 1 fi hdfs dfs -ls ${LOCATION_HDFS}/${TPCDS_SCALE} > /dev/null 2>&1 if [ $? -ne 0 ]; then echo "Flat files in hdfs does not exist, run gen-data.sh first." exit 1 fi # Generate the flat tables. if [ 'X'$INTEGRATE_MODE != "Xtrue" ]; then read -p "You are creating tpcds flat tables base on the data stored at HDFS \ directory ${LOCATION_HDFS}/${TPCDS_SCALE}, and the database name is $TEXT_DB, \ is that OK [Yes|No]? " CONFIRM [ 'X'$CONFIRM != "XYes" ] && echo "Your answer is not Yes, check tpcds-env.sh and run again." && exit 1 fi cmd=$EXEX_ENGINE [ $EXEC_ENGINE = "transwarp" ] && { INCEPTOR_SERVER=${INCEPTOR_SERVER:-localhost} echo "connect to inceptor server: $INCEPTOR_SERVER" cmd="transwarp -t -h $INCEPTOR_SERVER" } $cmd -e "create database if not exists $TEXT_DB" # tmp workaround since hive can not subsitue the DB and LOCATION var tmp_dir=$PROJ_HOME/ddl/text_tmp [ -d $tmp_dir ] && { rm -rf $tmp_dir } mkdir -p $tmp_dir cp -rf $PROJ_HOME/ddl/text/* $tmp_dir for t in ${LIST}; do echo "Creating table $t..." sql_file=$tmp_dir/${t}.sql LOCATION=${LOCATION_HDFS}/${TPCDS_SCALE}/${t} location_expr=`echo $LOCATION|sed s#/#'\\\/'#g` sed -i "s#\${DB}#$TEXT_DB#g" $sql_file sed -i "s#\${LOCATION}#$location_expr#g" $sql_file # transwarp -h option is not needed here. $cmd -i $PROJ_HOME/settings/load-flat.sql -f $sql_file -d DB=$TEXT_DB -d LOCATION=${LOCATION_HDFS}/${TPCDS_SCALE}/${t} > /dev/null doneView Code
4) Create other format table
$ ./create-none-text-table.sh
There are some options that will overwrite the variables in tpcds-env.sh, It is highly recommended DO NOT do this.
-s | --sourcedb, Source database, it is the flat database's name.
-t | --targetdb, Target database, it is the none flat database's name.
-f | --format, Storage format of which kind tables you need
-h | --host, Inceptor server host, localhost by default.
-d | --delete, DELETE_MODE, true or false
-l | --location, LOCATION_HDFS of flat file, can not be empty when DELETE_MODE is true
--help), Show this help message.
This script will help you create none-flat tables based the flat tables created by create-text-table.sh. Flat tables and related files will be removed if DELETE_MODE is true.
It is Integrated in all-in-one.sh to DO everything, including gen-data, create-text-file, create-none-text-file.
It is highly recommended that check and modify the tpcds-env.sh, DO NOT overwrite the variables by optional arguments.
/create-none-text-table.sh
# # (c) Copyright 2013-2015 Transwarp, Inc. # #!/bin/bash source ./tpcds-env.sh echo "********************************************************************" echo "***** Tables stored as other file-format is implemented as: *****" echo "**** create table t stored as XXX as select * from TEXT_DB.t ****" echo "********************************************************************" function usage { echo "Usage: $0 -s | --sourcedb, TEXT_DB -t | --targetdb, FORMAT_DB -f | --format, TBL_FORMAT -h | --host, Inceptor server host, localhost by default. -d | --delete, DELETE_MODE, true or false -l | --location, LOCATION_HDFS of flat file, can not be empty when DELETE_MODE is true --help, show this help message." } while [ $# -gt 0 ]; do case "$1" in -s | --sourcedb) shift TEXT_DB=$1 shift ;; -t | --targetdb) shift FORMAT_DB=$1 shift ;; -f | --format) shift TBL_FORMAT=$1 shift ;; -d | --delete) shift DELETE_MODE=$1 shift ;; -l | --location) shift LOCATION_HDFS=$1 shift ;; -h | --host) shift TRANS_HOST=$1 shift ;; --help) HELP=true shift ;; *) echo "Invalid args: $1" exit 1 ;; esac done [ "$HELP" == "true" ] && usage && exit 0 if [ 'X'$TEXT_DB == 'X' ]; then usage exit 1 fi if [ 'X'$FORMAT_DB == 'X' ]; then usage exit 1 fi if [ 'X'$TBL_FORMAT == "Xflat" ]; then echo "This script is designed to generate none flat tables!" exit 1 fi if [ 'X'$DELETE_MODE == "Xtrue" ] && [ X"$LOCATION_HDFS" == "X" ]; then echo "LOCATION_HDFS is needed when DELETE_MODE is true." usage exit 1 fi # Tables in the TPC-DS schema. LIST="date_dim time_dim item customer customer_demographics household_demographics customer_address store promotion warehouse ship_mode reason income_band call_center web_page catalog_page inventory store_sales store_returns web_sales web_returns web_site catalog_sales catalog_returns" case $TBL_FORMAT in orc) # Do nothing recently. ;; flat) # Do nothing recently. ;; parquet) if [ ! $EXEC_ENGINE == "impala-shell" ]; then echo "parquet is only supported by impala recently. Data generation will exit soon." exit 1 fi ;; *) echo "Invalid format, only orc, flat, parquet are supported recently." exit 1 ;; esac [ ! 'X'$DELETE_MODE == 'Xtrue' ] && [ ! 'X'$DELETE_MODE == 'Xfalse' ] && echo "Invalid DELETE_MODE, true or false is expected while got $DELETE_MODE" && exit 1 if [ 'X'$INTEGRATE_MODE != "Xtrue" ]; then if [ 'X'$DELETE_MODE == "Xtrue" ]; then read -p "You are creating ${TBL_FORMAT} tables and the database name is \ $FORMAT_DB. After all, all tables in $TEXT_DB is droped and related files in \ HDFS are removed, is that OK [Yes|No]? " CONFIRM elif [ 'X'$DELETE_MODE == "Xfalse" ]; then read -p "You are creating ${TBL_FORMAT} tables and the database name is \ $FORMAT_DB. After all, all tables in $TEXT_DB and related files in HDFS are \ reserved, is that OK [Yes|No]? " CONFIRM fi [ 'X'$CONFIRM != "XYes" ] && echo "Your answer is not Yes, check tpcds-env.sh and run again." && exit 1 fi if [ 'X'$EXEC_ENGINE == "Xtranswarp" ]; then echo "Creating database..." transwarp -h $TRANS_HOST -e "create database if not exists $FORMAT_DB;" > /dev/null for t in $LIST; do echo "Creating table $t..." transwarp -h $TRANS_HOST --database $FORMAT_DB -e "create table if not exists $t stored as $TBL_FORMAT as select * from $TEXT_DB.$t;" > /dev/null [ 'X'$DELETE_MODE == "Xtrue" ] && transwarp -h $TRANS_HOST --database $TEXT_DB -e "drop table if exists $t;" > /dev/null done [ 'X'$DELETE_MODE == "Xtrue" ] && transwarp -h $TRANS_HOST -e "drop database if exists $TEXT_DB;" > /dev/null elif [ 'X'$EXEC_ENGINE == "Ximpala-shell" ]; then echo "Creating database..." impala-shell -q "create database if not exists $FORMAT_DB" > /dev/null for t in $LIST; do echo "Creating table $t..." impala-shell --database $FORMAT_DB -q "create table if not exists $t stored as $TBL_FORMAT as select * from $TEXT_DB.$t;" > /dev/null [ 'X'$DELETE_MODE == "Xtrue" ] && impala-shell --database $TEXT_DB -q "drop table if exists $t;" > /dev/null done [ 'X'$DELETE_MODE == "Xtrue" ] && impala-shell -q "drop database if exists $TEXT_DB;" > /dev/null elif [ 'X'$EXEC_ENGINE == "Xhive" ] || [ 'X'$EXEC_ENGINE == "Xspark-sql" ]; then echo "Creating database..." hive -e "create database if not exists $FORMAT_DB;" > /dev/null for t in $LIST; do echo "Creating table $t..." $EXEC_ENGINE --database $FORMAT_DB -e "create table if not exists $t stored as $TBL_FORMAT as select * from $TEXT_DB.$t;" > /dev/null [ 'X'$DELETE_MODE == "Xtrue" ] && $EXEC_ENGINE --database $TEXT_DB -e "drop table if exists $t;" > /dev/null done [ 'X'$DELETE_MODE == "Xtrue" ] && $EXEC_ENGINE -e "drop database if exists $TEXT_DB;" > /dev/null fi if [ $DELETE_MODE == "true" ]; then hdfs dfs -rm -r -f $LOCATION_HDFS fiView Code
貌似最新版的impala语法有点变化,若上述SQL有问题,可以试一下这个:
1 elif [ 'X'$EXEC_ENGINE == "Ximpala-shell" ]; then 2 echo "Creating database..." 3 impala-shell -q "create database if not exists $FORMAT_DB" > /dev/null 4 for t in $LIST; do 5 echo "Creating table $t..." 6 impala-shell --database $FORMAT_DB -q "create table if not exists $t like $TEXT_DB.$t stored as $TBL_FORMAT;insert overwrite $t select * from $TEXT_DB.$t;" > /dev/null 7 [ 'X'$DELETE_MODE == "Xtrue" ] && impala-shell --database $TEXT_DB -q "drop table if exists $t;" > /dev/null 8 done 9 [ 'X'$DELETE_MODE == "Xtrue" ] && impala-shell -q "drop database if exists $TEXT_DB;" > /dev/null数据量比较大,目测group 的Impala Daemon 内存限制要设置在12G以上。
首先 建个config文件里面输入 use [db.name];
ex. : use tpcds_parquet_1000;
tpcds-test-impala.sh
## print basic configuration echo " Configuration: ---------------------------------------------- Database: $DB Perf: $query_perf Logs: $query_log Output: $query_out ---------------------------------------------- " | tee -a $query_perf ## find the impala-shell ENGINE which $ENGINE > /dev/null 2>&1 if [ $? -ne 0 ]; then echo "ERROR: Cannot find '$ENGINE' in your system path." & exit -1; fi ## print instant performance results printf "Performance results: -------------------- %3s %8s %8s %8s %8s %8s %8s %s " no success t_shell t_impala latest ln_run ln_avg info. \ | tee -a $query_perf ## do the dirty work if [ $QUERY_FILENO -eq -1 ]; then runs_total=99 for i in {1..99}; do run_query $i done elif [ $QUERY_FILENO -ge 0 ] && [ $QUERY_FILENO -le 99 ]; then runs_total=1 run_query $QUERY_FILENO else echo "ERROR: invalid query file no. (1~99)" exit -1; fi ## print final coverage stats printf " Performance statistics: ----------------------- Total queries: %s Coverage: %s Time shell: %ss Time impala: %ss\n " $runs_total \ $(echo "scale=1; $runs_success/$runs_total*100" | bc)% \ $runs_time_shell \ $runs_time_impala \ | tee -a $query_perf exit 0; ## EOFView Code
这边1-99个测试SQL 分享:http://pan.baidu.com/s/1eSjUPm6
后期组长又给我一组变态版SQL,懂不懂就是语法不兼容或者是impala的mem和cpu都跑步上去,这时候可以用ctrl+c kill掉这条语句,直接跑下一条。
附上SQL说明B.1 query1.tpl Find customers who have returned items more than 20% more often than the average customer returns for a store in a given state for a given year. Qualification Substitution Parameters: ? ? ? B.2 YEAR.01=2000 STATE.01=TN AGGFIELD.01 = SR_RETURN_AMT query2.tpl Report the increase of weekly web and catalog sales from one year to the next year for each week. That is, compute the increase of Monday, Tuesday, ... Sunday sales from one year to the following. Qualification Substitution Parameters: ? B.3 YEAR.01=2001 query3.tpl Report the total extended sales price per item brand of a specific manufacturer for all sales in a specific month of the year. Qualification Substitution Parameters: ? ? ? B.4 MONTH.01=11 MANUFACT =128 AGGC = s_ext_sales_price query4.tpl Find customers who spend more money via catalog than in stores. Identify preferred customers and their country of origin. Qualification Substitution Parameters: ? ? B.5 YEAR.01=2001 SELECTCONE.01=t_s_secyear.customer_id,t_s_secyear.customer_first_name,t_s_secyear.customer_last_ name,t_s_secyear.c_preferred_cust_flag,t_s_secyear.c_birth_country,t_s_secyear.c_login,t_s_secyear.c_em ail_address query5.tpl Report sales, profit, return amount, and net loss in the store, catalog, and web channels for a 14-day window. Rollup results by sales channel and channel specific sales method (store for store sales, catalog page for catalog sales and web site for web sales) Qualification Substitution Parameters: ? ? SALES_DATE.01=2000-08-23 YEAR.01=2000 TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 118 of 156B.6 query6.tpl List all the states with at least 10 customers who during a given month bought items with the price tag at least 20% higher than the average price of items in the same category. Qualification Substitution Parameters: ? ? B.7 MONTH.01=1 YEAR.01=2001 query7.tpl Compute the average quantity, list price, discount, and sales price for promotional items sold in stores where the promotion is not offered by mail or a special event. Restrict the results to a specific gender, marital and educational status. Qualification Substitution Parameters: ? ? ? ? B.8 YEAR.01=2000 ES.01=College MS.01=S GEN.01=M query8.tpl Compute the net profit of stores located in 400 Metropolitan areas with more than 10 preferred customers. Qualification Substitution Parameters: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ZIP.01=24128 ZIP.02=76232 ZIP.03=65084 ZIP.04=87816 ZIP.05=83926 ZIP.06=77556 ZIP.07=20548 ZIP.08=26231 ZIP.09=43848 ZIP.10=15126 ZIP.11=91137 ZIP.12=61265 ZIP.13=98294 ZIP.14=25782 ZIP.15=17920 ZIP.16=18426 ZIP.17=98235 ZIP.18=40081 ZIP.19=84093 ZIP.20=28577 ZIP.21=55565 ZIP.22=17183 ZIP.23=54601 ZIP.24=67897 ZIP.25=22752 ZIP.26=86284 ZIP.81=57834 ZIP.82=62878 ZIP.83=49130 ZIP.84=81096 ZIP.85=18840 ZIP.86=27700 ZIP.87=23470 ZIP.88=50412 ZIP.89=21195 ZIP.90=16021 ZIP.91=76107 ZIP.92=71954 ZIP.93=68309 ZIP.94=18119 ZIP.95=98359 ZIP.96=64544 ZIP.97=10336 ZIP.98=86379 ZIP.99=27068 ZIP.100=39736 ZIP.101=98569 ZIP.102=28915 ZIP.103=24206 ZIP.104=56529 ZIP.105=57647 ZIP.106=54917 ZIP.161=13354 ZIP.162=45375 ZIP.163=40558 ZIP.164=56458 ZIP.165=28286 ZIP.166=45266 ZIP.167=47305 ZIP.168=69399 ZIP.169=83921 ZIP.170=26233 ZIP.171=11101 ZIP.172=15371 ZIP.173=69913 ZIP.174=35942 ZIP.175=15882 ZIP.176=25631 ZIP.177=24610 ZIP.178=44165 ZIP.179=99076 ZIP.180=33786 ZIP.181=70738 ZIP.182=26653 ZIP.183=14328 ZIP.184=72305 ZIP.185=62496 ZIP.186=22152 ZIP.241=15734 ZIP.242=63435 ZIP.243=25733 ZIP.244=35474 ZIP.245=24676 ZIP.246=94627 ZIP.247=53535 ZIP.248=17879 ZIP.249=15559 ZIP.250=53268 ZIP.251=59166 ZIP.252=11928 ZIP.253=59402 ZIP.254=33282 ZIP.255=45721 ZIP.256=43933 ZIP.257=68101 ZIP.258=33515 ZIP.259=36634 ZIP.260=71286 ZIP.261=19736 ZIP.262=58058 ZIP.263=55253 ZIP.264=67473 ZIP.265=41918 ZIP.266=19515 TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 ZIP.321=78668 ZIP.322=22245 ZIP.323=15798 ZIP.324=27156 ZIP.325=37930 ZIP.326=62971 ZIP.327=21337 ZIP.328=51622 ZIP.329=67853 ZIP.330=10567 ZIP.331=38415 ZIP.332=15455 ZIP.333=58263 ZIP.334=42029 ZIP.335=60279 ZIP.336=37125 ZIP.337=56240 ZIP.338=88190 ZIP.339=50308 ZIP.340=26859 ZIP.341=64457 ZIP.342=89091 ZIP.343=82136 ZIP.344=62377 ZIP.345=36233 ZIP.346=63837 Page 119 of 156? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ZIP.27=18376 ZIP.28=38607 ZIP.29=45200 ZIP.30=21756 ZIP.31=29741 ZIP.32=96765 ZIP.33=23932 ZIP.34=89360 ZIP.35=29839 ZIP.36=25989 ZIP.37=28898 ZIP.38=91068 ZIP.39=72550 ZIP.40=10390 ZIP.41=18845 ZIP.42=47770 ZIP.43=82636 ZIP.44=41367 ZIP.45=76638 ZIP.46=86198 ZIP.47=81312 ZIP.48=37126 ZIP.49=39192 ZIP.50=88424 ZIP.51=72175 ZIP.52=81426 ZIP.53=53672 ZIP.54=10445 ZIP.55=42666 ZIP.56=66864 ZIP.57=66708 ZIP.58=41248 ZIP.59=48583 ZIP.60=82276 ZIP.61=18842 ZIP.62=78890 ZIP.63=49448 ZIP.64=14089 ZIP.65=38122 ZIP.66=34425 ZIP.67=79077 ZIP.68=19849 ZIP.69=43285 ZIP.70=39861 ZIP.71=66162 ZIP.72=77610 ZIP.73=13695 ZIP.74=99543 ZIP.75=83444 ZIP.76=83041 ZIP.107=42961 ZIP.108=91110 ZIP.109=63981 ZIP.110=14922 ZIP.111=36420 ZIP.112=23006 ZIP.113=67467 ZIP.114=32754 ZIP.115=30903 ZIP.116=20260 ZIP.117=31671 ZIP.118=51798 ZIP.119=72325 ZIP.120=85816 ZIP.121=68621 ZIP.122=13955 ZIP.123=36446 ZIP.124=41766 ZIP.125=68806 ZIP.126=16725 ZIP.127=15146 ZIP.128=22744 ZIP.129=35850 ZIP.130=88086 ZIP.131=51649 ZIP.132=18270 ZIP.133=52867 ZIP.134=39972 ZIP.135=96976 ZIP.136=63792 ZIP.137=11376 ZIP.138=94898 ZIP.139=13595 ZIP.140=10516 ZIP.141=90225 ZIP.142=58943 ZIP.143=39371 ZIP.144=94945 ZIP.145=28587 ZIP.146=96576 ZIP.147=57855 ZIP.148=28488 ZIP.149=26105 ZIP.150=83933 ZIP.151=25858 ZIP.152=34322 ZIP.153=44438 ZIP.154=73171 ZIP.155=30122 ZIP.156=34102 ZIP.187=10144 ZIP.188=64147 ZIP.189=48425 ZIP.190=14663 ZIP.191=21076 ZIP.192=18799 ZIP.193=30450 ZIP.194=63089 ZIP.195=81019 ZIP.196=68893 ZIP.197=24996 ZIP.198=51200 ZIP.199=51211 ZIP.200=45692 ZIP.201=92712 ZIP.202=70466 ZIP.203=79994 ZIP.204=22437 ZIP.205=25280 ZIP.206=38935 ZIP.207=71791 ZIP.208=73134 ZIP.209=56571 ZIP.210=14060 ZIP.211=19505 ZIP.212=72425 ZIP.213=56575 ZIP.214=74351 ZIP.215=68786 ZIP.216=51650 ZIP.217=20004 ZIP.218=18383 ZIP.219=76614 ZIP.220=11634 ZIP.221=18906 ZIP.222=15765 ZIP.223=41368 ZIP.224=73241 ZIP.225=76698 ZIP.226=78567 ZIP.227=97189 ZIP.228=28545 ZIP.229=76231 ZIP.230=75691 ZIP.231=22246 ZIP.232=51061 ZIP.233=90578 ZIP.234=56691 ZIP.235=68014 ZIP.236=51103 ZIP.267=36495 ZIP.268=19430 ZIP.269=22351 ZIP.270=77191 ZIP.271=91393 ZIP.272=49156 ZIP.273=50298 ZIP.274=87501 ZIP.275=18652 ZIP.276=53179 ZIP.277=18767 ZIP.278=63193 ZIP.279=23968 ZIP.280=65164 ZIP.281=68880 ZIP.282=21286 ZIP.283=72823 ZIP.284=58470 ZIP.285=67301 ZIP.286=13394 ZIP.287=31016 ZIP.288=70372 ZIP.289=67030 ZIP.290=40604 ZIP.291=24317 ZIP.292=45748 ZIP.293=39127 ZIP.294=26065 ZIP.295=77721 ZIP.296=31029 ZIP.297=31880 ZIP.298=60576 ZIP.299=24671 ZIP.300=45549 ZIP.301=13376 ZIP.302=50016 ZIP.303=33123 ZIP.304=19769 ZIP.305=22927 ZIP.306=97789 ZIP.307=46081 ZIP.308=72151 ZIP.309=15723 ZIP.310=46136 ZIP.311=51949 ZIP.312=68100 ZIP.313=96888 ZIP.314=64528 ZIP.315=14171 ZIP.316=79777 TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 ZIP.347=58078 ZIP.348=17043 ZIP.349=30010 ZIP.350=60099 ZIP.351=28810 ZIP.352=98025 ZIP.353=29178 ZIP.354=87343 ZIP.355=73273 ZIP.356=30469 ZIP.357=64034 ZIP.358=39516 ZIP.359=86057 ZIP.360=21309 ZIP.361=90257 ZIP.362=67875 ZIP.363=40162 ZIP.364=11356 ZIP.365=73650 ZIP.366=61810 ZIP.367=72013 ZIP.368=30431 ZIP.369=22461 ZIP.370=19512 ZIP.371=13375 ZIP.372=55307 ZIP.373=30625 ZIP.374=83849 ZIP.375=68908 ZIP.376=26689 ZIP.377=96451 ZIP.378=38193 ZIP.379=46820 ZIP.380=88885 ZIP.381=84935 ZIP.382=69035 ZIP.383=83144 ZIP.384=47537 ZIP.385=56616 ZIP.386=94983 ZIP.387=48033 ZIP.388=69952 ZIP.389=25486 ZIP.390=61547 ZIP.391=27385 ZIP.392=61860 ZIP.393=58048 ZIP.394=56910 ZIP.395=16807 ZIP.396=17871 Page 120 of 156? ? ? ? ? ? B.9 ZIP.77=12305 ZIP.78=57665 ZIP.79=68341 ZIP.80=25003 QOY.01=2 YEAR.01=1998 ZIP.157=22685 ZIP.158=71256 ZIP.159=78451 ZIP.160=54364 ZIP.237=94167 ZIP.238=57047 ZIP.239=14867 ZIP.240=73520 ZIP.317=28709 ZIP.318=11489 ZIP.319=25103 ZIP.320=32213 ZIP.397=35258 ZIP.398=31387 ZIP.399=35458 ZIP.400=35576 query9.tpl Categorize store sales transactions into 5 buckets according to the number of items sold. Each bucket contains the average discount amount, sales price, list price, tax, net paid, paid price including tax, or net profit.. Qualification Substitution Parameters: ? ? ? ? ? ? ? AGGCTHEN.01= ss_ext_discount_amt AGGCELSE.01= ss_net_paid RC.01=74129 RC.02=122840 RC.03=56580 RC.04=10097 RC.05=165306 B.10 query10.tpl Count the customers with the same gender, marital status, education status, purchase estimate, credit rating, dependent count, employed dependent count and college dependent count who live in certain counties and who have purchased from both stores and another sales channel during a three month time period of a given year. Qualification Substitution Parameters: ? ? ? ? ? ? ? YEAR.01 = 2002 MONTH.01 = 1 COUNTY.01 = Rush County COUNTY.02 = Toole County COUNTY.03 = Jefferson County COUNTY.04 = Dona Ana County COUNTY.05 = La Porte County B.11 query11.tpl Find customers whose increase in spending was large over the web than in stores this year compared to last year. Qualification Substitution Parameters: ? ? YEAR.01 = 2001 SELECTONE = t_s_secyear.customer_preferred_cust_flag TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 121 of 156B.12 query12.tpl Compute the revenue ratios across item classes: For each item in a list of given categories, during a 30 day time period, sold through the web channel compute the ratio of sales of that item to the sum of all of the sales in that item's class. Qualification Substitution Parameters ? ? ? ? ? CATEGORY.01 = Sports CATEGORY.02 = Books CATEGORY.03 = Home SDATE.01 = 1999-02-22 YEAR.01 = 1999 B.13 query13.tpl Calculate the average sales quantity, average sales price, average wholesale cost, total wholesale cost for store sales of different customer types (e.g., based on marital status, education status) including their household demographics, sales price and different combinations of state and sales profit for a given year. Qualification Substitution Parameters: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? YEAR.01 = 2001 STATE.01 = TX STATE.02 = OH STATE.03 = TX STATE.04 = OR STATE.05 = NM STATE.06 = KY STATE.07 = VA STATE.08 = TX STATE.09 = MS ES.01 = Advanced Degree ES.02 = College ES.03 = 2 yr Degree MS.01 = M MS.02 = S MS.03 = W B.14 query14.tpl) This query contains multiple iterations: Iteration 1: First identify items in the same brand, class and category that are sold in all three sales channels in two consecutive years. Then compute the average sales (quantity*list price) across all sales of all three sales channels in the same three years (average sales). Finally, compute the total sales and the total number of sales rolled up for each channel, brand, class and category. Only consider sales of cross channel sales that had sales larger than the average sale. Iteration 2: Based on the previous query compare December store sales. Qualification Substitution Parameters: ? DAY.01 = 11 ? YEAR.01 = 1999 TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 122 of 156B.15 query15.tpl Report the total catalog sales for customers in selected geographical regions or who made large purchases for a given year and quarter. Qualification Substitution Parameters: ? ? QOY.01 = 2 YEAR.01 = 2001 B.16 query16.tpl Report number of orders, total shipping costs and profits from catalog sales of particular counties and states for a given 60 day period for non-returned sales filled from an alternate warehouse. Qualification Substitution Parameters: ? ? ? ? ? ? ? ? COUNTY_E.01 = Williamson County COUNTY_D.01 = Williamson County COUNTY_C.01 = Williamson County COUNTY_B.01 = Williamson County COUNTY_A.01 = Williamson County STATE.01 = GA MONTH.01 = 2 YEAR.01 = 2002 B.17 query17.tpl Analyze, for each state, all items that were sold in stores in a particular quarter and returned in the next three quarters and then re-purchased by the customer through the catalog channel in the three following quarters. Qualification Substitution Parameters: ? YEAR.01 = 2001 B.18 query18.tpl Compute, for each county, the average quantity, list price, coupon amount, sales price, net profit, age, and number of dependents for all items purchased through catalog sales in a given year by customers who were born in a given list of six months and living in a given list of seven states and who also belong to a given gender and education demographic. Qualification Substitution Parameters: ? ? ? ? ? ? ? ? ? ? ? ? MONTH.01 = 1 MONTH.02 = 6 MONTH.03 = 8 MONTH.04 = 9 MONTH.05 = 12 MONTH.06 = 2 STATE.01 = MS STATE.02 = IN STATE.03 = ND STATE.04 = OK STATE.05 = NM STATE.06 = VA TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 123 of 156? ? ? ? STATE.07 = MS ES.01 = Unknown GEN.01 = F YEAR.01 = 1998 B.19 query19.tpl Select the top revenue generating products bought by out of zip code customers for a given year, month and manager. Qualification Substitution Parameters ? ? ? MANAGER.01 = 8 MONTH.01 = 11 YEAR.01 = 1998 B.20 query20.tpl Compute the total revenue and the ratio of total revenue to revenue by item class for specified item categories and time periods. Qualification Substitution Parameters: ? ? ? ? ? CATEGORY.01 = Sports CATEGORY.02 = Books CATEGORY.03 = Home SDATE.01 = 1999-02-22 YEAR.01 = 1999 B.21 query21.tpl For all items whose price was changed on a given date, compute the percentage change in inventory between the 30-day period BEFORE the price change and the 30-day period AFTER the change. Group this information by warehouse. Qualification Substitution Parameters: ? ? SALES_DATE.01 = 2000-03-11 YEAR.01 = 2000 B.22 query22.tpl For each product name, brand, class, category, calculate the average quantity on hand. Rollup data by product name, brand, class and category. Qualification Substitution Parameters: ? DMS.01 = 1200 TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 124 of 156B.23 query23.tpl This query contains multiple, related iterations: Find frequently sold items that are items that were sold more than 4 times per day in four consecutive years. Compute the maximum store sales made by any given customer in a period of four consecutive years (same as above). Compute the best store customers as those that are in the 5 th percentile of sales. Finally, compute the total sales of sales in March made by our best customers buying our most frequent items Qualification Substitution Parameters: ? ? ? MONTH.01 = 2 YEAR.01 = 2000 TOPPERCENT=50 B.24 query24.tpl This query contains multiple, related iterations: Iteration 1: Calculate the total specified monetary value of items in a specific color for store sales transactions by customer name and store, in a specific market, from customers who currently live in their birth countries and in the neighborhood of the store, and list only those customers for whom the total specified monetary value is greater than 5% of the average value Iteration 2: Calculate the total specified monetary value of items in a specific color and specific size for store sales transactions by customer name and store, in a specific market, from customers who currently live in their birth countries and in the neighborhood of the store, and list only those customers for whom the total specified monetary value is greater than 5% of the average value Qualification Substitution Parameters: MARKET = 8 COLOR.1 = pale COLOR.2 = chiffon AMOUNTONE = ss_net_paid ? ? ? ? B.25 query25.tpl Get all items that were ? ? ? sold in stores in a particular month and year and returned in the next three quarters re-purchased by the customer through the catalog channel in the six following months. For these items, compute the sum of net profit of store sales, net loss of store loss and net profit of catalog . Group this information by item and store. Qualification Substitution Parameters: ? ? ? MONTH.01 = 4 YEAR.01 = 2001 AGG.01 = sum TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 125 of 156B.26 query26.tpl Computes the average quantity, list price, discount, sales price for promotional items sold through the catalog channel where the promotion was not offered by mail or in an event for given gender, marital status and educational status. Qualification Substitution Parameters: ? ? ? ? YEAR.01 = 2000 ES.01 = College MS.01 = S GEN.01 = M B.27 query27.tpl For all items sold in stores located in six states during a given year, find the average quantity, average list price, average list sales price, average coupon amount for a given gender, marital status, education and customer demographic. Qualification Substitution Parameters: ? ? ? ? ? ? ? ? ? ? STATE_F.01 = TN STATE_E.01 = TN STATE_D.01 = TN STATE_C.01 = TN STATE_B.01 = TN STATE_A.01 = TN ES.01 = College MS.01 = S GEN.01 = M YEAR.01 = 2002 B.28 query28.tpl Calculate the average list price, number of non empty (null) list prices and number of distinct list prices of six different sales buckets of the store sales channel. Each bucket is defined by a range of distinct items and information about list price, coupon amount and wholesale cost. Qualification Substitution Parameters: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? WHOLESALECOST.01=57 WHOLESALECOST.02=31 WHOLESALECOST.03=79 WHOLESALECOST.04=38 WHOLESALECOST.05=17 WHOLESALECOST.06=7 COUPONAMT.01=459 COUPONAMT.02=2323 COUPONAMT.03=12214 COUPONAMT.04=6071 COUPONAMT.05=836 COUPONAMT.06=7326 LISTPRICE.01=8 LISTPRICE.02=90 LISTPRICE.03=142 TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 126 of 156? ? ? LISTPRICE.04=135 LISTPRICE.05=122 LISTPRICE.06=154 B.29 query29.tpl Get all items that were sold in stores in a specific month and year and which were returned in the next six months of the same year and re-purchased by the returning customer afterwards through the catalog sales channel in the following three years. For those these items, compute the total quantity sold through the store, the quantity returned and the quantity purchased through the catalog. Group this information by item and store. Qualification Substitution Parameters: ? ? ? MONTH.01 = 9 YEAR.01 = 1999 AGG.01 = 29 B.30 query30.tpl Find customers and their detailed customer data who have returned items, which they bought on the web, for an amount that is 20% higher than the average amount a customer returns in a given state in a given time period across all items. Order the output by customer data. Qualification Substitution Parameters: ? ? YEAR.01 = 2002 STATE.01 = GA B.31 query31.tpl List the top five counties where the percentage growth in web sales is consistently higher compared to the percentage growth in store sales in the first three consecutive quarters for a given year. Qualification Substitution Parameters: ? ? YEAR.01 = 2000 AGG.01 = ss1.ca_county B.32 query32.tpl Compute the total discounted amount for a particular manufacturer in a particular 90 day period for catalog sales whose discounts exceeded the average discount by at least 30%. Qualification Substitution Parameters: ? ? ? CSDATE.01 = 2000-01-27 YEAR.01 = 2000 IMID.01 = 977 TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 127 of 156B.33 query33.tpl What is the monthly sales figure based on extended price for a specific month in a specific year, for manufacturers in a specific category in a given time zone. Group sales by manufacturer identifier and sort output by sales amount, by channel, and give Total sales. Qualification Substitution Parameters: ? ? ? ? CATEGORY.01 = Electronics GMT.01 = -5 MONTH.01 = 5 YEAR.01 = 1998 B.34 query34.tpl Display all customers with specific buy potentials and whose dependent count to vehicle count ratio is larger than 1.2, who in three consecutive years made purchases with between 15 and 20 items in the beginning or the end of each month in stores located in 8 counties. Qualification Substitution Parameters: ? ? ? ? ? ? ? ? ? ? ? COUNTY_H.01 = Williamson County COUNTY_G.01 = Williamson County COUNTY_F.01 = Williamson County COUNTY_E.01 = Williamson County COUNTY_D.01 = Williamson County COUNTY_C.01 = Williamson County COUNTY_B.01 = Williamson County COUNTY_A.01 = Williamson County YEAR.01 = 1999 BPTWO.01 = unknown BPONE.01 = >10000 TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 128 of 156B.35 query35.tpl For each of the customers living in the same state, having the same gender and marital status who have purchased from stores and from either the catalog or the web during a given year, display the following: ? state, gender, marital status, count of customers ? min, max, avg, count distinct of the customer’s dependent count ? min, max, avg, count distinct of the customer’s employed dependent count ? min, max, avg, count distinct of the customer’s dependents in college count Display / calculate the “count of customers” multiple times to emulate a potential reporting tool scenario. Qualification Substitution Parameters: YEAR.01 = 2002 AGGONE = min AGGTWO = max AGGTHREE = avg B.36 query36.tpl Compute store sales gross profit margin ranking for items in a given year for a given list of states.\ Qualification Substitution Parameters: ? ? ? ? ? ? ? ? ? STATE_H.01 = TN STATE_G.01 = TN STATE_F.01 = TN STATE_E.01 = TN STATE_D.01 = TN STATE_C.01 = TN STATE_B.01 = TN STATE_A.01 = TN YEAR.01 = 2001 B.37 query37.tpl List all items and current prices sold through the catalog channel from certain manufacturers in a given $30 price range and consistently had a quantity between 100 and 500 on hand in a 60-day period. Qualification Substitution Parameters: ? ? ? ? ? ? PRICE.01 = 68 MANUFACT_ID.01 = 677 MANUFACT_ID.02 = 940 MANUFACT_ID.03 = 694 MANUFACT_ID.04 = 808 INVDATE.01 = 2000-02-01 B.38 query38.tpl Display count of customers with purchases from all 3 channels in a given year. Qualification Substitution Parameters: ? DMS.01 = 1200 TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 129 of 156B.39 query39.tpl This query contains multiple, related iterations: Iteration 1: Calculate the coefficient of variation and mean of every item and warehouse of two consecutive months Iteration 2: Find items that had a coefficient of variation in the first months of 1.5 or large Qualification Substitution Parameters: ? YEAR.01 = 2001 ? MONTH.01 = 1 B.40 query40.tpl Compute the impact of an item price change on the sales by computing the total sales for items in a 30 day period before and after the price change. Group the items by location of warehouse where they were delivered from. Qualification Substitution Parameters ? ? SALES_DATE.01 = 2000-03-11 YEAR.01 = 2000 B.41 query41.tpl How many items do we carry with specific combinations of color, units, size and category. Qualification Substitution Parameters ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? MANUFACT.01 = 738 SIZE.01 = medium SIZE.02 = extra large SIZE.03 = N/A SIZE.04 = small SIZE.05 = petite SIZE.06 = large UNIT.01 = Ounce UNIT.02 = Oz UNIT.03 = Bunch UNIT.04 = Ton UNIT.05 = N/A UNIT.06 = Dozen UNIT.07 = Box UNIT.08 = Pound UNIT.09 = Pallet UNIT.10 = Gross UNIT.11 = Cup UNIT.12 = Dram UNIT.13 = Each UNIT.14 = Tbl UNIT.15 = Lb UNIT.16 = Bundle COLOR.01 = powder COLOR.02 = khaki TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 130 of 156? ? ? ? ? ? ? ? ? ? ? ? ? ? COLOR.03 = brown COLOR.04 = honeydew COLOR.05 = floral COLOR.06 = deep COLOR.07 = light COLOR.08 = cornflower COLOR.09 = midnight COLOR.10 = snow COLOR.11 = cyan COLOR.12 = papaya COLOR.13 = orange COLOR.14 = frosted COLOR.15 = forest COLOR.16 = ghost B.42 query42.tpl For each item and a specific year and month calculate the sum of the extended sales price of store transactions. Qualification Substitution Parameters: ? ? MONTH.01 = 11 YEAR.01 = 2000 B.43 query43.tpl Report the sum of all sales from Sunday to Saturday for stores in a given data range by stores. Qualification Substitution Parameters: ? ? YEAR.01 = 2000 GMT.01 = -5 B.44 query44.tpl List the best and worst performing products measured by net profit. Qualification Substitution Parameters: ? ? NULLCOLSS.01 = ss_addr_sk STORE.01 = 4 B.45 query45.tpl Report the total web sales for customers in specific zip codes, cities, counties or states, or specific items for a given year and quarter. . Qualification Substitution Parameters: ? ? ? QOY.01 = 2 YEAR.01 = 2001 GBOBC = ca_city TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 131 of 156B.46 query46.tpl Compute the per-customer coupon amount and net profit of all "out of town" customers buying from stores located in 5 cities on weekends in three consecutive years. The customers need to fit the profile of having a specific dependent count and vehicle count. For all these customers print the city they lived in at the time of purchase, the city in which the store is located, the coupon amount and net profit Qualification Substitution Parameters: ? ? ? ? ? ? ? ? CITY_E.01 = Fairview CITY_D.01 = Fairview CITY_C.01 = Fairview CITY_B.01 = Midway CITY_A.01 = Fairview VEHCNT.01 = 3 YEAR.01 = 1999 DEPCNT.01 = 4 B.47 query47.tpl Find the item brands and categories for each store and company, the monthly sales figures for a specified year, where the monthly sales figure deviated more than 10% of the average monthly sales for the year, sorted by deviation and store. Report deviation of sales from the previous and the following monthly sales. Qualification Substitution Parameters ? ? ? YEAR.01 = 1999 SELECTONE = v1.i_category, v1.i_brand, v1.s_store_name, v1.s_company_name SELECTTWO = ,v1.d_year, v1.d_moy B.48 query48.tpl Calculate the total sales by different types of customers (e.g., based on marital status, education status), sales price and different combinations of state and sales profit. Qualification Substitution Parameters: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? MS.1=M MS.2=D MS.3=S ES.1=4 yr Degree ES.2=2 yr Degree ES.3=College STATE.1=CO STATE.2=OH STATE.3=TX STATE.4=OR STATE.5=MN STATE.6=KY STATE.7=VA STATE.8=CA STATE.9=MS TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 132 of 156B.49 Query49.tpl Report the worst return ratios (sales to returns) of all items for each channel by quantity and currency sorted by ratio. Quantity ratio is defined as total number of sales to total number of returns. Currency ratio is defined as sum of return amount to sum of net paid. Qualification Substitution Parameters: ? ? MONTH.01 = 12 YEAR.01 = 2001 B.50 query50.tpl For each store count the number of items in a specified month that were returned after 30, 60, 90, 120 and more than 120 days from the day of purchase. Qualification Substitution Parameters: ? ? MONTH.01 = 8 YEAR.01 = 2001 B.51 query51.tpl Compute the count of store sales resulting from promotions, the count of all store sales and their ratio for specific categories in a particular time zone and for a given year and month. Qualification Substitution Parameters: ? DMS.01 = 1200 B.52 query52.tpl Report the total of extended sales price for all items of a specific brand in a specific year and month. Qualification Substitution Parameters ? ? MONTH.01=11 YEAR.01=2000 B.53 query53.tpl Find the ID, quarterly sales and yearly sales of those manufacturers who produce items with specific characteristics and whose average monthly sales are larger than 10% of their monthly sales. Qualification Substitution Parameters: ? DMS.01 = 1200 B.54 query54.tpl Find all customers who purchased items of a given category and class on the web or through catalog in a given month and year that was followed by an in-store purchase at a store near their residence in the three consecutive months. Calculate a histogram of the revenue by these customers in $50 segments showing the number of customers in each of these revenue generated segments. Qualification Substitution Parameters: ? ? ? CLASS.01 = maternity CATEGORY.01 = Women MONTH.01 = 12 TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 133 of 156? YEAR.01 = 1998 B.55 query55.tpl For a given year, month and store manager calculate the total store sales of any combination all brands. Qualification Substitution Parameters: ? ? ? MANAGER.01 = 28 MONTH.01 = 11 YEAR.01 = 1999 B.56 query56.tpl Compute the monthly sales amount for a specific month in a specific year, for items with three specific colors across all sales channels. Only consider sales of customers residing in a specific time zone. Group sales by item and sort output by sales amount. Qualification Substitution Parameters: ? ? ? ? ? ? COLOR.01 = slate COLOR.02 = blanched COLOR.03 = burnished GMT.01 = -5 MONTH.01 = 2 YEAR.01 = 2001 B.57 query57.tpl Find the item brands and categories for each call center and their monthly sales figures for a specified year, where the monthly sales figure deviated more than 10% of the average monthly sales for the year, sorted by deviation and call center. Report the sales deviation from the previous and following month. Qualification Substitution Parameters: ? ? ? YEAR.01 = 1999 SELECTONE = v1.i_category, v1.i_brand, v1.cc_name SELECTTWO = ,v1.d_year, v1.d_moy B.58 query58.tpl Retrieve the items generating the highest revenue and which had a revenue that was approximately equivalent across all of store, catalog and web within the week ending a given date. Qualification Substitution Parameters: ? SALES_DATE.01 = 2000-01-03 B.59 query59.tpl Report the increase of weekly store sales from one year to the next year for each store and day of the week. Qualification Substitution Parameters: ? DMS.01 = 1212 TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 134 of 156B.60 query60.tpl What is the monthly sales amount for a specific month in a specific year, for items in a specific category, purchased by customers residing in a specific time zone. Group sales by item and sort output by sales amount. Qualification Substitution Parameters: ? ? ? ? CATEGORY.01 = Music GMT.01 = -5 MONTH.01 = 9 YEAR=1998 B.61 query61.tpl Find the ratio of items sold with and without promotions in a given month and year. Only items in certain categories sold to customers living in a specific time zone are considered. Qualification Substitution Parameters: ? ? ? ? GMT.01 = -5 CATEGORY.01 = Jewelry MONTH.01 = 11 YEAR.01 = 1998 B.62 query62.tpl For web sales, create a report showing the counts of orders shipped within 30 days, from 31 to 60 days, from 61 to 90 days, from 91 to 120 days and over 120 days within a given year, grouped by warehouse, shipping mode and web site. Qualification Substitution Parameters: ? DMS.01 = 1200 B.63 query63.tpl For a given year calculate the monthly sales of items of specific categories, classes and brands that were sold in stores and group the results by store manager. Additionally, for every month and manager print the yearly average sales of those items. Qualification Substitution Parameters: ? DMS.01 = 1200 B.64 query64.tpl Find those stores that sold more cross-sales items from one year to another. Cross-sale items are items that are sold over the Internet, by catalog and in store. Qualification Substitution Parameters: ? ? ? ? ? ? ? YEAR.01 = 1999 PRICE.01 = 64 COLOR.01 = purple COLOR.02 = burlywood COLOR.03 = indian COLOR.04 = spring COLOR.05 = floral TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 135 of 156? COLOR.06 = medium B.65 query65.tpl In a given period, for each store, report the list of items with revenue less than 10% the average revenue for all the items in that store. Qualification Substitution Parameters: ? DMS.01 = 1176 B.66 query66.tpl Compute web and catalog sales and profits by warehouse. Report results by month for a given year during a given 8-hour period. Qualification Substitution Parameters ? ? ? ? ? ? ? ? SALESTWO.01 = cs_sales_price SALESONE.01 = ws_ext_sales_price NETTWO.01 = cs_net_paid_inc_tax NETONE.01 = ws_net_paid SMC.01 = DHL SMC.02 = BARIAN TIMEONE.01 = 30838 YEAR.01 = 2001 B.67 query67.tpl Find top stores for each category based on store sales in a specific year. Qualification Substitution Parameters: ? DMS.01 = 1200 B.68 query68.tpl Compute the per customer extended sales price, extended list price and extended tax for "out of town" shoppers buying from stores located in two cities in the first two days of each month of three consecutive years. Only consider customers with specific dependent and vehicle counts. Qualification Substitution Parameters: ? ? ? ? ? CITY_B.01 = Midway CITY_A.01 = Fairview VEHCNT.01 = 3 YEAR.01 = 1999 DEPCNT.01 = 4 TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 136 of 156B.69 query69.tpl Count the customers with the same gender, marital status, education status, education status, purchase estimate and credit rating who live in certain states and who have purchased from stores but neither form the catalog nor from the web during a two month time period of a given year. Qualification Substitution Parameters: ? ? ? ? ? STATE.01 = KY STATE.02 = GA STATE.03 = NM YEAR.01 = 2001 MONTH.01 = 4 B.70 query70.tpl Compute store sales net profit ranking by state and county for a given year and determine the five most profitable states. Qualification Substitution Parameters: ? DMS.01 = 1200 B.71 query71.tpl Select the top revenue generating products, sold during breakfast or dinner time for one month managed by a given manager across all three sales channels. Qualification Substitution Parameters: ? ? ? MANAGER.01 = 1 MONTH.01 = 11 YEAR.01 = 1999 B.72 query72.tpl For each item, warehouse and week combination count the number of sales with and without promotion. Qualification Substitution Parameters: ? ? ? BP.01 = >10000 MS.01 = D YEAR.01 = 1999 B.73 query73.tpl Count the number of customers with specific buy potentials and whose dependent count to vehicle count ratio is larger than 1 and who in three consecutive years bought in stores located in 4 counties between 1 and 5 items in one purchase. Only purchases in the first 2 days of the months are considered. Qualification Substitution Parameters: ? ? ? ? ? ? COUNTY_D.01 = Orange County COUNTY_C.01 = Bronx County COUNTY_B.01 = Franklin Parish COUNTY_A.01 = Williamson CountyYEAR.01 = 1999 BPTWO.01 = unknown BPONE.01 = >10000 TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 137 of 156B.74 query74.tpl Display customers with both store and web sales in consecutive years for whom the increase in web sales exceeds the increase in store sales for a specified year. Qualification Substitution Parameters: ? ? ? YEAR.01 = 2001 AGGONE.01 = sum ORDERC.01 = 1 1 1 B.75 query75.tpl For two consecutive years track the sales of items by brand, class and category. Qualification Substitution Parameters: ? ? CATEGORY.01 = Books YEAR.02 = 2002 B.76 query76.tpl Computes the average quantity, list price, discount, sales price for promotional items sold through the web channel where the promotion is not offered by mail or in an event for given gender, marital status and educational status. Qualification Substitution Parameters: ? ? ? NULLCOLCS01 = cs_ship_addr_sk NULLCOLWS.01 = ws_ship_customer_sk NULLCOLSS.01 = ss_store_sk B.77 query77.tpl Report the total sales, returns and profit for all three sales channels for a given 30 day period. Roll up the results by channel and a unique channel location identifier. Qualification Substitution Parameters: ? SALES_DATE.01 = 2000-08-23 B.78 query78.tpl Report the top customer / item combinations having the highest ratio of store channel sales to all other channel sales (minimum 2 to 1 ratio), for combinations with at least one store sale and one other channel sale. Order the output by highest ratio. Qualification Substitution Parameters: ? YEAR.01 = 2000 B.79 query79.tpl Compute the per customer coupon amount and net profit of Monday shoppers. Only purchases of three consecutive years made on Mondays in large stores by customers with a certain dependent count and with a large vehicle count are considered. Qualification Substitution Parameters: ? VEHCNT.01 = 2 TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 138 of 156? ? YEAR.01 = 1999 DEPCNT.01 = 6 B.80 query80.tpl Report extended sales, extended net profit and returns in the store, catalog, and web channels for a 30 day window for items with prices larger than $50 not promoted on television, rollup results by sales channel and channel specific sales means (store for store sales, catalog page for catalog sales and web site for web sales) Qualification Substitution Parameters: ? SALES_DATE.01 = 2000-08-23 B.81 query81.tpl Find customers and their detailed customer data who have returned items bought from the catalog more than 20 percent the average customer returns for customers in a given state in a given time period. Order output by customer data. Qualification Substitution Parameters: ? ? YEAR.01 = 2000 STATE.01 = GA B.82 query82.tpl ? Find customers who tend to spend more money (net-paid) on-line than in stores. Qualification Substitution Parameters ? ? ? ? ? ? MANUFACT_ID.01 = 129 MANUFACT_ID.02 = 270 MANUFACT_ID.03 = 821 MANUFACT_ID.04 = 423 INVDATE.01 = 2000-05-25 PRICE.01 = 62 B.83 query83.tpl Retrieve the items with the highest number of returns where the number of returns was approximately equivalent across all store, catalog and web channels (within a tolerance of +/- 10%), within the week ending a given date. Qualification Substitution Parameters ? ? ? RETURNED_DATE_THREE.01 = 2000-11-17 RETURNED_DATE_TWO.01 = 2000-09-27 RETURNED_DATE_ONE.01 = 2000-06-30 B.84 query84.tpl List all customers living in a specified city, with an income between 2 values. Qualification Substitution Parameters ? ? INCOME.01 = 38128 CITY.01 = Edgewood TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 139 of 156B.85 query85.tpl For all web return reason calculate the average sales, average refunded cash and average return fee by different combinations of customer and sales types (e.g., based on marital status, education status, state and sales profit). Qualification Substitution Parameters: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? YEAR.01 = 2000 STATE.01 = IN STATE.02 = OH STATE.03 = NJ STATE.04 = WI STATE.05 = CT STATE.06 = KY STATE.07 = LA STATE.08 = IA STATE.09 = AR ES.01 = Advanced Degree ES.02 = College ES.03 = 2 yr Degree MS.01 = M MS.02 = S MS.03 = W B.86 query86.tpl Rollup the web sales for a given year by category and class, and rank the sales among peers within the parent, for each group compute sum of sales, location with the hierarchy and rank within the group. Qualification Substitution Parameters: ? DMS.01 = 1200 B.87 query87.tpl Count how many customers have ordered on the same day items on the web and the catalog and on the same day have bought items in a store. Qualification Substitution Parameters: ? DMS.01 = 1200 B.88 query88.tpl How many items do we sell between pacific times of a day in certain stores to customers with one dependent count and 2 or less vehicles registered or 2 dependents with 4 or fewer vehicles registered or 3 dependents and five or less vehicles registered. In one row break the counts into sells from 8:30 to 9, 9 to 9:30, 9:30 to 10 ... 12 to 12:30 Qualification Substitution Parameters: ? ? ? ? STORE.01=Unknown HOUR.01=4 HOUR.02=2 HOUR.03=0 TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 140 of 156B.89 query89.tpl Within a year list all month and combination of item categories, classes and brands that have had monthly sales larger than 0.1 percent of the total yearly sales. Qualification Substitution Parameters: ? ? ? ? ? ? ? ? ? ? ? ? ? CLASS_F.01 = dresses CAT_F.01 = Women CLASS_E.01 = birdal CAT_E.01 = Jewelry CLASS_D.01 = shirts CAT_D.01 = Men CLASS_C.01 = football CAT_C.01 = Sports CLASS_B.01 = stereo CAT_B.01 = Electronics CLASS_A.01 = computers CAT_A.01 = Books YEAR.01 = 1999 B.90 query90.tpl What is the ratio between the number of items sold over the internet in the morning (8 to 9am) to the number of items sold in the evening (7 to 8pm) of customers with a specified number of dependents. Consider only websites with a high amount of content. Qualification Substitution Parameters: ? ? ? HOUR_PM.01 = 19 HOUR_AM.01 = 8 DEPCNT.01 = 6 B.91 query91.tpl Display total returns of catalog sales by call center and manager in a particular month for male customers of unknown education or female customers with advanced degrees with a specified buy potential and from a particular time zone. Qualification Substitution Parameters: ? ? ? ? YEAR.01 = 1998 MONTH.01 = 11 BUY_POTENTIAL.01 = Unknown GMT.01 = -7 B.92 query92.tpl Compute the total discount on web sales of items from a given manufacturer over a particular 90 day period for sales whose discount exceeded 30% over the average discount of items from that manufacturer in that period of time. Qualification Substitution Parameters: ? ? IMID.01 = 350 WSDATE.01 = 2000-01-27 TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 141 of 156B.93 query93.tpl For a given merchandise return reason, report on customers’ total cost of purchases minus the cost of returned items. Qualification Substitution Parameters: ? Reason= reason 28 B.94 query94.tpl Produce a count of web sales and total shipping cost and net profit in a given 60 day period to customers in a given state from a named web site for non returned orders shipped from more than one warehouse. Qualification Substitution Parameters: ? ? ? YEAR.01 = 1999 MONTH.01 = 2 STATE.01 = IL B.95 query95.tpl Produce a count of web sales and total shipping cost and net profit in a given 60 day period to customers in a given state from a named web site for returned orders shipped from more than one warehouse. Qualification Substitution Parameters: ? ? ? STATE.01=IL MONTH.01=2 YEAR.01=1999 B.96 query96.tpl Compute a count of sales from a named store to customers with a given number of dependents made in a specified half hour period of the day. Qualification Substitution Parameters: ? ? HOUR.01 = 20 DEPCNT.01 = 7 B.97 query97.tpl Generate counts of promotional sales and total sales, and their ratio from the web channel for a particular item category and month to customers in a given time zone. Qualification Substitution Parameters: ? YEAR.01 = 2000 B.98 query98.tpl Report on items sold in a given 30 day period, belonging to the specified category. Qualification Substitution Parameters ? ? YEAR.01 = 1999 SDATE.01 = 1999-02-22 TPC BenchmarkTM DS - Standard Specification, Version 1.3.1 Page 142 of 156? ? ? CATEGORY.01 = Sports CATEGORY.02 = Books CATEGORY.03 = Home B.99 query99.tpl For catalog sales, create a report showing the counts of orders shipped within 30 days, from 31 to 60 days, from 61 to 90 days, from 91 to 120 days and over 120 days within a given year, grouped by warehouse, call center and shipping mode. Qualification Substitution Parameters ? DMS.01 = 1200 TPC BenchmarkTM DS - StanView Code 附上测试结果:
Configuration: ---------------------------------------------- Database: tpcds_parquet_1000 Perf: ./logs/impala/test_impala_tpcds_parquet_1000_160510-105135/query_perf.csv Logs: ./logs/impala/test_impala_tpcds_parquet_1000_160510-105135/query_log Output: ./logs/impala/test_impala_tpcds_parquet_1000_160510-105135/query_out ---------------------------------------------- Performance results: -------------------- no success t_shell t_impala latest ln_run ln_avg info. 1 yes 22 21.69 30.90 27 85.4 "" 2 yes 168 167.47 144.21 20 309.9 "" 3 yes 11 10.93 386.26 18 113.8 "" 4 yes 1453 1439.48 466.00 11 502.5 "" 5 yes 598 594.16 96.51 12 172.0 "" 6 yes 44 42.99 21.12 14 51.9 "" 7 yes 51 50.55 27.28 13 100.4 "" 8 yes 25 24.64 13.24 14 17.7 "" 9 yes 42 41.91 109.18 14 63.6 "" 10 yes 50 50.37 23.26 12 29.2 "" 11 yes 2578 2565.05 366.42 14 477.8 "" 12 yes 49 49.05 3.72 12 10.0 "" 13 yes 113 112.65 143.29 10 104.7 "" 14 yes 2162 1125.06 1184.38 8 1201.4 "" 15 yes 16 16.14 12.49 9 24.4 "" 16 yes 688 687.43 783.21 8 1032.4 "" 17 yes 116 115.34 114.32 8 134.7 "" 18 yes 82 81.67 42.15 8 134.5 "" 19 yes 42 41.21 12.61 8 37.3 "" 20 yes 12 11.68 5.78 8 12.1 "" 21 yes 14 13.66 45.11 8 76.5 "" 22 yes 175 174.55 208.11 7 202.7 "" 23 yes 2807 1577.84 1599.90 7 1670.1 "" 24 yes 4583 2404.18 2316.71 8 3377.1 "" 25 yes 108 107.57 53.82 8 97.7 "" 26 yes 16 15.15 16.14 8 58.3 "" 27 yes 69 68.52 21.70 8 51.2 "" 28 yes 105 104.73 89.09 10 82.7 "" 29 yes 638 638.10 106.90 7 387.9 "" 30 no 15 0 0 0 0 "" 31 yes 53 52.43 29.72 8 44.8 "" 32 yes 17 16.51 7.11 8 19.0 "" 33 yes 27 25.79 7.11 8 20.7 "" 34 yes 38 38.08 30.71 8 39.9 "" 35 yes 179 177.88 4415.05 7 2393.7 "" 36 yes 53 52.99 28.83 8 61.7 "" 37 yes 566 565.82 176.65 8 282.8 "" 38 yes 534 534.00 232.56 8 359.2 "" 39 yes 55 20.23 86.13 8 159.3 "" 40 yes 95 95.28 33.84 8 54.4 "" 41 yes 2 1.45 3.13 8 2.8 "" 42 yes 51 50.69 5.09 8 12.1 "" 43 yes 262 262.55 27.17 8 54.6 "" 44 yes 37 37.09 44.12 8 36.8 "" 45 yes 16 15.41 10.28 8 21.2 "" 46 yes 35 34.25 19.64 8 35.5 "" 47 yes 4080 4079.43 425.56 9 748.9 "" 48 yes 30 28.93 13.95 8 25.3 "" 49 yes 70 69.17 38.57 8 65.4 "" 50 yes 250 249.56 174.81 8 218.1 "" 51 yes 1320 1319.70 4343.36 7 2383.1 "" 52 yes 52 51.81 9.03 8 20.2 "" 53 yes 299 298.55 14.86 8 43.0 "" 54 yes 89 88.43 116.01 4 109.1 "" 55 yes 41 41.20 5.26 7 13.8 "" 56 yes 26 25.82 10.11 7 33.5 "" 57 yes 537 537.05 119.84 7 257.2 "" 58 yes 16 16.28 5.30 7 18.5 "" 59 yes 192 192.58 198.41 7 190.0 "" 60 yes 28 28.34 86.52 6 59.4 "" 61 yes 42 42.52 13.59 7 39.1 "" 62 yes 30 29.73 82.71 7 97.8 "" 63 yes 245 244.46 17.54 7 52.2 "" 64 yes 814 813.07 540.83 7 646.0 "" 65 yes 217 216.55 178.38 7 171.2 "" 66 yes 29 28.18 17.79 7 51.1 "" 67 yes 4235 4224.15 3636.85 1 3636.8 "" 68 yes 92 91.37 51.49 7 114.8 "" 69 yes 17 16.69 4.02 7 20.0 "" 70 yes 118 116.66 118.80 7 106.3 "" 71 yes 118 117.37 19.08 7 58.0 "" 72 yes 2672 2671.46 1274.81 6 1415.3 "" 73 yes 20 19.20 7.23 7 19.7 "" 74 yes 816 813.32 212.46 6 349.2 "" 75 yes 284 283.13 377.51 7 373.8 "" 76 yes 51 50.62 155.08 7 104.3 "" 77 yes 92 90.96 16.97 7 52.3 "" 78 yes 409 408.80 440.73 7 453.0 "" 79 yes 44 43.78 66.02 7 54.6 "" 80 yes 433 431.42 110.24 7 240.4 "" 81 yes 35 35.02 46.16 7 52.4 "" 82 yes 1120 1119.60 172.77 7 319.6 "" 83 yes 7 6.64 7.63 7 10.6 "" 84 yes 224 223.43 47.99 7 177.0 "" 85 yes 90 89.21 51.84 7 83.6 "" 86 yes 36 35.86 31.84 7 53.3 "" 87 yes 537 536.78 247.84 7 375.0 "" 88 yes 112 111.68 73.37 7 72.8 "" 89 yes 332 331.64 25.53 7 59.2 "" 90 yes 7 6.51 14.37 7 70.1 "" 91 yes 12 11.65 4.09 7 5.2 "" 92 yes 10 9.42 4.67 7 9.7 "" 93 yes 564 562.82 515.55 7 551.7 "" 94 yes 365 364.96 319.71 6 408.5 "" 95 yes 461 460.17 183.60 6 230.5 "" 96 yes 11 10.89 13.70 7 25.6 "" 97 yes 214 213.05 221.09 7 260.1 "" 98 yes 53 53.18 47.57 6 53.6 "" 99 yes 58 57.78 151.64 6 90.9 "" Performance statistics: ----------------------- Total queries: 99 Coverage: 90.0% Time shell: 40928s Time impala: 36354.80sView Code
这里顺带讲一下 如何 build Apache github (https://github.com/apache/spark)上面最新版的Spark Runnable Distribution。
目前正式版的Spark版本是1.6.1,而github上面的是2.0.0,号称性能比前者提高的很多 ==
见:https://issues.apache.org/jira/browse/SPARK-14070
https://github.com/apache/spark/pull/11891
Ran on a production table in Facebook (note that the data was in DWRF file format which is similar to ORC)
Best case : when there was no matching rows for the predicate in the query (everything is filtered out)
CPU time Wall time Total wall time across all tasks ================================================================ Without the change 541_515 sec 25.0 mins 165.8 hours With change 407 sec 1.5 mins 15 mins
Average case: A subset of rows in the data match the query predicate
CPU time Wall time Total wall time across all tasks ================================================================ Without the change 624_630 sec 31.0 mins 199.0 h With change 14_769 sec 5.3 mins 7.7 h
首先,git clone https://github.com/apache/spark
步骤和官方说明差不多:http://spark.apache.org/docs/latest/building-spark.html#building-a-runnable-distribution
要注意的是,
2.0.0的 make_distribution.sh 在dev里面。我cp 1.6.1 的过来,被坑个半死。2.0.0 make_dis 已经不会在 assembly 里面生成依赖jar了。github上面的说明:
[SPARK-13579][BUILD] Stop building the main Spark assembly.
This change modifies the "assembly/" module to just copy needed dependencies to its build directory, and modifies the packaging script to pick those up (and remove duplicate jars packages in the examples module). I also made some minor adjustments to dependencies to remove some test jars from the final packaging, and remove jars that conflict with each other when packaged separately (e.g. servlet api). Also note that this change restores guava in applications' classpaths, even though it's still shaded inside Spark. This is now needed for the Hadoop libraries that are packaged with Spark, which now are not processed by the shade plugin. Author: Marcelo Vanzin Closes #11796 from vanzin/SPARK-13579.
2.0.0里面没有docker的文件夹,但是pom.xml 里面又会有配置,所以需要把pom.xml里面以下两段注释掉:
external/docker-integration-tests
com.spotify docker-client shaded 3.6.6 test guava com.google.guava org.apache.httpcomponents httpclient org.apache.httpcomponents httpcore commons-logging httpclient commons-logging commons-loggingView Code
maven编译的时候下载.m2里面的文件,可能默认的镜像速度很很忙以至于编译失败~
解决方法:1. $MAVEN_HOME/conf/settings.xml 里面设置代理 (没试过~)
2. $MAVEN_HOME/conf/settings.xml里面设置镜像~ 推荐几个速度稍微快点的镜像,
ui central Human Readable Name for this Mirror. http://uk.maven.org/maven2/ jboss-public-repository-group central JBoss Public Repository Group http://repository.jboss.org/nexus/content/groups/public JBossJBPM central JBossJBPM Repository https://repository.jboss.org/nexus/content/repositories/releases/View Code
show databases; show tables; 都是可以的,但是select就会出现以下error log:
spark-sql> select count(*) from date_dim; 16/05/13 14:33:17 ERROR SparkSQLDriver: Failed in [select count(*) from date_dim] java.io.FileNotFoundException: Path is not a file: /user/hive/warehouse/tpcds_parquet_1000.db/date_dim/_impala_insert_staging at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:70) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1934) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1875) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1855) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1827) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:566) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:88) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:361) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1222) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1210) at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1260) at org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:220) at org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:216) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:216) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:208) at org.apache.spark.sql.execution.datasources.ListingFileCatalog$$anonfun$1$$anonfun$apply$2.apply(ListingFileCatalog.scala:103) at org.apache.spark.sql.execution.datasources.ListingFileCatalog$$anonfun$1$$anonfun$apply$2.apply(ListingFileCatalog.scala:91) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.execution.datasources.ListingFileCatalog$$anonfun$1.apply(ListingFileCatalog.scala:91) at org.apache.spark.sql.execution.datasources.ListingFileCatalog$$anonfun$1.apply(ListingFileCatalog.scala:79) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at org.apache.spark.sql.execution.datasources.ListingFileCatalog.listLeafFiles(ListingFileCatalog.scala:79) at org.apache.spark.sql.execution.datasources.ListingFileCatalog.refresh(ListingFileCatalog.scala:68) at org.apache.spark.sql.execution.datasources.ListingFileCatalog.(ListingFileCatalog.scala:50) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:314) at org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$14.apply(HiveMetastoreCatalog.scala:320) at org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$14.apply(HiveMetastoreCatalog.scala:311) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToLogicalRelation(HiveMetastoreCatalog.scala:311) at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.org$apache$spark$sql$hive$HiveMetastoreCatalog$ParquetConversions$$convertToParquetRelation(HiveMetastoreCatalog.scala:354) at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$$anonfun$apply$1.applyOrElse(HiveMetastoreCatalog.scala:377) at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$$anonfun$apply$1.applyOrElse(HiveMetastoreCatalog.scala:362) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:307) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:356) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:284) at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.apply(HiveMetastoreCatalog.scala:362) at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.apply(HiveMetastoreCatalog.scala:335) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:64) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:62) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:554) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:671) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:325) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:240) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:727) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is not a file: /user/hive/warehouse/tpcds_parquet_1000.db/date_dim/_impala_insert_staging at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:70) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1934) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1875) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1855) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1827) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:566) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:88) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:361) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) at org.apache.hadoop.ipc.Client.call(Client.java:1468) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:254) at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1220) ... 85 more java.io.FileNotFoundException: Path is not a file: /user/hive/warehouse/tpcds_parquet_1000.db/date_dim/_impala_insert_staging at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:70) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1934) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1875) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1855) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1827) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:566) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:88) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:361) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1222) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1210) at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1260) at org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:220) at org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:216) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:216) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:208) at org.apache.spark.sql.execution.datasources.ListingFileCatalog$$anonfun$1$$anonfun$apply$2.apply(ListingFileCatalog.scala:103) at org.apache.spark.sql.execution.datasources.ListingFileCatalog$$anonfun$1$$anonfun$apply$2.apply(ListingFileCatalog.scala:91) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.execution.datasources.ListingFileCatalog$$anonfun$1.apply(ListingFileCatalog.scala:91) at org.apache.spark.sql.execution.datasources.ListingFileCatalog$$anonfun$1.apply(ListingFileCatalog.scala:79) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at org.apache.spark.sql.execution.datasources.ListingFileCatalog.listLeafFiles(ListingFileCatalog.scala:79) at org.apache.spark.sql.execution.datasources.ListingFileCatalog.refresh(ListingFileCatalog.scala:68) at org.apache.spark.sql.execution.datasources.ListingFileCatalog.(ListingFileCatalog.scala:50) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:314) at org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$14.apply(HiveMetastoreCatalog.scala:320) at org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$14.apply(HiveMetastoreCatalog.scala:311) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToLogicalRelation(HiveMetastoreCatalog.scala:311) at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.org$apache$spark$sql$hive$HiveMetastoreCatalog$ParquetConversions$$convertToParquetRelation(HiveMetastoreCatalog.scala:354) at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$$anonfun$apply$1.applyOrElse(HiveMetastoreCatalog.scala:377) at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$$anonfun$apply$1.applyOrElse(HiveMetastoreCatalog.scala:362) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:307) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at scala.collection.AbstractIterator.to(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:356) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:284) at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.apply(HiveMetastoreCatalog.scala:362) at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.apply(HiveMetastoreCatalog.scala:335) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:64) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:62) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:554) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:671) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:325) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:240) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:727) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is not a file: /user/hive/warehouse/tpcds_parquet_1000.db/date_dim/_impala_insert_staging at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:70) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1934) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1875) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1855) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1827) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:566) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:88) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:361) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) at org.apache.hadoop.ipc.Client.call(Client.java:1468) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:254) at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1220) ... 85 moreView Code
貌似是查找/user/hive/warehouse/tpcds_parquet_1000.db/date_dim/_impala_insert_staging文件,但是node上面没有这个路径~
按理说如果是hive下找不到这个文件,那么hiveSQL 也是不能select tpcds_parquet_1000.db的,但是偏偏hive却可以。hive和spark SQL 表 refresh 也没用。
哪位大牛若知道如何解决该问题,请回复小弟一下 ~~谢谢~
退而求之,我只能用spark再从text表中生成一份parquet表。首先添加环境变量
export SPARK_HOME=/usr/lib/spark export PATH=$SPARK_HOME/bin:$PATH
改变tpcds-env.sh配置
export EXEC_ENGINE=spark-sql
create-none-text-table.sh 中加入新逻辑
elif [ 'X'$EXEC_ENGINE == "Xspark-sql" ]; then echo "Creating database..." spark-sql -e "create database if not exists spark_$FORMAT_DB" > /dev/null for t in $LIST; do echo "Creating table $t..." spark-sql --database spark_$FORMAT_DB -e "create table if not exists $t stored as $TBL_FORMAT as select * from $TEXT_DB.$t;" > /dev/null [ 'X'$DELETE_MODE == "Xtrue" ] && spark-sql --database $TEXT_DB -e "drop table if exists $t;" > /dev/null done [ 'X'$DELETE_MODE == "Xtrue" ] && spark-sql -e "drop database if exists $TEXT_DB;" > /dev/null
http://www.cnblogs.com/xiaoyesoso/p/5522671.html
组长得出的性能分析
分析Spark2.0性能: Whole-stage code generation对简单Mapjoin + group by性能提升明显。用共享RDD的方式处理with as的多个引用点,比Inceptor创建临时表更高效。
其实1T由于数据量太大,spark2.0的性能表现很不稳定,所有目前在改用2G的数据进行测试
分别把spark.sql.parquet.enableVectorizedReader和spark.sql.codegen.wholeStage设置为true和false进行测试:
query file | parquet(true) | parquet(false) | diff |
q1 | 3.681 | 5.459 | 1.4830209182 |
q2 | 6.105 | 6.395 | 1.0475020475 |
q3 | 1.804 | 2.598 | 1.4401330377 |
q4 | 17.394 | 20.32 | 1.1682189261 |
q5 | 9.129 | 10.234 | 1.1210428305 |
q6 | 4.395 | 4.522 | 1.0288964733 |
q7 | 2.466 | 4.614 | 1.8710462287 |
q8 | 3.407 | 4.12 | 1.209275022 |
q9 | 2.664 | 3.693 | 1.3862612613 |
q10 | 7.416 | 10.071 | 1.3580097087 |
q11 | 16.779 | 17.998 | 1.0726503367 |
q12 | 1.743 | 2.181 | 1.2512908778 |
q13 | 5.694 | 9.54 | 1.6754478398 |
q14 | 18.547 | 24.803 | 1.3373052246 |
q15 | 1.97 | 2.648 | 1.3441624365 |
q16 | 8.041 | 11.884 | 1.4779256311 |
q17 | 3.373 | 4.361 | 1.2929143196 |
q18 | 4.992 | 7.964 | 1.5953525641 |
q19 | 2.071 | 3.069 | 1.4818928054 |
q20 | 2.163 | 2.47 | 1.1419325012 |
q21 | 1.837 | 7.818 | 4.2558519325 |
q22 | 8.291 | 24.881 | 3.0009649017 |
q23 | 15 | 18.27 | 1.218 |
q24 | 4.712 | 7.517 | 1.5952886248 |
q25 | 2.651 | 4.469 | 1.6857789513 |
q26 | 1.831 | 3.723 | 2.0333151283 |
q27 | 1.965 | 5.177 | 2.634605598 |
q28 | 0.437 | 1.247 | 2.8535469108 |
q29 | 3.017 | 3.325 | 1.1020881671 |
q30 | 3.6 | 3.626 | 1.0072222222 |
q31 | 4.734 | 5.504 | 1.1626531474 |
q32 | 1.139 | 1.748 | 1.5346795435 |
q33 | 3.352 | 3.914 | 1.1676610979 |
q34 | 2.202 | 3.62 | 1.6439600363 |
q35 | 7.863 | 8.843 | 1.1246343635 |
q36 | 2.516 | 4.882 | 1.940381558 |
q37 | 2.499 | 6.099 | 2.4405762305 |
q38 | 6.946 | 6.961 | 1.0021595163 |
q39 | 4.443 | 10.752 | 2.4199864956 |
q40 | 2.257 | 3.066 | 1.3584404076 |
q41 | 0.705 | 0.751 | 1.065248227 |
q42 | 0.896 | 2.032 | 2.2678571429 |
q43 | 1.509 | 3.063 | 2.0298210736 |
q44 | 2.304 | 2.725 | 1.1827256944 |
q45 | 2.295 | 3.551 | 1.5472766885 |
q46 | 2.409 | 5.601 | 2.3250311333 |
q47 | 12.89 | 19.439 | 1.50806827 |
q48 | 2.194 | 3.884 | 1.7702825889 |
q49 | 2.813 | 3.917 | 1.392463562 |
q50 | 2.959 | 5.148 | 1.7397769517 |
q51 | 12.645 | 11.957 | 0.9455911427 |
q52 | 0.904 | 1.565 | 1.7311946903 |
q53 | 1.262 | 2.448 | 1.93977813 |
q54 | 3.423 | 1.2 | 0.3505696757 |
q55 | 1.009 | 1.812 | 1.7958374628 |
q56 | 2.681 | 3.66 | 1.3651622529 |
q57 | 10.104 | 15.117 | 1.4961401425 |
q58 | 2.873 | 3.187 | 1.1092934215 |
q59 | 3.532 | 6.851 | 1.9396942242 |
q60 | 5.727 | 6.019 | 1.0509865549 |
q61 | 2.515 | 5.371 | 2.1355864811 |
q62 | 1.419 | 3.157 | 2.2248062016 |
q63 | 1.287 | 2.541 | 1.9743589744 |
q64 | 13.089 | 16.151 | 1.2339368936 |
q65 | 2.848 | 3.774 | 1.3251404494 |
q66 | 4.581 | 5.338 | 1.1652477625 |
q67 | 6.499 | 9.739 | 1.4985382367 |
q68 | 2.488 | 4.854 | 1.9509646302 |
q69 | 7.432 | 9.675 | 1.301803014 |
q70 | 4.726 | 6.919 | 1.464028777 |
q71 | 1.992 | 2.946 | 1.4789156627 |
q72 | 7.706 | 30.718 | 3.9862444848 |
q73 | 1.76 | 3.235 | 1.8380681818 |
q74 | 16.598 | 14.268 | 0.8596216412 |
q75 | 15.759 | 17.251 | 1.0946760581 |
q76 | 1.536 | 2.119 | 1.3795572917 |
q77 | 3.243 | 4.307 | 1.3280912735 |
q78 | 12.951 | 14.2 | 1.0964404293 |
q79 | 2.305 | 4.379 | 1.8997830803 |
q80 | 5.504 | 7.527 | 1.3675508721 |
q81 | 4.069 | 3.842 | 0.9442123372 |
q82 | 1.708 | 7.241 | 4.2394613583 |
q83 | 2.398 | 2.764 | 1.1526271893 |
q84 | 3.261 | 4.353 | 1.3348666053 |
q85 | 7.732 | 10.609 | 1.3720900155 |
q86 | 1.464 | 2.347 | 1.6031420765 |
q87 | 5.561 | 5.575 | 1.0025175328 |
q88 | 2.971 | 6.429 | 2.1639178728 |
q89 | 2.104 | 3.087 | 1.4672053232 |
q90 | 0.946 | 2.907 | 3.0729386892 |
q91 | 1.684 | 3.048 | 1.809976247 |
q92 | 1.419 | 2.044 | 1.4404510218 |
q93 | 2.647 | 4.198 | 1.5859463544 |
q94 | 7.805 | 12.464 | 1.596925048 |
q95 | 7.681 | 12.642 | 1.6458794428 |
q96 | 0.978 | 1.573 | 1.6083844581 |
q97 | 2.669 | 3.126 | 1.171225178 |
q98 | 1.899 | 2.606 | 1.3723012112 |
q99 | 1.432 | 4.597 | 3.2101955307 |
all | 456.926 | 662.234 |