对Imapla&Spark2.0.0SQL进行TPC-DS性能测试

1. tpcds_test_gentable

 

There are two parts of this Unix-Shell project.
Part-1: Building the tpcds-gen-.jar, recently version=1.1 is up to date.
Part-2: Generating the tpcds flat data, creating tpcds tables.

Part-1
YOU NEED NOT RUN PART-1 IF YOU HAVE tpcds-gen-.jar IN generator/target.

Precondition:
a). gcc, mvn, java, unzip installed and can be accessed.
b). Internet access.

To build the generator of tpcds, Do:
1). change directory to generator and run build.sh
$ cd generator; ./build.sh
You will get tpcds-gen-.jar in sub-directory target. Keep it stay there, It will be used in Part-2.

Part-2
Precondition:
a). hadoop, hdfs installed and can be accessed.
b). At least one of transwarp, impala-shell, hive installed and can be accessed.

To generate tpcds data and create tpcds tables stored as format such as flat tables, orc, parquet, Do
1). change directory to bin and check the variables in tpcds-env.sh
$ cd bin; cat tpcds-env.sh
Those variables you should take care of:
a) TPCDS_SCALE -> Mandatory, tpcds scale in GB.
b) EXEC_ENGINE -> Mandatory, Query engine, only transwarp, impala-shell, hive is support recently.
c) TRANS_HOST -> Optional, IP or hostname of Inceptor server when query engine is transwarp.
d) TEXT_DB -> Optional, The flat database's name, tpcds_text_"$TPCDS_SCALE" by default.
e) TBL_FORMAT -> Mandatory, Storage format of which kind tables you need.
f) FORMAT_DB -> Optional, The none flat database's name, tpcds_$TBL_FORMAT_"$TPCDS_SCALE" by default.
g) DELETE_MODE -> Mandatory, The flag whether or not delete the flat tables and related files in HDFS, true will delete and false not.
h) LOCATION_HDFS -> Mandatory, HDFS directory for falt files.
Please check and confirm, modify some of them if needed.

tpcds-env.sh

 1 # PROJ_HOME
 2 PROJ_BIN=$(dirname "${BASH_SOURCE-$0}")
 3 PROJ_HOME=$(cd "$PROJ_BIN"/..; pwd)
 4 
 5 # TPCDS Scale in GB
 6 export TPCDS_SCALE=${TPCDS_SCALE:1000} //1T数据
 7 # Query engine, only tranwarp, impala-shell, hive is support recently.
 8 export EXEC_ENGINE=${EXEC_ENGINE:-impala-shell} // 用 impala-shell 生成数据
 9 # inceptor server when query engine is transwarp
10 export TRANS_HOST=${TRANS_HOST:-localhost}
11 
12 export TEXT_DB=tpcds_text_"$TPCDS_SCALE"
13 # Table format we need. only orc, flat, parquet are supported recently.
14 export TBL_FORMAT=parquet //表类型为 parquet15 export FORMAT_DB=tpcds_"$TBL_FORMAT"_"$TPCDS_SCALE"
16 # Delete or not the text file in HDFS.
17 export DELETE_MODE=${DELETE_MODE:-true}
18 
19 case $EXEC_ENGINE in 
20   transwarp)
21     which transwarp > /dev/null 2>&1
22     if [ $? -ne 0 ]; then
23       echo "transwarp not found, data generation will exit soon" && exit 1
24     elif [ 'X'$TRANS_HOST == 'X' ]; then
25       echo "Inceptor server not exist while query engine is tranwarp!" && exit 1
26     fi
27     # directory for falt files in HDFS
28     export LOCATION_HDFS=/user/transwarp/tpcds
29   ;;
30   impala-shell)
31     which impala-shell > /dev/null 2>&1
32     [ $? -ne 0 ] && echo "impala-shell not found, data generation will exit soon."
33     # directory for falt files in HDFS
34     export LOCATION_HDFS=/user/impala/tpcds
35   ;;
36   hive)
37    which hive > /dev/null 2>&1
38    [ $? -ne 0 ] && echo "hive not found, data generation will exit soon."
39    # directory for falt files in HDFS
40    export LOCATION_HDFS=/user/hive/tpcds
41   ;;
42   spark-sql)
43    which spark-sql > /dev/null 2>&1
44    [ $? -ne 0 ] && echo "spark-sql not found, data generation will exit soon."
45    # directory for falt files in HDFS
46    export LOCATION_HDFS=/user/spark/tpcds
47   ;;
48 
49   *)
50    echo "Invalid engine, only tranwarp, impala-shell and hive are supported recently."
51    exit 1
52    ;;
53 esac

 



2) Generate raw data
$ ./gen-data.sh
There are some options that will overwrite the variables in tpcds-env.sh, It is highly recommended DO NOT do this.
-s | --scale, scale(in GB).
-l | --location, HDFS directory for falt files.
-h | --help, Show this help message.
This script will help you generating flat data and putting them into HDFS.

 


gen-data.sh

source ./tpcds-env.sh

echo "********************************************************************"
echo "*****          Generate data by run mapreduce routine          *****"
echo "****           hadoop jar tpcds-gen.jar -d XXX -s XXX           ****"
echo "********************************************************************"

function usage {
  echo "Usage: $0 
  -s | --scale, scale(in GB)
  -l | --locate, HDFS directory for falt files.
  -h | --help, Show this help message."
}

while [ $# -gt 0 ]; do
  case "$1" in
    -s | --scale)
      shift
      TPCDS_SCALE=$1
      shift
      ;;
    -l | --location)
      shift
      LOCATION_HDFS=$1
      shift
      ;;
    -h | --help)
      HELP=true
      shift
      ;;
    *)
      echo "Invalid args: $1"
      exit 1
      ;;
  esac
done

[ "$HELP" == "true" ] && usage && exit 1

if [ ! -f $PROJ_HOME/generator/target/tpcds-gen-1.1.jar ]; then
  echo "tpcds-gen-1.1.jar not found, Build the data generator with\
 build.sh first or make sure tpcds-env.sh is modified correctly."
  exit 1
fi

which hadoop > /dev/null 2>&1
if [ $? -ne 0 ]; then
  echo "Script must be run where hadoop is installed"
  exit 1
fi

which hdfs > /dev/null 2>&1
if [ $? -ne 0 ]; then
  echo "Script must be run where hdfs is installed"
  exit 1
fi

# Ensure arguments exist.
if [ X"$TPCDS_SCALE" = "X" ]; then
  usage && exit 1
fi
if [ X"$LOCATION_HDFS" = "X" ]; then
  usage && exit 1
fi

# Sanity checking.
if [ $TPCDS_SCALE -lt 1 ]; then
  echo "Scale factor cannot be less than 1"
  exit 1
fi

if [ 'X'$INTEGRATE_MODE != "Xtrue" ]; then
  read -p "You are generating ${TPCDS_SCALE}g tpcds data and then store it at HDFS directory ${LOCATION_HDFS}, disk usage of HDFS will be ${TPCDS_SCALE}g, is that OK [Yes|No]? " CONFIRM 
  [ 'X'$CONFIRM != "XYes" ] && echo "Your answer is not Yes, check tpcds-env.sh and your HDFS storage and run again." && exit 1
fi

hdfs dfs -mkdir -p ${LOCATION_HDFS}
# TODO: How to test the directory is writable for current user gracefully?
hdfs dfs -put $PROJ_HOME/bin/tpcds-env.sh ${LOCATION_HDFS} > /dev/null 2>&1
if [ $? -ne 0 ]; then
  echo "${LOCATION_HDFS} is not writable for current user." && exit 1
else 
  hdfs dfs -rm ${LOCATION_HDFS}/tpcds-env.sh > /dev/null 2>&1
fi
hdfs dfs -ls ${LOCATION_HDFS}/${TPCDS_SCALE} > /dev/null 2>&1 || (cd $PROJ_HOME/generator; hadoop jar target/tpcds-gen-*.jar -d ${LOCATION_HDFS}/${TPCDS_SCALE}/ -s ${TPCDS_SCALE})
hdfs dfs -ls ${LOCATION_HDFS}/${TPCDS_SCALE}
View Code

 

 

 

运行脚本的时候 会出现 HDFS相应目录没有写权限的问题。同事给出的 解决方法是:

    hdfs dfs -chmod 777 相应目录

  但是不知道为什么不起作用。

  网上查到(http://stackoverflow.com/questions/11593374/permission-denied-at-hdfs)

  方法一(对我没用):

  

I solved this problem temporary by disabling the dfs permission.By adding below property code to conf/hdfs-site.xml

 

  dfs.permissions
  false

 

  方法二:

  改变环境变量

export HADOOP_USER_NAME=hdfs  

  完美解决~

 

小错误记录:像类似于HDFS的程序不能简单的ctrl+z把它关掉,需要跑自带的脚本或者运维界面。我之前下捣鼓,后台起来好多个HDFS却不知道,node直接快卡死,kill不干净,于是乎鲁莽的把这个node reboot 了,而导致这个node的HDFS无法挂载原先的三块磁盘。请交大博士同事帮我搞了1一个半小时,谁知道整个cluster起不来了== 第二天 早上 他才弄弄好 == (我就是 x一般的队友)Database Row Counts


3) Create text table
$ ./create-text-table.sh
There are some options that will overwrite the variables in tpcds-env.sh, It is highly recommended DO NOT do this.
-l | --location, HDFS directory of files generated by gen-data.sh.
-t | --textdb, The flat database's name.
-e | --engine, Query engine, one of [transwarp | impala-shell | hive].
-h | --help, show this help text.
This script will help you create flat tables against the data generated by gen-data.sh.

 

 

create-text-table.sh

source ./tpcds-env.sh

echo "*****************************************************************"
echo "**** Create text table by run Transwarp | Hive | Impala DDL  ****"
echo "****              Transwarp -f table-ddl.sql                 ****"
echo "*****************************************************************"


function usage {
  echo "Usage: $0 
  -s | --scale, scale(in GB). 
  -l | --location, HDFS directory of files generated by gen-data.sh.
  -t | --textdb, The flat database's name.
  -e | --engine, Query engine, one of [transwarp | impala-shell | hive].
  -i | --inceptor, inceptor server for transwarp engine.
  -h | --help, show this help text."
}

while [ $# -gt 0 ]; do
  case "$1" in
    -s | --scale)
      shift
      TPCDS_SCALE=$1
      shift
      ;;
    -l | --location)
      shift
      LOCATION_HDFS=$1
      shift
      ;;
    -t | --textdb)
      shift
      TEXT_DB=$1
      shift
      ;;
    -e | --engine)
      shift
      EXEC_ENGINE=$1
      shift
      ;;
    -i | --inceptor)
      shift
      INCEPTOR_SERVER=$1
      shift
      ;;
    -h | --help)
      HELP=true
      shift
      ;;
    *)
      echo "Invalid args: $1"
      exit 1
      ;;
  esac
done



[ X"$HELP" == X"true" ] && usage && exit 1

# Tables in the TPC-DS schema.
LIST="date_dim time_dim item customer customer_demographics 
      household_demographics customer_address store promotion 
      warehouse ship_mode reason income_band call_center 
      web_page catalog_page inventory store_sales store_returns
      web_sales web_returns web_site catalog_sales catalog_returns"

if [ X"$LOCATION_HDFS" == "X" ]; then
  usage && exit 1
fi
if [ X"$TPCDS_SCALE" == "X" ]; then
  usage && exit 1
fi
hdfs dfs -ls ${LOCATION_HDFS}/${TPCDS_SCALE} > /dev/null 2>&1
if [ $? -ne 0 ]; then 
  echo "Flat files in hdfs does not exist, run gen-data.sh first."
  exit 1
fi

# Generate the flat tables.
if [ 'X'$INTEGRATE_MODE != "Xtrue" ]; then
  read -p "You are creating tpcds flat tables base on the data stored at HDFS \
directory ${LOCATION_HDFS}/${TPCDS_SCALE}, and the database name is $TEXT_DB, \
is that OK [Yes|No]? " CONFIRM 
  [ 'X'$CONFIRM != "XYes" ] && echo "Your answer is not Yes, check tpcds-env.sh and run again." && exit 1
fi

cmd=$EXEX_ENGINE
[ $EXEC_ENGINE = "transwarp" ] && {
   INCEPTOR_SERVER=${INCEPTOR_SERVER:-localhost}
   echo "connect to inceptor server: $INCEPTOR_SERVER"
   cmd="transwarp -t -h $INCEPTOR_SERVER"
}

$cmd -e "create database if not exists $TEXT_DB" 

# tmp workaround since hive can not subsitue the DB and LOCATION var
tmp_dir=$PROJ_HOME/ddl/text_tmp
[ -d $tmp_dir ] && {
 rm -rf $tmp_dir 
}
mkdir -p $tmp_dir
cp -rf  $PROJ_HOME/ddl/text/* $tmp_dir
for t in ${LIST}; do
  echo "Creating table $t..."
  sql_file=$tmp_dir/${t}.sql
  LOCATION=${LOCATION_HDFS}/${TPCDS_SCALE}/${t}
  location_expr=`echo $LOCATION|sed s#/#'\\\/'#g`
  sed -i "s#\${DB}#$TEXT_DB#g" $sql_file
  sed -i "s#\${LOCATION}#$location_expr#g" $sql_file
  # transwarp -h option is not needed here.
  $cmd -i $PROJ_HOME/settings/load-flat.sql -f $sql_file -d DB=$TEXT_DB -d LOCATION=${LOCATION_HDFS}/${TPCDS_SCALE}/${t} > /dev/null
done
View Code

 

 



4) Create other format table
$ ./create-none-text-table.sh
There are some options that will overwrite the variables in tpcds-env.sh, It is highly recommended DO NOT do this.
-s | --sourcedb, Source database, it is the flat database's name.
-t | --targetdb, Target database, it is the none flat database's name.
-f | --format, Storage format of which kind tables you need
-h | --host, Inceptor server host, localhost by default.
-d | --delete, DELETE_MODE, true or false
-l | --location, LOCATION_HDFS of flat file, can not be empty when DELETE_MODE is true
--help), Show this help message.
This script will help you create none-flat tables based the flat tables created by create-text-table.sh. Flat tables and related files will be removed if DELETE_MODE is true.

It is Integrated in all-in-one.sh to DO everything, including gen-data, create-text-file, create-none-text-file.

It is highly recommended that check and modify the tpcds-env.sh, DO NOT overwrite the variables by optional arguments.
 

 

/create-none-text-table.sh

 

#
# (c) Copyright 2013-2015 Transwarp, Inc.
#
#!/bin/bash
source ./tpcds-env.sh

echo "********************************************************************"
echo "*****  Tables stored as other file-format is implemented as:   *****"
echo "****  create table t stored as XXX as select * from TEXT_DB.t   ****"
echo "********************************************************************"

function usage {
  echo "Usage: $0 
  -s | --sourcedb, TEXT_DB
  -t | --targetdb, FORMAT_DB 
  -f | --format, TBL_FORMAT
  -h | --host, Inceptor server host, localhost by default.
  -d | --delete, DELETE_MODE, true or false
  -l | --location, LOCATION_HDFS of flat file, can not be empty when DELETE_MODE is true
  --help, show this help message."
}

while [ $# -gt 0 ]; do
  case "$1" in
    -s | --sourcedb)
      shift
      TEXT_DB=$1
      shift
      ;;
    -t | --targetdb)
      shift
      FORMAT_DB=$1
      shift
      ;;
    -f | --format)
      shift
      TBL_FORMAT=$1
      shift
      ;;
    -d | --delete)
      shift
      DELETE_MODE=$1
      shift
      ;;
    -l | --location)
      shift
      LOCATION_HDFS=$1
      shift
      ;;
    -h | --host)
      shift
      TRANS_HOST=$1
      shift
      ;;
    --help)
      HELP=true
      shift
      ;;
    *)
      echo "Invalid args: $1"
      exit 1
      ;;
  esac
done 

[ "$HELP" == "true" ] && usage && exit 0

if [ 'X'$TEXT_DB == 'X' ]; then
  usage
  exit 1
fi
if [ 'X'$FORMAT_DB == 'X' ]; then
  usage
  exit 1
fi

if [ 'X'$TBL_FORMAT == "Xflat" ]; then
  echo "This script is designed to generate none flat tables!"
  exit 1
fi

if [ 'X'$DELETE_MODE == "Xtrue" ] && [ X"$LOCATION_HDFS" == "X" ]; then 
  echo "LOCATION_HDFS is needed when DELETE_MODE is true."
  usage
  exit 1
fi

# Tables in the TPC-DS schema.
LIST="date_dim time_dim item customer customer_demographics 
      household_demographics customer_address store promotion 
      warehouse ship_mode reason income_band call_center 
      web_page catalog_page inventory store_sales store_returns
      web_sales web_returns web_site catalog_sales catalog_returns"

case $TBL_FORMAT in
  orc)
    # Do nothing recently.
  ;;
  flat)
    # Do nothing recently.
  ;;
  parquet)
    if [ ! $EXEC_ENGINE == "impala-shell" ]; then
      echo "parquet is only supported by impala recently. Data generation will exit soon." 
      exit 1
    fi
  ;;
  *)
    echo "Invalid format, only orc, flat, parquet are supported recently."
    exit 1
  ;;
esac

[ ! 'X'$DELETE_MODE == 'Xtrue' ] && [ ! 'X'$DELETE_MODE == 'Xfalse' ] && echo "Invalid DELETE_MODE, true or false is expected while got $DELETE_MODE" &&  exit 1

if [ 'X'$INTEGRATE_MODE != "Xtrue" ]; then
  if [ 'X'$DELETE_MODE == "Xtrue" ]; then
    read -p "You are creating ${TBL_FORMAT} tables and the database name is \
$FORMAT_DB. After all, all tables in $TEXT_DB is droped and related files in \
HDFS are removed, is that OK [Yes|No]? " CONFIRM 
  elif [ 'X'$DELETE_MODE == "Xfalse" ]; then
    read -p "You are creating ${TBL_FORMAT} tables and the database name is \
$FORMAT_DB. After all, all tables in $TEXT_DB and related files in HDFS are \
reserved, is that OK [Yes|No]? " CONFIRM
  fi
  [ 'X'$CONFIRM != "XYes" ] && echo "Your answer is not Yes, check tpcds-env.sh and run again." && exit 1
fi

if [ 'X'$EXEC_ENGINE == "Xtranswarp" ]; then
  echo "Creating database..."
  transwarp -h $TRANS_HOST -e "create database if not exists $FORMAT_DB;" > /dev/null 
  for t in $LIST; do
    echo "Creating table $t..."
    transwarp -h $TRANS_HOST --database $FORMAT_DB -e "create table if not exists $t stored as $TBL_FORMAT as select * from $TEXT_DB.$t;" > /dev/null 
    [ 'X'$DELETE_MODE == "Xtrue" ] && transwarp -h $TRANS_HOST --database $TEXT_DB -e "drop table if exists $t;" > /dev/null 
  done
  [ 'X'$DELETE_MODE == "Xtrue" ] && transwarp -h $TRANS_HOST -e "drop database if exists $TEXT_DB;" > /dev/null

elif [ 'X'$EXEC_ENGINE == "Ximpala-shell" ]; then
  echo "Creating database..."
  impala-shell -q "create database if not exists $FORMAT_DB" > /dev/null 
  for t in $LIST; do
    echo "Creating table $t..." 
    impala-shell --database $FORMAT_DB -q "create table if not exists $t stored as $TBL_FORMAT as select * from $TEXT_DB.$t;" > /dev/null 
    [ 'X'$DELETE_MODE == "Xtrue" ] && impala-shell --database $TEXT_DB -q "drop table if exists $t;" > /dev/null 
  done
  [ 'X'$DELETE_MODE == "Xtrue" ] && impala-shell -q "drop database if exists $TEXT_DB;" > /dev/null 

elif [ 'X'$EXEC_ENGINE == "Xhive" ] || [ 'X'$EXEC_ENGINE == "Xspark-sql" ]; then
  echo "Creating database..."
  hive -e "create database if not exists $FORMAT_DB;" > /dev/null 
  for t in $LIST; do
    echo "Creating table $t..."
  $EXEC_ENGINE --database $FORMAT_DB -e "create table if not exists $t stored as $TBL_FORMAT as select * from $TEXT_DB.$t;" > /dev/null 
    [ 'X'$DELETE_MODE == "Xtrue" ] && $EXEC_ENGINE --database $TEXT_DB -e "drop table if exists $t;" > /dev/null 
  done
  [ 'X'$DELETE_MODE == "Xtrue" ] && $EXEC_ENGINE -e "drop database if exists $TEXT_DB;" > /dev/null 
fi

if [ $DELETE_MODE == "true" ]; then
  hdfs dfs -rm -r -f $LOCATION_HDFS
fi
View Code

 

 

 

 

貌似最新版的impala语法有点变化,若上述SQL有问题,可以试一下这个:
1 elif [ 'X'$EXEC_ENGINE == "Ximpala-shell" ]; then
2   echo "Creating database..."
3   impala-shell -q "create database if not exists $FORMAT_DB" > /dev/null
4   for t in $LIST; do
5     echo "Creating table $t..." 
6     impala-shell --database $FORMAT_DB -q "create table if not exists $t like $TEXT_DB.$t stored as $TBL_FORMAT;insert overwrite $t select * from $TEXT_DB.$t;" > /dev/null
7     [ 'X'$DELETE_MODE == "Xtrue" ] && impala-shell --database $TEXT_DB -q "drop table if exists $t;" > /dev/null
8   done
9   [ 'X'$DELETE_MODE == "Xtrue" ] && impala-shell -q "drop database if exists $TEXT_DB;" > /dev/null
数据量比较大,目测group 的Impala Daemon 内存限制要设置在12G以上。

 

 

 

 

 

2. tpcds-test-impala

 

 

 

首先 建个config文件里面输入 use [db.name];

ex. : use tpcds_parquet_1000;

 

tpcds-test-impala.sh

 

## print basic configuration
echo "
Configuration:
----------------------------------------------
Database: $DB
Perf: $query_perf
Logs: $query_log
Output: $query_out
----------------------------------------------
" | tee -a $query_perf

## find the impala-shell ENGINE
which $ENGINE > /dev/null 2>&1
if [ $? -ne 0 ]; then
  echo "ERROR: Cannot find '$ENGINE' in your system path." & exit -1;
fi

## print instant performance results
printf "Performance results:
--------------------
%3s %8s %8s %8s %8s %8s %8s %s
" no success t_shell t_impala latest ln_run ln_avg info. \
  | tee -a $query_perf

## do the dirty work
if [ $QUERY_FILENO -eq -1 ]; then
  runs_total=99
  for i in {1..99}; do
    run_query $i
  done
elif [ $QUERY_FILENO -ge 0 ] && [ $QUERY_FILENO -le 99 ]; then
  runs_total=1
  run_query $QUERY_FILENO
else
  echo "ERROR: invalid query file no. (1~99)"
  exit -1;
fi

## print final coverage stats
printf "
Performance statistics:
-----------------------
Total queries: %s
Coverage: %s
Time shell: %ss
Time impala: %ss\n
" $runs_total \
  $(echo "scale=1; $runs_success/$runs_total*100" | bc)% \
  $runs_time_shell \
  $runs_time_impala \
  | tee -a $query_perf

exit 0;

## EOF
View Code

 

 

 

这边1-99个测试SQL 分享:http://pan.baidu.com/s/1eSjUPm6

  后期组长又给我一组变态版SQL,懂不懂就是语法不兼容或者是impala的mem和cpu都跑步上去,这时候可以用ctrl+c kill掉这条语句,直接跑下一条。

附上SQL说明
B.1
query1.tpl
Find customers who have returned items more than 20% more often than the average customer returns for a
store in a given state for a given year.
Qualification Substitution Parameters:
?
?
?
B.2
YEAR.01=2000
STATE.01=TN
AGGFIELD.01 = SR_RETURN_AMT
query2.tpl
Report the increase of weekly web and catalog sales from one year to the next year for each week. That is,
compute the increase of Monday, Tuesday, ... Sunday sales from one year to the following.
Qualification Substitution Parameters:
?
B.3
YEAR.01=2001
query3.tpl
Report the total extended sales price per item brand of a specific manufacturer for all sales in a specific month
of the year.
Qualification Substitution Parameters:
?
?
?
B.4
MONTH.01=11
MANUFACT =128
AGGC = s_ext_sales_price
query4.tpl
Find customers who spend more money via catalog than in stores. Identify preferred customers and their
country of origin.
Qualification Substitution Parameters:
?
?
B.5
YEAR.01=2001
SELECTCONE.01=t_s_secyear.customer_id,t_s_secyear.customer_first_name,t_s_secyear.customer_last_
name,t_s_secyear.c_preferred_cust_flag,t_s_secyear.c_birth_country,t_s_secyear.c_login,t_s_secyear.c_em
ail_address
query5.tpl
Report sales, profit, return amount, and net loss in the store, catalog, and web channels for a 14-day window.
Rollup results by sales channel and channel specific sales method (store for store sales, catalog page for catalog
sales and web site for web sales)
Qualification Substitution Parameters:
?
?
SALES_DATE.01=2000-08-23
YEAR.01=2000
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 118 of 156B.6
query6.tpl
List all the states with at least 10 customers who during a given month bought items with the price tag at least
20% higher than the average price of items in the same category.
Qualification Substitution Parameters:
?
?
B.7
MONTH.01=1
YEAR.01=2001
query7.tpl
Compute the average quantity, list price, discount, and sales price for promotional items sold in stores where the
promotion is not offered by mail or a special event. Restrict the results to a specific gender, marital and
educational status.
Qualification Substitution Parameters:
?
?
?
?
B.8
YEAR.01=2000
ES.01=College
MS.01=S
GEN.01=M
query8.tpl
Compute the net profit of stores located in 400 Metropolitan areas with more than 10 preferred customers.
Qualification Substitution Parameters:
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
ZIP.01=24128
ZIP.02=76232
ZIP.03=65084
ZIP.04=87816
ZIP.05=83926
ZIP.06=77556
ZIP.07=20548
ZIP.08=26231
ZIP.09=43848
ZIP.10=15126
ZIP.11=91137
ZIP.12=61265
ZIP.13=98294
ZIP.14=25782
ZIP.15=17920
ZIP.16=18426
ZIP.17=98235
ZIP.18=40081
ZIP.19=84093
ZIP.20=28577
ZIP.21=55565
ZIP.22=17183
ZIP.23=54601
ZIP.24=67897
ZIP.25=22752
ZIP.26=86284
ZIP.81=57834
ZIP.82=62878
ZIP.83=49130
ZIP.84=81096
ZIP.85=18840
ZIP.86=27700
ZIP.87=23470
ZIP.88=50412
ZIP.89=21195
ZIP.90=16021
ZIP.91=76107
ZIP.92=71954
ZIP.93=68309
ZIP.94=18119
ZIP.95=98359
ZIP.96=64544
ZIP.97=10336
ZIP.98=86379
ZIP.99=27068
ZIP.100=39736
ZIP.101=98569
ZIP.102=28915
ZIP.103=24206
ZIP.104=56529
ZIP.105=57647
ZIP.106=54917
ZIP.161=13354
ZIP.162=45375
ZIP.163=40558
ZIP.164=56458
ZIP.165=28286
ZIP.166=45266
ZIP.167=47305
ZIP.168=69399
ZIP.169=83921
ZIP.170=26233
ZIP.171=11101
ZIP.172=15371
ZIP.173=69913
ZIP.174=35942
ZIP.175=15882
ZIP.176=25631
ZIP.177=24610
ZIP.178=44165
ZIP.179=99076
ZIP.180=33786
ZIP.181=70738
ZIP.182=26653
ZIP.183=14328
ZIP.184=72305
ZIP.185=62496
ZIP.186=22152
ZIP.241=15734
ZIP.242=63435
ZIP.243=25733
ZIP.244=35474
ZIP.245=24676
ZIP.246=94627
ZIP.247=53535
ZIP.248=17879
ZIP.249=15559
ZIP.250=53268
ZIP.251=59166
ZIP.252=11928
ZIP.253=59402
ZIP.254=33282
ZIP.255=45721
ZIP.256=43933
ZIP.257=68101
ZIP.258=33515
ZIP.259=36634
ZIP.260=71286
ZIP.261=19736
ZIP.262=58058
ZIP.263=55253
ZIP.264=67473
ZIP.265=41918
ZIP.266=19515
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
ZIP.321=78668
ZIP.322=22245
ZIP.323=15798
ZIP.324=27156
ZIP.325=37930
ZIP.326=62971
ZIP.327=21337
ZIP.328=51622
ZIP.329=67853
ZIP.330=10567
ZIP.331=38415
ZIP.332=15455
ZIP.333=58263
ZIP.334=42029
ZIP.335=60279
ZIP.336=37125
ZIP.337=56240
ZIP.338=88190
ZIP.339=50308
ZIP.340=26859
ZIP.341=64457
ZIP.342=89091
ZIP.343=82136
ZIP.344=62377
ZIP.345=36233
ZIP.346=63837
Page 119 of 156?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
ZIP.27=18376
ZIP.28=38607
ZIP.29=45200
ZIP.30=21756
ZIP.31=29741
ZIP.32=96765
ZIP.33=23932
ZIP.34=89360
ZIP.35=29839
ZIP.36=25989
ZIP.37=28898
ZIP.38=91068
ZIP.39=72550
ZIP.40=10390
ZIP.41=18845
ZIP.42=47770
ZIP.43=82636
ZIP.44=41367
ZIP.45=76638
ZIP.46=86198
ZIP.47=81312
ZIP.48=37126
ZIP.49=39192
ZIP.50=88424
ZIP.51=72175
ZIP.52=81426
ZIP.53=53672
ZIP.54=10445
ZIP.55=42666
ZIP.56=66864
ZIP.57=66708
ZIP.58=41248
ZIP.59=48583
ZIP.60=82276
ZIP.61=18842
ZIP.62=78890
ZIP.63=49448
ZIP.64=14089
ZIP.65=38122
ZIP.66=34425
ZIP.67=79077
ZIP.68=19849
ZIP.69=43285
ZIP.70=39861
ZIP.71=66162
ZIP.72=77610
ZIP.73=13695
ZIP.74=99543
ZIP.75=83444
ZIP.76=83041
ZIP.107=42961
ZIP.108=91110
ZIP.109=63981
ZIP.110=14922
ZIP.111=36420
ZIP.112=23006
ZIP.113=67467
ZIP.114=32754
ZIP.115=30903
ZIP.116=20260
ZIP.117=31671
ZIP.118=51798
ZIP.119=72325
ZIP.120=85816
ZIP.121=68621
ZIP.122=13955
ZIP.123=36446
ZIP.124=41766
ZIP.125=68806
ZIP.126=16725
ZIP.127=15146
ZIP.128=22744
ZIP.129=35850
ZIP.130=88086
ZIP.131=51649
ZIP.132=18270
ZIP.133=52867
ZIP.134=39972
ZIP.135=96976
ZIP.136=63792
ZIP.137=11376
ZIP.138=94898
ZIP.139=13595
ZIP.140=10516
ZIP.141=90225
ZIP.142=58943
ZIP.143=39371
ZIP.144=94945
ZIP.145=28587
ZIP.146=96576
ZIP.147=57855
ZIP.148=28488
ZIP.149=26105
ZIP.150=83933
ZIP.151=25858
ZIP.152=34322
ZIP.153=44438
ZIP.154=73171
ZIP.155=30122
ZIP.156=34102
ZIP.187=10144
ZIP.188=64147
ZIP.189=48425
ZIP.190=14663
ZIP.191=21076
ZIP.192=18799
ZIP.193=30450
ZIP.194=63089
ZIP.195=81019
ZIP.196=68893
ZIP.197=24996
ZIP.198=51200
ZIP.199=51211
ZIP.200=45692
ZIP.201=92712
ZIP.202=70466
ZIP.203=79994
ZIP.204=22437
ZIP.205=25280
ZIP.206=38935
ZIP.207=71791
ZIP.208=73134
ZIP.209=56571
ZIP.210=14060
ZIP.211=19505
ZIP.212=72425
ZIP.213=56575
ZIP.214=74351
ZIP.215=68786
ZIP.216=51650
ZIP.217=20004
ZIP.218=18383
ZIP.219=76614
ZIP.220=11634
ZIP.221=18906
ZIP.222=15765
ZIP.223=41368
ZIP.224=73241
ZIP.225=76698
ZIP.226=78567
ZIP.227=97189
ZIP.228=28545
ZIP.229=76231
ZIP.230=75691
ZIP.231=22246
ZIP.232=51061
ZIP.233=90578
ZIP.234=56691
ZIP.235=68014
ZIP.236=51103
ZIP.267=36495
ZIP.268=19430
ZIP.269=22351
ZIP.270=77191
ZIP.271=91393
ZIP.272=49156
ZIP.273=50298
ZIP.274=87501
ZIP.275=18652
ZIP.276=53179
ZIP.277=18767
ZIP.278=63193
ZIP.279=23968
ZIP.280=65164
ZIP.281=68880
ZIP.282=21286
ZIP.283=72823
ZIP.284=58470
ZIP.285=67301
ZIP.286=13394
ZIP.287=31016
ZIP.288=70372
ZIP.289=67030
ZIP.290=40604
ZIP.291=24317
ZIP.292=45748
ZIP.293=39127
ZIP.294=26065
ZIP.295=77721
ZIP.296=31029
ZIP.297=31880
ZIP.298=60576
ZIP.299=24671
ZIP.300=45549
ZIP.301=13376
ZIP.302=50016
ZIP.303=33123
ZIP.304=19769
ZIP.305=22927
ZIP.306=97789
ZIP.307=46081
ZIP.308=72151
ZIP.309=15723
ZIP.310=46136
ZIP.311=51949
ZIP.312=68100
ZIP.313=96888
ZIP.314=64528
ZIP.315=14171
ZIP.316=79777
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
ZIP.347=58078
ZIP.348=17043
ZIP.349=30010
ZIP.350=60099
ZIP.351=28810
ZIP.352=98025
ZIP.353=29178
ZIP.354=87343
ZIP.355=73273
ZIP.356=30469
ZIP.357=64034
ZIP.358=39516
ZIP.359=86057
ZIP.360=21309
ZIP.361=90257
ZIP.362=67875
ZIP.363=40162
ZIP.364=11356
ZIP.365=73650
ZIP.366=61810
ZIP.367=72013
ZIP.368=30431
ZIP.369=22461
ZIP.370=19512
ZIP.371=13375
ZIP.372=55307
ZIP.373=30625
ZIP.374=83849
ZIP.375=68908
ZIP.376=26689
ZIP.377=96451
ZIP.378=38193
ZIP.379=46820
ZIP.380=88885
ZIP.381=84935
ZIP.382=69035
ZIP.383=83144
ZIP.384=47537
ZIP.385=56616
ZIP.386=94983
ZIP.387=48033
ZIP.388=69952
ZIP.389=25486
ZIP.390=61547
ZIP.391=27385
ZIP.392=61860
ZIP.393=58048
ZIP.394=56910
ZIP.395=16807
ZIP.396=17871
Page 120 of 156?
?
?
?
?
?
B.9
ZIP.77=12305
ZIP.78=57665
ZIP.79=68341
ZIP.80=25003
QOY.01=2
YEAR.01=1998
ZIP.157=22685
ZIP.158=71256
ZIP.159=78451
ZIP.160=54364
ZIP.237=94167
ZIP.238=57047
ZIP.239=14867
ZIP.240=73520
ZIP.317=28709
ZIP.318=11489
ZIP.319=25103
ZIP.320=32213
ZIP.397=35258
ZIP.398=31387
ZIP.399=35458
ZIP.400=35576
query9.tpl
Categorize store sales transactions into 5 buckets according to the number of items sold. Each bucket contains
the average discount amount, sales price, list price, tax, net paid, paid price including tax, or net profit..
Qualification Substitution Parameters:
?
?
?
?
?
?
?
AGGCTHEN.01= ss_ext_discount_amt
AGGCELSE.01= ss_net_paid
RC.01=74129
RC.02=122840
RC.03=56580
RC.04=10097
RC.05=165306
B.10 query10.tpl
Count the customers with the same gender, marital status, education status, purchase estimate, credit rating,
dependent count, employed dependent count and college dependent count who live in certain counties and who
have purchased from both stores and another sales channel during a three month time period of a given year.
Qualification Substitution Parameters:
?
?
?
?
?
?
?
YEAR.01 = 2002
MONTH.01 = 1
COUNTY.01 = Rush County
COUNTY.02 = Toole County
COUNTY.03 = Jefferson County
COUNTY.04 = Dona Ana County
COUNTY.05 = La Porte County
B.11 query11.tpl
Find customers whose increase in spending was large over the web than in stores this year compared to last
year.
Qualification Substitution Parameters:
?
?
YEAR.01 = 2001
SELECTONE = t_s_secyear.customer_preferred_cust_flag
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 121 of 156B.12 query12.tpl
Compute the revenue ratios across item classes: For each item in a list of given categories, during a 30 day time
period, sold through the web channel compute the ratio of sales of that item to the sum of all of the sales in that
item's class.
Qualification Substitution Parameters
?
?
?
?
?
CATEGORY.01 = Sports
CATEGORY.02 = Books
CATEGORY.03 = Home
SDATE.01 = 1999-02-22
YEAR.01 = 1999
B.13 query13.tpl
Calculate the average sales quantity, average sales price, average wholesale cost, total wholesale cost for store
sales of different customer types (e.g., based on marital status, education status) including their household
demographics, sales price and different combinations of state and sales profit for a given year.
Qualification Substitution Parameters:
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
YEAR.01 = 2001
STATE.01 = TX
STATE.02 = OH
STATE.03 = TX
STATE.04 = OR
STATE.05 = NM
STATE.06 = KY
STATE.07 = VA
STATE.08 = TX
STATE.09 = MS
ES.01 = Advanced Degree
ES.02 = College
ES.03 = 2 yr Degree
MS.01 = M
MS.02 = S
MS.03 = W
B.14 query14.tpl)
This query contains multiple iterations:
Iteration 1: First identify items in the same brand, class and category that are sold in all three sales channels in
two consecutive years. Then compute the average sales (quantity*list price) across all sales of all three sales
channels in the same three years (average sales). Finally, compute the total sales and the total number of sales
rolled up for each channel, brand, class and category. Only consider sales of cross channel sales that had sales
larger than the average sale.
Iteration 2: Based on the previous query compare December store sales.
Qualification Substitution Parameters:
? DAY.01 = 11
? YEAR.01 = 1999
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 122 of 156B.15 query15.tpl
Report the total catalog sales for customers in selected geographical regions or who made large purchases for a
given year and quarter.
Qualification Substitution Parameters:
?
?
QOY.01 = 2
YEAR.01 = 2001
B.16 query16.tpl
Report number of orders, total shipping costs and profits from catalog sales of particular counties and states for
a given 60 day period for non-returned sales filled from an alternate warehouse.
Qualification Substitution Parameters:
?
?
?
?
?
?
?
?
COUNTY_E.01 = Williamson County
COUNTY_D.01 = Williamson County
COUNTY_C.01 = Williamson County
COUNTY_B.01 = Williamson County
COUNTY_A.01 = Williamson County
STATE.01 = GA
MONTH.01 = 2
YEAR.01 = 2002
B.17 query17.tpl
Analyze, for each state, all items that were sold in stores in a particular quarter and returned in the next three
quarters and then re-purchased by the customer through the catalog channel in the three following quarters.
Qualification Substitution Parameters:
?
YEAR.01 = 2001
B.18 query18.tpl
Compute, for each county, the average quantity, list price, coupon amount, sales price, net profit, age, and
number of dependents for all items purchased through catalog sales in a given year by customers who were born
in a given list of six months and living in a given list of seven states and who also belong to a given gender and
education demographic.
Qualification Substitution Parameters:
?
?
?
?
?
?
?
?
?
?
?
?
MONTH.01 = 1
MONTH.02 = 6
MONTH.03 = 8
MONTH.04 = 9
MONTH.05 = 12
MONTH.06 = 2
STATE.01 = MS
STATE.02 = IN
STATE.03 = ND
STATE.04 = OK
STATE.05 = NM
STATE.06 = VA
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 123 of 156?
?
?
?
STATE.07 = MS
ES.01 = Unknown
GEN.01 = F
YEAR.01 = 1998
B.19 query19.tpl
Select the top revenue generating products bought by out of zip code customers for a given year, month and
manager. Qualification Substitution Parameters
?
?
?
MANAGER.01 = 8
MONTH.01 = 11
YEAR.01 = 1998
B.20 query20.tpl
Compute the total revenue and the ratio of total revenue to revenue by item class for specified item categories
and time periods.
Qualification Substitution Parameters:
?
?
?
?
?
CATEGORY.01 = Sports
CATEGORY.02 = Books
CATEGORY.03 = Home
SDATE.01 = 1999-02-22
YEAR.01 = 1999
B.21 query21.tpl
For all items whose price was changed on a given date, compute the percentage change in inventory between
the 30-day period BEFORE the price change and the 30-day period AFTER the change. Group this information
by warehouse.
Qualification Substitution Parameters:
?
?
SALES_DATE.01 = 2000-03-11
YEAR.01 = 2000
B.22 query22.tpl
For each product name, brand, class, category, calculate the average quantity on hand. Rollup data by product
name, brand, class and category.
Qualification Substitution Parameters:
?
DMS.01 = 1200
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 124 of 156B.23 query23.tpl
This query contains multiple, related iterations:
Find frequently sold items that are items that were sold more than 4 times per day in four consecutive years.
Compute the maximum store sales made by any given customer in a period of four consecutive years (same as
above). Compute the best store customers as those that are in the 5 th percentile of sales. Finally, compute the
total sales of sales in March made by our best customers buying our most frequent items
Qualification Substitution Parameters:
?
?
?
MONTH.01 = 2
YEAR.01 = 2000
TOPPERCENT=50
B.24 query24.tpl
This query contains multiple, related iterations:
Iteration 1: Calculate the total specified monetary value of items in a specific color for store sales transactions
by customer name and store, in a specific market, from customers who currently live in their birth countries
and in the neighborhood of the store, and list only those customers for whom the total specified monetary value
is greater than 5% of the average value
Iteration 2: Calculate the total specified monetary value of items in a specific color and specific size for store
sales transactions by customer name and store, in a specific market, from customers who currently live in their
birth countries and in the neighborhood of the store, and list only those customers for whom the total specified
monetary value is greater than 5% of the average value
Qualification Substitution Parameters:
MARKET = 8
COLOR.1 = pale
COLOR.2 = chiffon
AMOUNTONE = ss_net_paid
?
?
?
?
B.25 query25.tpl
Get all items that were
?
?
?
sold in stores in a particular month and year and
returned in the next three quarters
re-purchased by the customer through the catalog channel in the six following months.
For these items, compute the sum of net profit of store sales, net loss of store loss and net profit of catalog .
Group this information by item and store.
Qualification Substitution Parameters:
?
?
?
MONTH.01 = 4
YEAR.01 = 2001
AGG.01 = sum
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 125 of 156B.26 query26.tpl
Computes the average quantity, list price, discount, sales price for promotional items sold through the catalog
channel where the promotion was not offered by mail or in an event for given gender, marital status and
educational status.
Qualification Substitution Parameters:
?
?
?
?
YEAR.01 = 2000
ES.01 = College
MS.01 = S
GEN.01 = M
B.27 query27.tpl
For all items sold in stores located in six states during a given year, find the average quantity, average list price,
average list sales price, average coupon amount for a given gender, marital status, education and customer
demographic.
Qualification Substitution Parameters:
?
?
?
?
?
?
?
?
?
?
STATE_F.01 = TN
STATE_E.01 = TN
STATE_D.01 = TN
STATE_C.01 = TN
STATE_B.01 = TN
STATE_A.01 = TN
ES.01 = College
MS.01 = S
GEN.01 = M
YEAR.01 = 2002
B.28 query28.tpl
Calculate the average list price, number of non empty (null) list prices and number of distinct list prices of six
different sales buckets of the store sales channel. Each bucket is defined by a range of distinct items and
information about list price, coupon amount and wholesale cost.
Qualification Substitution Parameters:
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
WHOLESALECOST.01=57
WHOLESALECOST.02=31
WHOLESALECOST.03=79
WHOLESALECOST.04=38
WHOLESALECOST.05=17
WHOLESALECOST.06=7
COUPONAMT.01=459
COUPONAMT.02=2323
COUPONAMT.03=12214
COUPONAMT.04=6071
COUPONAMT.05=836
COUPONAMT.06=7326
LISTPRICE.01=8
LISTPRICE.02=90
LISTPRICE.03=142
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 126 of 156?
?
?
LISTPRICE.04=135
LISTPRICE.05=122
LISTPRICE.06=154
B.29 query29.tpl
Get all items that were sold in stores in a specific month and year and which were returned in the next six
months of the same year and re-purchased by the returning customer afterwards through the catalog sales
channel in the following three years.
For those these items, compute the total quantity sold through the store, the quantity returned and the quantity
purchased through the catalog. Group this information by item and store.
Qualification Substitution Parameters:
?
?
?
MONTH.01 = 9
YEAR.01 = 1999
AGG.01 = 29
B.30 query30.tpl
Find customers and their detailed customer data who have returned items, which they bought on the web, for an
amount that is 20% higher than the average amount a customer returns in a given state in a given time period
across all items. Order the output by customer data.
Qualification Substitution Parameters:
?
?
YEAR.01 = 2002
STATE.01 = GA
B.31 query31.tpl
List the top five counties where the percentage growth in web sales is consistently higher compared to the
percentage growth in store sales in the first three consecutive quarters for a given year.
Qualification Substitution Parameters:
?
?
YEAR.01 = 2000
AGG.01 = ss1.ca_county
B.32 query32.tpl
Compute the total discounted amount for a particular manufacturer in a particular 90 day period for catalog
sales whose discounts exceeded the average discount by at least 30%.
Qualification Substitution Parameters:
?
?
?
CSDATE.01 = 2000-01-27
YEAR.01 = 2000
IMID.01 = 977
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 127 of 156B.33 query33.tpl
What is the monthly sales figure based on extended price for a specific month in a specific year, for
manufacturers in a specific category in a given time zone. Group sales by manufacturer identifier and sort
output by sales amount, by channel, and give Total sales.
Qualification Substitution Parameters:
?
?
?
?
CATEGORY.01 = Electronics
GMT.01 = -5
MONTH.01 = 5
YEAR.01 = 1998
B.34 query34.tpl
Display all customers with specific buy potentials and whose dependent count to vehicle count ratio is larger
than 1.2, who in three consecutive years made purchases with between 15 and 20 items in the beginning or the
end of each month in stores located in 8 counties.
Qualification Substitution Parameters:
?
?
?
?
?
?
?
?
?
?
?
COUNTY_H.01 = Williamson County
COUNTY_G.01 = Williamson County
COUNTY_F.01 = Williamson County
COUNTY_E.01 = Williamson County
COUNTY_D.01 = Williamson County
COUNTY_C.01 = Williamson County
COUNTY_B.01 = Williamson County
COUNTY_A.01 = Williamson County
YEAR.01 = 1999
BPTWO.01 = unknown
BPONE.01 = >10000
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 128 of 156B.35 query35.tpl
For each of the customers living in the same state, having the same gender and marital status who have
purchased from stores and from either the catalog or the web during a given year, display the following:
? state, gender, marital status, count of customers
? min, max, avg, count distinct of the customer’s dependent count
? min, max, avg, count distinct of the customer’s employed dependent count
? min, max, avg, count distinct of the customer’s dependents in college count
Display / calculate the “count of customers” multiple times to emulate a potential reporting tool scenario.
Qualification Substitution Parameters:
YEAR.01 = 2002
AGGONE = min
AGGTWO = max
AGGTHREE = avg
B.36 query36.tpl
Compute store sales gross profit margin ranking for items in a given year for a given list of states.\
Qualification Substitution Parameters:
?
?
?
?
?
?
?
?
?
STATE_H.01 = TN
STATE_G.01 = TN
STATE_F.01 = TN
STATE_E.01 = TN
STATE_D.01 = TN
STATE_C.01 = TN
STATE_B.01 = TN
STATE_A.01 = TN
YEAR.01 = 2001
B.37 query37.tpl
List all items and current prices sold through the catalog channel from certain manufacturers in a given $30
price range and consistently had a quantity between 100 and 500 on hand in a 60-day period.
Qualification Substitution Parameters:
?
?
?
?
?
?
PRICE.01 = 68
MANUFACT_ID.01 = 677
MANUFACT_ID.02 = 940
MANUFACT_ID.03 = 694
MANUFACT_ID.04 = 808
INVDATE.01 = 2000-02-01
B.38 query38.tpl
Display count of customers with purchases from all 3 channels in a given year.
Qualification Substitution Parameters:
?
DMS.01 = 1200
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 129 of 156B.39 query39.tpl
This query contains multiple, related iterations:
Iteration 1: Calculate the coefficient of variation and mean of every item and warehouse of two consecutive
months
Iteration 2: Find items that had a coefficient of variation in the first months of 1.5 or large
Qualification Substitution Parameters:
? YEAR.01 = 2001
? MONTH.01 = 1
B.40 query40.tpl
Compute the impact of an item price change on the sales by computing the total sales for items in a 30 day
period before and after the price change. Group the items by location of warehouse where they were delivered
from.
Qualification Substitution Parameters
?
?
SALES_DATE.01 = 2000-03-11
YEAR.01 = 2000
B.41 query41.tpl
How many items do we carry with specific combinations of color, units, size and category.
Qualification Substitution Parameters
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
MANUFACT.01 = 738
SIZE.01 = medium
SIZE.02 = extra large
SIZE.03 = N/A
SIZE.04 = small
SIZE.05 = petite
SIZE.06 = large
UNIT.01 = Ounce
UNIT.02 = Oz
UNIT.03 = Bunch
UNIT.04 = Ton
UNIT.05 = N/A
UNIT.06 = Dozen
UNIT.07 = Box
UNIT.08 = Pound
UNIT.09 = Pallet
UNIT.10 = Gross
UNIT.11 = Cup
UNIT.12 = Dram
UNIT.13 = Each
UNIT.14 = Tbl
UNIT.15 = Lb
UNIT.16 = Bundle
COLOR.01 = powder
COLOR.02 = khaki
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 130 of 156?
?
?
?
?
?
?
?
?
?
?
?
?
?
COLOR.03 = brown
COLOR.04 = honeydew
COLOR.05 = floral
COLOR.06 = deep
COLOR.07 = light
COLOR.08 = cornflower
COLOR.09 = midnight
COLOR.10 = snow
COLOR.11 = cyan
COLOR.12 = papaya
COLOR.13 = orange
COLOR.14 = frosted
COLOR.15 = forest
COLOR.16 = ghost
B.42 query42.tpl
For each item and a specific year and month calculate the sum of the extended sales price of store transactions.
Qualification Substitution Parameters:
?
?
MONTH.01 = 11
YEAR.01 = 2000
B.43 query43.tpl
Report the sum of all sales from Sunday to Saturday for stores in a given data range by stores.
Qualification Substitution Parameters:
?
?
YEAR.01 = 2000
GMT.01 = -5
B.44 query44.tpl
List the best and worst performing products measured by net profit.
Qualification Substitution Parameters:
?
?
NULLCOLSS.01 = ss_addr_sk
STORE.01 = 4
B.45 query45.tpl
Report the total web sales for customers in specific zip codes, cities, counties or states, or specific items for a
given year and quarter. .
Qualification Substitution Parameters:
?
?
?
QOY.01 = 2
YEAR.01 = 2001
GBOBC = ca_city
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 131 of 156B.46 query46.tpl
Compute the per-customer coupon amount and net profit of all "out of town" customers buying from stores
located in 5 cities on weekends in three consecutive years. The customers need to fit the profile of having a
specific dependent count and vehicle count. For all these customers print the city they lived in at the time of
purchase, the city in which the store is located, the coupon amount and net profit
Qualification Substitution Parameters:
?
?
?
?
?
?
?
?
CITY_E.01 = Fairview
CITY_D.01 = Fairview
CITY_C.01 = Fairview
CITY_B.01 = Midway
CITY_A.01 = Fairview
VEHCNT.01 = 3
YEAR.01 = 1999
DEPCNT.01 = 4
B.47 query47.tpl
Find the item brands and categories for each store and company, the monthly sales figures for a specified year,
where the monthly sales figure deviated more than 10% of the average monthly sales for the year, sorted by
deviation and store. Report deviation of sales from the previous and the following monthly sales.
Qualification Substitution Parameters
?
?
?
YEAR.01 = 1999
SELECTONE = v1.i_category, v1.i_brand, v1.s_store_name, v1.s_company_name
SELECTTWO = ,v1.d_year, v1.d_moy
B.48 query48.tpl
Calculate the total sales by different types of customers (e.g., based on marital status, education status), sales
price and different combinations of state and sales profit.
Qualification Substitution Parameters:
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
MS.1=M
MS.2=D
MS.3=S
ES.1=4 yr Degree
ES.2=2 yr Degree
ES.3=College
STATE.1=CO
STATE.2=OH
STATE.3=TX
STATE.4=OR
STATE.5=MN
STATE.6=KY
STATE.7=VA
STATE.8=CA
STATE.9=MS
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 132 of 156B.49 Query49.tpl
Report the worst return ratios (sales to returns) of all items for each channel by quantity and currency sorted by
ratio. Quantity ratio is defined as total number of sales to total number of returns. Currency ratio is defined as
sum of return amount to sum of net paid.
Qualification Substitution Parameters:
?
?
MONTH.01 = 12
YEAR.01 = 2001
B.50 query50.tpl
For each store count the number of items in a specified month that were returned after 30, 60, 90, 120 and more
than 120 days from the day of purchase.
Qualification Substitution Parameters:
?
?
MONTH.01 = 8
YEAR.01 = 2001
B.51 query51.tpl
Compute the count of store sales resulting from promotions, the count of all store sales and their ratio for
specific categories in a particular time zone and for a given year and month.
Qualification Substitution Parameters:
?
DMS.01 = 1200
B.52 query52.tpl
Report the total of extended sales price for all items of a specific brand in a specific year and month.
Qualification Substitution Parameters
?
?
MONTH.01=11
YEAR.01=2000
B.53 query53.tpl
Find the ID, quarterly sales and yearly sales of those manufacturers who produce items with specific
characteristics and whose average monthly sales are larger than 10% of their monthly sales.
Qualification Substitution Parameters:
?
DMS.01 = 1200
B.54 query54.tpl
Find all customers who purchased items of a given category and class on the web or through catalog in a given
month and year that was followed by an in-store purchase at a store near their residence in the three consecutive
months. Calculate a histogram of the revenue by these customers in $50 segments showing the number of
customers in each of these revenue generated segments.
Qualification Substitution Parameters:
?
?
?
CLASS.01 = maternity
CATEGORY.01 = Women
MONTH.01 = 12
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 133 of 156?
YEAR.01 = 1998
B.55 query55.tpl
For a given year, month and store manager calculate the total store sales of any combination all brands.
Qualification Substitution Parameters:
?
?
?
MANAGER.01 = 28
MONTH.01 = 11
YEAR.01 = 1999
B.56 query56.tpl
Compute the monthly sales amount for a specific month in a specific year, for items with three specific colors
across all sales channels. Only consider sales of customers residing in a specific time zone. Group sales by
item and sort output by sales amount.
Qualification Substitution Parameters:
?
?
?
?
?
?
COLOR.01 = slate
COLOR.02 = blanched
COLOR.03 = burnished
GMT.01 = -5
MONTH.01 = 2
YEAR.01 = 2001
B.57 query57.tpl
Find the item brands and categories for each call center and their monthly sales figures for a specified year,
where the monthly sales figure deviated more than 10% of the average monthly sales for the year, sorted by
deviation and call center. Report the sales deviation from the previous and following month.
Qualification Substitution Parameters:
?
?
?
YEAR.01 = 1999
SELECTONE = v1.i_category, v1.i_brand, v1.cc_name
SELECTTWO = ,v1.d_year, v1.d_moy
B.58 query58.tpl
Retrieve the items generating the highest revenue and which had a revenue that was approximately equivalent
across all of store, catalog and web within the week ending a given date.
Qualification Substitution Parameters:
?
SALES_DATE.01 = 2000-01-03
B.59 query59.tpl
Report the increase of weekly store sales from one year to the next year for each store and day of the week.
Qualification Substitution Parameters:
?
DMS.01 = 1212
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 134 of 156B.60 query60.tpl
What is the monthly sales amount for a specific month in a specific year, for items in a specific category,
purchased by customers residing in a specific time zone. Group sales by item and sort output by sales amount.
Qualification Substitution Parameters:
?
?
?
?
CATEGORY.01 = Music
GMT.01 = -5
MONTH.01 = 9
YEAR=1998
B.61 query61.tpl
Find the ratio of items sold with and without promotions in a given month and year. Only items in certain
categories sold to customers living in a specific time zone are considered.
Qualification Substitution Parameters:
?
?
?
?
GMT.01 = -5
CATEGORY.01 = Jewelry
MONTH.01 = 11
YEAR.01 = 1998
B.62 query62.tpl
For web sales, create a report showing the counts of orders shipped within 30 days, from 31 to 60 days, from 61
to 90 days, from 91 to 120 days and over 120 days within a given year, grouped by warehouse, shipping mode
and web site.
Qualification Substitution Parameters:
?
DMS.01 = 1200
B.63 query63.tpl
For a given year calculate the monthly sales of items of specific categories, classes and brands that were sold in
stores and group the results by store manager. Additionally, for every month and manager print the yearly
average sales of those items.
Qualification Substitution Parameters:
?
DMS.01 = 1200
B.64 query64.tpl
Find those stores that sold more cross-sales items from one year to another. Cross-sale items are items that are
sold over the Internet, by catalog and in store.
Qualification Substitution Parameters:
?
?
?
?
?
?
?
YEAR.01 = 1999
PRICE.01 = 64
COLOR.01 = purple
COLOR.02 = burlywood
COLOR.03 = indian
COLOR.04 = spring
COLOR.05 = floral
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 135 of 156?
COLOR.06 = medium
B.65 query65.tpl
In a given period, for each store, report the list of items with revenue less than 10% the average revenue for all
the items in that store.
Qualification Substitution Parameters:
?
DMS.01 = 1176
B.66 query66.tpl
Compute web and catalog sales and profits by warehouse. Report results by month for a given year during a
given 8-hour period.
Qualification Substitution Parameters
?
?
?
?
?
?
?
?
SALESTWO.01 = cs_sales_price
SALESONE.01 = ws_ext_sales_price
NETTWO.01 = cs_net_paid_inc_tax
NETONE.01 = ws_net_paid
SMC.01 = DHL
SMC.02 = BARIAN
TIMEONE.01 = 30838
YEAR.01 = 2001
B.67 query67.tpl
Find top stores for each category based on store sales in a specific year.
Qualification Substitution Parameters:
? DMS.01 = 1200
B.68 query68.tpl
Compute the per customer extended sales price, extended list price and extended tax for "out of town" shoppers
buying from stores located in two cities in the first two days of each month of three consecutive years. Only
consider customers with specific dependent and vehicle counts.
Qualification Substitution Parameters:
?
?
?
?
?
CITY_B.01 = Midway
CITY_A.01 = Fairview
VEHCNT.01 = 3
YEAR.01 = 1999
DEPCNT.01 = 4
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 136 of 156B.69 query69.tpl
Count the customers with the same gender, marital status, education status, education status, purchase estimate
and credit rating who live in certain states and who have purchased from stores but neither form the catalog nor
from the web during a two month time period of a given year.
Qualification Substitution Parameters:
?
?
?
?
?
STATE.01 = KY
STATE.02 = GA
STATE.03 = NM
YEAR.01 = 2001
MONTH.01 = 4
B.70 query70.tpl
Compute store sales net profit ranking by state and county for a given year and determine the five most
profitable states.
Qualification Substitution Parameters:
?
DMS.01 = 1200
B.71 query71.tpl
Select the top revenue generating products, sold during breakfast or dinner time for one month managed by a
given manager across all three sales channels.
Qualification Substitution Parameters:
?
?
?
MANAGER.01 = 1
MONTH.01 = 11
YEAR.01 = 1999
B.72 query72.tpl
For each item, warehouse and week combination count the number of sales with and without promotion.
Qualification Substitution Parameters:
?
?
?
BP.01 = >10000
MS.01 = D
YEAR.01 = 1999
B.73 query73.tpl
Count the number of customers with specific buy potentials and whose dependent count to vehicle count ratio is
larger than 1 and who in three consecutive years bought in stores located in 4 counties between 1 and 5 items in
one purchase. Only purchases in the first 2 days of the months are considered.
Qualification Substitution Parameters:
?
?
?
?
?
?
COUNTY_D.01 = Orange County
COUNTY_C.01 = Bronx County
COUNTY_B.01 = Franklin Parish
COUNTY_A.01 = Williamson CountyYEAR.01 = 1999
BPTWO.01 = unknown
BPONE.01 = >10000
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 137 of 156B.74 query74.tpl
Display customers with both store and web sales in consecutive years for whom the increase in web sales
exceeds the increase in store sales for a specified year.
Qualification Substitution Parameters:
?
?
?
YEAR.01 = 2001
AGGONE.01 = sum
ORDERC.01 = 1 1 1
B.75 query75.tpl
For two consecutive years track the sales of items by brand, class and category.
Qualification Substitution Parameters:
?
?
CATEGORY.01 = Books
YEAR.02 = 2002
B.76 query76.tpl
Computes the average quantity, list price, discount, sales price for promotional items sold through the web
channel where the promotion is not offered by mail or in an event for given gender, marital status and
educational status.
Qualification Substitution Parameters:
?
?
?
NULLCOLCS01 = cs_ship_addr_sk
NULLCOLWS.01 = ws_ship_customer_sk
NULLCOLSS.01 = ss_store_sk
B.77 query77.tpl
Report the total sales, returns and profit for all three sales channels for a given 30 day period. Roll up the
results by channel and a unique channel location identifier.
Qualification Substitution Parameters:
?
SALES_DATE.01 = 2000-08-23
B.78 query78.tpl
Report the top customer / item combinations having the highest ratio of store channel sales to all other channel
sales (minimum 2 to 1 ratio), for combinations with at least one store sale and one other channel sale. Order the
output by highest ratio.
Qualification Substitution Parameters:
?
YEAR.01 = 2000
B.79 query79.tpl
Compute the per customer coupon amount and net profit of Monday shoppers. Only purchases of three
consecutive years made on Mondays in large stores by customers with a certain dependent count and with a
large vehicle count are considered.
Qualification Substitution Parameters:
?
VEHCNT.01 = 2
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 138 of 156?
?
YEAR.01 = 1999
DEPCNT.01 = 6
B.80 query80.tpl
Report extended sales, extended net profit and returns in the store, catalog, and web channels for a 30 day
window for items with prices larger than $50 not promoted on television, rollup results by sales channel and
channel specific sales means (store for store sales, catalog page for catalog sales and web site for web sales)
Qualification Substitution Parameters:
?
SALES_DATE.01 = 2000-08-23
B.81 query81.tpl
Find customers and their detailed customer data who have returned items bought from the catalog more than 20
percent the average customer returns for customers in a given state in a given time period. Order output by
customer data.
Qualification Substitution Parameters:
?
?
YEAR.01 = 2000
STATE.01 = GA
B.82 query82.tpl
?
Find customers who tend to spend more money (net-paid) on-line than in stores.
Qualification Substitution Parameters
?
?
?
?
?
?
MANUFACT_ID.01 = 129
MANUFACT_ID.02 = 270
MANUFACT_ID.03 = 821
MANUFACT_ID.04 = 423
INVDATE.01 = 2000-05-25
PRICE.01 = 62
B.83 query83.tpl
Retrieve the items with the highest number of returns where the number of returns was approximately
equivalent across all store, catalog and web channels (within a tolerance of +/- 10%), within the week ending a
given date.
Qualification Substitution Parameters
?
?
?
RETURNED_DATE_THREE.01 = 2000-11-17
RETURNED_DATE_TWO.01 = 2000-09-27
RETURNED_DATE_ONE.01 = 2000-06-30
B.84 query84.tpl
List all customers living in a specified city, with an income between 2 values.
Qualification Substitution Parameters
?
?
INCOME.01 = 38128
CITY.01 = Edgewood
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 139 of 156B.85 query85.tpl
For all web return reason calculate the average sales, average refunded cash and average return fee by different
combinations of customer and sales types (e.g., based on marital status, education status, state and sales profit).
Qualification Substitution Parameters:
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
YEAR.01 = 2000
STATE.01 = IN
STATE.02 = OH
STATE.03 = NJ
STATE.04 = WI
STATE.05 = CT
STATE.06 = KY
STATE.07 = LA
STATE.08 = IA
STATE.09 = AR
ES.01 = Advanced Degree
ES.02 = College
ES.03 = 2 yr Degree
MS.01 = M
MS.02 = S
MS.03 = W
B.86 query86.tpl
Rollup the web sales for a given year by category and class, and rank the sales among peers within the parent,
for each group compute sum of sales, location with the hierarchy and rank within the group.
Qualification Substitution Parameters:
?
DMS.01 = 1200
B.87 query87.tpl
Count how many customers have ordered on the same day items on the web and the catalog and on the same
day have bought items in a store.
Qualification Substitution Parameters:
?
DMS.01 = 1200
B.88 query88.tpl
How many items do we sell between pacific times of a day in certain stores to customers with one dependent
count and 2 or less vehicles registered or 2 dependents with 4 or fewer vehicles registered or 3 dependents and
five or less vehicles registered. In one row break the counts into sells from 8:30 to 9, 9 to 9:30, 9:30 to 10 ... 12
to 12:30
Qualification Substitution Parameters:
?
?
?
?
STORE.01=Unknown
HOUR.01=4
HOUR.02=2
HOUR.03=0
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 140 of 156B.89 query89.tpl
Within a year list all month and combination of item categories, classes and brands that have had monthly sales
larger than 0.1 percent of the total yearly sales.
Qualification Substitution Parameters:
?
?
?
?
?
?
?
?
?
?
?
?
?
CLASS_F.01 = dresses
CAT_F.01 = Women
CLASS_E.01 = birdal
CAT_E.01 = Jewelry
CLASS_D.01 = shirts
CAT_D.01 = Men
CLASS_C.01 = football
CAT_C.01 = Sports
CLASS_B.01 = stereo
CAT_B.01 = Electronics
CLASS_A.01 = computers
CAT_A.01 = Books
YEAR.01 = 1999
B.90 query90.tpl
What is the ratio between the number of items sold over the internet in the morning (8 to 9am) to the number of
items sold in the evening (7 to 8pm) of customers with a specified number of dependents. Consider only
websites with a high amount of content.
Qualification Substitution Parameters:
?
?
?
HOUR_PM.01 = 19
HOUR_AM.01 = 8
DEPCNT.01 = 6
B.91 query91.tpl
Display total returns of catalog sales by call center and manager in a particular month for male customers of
unknown education or female customers with advanced degrees with a specified buy potential and from a
particular time zone.
Qualification Substitution Parameters:
?
?
?
?
YEAR.01 = 1998
MONTH.01 = 11
BUY_POTENTIAL.01 = Unknown
GMT.01 = -7
B.92 query92.tpl
Compute the total discount on web sales of items from a given manufacturer over a particular 90 day period for
sales whose discount exceeded 30% over the average discount of items from that manufacturer in that period of
time.
Qualification Substitution Parameters:
?
?
IMID.01 = 350
WSDATE.01 = 2000-01-27
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 141 of 156B.93 query93.tpl
For a given merchandise return reason, report on customers’ total cost of purchases minus the cost of returned
items.
Qualification Substitution Parameters:
?
Reason= reason 28
B.94 query94.tpl
Produce a count of web sales and total shipping cost and net profit in a given 60 day period to customers in a
given state from a named web site for non returned orders shipped from more than one warehouse.
Qualification Substitution Parameters:
?
?
?
YEAR.01 = 1999
MONTH.01 = 2
STATE.01 = IL
B.95 query95.tpl
Produce a count of web sales and total shipping cost and net profit in a given 60 day period to customers in a
given state from a named web site for returned orders shipped from more than one warehouse.
Qualification Substitution Parameters:
?
?
?
STATE.01=IL
MONTH.01=2
YEAR.01=1999
B.96 query96.tpl
Compute a count of sales from a named store to customers with a given number of dependents made in a
specified half hour period of the day.
Qualification Substitution Parameters:
?
?
HOUR.01 = 20
DEPCNT.01 = 7
B.97 query97.tpl
Generate counts of promotional sales and total sales, and their ratio from the web channel for a particular item
category and month to customers in a given time zone.
Qualification Substitution Parameters:
?
YEAR.01 = 2000
B.98 query98.tpl
Report on items sold in a given 30 day period, belonging to the specified category.
Qualification Substitution Parameters
?
?
YEAR.01 = 1999
SDATE.01 = 1999-02-22
TPC BenchmarkTM DS - Standard Specification, Version 1.3.1
Page 142 of 156?
?
?
CATEGORY.01 = Sports
CATEGORY.02 = Books
CATEGORY.03 = Home
B.99 query99.tpl
For catalog sales, create a report showing the counts of orders shipped within 30 days, from 31 to 60 days, from
61 to 90 days, from 91 to 120 days and over 120 days within a given year, grouped by warehouse, call center
and shipping mode.
Qualification Substitution Parameters
?
DMS.01 = 1200
TPC BenchmarkTM DS - Stan
View Code 附上测试结果:
Configuration:
----------------------------------------------
Database: tpcds_parquet_1000
Perf: ./logs/impala/test_impala_tpcds_parquet_1000_160510-105135/query_perf.csv
Logs: ./logs/impala/test_impala_tpcds_parquet_1000_160510-105135/query_log
Output: ./logs/impala/test_impala_tpcds_parquet_1000_160510-105135/query_out
----------------------------------------------

Performance results:
--------------------
 no  success  t_shell t_impala   latest   ln_run   ln_avg info.
  1      yes       22    21.69    30.90       27     85.4 ""
  2      yes      168   167.47   144.21       20    309.9 ""
  3      yes       11    10.93   386.26       18    113.8 ""
  4      yes     1453  1439.48   466.00       11    502.5 ""
  5      yes      598   594.16    96.51       12    172.0 ""
  6      yes       44    42.99    21.12       14     51.9 ""
  7      yes       51    50.55    27.28       13    100.4 ""
  8      yes       25    24.64    13.24       14     17.7 ""
  9      yes       42    41.91   109.18       14     63.6 ""
 10      yes       50    50.37    23.26       12     29.2 ""
 11      yes     2578  2565.05   366.42       14    477.8 ""
 12      yes       49    49.05     3.72       12     10.0 ""
 13      yes      113   112.65   143.29       10    104.7 ""
 14      yes     2162  1125.06  1184.38        8   1201.4 ""
 15      yes       16    16.14    12.49        9     24.4 ""
 16      yes      688   687.43   783.21        8   1032.4 ""
 17      yes      116   115.34   114.32        8    134.7 ""
 18      yes       82    81.67    42.15        8    134.5 ""
 19      yes       42    41.21    12.61        8     37.3 ""
 20      yes       12    11.68     5.78        8     12.1 ""
 21      yes       14    13.66    45.11        8     76.5 ""
 22      yes      175   174.55   208.11        7    202.7 ""
 23      yes     2807  1577.84  1599.90        7   1670.1 ""
 24      yes     4583  2404.18  2316.71        8   3377.1 ""
 25      yes      108   107.57    53.82        8     97.7 ""
 26      yes       16    15.15    16.14        8     58.3 ""
 27      yes       69    68.52    21.70        8     51.2 ""
 28      yes      105   104.73    89.09       10     82.7 ""
 29      yes      638   638.10   106.90        7    387.9 ""
 30       no       15        0        0        0        0 ""
 31      yes       53    52.43    29.72        8     44.8 ""
 32      yes       17    16.51     7.11        8     19.0 ""
 33      yes       27    25.79     7.11        8     20.7 ""
 34      yes       38    38.08    30.71        8     39.9 ""
 35      yes      179   177.88  4415.05        7   2393.7 ""
 36      yes       53    52.99    28.83        8     61.7 ""
 37      yes      566   565.82   176.65        8    282.8 ""
 38      yes      534   534.00   232.56        8    359.2 ""
 39      yes       55    20.23    86.13        8    159.3 ""
 40      yes       95    95.28    33.84        8     54.4 ""
 41      yes        2     1.45     3.13        8      2.8 ""
 42      yes       51    50.69     5.09        8     12.1 ""
 43      yes      262   262.55    27.17        8     54.6 ""
 44      yes       37    37.09    44.12        8     36.8 ""
 45      yes       16    15.41    10.28        8     21.2 ""
 46      yes       35    34.25    19.64        8     35.5 ""
 47      yes     4080  4079.43   425.56        9    748.9 ""
 48      yes       30    28.93    13.95        8     25.3 ""
 49      yes       70    69.17    38.57        8     65.4 ""
 50      yes      250   249.56   174.81        8    218.1 ""
 51      yes     1320  1319.70  4343.36        7   2383.1 ""
 52      yes       52    51.81     9.03        8     20.2 ""
 53      yes      299   298.55    14.86        8     43.0 ""
 54      yes       89    88.43   116.01        4    109.1 ""
 55      yes       41    41.20     5.26        7     13.8 ""
 56      yes       26    25.82    10.11        7     33.5 ""
 57      yes      537   537.05   119.84        7    257.2 ""
 58      yes       16    16.28     5.30        7     18.5 ""
 59      yes      192   192.58   198.41        7    190.0 ""
 60      yes       28    28.34    86.52        6     59.4 ""
 61      yes       42    42.52    13.59        7     39.1 ""
 62      yes       30    29.73    82.71        7     97.8 ""
 63      yes      245   244.46    17.54        7     52.2 ""
 64      yes      814   813.07   540.83        7    646.0 ""
 65      yes      217   216.55   178.38        7    171.2 ""
 66      yes       29    28.18    17.79        7     51.1 ""
 67      yes     4235  4224.15  3636.85        1   3636.8 ""
 68      yes       92    91.37    51.49        7    114.8 ""
 69      yes       17    16.69     4.02        7     20.0 ""
 70      yes      118   116.66   118.80        7    106.3 ""
 71      yes      118   117.37    19.08        7     58.0 ""
 72      yes     2672  2671.46  1274.81        6   1415.3 ""
 73      yes       20    19.20     7.23        7     19.7 ""
 74      yes      816   813.32   212.46        6    349.2 ""
 75      yes      284   283.13   377.51        7    373.8 ""
 76      yes       51    50.62   155.08        7    104.3 ""
 77      yes       92    90.96    16.97        7     52.3 ""
 78      yes      409   408.80   440.73        7    453.0 ""
 79      yes       44    43.78    66.02        7     54.6 ""
 80      yes      433   431.42   110.24        7    240.4 ""
 81      yes       35    35.02    46.16        7     52.4 ""
 82      yes     1120  1119.60   172.77        7    319.6 ""
 83      yes        7     6.64     7.63        7     10.6 ""
 84      yes      224   223.43    47.99        7    177.0 ""
 85      yes       90    89.21    51.84        7     83.6 ""
 86      yes       36    35.86    31.84        7     53.3 ""
 87      yes      537   536.78   247.84        7    375.0 ""
 88      yes      112   111.68    73.37        7     72.8 ""
 89      yes      332   331.64    25.53        7     59.2 ""
 90      yes        7     6.51    14.37        7     70.1 ""
 91      yes       12    11.65     4.09        7      5.2 ""
 92      yes       10     9.42     4.67        7      9.7 ""
 93      yes      564   562.82   515.55        7    551.7 ""
 94      yes      365   364.96   319.71        6    408.5 ""
 95      yes      461   460.17   183.60        6    230.5 ""
 96      yes       11    10.89    13.70        7     25.6 ""
 97      yes      214   213.05   221.09        7    260.1 ""
 98      yes       53    53.18    47.57        6     53.6 ""
 99      yes       58    57.78   151.64        6     90.9 ""

Performance statistics:
-----------------------
Total queries: 99
Coverage: 90.0%
Time shell: 40928s
Time impala: 36354.80s
View Code

 

 

 

 

 

 

 

3. Bulid GitHub Spark Runnable Distribution

 

 

这里顺带讲一下 如何 build Apache github (https://github.com/apache/spark)上面最新版的Spark Runnable Distribution。

目前正式版的Spark版本是1.6.1,而github上面的是2.0.0,号称性能比前者提高的很多 ==

见:https://issues.apache.org/jira/browse/SPARK-14070

  https://github.com/apache/spark/pull/11891

  

Ran on a production table in Facebook (note that the data was in DWRF file format which is similar to ORC)

 

Best case : when there was no matching rows for the predicate in the query (everything is filtered out)

 

                      CPU time          Wall time     Total wall time across all tasks
================================================================
Without the change   541_515 sec    25.0 mins    165.8 hours
With change              407 sec       1.5 mins     15 mins

 

Average case: A subset of rows in the data match the query predicate

 

                        CPU time        Wall time     Total wall time across all tasks
================================================================
Without the change   624_630 sec     31.0 mins    199.0 h
With change           14_769 sec      5.3 mins      7.7 h

 

 

 

 

  首先,git clone https://github.com/apache/spark

  步骤和官方说明差不多:http://spark.apache.org/docs/latest/building-spark.html#building-a-runnable-distribution

  要注意的是,

2.0.0的 make_distribution.sh 在dev里面。我cp 1.6.1 的过来,被坑个半死。

  2.0.0 make_dis 已经不会在 assembly 里面生成依赖jar了。github上面的说明:

 

[SPARK-13579][BUILD] Stop building the main Spark assembly.

This change modifies the "assembly/" module to just copy needed
dependencies to its build directory, and modifies the packaging
script to pick those up (and remove duplicate jars packages in the
examples module).

I also made some minor adjustments to dependencies to remove some
test jars from the final packaging, and remove jars that conflict with each
other when packaged separately (e.g. servlet api).

Also note that this change restores guava in applications' classpaths, even
though it's still shaded inside Spark. This is now needed for the Hadoop
libraries that are packaged with Spark, which now are not processed by
the shade plugin.

Author: Marcelo Vanzin 

Closes #11796 from vanzin/SPARK-13579.

 

2.0.0里面没有docker的文件夹,但是pom.xml 里面又会有配置,所以需要把pom.xml里面以下两段注释掉:
external/docker-integration-tests

 

     
        com.spotify
        docker-client
        shaded
        3.6.6
        test
        
          
            guava
            com.google.guava
          
          
            org.apache.httpcomponents
            httpclient
          
          
            org.apache.httpcomponents
            httpcore
          
          
            commons-logging
            httpclient
          
          
            commons-logging
            commons-logging
          
        
      
View Code

 

 

maven编译的时候下载.m2里面的文件,可能默认的镜像速度很很忙以至于编译失败~

  解决方法:1. $MAVEN_HOME/conf/settings.xml 里面设置代理 (没试过~)

       2. $MAVEN_HOME/conf/settings.xml里面设置镜像~ 推荐几个速度稍微快点的镜像,

  

      ui
      central
      Human Readable Name for this Mirror.
     http://uk.maven.org/maven2/


      jboss-public-repository-group
      central
      JBoss Public Repository Group
     http://repository.jboss.org/nexus/content/groups/public




      JBossJBPM
    central
    JBossJBPM Repository
    https://repository.jboss.org/nexus/content/repositories/releases/

View Code

 

 

 

 

4. tpcds-test-sparkSQL

有个问题~ 按道理说,impala生成的parquet表,spark-sql是应该能查询的。

  show databases; show tables; 都是可以的,但是select就会出现以下error log:

 

spark-sql> select count(*) from date_dim;
16/05/13 14:33:17 ERROR SparkSQLDriver: Failed in [select count(*) from date_dim]
java.io.FileNotFoundException: Path is not a file: /user/hive/warehouse/tpcds_parquet_1000.db/date_dim/_impala_insert_staging
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:70)
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1934)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1875)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1855)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1827)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:566)
    at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:88)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:361)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
    at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
    at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1222)
    at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1210)
    at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1260)
    at org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:220)
    at org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:216)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:216)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:208)
    at org.apache.spark.sql.execution.datasources.ListingFileCatalog$$anonfun$1$$anonfun$apply$2.apply(ListingFileCatalog.scala:103)
    at org.apache.spark.sql.execution.datasources.ListingFileCatalog$$anonfun$1$$anonfun$apply$2.apply(ListingFileCatalog.scala:91)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
    at org.apache.spark.sql.execution.datasources.ListingFileCatalog$$anonfun$1.apply(ListingFileCatalog.scala:91)
    at org.apache.spark.sql.execution.datasources.ListingFileCatalog$$anonfun$1.apply(ListingFileCatalog.scala:79)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
    at org.apache.spark.sql.execution.datasources.ListingFileCatalog.listLeafFiles(ListingFileCatalog.scala:79)
    at org.apache.spark.sql.execution.datasources.ListingFileCatalog.refresh(ListingFileCatalog.scala:68)
    at org.apache.spark.sql.execution.datasources.ListingFileCatalog.(ListingFileCatalog.scala:50)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:314)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$14.apply(HiveMetastoreCatalog.scala:320)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$14.apply(HiveMetastoreCatalog.scala:311)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToLogicalRelation(HiveMetastoreCatalog.scala:311)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.org$apache$spark$sql$hive$HiveMetastoreCatalog$ParquetConversions$$convertToParquetRelation(HiveMetastoreCatalog.scala:354)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$$anonfun$apply$1.applyOrElse(HiveMetastoreCatalog.scala:377)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$$anonfun$apply$1.applyOrElse(HiveMetastoreCatalog.scala:362)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:284)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:284)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:307)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    at scala.collection.AbstractIterator.to(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:356)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:284)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.apply(HiveMetastoreCatalog.scala:362)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.apply(HiveMetastoreCatalog.scala:335)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
    at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
    at scala.collection.immutable.List.foldLeft(List.scala:84)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:64)
    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:62)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61)
    at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:554)
    at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:671)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:325)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:240)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:727)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is not a file: /user/hive/warehouse/tpcds_parquet_1000.db/date_dim/_impala_insert_staging
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:70)
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1934)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1875)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1855)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1827)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:566)
    at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:88)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:361)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

    at org.apache.hadoop.ipc.Client.call(Client.java:1468)
    at org.apache.hadoop.ipc.Client.call(Client.java:1399)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
    at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:254)
    at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1220)
    ... 85 more
java.io.FileNotFoundException: Path is not a file: /user/hive/warehouse/tpcds_parquet_1000.db/date_dim/_impala_insert_staging
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:70)
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1934)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1875)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1855)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1827)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:566)
    at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:88)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:361)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
    at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
    at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1222)
    at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1210)
    at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1260)
    at org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:220)
    at org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:216)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:216)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:208)
    at org.apache.spark.sql.execution.datasources.ListingFileCatalog$$anonfun$1$$anonfun$apply$2.apply(ListingFileCatalog.scala:103)
    at org.apache.spark.sql.execution.datasources.ListingFileCatalog$$anonfun$1$$anonfun$apply$2.apply(ListingFileCatalog.scala:91)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
    at org.apache.spark.sql.execution.datasources.ListingFileCatalog$$anonfun$1.apply(ListingFileCatalog.scala:91)
    at org.apache.spark.sql.execution.datasources.ListingFileCatalog$$anonfun$1.apply(ListingFileCatalog.scala:79)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
    at org.apache.spark.sql.execution.datasources.ListingFileCatalog.listLeafFiles(ListingFileCatalog.scala:79)
    at org.apache.spark.sql.execution.datasources.ListingFileCatalog.refresh(ListingFileCatalog.scala:68)
    at org.apache.spark.sql.execution.datasources.ListingFileCatalog.(ListingFileCatalog.scala:50)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:314)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$14.apply(HiveMetastoreCatalog.scala:320)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$14.apply(HiveMetastoreCatalog.scala:311)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToLogicalRelation(HiveMetastoreCatalog.scala:311)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.org$apache$spark$sql$hive$HiveMetastoreCatalog$ParquetConversions$$convertToParquetRelation(HiveMetastoreCatalog.scala:354)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$$anonfun$apply$1.applyOrElse(HiveMetastoreCatalog.scala:377)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$$anonfun$apply$1.applyOrElse(HiveMetastoreCatalog.scala:362)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:284)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:284)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:307)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    at scala.collection.AbstractIterator.to(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:356)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:284)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.apply(HiveMetastoreCatalog.scala:362)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.apply(HiveMetastoreCatalog.scala:335)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
    at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
    at scala.collection.immutable.List.foldLeft(List.scala:84)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:64)
    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:62)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61)
    at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:554)
    at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:671)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:325)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:240)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:727)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is not a file: /user/hive/warehouse/tpcds_parquet_1000.db/date_dim/_impala_insert_staging
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:70)
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1934)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1875)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1855)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1827)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:566)
    at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:88)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:361)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

    at org.apache.hadoop.ipc.Client.call(Client.java:1468)
    at org.apache.hadoop.ipc.Client.call(Client.java:1399)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
    at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:254)
    at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1220)
    ... 85 more
View Code

 

  貌似是查找/user/hive/warehouse/tpcds_parquet_1000.db/date_dim/_impala_insert_staging文件,但是node上面没有这个路径~

  按理说如果是hive下找不到这个文件,那么hiveSQL 也是不能select tpcds_parquet_1000.db的,但是偏偏hive却可以。hive和spark SQL 表 refresh 也没用。

  哪位大牛若知道如何解决该问题,请回复小弟一下 ~~谢谢~

退而求之,我只能用spark再从text表中生成一份parquet表。

  首先添加环境变量

export SPARK_HOME=/usr/lib/spark
export PATH=$SPARK_HOME/bin:$PATH

  改变tpcds-env.sh配置

export EXEC_ENGINE=spark-sql

  create-none-text-table.sh 中加入新逻辑

 

elif [ 'X'$EXEC_ENGINE == "Xspark-sql" ]; then
  echo "Creating database..."
  spark-sql -e "create database if not exists spark_$FORMAT_DB" > /dev/null
  for t in $LIST; do
    echo "Creating table $t..." 
    spark-sql --database spark_$FORMAT_DB -e "create table if not exists $t stored as $TBL_FORMAT as select * from $TEXT_DB.$t;" > /dev/null
    [ 'X'$DELETE_MODE == "Xtrue" ] && spark-sql --database $TEXT_DB -e "drop table if exists $t;" > /dev/null
  done
  [ 'X'$DELETE_MODE == "Xtrue" ] && spark-sql -e "drop database if exists $TEXT_DB;" > /dev/null

 

 

 

5.Spark-SQL test

测试的时候出现关于序列化的NPE错误,查了下,感觉是spark2.0的bug,默认配置不能用。解决方法可见:

 

http://www.cnblogs.com/xiaoyesoso/p/5522671.html

 

组长得出的性能分析
分析Spark2.0性能: Whole-stage code generation对简单Mapjoin + group by性能提升明显。用共享RDD的方式处理with as的多个引用点,比Inceptor创建临时表更高效。

 

  其实1T由于数据量太大,spark2.0的性能表现很不稳定,所有目前在改用2G的数据进行测试

  分别把spark.sql.parquet.enableVectorizedReader和spark.sql.codegen.wholeStage设置为true和false进行测试:

query file parquet(true) parquet(false) diff
q1 3.681 5.459 1.4830209182
q2 6.105 6.395 1.0475020475
q3 1.804 2.598 1.4401330377
q4 17.394 20.32 1.1682189261
q5 9.129 10.234 1.1210428305
q6 4.395 4.522 1.0288964733
q7 2.466 4.614 1.8710462287
q8 3.407 4.12 1.209275022
q9 2.664 3.693 1.3862612613
q10 7.416 10.071 1.3580097087
q11 16.779 17.998 1.0726503367
q12 1.743 2.181 1.2512908778
q13 5.694 9.54 1.6754478398
q14 18.547 24.803 1.3373052246
q15 1.97 2.648 1.3441624365
q16 8.041 11.884 1.4779256311
q17 3.373 4.361 1.2929143196
q18 4.992 7.964 1.5953525641
q19 2.071 3.069 1.4818928054
q20 2.163 2.47 1.1419325012
q21 1.837 7.818 4.2558519325
q22 8.291 24.881 3.0009649017
q23 15 18.27 1.218
q24 4.712 7.517 1.5952886248
q25 2.651 4.469 1.6857789513
q26 1.831 3.723 2.0333151283
q27 1.965 5.177 2.634605598
q28 0.437 1.247 2.8535469108
q29 3.017 3.325 1.1020881671
q30 3.6 3.626 1.0072222222
q31 4.734 5.504 1.1626531474
q32 1.139 1.748 1.5346795435
q33 3.352 3.914 1.1676610979
q34 2.202 3.62 1.6439600363
q35 7.863 8.843 1.1246343635
q36 2.516 4.882 1.940381558
q37 2.499 6.099 2.4405762305
q38 6.946 6.961 1.0021595163
q39 4.443 10.752 2.4199864956
q40 2.257 3.066 1.3584404076
q41 0.705 0.751 1.065248227
q42 0.896 2.032 2.2678571429
q43 1.509 3.063 2.0298210736
q44 2.304 2.725 1.1827256944
q45 2.295 3.551 1.5472766885
q46 2.409 5.601 2.3250311333
q47 12.89 19.439 1.50806827
q48 2.194 3.884 1.7702825889
q49 2.813 3.917 1.392463562
q50 2.959 5.148 1.7397769517
q51 12.645 11.957 0.9455911427
q52 0.904 1.565 1.7311946903
q53 1.262 2.448 1.93977813
q54 3.423 1.2 0.3505696757
q55 1.009 1.812 1.7958374628
q56 2.681 3.66 1.3651622529
q57 10.104 15.117 1.4961401425
q58 2.873 3.187 1.1092934215
q59 3.532 6.851 1.9396942242
q60 5.727 6.019 1.0509865549
q61 2.515 5.371 2.1355864811
q62 1.419 3.157 2.2248062016
q63 1.287 2.541 1.9743589744
q64 13.089 16.151 1.2339368936
q65 2.848 3.774 1.3251404494
q66 4.581 5.338 1.1652477625
q67 6.499 9.739 1.4985382367
q68 2.488 4.854 1.9509646302
q69 7.432 9.675 1.301803014
q70 4.726 6.919 1.464028777
q71 1.992 2.946 1.4789156627
q72 7.706 30.718 3.9862444848
q73 1.76 3.235 1.8380681818
q74 16.598 14.268 0.8596216412
q75 15.759 17.251 1.0946760581
q76 1.536 2.119 1.3795572917
q77 3.243 4.307 1.3280912735
q78 12.951 14.2 1.0964404293
q79 2.305 4.379 1.8997830803
q80 5.504 7.527 1.3675508721
q81 4.069 3.842 0.9442123372
q82 1.708 7.241 4.2394613583
q83 2.398 2.764 1.1526271893
q84 3.261 4.353 1.3348666053
q85 7.732 10.609 1.3720900155
q86 1.464 2.347 1.6031420765
q87 5.561 5.575 1.0025175328
q88 2.971 6.429 2.1639178728
q89 2.104 3.087 1.4672053232
q90 0.946 2.907 3.0729386892
q91 1.684 3.048 1.809976247
q92 1.419 2.044 1.4404510218
q93 2.647 4.198 1.5859463544
q94 7.805 12.464 1.596925048
q95 7.681 12.642 1.6458794428
q96 0.978 1.573 1.6083844581
q97 2.669 3.126 1.171225178
q98 1.899 2.606 1.3723012112
q99 1.432 4.597 3.2101955307
all 456.926 662.234  

你可能感兴趣的:(hadoop,大数据+机器学习+oracle)