bigDataBench5.0(2019.6月发布)
前提:安装hadoop、jdk、g++、gcc、gsl
http://125.39.136.212:8090/BigDataBench/BigDataBench_V5.0_BigData_MicroBenchmark
http://125.39.136.212:8090/BigDataBench/BigDataBench_V5.0_BigData_ComponentBenchmark
(需要有GitLab账户才可以下载)
如果不想注册,可以从这下载(提取码:jusb )。根据自己的系统环境和需要选择合适的安装包(我是ubuntu18.04,所以下载的是tar.gz的包)
我解压到了/opt/module下
$ tar -zxvf BigDataBench_V5.0_BigData_MicroBenchmark.tar.gz -C /opt/module/
$ tar -zxvf BigDataBench_V5.0_BigData_ComponentBenchmark.tar.gz -C /opt/module/
重命名BigDataBench_V5.0_BigData_MicroBenchmark, BigDataBench_V5.0_BigData_ComponentBenchmark(太长了这名字)
mv BigDataBench_V5.0_BigData_MicroBenchmark BigDataBench5.0_MicroBenchmark
mv BigDataBench_V5.0_BigData_ComponentBenchmark BigDataBench5.0_ComponentBenchmark
提示:最好不要直接运行./prepare.sh,会出现编译失败的错误(小伙伴可以尝试一下,没有出错就不用看下面的步骤啦)
原因:它里面的很多文件都太旧了,所以需要手动更新,重新进行make
Ubuntu下载GSL库并安装(GSL是一个开源的科学计算库,C语言的版本)
安装教程参考:https://blog.csdn.net/weixin_34566605/article/details/103001334
进入BigDataGenratorSuite/Text_datagen里,会看到一个压缩文件gsl-1.15.tar.gz(就是这里面的文件有些旧了,需要重新编译)
cd ./BigDataGeneratorSuite/Text_datagen 解压gsl-1.15.tar.gz tar -xf
gsl-1.15.tar.gz cd gsl-1.15/
nano@nano1:/opt/module/BigDataBench5.0/BigDataGeneratorSuite/Text_datagen/gsl-1.15 . / a u t o g e n . s h ( 生 成 最 新 的 c o f i g . g u e s s , c o n f i g . s u b 等 文 件 , 自 带 的 是 2009 年 的 , 已 经 过 时 了 ) n a n o @ n a n o 1 : / o p t / m o d u l e / B i g D a t a B e n c h 5.0 / B i g D a t a G e n e r a t o r S u i t e / T e x t d a t a g e n / g s l − 1.15 ./autogen.sh(生成最新的cofig.guess,config.sub等文件,自带的是2009年的,已经过时了) nano@nano1:/opt/module/BigDataBench5.0/BigDataGeneratorSuite/Text_datagen/gsl-1.15 ./autogen.sh(生成最新的cofig.guess,config.sub等文件,自带的是2009年的,已经过时了)nano@nano1:/opt/module/BigDataBench5.0/BigDataGeneratorSuite/Textdatagen/gsl−1.15./configure
nano@nano1:/opt/module/BigDataBench5.0/BigDataGeneratorSuite/Text_datagen/gsl-1.15 m a k e n a n o @ n a n o 1 : / o p t / m o d u l e / B i g D a t a B e n c h 5.0 / B i g D a t a G e n e r a t o r S u i t e / T e x t d a t a g e n / g s l − 1.15 make nano@nano1:/opt/module/BigDataBench5.0/BigDataGeneratorSuite/Text_datagen/gsl-1.15 makenano@nano1:/opt/module/BigDataBench5.0/BigDataGeneratorSuite/Textdatagen/gsl−1.15sudo
make install(必须要有root权限)
编译:
nano@nano10:/opt/module/BigDataBench5.0_MicroBenchmark/BigDataGeneratorSuite/Text_datagen$ make
出错啦:
gen_random_text.cpp:43:23: warning: ISO C++ forbids converting a string constant to ‘char*’ [-Wwrite-strings]
char* alpha_temp2="/final.other";
^~~~~~~~~~~~~~
gen_random_text.cpp:44:32: error: ‘strlen’ was not declared in this scope
char* alphafile = new char[strlen(alpha_temp1)+strlen(modeldirname)+strlen(alpha_temp2)+1]; 出错原因:
1. warning: ISO C++ forbids converting a string constant to ‘char*’ [-Wwrite-strings]
出现这样的警告是因为在c和c++中,赋值操作的时候,等号两边的变量类型不一样,那么编译器会进行一种叫做 implicit conversion的操作来使得变量可以被赋值。将右边的常量强制类型转换成一个指针,也就是在修改一个const常量。编译运行的结果会因编译器和操作系统共同决定,有的编译器会通过,有的会抛异常,就算过了也可能因为操作系统的敏感性而被杀掉。像这种直接将string literal赋值给指针的操作被开发者们认为是deprecated,只不过由于以前很多代码都有这种习惯,为了兼容,就保留下来了。所以,为了消除警告,可以在程序前添加#pragma GCC diagnostic ignored “-Wwrite-strings”。
2. error: ‘strlen’ was not declared in this scope
编译器默认没有包含cstring,所以需要添加cstring头文件,在程序开始前添加头文件#include
解决方案:
先make clean一下在gen_random_text.cpp和pgen_random_text.cpp中添加
#include
#pragma GCC diagnostic ignored "-Wwrite-strings"
再重新make编译
参考资料:
https://www.youtube.com/watch?v=PZQaN9wTIsQ
https://blog.csdn.net/VVVLeHr/article/details/86697346
Ok,如果没有出问题的话,会出现如下图所示
Compile Text data generate 完成。
nano@nano10:/opt/module/BigDataBench5.0_MicroBenchmark/BigDataGeneratorSuite/Graph_datagen$ make
出错啦
- 错误1
出错原因:snap.o需要重新编译,进入Snap-core目录下,重新make一下,生成新的Snap.o目标文件,然后移到父目录同级下。
If there are some error about the incompatible of Snap when executes make command, users need to recompile the snap-core and update the Snap.O:
$ cd snap-core $ make $ mv Snap.o …/ And the execute the make command under directory of BigDataGeneratorSuite/Graph_datagen again: $ cd …/ $ make
解决方案:进入snap-core目录下,重新make编译该目录下的文件。
参考资料:http://www.benchcouncil.org/BigDataBench/files/BigDataBench5.0-User-Manual.pdf- 错误2
出错原因:没有Makefile.config这个文件
解决方案:找一个或者自己写一个Makefile.config
下载地址:https://github.com/AthenaHe/BeanchMark/blob/master/Makefile.config- 错误3
出错原因:
(1)…/glib-core/ds.h:280:5: warning: this ‘for’ clause does not guard… [-Wmisleading-indentation]
for (int i=0; i…/glib-core/linalg.cpp:985:9: warning: this ‘else’ clause does not guard… [-Wmisleading-indentation]
由于编译环境的不同,高版本的编译环境更加严格,for后面即使只有一条语句,也要加花括号。
解决方案:
打开…/glib-core/ds.h文件,找到第280行,for循环后面相应位置添加花括号{};打开…/glib-core/linalg.cpp文件,找到第985行。else后面相应位置添加花括号{}。
(2)…/glib-core/bd.cpp:13:21: note: forward declaration of ‘struct __exception’
int _matherr(struct __exception* e)
解决方案:
在bd.cpp中添加结构体:
struct __exception {
int type; /* Exception type /
char name; /* Name of function causing exception /
double arg1; / 1st argument to function /
double arg2; / 2nd argument to function /
double retval; / Function return value */
};
5. 进行数据生成器Table data generate的编译
nano@nano1:/opt/module/BigDataBench5.0/BigDataGeneratorSuite$ cd Table_datagen/personal_generator/
nano@nano1:/opt/module/BigDataBench5.0/BigDataGeneratorSuite/Table_datagen/personal_generator$ make
注意:有些脚本中没有加MAHOUT_HOME地址,执行会报错,自己在脚本中手动添加BigDatabench中自带的mahout地址,或者在环境变量中配置
生成1G的数据
nano@nano1:/opt/module/BigDataBench5.0/Hadoop/Sort$ ./genData-sort.sh 1
# Generating command: ./genData-sort.sh
# size: the input data size, GB
a=$1
let L=$a*10000000
#-----------------generating input data---------------
$HADOOP_HOME/bin/hadoop dfs -rmr /hadoop/terasort/terasort-${a}G
$HADOOP_HOME/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen $L /hadoop/terasort/terasort-${a}G
$HADOOP_HOME/bin/hadoop dfs -rmr /hadoop/terasort/terasort-out
运行工作负载
nano@nano1:/opt/module/BigDataBench5.0/Hadoop/Sort$ ./run-terasort.sh 1
# Running command: ./run-terasort.sh
# size: the input data size, GB
a=$1
#-----------------running hadoop terasort-------------
$HADOOP_HOME/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar terasort /hadoop/terasort/terasort-${a}G /hadoop/terasort/terasort-out
生成1G的数据
nano@nano1:/opt/module/BigDataBench5.0/Hadoop/wordcount$./genData-wc.sh 1
# Generating command: ./genData-wc.sh
# size: the input data size, GB
#----------------------------genenrate-data----------------------------#
curdir=`pwd`
a=$1
let L=a*2
cd ../../BigDataGeneratorSuite/Text_datagen/
rm -fr ./gen_data/$a"GB"-wordcountHP
./gen_text_data.sh lda_wiki1w $L 8000 10000 ./gen_data/$a"GB"-wordcountHP
${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/wd/$a"GB"-wordcountHP
${HADOOP_HOME}/bin/hadoop fs -mkdir -p /hadoop/wd
${HADOOP_HOME}/bin/hadoop fs -put ./gen_data/$a"GB"-wordcountHP /hadoop/wd
运行工作负载
nano@nano1:/opt/module/BigDataBench5.0/Hadoop/wordcount$./run-wordcount.sh 1
# Running command: ./run-wordcount.sh
# size: the input data size, GB
a=$1
#-----------------------------run-workload-----------------------------#
echo "running wordcount"
cd $curdir
cd ./externals/shell/industryPack/hadoop/workloads/wordcount
${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/wd/wordcountHP-result
${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /hadoop/wd/$a"GB"-wordcountHP /hadoop/wd/wordcountHP-result
生成1G的数据
nano@nano1:/opt/module/BigDataBench5.0/Hadoop/Grep$./genData-grep.sh 1
# Generating command: ./genData-grep.sh
# size: the input data size, GB
#----------------------------genenrate-data----------------------------#
curdir=`pwd`
a=$1
let L=a*2
cd ../../BigDataGeneratorSuite/Text_datagen/
rm -fr ./gen_data/$a"GB"-grepHP
./gen_text_data.sh lda_wiki1w $L 8000 10000 ./gen_data/$a"GB"-grepHP
${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/grep/$a"GB"-grepHP
${HADOOP_HOME}/bin/hadoop fs -mkdir -p /hadoop/grep/
${HADOOP_HOME}/bin/hadoop fs -put ./gen_data/$a"GB"-grepHP /hadoop/grep/
运行工作负载
nano@nano1:/opt/module/BigDataBench5.0/Hadoop/Grep$./run-grep.sh 1
# Running command: ./run-wordcount.sh
# size: the input data size, GB
a=$1
#-----------------------------run-workload-----------------------------#
echo "running wordcount"
cd $curdir
cd ./externals/shell/industryPack/hadoop/workloads/wordcount
${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/wd/wordcountHP-result
${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /hadoop/wd/$a"GB"-wordcountHP /hadoop/wd/wordcountHP-result
生成1G的数据
nano@nano1:/opt/module/BigDataBench5.0_MicroBenchmark/Hadoop/MD5$ ./genData-md5.sh 1
#!/bin/bash
# Generating command: ./genData-md5.sh
# size: the input data size, GB
#----------------------------genenrate-data----------------------------#
a=$1
let L=a*2
cd ../../BigDataGeneratorSuite/Text_datagen/
rm -fr ./gen_data/$a"GB"-md5HP
./gen_text_data.sh lda_wiki1w $L 8000 10000 ./gen_data/$a"GB"-md5HP
${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/md5/$a"GB"-md5HP
${HADOOP_HOME}/bin/hadoop fs -mkdir -p /hadoop/md5/
${HADOOP_HOME}/bin/hadoop fs -put ./gen_data/$a"GB"-md5HP /hadoop/md5/
运行工作负载
nano@nano1:/opt/module/BigDataBench5.0_MicroBenchmark/Hadoop/MD5$./run-md5.sh 1
#!/bin/bash
# Running command: ./run-md5.sh
# size: the input data size, GB
a=$1
#-----------------------------run-workload-----------------------------#
echo "running wordcount"
${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/md5/md5HP-result
$HADOOP_HOME/bin/hadoop jar md5/DwarfMD5.jar DwarfMD5 /hadoop/md5/$a"GB"-md5HP /hadoop/md5/md5HP-result
nano@nano1:/opt/module/BigDataBench5.0_MicroBenchmark/Hadoop/MatrixMult$ ./genData-matMult.sh 0.2 10 10 10
#!/bin/bash
# Generating command: ./genData-matMult.sh
# sparsity: the percentage of zero elements, ranges from 0 to 1.
# row_i: the row number of matrix A
# col_i: the column number of matrix A
# col_j: the column number of matrix B
#----------------------------genenrate-data----------------------------#
sparsity=$1
row_i=$2
col_i=$3
col_j=$4
cd genData-Matrix
rm -f data-kmeans
make
sh generate-matrix.sh int $row_i $col_i $sparsity
mv data-kmeans mat1
sh generate-matrix.sh int $col_i $col_j $sparsity
mv data-kmeans mat2
$HADOOP_HOME/bin/hadoop fs -rmr /hadoop/matMult/
$HADOOP_HOME/bin/hadoop fs -mkdir -p /hadoop/matMult/
$HADOOP_HOME/bin/hadoop fs -put mat* /hadoop/matMult/
运行工作负载
nano@nano1:/opt/module/BigDataBench5.0_MicroBenchmark/Hadoop/MatrixMult$ ./run-matMult.sh 0.2 10 10 10
#!/bin/bash
# Running command: ./run-matMult.sh
# sparsity: the percentage of zero elements, ranges from 0 to 1.
# row_i: the row number of matrix A
# col_i: the column number of matrix A
# col_j: the column number of matrix B
sparsity=$1
row_i=$2
col_i=$3
col_j=$4
MAHOUT_HOME=../apache-mahout-0.10.2-compile
#-----------------------------run-workload-----------------------------#
echo "running matMult"
${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/matMult/mat*-seq /hadoop/matMult/mat-out
$MAHOUT_HOME/bin/mahout seqdirectory --input /hadoop/matMult/mat1 --output /hadoop/matMult/mat1-seq
$MAHOUT_HOME/bin/mahout seqdirectory --input /hadoop/matMult/mat2 --output /hadoop/matMult/mat2-seq
${MAHOUT_HOME}/bin/mahout matrixmult \
--numRowsA $row_i \
--numColsA $col_i \
--numRowsB $col_i \
--numColsB $col_j \
--inputPathA /hadoop/matMult/mat1-seq \
--inputPathB /hadoop/matMult/mat2-seq \
--outputPath /hadoop/matMult/mat-out
echo "hadoop matrix mulitiply end"
nano@nano1:/opt/module/BigDataBench5.0_MicroBenchmark/Hadoop/CC$ ./genData-cc.sh 5
#!/bin/bash
# Generating command: ./genData-cc.sh
# log_vertex: indicates the vertex of the generated data, means vertex = 2^log_vertex
#----------------------------genenrate-data----------------------------#
curdir=`pwd`
I=$1
cd ../../BigDataGeneratorSuite/Graph_datagen
dir=/hadoop/cc
rm -fr ./gen_data/Google_genGraph_$I.txt
./gen_kronecker_graph -o:./gen_data/Google_genGraph_$I.txt -m:"0.8305 0.5573; 0.4638 0.3021" -i:$I
head ./gen_data/Google_genGraph_$I.txt > ./gen_data/Google_parameters_$I
sed 1,4d ./gen_data/Google_genGraph_$I.txt > ./gen_data/Google_genGraph_$I.tmp
mv ./gen_data/Google_genGraph_$I.tmp ./gen_data/Google_genGraph_$I.txt
${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/cc/Google_genGraph_$I.txt
${HADOOP_HOME}/bin/hadoop fs -mkdir -p /hadoop/cc
${HADOOP_HOME}/bin/hadoop fs -put ./gen_data/Google_genGraph_$I.txt /hadoop/cc
运行工作负载
nano@nano1:/opt/module/BigDataBench5.0_MicroBenchmark/Hadoop/CC$ ./run-cc.sh 5
#!/bin/bash
# Running command: ./run-cc.sh
# log_vertex: indicates the vertex of the input data, means vertex = 2^log_vertex
reducers=12
I=$1
#--------------------------------------------run----------------------------#
${HADOOP_HOME}/bin/hadoop fs -rmr concmpt_curbm concmpt_tempbm concmpt_nextbm concmpt_output
${HADOOP_HOME}/bin/hadoop jar pegasus-2.0.jar pegasus.ConCmpt -D mapred.input.format.class=org.apache.hadoop.mapred.lib.NLineInputFormat -D mapred.line.input.format.linespermap=2500000 /hadoop/cc/Google_genGraph_$I.txt concmpt_curbm concmpt_tempbm concmpt_nextbm concmpt_output $I $reducers new makesym
nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/Kmeans$ ./genData-kmeans.sh 5
##!/bin/bash
##
# Parameter $I indicates that the vertex of the generated graph is 2^$I
# Generating command: ./genData-kmeans.sh
##
I=$1
cd ../../BigDataGeneratorSuite/Graph_datagen
rm -fr ./gen_data/Facebook_genGragh_$I.txt
./gen_kronecker_graph -o:./gen_data/Facebook_genGragh_$I.txt -m:"0.9999 0.5887; 0.6254 0.3676" -i:$I
head -4 ./gen_data/Facebook_genGragh_$I.txt > ./gen_data/Facebook_parameters_$I
sed 1,4d ./gen_data/Facebook_genGragh_$I.txt > ./gen_data/Facebook_genGragh_$I.tmp
mv ./gen_data/Facebook_genGragh_$I.tmp ./gen_data/Facebook_genGragh_$I.txt
sed 's/[[:space:]][[:space:]]*/ /g' ./gen_data/Facebook_genGragh_$I.txt >./gen_data/testdata
${HADOOP_HOME}/bin/hadoop fs -rmr testdata
${HADOOP_HOME}/bin/hadoop fs -put ./gen_data/testdata
运行工作负载
nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/Kmeans$ ./run-Kmeans.sh 0.4 0.1 0.1 5
##!/bin/bash
##
# Kmeans running command:
# ./run-Kmeans.sh
# t1: T1 threshold value (0-1), such as 0.4
# t2: T2 threshold value (0-1), such as 0.1
# cd: The convergence delta value (0-1), such as 0.1
# x: The max iteration number
##
#-----------------------------------run--------------------------------------------#
MAHOUT_HOME=../apache-mahout-0.10.2-compile
$HADOOP_HOME/bin/hadoop dfs -rmr output
${MAHOUT_HOME}/bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job \
-i testdata \
-o output \
-dm org.apache.mahout.common.distance.CosineDistanceMeasure \
-t1 $1 \
-t2 $2 \
-cd $3 \
-x $4 \
nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/PageRank$ ./genData-pagerank.sh 5
#!/bin/bash
##
# Parameter $I indicates that the vertex of the generated graph is 2^$I
# Generating command: ./genData-pagerank.sh
##
#----------------------------genenrate-data----------------------------#
I=$1
cd ../../BigDataGeneratorSuite/Graph_datagen
dir=/hadoop/pagerank
rm -fr ./gen_data/Google_genGraph_$I.txt
./gen_kronecker_graph -o:./gen_data/Google_genGraph_$I.txt -m:"0.8305 0.5573; 0.4638 0.3021" -i:$I
head ./gen_data/Google_genGraph_$I.txt > ./gen_data/Google_parameters_$I
sed 1,4d ./gen_data/Google_genGraph_$I.txt > ./gen_data/Google_genGraph_$I.tmp
mv ./gen_data/Google_genGraph_$I.tmp ./gen_data/Google_genGraph_$I.txt
${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/pagerank/Google_genGraph_$I.txt
${HADOOP_HOME}/bin/hadoop fs -mkdir -p /hadoop/pagerank
${HADOOP_HOME}/bin/hadoop fs -put ./gen_data/Google_genGraph_$I.txt /hadoop/pagerank
运行工作负载
nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/PageRank$./run-pagerank.sh 5
#!/bin/bash
##
# Parameter $I indicates that the vertex of the generated graph is 2^$I
# Running command: ./run-pagerank.sh
##
I=$1
${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/pagerank/prtemp /hadoop/pagerank/output pr_distr pr_minmax pr_vector
${HADOOP_HOME}/bin/hadoop jar pegasus-2.0.jar pegasus.PagerankNaive /hadoop/pagerank/Google_genGraph_$I.txt /hadoop/pagerank/prtemp /hadoop/pagerank/output 450 4 3 nosym new
生成1G的数据
nano@nano1:/opt/module/BigDataBench5.0_MicroBenchmark/Hadoop/randSample$ ./genData-randSample.sh 1
# Generating command: ./genData-randSample.sh
# size: the input data size, GB
#----------------------------genenrate-data----------------------------#
curdir=`pwd`
a=$1
let L=a*2
cd ../../BigDataGeneratorSuite/Text_datagen/
rm -fr ./gen_data/$a"GB"-randsampleHP
./gen_text_data.sh lda_wiki1w $L 8000 10000 ./gen_data/$a"GB"-randsampleHP
${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/randsample/$a"GB"-randsampleHP
${HADOOP_HOME}/bin/hadoop fs -mkdir -p /hadoop/randsample
${HADOOP_HOME}/bin/hadoop fs -put ./gen_data/$a"GB"-randsampleHP /hadoop/randsample
运行工作负载
nano@nano1:/opt/module/BigDataBench5.0_MicroBenchmark/Hadoop/randSample$ ./run-randSample.sh 1 0.3
# Running command: ./run-randSample.sh
# size: the input data size, GB
# sample_ratio: the sampling ratio, ranges from 0 to 1.
curdir=`pwd`
a=$1
#-----------------------------run-workload-----------------------------#
echo "running randsample"
${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/randsample/randsampleHP-result
${HADOOP_HOME}/bin/hadoop jar RandSample/out/artifacts/RandSample_jar/RandSample.jar RandSample /hadoop/randsample/$a"GB"-randsampleHP /hadoop/randsample/randsampleHP-result $2
生成1G的数据
nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/CF$ ./genData-cf.sh 1
#Command for generating data:
# ./genData-cf.sh #GB
#
a=`expr $1 \* 1024`
#-----------------generating input data---------------
rm -rf genData-CF/als_input.txt
$HADOOP_HOME/bin/hadoop dfs -rmr /hadoop/cf/cf-${1}G
cd genData-CF
make
./ALS-DataGen $a
$HADOOP_HOME/bin/hadoop fs -mkdir -p /hadoop/cf/
$HADOOP_HOME/bin/hadoop fs als_input.txt /hadoop/cf/cf-${1}G
$HADOOP_HOME/bin/hadoop dfs -rmr /hadoop/cf/cf-out /hadoop/cf/temp
运行工作负载
nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/CF$ ./run-cf.sh 1
注意:脚本中没有加MAHOUT_HOME地址,执行会报错,自己在脚本中手动添加BigDatabench中自带的mahout地址,或者在环境变量中配置
#!/bin/bash
#Run command: ./run-cf.sh
# size: the input data size, GB
# numFeatures: the number of features
# numIterations: the number of features
# lambda: regularization parameter
MAHOUT_HOME=../apache-mahout-0.10.2-compile
#-----------------running hadoop cf-------------
${MAHOUT_HOME}/bin/mahout parallelALS \
-i /hadoop/cf/cf-${1}G \
-o /hadoop/cf/cf-out \
--numFeatures $2 \
--numIterations $3 \
--lambda $4 \
--tempDir /hadoop/cf/temp
#-----------------killing monitor script--------------
echo "hadoop cf end"
生成1G的数据
nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/Bayes$ ./genData-bayes.sh 1
#!/bin/bash
#
#The parameter $1 indicates the data size (GB) to generate
#Command for generate data: ./genData-bayes.sh #GB
#
#------------generate-data---------------------
a=$1
cd ../../BigDataGeneratorSuite/Text_datagen/
rm -rf ./gen_data/data-naivebayes
let L=a*2
./gen_text_data.sh amazonMR1 $L 1900 11500 ./gen_data/data-naivebayes/amazonMR1
./gen_text_data.sh amazonMR2 $L 1900 11500 ./gen_data/data-naivebayes/amazonMR2
./gen_text_data.sh amazonMR3 $L 1900 11500 ./gen_data/data-naivebayes/amazonMR3
./gen_text_data.sh amazonMR4 $L 1900 11500 ./gen_data/data-naivebayes/amazonMR4
./gen_text_data.sh amazonMR5 $L 1900 11500 ./gen_data/data-naivebayes/amazonMR5
#-------------------------------------put-data----------------------------#
${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/Bayes/*
${HADOOP_HOME}/bin/hadoop fs -mkdir -p /hadoop/Bayes
${HADOOP_HOME}/bin/hadoop fs -put ./gen_data/data-naivebayes /hadoop/Bayes/
运行工作负载
nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/Bayes$ ./run-bayes.sh
注意:脚本中没有加MAHOUT_HOME地址,执行会报错,自己在脚本中手动添加BigDatabench中自带的mahout地址,或者在环境变量中配置
#!/bin/bash
# Run command: ./run-bayes.sh
MAHOUT_HOME=../apache-mahout-0.10.2-compile
#--------------------------------------run-workload-----------------------#
#Generates input dataset for training & testing classifier
dir=/hadoop/Bayes
echo "Creating sequence files from naivebayes-naivebayes data"
${MAHOUT_HOME}/bin/mahout seqdirectory \
-i /hadoop/Bayes/data-naivebayes \
-o /hadoop/Bayes/naivebayes-seq -ow
echo "Converting sequence files to vectors"
${MAHOUT_HOME}/bin/mahout seq2sparse \
-i /hadoop/Bayes/naivebayes-seq \
-o /hadoop/Bayes/naivebayes-vectors -lnorm -nv -wt tfidf
echo "Creating training and holdout set with a random 80-20 split of the generated vector dataset"
${MAHOUT_HOME}/bin/mahout split \
-i /hadoop/Bayes/naivebayes-vectors/tfidf-vectors \
--trainingOutput /hadoop/Bayes/naivebayes-train-vectors \
--testOutput /hadoop/Bayes/naivebayes-test-vectors \
--randomSelectionPct 70 --overwrite --sequenceFiles -xm sequential
#Trains the classifier
echo "Training Naive Bayes model"
${MAHOUT_HOME}/bin/mahout trainnb \
-i /hadoop/Bayes/naivebayes-train-vectors \
-o /hadoop/Bayes/model \
-li /hadoop/Bayes/labelindex \
-ow #$c
#------------------------------------------run------------------------------#
hadoop dfs -rmr /hadoop/Bayes/naivebayes-testing
${MAHOUT_HOME}/bin/mahout testnb \
-i /hadoop/Bayes/naivebayes-test-vectors \
-m /hadoop/Bayes/model \
-l /hadoop/Bayes/labelindex \
-ow -o /hadoop/Bayes/naivebayes-testing #$c
生成数据
nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/Index/bin$ ./genData_Index.sh
#!/bin/bash
bin=`dirname "$0"`
bin=`cd "$bin"; pwd`
echo "========== preparing nutchindex data =========="
# configure
DIR=`cd $bin/../; pwd`
. "${DIR}/conf/hibench-config.sh"
. "${DIR}/conf/configure.sh"
# compress
if [ $COMPRESS -eq 1 ]; then
COMPRESS_OPT="-c ${COMPRESS_CODEC}"
fi
rm -rf $TMPLOGFILE
hadoop dfs -rmr /Nutch
hadoop dfs -mkdir /Nutch
# generate data
OPTION="-t nutch \
-b ${NUTCH_BASE_HDFS} \
-n ${NUTCH_INPUT} \
-m ${NUM_MAPS} \
-r ${NUM_REDS} \
-p ${PAGES} \
-o sequence"
$HADOOP_EXECUTABLE jar ../conf/datatools.jar HiBench.DataGen ${OPTION} ${COMPRESS_OPT}
运行工作负载
nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/Index/bin$ ./run_Index.sh
#!/bin/bash
bin=`dirname "$0"`
bin=`cd "$bin"; pwd`
echo "========== running nutchindex data =========="
# configure
DIR=`cd $bin/../; pwd`
. "${DIR}/conf/hibench-config.sh"
. "${DIR}/conf/configure.sh"
export NUTCH_HOME=$BigdataBench_Home/SearchEngine/Index/nutch-1.2-hadoop1
cd $NUTCH_HOME
export NUTCH_CONF_DIR=$HADOOP_CONF_DIR:$NUTCH_HOME/conf
hadoop dfs -rmr /Nutch/Output
hadoop dfs -rmr $INPUT_HDFS/indexes
# run bench
../nutch-1.2-hadoop1/bin/nutch index $COMPRESS_OPTS $OUTPUT_HDFS $INPUT_HDFS/crawldb $INPUT_HDFS/linkdb $INPUT_HDFS/segments/*
出错啦:
出错原因:
解决方案:
http://www.image-net.org/challenges/LSVRC/2014/
上传数据集
hdfs://192.168.1.101:9000/hadoop/sift/data/image1G.hib
运行工作负载
nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/SIFT$ ./run-sift.sh 1
#!/bin/bash
##
#Running command: ./run-sift.sh
# imgsize: the total size of the images, GB
##
#----------------check whether opencv is installed--------#
isopencv=`pkg-config --modversion opencv`
strB="Package opencv was not found"
result=$(echo $isopencv | grep "${strB}")
echo $result
if [[ $result != "" ]];then
echo "no opencv"
exit 1
fi
if [[ $1 -ge 10 ]];then
a=10
else
a=1
fi
#-----------------generating input data---------------
$HADOOP_HOME/bin/hadoop dfs -rmr /hadoop/sift/data/image${a}G.*
$HADOOP_HOME/bin/hadoop fs -mkdir -p /hadoop/sift/data/
$HADOOP_HOME/bin/hadoop dfs -rmr /hadoop/sift/sift-out
$HADOOP_HOME/bin/hadoop fs -put hadoop-SIFT/data/image${a}G.* /hadoop/sift/data/
#-----------------running hadoop sift-------------
$HADOOP_HOME/bin/hadoop jar hadoop-SIFT/hipi-SIFT/tools/sift/build/libs/sift.jar /hadoop/sift/data/image${a}G.hib /hadoop/sift/sift-out