大数据基准测试平台BigDataBench5.0安装配置及使用

bigDataBench5.0(2019.6月发布)
前提:安装hadoop、jdk、g++、gcc、gsl

1. 首先就是下载BigDataBench安装包

http://125.39.136.212:8090/BigDataBench/BigDataBench_V5.0_BigData_MicroBenchmark
http://125.39.136.212:8090/BigDataBench/BigDataBench_V5.0_BigData_ComponentBenchmark
(需要有GitLab账户才可以下载)
如果不想注册,可以从这下载(提取码:jusb )。根据自己的系统环境和需要选择合适的安装包(我是ubuntu18.04,所以下载的是tar.gz的包)

2. 解压安装包

我解压到了/opt/module下

$ tar -zxvf BigDataBench_V5.0_BigData_MicroBenchmark.tar.gz -C /opt/module/
$ tar -zxvf BigDataBench_V5.0_BigData_ComponentBenchmark.tar.gz -C /opt/module/

重命名BigDataBench_V5.0_BigData_MicroBenchmark, BigDataBench_V5.0_BigData_ComponentBenchmark(太长了这名字)

mv BigDataBench_V5.0_BigData_MicroBenchmark BigDataBench5.0_MicroBenchmark
mv BigDataBench_V5.0_BigData_ComponentBenchmark BigDataBench5.0_ComponentBenchmark

解压之后,查看BigDataBench目录结构
大数据基准测试平台BigDataBench5.0安装配置及使用_第1张图片
大数据基准测试平台BigDataBench5.0安装配置及使用_第2张图片

提示:最好不要直接运行./prepare.sh,会出现编译失败的错误(小伙伴可以尝试一下,没有出错就不用看下面的步骤啦)
原因:它里面的很多文件都太旧了,所以需要手动更新,重新进行make

3. 进入安装目录,进行数据生成器Text data generate 的编译

Ubuntu下载GSL库并安装(GSL是一个开源的科学计算库,C语言的版本)
安装教程参考:https://blog.csdn.net/weixin_34566605/article/details/103001334

进入BigDataGenratorSuite/Text_datagen里,会看到一个压缩文件gsl-1.15.tar.gz(就是这里面的文件有些旧了,需要重新编译)
cd ./BigDataGeneratorSuite/Text_datagen 解压gsl-1.15.tar.gz tar -xf
gsl-1.15.tar.gz cd gsl-1.15/
nano@nano1:/opt/module/BigDataBench5.0/BigDataGeneratorSuite/Text_datagen/gsl-1.15 . / a u t o g e n . s h ( 生 成 最 新 的 c o f i g . g u e s s , c o n f i g . s u b 等 文 件 , 自 带 的 是 2009 年 的 , 已 经 过 时 了 ) n a n o @ n a n o 1 : / o p t / m o d u l e / B i g D a t a B e n c h 5.0 / B i g D a t a G e n e r a t o r S u i t e / T e x t d a t a g e n / g s l − 1.15 ./autogen.sh(生成最新的cofig.guess,config.sub等文件,自带的是2009年的,已经过时了) nano@nano1:/opt/module/BigDataBench5.0/BigDataGeneratorSuite/Text_datagen/gsl-1.15 ./autogen.shcofig.guess,config.sub,2009nano@nano1:/opt/module/BigDataBench5.0/BigDataGeneratorSuite/Textdatagen/gsl1.15./configure
nano@nano1:/opt/module/BigDataBench5.0/BigDataGeneratorSuite/Text_datagen/gsl-1.15 m a k e n a n o @ n a n o 1 : / o p t / m o d u l e / B i g D a t a B e n c h 5.0 / B i g D a t a G e n e r a t o r S u i t e / T e x t d a t a g e n / g s l − 1.15 make nano@nano1:/opt/module/BigDataBench5.0/BigDataGeneratorSuite/Text_datagen/gsl-1.15 makenano@nano1:/opt/module/BigDataBench5.0/BigDataGeneratorSuite/Textdatagen/gsl1.15sudo
make install(必须要有root权限)

编译:

nano@nano10:/opt/module/BigDataBench5.0_MicroBenchmark/BigDataGeneratorSuite/Text_datagen$ make

出错啦:
gen_random_text.cpp:43:23: warning: ISO C++ forbids converting a string constant to ‘char*’ [-Wwrite-strings]
char* alpha_temp2="/final.other";
^~~~~~~~~~~~~~
gen_random_text.cpp:44:32: error: ‘strlen’ was not declared in this scope
char* alphafile = new char[strlen(alpha_temp1)+strlen(modeldirname)+strlen(alpha_temp2)+1]; 出错原因:
1. warning: ISO C++ forbids converting a string constant to ‘char*’ [-Wwrite-strings]
出现这样的警告是因为在c和c++中,赋值操作的时候,等号两边的变量类型不一样,那么编译器会进行一种叫做 implicit conversion的操作来使得变量可以被赋值。将右边的常量强制类型转换成一个指针,也就是在修改一个const常量。编译运行的结果会因编译器和操作系统共同决定,有的编译器会通过,有的会抛异常,就算过了也可能因为操作系统的敏感性而被杀掉。像这种直接将string literal赋值给指针的操作被开发者们认为是deprecated,只不过由于以前很多代码都有这种习惯,为了兼容,就保留下来了。所以,为了消除警告,可以在程序前添加#pragma GCC diagnostic ignored “-Wwrite-strings”。
2. error: ‘strlen’ was not declared in this scope
编译器默认没有包含cstring,所以需要添加cstring头文件,在程序开始前添加头文件#include
解决方案:
先make clean一下在这里插入图片描述在gen_random_text.cpp和pgen_random_text.cpp中添加
#include
#pragma GCC diagnostic ignored "-Wwrite-strings"

大数据基准测试平台BigDataBench5.0安装配置及使用_第3张图片
再重新make编译
参考资料:
https://www.youtube.com/watch?v=PZQaN9wTIsQ
https://blog.csdn.net/VVVLeHr/article/details/86697346

Ok,如果没有出问题的话,会出现如下图所示
在这里插入图片描述Compile Text data generate 完成。

4 进行数据生成器 Graph data generate的编译

nano@nano10:/opt/module/BigDataBench5.0_MicroBenchmark/BigDataGeneratorSuite/Graph_datagen$ make

出错啦

  1. 错误1

    出错原因:snap.o需要重新编译,进入Snap-core目录下,重新make一下,生成新的Snap.o目标文件,然后移到父目录同级下。
    If there are some error about the incompatible of Snap when executes make command, users need to recompile the snap-core and update the Snap.O:
    $ cd snap-core $ make $ mv Snap.o …/ And the execute the make command under directory of BigDataGeneratorSuite/Graph_datagen again: $ cd …/ $ make
    解决方案:进入snap-core目录下,重新make编译该目录下的文件。
    参考资料:http://www.benchcouncil.org/BigDataBench/files/BigDataBench5.0-User-Manual.pdf
  2. 错误2
    在这里插入图片描述
    出错原因:没有Makefile.config这个文件
    解决方案:找一个或者自己写一个Makefile.config
    下载地址:https://github.com/AthenaHe/BeanchMark/blob/master/Makefile.config
  3. 错误3

    出错原因:
    (1)…/glib-core/ds.h:280:5: warning: this ‘for’ clause does not guard… [-Wmisleading-indentation]
    for (int i=0; i …/glib-core/linalg.cpp:985:9: warning: this ‘else’ clause does not guard… [-Wmisleading-indentation]
    由于编译环境的不同,高版本的编译环境更加严格,for后面即使只有一条语句,也要加花括号。
    解决方案:
    打开…/glib-core/ds.h文件,找到第280行,for循环后面相应位置添加花括号{};打开…/glib-core/linalg.cpp文件,找到第985行。else后面相应位置添加花括号{}。
    大数据基准测试平台BigDataBench5.0安装配置及使用_第4张图片大数据基准测试平台BigDataBench5.0安装配置及使用_第5张图片
    (2)…/glib-core/bd.cpp:13:21: note: forward declaration of ‘struct __exception’
    int _matherr(struct __exception* e)
    解决方案:
    在bd.cpp中添加结构体:
    struct __exception {
    int type; /* Exception type /
    char
    name; /* Name of function causing exception /
    double arg1; /
    1st argument to function /
    double arg2; /
    2nd argument to function /
    double retval; /
    Function return value */
    };
    大数据基准测试平台BigDataBench5.0安装配置及使用_第6张图片
    5. 进行数据生成器Table data generate的编译
    nano@nano1:/opt/module/BigDataBench5.0/BigDataGeneratorSuite$ cd Table_datagen/personal_generator/
    nano@nano1:/opt/module/BigDataBench5.0/BigDataGeneratorSuite/Table_datagen/personal_generator$ make
    在这里插入图片描述

6. 数据生成及处理

注意:有些脚本中没有加MAHOUT_HOME地址,执行会报错,自己在脚本中手动添加BigDatabench中自带的mahout地址,或者在环境变量中配置

  • terasort
生成1G的数据
	nano@nano1:/opt/module/BigDataBench5.0/Hadoop/Sort$ ./genData-sort.sh 1
	# Generating command: ./genData-sort.sh 
	#       size: the input data size, GB
	a=$1
	let L=$a*10000000
	#-----------------generating input data---------------
	$HADOOP_HOME/bin/hadoop dfs -rmr /hadoop/terasort/terasort-${a}G
	$HADOOP_HOME/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen $L /hadoop/terasort/terasort-${a}G
	$HADOOP_HOME/bin/hadoop dfs -rmr /hadoop/terasort/terasort-out
运行工作负载
	nano@nano1:/opt/module/BigDataBench5.0/Hadoop/Sort$ ./run-terasort.sh 1
	# Running command: ./run-terasort.sh 
	#       size: the input data size, GB
	a=$1
	#-----------------running hadoop terasort-------------
	$HADOOP_HOME/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar terasort /hadoop/terasort/terasort-${a}G /hadoop/terasort/terasort-out
  • Wordcount
生成1G的数据
	nano@nano1:/opt/module/BigDataBench5.0/Hadoop/wordcount$./genData-wc.sh 1
	# Generating command: ./genData-wc.sh 
	#       size: the input data size, GB
	#----------------------------genenrate-data----------------------------#
	curdir=`pwd`
	a=$1
	    let L=a*2
	    cd ../../BigDataGeneratorSuite/Text_datagen/
	    rm -fr ./gen_data/$a"GB"-wordcountHP
	    ./gen_text_data.sh lda_wiki1w $L 8000 10000 ./gen_data/$a"GB"-wordcountHP
	    ${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/wd/$a"GB"-wordcountHP
	    ${HADOOP_HOME}/bin/hadoop fs -mkdir -p /hadoop/wd
	    ${HADOOP_HOME}/bin/hadoop fs -put ./gen_data/$a"GB"-wordcountHP /hadoop/wd
运行工作负载
	nano@nano1:/opt/module/BigDataBench5.0/Hadoop/wordcount$./run-wordcount.sh 1
	# Running command: ./run-wordcount.sh 
	#       size: the input data size, GB
	a=$1
	#-----------------------------run-workload-----------------------------#
	echo "running wordcount"
	cd $curdir
	cd ./externals/shell/industryPack/hadoop/workloads/wordcount
	${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/wd/wordcountHP-result
	${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar  wordcount  /hadoop/wd/$a"GB"-wordcountHP  /hadoop/wd/wordcountHP-result
  • Grep
生成1G的数据
	nano@nano1:/opt/module/BigDataBench5.0/Hadoop/Grep$./genData-grep.sh 1
	# Generating command: ./genData-grep.sh 
	#       size: the input data size, GB
	#----------------------------genenrate-data----------------------------#
	curdir=`pwd`
	a=$1
	    let L=a*2
	    cd ../../BigDataGeneratorSuite/Text_datagen/
	    rm -fr ./gen_data/$a"GB"-grepHP
	    ./gen_text_data.sh lda_wiki1w $L 8000 10000 ./gen_data/$a"GB"-grepHP
	    ${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/grep/$a"GB"-grepHP
	    ${HADOOP_HOME}/bin/hadoop fs -mkdir -p /hadoop/grep/
	    ${HADOOP_HOME}/bin/hadoop fs -put ./gen_data/$a"GB"-grepHP /hadoop/grep/
运行工作负载
	nano@nano1:/opt/module/BigDataBench5.0/Hadoop/Grep$./run-grep.sh 1
	# Running command: ./run-wordcount.sh 
	#       size: the input data size, GB
	a=$1
	#-----------------------------run-workload-----------------------------#
	echo "running wordcount"
	cd $curdir
	cd ./externals/shell/industryPack/hadoop/workloads/wordcount
	${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/wd/wordcountHP-result
	${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar  wordcount  /hadoop/wd/$a"GB"-wordcountHP  /hadoop/wd/wordcountHP-result
  • MD5
 生成1G的数据
nano@nano1:/opt/module/BigDataBench5.0_MicroBenchmark/Hadoop/MD5$ ./genData-md5.sh 1
#!/bin/bash
# Generating command: ./genData-md5.sh 
#       size: the input data size, GB
#----------------------------genenrate-data----------------------------#
a=$1
    let L=a*2
    cd ../../BigDataGeneratorSuite/Text_datagen/
    rm -fr ./gen_data/$a"GB"-md5HP
    ./gen_text_data.sh lda_wiki1w $L 8000 10000 ./gen_data/$a"GB"-md5HP
    ${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/md5/$a"GB"-md5HP
    ${HADOOP_HOME}/bin/hadoop fs -mkdir -p /hadoop/md5/
    ${HADOOP_HOME}/bin/hadoop fs -put ./gen_data/$a"GB"-md5HP /hadoop/md5/
运行工作负载
nano@nano1:/opt/module/BigDataBench5.0_MicroBenchmark/Hadoop/MD5$./run-md5.sh 1
	#!/bin/bash
	  
	# Running command: ./run-md5.sh 
	#       size: the input data size, GB
	a=$1
	#-----------------------------run-workload-----------------------------#
	echo "running wordcount"
	${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/md5/md5HP-result
	$HADOOP_HOME/bin/hadoop jar md5/DwarfMD5.jar DwarfMD5 /hadoop/md5/$a"GB"-md5HP /hadoop/md5/md5HP-result
  • MatrixMult(矩阵相乘)
    这里少了mahout的路径,要加上。MAHOUT_HOME=…/apache-mahout-0.10.2-compile
	nano@nano1:/opt/module/BigDataBench5.0_MicroBenchmark/Hadoop/MatrixMult$ ./genData-matMult.sh 0.2 10 10 10
	#!/bin/bash
	# Generating command: ./genData-matMult.sh     
	#       sparsity: the percentage of zero elements, ranges from 0 to 1.
	#       row_i: the row number of matrix A
	#       col_i: the column number of matrix A
	#       col_j: the column number of matrix B
	#----------------------------genenrate-data----------------------------#
	sparsity=$1
	row_i=$2
	col_i=$3
	col_j=$4
	cd genData-Matrix
	rm -f data-kmeans
	make
	sh generate-matrix.sh int $row_i $col_i $sparsity
	mv data-kmeans mat1
	sh generate-matrix.sh int $col_i $col_j $sparsity
	mv data-kmeans mat2
	$HADOOP_HOME/bin/hadoop fs -rmr /hadoop/matMult/
	$HADOOP_HOME/bin/hadoop fs -mkdir -p /hadoop/matMult/
	$HADOOP_HOME/bin/hadoop fs -put mat* /hadoop/matMult/
	运行工作负载
nano@nano1:/opt/module/BigDataBench5.0_MicroBenchmark/Hadoop/MatrixMult$ ./run-matMult.sh 0.2 10 10 10
	#!/bin/bash  
	# Running command: ./run-matMult.sh     
	#       sparsity: the percentage of zero elements, ranges from 0 to 1.
	#       row_i: the row number of matrix A
	#       col_i: the column number of matrix A
	#       col_j: the column number of matrix B
	sparsity=$1
	row_i=$2
	col_i=$3
	col_j=$4
	MAHOUT_HOME=../apache-mahout-0.10.2-compile
	#-----------------------------run-workload-----------------------------#
	echo "running matMult"
	${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/matMult/mat*-seq /hadoop/matMult/mat-out
	
	$MAHOUT_HOME/bin/mahout seqdirectory --input /hadoop/matMult/mat1 --output /hadoop/matMult/mat1-seq
	$MAHOUT_HOME/bin/mahout seqdirectory --input /hadoop/matMult/mat2 --output /hadoop/matMult/mat2-seq
	${MAHOUT_HOME}/bin/mahout matrixmult \
	        --numRowsA $row_i \
	        --numColsA $col_i \
	        --numRowsB $col_i \
	        --numColsB $col_j \
	        --inputPathA /hadoop/matMult/mat1-seq \
	        --inputPathB /hadoop/matMult/mat2-seq \
	        --outputPath /hadoop/matMult/mat-out
	
	echo "hadoop matrix mulitiply end"

  • CC(Connected Component)
nano@nano1:/opt/module/BigDataBench5.0_MicroBenchmark/Hadoop/CC$ ./genData-cc.sh 5
	#!/bin/bash
	# Generating command: ./genData-cc.sh 
	#       log_vertex: indicates the vertex of the generated data, means vertex = 2^log_vertex
	#----------------------------genenrate-data----------------------------#
	curdir=`pwd`
	I=$1
	cd ../../BigDataGeneratorSuite/Graph_datagen
	dir=/hadoop/cc
	rm -fr ./gen_data/Google_genGraph_$I.txt
	./gen_kronecker_graph  -o:./gen_data/Google_genGraph_$I.txt -m:"0.8305 0.5573; 0.4638 0.3021" -i:$I
	head ./gen_data/Google_genGraph_$I.txt > ./gen_data/Google_parameters_$I
	sed 1,4d ./gen_data/Google_genGraph_$I.txt > ./gen_data/Google_genGraph_$I.tmp
	mv ./gen_data/Google_genGraph_$I.tmp ./gen_data/Google_genGraph_$I.txt
	${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/cc/Google_genGraph_$I.txt
	${HADOOP_HOME}/bin/hadoop fs -mkdir -p /hadoop/cc
	${HADOOP_HOME}/bin/hadoop fs -put ./gen_data/Google_genGraph_$I.txt /hadoop/cc
运行工作负载
	nano@nano1:/opt/module/BigDataBench5.0_MicroBenchmark/Hadoop/CC$ ./run-cc.sh 5
	#!/bin/bash  
	# Running command: ./run-cc.sh 
	#       log_vertex: indicates the vertex of the input data, means vertex = 2^log_vertex
	
	reducers=12
	I=$1
	
	#--------------------------------------------run----------------------------#
	${HADOOP_HOME}/bin/hadoop fs -rmr concmpt_curbm concmpt_tempbm concmpt_nextbm concmpt_output
	${HADOOP_HOME}/bin/hadoop jar pegasus-2.0.jar pegasus.ConCmpt -D mapred.input.format.class=org.apache.hadoop.mapred.lib.NLineInputFormat -D mapred.line.input.format.linespermap=2500000 /hadoop/cc/Google_genGraph_$I.txt concmpt_curbm concmpt_tempbm concmpt_nextbm concmpt_output $I $reducers new makesym
  • Kmeans
    MAHOUT_HOME=…/apache-mahout-0.10.2-compile
	nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/Kmeans$ ./genData-kmeans.sh 5
	##!/bin/bash
	##
	# Parameter $I indicates that the vertex of the generated graph is 2^$I
	# Generating command: ./genData-kmeans.sh  
	##
	I=$1
	cd ../../BigDataGeneratorSuite/Graph_datagen
	rm -fr ./gen_data/Facebook_genGragh_$I.txt
	./gen_kronecker_graph  -o:./gen_data/Facebook_genGragh_$I.txt -m:"0.9999 0.5887; 0.6254 0.3676" -i:$I
	head -4 ./gen_data/Facebook_genGragh_$I.txt > ./gen_data/Facebook_parameters_$I
	sed 1,4d ./gen_data/Facebook_genGragh_$I.txt > ./gen_data/Facebook_genGragh_$I.tmp
	mv ./gen_data/Facebook_genGragh_$I.tmp ./gen_data/Facebook_genGragh_$I.txt
	sed 's/[[:space:]][[:space:]]*/ /g' ./gen_data/Facebook_genGragh_$I.txt >./gen_data/testdata
	${HADOOP_HOME}/bin/hadoop fs -rmr testdata
	${HADOOP_HOME}/bin/hadoop fs -put ./gen_data/testdata
运行工作负载
	nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/Kmeans$ ./run-Kmeans.sh 0.4 0.1 0.1 5
	##!/bin/bash
	##
	# Kmeans running command:
	# ./run-Kmeans.sh    
	#       t1: T1 threshold value (0-1), such as 0.4
	#       t2: T2 threshold value (0-1), such as 0.1
	#       cd: The convergence delta value (0-1), such as 0.1
	#       x: The max iteration number
	##
	#-----------------------------------run--------------------------------------------#
	MAHOUT_HOME=../apache-mahout-0.10.2-compile
	$HADOOP_HOME/bin/hadoop dfs -rmr output
	${MAHOUT_HOME}/bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job \
	        -i testdata \
	        -o output \
	        -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
	        -t1 $1 \
	        -t2 $2 \
	        -cd $3 \
	        -x $4 \
  • Pagerank
nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/PageRank$ ./genData-pagerank.sh 5
	#!/bin/bash
	##
	# Parameter $I indicates that the vertex of the generated graph is 2^$I
	# Generating command: ./genData-pagerank.sh  
	##
	#----------------------------genenrate-data----------------------------#
	I=$1
	cd ../../BigDataGeneratorSuite/Graph_datagen
	dir=/hadoop/pagerank
	rm -fr ./gen_data/Google_genGraph_$I.txt
	./gen_kronecker_graph  -o:./gen_data/Google_genGraph_$I.txt -m:"0.8305 0.5573; 0.4638 0.3021" -i:$I
	head ./gen_data/Google_genGraph_$I.txt > ./gen_data/Google_parameters_$I
	sed 1,4d ./gen_data/Google_genGraph_$I.txt > ./gen_data/Google_genGraph_$I.tmp
	mv ./gen_data/Google_genGraph_$I.tmp ./gen_data/Google_genGraph_$I.txt
	${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/pagerank/Google_genGraph_$I.txt
	${HADOOP_HOME}/bin/hadoop fs -mkdir -p /hadoop/pagerank
	${HADOOP_HOME}/bin/hadoop fs -put ./gen_data/Google_genGraph_$I.txt /hadoop/pagerank
运行工作负载
nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/PageRank$./run-pagerank.sh 5
	#!/bin/bash  
	##
	# Parameter $I indicates that the vertex of the generated graph is 2^$I
	# Running command: ./run-pagerank.sh  
	##
	I=$1
	${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/pagerank/prtemp /hadoop/pagerank/output pr_distr pr_minmax pr_vector
	${HADOOP_HOME}/bin/hadoop jar pegasus-2.0.jar pegasus.PagerankNaive /hadoop/pagerank/Google_genGraph_$I.txt /hadoop/pagerank/prtemp /hadoop/pagerank/output 450 4 3 nosym new
  • RandSample
生成1G的数据
	nano@nano1:/opt/module/BigDataBench5.0_MicroBenchmark/Hadoop/randSample$ ./genData-randSample.sh 1
	# Generating command: ./genData-randSample.sh 
	#       size: the input data size, GB
	#----------------------------genenrate-data----------------------------#
	curdir=`pwd`
	a=$1
	    let L=a*2
	    cd ../../BigDataGeneratorSuite/Text_datagen/
	    rm -fr ./gen_data/$a"GB"-randsampleHP
	    ./gen_text_data.sh lda_wiki1w $L 8000 10000 ./gen_data/$a"GB"-randsampleHP
	    ${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/randsample/$a"GB"-randsampleHP
	    ${HADOOP_HOME}/bin/hadoop fs -mkdir -p /hadoop/randsample
	    ${HADOOP_HOME}/bin/hadoop fs -put ./gen_data/$a"GB"-randsampleHP /hadoop/randsample
运行工作负载
nano@nano1:/opt/module/BigDataBench5.0_MicroBenchmark/Hadoop/randSample$ ./run-randSample.sh 1 0.3
	# Running command: ./run-randSample.sh  
	#       size: the input data size, GB
	#       sample_ratio: the sampling ratio, ranges from 0 to 1.
	curdir=`pwd`
	a=$1
	#-----------------------------run-workload-----------------------------#
	echo "running randsample"
	${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/randsample/randsampleHP-result
	${HADOOP_HOME}/bin/hadoop jar RandSample/out/artifacts/RandSample_jar/RandSample.jar RandSample /hadoop/randsample/$a"GB"-randsampleHP  /hadoop/randsample/randsampleHP-result $2
  • CF(collaborative Filtering)
生成1G的数据
	nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/CF$ ./genData-cf.sh 1
	#Command for generating data:
	#       ./genData-cf.sh  #GB
	#
	
	a=`expr $1 \* 1024`
	
	#-----------------generating input data---------------
	rm -rf genData-CF/als_input.txt
	$HADOOP_HOME/bin/hadoop dfs -rmr /hadoop/cf/cf-${1}G
	cd genData-CF
	make
	./ALS-DataGen $a
	$HADOOP_HOME/bin/hadoop fs -mkdir -p /hadoop/cf/
	$HADOOP_HOME/bin/hadoop fs als_input.txt /hadoop/cf/cf-${1}G
	$HADOOP_HOME/bin/hadoop dfs -rmr /hadoop/cf/cf-out /hadoop/cf/temp
运行工作负载
nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/CF$ ./run-cf.sh 1
注意:脚本中没有加MAHOUT_HOME地址,执行会报错,自己在脚本中手动添加BigDatabench中自带的mahout地址,或者在环境变量中配置
	#!/bin/bash 
	#Run command: ./run-cf.sh    
	#       size: the input data size, GB
	#       numFeatures: the number of features
	#       numIterations: the number of features
	#       lambda: regularization parameter
	MAHOUT_HOME=../apache-mahout-0.10.2-compile
	#-----------------running hadoop cf-------------
	${MAHOUT_HOME}/bin/mahout parallelALS \
	        -i /hadoop/cf/cf-${1}G \
	        -o /hadoop/cf/cf-out \
	        --numFeatures $2 \
	        --numIterations $3 \
	        --lambda $4 \
	        --tempDir /hadoop/cf/temp
	#-----------------killing monitor script--------------
	echo "hadoop cf end"
  • Bayes
生成1G的数据
	nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/Bayes$ ./genData-bayes.sh 1
	#!/bin/bash	  
	#
	#The parameter $1 indicates the data size (GB) to generate
	#Command for generate data: ./genData-bayes.sh  #GB
	#	
	#------------generate-data---------------------
	a=$1
	cd ../../BigDataGeneratorSuite/Text_datagen/
	rm -rf ./gen_data/data-naivebayes
	    let L=a*2
	    ./gen_text_data.sh amazonMR1 $L 1900 11500 ./gen_data/data-naivebayes/amazonMR1
	    ./gen_text_data.sh amazonMR2 $L 1900 11500 ./gen_data/data-naivebayes/amazonMR2
	    ./gen_text_data.sh amazonMR3 $L 1900 11500 ./gen_data/data-naivebayes/amazonMR3
	    ./gen_text_data.sh amazonMR4 $L 1900 11500 ./gen_data/data-naivebayes/amazonMR4
	    ./gen_text_data.sh amazonMR5 $L 1900 11500 ./gen_data/data-naivebayes/amazonMR5
	#-------------------------------------put-data----------------------------#
	    ${HADOOP_HOME}/bin/hadoop fs -rmr /hadoop/Bayes/*
	    ${HADOOP_HOME}/bin/hadoop fs -mkdir -p /hadoop/Bayes
	    ${HADOOP_HOME}/bin/hadoop fs -put ./gen_data/data-naivebayes /hadoop/Bayes/
运行工作负载
nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/Bayes$ ./run-bayes.sh
注意:脚本中没有加MAHOUT_HOME地址,执行会报错,自己在脚本中手动添加BigDatabench中自带的mahout地址,或者在环境变量中配置
	#!/bin/bash  
	# Run command: ./run-bayes.sh
	MAHOUT_HOME=../apache-mahout-0.10.2-compile
	#--------------------------------------run-workload-----------------------#
	#Generates input dataset for training & testing classifier
	dir=/hadoop/Bayes
	echo "Creating sequence files from naivebayes-naivebayes data"
	 ${MAHOUT_HOME}/bin/mahout seqdirectory \
	  -i /hadoop/Bayes/data-naivebayes \
	  -o /hadoop/Bayes/naivebayes-seq -ow
	echo "Converting sequence files to vectors"
	 ${MAHOUT_HOME}/bin/mahout seq2sparse \
	  -i /hadoop/Bayes/naivebayes-seq \
	  -o /hadoop/Bayes/naivebayes-vectors  -lnorm -nv  -wt tfidf
	echo "Creating training and holdout set with a random 80-20 split of the generated vector dataset"
	 ${MAHOUT_HOME}/bin/mahout split \
	  -i /hadoop/Bayes/naivebayes-vectors/tfidf-vectors \
	  --trainingOutput /hadoop/Bayes/naivebayes-train-vectors \
	  --testOutput /hadoop/Bayes/naivebayes-test-vectors  \
	  --randomSelectionPct 70 --overwrite --sequenceFiles -xm sequential
	#Trains the classifier
	echo "Training Naive Bayes model"
	 ${MAHOUT_HOME}/bin/mahout trainnb \
	  -i /hadoop/Bayes/naivebayes-train-vectors  \
	  -o /hadoop/Bayes/model \
	  -li /hadoop/Bayes/labelindex \
	  -ow #$c
	#------------------------------------------run------------------------------#
	hadoop dfs -rmr /hadoop/Bayes/naivebayes-testing
	${MAHOUT_HOME}/bin/mahout testnb \
	 -i /hadoop/Bayes/naivebayes-test-vectors \
	 -m /hadoop/Bayes/model \
	 -l /hadoop/Bayes/labelindex \
	 -ow -o /hadoop/Bayes/naivebayes-testing #$c

大数据基准测试平台BigDataBench5.0安装配置及使用_第7张图片

  • Index
    首先进行配置
    nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/Index/conf$ vim hibench-config.sh
    大数据基准测试平台BigDataBench5.0安装配置及使用_第8张图片nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/Index/conf$ vim configure.sh
    大数据基准测试平台BigDataBench5.0安装配置及使用_第9张图片
生成数据
	nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/Index/bin$ ./genData_Index.sh 
	#!/bin/bash
	bin=`dirname "$0"`
	bin=`cd "$bin"; pwd`
	
	echo "========== preparing nutchindex data =========="
	
	# configure
	DIR=`cd $bin/../; pwd`
	. "${DIR}/conf/hibench-config.sh"
	. "${DIR}/conf/configure.sh"
	
	# compress
	if [ $COMPRESS -eq 1 ]; then
	    COMPRESS_OPT="-c ${COMPRESS_CODEC}"
	fi
	
	rm -rf $TMPLOGFILE
	hadoop dfs -rmr /Nutch
	hadoop dfs -mkdir /Nutch
	# generate data
	OPTION="-t nutch \
	        -b ${NUTCH_BASE_HDFS} \
	        -n ${NUTCH_INPUT} \
	        -m ${NUM_MAPS} \
	        -r ${NUM_REDS} \
	        -p ${PAGES} \
	        -o sequence"
	
	$HADOOP_EXECUTABLE jar  ../conf/datatools.jar HiBench.DataGen ${OPTION} ${COMPRESS_OPT}
运行工作负载
	nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/Index/bin$ ./run_Index.sh
	#!/bin/bash
	bin=`dirname "$0"`
	bin=`cd "$bin"; pwd`
	
	echo "========== running nutchindex data =========="
	# configure
	DIR=`cd $bin/../; pwd`
	. "${DIR}/conf/hibench-config.sh"
	. "${DIR}/conf/configure.sh"
	
	export NUTCH_HOME=$BigdataBench_Home/SearchEngine/Index/nutch-1.2-hadoop1
	cd $NUTCH_HOME
	export NUTCH_CONF_DIR=$HADOOP_CONF_DIR:$NUTCH_HOME/conf
	
	hadoop dfs -rmr /Nutch/Output
	hadoop dfs -rmr $INPUT_HDFS/indexes
	# run bench
	../nutch-1.2-hadoop1/bin/nutch index $COMPRESS_OPTS $OUTPUT_HDFS $INPUT_HDFS/crawldb $INPUT_HDFS/linkdb $INPUT_HDFS/segments/*
	

出错啦:

出错原因:
解决方案:

  • SIFT
    SIFT详解:(局部特征提取算法)尺度不变特征转换(Scale-invariant feature transform或SIFT)是一种电脑视觉的算法用来侦测与描述影像中的局部性特征,它在空间尺度中寻找极值点,并提取出其位置、尺度、旋转不变量,此算法由 David Lowe在1999年所发表,2004年完善总结。其应用范围包含物体辨识、机器人地图感知与导航、影像缝合、3D模型建立、手势辨识、影像追踪和动作比对。此算法有其专利,专利拥有者为英属哥伦比亚大学。
    数据集下载地址
http://www.image-net.org/challenges/LSVRC/2014/

上传数据集

 hdfs://192.168.1.101:9000/hadoop/sift/data/image1G.hib

运行工作负载

nano@nano1:/opt/module/BigDataBench5.0_ComponentBenchmark/Hadoop/SIFT$ ./run-sift.sh 1
#!/bin/bash
##
#Running command: ./run-sift.sh 
#       imgsize: the total size of the images, GB
##
#----------------check whether opencv is installed--------#  
isopencv=`pkg-config --modversion opencv`
strB="Package opencv was not found"
result=$(echo $isopencv | grep "${strB}")
echo $result
if [[ $result != "" ]];then
echo "no opencv"
exit 1
fi
if [[ $1 -ge 10 ]];then
a=10
else
a=1
fi
#-----------------generating input data---------------
$HADOOP_HOME/bin/hadoop dfs -rmr /hadoop/sift/data/image${a}G.*
$HADOOP_HOME/bin/hadoop fs -mkdir -p /hadoop/sift/data/
$HADOOP_HOME/bin/hadoop dfs -rmr /hadoop/sift/sift-out
$HADOOP_HOME/bin/hadoop fs -put hadoop-SIFT/data/image${a}G.* /hadoop/sift/data/
#-----------------running hadoop sift-------------
$HADOOP_HOME/bin/hadoop jar hadoop-SIFT/hipi-SIFT/tools/sift/build/libs/sift.jar /hadoop/sift/data/image${a}G.hib /hadoop/sift/sift-out

你可能感兴趣的:(各种安装,大数据)