一. Hadoop基准测试
Hadoop自带了几个基准测试,被打包在几个jar包中。本文主要是cloudera版本测试
[hsu@server01 ~]$ ls /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop* | egrep "examples|test"
/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-examples-2.5.0-mr1-cdh5.2.0.jar
/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-examples.jar
/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-test-2.5.0-mr1-cdh5.2.0.jar
/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-test.jar
(1)、Hadoop Test
当不带参数调用hadoop-test-0.20.2-cdh3u3.jar时,会列出所有的测试程序:
[hsu@server01 ~]$ sudo hadoop jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-test.jar
An example program must be given as the first argument.
Valid program names are:
DFSCIOTest: Distributed i/o benchmark of libhdfs.
DistributedFSCheck: Distributed checkup of the file system consistency.
MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures
TestDFSIO: Distributed i/o benchmark.
dfsthroughput: measure hdfs throughput
filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed)
loadgen: Generic map/reduce load generator
mapredtest: A map/reduce test check.
minicluster: Single process HDFS and MR cluster.
mrbench: A map/reduce benchmark that can create many small jobs
nnbench: A benchmark that stresses the namenode.
testarrayfile: A test for flat files of binary key/value pairs.
testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce
testfilesystem: A test for FileSystem read/write.
testmapredsort: A map/reduce program that validates the map-reduce framework's sort.
testrpc: A test for rpc.
testsequencefile: A test for flat files of binary key value pairs.
testsequencefileinputformat: A test for sequence file input format.
testsetfile: A test for flat files of binary key/value pairs.
testtextinputformat: A test for text input format.
threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill
这些程序从多个角度对Hadoop进行测试,TestDFSIO、mrbench和nnbench是三个广泛被使用的测试。
(2) TestDFSIO write
TestDFSIO用于测试HDFS的IO性能,使用一个MapReduce作业来并发地执行读写操作,每个map任务用于读或写每个文件,map的输出用于收集与处理文件相关的统计信息,reduce用于累积统计信息,并产生summary。TestDFSIO的用法如下:
TestDFSIO
Usage: TestDFSIO [genericOptions] -read | -write | -append | -clean [-nrFiles N] [-fileSize Size[B|KB|MB|GB|TB]] [-resFile resultFileName] [-bufferSize Bytes] [-rootDir]
以下的例子将往HDFS中写入10个1000MB的文件:
[hsu@server01 ~]$ sudo hadoop jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1000
15/01/13 15:14:17 INFO fs.TestDFSIO: TestDFSIO.1.7
15/01/13 15:14:17 INFO fs.TestDFSIO: nrFiles = 10
15/01/13 15:14:17 INFO fs.TestDFSIO: nrBytes (MB) = 1000.0
15/01/13 15:14:17 INFO fs.TestDFSIO: bufferSize = 1000000
15/01/13 15:14:17 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO
15/01/13 15:14:18 INFO fs.TestDFSIO: creating control file: 1048576000 bytes, 10 files
15/01/13 15:14:19 INFO fs.TestDFSIO: created control files for: 10 files
15/01/13 15:15:23 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
15/01/13 15:15:23 INFO fs.TestDFSIO: Date & time: Tue Jan 13 15:15:23 CST 2015
15/01/13 15:15:23 INFO fs.TestDFSIO: Number of files: 10
15/01/13 15:15:23 INFO fs.TestDFSIO: Total MBytes processed: 10000.0
15/01/13 15:15:23 INFO fs.TestDFSIO: Throughput mb/sec: 29.67623230554649
15/01/13 15:15:23 INFO fs.TestDFSIO: Average IO rate mb/sec: 29.899526596069336
15/01/13 15:15:23 INFO fs.TestDFSIO: IO rate std deviation: 2.6268824639446526
15/01/13 15:15:23 INFO fs.TestDFSIO: Test exec time sec: 64.203
15/01/13 15:15:23 INFO fs.TestDFSIO:
(3) TestDFSIO read
以下的例子将从HDFS中读取10个1000MB的文件:
[hsu@server01 ~]$ sudo hadoop jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-test.jar TestDFSIO -read -nrFiles 10 -fileSize 1000
15/01/13 15:42:35 INFO fs.TestDFSIO: TestDFSIO.1.7
15/01/13 15:42:35 INFO fs.TestDFSIO: nrFiles = 10
15/01/13 15:42:35 INFO fs.TestDFSIO: nrBytes (MB) = 1000.0
15/01/13 15:42:35 INFO fs.TestDFSIO: bufferSize = 1000000
15/01/13 15:42:35 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO
15/01/13 15:42:36 INFO fs.TestDFSIO: creating control file: 1048576000 bytes, 10 files
15/01/13 15:42:37 INFO fs.TestDFSIO: created control files for: 10 files
(4) 清空测试数据
[hsu@server01 ~]$ sudo hadoop jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-test.jar TestDFSIO -clean
15/01/13 15:46:51 INFO fs.TestDFSIO: TestDFSIO.1.7
15/01/13 15:46:51 INFO fs.TestDFSIO: nrFiles = 1
15/01/13 15:46:51 INFO fs.TestDFSIO: nrBytes (MB) = 1.0
15/01/13 15:46:51 INFO fs.TestDFSIO: bufferSize = 1000000
15/01/13 15:46:51 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO
15/01/13 15:46:52 INFO fs.TestDFSIO: Cleaning up test files
(4) nnbench测试
nnbench用于测试NameNode的负载,它会生成很多与HDFS相关的请求,给NameNode施加较大的压力。这个测试能在HDFS上模拟创建、读取、重命名和删除文件等操作。nnbench的用法如下:
以下例子使用12个mapper和6个reducer来创建1000个文件:
[hsu@server01 ~]$ sudo hadoop jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-test.jar nnbench -operation create_write -maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 -replicationFactorPerFile 3 -readFileAfterOpen true -baseDir /benchmarks/NNBench-`hostname -s`
NameNode Benchmark 0.4
15/01/13 15:53:33 INFO hdfs.NNBench: Test Inputs:
15/01/13 15:53:33 INFO hdfs.NNBench: Test Operation: create_write
15/01/13 15:53:33 INFO hdfs.NNBench: Start time: 2015-01-13 15:55:33,585
15/01/13 15:53:33 INFO hdfs.NNBench: Number of maps: 12
15/01/13 15:53:33 INFO hdfs.NNBench: Number of reduces: 6
15/01/13 15:53:33 INFO hdfs.NNBench: Block Size: 1
15/01/13 15:53:33 INFO hdfs.NNBench: Bytes to write: 0
15/01/13 15:53:33 INFO hdfs.NNBench: Bytes per checksum: 1
15/01/13 15:53:33 INFO hdfs.NNBench: Number of files: 1000
15/01/13 15:53:33 INFO hdfs.NNBench: Replication factor: 3
15/01/13 15:53:33 INFO hdfs.NNBench: Base dir: /benchmarks/NNBench-server01
15/01/13 15:53:33 INFO hdfs.NNBench: Read file after open: true
15/01/13 15:53:34 INFO hdfs.NNBench: Deleting data directory
15/01/13 15:53:34 INFO hdfs.NNBench: Creating 12 control files
15/01/13 15:56:06 INFO hdfs.NNBench: -------------- NNBench -------------- :
15/01/13 15:56:06 INFO hdfs.NNBench: Version: NameNode Benchmark 0.4
15/01/13 15:56:06 INFO hdfs.NNBench: Date & time: 2015-01-13 15:56:06,539
15/01/13 15:56:06 INFO hdfs.NNBench:
15/01/13 15:56:06 INFO hdfs.NNBench: Test Operation: create_write
15/01/13 15:56:06 INFO hdfs.NNBench: Start time: 2015-01-13 15:55:33,585
15/01/13 15:56:06 INFO hdfs.NNBench: Maps to run: 12
15/01/13 15:56:06 INFO hdfs.NNBench: Reduces to run: 6
15/01/13 15:56:06 INFO hdfs.NNBench: Block Size (bytes): 1
15/01/13 15:56:06 INFO hdfs.NNBench: Bytes to write: 0
15/01/13 15:56:06 INFO hdfs.NNBench: Bytes per checksum: 1
15/01/13 15:56:06 INFO hdfs.NNBench: Number of files: 1000
15/01/13 15:56:06 INFO hdfs.NNBench: Replication factor: 3
15/01/13 15:56:06 INFO hdfs.NNBench: Successful file operations: 0
15/01/13 15:56:06 INFO hdfs.NNBench:
15/01/13 15:56:06 INFO hdfs.NNBench: # maps that missed the barrier: 0
15/01/13 15:56:06 INFO hdfs.NNBench: # exceptions: 0
15/01/13 15:56:06 INFO hdfs.NNBench:
15/01/13 15:56:06 INFO hdfs.NNBench: TPS: Create/Write/Close: 0
15/01/13 15:56:06 INFO hdfs.NNBench: Avg exec time (ms): Create/Write/Close: 0.0
15/01/13 15:56:06 INFO hdfs.NNBench: Avg Lat (ms): Create/Write: NaN
15/01/13 15:56:06 INFO hdfs.NNBench: Avg Lat (ms): Close: NaN
15/01/13 15:56:06 INFO hdfs.NNBench:
15/01/13 15:56:06 INFO hdfs.NNBench: RAW DATA: AL Total #1: 0
15/01/13 15:56:06 INFO hdfs.NNBench: RAW DATA: AL Total #2: 0
15/01/13 15:56:06 INFO hdfs.NNBench: RAW DATA: TPS Total (ms): 0
15/01/13 15:56:06 INFO hdfs.NNBench: RAW DATA: Longest Map Time (ms): 0.0
15/01/13 15:56:06 INFO hdfs.NNBench: RAW DATA: Late maps: 0
15/01/13 15:56:06 INFO hdfs.NNBench: RAW DATA: # of exceptions: 0
15/01/13 15:56:06 INFO hdfs.NNBench:
(5) mrbench测试
mrbench会多次重复执行一个小作业,用于检查在机群上小作业的运行是否可重复以及运行是否高效。mrbench的用法如下:
MRBenchmark.1.7
Usage: mrbench [-baseDir
] [-jar
] [-numRuns ] [-maps ] [-reduces ] [-inputLines ] [-inputType ] [-verbose]
以下例子会运行一个小作业50次:
[hsu@server01 ~]$ sudo hadoop jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-test.jar mrbench -numRuns 50
MRBenchmark.0.0.2
15/01/13 16:17:19 INFO mapred.MRBench: creating control file: 1 numLines, ASCENDING sortOrder
15/01/13 16:17:20 INFO mapred.MRBench: created control file: /benchmarks/MRBench/mr_input/input_331064064.txt
15/01/13 16:17:20 INFO mapred.MRBench: Running job 0: input=hdfs://server01:8020/benchmarks/MRBench/mr_input output=hdfs://server01:8020/benchmarks/MRBench/mr_output/output_556018847
DataLines Maps Reduces AvgTime (milliseconds)
1 2 1 26748
以上结果表示平均作业完成时间是26秒。
(6) Hadoop Examples
除了上文提到的测试,Hadoop还自带了一些例子,比如WordCount和TeraSort,这些例子在hadoop-examples*.jar中。
[hsu@server01 ~]$ ls /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-examples*
/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-examples-2.5.0-mr1-cdh5.2.0.jar
/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-examples.jar
执行以下命令会列出所有的示例程序:
[hsu@server01 ~]$ sudo hadoop jar /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-0.20-mapreduce/hadoop-examples.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
(7) TeraSort
一个完整的TeraSort测试需要按以下三步执行:
1、用TeraGen生成随机数据
2、对输入数据运行TeraSort
3、用TeraValidate验证排好序的输出数据
并不需要在每次测试时都生成输入数据,生成一次数据之后,每次测试可以跳过第一步。
TeraGen的用法如下:
$ hadoop jar hadoop-*examples*.jar teragen