Benchmark性能测试工具,TestDFSIO/TeraSort

TestDFSIO
    
    
    
    
  1. //用法
  2. hadoop jar $HADOOP_HOME/hadoop-*test*.jar TestDFSIO -read | -write | -clean [-nrFiles N] [-fileSize MB] [-resFile resultFileName] [-bufferSize Bytes]
TestDFSIO给每个文件都起一个map任务。

写测试:生成10个文件,每个文件100M
    
    
    
    
  1. pwd
  2. /home/mr/yarn/share/hadoop/mapreduce
  3. hadoop jar ./hadoop-mapreduce-client-jobclient-2.3.0-cdh5.0.2-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 100
vmaxspark1集群测试结果:
    
    
    
    
  1. 15/06/29 22:59:34 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
  2. 15/06/29 22:59:34 INFO fs.TestDFSIO: Date & time: Mon Jun 29 22:59:34 CST 2015
  3. 15/06/29 22:59:34 INFO fs.TestDFSIO: Number of files: 10
  4. 15/06/29 22:59:34 INFO fs.TestDFSIO: Total MBytes processed: 1000.0
  5. 15/06/29 22:59:34 INFO fs.TestDFSIO: Throughput mb/sec: 2.2699105201272967
  6. 15/06/29 22:59:34 INFO fs.TestDFSIO: Average IO rate mb/sec: 11.470916748046875
  7. 15/06/29 22:59:34 INFO fs.TestDFSIO: IO rate std deviation: 15.038400232638908
  8. 15/06/29 22:59:34 INFO fs.TestDFSIO: Test exec time sec: 80.936
  9. 15/06/29 22:59:34 INFO fs.TestDFSIO:

读测试:读10个文件,每个文件100M
    
    
    
    
  1. hadoop jar ./hadoop-mapreduce-client-jobclient-2.3.0-cdh5.0.2-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 100
  2. 15/06/29 23:02:28 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
  3. 15/06/29 23:02:28 INFO fs.TestDFSIO: Date & time: Mon Jun 29 23:02:28 CST 2015
  4. 15/06/29 23:02:28 INFO fs.TestDFSIO: Number of files: 10
  5. 15/06/29 23:02:28 INFO fs.TestDFSIO: Total MBytes processed: 1000.0
  6. 15/06/29 23:02:28 INFO fs.TestDFSIO: Throughput mb/sec: 1540.8320493066255
  7. 15/06/29 23:02:28 INFO fs.TestDFSIO: Average IO rate mb/sec: 1566.176025390625
  8. 15/06/29 23:02:28 INFO fs.TestDFSIO: IO rate std deviation: 207.60517212156435
  9. 15/06/29 23:02:28 INFO fs.TestDFSIO: Test exec time sec: 19.235

清除测试数据:默认的测试数据文件在 /benchmarks/TestDFSIO
    
    
    
    
  1. hadoop jar ./hadoop-mapreduce-client-jobclient-2.3.0-cdh5.0.2-tests.jar TestDFSIO -clean

说明:

Throughput mb/sec for a TestDFSIO job using N map tasks is defined as follows. The index 1 <= i <= N denotes the individual map tasks:

Throughput(N)=Ni=0filesizeiNi=0timei

Average IO rate mb/sec is defined as:

Average IO rate(N)=Ni=0rateiN=Ni=0filesizeitimeiN  
从这个定义可以看出,如果有10个Map任务,则实际并发吞吐量为10 * 2.27 = 22.7mb/sec( 有疑问,不太明白?)

另一个对结果有很大影响的是HDFS replication factor。设置dfs.replication属性可以调整replication factor。
    
    
    
    
  1. hadoop jar ./hadoop-mapreduce-client-jobclient-2.3.0-cdh5.0.2-tests.jar TestDFSIO -D dfs.replication=2 -write -nrFiles 10 -fileSize 100
  2. 15/06/29 23:32:47 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
  3. 15/06/29 23:32:47 INFO fs.TestDFSIO: Date & time: Mon Jun 29 23:32:47 CST 2015
  4. 15/06/29 23:32:47 INFO fs.TestDFSIO: Number of files: 10
  5. 15/06/29 23:32:47 INFO fs.TestDFSIO: Total MBytes processed: 1000.0
  6. 15/06/29 23:32:47 INFO fs.TestDFSIO: Throughput mb/sec: 4.046895424175344
  7. 15/06/29 23:32:47 INFO fs.TestDFSIO: Average IO rate mb/sec: 9.856432914733887
  8. 15/06/29 23:32:47 INFO fs.TestDFSIO: IO rate std deviation: 14.509421322080607
  9. 15/06/29 23:32:47 INFO fs.TestDFSIO: Test exec time sec: 138.103


TeraSort Benchmark
输入数据:TeraGen
     
     
     
     
  1. hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.0.2.jar teragen 1000 /test/input100M
teragen后的数值单位是行数;因为每行100个字节,所以如果要产生1T的数据量,则这个数值应为1T/100=10000000000(10个0)。
设置 dfs.block.size属性可以调整hdfs块大小,如 teragen -D dfs.block.size=536870912 ...

运行TeraSort
     
     
     
     
  1. hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.0.2.jar terasort /test/input100M /test/output100M

结果的校验:TeraValidate
     
     
     
     
  1. hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.0.2.jar teravalidate /test/output1TB /test/validate1TB

vmaxspark1测试结果
    
    
    
    
  1. [mr@vmaxspark1 mapreduce]$ hadoop jar ./hadoop-mapreduce-examples-2.3.0-cdh5.0.2.jar teragen 100000000 /gx/tera/input10G
  2. 15/07/02 20:47:47 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
  3. 15/07/02 20:47:47 INFO terasort.TeraSort: Generating 100000000 using 2
  4. 15/07/02 20:47:47 INFO mapreduce.JobSubmitter: number of splits:2
  5. 15/07/02 20:47:47 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1435833769205_0028
  6. 15/07/02 20:47:48 INFO impl.YarnClientImpl: Submitted application application_1435833769205_0028
  7. 15/07/02 20:47:48 INFO mapreduce.Job: The url to track the job: http://vmaxspark3:8088/proxy/application_1435833769205_0028/
  8. 15/07/02 20:47:48 INFO mapreduce.Job: Running job: job_1435833769205_0028
  9. 15/07/02 20:47:54 INFO mapreduce.Job: Job job_1435833769205_0028 running in uber mode : false
  10. 15/07/02 20:47:54 INFO mapreduce.Job: map 0% reduce 0%
  11. ...
  12. 15/07/02 21:08:20 INFO mapreduce.Job: map 99% reduce 0%
  13. 15/07/02 21:08:41 INFO mapreduce.Job: map 100% reduce 0%
  14. 15/07/02 21:08:45 INFO mapreduce.Job: Job job_1435833769205_0028 completed successfully
  15. 15/07/02 21:08:45 INFO mapreduce.Job: Counters: 31
  16. File System Counters
  17. FILE: Number of bytes read=0
  18. FILE: Number of bytes written=191996
  19. FILE: Number of read operations=0
  20. FILE: Number of large read operations=0
  21. FILE: Number of write operations=0
  22. HDFS: Number of bytes read=170
  23. HDFS: Number of bytes written=10000000000
  24. HDFS: Number of read operations=8
  25. HDFS: Number of large read operations=0
  26. HDFS: Number of write operations=4
  27. Job Counters
  28. Launched map tasks=2
  29. Other local map tasks=2
  30. Total time spent by all maps in occupied slots (ms)=7422519
  31. Total time spent by all reduces in occupied slots (ms)=0
  32. Total time spent by all map tasks (ms)=2474173
  33. Total vcore-seconds taken by all map tasks=2474173
  34. Total megabyte-seconds taken by all map tasks=3800329728
  35. Map-Reduce Framework
  36. Map input records=100000000
  37. Map output records=100000000
  38. Input split bytes=170
  39. Spilled Records=0
  40. Failed Shuffles=0
  41. Merged Map outputs=0
  42. GC time elapsed (ms)=39813
  43. CPU time spent (ms)=347490
  44. Physical memory (bytes) snapshot=975953920
  45. Virtual memory (bytes) snapshot=3772518400
  46. Total committed heap usage (bytes)=1979842560
  47. org.apache.hadoop.examples.terasort.TeraGen$Counters
  48. CHECKSUM=214760662691937609
  49. File Input Format Counters
  50. Bytes Read=0
  51. File Output Format Counters
  52. Bytes Written=10000000000
  53. [mr@vmaxspark1 mapreduce]$ hadoop jar ./hadoop-mapreduce-examples-2.3.0-cdh5.0.2.jar terasort /gx/tera/input10G /gx/tera/output10G
  54. 15/07/02 21:54:06 INFO terasort.TeraSort: starting
  55. 15/07/02 21:54:07 INFO input.FileInputFormat: Total input paths to process : 2
  56. Spent 125ms computing base-splits.
  57. Spent 4ms computing TeraScheduler splits.
  58. Computing input splits took 130ms
  59. Sampling 10 splits of 76
  60. Making 1 from 100000 sampled records
  61. Computing parititions took 409ms
  62. Spent 542ms computing partitions.
  63. 15/07/02 21:54:07 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
  64. 15/07/02 21:54:07 INFO mapreduce.JobSubmitter: number of splits:76
  65. 15/07/02 21:54:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1435833769205_0038
  66. 15/07/02 21:54:08 INFO impl.YarnClientImpl: Submitted application application_1435833769205_0038
  67. 15/07/02 21:54:08 INFO mapreduce.Job: The url to track the job: http://vmaxspark3:8088/proxy/application_1435833769205_0038/
  68. 15/07/02 21:54:08 INFO mapreduce.Job: Running job: job_1435833769205_0038
  69. 15/07/02 21:54:14 INFO mapreduce.Job: Job job_1435833769205_0038 running in uber mode : false
  70. 15/07/02 21:54:30 INFO mapreduce.Job: map 3% reduce 0%
  71. ...
  72. 15/07/02 21:55:16 INFO mapreduce.Job: map 84% reduce 25%
  73. ...
  74. 15/07/02 21:57:36 INFO mapreduce.Job: map 100% reduce 76%
  75. 15/07/02 21:58:25 INFO mapreduce.Job: map 100% reduce 100%
  76. 15/07/02 21:58:25 INFO mapreduce.Job: Job job_1435833769205_0038 completed successfully
  77. 15/07/02 21:58:25 INFO mapreduce.Job: Counters: 50
  78. File System Counters
  79. FILE: Number of bytes read=8746650762
  80. FILE: Number of bytes written=13195291236
  81. FILE: Number of read operations=0
  82. FILE: Number of large read operations=0
  83. FILE: Number of write operations=0
  84. HDFS: Number of bytes read=10000008436
  85. HDFS: Number of bytes written=10000000000
  86. HDFS: Number of read operations=231
  87. HDFS: Number of large read operations=0
  88. HDFS: Number of write operations=2
  89. Job Counters
  90. Launched map tasks=76
  91. Launched reduce tasks=1
  92. Data-local map tasks=64
  93. Rack-local map tasks=12
  94. Total time spent by all maps in occupied slots (ms)=8175426
  95. Total time spent by all reduces in occupied slots (ms)=906864
  96. Total time spent by all map tasks (ms)=2725142
  97. Total time spent by all reduce tasks (ms)=226716
  98. Total vcore-seconds taken by all map tasks=2725142
  99. Total vcore-seconds taken by all reduce tasks=226716
  100. Total megabyte-seconds taken by all map tasks=4185818112
  101. Total megabyte-seconds taken by all reduce tasks=464314368
  102. Map-Reduce Framework
  103. Map input records=100000000
  104. Map output records=100000000
  105. Map output bytes=10200000000
  106. Map output materialized bytes=4406714080
  107. Input split bytes=8436
  108. Combine input records=0
  109. Combine output records=0
  110. Reduce input groups=100000000
  111. Reduce shuffle bytes=4406714080
  112. Reduce input records=100000000
  113. Reduce output records=100000000
  114. Spilled Records=299321120
  115. Shuffled Maps =76
  116. Failed Shuffles=0
  117. Merged Map outputs=76
  118. GC time elapsed (ms)=466626
  119. CPU time spent (ms)=2438220
  120. Physical memory (bytes) snapshot=47665377280
  121. Virtual memory (bytes) snapshot=145613328384
  122. Total committed heap usage (bytes)=79307472896
  123. Shuffle Errors
  124. BAD_ID=0
  125. CONNECTION=0
  126. IO_ERROR=0
  127. WRONG_LENGTH=0
  128. WRONG_MAP=0
  129. WRONG_REDUCE=0
  130. File Input Format Counters
  131. Bytes Read=10000000000
  132. File Output Format Counters
  133. Bytes Written=10000000000
  134. 15/07/02 21:58:25 INFO terasort.TeraSort: done
  135. [mr@vmaxspark1 mapreduce]$




可用的Benchmark 和 Testing 工具:
    
    
    
    
  1. [mr@vmaxspark1 mapreduce]$ hadoop jar ./hadoop-mapreduce-*test*
  2. An example program must be given as the first argument.
  3. Valid program names are:
  4. DFSCIOTest: Distributed i/o benchmark of libhdfs.
  5. DistributedFSCheck: Distributed checkup of the file system consistency.
  6. JHLogAnalyzer: Job History Log analyzer.
  7. MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures
  8. SliveTest: HDFS Stress Test and Live Data Verification.
  9. TestDFSIO: Distributed i/o benchmark.
  10. fail: a job that always fails
  11. filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed)
  12. largesorter: Large-Sort tester
  13. loadgen: Generic map/reduce load generator
  14. mapredtest: A map/reduce test check.
  15. minicluster: Single process HDFS and MR cluster.
  16. mrbench: A map/reduce benchmark that can create many small jobs
  17. nnbench: A benchmark that stresses the namenode.
  18. sleep: A job that sleeps at each map and reduce task.
  19. testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce
  20. testfilesystem: A test for FileSystem read/write.
  21. testmapredsort: A map/reduce program that validates the map-reduce framework's sort.
  22. testsequencefile: A test for flat files of binary key value pairs.
  23. testsequencefileinputformat: A test for sequence file input format.
  24. testtextinputformat: A test for text input format.
  25. threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill
  26. [mr@vmaxspark1 mapreduce]$ hadoop jar ./hadoop-mapreduce-*example*
  27. An example program must be given as the first argument.
  28. Valid program names are:
  29. aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  30. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  31. bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  32. dbcount: An example job that count the pageview counts from a database.
  33. distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  34. grep: A map/reduce program that counts the matches of a regex in the input.
  35. join: A job that effects a join over sorted, equally partitioned datasets
  36. multifilewc: A job that counts words from several files.
  37. pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  38. pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  39. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  40. randomwriter: A map/reduce program that writes 10GB of random data per node.
  41. secondarysort: An example defining a secondary sort to the reduce.
  42. sort: A map/reduce program that sorts the data written by the random writer.
  43. sudoku: A sudoku solver.
  44. teragen: Generate data for the terasort
  45. terasort: Run the terasort
  46. teravalidate: Checking results of terasort
  47. wordcount: A map/reduce program that counts the words in the input files.
  48. wordmean: A map/reduce program that counts the average length of the words in the input files.
  49. wordmedian: A map/reduce program that counts the median length of the words in the input files.
  50. wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.









来自为知笔记(Wiz)


你可能感兴趣的:(Benchmark性能测试工具,TestDFSIO/TeraSort)