In this blog post I introduce some of the benchmarking and testing tools in the Apache Hadoop distribution. Namely, I'll look at TeraSort, NNBench and MRBench. These are popular choices to benchmark a Hadoop cluster.
Before we start, let me show you the clusters on which the tests will run:
- Three VMWare virtual machines (nodes) run on OS X Mountain Lion
- Node1: two processors, 2GB memory, which is used as NameNode as well as DataNode
- Node2: 1 processor, 1GB memory, which is used as Secondary NameNode as well as DataNodes
- Node3: 1 processor, 1GB memory, which is used as DataNode
Now let's start benchmark test.
TeraSort benchmark test
A full TeraSort benchmark run consists of the following three steps:
- Generating the input data via TeraGen.
- Running the actual TeraSort on the input data.
- Validating the sorted output data via TeraValidate.
Now let's generate the input data with:
[root@n1 lib]# hadoop jar hadoop-examples.jar teragen 1000 /user/root/terasort-input 13/07/12 21:37:00 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. Generating 1000 using 2 maps with step of 500 13/07/12 21:37:09 INFO mapred.JobClient: Running job: job_201307122107_0001 13/07/12 21:37:10 INFO mapred.JobClient: map 0% reduce 0% 13/07/12 21:37:35 INFO mapred.JobClient: map 50% reduce 0% 13/07/12 21:38:28 INFO mapred.JobClient: map 100% reduce 0% 13/07/12 21:39:03 INFO mapred.JobClient: Job complete: job_201307122107_0001 13/07/12 21:39:05 INFO mapred.JobClient: Counters: 24 13/07/12 21:39:06 INFO mapred.JobClient: File System Counters 13/07/12 21:39:06 INFO mapred.JobClient: FILE: Number of bytes read=0 13/07/12 21:39:06 INFO mapred.JobClient: FILE: Number of bytes written=309768 13/07/12 21:39:06 INFO mapred.JobClient: FILE: Number of read operations=0 13/07/12 21:39:06 INFO mapred.JobClient: FILE: Number of large read operations=0 13/07/12 21:39:06 INFO mapred.JobClient: FILE: Number of write operations=0 13/07/12 21:39:06 INFO mapred.JobClient: HDFS: Number of bytes read=164 13/07/12 21:39:06 INFO mapred.JobClient: HDFS: Number of bytes written=100000 13/07/12 21:39:06 INFO mapred.JobClient: HDFS: Number of read operations=3 13/07/12 21:39:06 INFO mapred.JobClient: HDFS: Number of large read operations=0 13/07/12 21:39:06 INFO mapred.JobClient: HDFS: Number of write operations=2 13/07/12 21:39:06 INFO mapred.JobClient: Job Counters 13/07/12 21:39:06 INFO mapred.JobClient: Launched map tasks=2 13/07/12 21:39:06 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=93872 13/07/12 21:39:06 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0 13/07/12 21:39:06 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/07/12 21:39:06 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/07/12 21:39:06 INFO mapred.JobClient: Map-Reduce Framework 13/07/12 21:39:06 INFO mapred.JobClient: Map input records=1000 13/07/12 21:39:06 INFO mapred.JobClient: Map output records=1000 13/07/12 21:39:06 INFO mapred.JobClient: Input split bytes=164 13/07/12 21:39:06 INFO mapred.JobClient: Spilled Records=0 13/07/12 21:39:06 INFO mapred.JobClient: CPU time spent (ms)=1360 13/07/12 21:39:06 INFO mapred.JobClient: Physical memory (bytes) snapshot=178167808 13/07/12 21:39:06 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2249502720 13/07/12 21:39:06 INFO mapred.JobClient: Total committed heap usage (bytes)=48758784 13/07/12 21:39:06 INFO mapred.JobClient: org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter 13/07/12 21:39:06 INFO mapred.JobClient: BYTES_READ=1000
Check the data generated:
[root@n1 lib]# hadoop fs -ls ./terasort-input Found 4 items -rw-r--r-- 3 root supergroup 0 2013-07-12 21:38 terasort-input/_SUCCESS drwxr-xr-x - root supergroup 0 2013-07-12 21:37 terasort-input/_logs -rw-r--r-- 3 root supergroup 50000 2013-07-12 21:37 terasort-input/part-00000 -rw-r--r-- 3 root supergroup 50000 2013-07-12 21:38 terasort-input/part-00001
Run the terasort test:
[root@n1 lib]# hadoop jar hadoop-examples.jar terasort terasort-input terasort-output 13/07/12 21:53:19 INFO terasort.TeraSort: starting 13/07/12 21:53:21 INFO mapred.FileInputFormat: Total input paths to process : 2 13/07/12 21:53:21 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 13/07/12 21:53:21 INFO compress.CodecPool: Got brand-new compressor [.deflate] Making 1 from 1000 records Step size is 1000.0 13/07/12 21:53:22 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/07/12 21:53:26 INFO mapred.JobClient: Running job: job_201307122107_0002 13/07/12 21:53:27 INFO mapred.JobClient: map 0% reduce 0% 13/07/12 21:53:46 INFO mapred.JobClient: map 100% reduce 0% 13/07/12 21:53:57 INFO mapred.JobClient: map 100% reduce 100% 13/07/12 21:54:01 INFO mapred.JobClient: Job complete: job_201307122107_0002 13/07/12 21:54:01 INFO mapred.JobClient: Counters: 33 13/07/12 21:54:01 INFO mapred.JobClient: File System Counters 13/07/12 21:54:01 INFO mapred.JobClient: FILE: Number of bytes read=23088 13/07/12 21:54:01 INFO mapred.JobClient: FILE: Number of bytes written=520103 13/07/12 21:54:01 INFO mapred.JobClient: FILE: Number of read operations=0 13/07/12 21:54:01 INFO mapred.JobClient: FILE: Number of large read operations=0 13/07/12 21:54:01 INFO mapred.JobClient: FILE: Number of write operations=0 13/07/12 21:54:01 INFO mapred.JobClient: HDFS: Number of bytes read=100230 13/07/12 21:54:01 INFO mapred.JobClient: HDFS: Number of bytes written=100000 13/07/12 21:54:01 INFO mapred.JobClient: HDFS: Number of read operations=4 13/07/12 21:54:01 INFO mapred.JobClient: HDFS: Number of large read operations=0 13/07/12 21:54:01 INFO mapred.JobClient: HDFS: Number of write operations=1 13/07/12 21:54:01 INFO mapred.JobClient: Job Counters 13/07/12 21:54:01 INFO mapred.JobClient: Launched map tasks=2 13/07/12 21:54:01 INFO mapred.JobClient: Launched reduce tasks=1 13/07/12 21:54:01 INFO mapred.JobClient: Data-local map tasks=2 13/07/12 21:54:01 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=26310 13/07/12 21:54:01 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=8722 13/07/12 21:54:01 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/07/12 21:54:01 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/07/12 21:54:01 INFO mapred.JobClient: Map-Reduce Framework 13/07/12 21:54:01 INFO mapred.JobClient: Map input records=1000 13/07/12 21:54:01 INFO mapred.JobClient: Map output records=1000 13/07/12 21:54:01 INFO mapred.JobClient: Map output bytes=100000 13/07/12 21:54:01 INFO mapred.JobClient: Input split bytes=230 13/07/12 21:54:01 INFO mapred.JobClient: Combine input records=0 13/07/12 21:54:01 INFO mapred.JobClient: Combine output records=0 13/07/12 21:54:01 INFO mapred.JobClient: Reduce input groups=1000 13/07/12 21:54:01 INFO mapred.JobClient: Reduce shuffle bytes=22876 13/07/12 21:54:01 INFO mapred.JobClient: Reduce input records=1000 13/07/12 21:54:01 INFO mapred.JobClient: Reduce output records=1000 13/07/12 21:54:01 INFO mapred.JobClient: Spilled Records=2000 13/07/12 21:54:01 INFO mapred.JobClient: CPU time spent (ms)=3780 13/07/12 21:54:01 INFO mapred.JobClient: Physical memory (bytes) snapshot=408850432 13/07/12 21:54:01 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1962823680 13/07/12 21:54:01 INFO mapred.JobClient: Total committed heap usage (bytes)=147070976 13/07/12 21:54:01 INFO mapred.JobClient: org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter 13/07/12 21:54:01 INFO mapred.JobClient: BYTES_READ=100000 13/07/12 21:54:01 INFO terasort.TeraSort: done
Validate job output with teravalidate:
[root@n1 lib]# hadoop jar hadoop-examples.jar teravalidate terasort-output terasort-validate 13/07/12 21:56:02 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/07/12 21:56:04 INFO mapred.FileInputFormat: Total input paths to process : 1 13/07/12 21:56:10 INFO mapred.JobClient: Running job: job_201307122107_0003 13/07/12 21:56:11 INFO mapred.JobClient: map 0% reduce 0% 13/07/12 21:56:23 INFO mapred.JobClient: map 100% reduce 0% 13/07/12 21:56:31 INFO mapred.JobClient: map 100% reduce 100% 13/07/12 21:56:34 INFO mapred.JobClient: Job complete: job_201307122107_0003 13/07/12 21:56:34 INFO mapred.JobClient: Counters: 33 13/07/12 21:56:34 INFO mapred.JobClient: File System Counters 13/07/12 21:56:34 INFO mapred.JobClient: FILE: Number of bytes read=69 13/07/12 21:56:34 INFO mapred.JobClient: FILE: Number of bytes written=310607 13/07/12 21:56:34 INFO mapred.JobClient: FILE: Number of read operations=0 13/07/12 21:56:34 INFO mapred.JobClient: FILE: Number of large read operations=0 13/07/12 21:56:34 INFO mapred.JobClient: FILE: Number of write operations=0 13/07/12 21:56:34 INFO mapred.JobClient: HDFS: Number of bytes read=100116 13/07/12 21:56:34 INFO mapred.JobClient: HDFS: Number of bytes written=0 13/07/12 21:56:34 INFO mapred.JobClient: HDFS: Number of read operations=3 13/07/12 21:56:34 INFO mapred.JobClient: HDFS: Number of large read operations=0 13/07/12 21:56:34 INFO mapred.JobClient: HDFS: Number of write operations=2 13/07/12 21:56:34 INFO mapred.JobClient: Job Counters 13/07/12 21:56:34 INFO mapred.JobClient: Launched map tasks=1 13/07/12 21:56:34 INFO mapred.JobClient: Launched reduce tasks=1 13/07/12 21:56:34 INFO mapred.JobClient: Data-local map tasks=1 13/07/12 21:56:34 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=14493 13/07/12 21:56:34 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=6647 13/07/12 21:56:34 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/07/12 21:56:34 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/07/12 21:56:34 INFO mapred.JobClient: Map-Reduce Framework 13/07/12 21:56:34 INFO mapred.JobClient: Map input records=1000 13/07/12 21:56:34 INFO mapred.JobClient: Map output records=2 13/07/12 21:56:34 INFO mapred.JobClient: Map output bytes=54 13/07/12 21:56:34 INFO mapred.JobClient: Input split bytes=116 13/07/12 21:56:34 INFO mapred.JobClient: Combine input records=0 13/07/12 21:56:34 INFO mapred.JobClient: Combine output records=0 13/07/12 21:56:34 INFO mapred.JobClient: Reduce input groups=2 13/07/12 21:56:34 INFO mapred.JobClient: Reduce shuffle bytes=65 13/07/12 21:56:34 INFO mapred.JobClient: Reduce input records=2 13/07/12 21:56:34 INFO mapred.JobClient: Reduce output records=0 13/07/12 21:56:34 INFO mapred.JobClient: Spilled Records=4 13/07/12 21:56:34 INFO mapred.JobClient: CPU time spent (ms)=1640 13/07/12 21:56:34 INFO mapred.JobClient: Physical memory (bytes) snapshot=250499072 13/07/12 21:56:34 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1310330880 13/07/12 21:56:34 INFO mapred.JobClient: Total committed heap usage (bytes)=81399808 13/07/12 21:56:34 INFO mapred.JobClient: org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter 13/07/12 21:56:34 INFO mapred.JobClient: BYTES_READ=100000
Hadoop provides a very convenient way to access statistics about a job from the command line:
$ hadoop job -history all terasort-output
Also you can see the detailed result via Hadoop JobTracker web UI.
NameNode benchmark (nnbench)
NNBench is useful for load testing the NameNode hardware and configuration. It generates a lot of HDFS-related requests with normally very small "payloads" for the sole purpose of putting a high HDFS management stress on the NameNode. The benchmark can simulate requests for creating, reading, renaming and deleting files on HDFS.
The syntax of NNBench is as follows:
[root@n1 lib]# hadoop jar /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-0.20-mapreduce/hadoop-test.jar nnbench NameNode Benchmark 0.4 Usage: nnbench <options> Options: -operation <Available operations are create_write open_read rename delete. This option is mandatory> * NOTE: The open_read, rename and delete operations assume that the files they operate on, are already available. The create_write operation must be run before running the other operations. -maps <number of maps. default is 1. This is not mandatory> -reduces <number of reduces. default is 1. This is not mandatory> -startTime <time to start, given in seconds from the epoch. Make sure this is far enough into the future, so all maps (operations) will start at the same time>. default is launch time + 2 mins. This is not mandatory -blockSize <Block size in bytes. default is 1. This is not mandatory> -bytesToWrite <Bytes to write. default is 0. This is not mandatory> -bytesPerChecksum <Bytes per checksum for the files. default is 1. This is not mandatory> -numberOfFiles <number of files to create. default is 1. This is not mandatory> -replicationFactorPerFile <Replication factor for the files. default is 1. This is not mandatory> -baseDir <base DFS path. default is /becnhmarks/NNBench. This is not mandatory> -readFileAfterOpen <true or false. if true, it reads the file and reports the average time to read. This is valid with the open_read operation. default is false. This is not mandatory> -help: Display the help statement
To run NameNode benchmark test with 6 mappers and 3 reducers:
[root@n1 lib]# hadoop jar /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-0.20-mapreduce/hadoop-test.jar nnbench -operation create_write -maps 6 -reduces 3 -blockSize 1 -typesToWrite 0 -numberOfFiles 100 -replicationFactorPerFile 3 -readFileAfterOpen true -baseDir /benchmarks/NNBench-`hostname -s` NameNode Benchmark 0.4 13/07/12 22:13:42 INFO hdfs.NNBench: Test Inputs: 13/07/12 22:13:42 INFO hdfs.NNBench: Test Operation: create_write 13/07/12 22:13:42 INFO hdfs.NNBench: Start time: 2013-07-12 22:15:42,26 13/07/12 22:13:42 INFO hdfs.NNBench: Number of maps: 6 13/07/12 22:13:42 INFO hdfs.NNBench: Number of reduces: 3 13/07/12 22:13:42 INFO hdfs.NNBench: Block Size: 1 13/07/12 22:13:42 INFO hdfs.NNBench: Bytes to write: 0 13/07/12 22:13:42 INFO hdfs.NNBench: Bytes per checksum: 1 13/07/12 22:13:42 INFO hdfs.NNBench: Number of files: 100 13/07/12 22:13:42 INFO hdfs.NNBench: Replication factor: 3 13/07/12 22:13:42 INFO hdfs.NNBench: Base dir: /benchmarks/NNBench-n1 13/07/12 22:13:42 INFO hdfs.NNBench: Read file after open: true 13/07/12 22:13:43 INFO hdfs.NNBench: Deleting data directory 13/07/12 22:13:43 INFO hdfs.NNBench: Creating 6 control files 13/07/12 22:13:43 WARN conf.Configuration: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum 13/07/12 22:13:44 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/07/12 22:13:44 INFO mapred.FileInputFormat: Total input paths to process : 6 13/07/12 22:13:44 WARN conf.Configuration: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum 13/07/12 22:13:44 INFO mapred.JobClient: Running job: job_201307122107_0005 13/07/12 22:13:45 INFO mapred.JobClient: map 0% reduce 0% 13/07/12 22:14:03 INFO mapred.JobClient: map 33% reduce 0% 13/07/12 22:14:05 INFO mapred.JobClient: map 67% reduce 0% 13/07/12 22:15:57 INFO mapred.JobClient: map 83% reduce 0% 13/07/12 22:15:58 INFO mapred.JobClient: map 100% reduce 0% 13/07/12 22:16:07 INFO mapred.JobClient: map 100% reduce 67% 13/07/12 22:16:09 INFO mapred.JobClient: map 100% reduce 100% 13/07/12 22:16:11 INFO mapred.JobClient: Job complete: job_201307122107_0005 13/07/12 22:16:11 INFO mapred.JobClient: Counters: 33 13/07/12 22:16:11 INFO mapred.JobClient: File System Counters 13/07/12 22:16:11 INFO mapred.JobClient: FILE: Number of bytes read=359 13/07/12 22:16:11 INFO mapred.JobClient: FILE: Number of bytes written=1448711 13/07/12 22:16:11 INFO mapred.JobClient: FILE: Number of read operations=0 13/07/12 22:16:11 INFO mapred.JobClient: FILE: Number of large read operations=0 13/07/12 22:16:11 INFO mapred.JobClient: FILE: Number of write operations=0 13/07/12 22:16:11 INFO mapred.JobClient: HDFS: Number of bytes read=1530 13/07/12 22:16:11 INFO mapred.JobClient: HDFS: Number of bytes written=182 13/07/12 22:16:11 INFO mapred.JobClient: HDFS: Number of read operations=21 13/07/12 22:16:11 INFO mapred.JobClient: HDFS: Number of large read operations=0 13/07/12 22:16:11 INFO mapred.JobClient: HDFS: Number of write operations=4006 13/07/12 22:16:11 INFO mapred.JobClient: Job Counters 13/07/12 22:16:11 INFO mapred.JobClient: Launched map tasks=6 13/07/12 22:16:11 INFO mapred.JobClient: Launched reduce tasks=3 13/07/12 22:16:11 INFO mapred.JobClient: Data-local map tasks=6 13/07/12 22:16:11 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=498450 13/07/12 22:16:11 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=24054 13/07/12 22:16:11 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/07/12 22:16:11 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/07/12 22:16:11 INFO mapred.JobClient: Map-Reduce Framework 13/07/12 22:16:11 INFO mapred.JobClient: Map input records=6 13/07/12 22:16:11 INFO mapred.JobClient: Map output records=44 13/07/12 22:16:11 INFO mapred.JobClient: Map output bytes=974 13/07/12 22:16:11 INFO mapred.JobClient: Input split bytes=786 13/07/12 22:16:11 INFO mapred.JobClient: Combine input records=0 13/07/12 22:16:11 INFO mapred.JobClient: Combine output records=0 13/07/12 22:16:11 INFO mapred.JobClient: Reduce input groups=8 13/07/12 22:16:11 INFO mapred.JobClient: Reduce shuffle bytes=1227 13/07/12 22:16:11 INFO mapred.JobClient: Reduce input records=44 13/07/12 22:16:11 INFO mapred.JobClient: Reduce output records=8 13/07/12 22:16:11 INFO mapred.JobClient: Spilled Records=88 13/07/12 22:16:11 INFO mapred.JobClient: CPU time spent (ms)=16050 13/07/12 22:16:11 INFO mapred.JobClient: Physical memory (bytes) snapshot=1233637376 13/07/12 22:16:11 INFO mapred.JobClient: Virtual memory (bytes) snapshot=8789716992 13/07/12 22:16:11 INFO mapred.JobClient: Total committed heap usage (bytes)=525942784 13/07/12 22:16:11 INFO mapred.JobClient: org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter 13/07/12 22:16:11 INFO mapred.JobClient: BYTES_READ=228 13/07/12 22:16:11 INFO hdfs.NNBench: -------------- NNBench -------------- : 13/07/12 22:16:11 INFO hdfs.NNBench: Version: NameNode Benchmark 0.4 13/07/12 22:16:11 INFO hdfs.NNBench: Date & time: 2013-07-12 22:16:11,562 13/07/12 22:16:11 INFO hdfs.NNBench: 13/07/12 22:16:11 INFO hdfs.NNBench: Test Operation: create_write 13/07/12 22:16:11 INFO hdfs.NNBench: Start time: 2013-07-12 22:15:42,26 13/07/12 22:16:11 INFO hdfs.NNBench: Maps to run: 6 13/07/12 22:16:11 INFO hdfs.NNBench: Reduces to run: 3 13/07/12 22:16:11 INFO hdfs.NNBench: Block Size (bytes): 1 13/07/12 22:16:11 INFO hdfs.NNBench: Bytes to write: 0 13/07/12 22:16:11 INFO hdfs.NNBench: Bytes per checksum: 1 13/07/12 22:16:11 INFO hdfs.NNBench: Number of files: 100 13/07/12 22:16:11 INFO hdfs.NNBench: Replication factor: 3 13/07/12 22:16:11 INFO hdfs.NNBench: Successful file operations: 0 13/07/12 22:16:11 INFO hdfs.NNBench: 13/07/12 22:16:11 INFO hdfs.NNBench: # maps that missed the barrier: 0 13/07/12 22:16:11 INFO hdfs.NNBench: # exceptions: 0 13/07/12 22:16:11 INFO hdfs.NNBench: 13/07/12 22:16:11 INFO hdfs.NNBench: TPS: Create/Write/Close: 0 13/07/12 22:16:11 INFO hdfs.NNBench: Avg exec time (ms): Create/Write/Close: 0.0 13/07/12 22:16:11 INFO hdfs.NNBench: Avg Lat (ms): Create/Write: NaN 13/07/12 22:16:11 INFO hdfs.NNBench: Avg Lat (ms): Close: NaN 13/07/12 22:16:11 INFO hdfs.NNBench: 13/07/12 22:16:11 INFO hdfs.NNBench: RAW DATA: AL Total #1: 0 13/07/12 22:16:11 INFO hdfs.NNBench: RAW DATA: AL Total #2: 0 13/07/12 22:16:11 INFO hdfs.NNBench: RAW DATA: TPS Total (ms): 0 13/07/12 22:16:11 INFO hdfs.NNBench: RAW DATA: Longest Map Time (ms): 0.0 13/07/12 22:16:11 INFO hdfs.NNBench: RAW DATA: Late maps: 0 13/07/12 22:16:11 INFO hdfs.NNBench: RAW DATA: # of exceptions: 0 13/07/12 22:16:11 INFO hdfs.NNBench:
Look at the trick we did here, I use a custom output directory based on the machine's short hostname `hostname -s`. This is simple trick to ensure that one box does not accidentally write into the same output directory of another machine running nnbench at the same time.
MapReduce benchmark (mrbench)
MRBench loops a small job a number of times. As such it is a very complimentary benchmark to the "large-scale" TeraSort benchmark suite because MRBench checks whether small job runs are responsive and running efficiently on your cluster. It puts its focus on the MapReduce layer as its impact on the HDFS layer is very limited.
Default parameters of mrbench is:
-baseDir: /benchmarks/MRBench [*** see my note above ***] -numRuns: 1 -maps: 2 -reduces: 1 -inputLines: 1 -inputType: ascending
Run mrbench with default parameters:
[root@n1 lib]# hadoop jar /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-0.20-mapreduce/hadoop-test.jar mrbench MRBenchmark.0.0.2 13/07/12 22:04:42 INFO mapred.MRBench: creating control file: 1 numLines, ASCENDING sortOrder 13/07/12 22:04:42 INFO mapred.MRBench: created control file: /benchmarks/MRBench/mr_input/input_-1751865361.txt 13/07/12 22:04:43 INFO mapred.MRBench: Running job 0: input=hdfs://n1.example.com:8020/benchmarks/MRBench/mr_input output=hdfs://n1.example.com:8020/benchmarks/MRBench/mr_output/output_-1484101927 13/07/12 22:04:43 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/07/12 22:04:44 INFO mapred.FileInputFormat: Total input paths to process : 1 13/07/12 22:04:47 INFO mapred.JobClient: Running job: job_201307122107_0004 13/07/12 22:04:49 INFO mapred.JobClient: map 0% reduce 0% 13/07/12 22:05:41 INFO mapred.JobClient: map 50% reduce 0% 13/07/12 22:05:48 INFO mapred.JobClient: map 100% reduce 0% 13/07/12 22:05:58 INFO mapred.JobClient: map 100% reduce 100% 13/07/12 22:06:00 INFO mapred.JobClient: Job complete: job_201307122107_0004 13/07/12 22:06:00 INFO mapred.JobClient: Counters: 33 13/07/12 22:06:00 INFO mapred.JobClient: File System Counters 13/07/12 22:06:00 INFO mapred.JobClient: FILE: Number of bytes read=27 13/07/12 22:06:00 INFO mapred.JobClient: FILE: Number of bytes written=468313 13/07/12 22:06:00 INFO mapred.JobClient: FILE: Number of read operations=0 13/07/12 22:06:00 INFO mapred.JobClient: FILE: Number of large read operations=0 13/07/12 22:06:00 INFO mapred.JobClient: FILE: Number of write operations=0 13/07/12 22:06:00 INFO mapred.JobClient: HDFS: Number of bytes read=261 13/07/12 22:06:00 INFO mapred.JobClient: HDFS: Number of bytes written=3 13/07/12 22:06:00 INFO mapred.JobClient: HDFS: Number of read operations=5 13/07/12 22:06:00 INFO mapred.JobClient: HDFS: Number of large read operations=0 13/07/12 22:06:00 INFO mapred.JobClient: HDFS: Number of write operations=2 13/07/12 22:06:00 INFO mapred.JobClient: Job Counters 13/07/12 22:06:00 INFO mapred.JobClient: Launched map tasks=2 13/07/12 22:06:00 INFO mapred.JobClient: Launched reduce tasks=1 13/07/12 22:06:00 INFO mapred.JobClient: Data-local map tasks=2 13/07/12 22:06:00 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=50958 13/07/12 22:06:00 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=7753 13/07/12 22:06:00 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/07/12 22:06:00 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/07/12 22:06:00 INFO mapred.JobClient: Map-Reduce Framework 13/07/12 22:06:00 INFO mapred.JobClient: Map input records=1 13/07/12 22:06:00 INFO mapred.JobClient: Map output records=1 13/07/12 22:06:00 INFO mapred.JobClient: Map output bytes=5 13/07/12 22:06:00 INFO mapred.JobClient: Input split bytes=258 13/07/12 22:06:00 INFO mapred.JobClient: Combine input records=0 13/07/12 22:06:00 INFO mapred.JobClient: Combine output records=0 13/07/12 22:06:00 INFO mapred.JobClient: Reduce input groups=1 13/07/12 22:06:00 INFO mapred.JobClient: Reduce shuffle bytes=39 13/07/12 22:06:00 INFO mapred.JobClient: Reduce input records=1 13/07/12 22:06:00 INFO mapred.JobClient: Reduce output records=1 13/07/12 22:06:00 INFO mapred.JobClient: Spilled Records=2 13/07/12 22:06:00 INFO mapred.JobClient: CPU time spent (ms)=2920 13/07/12 22:06:00 INFO mapred.JobClient: Physical memory (bytes) snapshot=398467072 13/07/12 22:06:00 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3889000448 13/07/12 22:06:00 INFO mapred.JobClient: Total committed heap usage (bytes)=204607488 13/07/12 22:06:00 INFO mapred.JobClient: org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter 13/07/12 22:06:00 INFO mapred.JobClient: BYTES_READ=2 DataLines Maps Reduces AvgTime (milliseconds) 1 2 1 77797
This means that the average finish time of executed jobs was 78 seconds.