Hadoop TeraSort 基准测试实验


Author: zhankunlin
Date: 2011-4-1
Key words: Hadoop, TeraSort

 

<一> TeraSort 介绍

1TB排序通常用于衡量分布式数据处理框架的数据处理能力。Terasort是Hadoop中的的一个排序作业,在2008年,Hadoop在1TB排序基准评估中赢得第一名,耗时209秒。

<二> 相关资料

Hadoop MapReduce扩展性的测试:  http://cloud.csdn.net/a/20100901/278934.html
用MPI实现Hadoop:  Map/Reduce的TeraSort  http://emonkey.blog.sohu.com/166546157.html
Hadoop中TeraSort算法分析:  http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/
hadoop的1TB排序terasort:  http://hi.baidu.com/dtzw/blog/item/cffc8e1830f908b94bedbc12.html
Sort Benchmark:  http://sortbenchmark.org/
Trir树:http://www.cnblogs.com/cherish_yimi/archive/2009/10/12/1581666.html
<三> 实验

(0) 源码位置
    /local/zkl/hadoop/hadoop-0.20.1/hadoop-0.20.1/src/examples/org/apache/hadoop/examples/terasort

(1) 首先执行 teragen 生成数据

[root@gd86 hadoop-0.20.1]# /local/zkl/hadoop/hadoop-0.20.1/hadoop-0.20.1/bin/hadoop jar hadoop-0.20.1-examples.jar teragen 1000000 terasort/1000000-input

查看生成的数据

[root@gd86 hadoop-0.20.1]# /local/zkl/hadoop/hadoop-0.20.1/hadoop-0.20.1/bin/hadoop fs -ls  /user/root/terasort/1000000-input
Found 3 items
drwxr-xr-x   - root supergroup          0 2011-03-31 16:21 /user/root/terasort/1000000-input/_logs
-rw-r--r--   3 root supergroup   50000000 2011-03-31 16:21 /user/root/terasort/1000000-input/part-00000  
-rw-r--r--   3 root supergroup   50000000 2011-03-31 16:21 /user/root/terasort/1000000-input/part-00001

生成两个数据,每个的大小是 50000000 B = 50 M

[root@gd86 hadoop-0.20.1]# bin/hadoop jar hadoop-0.20.1-examples.jar teragen 10 terasort/1000000-input
将生成两个 500 B 的数据,加起来是 1000 B = 1 kb

产生的数据一行是100B,参数10表示产生10行,共1000B;1,000,000 行就有 100,000,000 B = 100 M;

teragen是用两个 map 来完成数据的生成,每个 map 生成一个文件,两个文件大小共 100 M,每个就是 50 M .

[root@gd86 hadoop-0.20.1]# bin/hadoop jar hadoop-0.20.1-examples.jar teragen 10000000 terasort/1G-input

这将产生 1 G 的数据,由于数据块是 64 M 一块,这会被分成16个数据块,当运行terasort时会有64个map task。

[root@gd86 hadoop-0.20.1]# bin/hadoop jar hadoop-0.20.1-examples.jar teragen 10000000 terasort/1G-input
Generating 10000000 using 2 maps with step of 5000000
11/04/01 17:02:46 INFO mapred.JobClient: Running job: job_201103311423_0005
11/04/01 17:02:47 INFO mapred.JobClient:  map 0% reduce 0%
11/04/01 17:03:00 INFO mapred.JobClient:  map 19% reduce 0%
11/04/01 17:03:01 INFO mapred.JobClient:  map 41% reduce 0%
11/04/01 17:03:03 INFO mapred.JobClient:  map 52% reduce 0%
11/04/01 17:03:04 INFO mapred.JobClient:  map 63% reduce 0%
11/04/01 17:03:06 INFO mapred.JobClient:  map 74% reduce 0%
11/04/01 17:03:10 INFO mapred.JobClient:  map 91% reduce 0%
11/04/01 17:03:12 INFO mapred.JobClient:  map 100% reduce 0%
11/04/01 17:03:14 INFO mapred.JobClient: Job complete: job_201103311423_0005
11/04/01 17:03:14 INFO mapred.JobClient: Counters: 6
11/04/01 17:03:14 INFO mapred.JobClient:   Job Counters
11/04/01 17:03:14 INFO mapred.JobClient:     Launched map tasks=2
11/04/01 17:03:14 INFO mapred.JobClient:   FileSystemCounters
11/04/01 17:03:14 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1000000000
11/04/01 17:03:14 INFO mapred.JobClient:   Map-Reduce Framework
11/04/01 17:03:14 INFO mapred.JobClient:     Map input records=10000000
11/04/01 17:03:14 INFO mapred.JobClient:     Spilled Records=0
11/04/01 17:03:14 INFO mapred.JobClient:     Map input bytes=10000000
11/04/01 17:03:14 INFO mapred.JobClient:     Map output records=10000000


(2) 执行 terasort 排序

执行 terasort 程序,将会执行 16 个 MapTask

root@gd38 hadoop-0.20.1# bin/hadoop jar hadoop-0.20.1-examples.jar terasort terasort/1G-input terasort/1G-output

11/03/31 17:12:49 INFO terasort.TeraSort: starting
11/03/31 17:12:49 INFO mapred.FileInputFormat: Total input paths to process : 2
11/03/31 17:13:05 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/03/31 17:13:05 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
11/03/31 17:13:05 INFO compress.CodecPool: Got brand-new compressor
Making 1 from 100000 records
Step size is 100000.0
11/03/31 17:13:06 INFO mapred.JobClient: Running job: job_201103311423_0006
11/03/31 17:13:07 INFO mapred.JobClient:  map 0% reduce 0%
11/03/31 17:13:20 INFO mapred.JobClient:  map 12% reduce 0%
11/03/31 17:13:21 INFO mapred.JobClient:  map 37% reduce 0%
11/03/31 17:13:29 INFO mapred.JobClient:  map 50% reduce 2%
11/03/31 17:13:30 INFO mapred.JobClient:  map 75% reduce 2%
11/03/31 17:13:32 INFO mapred.JobClient:  map 75% reduce 12%
11/03/31 17:13:36 INFO mapred.JobClient:  map 87% reduce 12%
11/03/31 17:13:38 INFO mapred.JobClient:  map 100% reduce 12%
11/03/31 17:13:41 INFO mapred.JobClient:  map 100% reduce 25%
11/03/31 17:13:44 INFO mapred.JobClient:  map 100% reduce 31%
11/03/31 17:13:53 INFO mapred.JobClient:  map 100% reduce 33%
11/03/31 17:14:02 INFO mapred.JobClient:  map 100% reduce 68%
11/03/31 17:14:05 INFO mapred.JobClient:  map 100% reduce 71%
11/03/31 17:14:08 INFO mapred.JobClient:  map 100% reduce 75%
11/03/31 17:14:11 INFO mapred.JobClient:  map 100% reduce 79%
11/03/31 17:14:14 INFO mapred.JobClient:  map 100% reduce 82%
11/03/31 17:14:17 INFO mapred.JobClient:  map 100% reduce 86%
11/03/31 17:14:20 INFO mapred.JobClient:  map 100% reduce 90%
11/03/31 17:14:23 INFO mapred.JobClient:  map 100% reduce 93%
11/03/31 17:14:26 INFO mapred.JobClient:  map 100% reduce 97%
11/03/31 17:14:32 INFO mapred.JobClient:  map 100% reduce 100%
11/03/31 17:14:34 INFO mapred.JobClient: Job complete: job_201103311423_0006
11/03/31 17:14:34 INFO mapred.JobClient: Counters: 18
11/03/31 17:14:34 INFO mapred.JobClient:   Job Counters
11/03/31 17:14:34 INFO mapred.JobClient:     Launched reduce tasks=1
11/03/31 17:14:34 INFO mapred.JobClient:     Launched map tasks=16
11/03/31 17:14:34 INFO mapred.JobClient:     Data-local map tasks=16
11/03/31 17:14:34 INFO mapred.JobClient:   FileSystemCounters
11/03/31 17:14:34 INFO mapred.JobClient:     FILE_BYTES_READ=2382257412
11/03/31 17:14:34 INFO mapred.JobClient:     HDFS_BYTES_READ=1000057358
11/03/31 17:14:34 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=3402255956
11/03/31 17:14:34 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1000000000
11/03/31 17:14:34 INFO mapred.JobClient:   Map-Reduce Framework
11/03/31 17:14:34 INFO mapred.JobClient:     Reduce input groups=10000000
11/03/31 17:14:34 INFO mapred.JobClient:     Combine output records=0
11/03/31 17:14:34 INFO mapred.JobClient:     Map input records=10000000
11/03/31 17:14:34 INFO mapred.JobClient:     Reduce shuffle bytes=951549012
11/03/31 17:14:34 INFO mapred.JobClient:     Reduce output records=10000000
11/03/31 17:14:34 INFO mapred.JobClient:     Spilled Records=33355441
11/03/31 17:14:34 INFO mapred.JobClient:     Map output bytes=1000000000
11/03/31 17:14:34 INFO mapred.JobClient:     Map input bytes=1000000000
11/03/31 17:14:34 INFO mapred.JobClient:     Combine input records=0
11/03/31 17:14:34 INFO mapred.JobClient:     Map output records=10000000
11/03/31 17:14:34 INFO mapred.JobClient:     Reduce input records=10000000
11/03/31 17:14:34 INFO terasort.TeraSort: done

执行完成,排序,生成的数据仍是 1G ,

root@gd38 hadoop-0.20.1# bin/hadoop fs -ls terasort/1G-output
Found 2 items
drwxr-xr-x   - root supergroup          0 2011-03-31 17:13 /user/root/terasort/1G-output/_logs
-rw-r--r--   1 root supergroup 1000000000 2011-03-31 17:13 /user/root/terasort/1G-output/part-00000

 

 

 

 

 

 

 

 

你可能感兴趣的:(hadoop,测试,jar,input,library,output)