6.824 Note1: MapReduce (2004)









MapReduce的用户将任务划分为两个计算操作Map() 和Reduce() 。

  • Map()接受输入文件,输出一个 key/value 键值对的集合;
  • MapReduce模型负责将 Map()函数产生的键值对的集合中,相同的 key 值的value值集合到一起,传递给Reduce()函数。
  • Reduce()接受一个 key 值和相应的 value 集合,合并这些value值,输出一个 key/value 键值对;


map(String key, String value):
    // key: document name
    // value: document contents
    for each word w in value:
    EmitIntermediate(w, "1");

reduce(String key, Iterator values):
    // key: a word
    // values: a list of counts
    int result = 0;
    for each v in values:
        result += ParseInt(v);


3.1 执行过程
6.824 Note1: MapReduce (2004)_第1张图片
  1. The MapReduce library in the user program firstsplits the input files into M pieces of typically 16megabytes to 64 megabytes (MB) per piece (con-trollable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.

  2. One of the copies of the program is special – the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.

    Task:M+N > Worker

  3. A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.


  4. Periodically, the buffered pairs are written to localdisk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.


  5. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all in-termediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit inmemory, an external sort is used

    Reduce阶段:获取key region的所有中间文件内容,排序生成key-values集合,调用reduce()函数,写入输出文件;

  6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key en-countered, it passes the key and the corresponding set of intermediate values to the user’s Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.

  7. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user pro- gram returns back to the user code.

3.2 Master数据结构


3.3 Fault Tolerance
  • Worker Failer
    master周期性ping worker,超时标记为fail。


  • Master Failer
3.4 其他
  • Locality:输入数据由GFS管理,3副本,master调度map任务时会考虑数据文件的位置信息;

  • Backup Tasks:影响一个mapreduce的总执行时间的是“落伍者”,当一个 MapReduce 操作接近完成的时候,master调度备用(backup)任务进程来执行剩下的、处于处理中状态(in-progress)的任务。无论是最初的执行进程、 还是备用(backup)任务进程完成了任务,我们都把这个任务标记成为已经完成。

  • ​map全部执行完毕后,才执行reduce?No Reduce calls until all Maps are finished;

  • load balance : many more tasks than workers, fast workers do more. Task数远多于worker数,性能好的机器执行多任务,性能差的机器执行少任务,从而提高集群的动态的负载均衡能力。

  • What if the master gives two workers the same Map() task?
    perhaps the master incorrectly thinks one worker died.
    it will tell Reduce workers about only one of them.

  • What if the master gives two workers the same Reduce() task?
    they will both try to write the same output file on GFS!
    atomic GFS rename prevents mixing; one complete file will be visible.

  • What if a worker computes incorrect output, due to broken h/w or s/w?
    too bad! MR assumes "fail-stop" CPUs and software.


MapReduce single-handedly made big cluster computation popular.

  • Not the most efficient or flexible.
  • Scales well.
  • Easy to program -- failures and data movement are hidden.
    These were good trade-offs in practice.

[2017.9 梦工厂]

