案例学习: MapReduce

1. 准备工作:

阅读 MapReduce (2004)

2. 主要内容

MapReduce 可以很好的例证6.824 课程的主题(分布式系统), 而MR也将作为Lab1 的主题。

2.1 MapReduce 概述

  • 上下文: 在多TB(multi-terabyte)数量级数据上的多小时(multi-hour)计算。
    例如:当爬取的网页不是由分布式系统的爱好者开发的时候,那么对这些网页结构进行试验分析时会相当的痛苦,e.g. 处理失败

  • 总体的目标:非专业的程序员可以很容易的处理大数据问题,并且获得理想的效率。
    程序员定义MapReduce函数代码,通常来说是比较容易的。
    MR结合大数据量的输入,在1000台机器上运行这些函数,并隐藏了分布式的所有细节。

2.2 MapReduce的抽象观念

  Input Map -> a,1 b,1 c,1
  Input Map -> b,1
  Input Map -> a,1     c,1
                |   |   |
                    |   -> Reduce -> c,2
                    -----> Reduce -> b,2
  1. 输入数据被划分成”splits”
  2. MR在每一个split上调用Map()函数,并产生k2, v2的数据集合(中间数据)
  3. MR收集给定的k2值的所有v2数据集, 并将他们传递给Reduce()函数
  4. 最后的输出是经过Reduce()操作的< k2, v3 >对的集合。

例子: 单词计数

  • 输入是成千上万个文本文件
  • Map(k, v):
  Map(k, v)
    split v into words
    for each word w
      emit(w, "1")
  • Reduce(k, v):
  Reduce(k, v)
    emit(len(v))

2.3 MapReduce模式的优点

  1. 这个模式易于编程;它隐藏了许多令人痛苦的细节;
    • 并发——能够和顺序执行得到相同的结果
    • 启动服务器上的软件服务(s/w)
    • 数据的移动
    • 失败
  2. 这个模式的可扩展性很好
    Nx台电脑可以给你提供Nx Map()和Reduce()的吞吐量
    Map()函数不等待其它(Map实体)的完成也不共享数据,所以可以并行的运行,Reduce()函数也是同理。
    所以你可以简单的通过购买更多的PC来获取更大的吞吐量而不是对每个应用程序进行特定的优化。
    电脑可比程序员要便宜的多。

2.4 关于MapReduce的一些问题

  1. 什么可能会成为影响系统性能的限制因子?
    我们关心那些确实需要优化的东西.
    CPU?内存?磁盘?网络?
    这些东西都被”网络截面(cross-section)带宽”所限制.
    网络的内部总容量(capacity)总是比主机的传输速度小的多。
    你很难创建一个比单台电脑运行快1000倍的网络环境,所以通过减少网络间的数据移动(来提升系统性).

  2. MR怎么进行容错?
    例如: 当一个服务器在进行MR 任务时,突然崩溃怎么办?
    将错误隐藏起来对程序员编写程序有很大的帮助。
    为什么不从最开始处重启整个任务?
    MR只是重启那些运行失败的Map()和Reduce()操作:
    Map()和Reduce只是纯函数(pure function)操作——他们不改变输入,不维持状态信息,没有共享内存,并且在Map/Map, Reduce/Reduce之间不存在交互。所以重启这些失败的操作会产生相同的结果。
    和其它的并行程序设计模式相比的话,MR的主要限制在于操作函数需要是纯函数.。但是这个限制也有利于MR模式的简化。

3. 更多的细节

案例学习: MapReduce_第1张图片

主控(master): 将任务分发给workers; 并保存中间数据所在的位置
输入数据为split的集合
输入数据保存在GFS中,并保存每个split的3个副本
每台电脑都运行GFS和MR workers
当然输入数据(split)数量比workers要多的多
主控在每台服务器上启动一个Map任务,当旧的任务完成时才分发新的请求。
worker根据键值进行hash,并将Map的输出结果保存在R 分区(本地磁盘上)
直到所有的Map操作完成后,才调用Reduce方法。
主控通知Reduce进程获取中间数据划分(由Map workers产生)
Reduce workers将最后的结果写入GFS.

How does detailed design help network performance?
Map input is read from local disks, not over network.
Intermediate data goes over network just once.
Stored on local disk, not GFS.
Intermediate data partitioned into files holding many keys.
Big network transfers are more efficient.

How do they get good load balance?
Critical to scaling – otherwise Nx servers -> no gain.
But time to process a split or partition isn’t uniform.
Different sizes and contents, and different server hardware.
Solution: many more splits than workers.
Master hands out new splits to workers who finish previous tasks.
So no split is so big it dominates completion time (hopefully).
So faster servers do more work than slower ones, finish abt the same time.

How does MR cope with worker crashes?
* Map worker crashes:
master re-runs, spreads tasks over other GFS replicas of input.
even if worker had finished, since still need intermediate data on disk.
some Reduce workers may already have read failed worker’s intermediate data.
here we depend on functional and deterministic Map()!
how does the master know the worker crashed? (pings)
master need not re-run Map if Reduces have fetched all intermediate data
though then a Reduce crash would have to wait for Maps to re-run
* Reduce worker crashes before producing output.
master re-starts its tasks on another worker.
* Reduce worker crashes in the middle of writing its output.
GFS has atomic rename that prevents output from being visible until complete.
so it’s safe for the master to re-run the Reduce tasks somewhere else.

Other failures/problems:
* What if the master accidentally starts two Map() workers on same input?
it will tell Reduce workers about only one of them.
* What if two Reduce() workers for the same partition of intermediate data?
they will both try to write the same output file on GFS!
atomic GFS rename will cause the second to finish to win.
* What if a single worker is very slow – a “straggler”?
perhaps due to flakey hardware.
master starts a second copy of last few tasks.
* What if a worker computes incorrect output, due to broken h/w or s/w?
too bad! MR assumes “fail-stop” CPUs and software.
* What if the master crashes?

For what applications doesn’t MapReduce work well?
Not everything fits the map/shuffle/reduce pattern.
Small data, since overheads are high. E.g. not web site back-end.
Small updates to big data, e.g. add a few documents to a big index
Unpredictable reads (neither Map nor Reduce can choose input)
Multiple shuffles, e.g. page-rank (can use multiple MR but not very efficient)
More flexible systems allow these, but more complex model.

Conclusion
MapReduce single-handedly made big cluster computation popular.
- Not the most efficient or flexible.
+ Scales well.
+ Easy to program – failures and data movement are hidden.
These were good trade-offs in practice.
We’ll see some more advanced successors later in the course.
Have fun with the lab!

你可能感兴趣的:(mapreduce,分布式,MIT,6-824)