Hadoop MapReduce 各阶段理解

Hadoop的MapReduce是一个很经典的分布式并行计算框架,一直对各个阶段的具体含义有些模糊。花时间看了下stackoverflow上的理解,记录一下。

stackoverflow链接:https://stackoverflow.com/questions/22141631/what-is-the-purpose-of-shuffling-and-sorting-phase-in-the-reducer-in-map-reduce

看下面这个例子,一目了然。


map_reduce.jpg

上图演示的是经典的word count的例子:

  • map阶段:map的作用是takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). 这里是将分好词的文本转成(word,num)的形式。

  • combine阶段:combine的作用是Shrinks the output of each Mapper. It would save the time spending for moving the data from one node to another,即merge每个map阶段的结果,如图所示,相当于一个内部整合的作用。

  • shuffle&sort阶段:shuffle的作用是Makes it easy for the run-time to schedule (spawn/start) new reducers, where while going through the sorted item list, whenever the current key is different from the previous, it can spawn a new reducer. shuffle模块会根据reduce的数目,将combine的结果哈希到某个partion,默认是key的顺序。图中可以看到是按照key的字符串顺序排序的,将相同的(key,value)哈希到一个partion,以便reduce操作。注意的是,shuffle并不是hadoop的必要阶段,配置中可选。

  • reduce阶段: 对shuffle的结果处理,输出结果。

你可能感兴趣的:(Hadoop MapReduce 各阶段理解)