Tuple MapReduce: 超越经典MapReduce

It’s been some years now since Google wrote the paper [“MapReduce: Simplified Data Processing on Large Clusters“] in 2004. In this paper Google presented MapReduce, a programming model and associated implementation for solving parallel computation problems with big-scale data. This model is based on the use of the functional primitives “map” and “reduce” present in LISP and other functional languages.

Today, Hadoop, the “de facto” open-source implementation of MapReduce, is used by a wide variety of companies, institutions and universities. The massive usage of this programming model has led to the creation of multiple tools associated with it (which has come to be known as the Hadoop ecosystem) and even specialized companies like Cloudera engaged in trainingprogrammers to use it. Part of the success of such tools and companies lies in the now-evidentdifficulty and sharp learning curve involved in MapReduce, as it was originally defined, when applied to practical problems.

In this post we’ll review the MapReduce model proposed by Google in 2004 and propound another one called Tuple MapReduce. We’ll see that this new model is a generalization of the first and we’ll explain what advantages it has to offer. We’ll provide a practical example and conclude by discussing when the implementation of Tuple MapReduce is advisable.

Motivation

The usage of MapReduce for solving problems like the typical “word count” is advisable and even intuitive. However, for many real-world problems, it can be excessively complex to code a solution based on MapReduce (for a reference on this matter, see this post about the shortcomings of the Hadoop MapReduce API).

MapReduce is currently seen as a low-level paradigm on top of which high-level tools must be built that are more intuitive and easy to use. However, another way of solving MapReduce’s shortcomings is to reformulate it.

If MapReduce were formulated differently, many problems would be easier to code. The high-level tools that could arise from it would also be easier to code.

This is the motivation that has led us to pose and formulate Tuple MapReduce.

MapReduce

The MapReduce model proposed by Google (and the one that Hadoop implements) can be conceptualized as:

The map function processes a certain (key, value) pair and emits a certain number of (key, value) pairs. The reduce function processes values grouped by the same key and emits another set of (key, value) pairs as output.

The original MapReduce approach is to use (key, value) pairs and it is specified that the groups that will be received in the reduce function will be grouped by the key.

Tuple MapReduce

Now we will show an extended MapReduce model, Tuple MapReduce, which we can formalize as:

In this case, the map function processes a tuple as input and emits a certain number of tuples as output. These tuples are made up of “n” fields out of which “s” fields are used to sortand “g” fields are used to group by. This diagram shows how sorting and grouping is done in greater detail:

In the reduce function, for each group, we receive a group tuple with “g” fields and a list of tuples for that group. Finally we’ll emit a certain number of tuples as output.

Tuple MapReduce extends the idea of MapReduce in order to be able to work with an arbitrary number of fields. It specifies how to sort and group in order to receive the tuples in the reduce function in a certain order. This formulation simplifies programming in several use cases.

Example use case

In Google’s MapReduce paper, we find a pseudo-code example of a typical “word count”. However, as we mentioned above, many real-world problems are difficult to materialize in MapReduce. These problems usually involve the use of compound data (tables) and / or need to be grouped and sorted in some specific way. Let’s see an example.

Imagine we have a register of daily unique visits for each URL: [“url”, “date”, “visits”]. We want to calculate the ~~unique visits~~ cumulative number of visits up to each date from that register. In pseudo-code, the user could write a program that is something like:

 
          1 
          map(Tuple tuple): 
        
          2 
              EmitIntermediate(tuple); 
        
          3 
            
          4 
          reduce(Tuple groupTuple, Iterator tuples): 
        
          5 
              int count = 0; 
        
          6 
              for each tuple in tuples: 
        
          7 
                  count += tuple.get("visits"); 
        
          8 
                  Emit(NewTuple(tuple.get("url"), tuple.get("date"), count));

The user would need to configure the Tuple MapReduce implementation in the following manner:

 
          1 
          configureSortBy(TupleFields("url", "date")); 
        
          2 
          configureGroupBy(TupleFields("url"));

The map function is quite simple: it just emits the registers as they are being processed. Because the user doesn’t need to create a key and a value, emitting a compound register is extremely easy. Because the process has been configured to sort by “url, date” and group by “url”, we are making sure that each URL group will receive the registers sorted by date.

The reduce function creates an accumulating visits counter for each URL. For each register of the group, we add up the visits from that day and emit the accumulated counter for that URL and day.

Other applications similar to this example would include: calculating increments in visits between consecutive days, calculating moving averages (e.g. average number of visits within the 30 days prior to each date), calculating unique visits in different time periods (year, month, week, …). All of these applications are quite common nowadays, and yet they are quite difficult to code in a simple and scalable way using MapReduce. In this blog post you’ll find an example of such complexity.

Generalization

If we think about it, Tuple MapReduce is a more general model than MapReduce. MapReduce could be seen as a specific case of Tuple MapReduce where we only work with tuples that have two fields and limit the group and sort by tuples to one field (the first one, which would be the “key” in the original MapReduce).

Therefore we want to emphasize that Tuple MapReduce allows us to do the same things as MapReduce, while greatly simplifying the way that we code and understand it.

Implementation

Tuple MapReduce can be implemented based on the same foundations as any current MapReduce architecture, and does not, in itself, involve changes in how the system is distributed or coordinated, but rather only the way that the user interfaces with the system.

Tuple-Join MapReduce

Nowadays, we know that a quite common pattern in parallel data processing is joining multiple heterogeneous data sources. This problem is not inherently solved by the original formulation of MapReduce. Tuple MapReduce can inherently be generalized to allow joins between multiple data sources. Taking two data sources as an example, we can formalize a join in Tuple MapReduce in the following simplified manner:

The group tuple ‘g’ must be a prefix of all the tuples emitted from all data sources. In the reducer we’ll receive several lists of tuples for each group tuple: one list for each data source. These lists would be sorted, based, for instance, on the order in which the different data sources were defined.

Conclusion

In this post we have presented a new MapReduce model, Tuple MapReduce, and we have shown its benefits and virtues. We have generalized it in order to allow joins between different data sources (Tuple-Join MapReduce). We have noted that it allows the same things to be done as the MapReduce we already know, while making it much simpler to learn and use.

We believe that an implementation of Tuple MapReduce would be advisable and that it could act as a replacement for the original MapReduce. This implementation, instead of being comparable to existing high-level tools that have been created on top of MapReduce, would be comparable in efficiency to current implementations of MapReduce.

At Datasalt we are working on an implementation of Tuple MapReduce for Hadoop that we will open-source for the community in the near future.

Tuple MapReduce: 超越经典MapReduce

导读：这篇文章将先回顾Google设计的MapReduce模型，然后提出一种新的并行编程模型——TupleMapReduce。Tuple MapReduce对传统的MapReduce模型进行了扩展。本文将定义Tuple MapReduce的概念，并用实际例子来阐述其特点，最后讨论怎样实现它。

为了解决大规模数据的并行计算问题，Google提出了MapReduce编程模型，并于2004年发表论文：[“MapReduce： Simplified Data Processingon Large Clusters”]。基于LISP和其他函数式语言中的函数原语概念， MapReduce模型使用了函数式“map”和“reduce” 原语。

如今，Hadoop作为MapReduce的一种开源实现，已经成为许多公司、研究机构和高校广泛采用的开源工具。Hadoop的广泛应用使得许多相关的工具如雨后春笋般地出现（也就是熟知的Hadoop生态系统），甚至诞生了像Cloudera这样的专门从事于培训程序员使用Hadoop的公司。这些工具和公司成功的部分原因在于MapReduce具有较高的学习门槛，以及它从设计实现到良好地应用于解决实际问题面临重重困难。

Map方法处理诸如（key，value）这样的数据对，并生成中间形式的（key，value）对。Reduce方法合并所有具有相同key值的中间（key，value）数据，并生成最终结果。

“MapReduce的核心思想就是使用（key，value）对来处理数据，并且规定reduce方法接收到的数据是合并相同key值的（key，value）对而得到的。”

使用MapReduce来解决像“word count”这种典型小问题是完全没有问题的。但是，对于许多现实中的实际问题，利用MapReduce模型来编码解决方案是极其复杂的（参看这篇文章中的“Hadoop MapReduce API的不足”）。

“由于MapReduce被视为是一种低级范式，所以更直观、更易于使用的高级工具必须建立其上。然而，另一种解决 MapReduce不足的方式便是改写它。”

如果重新定义MapReduce，那么很多问题就会很容易编码解决了。建立在其上的高级工具也会更容易设计实现了。

现在我们来看看MapReduce模型的扩展形式——Tuple MapReduce，它可以形式化地定义为：

其中，map方法处理输入的一个元组，并生成若干中间元组，分组合并后将传递给reduce方法处理。这些中间生成的元组由n个域组成，其中“g”个域用来分组合并，“s”个域用来排序。下图详细的说明了排序与合并是怎样进行的：

在reduce方法中，对于接收到的每个元组，它具有用于合并的“g”个域和合并后这个元组排好序的链表。最后，它将处理这些元组，生成新元组作为最终结果。

“为了能够处理任意个数域的数据，Tuple MapReduce扩展了MapReduce的思想。为了使reduce方法收到的元组仍具有特定的规则，它规定了如何排序和合并中间生成元组。在很多应用场合，TupleMapReduce简化了编程。”

在Google的MapReduce论文中，我们只能找到像“word count”这些特例的伪代码。但是，正如我们前面叙述的，如果采用MapReduce模型来解决许多实际问题是很困难的。这些实际问题经常会涉及到混合数据（像表），或者需要按照某些特定方式来对数据进行合并和排序。我们来看个例子。

假设我们有一个日志，它记录了每个URL的访问记录——URL: [“url”, “date”, “visits”] 。现在我们想根据这个日志来计算到某日期为止的每个URL的访问次数。基于MapReduce模型，用户可以写一个类似如下的伪代码程序：

但是，如果用户采用TupleMapReduce模型来实现的话，程序就将简化为如下形式：

这里map方法相当简单：它直接发送出正在处理的记录就行了。因为不需要创建key和value，直接生成一个多域（3个数域）的元组很容易。 Tuple MapReduce模型将对访问数据根据“url，date”排序、根据“url”合并。因此，reduce接收到的每个URL的合并记录都是依据“date”排序的。

Reduce方法为每个URL创建一个访问累加器。对于URL组内的每个记录，我们加上那天的访问次数，然后返回这个URL在那天的累加访问次数。

“类似这个例子的应用还包括：计算连续间隔时间内的访问增长量，计算移动平均数（例如某个日期前30天内的平均访问次数），计算不同时间段的访问次数（年，月，周）。现在所有这些应用都相当普遍，而且也很难使用MapReduce来编码简单、易扩展的方案。在这里blog，你可以看到关于这种复杂性的例子。”

目前，我们知道并行数据处理中相当普遍的模式就是多个异构数据源的joining。这个问题并没有被经典MapReduce模型所解决。 Tuple MapReduce与生俱来便支持多个异构数据源的join。就拿两个数据源举例，我们可以将Tuple MapReduce中的join定义为以下简化的方式：

对于所有数据源生成的中间数据元组，依据前缀“g”来合并元组。在reducer中将收到合并元组后的几个元组链表：每个数据源一个链表。这些链表将根据不同数据源自身定义的不同规则来排序。

我们可以认为Tuple MapReduce是MapReduce模型的扩展。MapReduce可以看作是Tuple MapReduce的一种特例，只不过MapReduce只处理仅有两个域的元组，并限定依据元组的第一个域（“key”）来合并和排序。

所以我们需要强调的是，Tuple MapReduce可以做任何MapReduce能做的事情，并且可以大大简化我们编码和理解这种并行编程模型。

基于目前任何那种MapReduce架构，Tuple MapReduce都可以在其上很容易的实现，并且不涉及改变系统分布式处理和协作方式，仅仅只改变了用户与系统的交互方式。

这篇文章中我们提出了一种新的MapReduce模型——TupleMapReduce，并且我们阐述了其特点与优势。而且为了支持不同数据源之间的joins操作，我们将其普适化了（Tuple-Join MapReduce）。 Tuple MapReduce可以支持目前MapReduce的任何功能，并且简化了其学习和使用。

我们相信Tuple MapReduce不久就会被实现，并且它将代替原始MapReduce模型。当然，实现Tuple MapReduce不是为了与建立在MapReduce之上的高级工具对比。就实现效率而言，Tuple MapReduce能与MapReduce相媲美。

Tuple MapReduce: 超越经典MapReduce

Tuple MapReduce: beyond the classic MapReduce

Motivation

MapReduce

Tuple MapReduce

Example use case

Generalization

Implementation

Tuple-Join MapReduce

Conclusion

Tuple MapReduce: 超越经典MapReduce

你可能感兴趣的:(Tuple MapReduce: 超越经典MapReduce)