MapReduce: Simplified Data Processing on Large Clusters 中文翻译 2

1 Introduction
Over the past five years, the authors and many others at Google have implemented hundreds of special-purpose computations that process large amounts of raw data,such as crawled documents, web request logs, etc., to compute various kinds of derived data, such as inverted indices, various representations of the graph structure of web documents, summaries of the number of pages crawled per host, the set of most frequent queries in a given day, etc. Most such computations are conceptually straightforward. However, the input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with these issues.
As a reaction to this complexity, we designed a new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical “record” in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with userspecified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.
The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs. Section 2 describes the basic programming model and gives several examples. Section 3 describes an implementation of the MapReduce interface tailored towards our cluster-based computing environment. Section 4 describes several refinements of the programming model that we have found useful. Section 5 has performance measurements of our implementation for a variety of tasks. Section 6 explores the use of MapReduce within
Google including our experiences in using it as the basis  for a rewrite of our production indexing system. Sec tion 7 discusses related and future work.


介绍

在过去的5年里,作者和google的同仁们已经实现了数百个处理大量原始数据的具有专门目的的计算任务,像是网爬文档,web请求日志等等。为了计算各种类型的派生数据,像是,倒排索引,网页文档图结构各种表示,每个主机爬取网页数量的概要,每天中最常被请求的集合。大多数的这种计算在概念上都是简单明了的。但是,输入数据通常很大,为了在合理的时间内完成任务,任务需要被分布到成百上千台机器上。怎样去并行计算,分发数据,处理错误,这些问题的综合,使得原本的简易计算,却需要大量的复杂代码,去处理这些问题。


作为对此复杂性的回应(处理方式),我们设计了一种新的抽象模型,允许我们表示想要执行的简单计算,而隐藏如下的繁杂的细节,对于一个库来说,包括并行化,错误容忍(容错),数据分布和负载均衡。我们的抽象模型定义,受到了Lisp和一些其它函数语言中map和reduce原型的启发。我们认识到,我们的许多计算过程,均包含在输入的每个逻辑记录中应用一个map操作,并计算出一个中间的键/值对集合,然后,在所有具有相同key的value上应用reduce操作,,将导出(派生)数据适当地合并。函数模型的使用,结合用户指定的map和reduce操作,可以让我们容易的实现大规模并行化计算,并使用再次执行这一初级机制,进行容错。


本工作主要的贡献,是开发了一个简单却又强有力的应用接口,可以实现自动的并行话,大规模分布式计算,结合这个接口的实现,在大量商业集群上实现高性能计算。第二部分描述基本的编程模型,并给出了一些例子。第三部分包含一个MapReduce接口的实现,这个接口,符合我们的基于集群的计算环境。第四部分,描述了我们一些编程模型的有用的技巧。第五部分,对各种不同的任务,进行性能度量。第六部分,将会探究在google内部MapReduce在的使用,包括我们的使用经验,使用使用MapReduce作为基础来重写我们的索引系统。第七部分,包含相关的讨论议题和未来的工作。


你可能感兴趣的:(负载均衡,分布式计算,processing,高性能计算,parallel)