databasecolumn 的数据库大牛们(其中包括PostgreSQL的最初伯克利领导:Michael Stonebraker)最近写了一篇评论当前如日中天的MapReduce 技术的文章,引发剧烈的讨论。我抽空在这儿翻译一些,一起学习。
译者注:这种 Tanenbaum vs. Linus 式的讨论自然会导致非常热烈的争辩。但是老实说,从 Tanenbaum vs. Linus 的辩论历史发展来看,Linux是越来越多地学习并以不同方式应用了 Tanenbaum 等 OS 研究者的经验(而不是背弃); 所以 MapReduce vs. DBMS 的讨论,希望也能给予后来者更多的启迪,而不是对立。
原文见:http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html
注:作者是 David J. DeWitt 和 Michael Stonebraker
On January 8, a Database Column reader asked for our views on new distributed database research efforts, and we'll begin here with our views on MapReduce. This is a good time to discuss it, since the recent trade press has been filled with news of the revolution of so-called "cloud computing." This paradigm entails harnessing large numbers of (low-end) processors working in parallel to solve a computing problem. In effect, this suggests constructing a data center by lining up a large number of "jelly beans" rather than utilizing a much smaller number of high-end servers.
1月8日,一位Database Column的读者询问我们对各种新的分布式数据库研究工作有何看法,我们就从MapReduce谈起吧。现在讨论MapReduce恰逢其时,因为最近商业媒体充斥着所谓“云计算(cloud computing)”革命的新闻。这种计算方式通过大量(低端的)并行工作的处理器来解决计算问题。实际上,就是用大量便宜货(原文是jelly beans)代替数量小得多的高端服务器来构造数据中心。
For example, IBM and Google have announced plans to make a 1,000 processor cluster available to a few select universities to teach students how to program such clusters using a software tool called MapReduce [1]. Berkeley has gone so far as to plan on teaching their freshman how to program using the MapReduce framework.
例如,IBM和Google已经宣布,计划构建一个1000处理器的集群,开放给几个大学,教授学生使用一种名为MapReduce [1]的软件工具对这种集群编程。加州大学伯克利分校甚至计划教一年级新生如何使用MapReduce框架编程。
As both educators and researchers,we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is:
我们都既是教育者也是研究人员,MapReduce支持者们大肆宣传它代表了可伸缩、数据密集计算发展中的一次范型转移,对此我们非常惊讶。MapReduce就编写某些类型的通用计算程序而言,可能是个不错的想法,但是从数据库界看来,并非如此:
First,we will briefly discuss what MapReduce is; then we will go into more detail about our five reactions listed above.
首先,我们简要地讨论一下MapReduce是什么,然后更详细地阐述上面列出的5点看法。
The basic idea of MapReduce is straightforward. It consists of two programs that the user writes called map and reduce plus a framework for executing a possibly large number of instances of each program on a compute cluster.
MapReduce的基本思想很直接。它包括用户写的两个程序:map和reduce,以及一个framework,在一个计算机簇中执行大量的每个程序的实例。
The map program reads a set of "records" from an input file, does any desired filtering and/or transformations, and then outputs a set of records of the form (key, data). As the map program produces output records, a "split" function partitions the records into M disjoint buckets by applying a function to the key of each output record. This split function is typically a hash function, though any deterministic function will suffice. When a bucket fills, it is written to disk. The map program terminates with M output files, one for each bucket.
map程序从输入文件中读取"records"的集合,执行任何需要的过滤或者转换,并且以(key,data)的形式输出records的集合。当map程序产生输出记录,"split"函数对每一个输出的记录的key应用一个函数,将records分割为M个不连续的块(buckets)。这个split函数有可能是一个hash函数,而其他确定的函数也是可用的。当一个块被写满后,将被写道磁盘上。然后map程序终止,输出M个文件,每一个代表一个块(bucket)。
In general, there are multiple instances of the map program running on different nodes of a compute cluster. Each map instance is given a distinct portion of the input file by the MapReduce scheduler to process. If N nodes participate in the map phase, then there are M files on disk storage at each of N nodes, for a total of N * M files; Fi,j, 1 ≤ i ≤ N, 1 ≤ j ≤ M.
通常情况下,map程序的多个实例持续运行在compute cluster的不同节点上。每一个map实例都被MapReduce scheduler分配了input file的不同部分,然后执行。如果有N个节点参与到map阶段,那么在这N个节点的磁盘储存都有M个文件,总共有N*M个文件。
The key thing to observe is that all map instances use the same hash function. Hence, all output records with the same hash value will be in corresponding output files.
值得注意的地方是,所有的map实例都使用同样的hash函数。因此,有相同hash值的所有output record会出被放到相应的输出文件中。
The second phase of a MapReduce job executes M instances of the reduce program, Rj, 1 ≤ j ≤ M. The input for each reduce instance Rj consists of the files Fi,j, 1 ≤ i ≤ N. Again notice that all output records from the map phase with the same hash value will be consumed by the same reduce instance -- no matter which map instance produced them. After being collected by the map-reduce framework, the input records to a reduce instance are grouped on their keys (by sorting or hashing) and feed to the reduce program. Like the map program, the reduce program is an arbitrary computation in a general-purpose language. Hence, it can do anything it wants with its records. For example, it might compute some additional function over other data fields in the record. Each reduce instance can write records to an output file,which forms part of the "answer" to a MapReduce computation.
MapReduce的第二个阶段执行M个reduce程序的实例, Rj, 1 <= j <= M.每一个reduce实例的输入是Rj,包含文件Fi,j, 1<= i <= N.注意,每一个来自map阶段的output record,含有相同的hash值的record将会被相同的reduce实例处理--不论是哪一个map实例产生的数据。在map-reduce架构处理过后,input records将会被以他们的keys来分组(以排序或者哈希的方式),到一个reduce实例然后给reduce程序处理。和map程序一样,reduce程序是任意计算语言表示的。因此,它可以对它的records做任何想做事情。例如,可以添加一些额外的函数,来计算record的其他data field。每一个reduce实例可以将records写到输出文件中,组成MapReduce计算的"answer"的一部分。
To draw an analogy to SQL, map is like the group-by clause of an aggregate query. Reduce is analogous to the aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute.
和SQL可以做对比的是,map程序和聚集查询中的 group-by语句相似。Reduce函数和聚集函数(例如,average,求平均)相似,在所有的有相同group-by的属性的列上计算。
We now turn to the five concerns we have with this computing paradigm.
现在来谈一谈我们对这种计算方式的5点看法。
As a data processing paradigm, MapReduce represents a giant step backwards. The database community has learned the following three lessons from the 40 years that have unfolded since IBM first released IMS in 1968.
MapReduce has learned none of these lessons and represents a throw back to the 1960s, before modern DBMSs were invented.
MapReduce没有学到任何一条,并且倒退回了60年代,倒退回了现代数据库管理系统发明以前的时代。
The DBMS community learned the importance of schemas,whereby the fields and their data types are recorded in storage. More importantly, the run-time system of the DBMS can ensure that input records obey this schema. This is the best way to keep an application from adding "garbage" to a data set. MapReduce has no such functionality, and there are no controls to keep garbage out of its data sets. A corrupted MapReduce dataset can actually silently break all the MapReduce applications that use that dataset.
DBMS社区懂得schemas的重要性,凭借fields和他们的数据类型记录在储存中。更重要的,运行状态的DBMS系统可以确定输入的记录都遵循这个schema。这是最佳的保护程序不会添加任何垃圾信息到数据集中。MapReduce没有任何这样的功能,没有任何控制数据集的预防垃圾数据机制。一个损坏的MapReduce数据集事实上可以无声无息的破坏所有使用这个数据集的MapReduce程序。
It is also crucial to separate the schema from the application program. If a programmer wants to write a new application against a data set, he or she must discover the record structure. In modern DBMSs, the schema is stored in a collection of system catalogs and can be queried (in SQL) by any user to uncover such structure. In contrast,when the schema does not exist or is buried in an application program, the programmer must discover the structure by an examination of the code. Not only is this a very tedious exercise, but also the programmer must find the source code for the application. This latter tedium is forced onto every MapReduce programmer, since there are no system catalogs recording the structure of records -- if any such structure exists.
将schema和程序分开也非常重要。如果一个程序员想要对一个数据集写一个新程序,他必须知道数据集的结构(record structure)。现代DBMS系统中,schema储存在系统目录中,并且可以被任意用户查询(使用SQL)它的结构。相反的,如果schema不存在或者存在于程序中,程序员必须检查程序的代码来获得数据的结构。这不仅是一个单调枯燥的尝试,而且程序员必须能够找到先前程序的source code。每一个MapReduce程序员都必须承受后者的乏味,因为没有系统目录用来储存records的结构--就算这些结构存在。
During the 1970s the DBMS community engaged in a "great debate" between the relational advocates and the Codasyl advocates. One of the key issues was whether a DBMS access program should be written:
70年代DBMS社区,在关系型数据库支持者和Codasys型数据库支持者之间发有一次"大讨论"。一个重点议题就是DBMS存取程序应该写成哪种方式:
The result is now ancient history, but the entire world saw the value of high-level languages and relational systems prevailed. Programs in high-level languages are easier to write, easier to modify, and easier for a new person to understand. Codasyl was rightly criticized for being "the assembly language of DBMS access." A MapReduce programmer is analogous to a Codasyl programmer -- he or she is writing in a low-level language performing low-level record manipulation. Nobody advocates returning to assembly language; similarly nobody should be forced to program in MapReduce.
讨论的结果已经是过去的历史,但是整个世界看到High-level语言的价值,因此关系型数据库开始流行.在High-level语言上编写/修改程序比较容易,而且易于理解. Codasyl曾被批评为"DBMS存取的汇编语言".一个MapReduce程序员和Codasyl程序员类似-他们在low-level语言基础上做low-level的记录操作.没有人提倡回到汇编语言,同样没有人被迫去编写MapReduce程序.
MapReduce advocates might counter this argument by claiming that the datasets they are targeting have no schema. We dismiss this assertion. In extracting a key from the input data set, the map function is relying on the existence of at least one data field in each input record. The same holds for a reduce function that computes some value from the records it receives to process.
MapReduce提倡者也许反对这个说法,宣称他们的目标数据集是没有schema的.我们不同意这个说法.从输入数据集中抽取key, map函数至少依赖每个数据集的一个数据字段的存在, reduce函数也是如此,从收到要处理的记录来计算值.
Writing MapReduce applications on top of Google's BigTable (or Hadoop's HBase) does not really change the situation significantly. By using a self-describing tuple format (row key, column name,{values}) different tuples within the same table can actually have different schemas. In addition, BigTable and HBase do not provide logical independence, for example with a view mechanism. Views significantly simplify keeping applications running when the logical schema changes.
在google的BigTable(或者Hadoop的HBase)基础上写MapReduce的应用并没有改变这个事实.通过自描述的元组格式(row key, column name,(value)),相同表的不同元组事实上有不同的schema.另外BigTable和HBase 并不提供逻辑独立,例如view视图机制.当逻辑schema改变时, View(视图)很重要地简化了程序的继续运行.
2. MapReduce是一个糟糕的实现
All modern DBMSs use hash or B-tree indexes to accelerate access to data. If one is looking for a subset of the records (e.g., those employees with a salary of 10,000 or those in the shoe department), then one can often use an index to advantage to cut down the scope of the search by one to two orders of magnitude. In addition, there is a query optimizer to decide whether to use an index or perform a brute-force sequential search.
所有现代DBMS都使用散列或者B树索引加速数据存取。如果要寻找记录的某个子集(比如薪水为10000的雇员或者是鞋类专柜的雇员),经常可以使用索引有效地将搜索范围缩小一到两个数量级。而且,还有查询优化器来确定是使用索引还是执行蛮力顺序搜索。
MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamed whenever an index is the better access mechanism.
MapReduce没有索引,因此处理时只有蛮力一种选择。在索引是更好的存取机制时,MapReduce将劣势尽显。
One could argue that value of MapReduce is automatically providing parallel execution on a grid of computers. This feature was explored by the DBMS research community in the 1980s, and multiple prototypes were built including Gamma [2,3], Bubba [4], and Grace [5]. Commercialization of these ideas occurred in the late 1980s with systems such as Teradata.
有人可能会说,MapReduce的价值在于在计算机网格上自动地提供并行执行。这种特性数据库研究界在上世纪80年代就已经探讨过了,而且构建了许多原型,包括 Gamma [2,3], Bubba [4],和 Grace [5]。而Teradata这样的系统早在80年代晚期,就将这些想法商业化了。
In summary to this first point, there have been high-performance, commercial, grid-oriented SQL engines (with schemas and indexing) for the past 20 years. MapReduce does not fare well when compared with such systems.
对这一点做个总结,过去的20年曾出现过许多高性能,商业化的,网格SQL引擎(带有schemas和索引).与它们相比, MapReduce并没有表现出众.
There are also some lower-level implementation issues with MapReduce, specifically skew and data interchange.
而MapReduce本身存在一些lower-level实现的问题,特别是skew和数据交换.
One factor that MapReduce advocates seem to have overlooked is the issue of skew. As described in "Parallel Database System: The Future of High Performance Database Systems," [6] skew is a huge impediment to achieving successful scale-up in parallel query systems. The problem occurs in the map phase when there is wide variance in the distribution of records with the same key. This variance, in turn, causes some reduce instances to take much longer to run than others, resulting in the execution time for the computation being the running time of the slowest reduce instance. The parallel database community has studied this problem extensively and has developed solutions that the MapReduce community might want to adopt.
MapReduce提倡者好象忽略的一个因素是skew问题.如"Parallel Database System: The Future of High Performance Database Systems"文章所述, skew是并行查询系统想要成功达到扩展应用的巨大障碍.当有同样key的记录分布变化很广,这个问题会发生在map阶段.这个变化反过来会导致一些reduce实例比其他实例要运行更长时间,这就使整个计算执行时间是最慢的reduce实例的运行时间.并行数据库业界已经研究这个问题很深了,并开发了一些解决方案, MapReduce社区也许想采用.
There is a second serious performance problem that gets glossed over by the MapReduce proponents. Recall that each of the N map instances produces M output files -- each destined for a different reduce instance. These files are written to a disk local to the computer used to run the map instance. If N is 1,000 and M is 500, the map phase produces 500,000 local files. When the reduce phase starts, each of the 500 reduce instances needs to read its 1,000 input files and must use a protocol like FTP to "pull" each of its input files from the nodes on which the map instances were run. With 100s of reduce instances running simultaneously, it is inevitable that two or more reduce instances will attempt to read their input files from the same map node simultaneously-- inducing large numbers of disk seeks and slowing the effective disk transfer rate by more than a factor of 20. This is why parallel database systems do not materialize their split files and use push (to sockets) instead of pull. Since much of the excellent fault-tolerance that MapReduce obtains depends on materializing its split files, it is not clear whether the MapReduce framework could be successfully modified to use the push paradigm instead.
还存在一个MapReduce支持者曲解的严重性能问题.想想N个map实例产生M个输出文件-每个最后由不同的reduce 实例处理,这些文件写到运行map实例机器的本地硬盘.如果N是1,000, M是500, map阶段产生500,000个本地文件.当reduce阶段开始, 500个reduce实例每个需要读入1,000文件,并用类似FTP协议把它要的输入文件从map实例运行的节点上pull取过来.假如同时有数量级为100的reduce实例运行,那么2个或2个以上的reduce实例同时访问同一个map节点来获取输入文件是不可避免的-导致大量的硬盘查找,有效的硬盘运转速度至少降低20%.这就是为什么并行数据库系统不实现split文件,采用push(推到socket套接字)而不是pull.由于MapReduce的出色容错依赖于如何实现split文件, MapReduce框架是否成功地转向使用push范式,不是很清楚.
Given the experimental evaluations to date,we have serious doubts about howwell MapReduce applications can scale. Moreover, the MapReduce implementers would do well to study the last 25 years of parallel DBMS research literature.
仅用实验结果来说,我们严重怀疑MapReduce应用如何能很好地扩展.甚至, MapReduce实现者应该好好学习一下近 25年来的并行DBMS研究文献.
MapReduce一点也不新颖
The MapReduce community seems to feel that they have discovered an entirely new paradigm for processing large data sets. In actuality, the techniques employed by MapReduce are more than 20 years old. The idea of partitioning a large data set into smaller partitions was first proposed in "Application of Hash to Data Base Machine and Its Architecture" [11] as the basis for a new type of join algorithm. In "Multiprocessor Hash-Based Join Algorithms," [7], Gerber demonstrated how Kitsuregawa's techniques could be extended to execute joins in parallel on a shared-nothing [8] cluster using a combination of partitioned tables, partitioned execution, and hash based splitting. DeWitt [2] showed how these techniques could be adopted to execute aggregates with and without group by clauses in parallel. DeWitt and Gray [6] described parallel database systems and how they process queries. Shatdal and Naughton [9] explored alternative strategies for executing aggregates in parallel.
MapReduce社区好象觉得他们发现了一个崭新的处理大数据集的范式.事实上, MapReduce使用的技术已经有20年了.把大数据集切分成小的分区在"Application of Hash to Data Base Machine and Its Architecture"[11] 里作为一种新的join算法的基础就已经提出,在"Multiprocessor Hash-Based Join Algorithms"[7], Gerber 展现了Kitsuregawa的技术如何被扩展,在shared-nothing [8]的集群上结合分区表,分区执行, hash拆分来并行执行join. DeWitt [2]表明这些技术可以被采用来并行执行无论有无group 语句的合并. DeWitt和Gray [6]描述了一些并行数据库系统,以及它们如何处理查询. Shatdal和Naughton [9] 则探索了并行执行合并的替代策略.
Teradata has been selling a commercial DBMS utilizing all of these techniques for more than 20 years; exactly the techniques that the MapReduce crowd claims to have invented.
Teradata已经利用这些技术实现了一个商业DBMS产品20年了; 特别是MapReduce人群宣称发明的那些技术.
While MapReduce advocates will undoubtedly assert that being able to write MapReduce functions is what differentiates their software from a parallel SQL implementation,we would remind them that POSTGRES supported user-defined functions and user-defined aggregates in the mid 1980s. Essentially, all modern database systems have provided such functionality for quite a while, starting with the Illustra engine around 1995.
当MapReduce支持者肯定地声称写MapReduce函数是他们的软件不同于并行SQL实现的地方,我们不得不提醒他们 POSTGRES在1980年代中期就支持用户自定义函数和自定义合并了.实质上,所有现代数据库系统已经提供这样的功能很久了,大约起源于1995年左右的Illustra引擎.
MapReduce缺乏的特性
All of the following features are routinely provided by modern DBMSs, and all are missing from MapReduce:
以下特性在现代DBMS都已经缺省提供,但MapReduce里没有.
In summary, MapReduce provides only a sliver of the functionality found in modern DBMSs.
总之, MapReduce仅提供了现代DBMS功能的一小部分.
它和DBMS工具不兼容
A modern SQL DBMS has available all of the following classes of tools:
一个现代SQL DBMS可以有以下类别的工具:
MapReduce cannot use these tools and has none of its own. Until it becomes SQL-compatible or until someone writes all of these tools, MapReduce will remain very difficult to use in an end-to-end task.
MapReduce不能使用这些工具,但也没有自己的工具.直到它成为SQL兼容,或者有人写这些工具,否则在完成一个终端应用时MapReduce会一直很难用.
It is exciting to see a much larger community engaged in the design and implementation of scalable query processing techniques. We, however, assert that they should not overlook the lessons of more than 40 years of database technology-- in particular the many advantages that a data model, physical and logical data independence, and a declarative query language, such as SQL, bring to the design, implementation, and maintenance of application programs. Moreover, computer science communities tend to be insular and do not read the literature of other communities. We would encourage the wider community to examine the parallel DBMS literature of the last 25 years. Last, before MapReduce can measure up to modern DBMSs, there is a large collection of unmet features and required tools that must be added.
看到规模大得多的社区加入可伸缩的查询处理技术的设计与实现,非常令人兴奋。但是,我们要强调,他们不应该忽视数据库技术40多年来的教训,尤其是数据库技术中数据模型、物理和逻辑数据独立性、像SQL这样的声明性查询语言等等,可以为应用程序的设计、实现和维护带来的诸多好处。而且,计算机科学界往往喜欢自行其是,不理会其他学科的文献。我们希望更多人来一起研究过去25年的并行DBMS文献。MapReduce要达到能够与现代DBMS相提并论的水平,还需要开发大量特性和工具。
We fully understand that database systems are not without their problems. The database community recognizes that database systems are too "hard" to use and is working to solve this problem. The database community can also learn something valuable from the excellent fault-tolerance that MapReduce provides its applications. Finallywe note that some database researchers are beginning to explore using the MapReduce framework as the basis for building scalable database systems. The Pig[10] project at Yahoo! Research is one such effort.
我们完全理解数据库系统也有自己的问题。数据库界清楚地认识到,现在数据库系统还太“难”使用,而且正在解决这一问题。数据库界也从MapReduce为其应用程序提供的出色的容错上学到了有价值的东西。最后,我们注意到,一些数据库研究人员也开始研究使用MapReduce框架作为构建可伸缩数据库系统的基础。雅虎研究院的Pig[10]项目就是其中之一。
[1] "MapReduce: Simplified Data Processing on Large Clusters," Jeff Dean and Sanjay Ghemawat, Proceedings of the 2004 OSDI Conference, 2004.
[2] "The Gamma Database Machine Project," DeWitt, et. al., IEEE Transactions on Knowledge and Data Engineering, Vol. 2, No. 1, March 1990.
[4] "Gamma - A High Performance Dataflow Database Machine," DeWitt, D, R. Gerber, G. Graefe, M. Heytens, K. Kumar, and M. Muralikrishna, Proceedings of the 1986 VLDB Conference, 1986.
[5] "Prototyping Bubba, A Highly Parallel Database System," Boral, et. al., IEEE Transactions on Knowledge and Data Engineering,Vol. 2, No. 1, March 1990.
[6] "Parallel Database System: The Future of High Performance Database Systems," David J. DeWitt and Jim Gray, CACM, Vol. 35, No. 6, June 1992.
[7] "Multiprocessor Hash-Based Join Algorithms," David J. DeWitt and Robert H. Gerber, Proceedings of the 1985 VLDB Conference, 1985.
[8] "The Case for Shared-Nothing," Michael Stonebraker, Data Engineering Bulletin, Vol. 9, No. 1, 1986.
[9] "Adaptive Parallel Aggregation Algorithms," Ambuj Shatdal and Jeffrey F. Naughton, Proceedings of the 1995 SIGMOD Conference, 1995.
[10] "Pig", Chris Olston, http://research.yahoo.com/project/90
[11] "Application of Hash to Data Base Machine and Its Architecture," Masaru Kitsuregawa, Hidehiko Tanaka, Tohru Moto-Oka, New Generation Comput. 1(1): 63-74 (1983)