0x00 Preface

This post is abot MapRedece. Its basic concept, architecture, programing model, and example.

0x01 Introduction

What is MapReduce?

MapReduce is a programming model and an associated implementation for processing and generating large data sets.

This is very important!

Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

Why MapReduce?

Challenge:

Input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. Issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with these issues.

Solution:

As a reaction to this complexity, we designed a new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.

Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages.

0x02 Architecture

Figure below shows the overall flow of a MapReduce operation in our implementation. When the user program calls the MapReduce function, the following sequence of actions occurs.

Paste_Image.png

The MapReduce library in the user program �first splits the input �files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
One of the copies of the program is special —— the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.
Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.
When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.

When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.

0x03 Programming Model

The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce.

Here is what it looks like:

  Input1 -> Map -> a,1 b,1 c,1
  Input2 -> Map ->     b,1
  Input3 -> Map -> a,1     c,1
                    |   |   |
                        |   -> Reduce -> c,2
                        -----> Reduce -> b,2
                    ---------> Reduce -> a,2

Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function.

The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user's reduce function via an iterator. This allows us to handle lists of values that are too
large to �t in memory.

Each Map() or Reduce() call is a "task".

Example

Here is the simple example worldcount. User can write code like this:

map(String key, String value):
    // key: document name
    // value: document contents
    for each word w in value:
        EmitIntermediate(w, "1");

reduce(String key, Iterator values):
    // key: a word
    // values: a list of counts
    int result = 0;
    for each v in values:
        result += ParseInt(v);
    Emit(AsString(result));

0x04 Some Exciting Topics

1. What MapReduce does well?

Scales well:

N computers gets you Nx throughput.

Assuming M and R are >= N (i.e. lots of input files and output keys). Maps()s can run in parallel, since they don't interact. Same for Reduce()s.

So you can get more throughput by buying more computers. Rather than special-purpose efficient parallelizations of each application. Computers are cheaper than programmers!

Hides many painful details

starting s/w on servers
tracking which tasks are done
data movement
recovering from failures

2. Fault Tolerance

Questions：

What if a worker server which runs map or reduce task crashes during a MR job?
What if a master server which runs map or reduce task crashes during a MR job?
What if a reduce worker crashes in the middle of writing its output.
Why not re-start the whole job from the beginning?

See, fault tolerance is an important part of the mapreduce design. Let's try answer some of the questions above.

Worker server crashes

The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed.

Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling.

MR re-runs just the failed Map()s and Reduce()s. MR requires them to be pure functions:

they don't keep state across calls,
they don't read or write files other than expected MR inputs/outputs,
there's no hidden communication among tasks.

So re-execution yields the same output.

The requirement for pure functions is a major limitation of MR compared to other parallel programming schemes. But it's critical to MR's simplicity.

Details of map worker crashes recovery

Master sees worker no longer responds to pings, crashed worker's intermediate Map output is lost but is likely needed by every Reduce task!
Master re-runs, spreads tasks over other GFS replicas of input.
Some Reduce workers may already have read failed worker's intermediate data. Here we depend on functional and deterministic Map()!
Master need not re-run Map if Reduces have fetched all intermediate data though then a Reduce crash would then force re-execution of failed Map

Details of reduce worker crashes recovery

If reduce task does not start to write its output, just restart the reduce task on other machine.

If reduce worker crashes in the middle of writing its output. GFS has atomic rename that prevents output from being visible until complete. So it's safe for the master to re-run the Reduce tasks somewhere else.

3. What will likely limit the performance?

We care since that's the thing to optimize. CPU? memory? disk? network?

In my opinion , disk and network is much more important.

One MR job has to write intermediate result to local disk many times, it will waste a lot of time.

In 2004 authors were limited by "network cross-section bandwidth". [diagram: servers, tree of network switches]

Note all data goes over network, during Map->Reduce shuffle. Paper's root switch: 100 to 200 gigabits/second 1800 machines, so 55 megabits/second/machine.Small, e.g. much less than disk or RAM speed.

So they cared about minimizing movement of data over the network. (Datacenter networks are much faster today.)

4. How do they get good load balance?

Critical to scaling -- bad for N-1 servers to wait for 1 to finish. But some tasks likely take longer than others.

Solution: many more tasks than workers. Master hands out new tasks to workers who finish previous tasks. So no task is so big it dominates completion time (hopefully).

So faster servers do more work than slower ones, finish abt the same time.

5. For what applications doesn't MapReduce work well?

Not everything fits the map/shuffle/reduce pattern.

Small data, since overheads are high. E.g. not web site back-end.
Small updates to big data, e.g. add a few documents to a big index
Unpredictable reads (neither Map nor Reduce can choose input)
Multiple shuffles, e.g. page-rank (can use multiple MR but not very efficient)

6. Other failures/problems

What if the master gives two workers the same Map() task?

Perhaps the master incorrectly thinks one worker died. It will tell Reduce workers about only one of them.

What if the master gives two workers the same Reduce() task?

They will both try to write the same output file on GFS! Atomic GFS rename prevents mixing; one complete file will be visible.

What if a single worker is very slow -- a "straggler"?

Perhaps due to flakey hardware.Mmaster starts a second copy of last few tasks.

0xFF Summary

MapReduce single-handedly made big cluster computation popular.

Not the most efficient or flexible.
Scales well.
Easy to program -- failures and data movement are hidden.

Lesson 2 : Happy MapReduce