In MapReduce, a YARN application is called a Job. The implementation of the Application Master provided by the MapReduce framework is called MRAppMaster
.
This is the timeline of a MapReduce Job execution:
Notice that the Reduce Phase may start before the end of Map Phase. Hence, an interleaving between them is possible.
We now focus our discussion on the Map Phase. A key decision is how many MapTasks the Application Master needs to start for the current job.
Let’s take a step back. When a client submits an application, several kinds of information are provided to the YARN infrastucture. In particular:
map()
implementationreduce()
implementationThe number of files inside the input directory is used for deciding the number of Map Tasks of a job.
The Application Master will launch one MapTask for each map split. Typically, there is a map split for each input file. If the input file is too big (bigger than the HDFS block size) then we have two or more map splits associated to the same input file. This is the pseudocode used inside the method getSplits()
of theFileInputFormat
class:
num_splits = 0
for each input file f:
remaining = f.length
while remaining / split_size > split_slope:
num_splits += 1
remaining -= split_size
where:
split_slope = 1.1
split_size =~ dfs.blocksize
Notice that the configuration parameter mapreduce.job.maps
is ignored in MRv2 (in the past it was just an hint).
The MapReduce Application Master asks to the Resource Manager for Containers needed by the Job: one MapTask container request for each MapTask (map split).
A container request for a MapTask tries to exploit data locality of the map split. The Application Master asks for:
This is just an hint to the Resource Scheduler. The Resource Scheduler is free to ignore data locality if the suggested assignment is in conflict with the Resouce Scheduler’s goal.
When a Container is assigned to the Application Master, the MapTask is launched.
This is a possible execution scenario of the Map Phase:
Let’s now focus on a single Map Task. This is the Map Task execution timeline:
map()
functionDuring the INIT phase, we:
TaskAttemptContext.class
)Mapper.class
InputFormat.class
, InputSplit.class
, RecordReader.class
)NewOutputCollector.class
)MapContext.class
, Mapper.Context.class
)SplitLineReader.class
objectHdfsDataInputStream.class
object
The EXECUTION phase is performed by the run
method of the Mapper
class. The user can override it, but by default it will start by calling the setup
method: this function by default does not do anything useful but can be override by the user in order to setup the Task (e.g., initialize class variables). After the setup, for each map()
is invoked. Therefore, map()
receives: a key a value, and a mapper context. Using the context, a map
stores its output to a buffer.
Notice that the map split is fetched chuck by chunk (e.g., 64KB) and each chunk is split in several (key, value) tuples (e.g., using SplitLineReader.class
). This is done inside the Mapper.Context.nextKeyValue
method.
When the map split has been completely processed, the run
function calls the clean
method: by default, no action is performed but the user may decide to override it.
As seen in the EXECUTING phase, the map
will write (using Mapper.Context.write()
) its output into a circular in-memory buffer (MapTask.MapOutputBuffer
). The size of this buffer is fixed and determined by the configuration parameter mapreduce.task.io.sort.mb
(default: 100MB).
Whenever this circular buffer is almost full (mapreduce.map. sort.spill.percent
: 80% by default), the SPILLING phase is performed (in parallel using a separate thread). Notice that if the splilling thread is too slow and the buffer is 100% full, then the map()
cannot be executed and thus it has to wait.
The SPILLING thread performs the following actions:
SpillRecord
and FSOutputStream
(local filesystem) The number of ReduceTasks for the job is decided by the configuration parameter mapreduce.job.reduces
.
The paritionIdx of an output tuple is the index of a partition. It is decided inside the Mapper.Context.write()
:
partitionIdx = (key.hashCode() & Integer.MAX_VALUE) % numReducers
It is stored as metadata in the circular buffer alongside the output tuple. The user can customize the partitioner by setting the configuration parametermapreduce.job.partitioner.class
.
If the user specifies a combiner then the SPILLING thread, before writing the tuples to the file (4), executes the combiner on the tuples contained in each partition. Basically, we:
Reducer.class
(the one specified for the combiner!)Reducer.Context
: the output will be stored on the local filesystemReduce.run()
: see Reduce Task description The combiner typically use the same implementation of the standard reduce()
function and thus can be seen as a local reducer.
At the end of the EXECUTION phase, the SPILLING thread is triggered for the last time. In more detail, we:
Notice that for each time the buffer was almost full, we get one spill file (SpillReciord
+ output file). Each Spill file contains several partitions (segments).
[…]
Ref: http://ercoppa.github.io/HadoopInternals/