MapReduce

MapReduce and why

  • MapReduce is a programming model for data processing 

  • The power of MapReduce lies in its ability to scale to 100s or 1000s of computers, each with several processor cores

  • MapReduce is designed to efficiently process large volumes of data by connecting many commodity computers together to work in parallel 

  • A theoretical 1000-CPU machine  would cost a very large amount of money, far more than 1000 single-CPU or 250 quad-core machines 

  • MapReduce ties smaller and more reasonably priced machines together into a single cost-effective commodity cluster


Isolated Tasks

  • MapReduce divides the workload into multiple independent tasks and schedule them across cluster nodes

  • A work performed by each task is done in isolation from one another

  • The amount of communication which can be performed by tasks is mainly limited for scalability and fault tolerance reasons

  • The communication overhead required to keep the data on the nodes synchronized at all times would prevent the model from performing reliably and efficiently at large scale

Data Distribution

  • In a MapReduce cluster, data is distributed to all the nodes of the cluster as it is being loaded in

  • An underlying distributed file systems (e.g., GFS) splits large data files into chunks which are managed by different nodes in the cluster

  • Even though the file chunks are distributed across several machines, they form a single namesapce

MapReduce_第1张图片

MapReduce: A Bird‘s-Eye View

  • In MapReduce, chunks are processed in isolation by tasks called Mappers

  • The outputs from the mappers are denoted as intermediate outputs (IOs) and are brought into a second set of tasks called Reducers 

  • The process of bringing together IOs into a set of Reducers is known as shuffling process

  • The Reducers produce the final outputs (FOs)

  • Overall, MapReduce breaks the data flow into two phases, map phase and reduce phase

MapReduce_第2张图片

Keys and Values

  • The programmer in MapReduce has to specify two functions, the map function and the reduce function that implement the Mapper and the Reducer in a MapReduce program

  • In MapReduce data elements are always structured as key-value (i.e., (K, V)) pairs

  • The map and reduce functions receive and emit (K, V) pairs

MapReduce_第3张图片

Partitions

  • In MapReduce, intermediate output values are not usually reduced together

  • All values with the same key are presented to a single Reducer together

  • More specifically, a different subset of intermediate key space is assigned to each Reducer,These subsets are known as partitions

MapReduce_第4张图片

Hadoop Mapreduce

MapReduce_第5张图片

SORT EXAMPLE

MAP

MapReduce_第6张图片

Reduce

MapReduce_第7张图片

WORD COUNT

MAP

MapReduce_第8张图片

REDUCE

MapReduce_第9张图片

Hadoop MapReduce: A Closer Look

MapReduce_第10张图片


Input Files

  • Input files are where the data for a MapReduce task is initially stored

  • The input files typically reside in a distributed file system (e.g. HDFS)

  • The format of input files is arbitrary

                    Line-based log files

                    Binary files

                    Multi-line input records

                    Or something else entirely

MapReduce_第11张图片

 

InputFormat

  • How the input files are split up and read is defined by the InputFormat

  • InputFormat is a class that does the following:

            Selects the files that should be used for input

            Defines the InputSplits that break a file

            Provides a factory for RecordReader objects that read the file

                   MapReduce_第12张图片

Input Splits

An input split describes a unit of work that comprises a single map task in a MapReduce program

By default, the InputFormat breaks a file up into 64MB splits

By dividing the file into splits, we allow several map tasks to operate on a single file in parallel

If the file is very large, this can improve performance significantly through parallelism

Each map task corresponds to a single input split

MapReduce_第13张图片


RecordReader

  • The input split defines a slice of work but does not describe how to access it

  • The RecordReader class actually loads data from its source and converts it into (K, V) pairs suitable for reading by Mappers

  • The RecordReader is invoked repeatedly on the input until the entire split is consumed

  • Each invocation of the RecordReader leads to another call of the map function defined by the programmer


Mapper and Reducer

The Mapper performs the user-defined work of the first phase of the MapReduce program

A new instance of Mapper is created for each split

The Reducer performs the user-defined work of the second phase of the MapReduce program

A new instance of Reducer is created for each partition

For each key in the partition assigned to a Reducer, the Reducer is called once



你可能感兴趣的:(mapreduce,hadoop)