MapReduce流程讲解以及源码分析

MapReduce流程讲解

一.简介

对于用户来说只需要书写map操作和reduce操作

mapreduce计算数据的时间较长

整个过程分为map和reduce,map负责处理原始数据,reduce负责处理map数据

二.原理分析

1.map过程

block:块-->物理上的概念,默认是128M

split:切片-->本次map任务要处理的数据的大小;默认大小等于block的大小

maptask:map的任务-->一个split对应一个maptask

处理的都是切片数据,属于最原始的数据

kvbuffer:maptask临时结果输出的目的地.它是内存中的一块环形的空间,默认大小是100M;设置阈值,默认大小是80%,达到阈值之后进行溢写

spill:将环形数据缓冲区的临时结果写出到硬盘上;如果文件足够大,那么会溢写出很多的小文件,大概80M左右

partation:分区的数量和reduce的数量完全相同;

溢写的时候回提前计算出key所对应的的分区reduce

sort:排序先按照partation进行排序;然后按照key进行快速排序

merge:就是将溢写的多个小文件合并成一个大文件;使用归并算法,先按照分区,然后按照key进行归并排序

2.reduce

fetch:从maptask节点拉取reduce需要的数据

reducetask:我们必须将key相同临时数据拉取到同一个reducetask中进行计算;一个箱子里可以有不同的key,但是相同key必须在一个箱子里

output:因为每次产生结果不一样,为了防止产生的文件过大出现问题;每次产生的结果会默认存放到HDFS上;

3.搭建MapReduce集群

搭建的环境完全基于HA的环境

修改mapred-site.xml文件

cp mapred-site.xml.template mapred-site.xml

vim mapred-site.xml

<property>
<name>mapreduce.framework.namename>
<value>yarnvalue>
property>

修改yarn-site.xml文件

vim yarn-site.xml

<property>
<name>yarn.nodemanager.aux-servicesname>
<value>mapreduce_shufflevalue>
property>
<property>
<name>yarn.resourcemanager.ha.enabledname>
<value>truevalue>
property>
<property>
<name>yarn.resourcemanager.cluster-idname>
<value>mr_shsxtvalue>
property>
<property>
<name>yarn.resourcemanager.ha.rm-idsname>
<value>rm1,rm2value>
property>
<property>
<name>yarn.resourcemanager.hostname.rm1name>
<value>node03value>
property>
<property>
<name>yarn.resourcemanager.hostname.rm2name>
<value>node01value>
property>
<property>
<name>yarn.resourcemanager.zk-addressname>
<value>node01:2181,node02:2181,node03:2181value>
property>

拷贝配置文件(yarn-mapred)到其他节点

scp mapred-site.xml yarn-site.xml root@node02:pwd

scp mapred-site.xml yarn-site.xml root@node02:`pwd

先启动Zookeeper

zkServer.sh start

zkServer.sh status

启动DFS

start-all.sh( start-dfs.sh start-yarn.sh)

在备用节点输入命令

yarn-daemon.sh start resourcemanager

MapReduce源码分析

1.提交工作至客户端(client)

1.1创建job类

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
* 作业类的主类
* @author Administrator
*
*/
public class WordCountJob {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
//获取配置文件对象
Configuration configuration = new Configuration(true);
//创建本次的作业对象
Job job = Job.getInstance(configuration);
//设置作业Jar的主类
job.setJarByClass(WordCountJob.class);
//设置作业的名称
job.setJobName("shsxt-WorkCount");
//设置ReduceTask的数量
job.setNumReduceTasks(3);
//设置输入路径(要处理的文件路径)
FileInputFormat.setInputPaths(job, new Path("/shsxt/java/武动乾坤.txt"));
//设置切片的大小
// CombineTextInputFormat.setMaxInputSplitSize(job, 1024 * 1024 * 100);
// CombineTextInputFormat.setMinInputSplitSize(job, 1024 * 1024 * 1);
//设置输入路径(结果要输出的位置)
FileOutputFormat.setOutputPath(job,
  new Path("/shsxt/java/武动乾坤_result" + System.currentTimeMillis()));
//定义Map的输出格式
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//设置Mapper类
job.setMapperClass(WordCountMapper.class);
//设置Reducer类
job.setReducerClass(WordCountReducer.class);
//提交作业
job.waitForCompletion(true);
}
// MapTask

}
  • 文件必须上传到Hadoop上并且路径必须正确 设置的job名称必须和导出的jar包名称一致

1.2创建mapper类

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text text = new Text();
private IntWritable one = new IntWritable(1);
@Override
protected void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
//对数据进行切分
String[] ss = value.toString().split("\\W+");
for (String string : ss) {
text.set(string);
context.write(text, one);
}

}

}

1.3创建reduce类

import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private Text text = new Text("ly");
private int count = 1000;
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
//获取所有的key对应的值
Iterator<IntWritable> iterator = values.iterator();
//创建计数器
int count = 0;
//遍历
while (iterator.hasNext()) {
count += iterator.next().get();
}
context.write(key, new IntWritable(count));
}
}

2.map源码分析

1.split 切片

解析job类中-->job.waitForCompletion(true);

/**
*Submit the job to the cluster and wait for it to finish
*提交该工作到集群并等待完成
*/
job.waitForCompletion(true);
// JobState是个枚举类只有两种状态 public static enum JobState {DEFINE, RUNNING};             //如果状态是DEFINE (定义的) 则直接提交
   if (state == JobState.DEFINE) {
     //(------------------------------submit())
       submit();
  }
//Monitor a job and print status in real-time as progress is made and tasks fail.
//实时监测当前进程的状态
if (verbose) {
     monitorAndPrintJob();

解析job类中-->submit();

//(------------------------------submit())
//监测任务的状态
ensureState(JobState.DEFINE);
//设置使用新的API
setUseNewAPI();
//通过配置文件创建新的对象
connect();
//(----------------------连接)
if (cluster == null) {
     cluster =
       ugi.doAs(new PrivilegedExceptionAction<Cluster>() {
                  public Cluster run()
                         throws IOException, InterruptedException,
                                ClassNotFoundException {
                    return new Cluster(getConfiguration());
                  }
                });
  }
//通过分布式系统和客户端获取任务提交器
final JobSubmitter submitter =
       getJobSubmitter(cluster.getFileSystem(), cluster.getClient());
status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {
     //开启多线程提交任务
    public JobStatus run() throws IOException, InterruptedException,
     ClassNotFoundException {
         //正式提交任务
         //进入(-----------------submitter)
       return submitter.submitJobInternal(Job.this, cluster);
    }
  });
//任务进入运行状态
state = JobState.RUNNING;

解析submitter

//Internal method for submitting jobs to the system
//使用内部的方法将任务提交到分布式系统
//validate the jobs output specs
//验证任务输出空间
   checkSpecs(job);
//生成新的jobid并且设置进去
JobID jobId = submitClient.getNewJobID();
   job.setJobID(jobId);
//设置job的提交路径
Path submitJobDir = new Path(jobStagingArea, jobId.toString());
//Create the splits for the job
//创建给任务的切片
//(---------------------------writeSplits)
int maps = writeSplits(job, submitJobDir);
 List<InputSplit> splits = input.getSplits(job);
//将切片的集合转换成数组
   T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]);
   // sort the splits into order based on size, so that the biggest go first
//对切片进行排序将最大的放在前面
   Arrays.sort(array, new SplitComparator());
//创建切片文件
   JobSplitWriter.createSplitFiles(jobSubmitDir, conf,
       jobSubmitDir.getFileSystem(conf), array);
//返回切片的个数
   return array.length;
//最终提交任务(设置好切片的数量以及maptask的数量)
status = submitClient.submitJob(
         jobId, submitJobDir.toString(), job.getCredentials());

解析writeSplits

//获取配置文件
Configuration conf = job.getConfiguration();
//Create an object for the given class and initialize it from conf
//通过反射工具类使用InputFormatClass()类创建类并使用conf配置文件进行初始化
//JobContext job = org.apache.hadoop.mapreduce.task.JobContextImpl
//input = org.apache.hadoop.mapreduce.lib.input.TextInputFormat.class
InputFormat?> input =
     ReflectionUtils.newInstance(job.getInputFormatClass(), conf);
//通过输入格式对象获取切片
// getSplits = org.apache.hadoop.mapreduce.lib.input.FileInputFormat
//Generate the list of files and make them into FileSplits
//将切片放入到文件切片中
List<InputSplit> splits = input.getSplits(job);
//设置切片的最大值和最小值
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
long maxSize = getMaxSplitSize(job);
//创建list
List<InputSplit> splits = new ArrayList<InputSplit>();
//获取需要处理的文件
List<FileStatus> files = listStatus(job);
//遍历
for (FileStatus file: files) {
    //获取文件的路径
     Path path = file.getPath();
    //获取文件的大小
     long length = file.getLen();
    //判断文件是否为空
     if (length != 0) {
        //当前文件所对应的块
       BlockLocation[] blkLocations;
       if (file instanceof LocatedFileStatus) {
         blkLocations = ((LocatedFileStatus) file).getBlockLocations();
      } else {
         FileSystem fs = path.getFileSystem(job.getConfiguration());
         blkLocations = fs.getFileBlockLocations(file, 0, length);
      }
       //判断文件是否能被切片,压缩的文件不能进行切片
       if (isSplitable(job, path)) {
         //Get the block size of the file
         //获得块的数量
         long blockSize = file.getBlockSize();
        // 获取切片的最终大小
        //(-----------------------------computeSplitSize)
         long splitSize = computeSplitSize(blockSize, minSize, maxSize);
          //剩下的长度
         long bytesRemaining = length;
           //判断剩下的是否能够切片,依据是不超过数据的10%不切,
           //即就是最后一个切片最大是140.8M,最小切片是12.9M
         while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
             //第一个切片的索引为0
           int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
             //将切片放到集合里,获取片的地址.路径
           splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                       blkLocations[blkIndex].getHosts(),
                       blkLocations[blkIndex].getCachedHosts()));
             //获取剩余部分的长度
           bytesRemaining -= splitSize;
        }
//最后一个切片的大小(0,1.1]*splitSize,将最后一个切片放入集合中
         if (bytesRemaining != 0) {
           int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
           splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
                      blkLocations[blkIndex].getHosts(),
                      blkLocations[blkIndex].getCachedHosts()));
        }
      } else {
           // not splitable
         splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
                     blkLocations[0].getCachedHosts()));
      }
    } else {
       //Create empty hosts array for zero length files
       //如果文件为空就创建一个空的文件夹
       splits.add(makeSplit(path, 0, length, new String[0]));
    }
  }
// Save the number of input files for metrics/loadgen
   job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
   sw.stop();
   if (LOG.isDebugEnabled()) {
     LOG.debug("Total # of splits generated by getSplits: " + splits.size()
         + ", TimeTaken: " + sw.elapsedMillis());
  }
//返回带有切片的list集合
   return splits;
}

computeSplitSize解析

//计算块的大小的方法
//如何让Splitsize>blockSize(将MinSize大于blockSize)
//如果让SplitsizeMath.max(minSize, Math.min(maxSize, blockSize));

2.maptask

//  所属包org.apache.hadoop.mapred
 public void run(final JobConf job, final TaskUmbilicalProtocol umbilical)
   throws IOException, ClassNotFoundException, InterruptedException {
   this.umbilical = umbilical;
   if (isMapTask()) {
     // If there are no reducers then there won't be any sort. Hence the map
     // phase will govern the entire attempt's progress.
     //如果没有reducers,只进行map操作
     if (conf.getNumReduceTasks() == 0) {
       mapPhase = getProgress().addPhase("map", 1.0f);
    } else {
       // If there are reducers then the entire attempt's progress will be
       // split between the map phase (67%) and the sort phase (33%).
         //设置溢写
       mapPhase = getProgress().addPhase("map", 0.667f);
       sortPhase  = getProgress().addPhase("sort", 0.333f);
    }
  }
   TaskReporter reporter = startReporter(umbilical);
//判断是否使用新的API
   boolean useNewApi = job.getUseNewMapper();
     //进行数据初始化
   initialize(job, getJobID(), reporter, useNewApi);

   // check if it is a cleanupJobTask
   if (jobCleanup) {
     runJobCleanupTask(umbilical, reporter);
     return;
  }
   if (jobSetup) {
     runJobSetupTask(umbilical, reporter);
     return;
  }
   if (taskCleanup) {
     runTaskCleanupTask(umbilical, reporter);
     return;
  }
//(------------------------------- runNewMapper)
   if (useNewApi) {
     runNewMapper(job, splitMetaInfo, umbilical, reporter);
  } else {
     runOldMapper(job, splitMetaInfo, umbilical, reporter);
  }
   done(umbilical, reporter);
}

runNewMapper解析

//创建任务上下文对象
taskContext =
     new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job,
                                                                 getTaskID(),
                                                                 reporter);
// make a mapper
//通过反射创建mapper类
//获取到的是wordcountmapper这个类
   org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =
    (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)
       ReflectionUtils.newInstance(taskContext.getMapperClass(), job);
// make the input format
//所属包 org.apache.hadoop.mapreduce.lib.input.TextInputFormat
   org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
    (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
       ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
//记录读取器,使用的类类似于装饰器模式
input = new NewTrackingRecordReader (split, inputFormat, reporter, taskContext);
//真正的读取数据的内部组件
//(-------------------------createRecordReader)
this.real = inputFormat.createRecordReader(split, taskContext);
//数据写出器
org.apache.hadoop.mapreduce.RecordWriter output = null;
   // get an output object
   if (job.getNumReduceTasks() == 0) {
     output =
       new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
  } else {
     output = new NewOutputCollector(taskContext, job, umbilical, reporter);
  }
//创建一个收集器
collector = createSortingCollector(job, reporter);
//创建收集器org.apache.hadoop.mapred.MapOutputBuffer.class
MapOutputCollector<KEY, VALUE> collector =
         ReflectionUtils.newInstance(subclazz, job);
//收集器初始化
//(------------------------------ collector.init(context))
collector.init(context);
//创建分区 org.apache.hadoop.mapreduce.lib.partition.HashPartitioner.class
//(-------------------HashPartitioner.class)
partitions = jobContext.getNumReduceTasks();
     if (partitions > 1) {
       partitioner = (org.apache.hadoop.mapreduce.Partitioner<K,V>)
         ReflectionUtils.newInstance(jobContext.getPartitionerClass(), job);
    } else {
       partitioner = new org.apache.hadoop.mapreduce.Partitioner<K,V>() {
         @Override
         public int getPartition(K key, V value, int numPartitions) {
           return partitions - 1;
        }
//创建mapper的上下文对象
mapContext =
     new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job,
                                                          getTaskID(),
                                                          input, output,
                                                          committer,
                                                          reporter, split);
//创建mapperContext的包装类
mapperContext =
         new WrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>().getMapContext(
             mapContext);
try {
//行读取器初始化
   //(----------------------------------input.initialize)
     input.initialize(split, mapperContext);
   //(------------------mapper.run)
     mapper.run(mapperContext);
     mapPhase.complete();
     setPhase(TaskStatus.Phase.SORT);
   //更新状态
     statusUpdate(umbilical);
   //关闭行读取器
     input.close();
     input = null;
     output.close(mapperContext);
   //数据刷出
   //关闭写出器
     output = null;
  } finally {
     closeQuietly(input);
     closeQuietly(output, mapperContext);
  }

解析HashPartitioner.class

public class HashPartitioner<K, V> extends Partitioner<K, V> {

 /** Use {@link Object#hashCode()} to partition. */
 public int getPartition(K key, V value,
                         int numReduceTasks) {
     //取余运算防止负数
   return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}

}

解析 collector.init(context)

  public void init(MapOutputCollector.Context context
                  ) throws IOException, ClassNotFoundException {
     job = context.getJobConf();
     reporter = context.getReporter();
     mapTask = context.getMapTask();
     mapOutputFile = mapTask.getMapOutputFile();
     sortPhase = mapTask.getSortPhase();
     spilledRecordsCounter = reporter.getCounter(TaskCounter.SPILLED_RECORDS);
     partitions = job.getNumReduceTasks();
     rfs = ((LocalFileSystem)FileSystem.getLocal(job)).getRaw();

     //sanity checks
     final float spillper =
       job.getFloat(JobContext.MAP_SORT_SPILL_PERCENT, (float)0.8);
     final int sortmb = job.getInt(JobContext.IO_SORT_MB, 100);
     indexCacheMemoryLimit = job.getInt(JobContext.INDEX_CACHE_MEMORY_LIMIT,
                                        INDEX_CACHE_MEMORY_LIMIT_DEFAULT);
     //确定溢写范围
     if (spillper > (float)1.0 || spillper <= (float)0.0) {
       throw new IOException("Invalid \"" + JobContext.MAP_SORT_SPILL_PERCENT +
           "\": " + spillper);
    }
     if ((sortmb & 0x7FF) != sortmb) {
       throw new IOException(
           "Invalid \"" + JobContext.IO_SORT_MB + "\": " + sortmb);
    }
     sorter = ReflectionUtils.newInstance(job.getClass("map.sort.class",
           QuickSort.class, IndexedSorter.class), job);
     // buffers and accounting
     int maxMemUsage = sortmb << 20;
     maxMemUsage -= maxMemUsage % METASIZE;
     //环形缓冲区,默认大小100M
     kvbuffer = new byte[maxMemUsage];
     bufvoid = kvbuffer.length;
     kvmeta = ByteBuffer.wrap(kvbuffer)
        .order(ByteOrder.nativeOrder())
        .asIntBuffer();
     setEquator(0);
     bufstart = bufend = bufindex = equator;
     kvstart = kvend = kvindex;

     maxRec = kvmeta.capacity() / NMETA;
     softLimit = (int)(kvbuffer.length * spillper);
     bufferRemaining = softLimit;
     LOG.info(JobContext.IO_SORT_MB + ": " + sortmb);
     LOG.info("soft limit at " + softLimit);
     LOG.info("bufstart = " + bufstart + "; bufvoid = " + bufvoid);
     LOG.info("kvstart = " + kvstart + "; length = " + maxRec);

     // k/v serialization
     comparator = job.getOutputKeyComparator();
     //获取key和value的类型
     keyClass = (Class<K>)job.getMapOutputKeyClass();
     valClass = (Class<V>)job.getMapOutputValueClass();
     serializationFactory = new SerializationFactory(job);
     keySerializer = serializationFactory.getSerializer(keyClass);
     keySerializer.open(bb);
     valSerializer = serializationFactory.getSerializer(valClass);
     valSerializer.open(bb);

     // output counters
     mapOutputByteCounter = reporter.getCounter(TaskCounter.MAP_OUTPUT_BYTES);
     mapOutputRecordCounter =
       reporter.getCounter(TaskCounter.MAP_OUTPUT_RECORDS);
     fileOutputByteCounter = reporter
        .getCounter(TaskCounter.MAP_OUTPUT_MATERIALIZED_BYTES);

     // compression
     if (job.getCompressMapOutput()) {
       Classextends CompressionCodec> codecClass =
         job.getMapOutputCompressorClass(DefaultCodec.class);
       codec = ReflectionUtils.newInstance(codecClass, job);
    } else {
       codec = null;
    }

     // combiner
     final Counters.Counter combineInputCounter =
       reporter.getCounter(TaskCounter.COMBINE_INPUT_RECORDS);
     combinerRunner = CombinerRunner.create(job, getTaskID(),
                                            combineInputCounter,
                                            reporter, null);
     if (combinerRunner != null) {
       final Counters.Counter combineOutputCounter =
         reporter.getCounter(TaskCounter.COMBINE_OUTPUT_RECORDS);
       combineCollector= new CombineOutputCollector<K,V>(combineOutputCounter, reporter, job);
    } else {
       combineCollector = null;
    }
     spillInProgress = false;
     minSpillsForCombine = job.getInt(JobContext.MAP_COMBINE_MIN_SPILLS, 3);
     //溢写线程
     spillThread.setDaemon(true);
     spillThread.setName("SpillThread");
     spillLock.lock();
     try {
       spillThread.start();
       while (!spillThreadRunning) {
         spillDone.await();
      }
    } catch (InterruptedException e) {
       throw new IOException("Spill thread failed to initialize", e);
    } finally {
       spillLock.unlock();
    }
     if (sortSpillException != null) {
       throw new IOException("Spill thread failed to initialize",sortSpillException);
    }
  }

解析mapper.run

 public void run(Context context) throws IOException, InterruptedException {
   //开始任务
    setup(context);
   try {
       //循环调用数据(mapcontext-->mapcontextimpl-->lineRecordWriter.nextKeyValue())
    //(-------------------------context.nextKeyValue())
       while (context.nextKeyValue()) {
       //context.getCurrentKey()-->当前行的索引
      //context.getCurrentValue()-->value当前行的数据
       map(context.getCurrentKey(), context.getCurrentValue(), context);
    }
  } finally {
       //任务结束
     cleanup(context);
  }
}

解析context.nextKeyValue()

 public boolean nextKeyValue() throws IOException {
   if (key == null) {
       //设置key的类型
     key = new LongWritable();
  }
   //设置当前行开始的索引
   key.set(pos);
   if (value == null) {
     //设置value的类型
     value = new Text();
  }
   int newSize = 0;
   // We always read one extra line, which lies outside the upper
   // split limit i.e. (end - 1)
   //除了最后一行之外,我们总是多读一行
   //getFilePosition确定当前读取的数据定位在切片中
   //in.needAdditionalRecordAfterSplit()即使超出范围,可能下个切片的第一行也是能读出来的
   while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
     if (pos == 0) {
       //如果是第一行就跳过Utf-8字符
       newSize = skipUtfByteOrderMark();
    } else {
       //读取一行数据,并把它存到value中
       newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));
       //将定位修改到下一行的开头
       pos += newSize;
    }

     if ((newSize == 0) || (newSize < maxLineLength)) {
       break;
    }

     // line too long. try again
     LOG.info("Skipped line of size " + newSize + " at pos " +
              (pos - newSize));
  }
   if (newSize == 0) {
     key = null;
     value = null;
     return false;
  } else {
     return true;
  }
}

解析createRecordReader

  @Override
 public RecordReader<LongWritable, Text>
   createRecordReader(InputSplit split,
                      TaskAttemptContext context) {
   String delimiter = context.getConfiguration().get(
       "textinputformat.record.delimiter");
   byte[] recordDelimiterBytes = null;
   if (null != delimiter)
     recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
     //返回一个行读取器
   return new LineRecordReader(recordDelimiterBytes);
}

解析input.initialize

  • public void initialize(InputSplit genericSplit,
                            TaskAttemptContext context) throws IOException {
       FileSplit split = (FileSplit) genericSplit;
       Configuration job = context.getConfiguration();
       //每一行最多读取的大小,默认最大读取值是int的最大范围
       this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
       //The position of the first byte in the file to process
       //获得进程的文件中第一的字节的偏移量
       start = split.getStart();
       //获取当前切片的最后一个字符的位置
       end = start + split.getLength();
       //获取切片的路径
       final Path file = split.getPath();
       // open the file and seek to the start of the split
       final FileSystem fs = file.getFileSystem(job);
       fileIn = fs.open(file);
       
       CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
       if (null!=codec) {
         isCompressedInput = true;
         decompressor = CodecPool.getDecompressor(codec);
         if (codec instanceof SplittableCompressionCodec) {
           final SplitCompressionInputStream cIn =
            ((SplittableCompressionCodec)codec).createInputStream(
               fileIn, decompressor, start, end,
               SplittableCompressionCodec.READ_MODE.BYBLOCK);
           in = new CompressedSplitLineReader(cIn, job,
               this.recordDelimiterBytes);
           start = cIn.getAdjustedStart();
           end = cIn.getAdjustedEnd();
           filePosition = cIn;
        } else {
           in = new SplitLineReader(codec.createInputStream(fileIn,
               decompressor), job, this.recordDelimiterBytes);
           filePosition = fileIn;
        }
      } else {
         fileIn.seek(start);
         in = new UncompressedSplitLineReader(
             fileIn, job, this.recordDelimiterBytes, split.getLength());
         filePosition = fileIn;
      }
       // If this is not the first split, we always throw away first record
       // because we always (except the last split) read one extra line in
       // next() method.
       //如果不是第一个切片,我们读取时就跳过第一行
       //除了最后一行之外,每个切片多读下一个切片的第一行
       if (start != 0) {
         start += in.readLine(new Text(), 0, maxBytesToConsume(start));
      }
       this.pos = start;
    }
     

3.reduce源码分析

1.reducetask

//所属包 org.apache.hadoop.mapred
public void run(JobConf job, final TaskUmbilicalProtocol umbilical)
   throws IOException, InterruptedException, ClassNotFoundException {
   job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());

   if (isMapOrReduce()) {
     copyPhase = getProgress().addPhase("copy");
     sortPhase  = getProgress().addPhase("sort");
     reducePhase = getProgress().addPhase("reduce");
  }
   // start thread that will handle communication with parent
   TaskReporter reporter = startReporter(umbilical);
   
   boolean useNewApi = job.getUseNewReducer();
   //(------------------initialize(job, getJobID(), reporter, useNewApi))
   initialize(job, getJobID(), reporter, useNewApi);

   // check if it is a cleanupJobTask
   if (jobCleanup) {
     runJobCleanupTask(umbilical, reporter);
     return;
  }
   if (jobSetup) {
     runJobSetupTask(umbilical, reporter);
     return;
  }
   if (taskCleanup) {
     runTaskCleanupTask(umbilical, reporter);
     return;
  }
   
   // Initialize the codec
   codec = initCodec();
   //(---------------------------------RawKeyValueIterator)
   //原生的key和value迭代器
   RawKeyValueIterator rIter = null;
   //洗牌插件,模式使用Shuffle.class
   //所属包org.apache.hadoop.mapreduce.task.reduce.Shuffle.class
   ShuffleConsumerPlugin shuffleConsumerPlugin = null;
   
   Class combinerClass = conf.getCombinerClass();
   CombineOutputCollector combineCollector =
    (null != combinerClass) ?
    new CombineOutputCollector(reduceCombineOutputCounter, reporter, conf) : null;

   Classextends ShuffleConsumerPlugin> clazz =
         job.getClass(MRConfig.SHUFFLE_CONSUMER_PLUGIN, Shuffle.class, ShuffleConsumerPlugin.class);

   shuffleConsumerPlugin = ReflectionUtils.newInstance(clazz, job);
   LOG.info("Using ShuffleConsumerPlugin: " + shuffleConsumerPlugin);

   ShuffleConsumerPlugin.Context shuffleContext =
     new ShuffleConsumerPlugin.Context(getTaskID(), job, FileSystem.getLocal(job), umbilical,
                 super.lDirAlloc, reporter, codec,
                 combinerClass, combineCollector,
                 spilledRecordsCounter, reduceCombineInputCounter,
                 shuffledMapsCounter,
                 reduceShuffleBytes, failedShuffleCounter,
                 mergedMapOutputsCounter,
                 taskStatus, copyPhase, sortPhase, this,
                 mapOutputFile, localMapFiles);
   shuffleConsumerPlugin.init(shuffleContext);
//(---------------------------shuffleConsumerPlugin.run())
   rIter = shuffleConsumerPlugin.run();

   // free up the data structures
   mapOutputFilesOnDisk.clear();
   
   sortPhase.complete();                         // sort is complete
   setPhase(TaskStatus.Phase.REDUCE);
   statusUpdate(umbilical);
   //获取输出key的类型
   Class keyClass = job.getMapOutputKeyClass();
   //获取输出value的类型
   Class valueClass = job.getMapOutputValueClass();
   //获取分组比较器
   //(----------------------job.getOutputValueGroupingComparator())
   RawComparator comparator = job.getOutputValueGroupingComparator();

   if (useNewApi) {
     //(-------------------------------- runNewReducer)
     runNewReducer(job, umbilical, reporter, rIter, comparator,
                   keyClass, valueClass);
  } else {
     runOldReducer(job, umbilical, reporter, rIter, comparator,
                   keyClass, valueClass);
  }

   shuffleConsumerPlugin.close();
   done(umbilical, reporter);
}

解析initialize(job, getJobID(), reporter, useNewApi)

jobContext = new JobContextImpl(job, id, reporter);
taskContext = new TaskAttemptContextImpl(job, taskId, reporter);
outputFormat =ReflectionUtils.newInstance(taskContext.getOutputFormatClass(), job);
committer = outputFormat.getOutputCommitter(taskContext);

解析RawKeyValueIterator和shuffleConsumerPlugin.run()

 public RawKeyValueIterator run() throws IOException, InterruptedException {
   // Scale the maximum events we fetch per RPC call to mitigate OOM issues
   // on the ApplicationMaster when a thundering herd of reducers fetch events
  //设置最大的范围防止出现OOM(内存溢出)问题
   int eventsPerReducer = Math.max(MIN_EVENTS_TO_FETCH,
       MAX_RPC_OUTSTANDING_EVENTS / jobConf.getNumReduceTasks());
   int maxEventsToFetch = Math.min(MAX_EVENTS_TO_FETCH, eventsPerReducer);

   // Start the map-completion events fetcher thread
   final EventFetcher<K,V> eventFetcher =
     new EventFetcher<K,V>(reduceId, umbilical, scheduler, this,
         maxEventsToFetch);
   eventFetcher.start();
   
   // Start the map-output fetcher threads
  //开启拉取线程
   boolean isLocal = localMapFiles != null;
  //如果是本期文件开启1个拉取器,如果是远程的开启5个拉取器
   final int numFetchers = isLocal ? 1 :
     jobConf.getInt(MRJobConfig.SHUFFLE_PARALLEL_COPIES, 5);
   Fetcher<K,V>[] fetchers = new Fetcher[numFetchers];
   if (isLocal) {
     fetchers[0] = new LocalFetcher<K, V>(jobConf, reduceId, scheduler,
         merger, reporter, metrics, this, reduceTask.getShuffleSecret(),
         localMapFiles);
     fetchers[0].start();
  } else {
     for (int i=0; i < numFetchers; ++i) {
       fetchers[i] = new Fetcher<K,V>(jobConf, reduceId, scheduler, merger,
                                      reporter, metrics, this,
                                      reduceTask.getShuffleSecret());
      //默认调用run方法;run方法中会调用copyFromHost(host);  
         fetchers[i].start();
    }
  }
   
   // Wait for shuffle to complete successfully
   while (!scheduler.waitUntilDone(PROGRESS_FREQUENCY)) {
     reporter.progress();
     
     synchronized (this) {
       if (throwable != null) {
         throw new ShuffleError("error in shuffle in " + throwingThreadName,
                                throwable);
      }
    }
  }

   // Stop the event-fetcher thread
   eventFetcher.shutDown();
   
   // Stop the map-output fetcher threads
   for (Fetcher<K,V> fetcher : fetchers) {
     fetcher.shutDown();
  }
   
   // stop the scheduler
   scheduler.close();

   copyPhase.complete(); // copy is already complete
   taskStatus.setPhase(TaskStatus.Phase.SORT);
   reduceTask.statusUpdate(umbilical);

   // Finish the on-going merges...
   RawKeyValueIterator kvIter = null;
   try {
     kvIter = merger.close();
  } catch (Throwable e) {
     throw new ShuffleError("Error while doing final merge " , e);
  }

   // Sanity check
   synchronized (this) {
     if (throwable != null) {
       throw new ShuffleError("error in shuffle in " + throwingThreadName,
                              throwable);
    }
  }
   
   return kvIter;
}

解析kvIter = merger.close();

kvIter = merger.close();
//(||||||||||||||||||||||||||||||||||||||||||||||||||||)
final RawComparator<K> comparator =
  (RawComparator<K>)job.getOutputKeyComparator();
return finalMerge(jobConf, rfs, memory, disk);
//(||||||||||||||||||||||||||||||||||||||||||||||||||||)
eturn Merger.merge(job, fs, keyClass, valueClass,
   finalSegments, finalSegments.size(), tmpDir,
   comparator, reporter, spilledRecordsCounter, null, null);
//(||||||||||||||||||||||||||||||||||||||||||||||||||||)
return merge(conf, fs, keyClass, valueClass, segments, mergeFactor, tmpDir,
   comparator, reporter, false, readsCounter, writesCounter,mergePhase);
//(||||||||||||||||||||||||||||||||||||||||||||||||||||)
return new MergeQueue<K, V>(conf, fs, segments, comparator, reporter,
   sortSegments,
   TaskType.REDUCE).merge(keyClass, valueClass,
                       mergeFactor, tmpDir,
                       readsCounter, writesCounter,
                       mergePhase);
//(||||||||||||||||||||||||||||||||||||||||||||||||||||)
//kvIter最终拿到的是org.apache.hadoop.mapred.Merger.MergeQueue
org.apache.hadoop.mapred.Merger.MergeQueue

解析job.getOutputValueGroupingComparator()

//获取设置的分组比较器
public RawComparator getOutputValueGroupingComparator() {
   Classextends RawComparator> theClass = getClass(
     JobContext.GROUP_COMPARATOR_CLASS, null, RawComparator.class);
   //如果没有设置分组比较器,选择key比较器
   if (theClass == null) {
     return getOutputKeyComparator();
  }
   
   return ReflectionUtils.newInstance(theClass, this);
}
//如果没key比较器使用key类型自带的比较器 : mapreduce.map.output.key.class

解析 runNewReducer

 private <INKEY,INVALUE,OUTKEY,OUTVALUE> void runNewReducer(JobConf job,
                    final TaskUmbilicalProtocol umbilical,
                    final TaskReporter reporter,
                    RawKeyValueIterator rIter,
                    RawComparator<INKEY> comparator,
                    Class<INKEY> keyClass,
                    Class<INVALUE> valueClass
                    ) throws IOException,InterruptedException,
                             ClassNotFoundException {
   // wrap value iterator to report progress.
   final RawKeyValueIterator rawIter = rIter;
   rIter = new RawKeyValueIterator() {
     public void close() throws IOException {
       rawIter.close();
    }
     public DataInputBuffer getKey() throws IOException {
       return rawIter.getKey();
    }
     public Progress getProgress() {
       return rawIter.getProgress();
    }
     public DataInputBuffer getValue() throws IOException {
       return rawIter.getValue();
    }
     public boolean next() throws IOException {
       boolean ret = rawIter.next();
       reporter.setProgress(rawIter.getProgress().getProgress());
       return ret;
    }
  };
   // make a task context so we can get the classes
   org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
     new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job,
         getTaskID(), reporter);
   // make a reducer
   //实际上就是Wordcountreducer.class                              
   org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE> reducer =
    (org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>)
       ReflectionUtils.newInstance(taskContext.getReducerClass(), job);
  //生成写出器
  //所属包org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.LineRecordWriter
   this.real  = return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);          
   org.apache.hadoop.mapreduce.RecordWriter<OUTKEY,OUTVALUE> trackedRW =
     new NewTrackingRecordWriter<OUTKEY, OUTVALUE>(this, taskContext);
   job.setBoolean("mapred.skip.on", isSkipping());
   job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
    //(---------------------------- reducerContext)                            
   org.apache.hadoop.mapreduce.Reducer.Context
        reducerContext = createReduceContext(reducer, job, getTaskID(),
                                              rIter, reduceInputKeyCounter,
                                              reduceInputValueCounter,
                                              trackedRW,
                                              committer,
                                              reporter, comparator, keyClass,
                                              valueClass);
   try {
     //(----------------------reducer.run)
     reducer.run(reducerContext);
  } finally {
     trackedRW.close(reducerContext);
  }
}
 

解析 reducerContext

//org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer
//org.apache.hadoop.mapreduce.task.ReduceContextImpl
reducerContext = new WrappedReducer-->new ReduceContextImpl

解析reducer.run

WrappedReducer.nextkey();
ReduceContextImpl.nextkey();
//拿到反序列化的key
key = keyDeserializer.deserialize(key);
//拿到反序列化的value
value = valueDeserializer.deserialize(value);
//下一个值是否存在
hasMore = input.next();
//判断下个key是否有值,
if (hasMore) {
   //获得下一个key
     nextKey = input.getKey();
   //比较当前key和下一个key是否一样
     nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0,
                                    currentRawKey.getLength(),
                                    nextKey.getData(),
                                    nextKey.getPosition(),
                                    nextKey.getLength() - nextKey.getPosition()
                                        ) == 0;
  } else {
     nextKeyIsSame = false;
  }
   inputValueCounter.increment(1);
//context.getValues()返回的是一个迭代器return iterable;

你可能感兴趣的:(MapReduce流程讲解以及源码分析)