初学mapreduce源码分析
reduce
reducetask.run();
在类reducetask中run方法:由yarnchild调用
run方法中:RawKeyValueIterator rIter = null;此类为迭代器reducer类中的reduce方法中参数(key , Iteractor values,context)中的 Iteractor values也是一个一个的keyvalue传过来的
run方法中:ShuffleConsumerPlugin shuffleConsumerPlugin = null;Class extends ShuffleConsumerPlugin> clazz =
job.getClass(MRConfig.SHUFFLE_CONSUMER_PLUGIN, Shuffle.class, ShuffleConsumerPlugin.class);
shuffleConsumerPlugin = ReflectionUtils.newInstance(clazz, job);负责洗牌、可以自定义洗牌规则
Shuffle.class中的run方法
Start the map-completion events fetcher thread:eventFetcher.start();map任务结束后启动提取线程
Start the map-output fetcher threads
Wait for shuffle to complete successfully
Class keyClass = job.getMapOutputKeyClass();
Class valueClass = job.getMapOutputValueClass();
RawComparator comparator = job.getOutputValueGroupingComparator();
if (useNewApi) {
runNewReducer(job, umbilical, reporter, rIter, comparator,
keyClass, valueClass);
}
RawComparator comparator = job.getOutputValueGroupingComparator();分组比较器的自定义;作用定义怎么分组setGroupingComparatorClass(Class extends RawComparator>):比如默认分组相同key为1组:此类的比较器issamekey决定是否适用1个reduce方法:
此代码可实现RawComparator getOutputValueGroupingComparator() {
Class extends RawComparator> theClass = getClass(
JobContext.GROUP_COMPARATOR_CLASS, null, RawComparator.class);
if (theClass == null) {
return getOutputKeyComparator();
}
return ReflectionUtils.newInstance(theClass, this);
}
runNewReducer(job, umbilical, reporter, rIter, comparator,
keyClass, valueClass);
类中的方法private
void runNewReducer(JobConf job,
final TaskUmbilicalProtocol umbilical,
final TaskReporter reporter,
RawKeyValueIterator rIter,
RawComparator comparator,
Class keyClass,
Class valueClass
)
// wrap value iterator to report progress.
// make a task context so we can get the classes
/ make a reducer
org.apache.hadoop.mapreduce.Reducer
(org.apache.hadoop.mapreduce.Reducer
ReflectionUtils.newInstance(taskContext.getReducerClass(), job);
org.apache.hadoop.mapreduce.RecordWriter
new NewTrackingRecordWriter
job.setBoolean(“mapred.skip.on”, isSkipping());
job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
org.apache.hadoop.mapreduce.Reducer.Context
reducerContext = createReduceContext(reducer, job, getTaskID(),
rIter, reduceInputKeyCounter,
reduceInputValueCounter,
trackedRW,//含有write方法的方法
committer,
reporter, comparator, keyClass,
valueClass);
try {
reducer.run(reducerContext);
reducerContext是 ReduceContextImpl类的对象:public ReduceContextImpl(Configuration conf, TaskAttemptID taskid,
RawKeyValueIterator input,
Counter inputKeyCounter,
Counter inputValueCounter,
RecordWriter
OutputCommitter committer,
StatusReporter reporter,
RawComparator comparator,
Class keyClass,
Class valueClass
)
ReduceContextImpl类的对象:nextKeyValue() getCurrentKey() getCurrentValue()
nextKeyValue()
public boolean nextKeyValue() throws IOException, InterruptedException {
if (!hasMore) {
key = null;
value = null;
return false;
}
firstValue = !nextKeyIsSame;
DataInputBuffer nextKey = input.getKey();// //是RawKeyValueIterator input一个迭代器使用迭代器指定//在keyvalue的一个值上相当于1个指针在内存地址上
currentRawKey.set(nextKey.getData(), nextKey.getPosition(),
nextKey.getLength() - nextKey.getPosition());
buffer.reset(currentRawKey.getBytes(), 0, currentRawKey.getLength());
key = keyDeserializer.deserialize(key);//反序列化key值
DataInputBuffer nextVal = input.getValue();
buffer.reset(nextVal.getData(), nextVal.getPosition(), nextVal.getLength()
- nextVal.getPosition());
value = valueDeserializer.deserialize(value);//反序列化value值返回value值
currentKeyLength = nextKey.getLength() - nextKey.getPosition();
currentValueLength = nextVal.getLength() - nextVal.getPosition();
if (isMarked) {
backupStore.write(nextKey, nextVal);
}
hasMore = input.next();
if (hasMore) {
nextKey = input.getKey();//比较前后2个值是否相同
nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0,
currentRawKey.getLength(),
nextKey.getData(),
nextKey.getPosition(),
nextKey.getLength() - nextKey.getPosition()
) == 0;
} else {
nextKeyIsSame = false;
}
inputValueCounter.increment(1);
return true;
}此方法取得当前key、value返回boolean
getCurrentKey()
getCurrentValue()
run方法中:public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKey()) {
reduce(context.getCurrentKey(), context.getValues(), context);
// If a back up store is used, reset it
Iterator iter = context.getValues().iterator();
if(iter instanceof ReduceContext.ValueIterator) {
((ReduceContext.ValueIterator)iter).resetBackupStore();
}
}
} finally {
cleanup(context);
}其中的ruduce方法是ruducer中的方法可以自定义
此方法为我们自定义的ruduce方法:reduce(Text key, Iterable values,
Context context) throws IOException, InterruptedException {
// <“hello”, [1,1,1,1,1,1,1,1,1]>
//获取迭代器,遍历values
Iterator iterator = values.iterator();
long sum = 0L;
while (iterator.hasNext()) {
LongWritable num = iterator.next();
sum += num.get();
}
//将求和的总数封装为LongWritable类型,并输出到HDFS
outValue.set(sum);
context.write(key, outValue);
}
其中的Iterable values为context.getValues()返回一个iterable:private ValueIterable iterable = new ValueIterable();
iterable为ValueIterable 此类的实现
hasNext()
return firstValue || nextKeyIsSame;返回true或者false: nextKeyIsSame如果下一个与当前key相同返回true执行next方法:1个reduce方法是执行的相同的key,不同的key是有可能给不同的reduce方法
nextKeyIsSame: nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0,
currentRawKey.getLength(),
nextKey.getData(),
nextKey.getPosition(),
nextKey.getLength() - nextKey.getPosition()
) == 0;是否适用相同的reduce方法取决于比较器
next()
nextKeyValue();此方法取得下一个key、value返回boolean再调用就是取得下一个值
return value;