本节分析ReduceTask的最后一个阶段:reduce,经历了copy、sort后,reduce的输入数据就准备好了,reduce数据输入由Reducer.Context提供,该Context封装了sort阶段的迭代器,可以对内存和磁盘的KV进行迭代,这部分需要注意两个大的循环:1、对KEY的循环由Reducer类实现,具体参考run函数 2、在自定义的reduce函数中对VALUE的循环。在自定义的reduce函数中会处理迭代器中的数据,当迭代器中的数据没有的时候就意味着需要处理下一个KEY了,reduce函数的输出会直接输出目的地如HDFS中,具体位置是可以自定义的。下面我们先看Reducer中run函数是如何实现KEY循环的
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKey()) {//循环读取KEY
reduce(context.getCurrentKey(), context.getValues(), context);//进入自定义的reduce函数
}
cleanup(context);
}
nextKey函数的逻辑如下:
/** Start processing next unique key. */
public boolean nextKey() throws IOException,InterruptedException {
while (hasMore && nextKeyIsSame) {//读取新KEY时nextKeyIsSame为假
nextKeyValue();
}
if (hasMore) {
if (inputKeyCounter != null) {
inputKeyCounter.increment(1);
}
return nextKeyValue();//如果为新的KEY,则会预读一条KV
} else {
return false;
}
}
KV的预读逻辑如下
public boolean nextKeyValue() throws IOException, InterruptedException {
if (!hasMore) {
key = null;
value = null;
return false;
}
//读取新KEY的时候firstValue为真,此时nextKeyIsSame为假
//当读取相同KEY的非首条记录时,firstValue会置为假
firstValue = !nextKeyIsSame;
//将KV读取到buffer中
DataInputBuffer next = input.getKey();
currentRawKey.set(next.getData(), next.getPosition(),
next.getLength() - next.getPosition());
buffer.reset(currentRawKey.getBytes(), 0, currentRawKey.getLength());
key = keyDeserializer.deserialize(key);
next = input.getValue();
buffer.reset(next.getData(), next.getPosition(), next.getLength());
value = valueDeserializer.deserialize(value);
//再读取一条,用于判断下一条的KEY是否相同,来设置nextKeyIsSame
hasMore = input.next();
if (hasMore) {
next = input.getKey();
nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0,
currentRawKey.getLength(),
next.getData(),
next.getPosition(),
next.getLength() - next.getPosition()
) == 0;
} else {
nextKeyIsSame = false;
}
inputValueCounter.increment(1);
return true;
}
KEY判断完毕,如果确定还有数据则进入到自定义的reduce函数中,这里我们以WordCount为例,由于函数中会对同一个KEY的相同VALUE进行迭代,因此会传入Iterable(第二个参数),该参数封装了org.apache.hadoop.mapreduce.ReduceContext.ValueIterator
protected void reduce(Text key, java.lang.Iterable<IntWritable> arg1,
Context context) throws IOException, InterruptedException {
Iterator<IntWritable> iterator = arg1.iterator();//获得迭代器
int sum = 0;
while (iterator.hasNext()) {//判断是否有下一个VALUE
sum += iterator.next().get();//自定义操作
}
context.write(key, new IntWritable(sum));//写出操作
};
在对VALUE的迭代中每读取一次VALUE,都会判断下一个VALUE是否相同,以设置nextKeyIsSame的值,当相同KEY的VALUE有多条时,一旦nextKeyIsSame为假,那么证明需要处理下一个KEY了。
protected class ValueIterator implements Iterator<VALUEIN> {
@Override
public boolean hasNext() {
return firstValue || nextKeyIsSame;
}
@Override
public VALUEIN next() {
// 如果为首条记录则直接返回,注意此时firstValue状态变化
if (firstValue) {
firstValue = false;
return value;
}
// if this isn't the first record and the next key is different, they
// can't advance it here.
if (!nextKeyIsSame) {
throw new NoSuchElementException("iterate past last value");
}
// 读取下一条KV,具体逻辑见上面nextKeyValue的分析
try {
nextKeyValue();
return value;
} catch (IOException ie) {
throw new RuntimeException("next value iterator failed", ie);
} catch (InterruptedException ie) {
// this is bad, but we can't modify the exception list of java.util
throw new RuntimeException("next value iterator interrupted", ie);
}
}
@Override
public void remove() {
throw new UnsupportedOperationException("remove not implemented");
}
}
当reduce阶段输出时,如果目的地是HDFS,则会直接写入,此时HDFS相当于服务端,reduce任务相当于客户端,也是调用FSDataOutputStream来写出的,这里就不再多分析了。