Hadoop在reduce中Iterable对应的值不变原因

版本:

$ hadoop version
Hadoop 0.20.2-cdh3u4
Subversion git://ubuntu-slave01/var/lib/jenkins/workspace/CDH3u4-Full-RC/build/cdh3/hadoop20/0.20.2-cdh3u4/source -r 214dd731e3bdb687cb55988d3f47dd9e248c5690
Compiled by jenkins on Mon May  7 13:01:39 PDT 2012
From source with checksum a60c9795e41a3248b212344fb131c12c

问题描述:

Hadoop在执行Reducer时对应的Iterable<VALUEIN> 其对应的值保持问题,代码如下:

protected void reduce(Text key, Iterable<VectorWritable> values, Context context)
	        throws IOException, InterruptedException {
	Map<String, VectorWritable> map = new HashMap<String, VectorWritable>();
	for (VectorWritable vw : values) {
		NamedVector nv = (NamedVector) vw.get();
		Item itemi = Item.toInstance(nv.getName());
		map.put(itemi.getItemID(), vw);
	}
}

 其对应的Map中的元素都是一样的。

问题原因:

每次迭代对应的值在此次reduce时内存中是一个实例,源码如下:

public class ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
    extends TaskInputOutputContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
  private RawKeyValueIterator input;
  private Counter inputKeyCounter;
  private Counter inputValueCounter;
  private RawComparator<KEYIN> comparator;
  private KEYIN key;                                  // current key
  private VALUEIN value;                              // 就是这个实例
.....................
}

每次执行时都是对value进行赋值

@Override
public VALUEIN next() {
	// if this is the first record, we don't need to advance
	if (firstValue) {
		firstValue = false;
		return value;
	}
	// if this isn't the first record and the next key is different, they
	// can't advance it here.
	if (!nextKeyIsSame) {
		throw new NoSuchElementException("iterate past last value");
	}
		// otherwise, go to the next key/value pair
	try {
		nextKeyValue();
		return value;
	} catch (IOException ie) {
		throw new RuntimeException("next value iterator failed", ie);
	} catch (InterruptedException ie) {
		// this is bad, but we can't modify the exception list of java.util
		throw new RuntimeException("next value iterator interrupted", ie);        
	}
}

 因此造成了上述问题

解决方式:

protected void reduce(Text key, Iterable<VectorWritable> values, Context context)
	        throws IOException, InterruptedException {
	Map<String, VectorWritable> map = new HashMap<String, VectorWritable>();
	for (VectorWritable vectorWritable : values) {
		VectorWritable vw = WritableUtils.clone(vectorWritable, context.getConfiguration());
		NamedVector nv = (NamedVector) vw.get();
		Item itemi = Item.toInstance(nv.getName());
		map.put(itemi.getItemID(), vw);
	}
}

 克隆一个即可

你可能感兴趣的:(Iterable)