MinHash的介绍请参看http://rdc.taobao.com/team/jm/archives/2434
初始化
Configuration conf = getConf(); conf.setInt(MinhashOptionCreator.MIN_CLUSTER_SIZE, minClusterSize); conf.setInt(MinhashOptionCreator.MIN_VECTOR_SIZE, minVectorSize); conf.set(MinhashOptionCreator.HASH_TYPE, hashType); conf.setInt(MinhashOptionCreator.NUM_HASH_FUNCTIONS, numHashFunctions); conf.setInt(MinhashOptionCreator.KEY_GROUPS, keyGroups); conf.setBoolean(MinhashOptionCreator.DEBUG_OUTPUT, debugOutput);
设置hadoop运行参数
job.setMapperClass(MinHashMapper.class); job.setReducerClass(MinHashReducer.class);
setup函数先取得选项参数,再根据hashType获得hashfunction
tf-idf sequence file文件的key是标记文档的字符串,value是vector组成的,每个vector的key是index,value是index的tf-idf值,理解这些值才能理解mapper
取得features:
Vector featureVector = features.get();
for (int i = 0; i < numHashFunctions; i++) { minHashValues[i] = Integer.MAX_VALUE; }计算这个文档的minhash
for (int i = 0; i < numHashFunctions; i++) { for (Vector.Element ele : featureVector) { int value = (int) ele.get(); bytesToHash[0] = (byte) (value >> 24); bytesToHash[1] = (byte) (value >> 16); bytesToHash[2] = (byte) (value >> 8); bytesToHash[3] = (byte) value; int hashIndex = hashFunction[i].hash(bytesToHash); //if our new hash value is less than the old one, replace the old one if (minHashValues[i] > hashIndex) { minHashValues[i] = hashIndex; } } }mapper输出
for (int i = 0; i < numHashFunctions; i++) { StringBuilder clusterIdBuilder = new StringBuilder(); for (int j = 0; j < keyGroups; j++) { clusterIdBuilder.append(minHashValues[(i + j) % numHashFunctions]).append('-'); } //remove the last dash clusterIdBuilder.deleteCharAt(clusterIdBuilder.length() - 1); Text cluster = new Text(clusterIdBuilder.toString()); Writable point; if (debugOutput) { point = new VectorWritable(featureVector.clone()); } else { point = new Text(item.toString()); } context.write(cluster, point); }
这里需要理解keyGroups的含义,它的作用是连接hash值,这样在比较hash值的时候是多个,减少了两个项之间的冲突,比较结果更可信,参见:
http://mail-archives.apache.org/mod_mbox/mahout-user/201111.mbox/%[email protected]%3E
mapper的输出key就是上面的keyGroups,value是文档id
MinHashReducer
@Override protected void reduce(Text cluster, Iterable<Writable> points, Context context) throws IOException, InterruptedException { Collection<Writable> pointList = Lists.newArrayList(); for (Writable point : points) { if (debugOutput) { Vector pointVector = ((VectorWritable) point).get().clone(); Writable writablePointVector = new VectorWritable(pointVector); pointList.add(writablePointVector); } else { Writable pointText = new Text(point.toString()); pointList.add(pointText); } } if (pointList.size() >= minClusterSize) { context.getCounter(Clusters.ACCEPTED).increment(1); for (Writable point : pointList) { context.write(cluster, point); } } else { context.getCounter(Clusters.DISCARDED).increment(1); } }
从上面也可以看出minHash的输出不是最终结果,要得到结果还需要自己处理输出