生成协同矩阵是RocommendJob的第二步主要操作,第一步操作生成偏好矩阵分析请点击打开链接。在分析之前如果你有兴趣可以先去看一下mahuot对RowSimilarityJob的简单介绍。英文请点击打开链接。然后还有一个链接个人感觉比较有帮助,是某个人和开发者的一个邮件讨论,点击打开链接
这一步操作以上一步PreparePreferenceMatrixJob生成的偏好矩阵为输入,来对偏好矩阵做进一步的操作生成协同矩阵。这一步也是分为三个子步骤完成,每个子步骤分别有一个mapper与一个reducer完成。下面就对每个子步骤进行分析。
第一个子步骤:
mapper操作VectorNormMapper是对偏好矩阵进行重新组合向量。把以itemId为key,以userId为value的的向量(itemId, VectorWritable<userId, pref>),再转化成以userId为key,以itemId为value的向量组合(userId, VectorWritable<itemId, pref>)。其实在PreparePreferenceMatrixJob的某一个子步骤中已经生成了格式为(userId, VectorWritable<itemId, pref>)的向量,但是不知为何又转化为了(itemId, VectorWritable<userId, pref>)的向量,而在这里又做了一步转化。真心没懂。
在mapper的操作中,在清除工作的cleanup方法中也做了相应的输出工作。
protected void map(IntWritable row, VectorWritable vectorWritable, Context ctx) throws IOException, InterruptedException { //(itemId, VectorWritable<userId, pref>) Vector rowVector = similarity.normalize(vectorWritable.get()); int numNonZeroEntries = 0; double maxValue = Double.MIN_VALUE; Iterator<Vector.Element> nonZeroElements = rowVector.iterateNonZero(); while (nonZeroElements.hasNext()) { Vector.Element element = nonZeroElements.next(); RandomAccessSparseVector partialColumnVector = new RandomAccessSparseVector(Integer.MAX_VALUE); partialColumnVector.setQuick(row.get(), element.get()); ctx.write(new IntWritable(element.index()), new VectorWritable(partialColumnVector)); numNonZeroEntries++; if (maxValue < element.get()) { maxValue = element.get(); } } if (threshold != NO_THRESHOLD) { nonZeroEntries.setQuick(row.get(), numNonZeroEntries); maxValues.setQuick(row.get(), maxValue); } norms.setQuick(row.get(), similarity.norm(rowVector)); ctx.getCounter(Counters.ROWS).increment(1); } @Override protected void cleanup(Context ctx) throws IOException, InterruptedException { super.cleanup(ctx); // dirty trick ctx.write(new IntWritable(NORM_VECTOR_MARKER), new VectorWritable(norms)); ctx.write(new IntWritable(NUM_NON_ZERO_ENTRIES_VECTOR_MARKER), new VectorWritable(nonZeroEntries)); ctx.write(new IntWritable(MAXVALUE_VECTOR_MARKER), new VectorWritable(maxValues)); }
reducer操作MergeVectorsReducer就是对mapper的输出(userId, VectorWritable<itemId, pref>)进行合并集合,把相同userId下的vector合并到一起,然后直接输出。其实在这一步之前也有一步的combiner'操作,combiner操作中是对vector进行了merge操作。这一步的reducer操作中在输出的时候也会根据不同情况进行相应的不同操作。通过看上面mapper代码,也会发现它输出好几种数据,reducer会对这不同种类数据写到不同的位置。
protected void reduce(IntWritable row, Iterable<VectorWritable> partialVectors, Context ctx) throws IOException, InterruptedException { Vector partialVector = Vectors.merge(partialVectors); if (row.get() == NORM_VECTOR_MARKER) { Vectors.write(partialVector, normsPath, ctx.getConfiguration()); } else if (row.get() == MAXVALUE_VECTOR_MARKER) { Vectors.write(partialVector, maxValuesPath, ctx.getConfiguration()); } else if (row.get() == NUM_NON_ZERO_ENTRIES_VECTOR_MARKER) { Vectors.write(partialVector, numNonZeroEntriesPath, ctx.getConfiguration(), true); } else { ctx.write(row, new VectorWritable(partialVector)); } }
mapper操作CooccurrencesMapper把上一步的reduce输出(userId, VectorWritable<itemId,pref>)进行处理,通过循环遍历,对每一个vector中的每一个元素进行相互组合,输出为( itemid_index, vector<itemid_index, value> )。在这里的value值会根据采用不同推荐策略进行不同的计算,如果采用的是基于CountbasedMeasure相关策略,那么当两个item被同一个用户看过的时候,则上述的value就为1。
protected void map(IntWritable column, VectorWritable occurrenceVector, Context ctx) throws IOException, InterruptedException { //(userId, VectorWritable<itemId,pref>) Vector.Element[] occurrences = Vectors.toArray(occurrenceVector); Arrays.sort(occurrences, BY_INDEX); int cooccurrences = 0; int prunedCooccurrences = 0; //第二层for循环,是让m = n的,然后又取出数组的的第m个元素,然后与第n个元素计算,这个时候写出的数据是不是就是自身和自身的一个关系???? for (int n = 0; n < occurrences.length; n++) { Vector.Element occurrenceA = occurrences[n]; Vector dots = new RandomAccessSparseVector(Integer.MAX_VALUE); for (int m = n; m < occurrences.length; m++) { Vector.Element occurrenceB = occurrences[m]; if (threshold == NO_THRESHOLD || consider(occurrenceA, occurrenceB)) { //in CountbasedMeasure aggregate always return 1 dots.setQuick(occurrenceB.index(), similarity.aggregate(occurrenceA.get(), occurrenceB.get())); cooccurrences++; } else { prunedCooccurrences++; } } ctx.write(new IntWritable(occurrenceA.index()), new VectorWritable(dots));//在这里输出的就是以itemA与所有与其有关系的item的之间的关系 } ctx.getCounter(Counters.COOCCURRENCES).increment(cooccurrences); ctx.getCounter(Counters.PRUNED_COOCCURRENCES).increment(prunedCooccurrences); }
the input like this :
column1: row1, row2, row3
column2: row1, row3
column3: row2
the output will be :
for column1:reducer操作SimilarityReducer就是对上述单个输出进行汇总计算两两item之间的相似度,生成最后的协同矩阵。
protected void reduce(IntWritable row, Iterable<VectorWritable> partialDots, Context ctx) throws IOException, InterruptedException { //(itemIdA, VectorWritable<itemIdB,1>) Iterator<VectorWritable> partialDotsIterator = partialDots.iterator(); Vector dots = partialDotsIterator.next().get(); while (partialDotsIterator.hasNext()) { Vector toAdd = partialDotsIterator.next().get(); Iterator<Vector.Element> nonZeroElements = toAdd.iterateNonZero(); while (nonZeroElements.hasNext()) { Vector.Element nonZeroElement = nonZeroElements.next(); //对与row有关系的相同itemId的score进行累加 dots.setQuick(nonZeroElement.index(), dots.getQuick(nonZeroElement.index()) + nonZeroElement.get()); } } Vector similarities = dots.like(); double normA = norms.getQuick(row.get()); Iterator<Vector.Element> dotsWith = dots.iterateNonZero(); while (dotsWith.hasNext()) { Vector.Element b = dotsWith.next(); double similarityValue = similarity.similarity(b.get(), normA, norms.getQuick(b.index()), numberOfColumns); if (similarityValue >= treshold) { similarities.set(b.index(), similarityValue); } } if (excludeSelfSimilarity) { similarities.setQuick(row.get(), 0); } ctx.write(row, new VectorWritable(similarities)); } }