创建dictionary和tf-vectors
实现类是DictionaryVectorizer
调用createTermFrequencyVectors方法,参数是:
input,output,tfVectorsFolderName,baseConf,这几个参数很明显
minSupport -- 最少要在文档中出现多少次才会放置到sparsevector,缺省值2
maxNGramSize -- 最大n-gram值,这个在计算llr(log-likelyhood ratio)时使用,如为3,则计算3-gram,2-gram,1-gram,为2则计算2-gram,1-gram,缺省是1
minLLRValue -- 最小的llr值,高于这个的认为是出现在一起的词,缺省是1,这个值最好就使用缺省的
normPower -- normalization是为了避免文档长度的影响的一种手段,统计学里边叫p-norm,参考mahout in action的8.4节,缺省是0,这个值最好改下,比如2
logNormalize -- 是否使用log normalization,这个不明白有何用
chunkSizeInMegabytes -- 这个值的设置有讲究,对性能影响很大,具体还不明白,有机会专题研究
其它参数都很明白
dictionary job
Path dictionaryJobPath = new Path(output, DICTIONARY_JOB_FOLDER); int[] maxTermDimension = new int[1]; List<Path> dictionaryChunks; if (maxNGramSize == 1) { startWordCounting(input, dictionaryJobPath, baseConf, minSupport); dictionaryChunks = createDictionaryChunks(dictionaryJobPath, output, baseConf, chunkSizeInMegabytes, maxTermDimension); } else { CollocDriver.generateAllGrams(input, dictionaryJobPath, baseConf, maxNGramSize, minSupport, minLLRValue, numReducers); dictionaryChunks = createDictionaryChunks(new Path(new Path(output, DICTIONARY_JOB_FOLDER), CollocDriver.NGRAM_OUTPUT_DIRECTORY), output, baseConf, chunkSizeInMegabytes, maxTermDimension); }
startWordCounting
Mapper是TermCountMapper类,Combiner和Reducer都是TermCountReducer类,我们到这两个类里边看他们都做了什么工作
首先看TermCountMapper
@Override protected void map(Text key, StringTuple value, final Context context) throws IOException, InterruptedException { OpenObjectLongHashMap<String> wordCount = new OpenObjectLongHashMap<String>(); for (String word : value.getEntries()) { if (wordCount.containsKey(word)) { wordCount.put(word, wordCount.get(word) + 1); } else { wordCount.put(word, 1); } } wordCount.forEachPair(new ObjectLongProcedure<String>() { @Override public boolean apply(String first, long second) { try { context.write(new Text(first), new LongWritable(second)); } catch (IOException e) { context.getCounter("Exception", "Output IO Exception").increment(1); } catch (InterruptedException e) { context.getCounter("Exception", "Interrupted Exception").increment(1); } return true; } }); }
输出的key是token,value是token的数量
再看TermCountReducer
@Override protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { long sum = 0; for (LongWritable value : values) { sum += value.get(); } if (sum >= minSupport) { context.write(key, new LongWritable(sum)); } }
自此,wordcount工作完成