最近在做爬虫时的一点点心德,记录下来。
文档相似度计算,一般常用的就是余弦定理,代表性介绍的文章有:
google黑板报的数学之美系列十二 -- 余弦定理和新闻的分类(这个是网上的一遍原文转载,google的黑板报被河蟹了)
把文档量化然后通过余弦定理计算相似度,主要适用于爬虫的聚类统计,和文档分类,是一种比较简单的分类算法:
/** * 计算文档相似度 * * @param doci * 准备比较的文档 * @param docj * 样例文档 * @return */ public double calculateSimilary(Document doci, Document docj) { Map<String, Integer> ifreq = doci.documentFreq();//文档词项词频 Map<String, Integer> jfreq = docj.documentFreq(); double ijSum = 0; Iterator<Entry<String, Integer>> it = ifreq.entrySet().iterator(); while (it.hasNext()) { Map.Entry<String,Integer> entry = it.next(); if(jfreq.containsKey(entry.getKey())) { double iw = weight(entry.getValue()); double jw = weight(jfreq.get(entry.getKey())); ijSum += (iw * jw); } } double iPowSum = powSum(doci); double jPowSum = powSum(docj); return ijSum / (iPowSum * jPowSum); } /** * @param document * @return */ public double powSum(Document document) { Map<String, Integer> mapfreq = document.documentFreq(); Collection<Integer> freqs = mapfreq.values(); double sum = 0; for(int f : freqs) { double dw = weight(f); sum += Math.pow(dw, 2); } return Math.sqrt(sum); } /** * 计算词项特征值 * @param wordfreq * @return */ public double weight(float wordfreq) { return Math.sqrt(wordfreq); }
通过计算,两文档的余弦值越接近1,文档相似度越高。
当余弦值为1是,文档重叠。
其他java类:
public interface Document { /** * 获取文档词频 * @param content * @return {@link Map} */ public Map<String, Integer> segment(); public Map<String, Integer> documentFreq(); }
public class DocumentIpml implements Document { private String content; private IKSegmentation ikSegmentation; private Logger logger = Logger.getLogger("DocumentIpmlLogger"); private Map<String, Integer> dfreq; public DocumentIpml(String cont) { this.content = cont; } public Map<String, Integer> documentFreq() { if(dfreq == null || dfreq.isEmpty()) { dfreq = segment(); return dfreq; } return dfreq; } public Map<String, Integer> segment() { if(this.content == null || content.isEmpty()) { logger.log(Level.WARNING, "document content can not be empty"); return null; } if(ikSegmentation == null) ikSegmentation = new IKSegmentation(new StringReader(content), true); else ikSegmentation.reset(new StringReader(content)); Lexeme lexeme = null; Map<String, Integer> mapfreq = new HashMap<String, Integer>(); try { while((lexeme = ikSegmentation.next()) != null) { if(!mapfreq.containsKey(lexeme.getLexemeText())) { mapfreq.put(lexeme.getLexemeText(), 1); continue; } int freq = mapfreq.get(lexeme.getLexemeText()); mapfreq.put(lexeme.getLexemeText(), ++freq); } } catch (IOException e) { logger.log(Level.SEVERE, "", e); return null; } return mapfreq; } }
实现结果:
1.txt和2.txt的相似度为:0.32460869971007195 1.txt和3.txt的相似度为:0.21837417258281094 1.txt和94.txt的相似度为:0.1805190131222515 1.txt和77.txt的相似度为:0.14018416797440844 txt6和77.txt的相似度为:0.1979109275388269
这几遍文档在附件中。
如果对文档相似度计算方式有更好的做法,欢迎指导:
我的邮箱: