改进向量空间模型

<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style=""><span style="font-size: small;">声明:只是对向量空间模型的介绍(或者叫推广),并没有理论创新工作。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"><span style="">本人在之前的《</span><span lang="EN-US"><a href="http://blog.csdn.net/Felomeng/archive/2009/03/25/4024078.aspx" target="_blank"><span style="" lang="EN-US"><span lang="EN-US">向量空间模型</span></span><span style="font-family: Calibri;">(VSM)</span><span style="" lang="EN-US"><span lang="EN-US">在文档相似度计算上的简单介绍</span></span></a></span><span class="title"><span style="">》和《</span></span><span lang="EN-US"><a href="http://blog.csdn.net/Felomeng/archive/2009/03/25/4023990.aspx" target="_blank"><span style="" lang="EN-US"><span lang="EN-US">向量空间模型文档相似度计算实现(</span></span><span style="font-family: Calibri;">C#</span><span style="" lang="EN-US"><span lang="EN-US">)</span></span></a></span><span class="title"><span style="">》两篇文章中分别介绍了简单</span><span lang="EN-US"><span style="font-family: Calibri;">SVM</span></span></span><span class="title"><span style="">模型及其实现。</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span class="title"><span style="font-size: small;"><span style="">本人使用简单词频(即词在当前文档中出现的次数)信息,实现了一个朴素版本的向量空间模型,效果尚可,但还是有很多可改进之处。</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"><span class="title"><span style="">直接使用词的个数在比较词数很多和词数很少的文档时存在着问题。例如文档</span><span lang="EN-US"><span style="font-family: Calibri;">I</span></span></span><span class="title"><span style="">中含有</span><span lang="EN-US"><span style="font-family: Calibri;">10000</span></span></span><span class="title"><span style="">个词,而词</span><span lang="EN-US"><span style="font-family: Calibri;">a</span></span></span><span class="title"><span style="">出现了</span><span lang="EN-US"><span style="font-family: Calibri;">10</span></span></span><span class="title"><span style="">次;文档</span><span lang="EN-US"><span style="font-family: Calibri;">II</span></span></span><span class="title"><span style="">中含有</span><span lang="EN-US"><span style="font-family: Calibri;">100</span></span></span><span class="title"><span style="">个词,而</span><span lang="EN-US"><span style="font-family: Calibri;">a</span></span></span><span class="title"><span style="">出现了</span><span lang="EN-US"><span style="font-family: Calibri;">5</span></span></span><span class="title"><span style="">次。这样在相似度计算时,文档</span><span lang="EN-US"><span style="font-family: Calibri;">I</span></span></span><span class="title"><span style="">中</span><span lang="EN-US"><span style="font-family: Calibri;">a</span></span></span><span class="title"><span style="">对最后结果的影响比文档</span><span lang="EN-US"><span style="font-family: Calibri;">II</span></span></span><span class="title"><span style="">中的</span><span lang="EN-US"><span style="font-family: Calibri;">a</span></span></span><span class="title"><span style="">要大。这显然是不合理的,因为</span><span lang="EN-US"><span style="font-family: Calibri;">a</span></span></span><span class="title"><span style="">只点文档</span><span lang="EN-US"><span style="font-family: Calibri;">I</span></span></span><span class="title"><span style="">的</span><span lang="EN-US"><span style="font-family: Calibri;">0.1%</span></span></span><span class="title"><span style="">而却占文档</span><span lang="EN-US"><span style="font-family: Calibri;">II</span></span></span><span class="title"><span style="">的</span><span lang="EN-US"><span style="font-family: Calibri;">5%</span></span></span><span class="title"><span style="">。为了解决这类问题,我们引入词频(</span><span lang="EN-US"><span style="font-family: Calibri;">TF</span></span></span><span class="title"><span style="">)和反词频(</span><span lang="EN-US"><span style="font-family: Calibri;">IDF</span></span></span><span class="title"><span style="">)两个概念。</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"><span class="title"><span style="">其中</span><span lang="EN-US"><span style="font-family: Calibri;">TF = f/m</span></span></span><span class="title"><span style="">,其中</span><span lang="EN-US"><span style="font-family: Calibri;">f</span></span></span><span class="title"><span style="">表示当前词在当前文档中出现的次数,而</span><span lang="EN-US"><span style="font-family: Calibri;">m</span></span></span><span class="title"><span style="">表示当前文档中出现次数最多的词的次数。这样</span><span lang="EN-US"><span style="font-family: Calibri;">TF</span></span></span><span class="title"><span style="">值就在</span><span lang="EN-US"><span style="font-family: Calibri;">0</span></span></span><span class="title"><span style="">和</span><span lang="EN-US"><span style="font-family: Calibri;">1</span></span></span><span class="title"><span style="">之间。这样做可以减少文档中词的频率不合理分布所引起的误差。</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"><span style="font-family: Calibri;"><span class="title"><span lang="EN-US">IDF = </span></span><span lang="EN-US">log<sub>2 </sub>(<em>n</em>/<em>n<sub>j</sub></em>) + 1</span></span><span style="">,其中</span><span lang="EN-US"><span style="font-family: Calibri;">n</span></span><span style="">表示在整个语料中文档的总数,而</span><span lang="EN-US"><span style="font-family: Calibri;">n<sub>j</sub></span></span><span style="">表示含有当前词的文档数。这样做可以减少在语料范围内词频分布不均匀造成的相似度误差。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"><span style=""><img src="http://p.blog.csdn.net/images/p_blog_csdn_net/Felomeng/EntryImages/20090409/tfidf.JPG" alt="" width="457" height="117"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"><span style="">最后,将这两项相乘得到</span><span lang="EN-US"><span style="font-family: Calibri;">T = TF * IDF</span></span><span style="">,用这个量替代《</span><span lang="EN-US"><a href="http://blog.csdn.net/Felomeng/archive/2009/03/25/4024078.aspx" target="_blank"><span style="" lang="EN-US"><span lang="EN-US">向量空间模型</span></span><span style="font-family: Calibri;">(VSM)</span><span style="" lang="EN-US"><span lang="EN-US">在文档相似度计算上的简单介绍</span></span></a></span><span class="title"><span style="">》中的简单词频,就可以得到实际应用中常用的向量空间模型了。</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"><span class="title"><span style=""><img src="http://p.blog.csdn.net/images/p_blog_csdn_net/Felomeng/EntryImages/20090409/cos.JPG" alt="" width="402" height="202"></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"><span class="title"><span style="">另外还对<a href="http://download.csdn.net/source/1143450">原向量空间模型</a>的源代码进行了优化和改进(主要是空间换时间策略),可以从<a href="http://download.csdn.net/source/1191463">这里</a>下载。</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"></span></p>

你可能感兴趣的:(模型)