为了解决这个问题,文献[3]提出了基于特征码的快速网页去重方法。该方法把网页去重问题近似看作一个检索问题,把每篇文章变成一个查询请求。但是与一般的检索系统不同的是:一般的检索系统检索出来的是所有与该网页相关的网页,而不是完全相同的网页。而去重的过程中需要检索出来的是完全相同的网页。因此需要对网页的特征建立索引,通过对网页的分析发现,句号在网页的导航信息中几乎不会出现。于是便把句号出现的位置作为一个提取特征码的位置,从句号两边各取L个汉字,作为该篇文章的特征码。通过分析可知:如果两边各取5个汉字,就相当于10阶文法,而对于参数个数为6763的空间来说,10阶文法重复的概率在理论上是,这个值是非常小的。
|
教育
|
科技
|
时事
|
体育
|
总计
|
a
=1.05
|
53/4
|
95/11
|
72/2
|
80/3
|
300/20
|
a
=1.10
|
53/1
|
95/4
|
72/4
|
80/0
|
300/9
|
a
=1.12
|
53/2
|
95/2
|
72/6
|
80/0
|
300/10
|
a
=1.14
|
53/3
|
95/3
|
72/3
|
80/0
|
300/9
|
|
教育
|
科技
|
时事
|
体育
|
总计
|
新方法
|
53/1
|
95/4
|
72/4
|
80/0
|
300/9
|
特征码法
|
53/5
|
95/5
|
72/12
|
80/4
|
300/26
|
[2]Besancon, R., Rajman, M., Chappelier, J. C. Textual similarities based on a distributional approach, Tenth International Workshop on Database and Expert Systems Applications, 1-3 Sept. 1999, pp.180-184
[4] Zheng De-Quan, Hu Yi, Yu Hao, et al. Research of specific information recognition in multi-carrier data streams, Journal of Software, 2003, 14(9), pp.1538-1543
Content Based Deletion Algorithm for Large Scale Duplicated Web Pages
Peng Yuan Zhao Tiejun Zheng Dequang Yu Hao
(Machine Translation Laboratory , HIT. , Harbin 150001 , Heilongjiang , China)
Abstract: This Paper proposes a feature code and file length combined based deletion and combination algorithm for large scale duplicated web pages. This algorithm effectively improved efficiency of pure feature code method. Experiments have proved that the new method can make high performance in correctionness. And it can do some help to other research fields.
Key Words: feature code search engine file length
[1]作者简介:彭渊(1983-),男,硕士,主要从事互联网信息检索方面的研究。Email: [email protected]