boilerpipe(Boilerplate Removal and Fulltext Extraction from HTML pages) 源码分析

开源Java模块boilerpipe(1.1.0), http://code.google.com/p/boilerpipe/

使用例子,
URL url = new URL("http://www.example.com/some-location/index.html ");
// NOTE: Use ArticleExtractor unless DefaultExtractor gives better results for you
String text = ArticleExtractor .INSTANCE.getText(url);
那就从ActicleExtractor开始分析, 这个类用了singleton的design pattern, 使用INSTANCE取得唯一的实例, 实际处理如下步骤

HTML Parser
The HTML Parser is based upon CyberNeko 1.9.13. It is called internally from within the Extractors.
The parser takes an HTML document and transforms it into a TextDocument , consisting of one or more TextBlocks . It knows about specific HTML elements (SCRIPT, OPTION etc.) that are ignored automatically.
Each TextBlock stores a portion of text from the HTML document. Initially (after parsing) almost every TextBlock represents a text section from the HTML document, except for a few inline elements that do not separate per defintion (for example '<A>'anchor tags).
The TextBlock objects also store shallow text statistics for the block's content such as the number of words and the number of words in anchor text.

Extractors
Extractors consist of one or more pipelined Filters . They are used to get the content of a webpage. Several different Extractors exist, ranging from a generic DefaultExtractor to extractors specific for news article extraction (ArticleExtractor).
ArticleExtractor.process() 就包含了这个pipeline filter, 这个design做的非常具有可扩展性, 把整个处理过程分成若干小的步骤分别实现, 在用的时候象搭积木一样搭成一个处理流. 当想扩展或改变处理过程时, 非常简单, 只需加上或替换其中的一块就可以了.
这样也非常方便于多语言扩展, 比如这儿用的english包里的相应的处理函数,
import de.l3s.boilerpipe.filters.english.IgnoreBlocksAfterContentFilter;
import de.l3s.boilerpipe.filters.english.KeepLargestFulltextBlockFilter;
如果要扩展到其他语言, 如韩文, 只需在filters包里面加上个korean包, 分别实现这些filter处理函数, 然后只需要修改import, 就可以实现对韩语的support.

 TerminatingBlocksFinder.INSTANCE.process(doc)
                | new DocumentTitleMatchClassifier(doc.getTitle()).process(doc)
                | NumWordsRulesClassifier.INSTANCE.process(doc)
                | IgnoreBlocksAfterContentFilter.DEFAULT_INSTANCE.process(doc)
                | BlockProximityFusion.MAX_DISTANCE_1.process(doc)
                | BoilerplateBlockFilter.INSTANCE.process(doc)
                | BlockProximityFusion.MAX_DISTANCE_1_CONTENT_ONLY.process(doc)
                | KeepLargestFulltextBlockFilter.INSTANCE.process(doc)
                | ExpandTitleToContentFilter.INSTANCE.process(doc);
下面具体看一下处理流的每个环节.

TerminatingBlocksFinder
Finds blocks which are potentially indicating the end of an article text and marks them with {@link DefaultLabels#INDICATES_END_OF_TEXT}. This can be used in conjunction with a downstream {@link IgnoreBlocksAfterContentFilter}.(意思是IgnoreBlocksAfterContentFilter必须作为它的downstream)

原理很简单, 就是判断这个block, 在tb.getNumWords() < 20的情况下是否满足下面的条件,
text.startsWith("Comments")
                        || N_COMMENTS.matcher(text).find() //N_COMMENTS = Pattern.compile("(?msi)^[0-9]+ (Comments|users responded in)")
                        || text.contains("What you think...")
                        || text.contains("add your comment")
                        || text.contains("Add your comment")
                        || text.contains("Add Your Comment")
                        || text.contains("Add Comment")
                        || text.contains("Reader views")
                        || text.contains("Have your say")
                        || text.contains("Have Your Say")
                        || text.contains("Reader Comments")
                        || text.equals("Thanks for your comments - this feedback is now closed")
                        || text.startsWith("© Reuters")
                        || text.startsWith("Please rate this")
如果满足就认为这个block为artical的结尾, 并加上标记tb.addLabel(DefaultLabels.INDICATES_END_OF_TEXT);

DocumentTitleMatchClassifier
这个很简单, 就是根据'<title>'的内容去页面中去标注title的位置, 做法就是根据'<title>'的内容产生一个potentialTitles列表, 然后去匹配block, 匹配上就标注成DefaultLabels.TITLE

NumWordsRulesClassifier
Classifies {@link TextBlock}s as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper "Boilerplate Detection using Shallow Text Features" (WSDM 2010), particularly using number of words per block and link density per block.
这个模块实现了个分类器, 用于区分content/not-content , 分类器的构建参见上面这篇文章的4.3节.
分类器使用Decision Trees算法, 用标注过的google news作为训练集, 接着对训练完的Decision Trees经行剪枝, Applying reduced-error pruning we were able to simplify the decision tree to only use 6 dimensions (2 features each for current, previous and next block) without a significant loss in accuracy.
最后用伪码描述出Decision Trees的decision过程, 这就是使用Decision Trees的最大好处, 它的decision rules是可以理解的, 所以可以用各种语言描述出来.
这个模块实现的是Algorithm 2 Classifier based on Number of Words

curr_linkDensity <= 0.333333
| prev_linkDensity <= 0.555556
| | curr_numWords <= 16
| | | next_numWords <= 15
| | | | prev_numWords <= 4: BOILERPLATE
| | | | prev_numWords > 4: CONTENT
| | | next_numWords > 15: CONTENT
| | curr_numWords > 16: CONTENT
| prev_linkDensity > 0.555556
| | curr_numWords <= 40
| | | next_numWords <= 17: BOILERPLATE
| | | next_numWords > 17: CONTENT
| | curr_numWords > 40: CONTENT
curr_linkDensity > 0.333333: BOILERPLATE

有了Classifies, 接下来的事情就是对于所有block进行分类并标注.

IgnoreBlocksAfterContentFilter
Marks all blocks as "non-content" that occur after blocks that have been marked {@link DefaultLabels#INDICATES_END_OF_TEXT}. These marks are ignored unless a minimum number of words in content blocks occur before this mark (default: 60). This can be used in conjunction with an upstream {@link TerminatingBlocksFinder}.

这个模块是TerminatingBlocksFinder模块的downstream, 就是说必须在它后面做, 简单的很, 找到DefaultLabels#INDICATES_END_OF_TEXT, 后面的内容全标为BOILERPLATE.
除了前面正文length不到minimum number of words(default: 60), 还需要继续抓点文字凑数.

BlockProximityFusion
Fuses adjacent blocks if their distance (in blocks) does not exceed a certain limit. This probably makes sense only in cases where an upstream filter already has removed some blocks.
这个模块用来合并block的, 合并的依据主要是根据两个block的offset的差值不大于2, 也就是说中间最多只能隔一个block.
当要求contentOnly时, 会check两个block都标注为content时才会fusion.
int diffBlocks = block.getOffsetBlocksStart() - prevBlock.getOffsetBlocksEnd() - 1;
if (diffBlocks <= maxBlocksDistance)

那么block的offset怎么来的了, 查一下block构造的时候的代码
BoilerpipeHTMLContentHandler .flushBlock()
TextBlock tb = new TextBlock(textBuffer.toString().trim(), currentContainedTextElements, numWords, numLinkedWords, numWordsInWrappedLines, numWrappedLines, offsetBlocks);
offsetBlocks++;

TextBlock构造函数
this.offsetBlocksStart = offsetBlocks;
this.offsetBlocksEnd = offsetBlocks;
可以看出初始情况下, block的offset就是递增的, 并且再没有做过fusion的情况下, offsetBlocksStart和offsetBlocksEnd是相等的.
所以象注释讲的那样, 只有当upstream filter remove了部分blocks以后, 这个模块的合并依据才是有意义的, 不然在没有任何删除的情况下, 所有block都满足fusion条件.

看完这段代码, 我很奇怪, Paper中fusion是根据text density的, 而这儿只是根据block的offset, 有所减弱.
There, adjacent text fragments of similar text density (interpreted as /similar class") are iteratively fused until the blocks' densities (and therefore the text classes) are distinctive
enough.
而且我更加不理解的是, 在ArticleExtractor关于这个模块的用法如下,
                  BlockProximityFusion.MAX_DISTANCE_1.process(doc)
                | BoilerplateBlockFilter.INSTANCE.process(doc)
                | BlockProximityFusion.MAX_DISTANCE_1_CONTENT_ONLY.process(doc)
调用了BlockProximityFusion两次, 分别在BoilerplateBlockFilter(含义在下节)的down,upstream, 对于BlockProximityFusion.MAX_DISTANCE_1_CONTENT_ONLY.process(doc)的调用我还是能理解的, 再删除完非content的block后, 对剩下的block做一下fusion, 比如原来两个block中间隔了个广告. 不过这儿根据offset, 而不根据text density, 个人觉得功能有所减弱.
可是对于BlockProximityFusion.MAX_DISTANCE_1.process(doc)的调用, 可能是我没看懂, 实在无法理解, 为什么要加这步, 唯一的解释是想将一些没有标注为content的block fusion到content里面去. 奇怪的是这儿fusion是无条件的(在没有删除block的情况下,判断offset无效), 只需要当前的block是content是就和Prev进行fusion. 而且为什么只判断当前block, Prevblock是content是否也应该fusion.个人觉得这边逻辑完全不合理......

BoilerplateBlockFilter
Removes {@link TextBlock}s which have explicitly been marked as "not content"
没啥好说的, 就是遍历每个block, 把没有标注为"content"的都删掉.

KeepLargestFulltextBlockFilter
Keeps the largest {@link TextBlock} only (by the number of words). In case of more than one block with the same number of words, the first block is chosen. All discarded blocks are marked "not content" and flagged as {@link DefaultLabels#MIGHT_BE_CONTENT}
很好理解, 找出最大的文本block作为正文, 其他的标注为DefaultLabels#MIGHT_BE_CONTENT

ExpandTitleToContentFilter
Marks all {@link TextBlock}s "content" which are between the headline and the part that has already been marked content, if they are marked {@link DefaultLabels#MIGHT_BE_CONTENT}. This filter is quite specific to the news domain.
逻辑是找出标注为DefaultLabels.TITLE的block, 和content开始的那个block, 把这两个block之间的标注为MIGHT_BE_CONTENT的都改标注为Content.

TextDocument.getContent()
最后需要做的一步, 是把抽取的内容输出成文本. 遍历每一个标注为content的block, 把内容append并输出.

DefaultExtractor
下面再看看除了ArticleExtractor (针对news)以外, 很常用的DefaultExtractor
SimpleBlockFusionProcessor.INSTANCE.process(doc)
                | BlockProximityFusion.MAX_DISTANCE_1.process(doc)
                | DensityRulesClassifier.INSTANCE.process(doc);
相对比较简单, 就三步, 第二步很奇怪, 前面没有任何upstream会标注content, 那么这步就什么都不会做

SimpleBlockFusionProcessor
Merges two subsequent blocks if their text densities are equal.
遍历每一个block, 两个block的text densities相同就merge

DensityRulesClassifier
Classifies {@link TextBlock}s as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper "Boilerplate Detection using Shallow Text Features", particularly using text densities and link densities.
参照NumWordsRulesClassifier , 这儿实现了Paper里面的Algorithm 1 Densitometric Classifier
curr_linkDensity <= 0.333333
| prev_linkDensity <= 0.555556
| | curr_textDensity <= 9
| | | next_textDensity <= 10
| | | | prev_textDensity <= 4: BOILERPLATE
| | | | prev_textDensity > 4: CONTENT
| | | next_textDensity > 10: CONTENT
| | curr_textDensity > 9
| | | next_textDensity = 0: BOILERPLATE
| | | next_textDensity > 0: CONTENT
| prev_linkDensity > 0.555556
| | next_textDensity <= 11: BOILERPLATE
| | next_textDensity > 11: CONTENT
curr_linkDensity > 0.333333: BOILERPLATE

如果有兴趣, 你可以学习其他extractor, 或自己design合适自己的extractor.

你可能感兴趣的:(Algorithm,html,filter,features,Comments,distance)