[译] 如何从网页中获取主要文本信息

原文 Extracting meaningful content from raw HTML

解析HTML页面不算难事, 开源库Beautiful Soup就给你提供了紧凑直观的接口来处理网页, 让你使用喜欢的语言去实现. 但这也仅仅是第一步. 更有趣的问题是: 怎么才能把有用的信息(主题内容)从网页中抓取出来?

我在过去的几天里一直在尝试回答这个问题, 以下是我找到的.

Arc90 Readability

我最喜欢的方法是叫做 Arc90 Readability 的算法. 由 Arc90 Labs 开发, 目的是用来使网页阅读更佳舒适(例如在移动端上). 你可以找到它的chrome plugin. 整个项目放在Google Code上, 但更有趣的是它的算法, Nirmal Patel用python实现了这个算法, 源码在这里.

整个算法基于HTML-ID-names和HTML-CLASS-names生成了2个列表. 一个是包含了褒义IDs和CLASSes, 一个列表包含了贬义IDs和CLASSes. 如果一个tag有褒义的ID和CLASS, 那么它会得到额外的分数. 反之,如果它包含了贬义的ID和CLASS则丢分. 当我们算完了所有的tag的分数之后, 我们只需要渲染出那些得分较高的tags, 就得到了我们想要的内容.例子如下:

My post
...
Contact

第一个div tag含有一个褒义的ID (“id"=“post”), 所以很有可能这个tag包含了真正的内容(post). 然而, 第二行的tag footer是一个贬义的tag, 我们可以认为这个tag所包含的东西不是我们想要的真正内容. 基于以上, 我们可以得到如下方法:

在HTML源码里找到所有的p tag(即paragraph)
对于每一个paragraph段落:
把该段落的父级标签加入到列表中
并且把该父级标签初始化为0分
如果父级标签包含有褒义的属性, 加分
如果父级标签包含有贬义的属性, 减分
可选: 加入一些额外的标准, 比如限定tag的最短长度
找到得分最多的父级tag
渲染得分最多的父级tag

这里我参考了Nirmal Patel的代码, 写了一个简单的实现. 主要的区别是: 我在算法之前, 写了一个用来清除杂项的代码. 这样最终会得到一个没有脚本, 图片的文本内容, 就是我们想要的网页内容.

import re
from bs4 import BeautifulSoup
from bs4 import Comment
from bs4 import Tag
 
NEGATIVE = re.compile(".*comment.*|.*meta.*|.*footer.*|.*foot.*|.*cloud.*|.*head.*")
POSITIVE = re.compile(".*post.*|.*hentry.*|.*entry.*|.*content.*|.*text.*|.*body.*")
BR = re.compile("
[ rn]*
")
 
def extract_content_with_Arc90(html):
 
    soup = BeautifulSoup( re.sub(BR, "", html) )
    soup = simplify_html_before(soup)
 
    topParent = None
    parents = []
    for paragraph in soup.findAll("p"):
        
        parent = paragraph.parent
        
        if (parent not in parents):
            parents.append(parent)
            parent.score = 0
 
            if (parent.has_key("class")):
                if (NEGATIVE.match(str(parent["class"]))):
                    parent.score -= 50
                elif (POSITIVE.match(str(parent["class"]))):
                    parent.score += 25
 
            if (parent.has_key("id")):
                if (NEGATIVE.match(str(parent["id"]))):
                    parent.score -= 50
                elif (POSITIVE.match(str(parent["id"]))):
                    parent.score += 25
 
        if (len( paragraph.renderContents() ) > 10):
            parent.score += 1
 
        # you can add more rules here!
 
    topParent = max(parents, key=lambda x: x.score)
    simplify_html_after(topParent)
    return topParent.text
 
def simplify_html_after(soup):
 
    for element in soup.findAll(True):
        element.attrs = {}    
        if( len( element.renderContents().strip() ) == 0 ):
            element.extract()
    return soup
 
def simplify_html_before(soup):
 
    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
    [comment.extract() for comment in comments]
 
    # you can add more rules here!
 
    map(lambda x: x.replaceWith(x.text.strip()), soup.findAll("li"))    # tag to text
    map(lambda x: x.replaceWith(x.text.strip()), soup.findAll("em"))    # tag to text
    map(lambda x: x.replaceWith(x.text.strip()), soup.findAll("tt"))    # tag to text
    map(lambda x: x.replaceWith(x.text.strip()), soup.findAll("b"))     # tag to text
    
    replace_by_paragraph(soup, 'blockquote')
    replace_by_paragraph(soup, 'quote')
 
    map(lambda x: x.extract(), soup.findAll("code"))      # delete all
    map(lambda x: x.extract(), soup.findAll("style"))     # delete all
    map(lambda x: x.extract(), soup.findAll("script"))    # delete all
    map(lambda x: x.extract(), soup.findAll("link"))      # delete all
    
    delete_if_no_text(soup, "td")
    delete_if_no_text(soup, "tr")
    delete_if_no_text(soup, "div")
 
    delete_by_min_size(soup, "td", 10, 2)
    delete_by_min_size(soup, "tr", 10, 2)
    delete_by_min_size(soup, "div", 10, 2)
    delete_by_min_size(soup, "table", 10, 2)
    delete_by_min_size(soup, "p", 50, 2)
 
    return soup
 
def delete_if_no_text(soup, tag):
    
    for p in soup.findAll(tag):
        if(len(p.renderContents().strip()) == 0):
            p.extract()
 
def delete_by_min_size(soup, tag, length, children):
    
    for p in soup.findAll(tag):
        if(len(p.text) < length and len(p) <= children):
            p.extract()
 
def replace_by_paragraph(soup, tag):
    
    for t in soup.findAll(tag):
        t.name = “p"
        t.attrs = {}

空格符渲染

这个方法是主要思路很简单: 把HTML源码里面的所有tag (所有在<和>之间的代码) 用空格符代替. 当你再次渲染网页的时候, 所有的文本块(text block)依然是”块”状, 但是其他部分变成了包含很多空格符的分散的语句. 你剩下唯一要做的就是把文本快给找出来, 并且移除掉所有其他的内容.
我写了一个简单的实现.

#!/usr/bin/python
# -*- coding: utf-8 -*-
 
import requests
import re
 
from bs4 import BeautifulSoup
from bs4 import Comment
 
if __name__ == "__main__":
    
    html_string = requests.get('http://www.zdnet.com/windows-8-microsofts-new-coke-moment-7000014779/').text
    
    soup = BeautifulSoup(str( html_string ))
    
    map(lambda x: x.extract(), soup.findAll("code"))
    map(lambda x: x.extract(), soup.findAll("script"))
    map(lambda x: x.extract(), soup.findAll("pre"))
    map(lambda x: x.extract(), soup.findAll("style"))
    map(lambda x: x.extract(), soup.findAll("embed"))
    
    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
        
    [comment.extract() for comment in comments]
    
    white_string = ""
    isIn = False;
    
    for character in soup.prettify():
 
        if character == "<":
            isIn = True;
        
        if isIn:
            white_string += " "
        else:
            white_string += character
            
        if character == ">":
            isIn = False;
            
    for string in white_string.split("           "):    # tune here!
        
        p = string.strip()
        p = re.sub(' +',' ', p)
        p = re.sub('n+',' ', p)
        
        if( len( p.strip() ) > 50):
            print p.strip()

这里有个问题是, 这个方法不是很通用. 你需要进行参数调优来找到最合适空格符长度来分割得到的字符串. 那些带有大量markup标记的网站通常要比普通网站有更多的空格符. 除此之外, 这个还是一个相当简单有效的方法.

开源社区的库
有句话说: 不要重新造轮子. 事实上有很多的开源库可以解决网页内容抓取的问题. 我最喜欢的是Boilerpipe. 你可以在这里找到它的web服务(demo)http://boilerpipe-web.appspot.com/, 以及它的JAVA实现https://code.google.com/p/boilerpipe/. 相比较上面2个简单的算法, 它真的很有用, 代码也相对更复杂. 但是如果你只把它当做黑盒子使用, 确实是一个非常好的解决方法.

Best regards,
Thomas Uhrig

[译] 如何从网页中获取主要文本信息

Arc90 Readability

My post

空格符渲染

你可能感兴趣的:([译] 如何从网页中获取主要文本信息)