python使用HTMLParser和BeautifulSoup解析网页

HTMLParser是python自带的网页解析库,使用也很简单,主要需要继承基类HTMLParser,然后

重载handle_starttag、handle_data、handle_endtag三个函数即可。

下面给出一个抽取网页链接的示例

#!/usr/bin/env python
#coding=utf-8

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.links = []

    def handle_starttag(self,tag,attrs):
        #print attrs attrs is lists of tuples.
        if tag == 'a':
            if len(attrs) == 0:
                pass
            else:
                for (variable,value) in attrs:
                    if variable == "href":
                        self.links.append(value)

if __name__ == "__main__":
    html_code = """
    <a href="www.google.com"> google.com</a>
    <A Href="www.pythonclub.org"> PythonClub </a>
    <A HREF = "www.sina.com.cn"> Sina </a>
    """
    hp = MyHTMLParser()
    hp.feed(html_code)
    hp.close()
    print(hp.links)

===========================================================================

BeautifulSoup是第三方库,不过功能更强大,代码量更少。文档请参考http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html

同样抽取网页链接代码示例

#!/usr/bin/env python
#coding=utf-8

from bs4 import BeautifulSoup
import urllib2

def downloadpage(url):
    fp = urllib2.urlopen(url)
    data = fp.read()
    fp.close()
    return data

def parsehtml(data):
    soup = BeautifulSoup(data)
    for x in soup.findAll('a'):
        print x.attrs['href']

if __name__ == "__main__":
    #parsehtml(downloadpage("http://www.baidu.com") )
    parsehtml("""
    <a href="www.google.com"> google.com</a>
    <A Href="www.pythonclub.org"> PythonClub </a>
    <A HREF = "www.sina.com.cn"> Sina </a>
    """)
    



你可能感兴趣的:(python,HtmlParser,beautifulsoup)