HTMLParser是python自带的网页解析库,使用也很简单,主要需要继承基类HTMLParser,然后
重载handle_starttag、handle_data、handle_endtag三个函数即可。
下面给出一个抽取网页链接的示例
#!/usr/bin/env python #coding=utf-8 from HTMLParser import HTMLParser class MyHTMLParser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.links = [] def handle_starttag(self,tag,attrs): #print attrs attrs is lists of tuples. if tag == 'a': if len(attrs) == 0: pass else: for (variable,value) in attrs: if variable == "href": self.links.append(value) if __name__ == "__main__": html_code = """ <a href="www.google.com"> google.com</a> <A Href="www.pythonclub.org"> PythonClub </a> <A HREF = "www.sina.com.cn"> Sina </a> """ hp = MyHTMLParser() hp.feed(html_code) hp.close() print(hp.links)
BeautifulSoup是第三方库,不过功能更强大,代码量更少。文档请参考http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html
同样抽取网页链接代码示例
#!/usr/bin/env python #coding=utf-8 from bs4 import BeautifulSoup import urllib2 def downloadpage(url): fp = urllib2.urlopen(url) data = fp.read() fp.close() return data def parsehtml(data): soup = BeautifulSoup(data) for x in soup.findAll('a'): print x.attrs['href'] if __name__ == "__main__": #parsehtml(downloadpage("http://www.baidu.com") ) parsehtml(""" <a href="www.google.com"> google.com</a> <A Href="www.pythonclub.org"> PythonClub </a> <A HREF = "www.sina.com.cn"> Sina </a> """)