
来源 《TextProcessing in Python》  和python官方  tutorial 

htmlParser模块包含了类HTMLParser  这个类本身很有用.因为当产生事件时,本身并不做任何工作。对HTMLParser.HTMLParser() 的利用需要实现其子类,并且编写处理你感兴趣事件的方法

这里插入一段来自python 官网的htmlparser介绍,可以更清晰的了解htmlparser的使用方法

HTMLPaser模块定义一个类HTMLParser ,可以用作解析html和xhtml 的基础.和htmllib中的parser不同,这个parser并不是基于sgmllib实现

class  HTMLParser. HTMLParser

当给htmlParser中填充html数据时(调用feed()方法),parser会在遇到开始标签(如<html><head><p>)结束标签如</html></head></p>时调用相应的处理方法.使用者应该实现htmlparser的子类,并且重写(重载)这些处理方法,来实现特定的行为( call the end-tag handler for elements which are closed implicitly by closing an outer element.


和htmllib不同。HTMLParser不会检查结束标签是否和开始标签是否匹配,也不会对隐式闭合的元素调用end-tag handler方法,这里的隐式闭合是指只闭合外层元素( does not call the end-tag handler for elements which are closed implicitly by closing an outer element)
。如  title可以被看做一个隐式闭合的标签


这里的隐式闭合(close implicityly)是个人理解,之前简单google没有找到满意答案

一个简单htmlparser 使用样例

from HTMLParser import HTMLParser

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Encountered a start tag:", tag
    def handle_endtag(self, tag):
        print "Encountered an end tag :", tag
    def handle_data(self, data):
        print "Encountered some data  :", data

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
            '<body><h1>Parse me!</h1></body></html>')

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html


TextProcessing in Python

If it is important to keep track of the structural position of the current event within the document, you will need to maintain a data structure with this information. If you are certain that the document you are processing is well-formed XHTML, a stack suffices. For example:



#!/usr/bin/env python
import HTMLParser
html = """<html><head><title>Advice</title></head><body>
<p>The <a href="">IETF admonishes:
   <i>Be strict in what you <b>send</b>.</i></a></p>
tagstack = []
class ShowStructure(HTMLParser.HTMLParser):
    def handle_starttag(self, tag, attrs): tagstack.append(tag)
    def handle_endtag(self, tag): tagstack.pop()
    def handle_data(self, data):
        if data.strip():
            for tag in tagstack: sys.stdout.write('/'+tag)
            sys.stdout.write(' >> %s\n' % data[:40].strip())


% ./
/html/head/title >> Advice
/html/body/p >> The
/html/body/p/a >> IETF admonishes:
/html/body/p/a/i >> Be strict in what you
/html/body/p/a/i/b >> send
/html/body/p/a/i >> .

如果要处理的数据不那么良好,就需要实现一个更复杂的栈,我们可以定义一个新的对象,这个对象可以删除和endtag相对应的最近的一个starttag,同时还可以避免没有被闭合的<p> 和<blockquote>嵌套在其中。你可以为一个应用,顺着这种方式做更多的工作,这里的TagStack是一个很好的例子,可以作为开端

class TagStack:
    def __init__(self, lst=[]): self.lst  = lst
    def __getitem__(self, pos): return self.lst[pos]
    def append(self, tag):
        if tag.lower() in ('p','blockquote'):
            self.lst = [t for t in self.lst
                          if t not in ('p','blockquote')]
    def pop(self, tag):
        # "Pop" by tag from nearest pos, not only last item
            pos = self.lst.index(tag)
        except ValueError:
            raise HTMLParser.HTMLParseError, "Tag not on stack"
        del self.lst[pos]
tagstack = TagStack()



对pop方法的一点简单说明,因为刚开始学习python ,这里曾产生困惑:
