lxml手册：http://lxml.de/index.html

1.下面的例子源自于博客用lxml解析HTML¶

from lxml import etree

待解析的文本

html = '''

<html>
　　<head>
　　　　<meta name="content-type" content="text/html; charset=utf-8" />
　　　　<title>友情链接查询 - 站长工具</title>
　　　　<!-- uRj0Ak8VLEPhjWhg3m9z4EjXJwc -->
　　　　<meta name="Keywords" content="友情链接查询" />
　　　　<meta name="Description" content="友情链接查询" />

　　</head>
　　<body>
　　　　<h1 class="heading">Top News</h1>
　　　　<p style="font-size: 200%">World News only on this page</p>
　　　　Ah, and here's some more text, by the way.
　　　　<p>... and this is a parsed fragment ...</p>

　　　　<a href="http://www.cydf.org.cn/" rel="nofollow" target="_blank">青少年发展基金会</a> 
　　　　<a href="http://www.4399.com/flash/32979.htm" target="_blank">洛克王国</a> 
　　　　<a href="http://www.4399.com/flash/35538.htm" target="_blank">奥拉星</a> 
　　　　<a href="http://game.3533.com/game/" target="_blank">手机游戏</a>
　　　　<a href="http://game.3533.com/tupian/" target="_blank">手机壁纸</a>
　　　　<a href="http://www.4399.com/" target="_blank">4399小游戏</a> 
　　　　<a href="http://www.91wan.com/" target="_blank">91wan游戏</a>

　　</body>
</html>

'''

使用lxml前注意事项：先确保html经过了utf-8解码，即code = html.decode(‘utf-8’, ‘ignore’)，否则会出现解析出错情况。因为中文被编码成utf-8之后变成 ‘/u2541’　之类的形式，lxml一遇到　“/”就会认为其标签结束。

page = etree.HTML(html.decode('utf-8'))
hrefs = page.xpath(u"//a")#它会找到整个html文本里的所有 a 标签
for href in hrefs:
    print href.attrib['href']#+"  "+href.text
for href in hrefs:
    print href.text

http://www.cydf.org.cn/
http://www.4399.com/flash/32979.htm
http://www.4399.com/flash/35538.htm
http://game.3533.com/game/
http://game.3533.com/tupian/
http://www.4399.com/
http://www.91wan.com/
青少年发展基金会
洛克王国
奥拉星
手机游戏
手机壁纸
4399小游戏
91wan游戏

上面解析HTML过程中出现的几个对象的类型

print type(hrefs)
print type(href)
print type(href.text)
print type(href.attrib)

<type 'list'>
<type 'lxml.etree._Element'>
<type 'unicode'>
<type 'lxml.etree._Attrib'>

过滤的方法就是用[”@”]把过滤条件加上。类似的还有@name, @id, @value, @href, @src, @class等等。

p = page.xpath(u"/html/body/p[@style='font-size: 200%']")
#用“/”来作为上下层级间的分隔。第一个“/”表示文档的根节点。
print p[0].values()
print p[0].text

['font-size: 200%']
World News only on this page

或者

p = page.xpath(u"//p[@style='font-size: 200%']")
print p[0].values()
print p[0].text

['font-size: 200%']
World News only on this page

数字定位功能，需要注意的是序号从1开始，而不是0.

hrefs = page.xpath(u"//a[3]")#此序号从1开始
print hrefs[0].attrib

{'href': 'http://www.4399.com/flash/35538.htm', 'target': '_blank'}

星号 * 可以代替所有的节点名

metas = page.xpath(u"/html/*/meta")
for meta in metas:
    print meta.attrib
for meta in metas:
    print meta.attrib['name']

{'content': 'text/html; charset=utf-8', 'name': 'content-type'}
{'content': u'\u53cb\u60c5\u94fe\u63a5\u67e5\u8be2', 'name': 'Keywords'}
{'content': u'\u53cb\u60c5\u94fe\u63a5\u67e5\u8be2', 'name': 'Description'}
content-type
Keywords
Description

2.下面的例子源自于博客 python lxml xpath 使用实例¶

import lxml.html

html=''' 
<html> 
<body> 
<bookstore position="cn"> 
    <book category="A"> 
        <title lang="en">Everyday Italian</title> 
        <author>Giada De Laurentiis</author> 
        <year>2005</year> 
        <price>30.00</price> 
    </book> 
    <book category="B"> 
        <title lang="en">Harry Potter</title> 
        <author>J K. Rowling</author> 
        <year>2005</year> 
        <price>29.99</price> 
    </book> 
</bookstore> 
<bookstore position="pk"> 
    <book category="A"> 
        <title lang="en">Learning XML</title> 
        <author>Erik T. Ray</author> 
        <year>2003</year> 
        <price>39.95</price> 
    </book> 
</bookstore> 
<bookstore position="jp"> 
    <book category="C"> 
        <title lang="en">XQuery Kick Start</title> 
        <author>James McGovern</author> 
        <author>Per Bothner</author> 
        <author>Kurt Cagle</author> 
        <author>James Linn</author> 
        <author>Vaidyanathan Nagarajan</author> 
        <year>2003</year> 
        <price>49.99</price> 
    </book> 
</bookstore> 
</body> 
</html> 
'''

doc = lxml.html.document_fromstring(html)

print "总共有%d本书" %(len(doc.xpath('/html/body/bookstore/book')))

总共有4本书

print "2005 年出版的书有%d本"% (len(doc.xpath('/html/body/bookstore/book[year=2005]')))

2005 年出版的书有2本

print "2005 年出版的书在 %s" % (" ".join([ i.get("position")  for i in doc.xpath('/html/body/bookstore/book[year=2003]/parent::*') ])) 
# get("position")biaosh表示获得position属性。
# parent::表示任意父节点

2005 年出版的书在 pk jp

price = doc.xpath("//bookstore/book[title='Harry Potter']/price")  
print(price[0].text)

29.99

3.分析在线网页¶

r = requests.get('https://www.python.org')

doc = lxml.html.document_fromstring(r.content)

ps = doc.xpath('/html/body/div/div/nav/ul/li/a')

for p in ps:
    print p.text

Python
PSF
Docs
PyPI
Jobs
Community

用xpath解析网页

1.下面的例子源自于博客用lxml解析HTML¶

2.下面的例子源自于博客 python lxml xpath 使用实例¶

3.分析在线网页¶

4.博客园粉丝关系解析¶

你可能感兴趣的:(解析,xpath,网页)

用xpath解析网页

1.下面的例子源自于博客 用lxml解析HTML¶

2.下面的例子源自于博客 python lxml xpath 使用实例¶

3.分析在线网页¶

4.博客园粉丝关系解析¶

你可能感兴趣的:(解析,xpath,网页)

1.下面的例子源自于博客用lxml解析HTML¶