Python lxml模块

lxml:python用来解析xml和html模块,用这个模块就可以使用xpath语法。

标题xpath(语法) 就相当于一个路径一样,可以匹配html和xml想要的内容数据

1、什么是xpath?
		xpath就是用来筛选html或者xml中元素语法。如果匹配标签和元素,则返回element对象,如果匹配到的是标签和text,则返回字符串
2、xml和html中一些名词。
		元素
		标签
		属性
		内容
3、xpath的语法
	(1)选取节点
| . | 代表当前节点 |
|| 代表父节点 |
| / | 从根节点开始 |
| // | 文档的任意位置 |
| nodename | 选取标签火元素 |
| @属性名 | 选取属性名所对应的方法 |
| text() | 选取内容 |2)谓语:语言中就是用来限定主语的成分。
a、可以通过位置限定
[数字]选取第几个----> //body/div[3] ----> 选取页面内所有的body标签下的第三个div标签
[last()]:选取最后一个----> //body/div[last()] ----> 选取所有body标签下的最后一个div标签
[last()-1]:选取倒数第二个//body/div[last()-1] ----> 选取所有body标签下的倒数第二个div标签
[position()>1]:选取位置大于1----> //dl/dd[position()>1] ----> 所有dl下的位置大于1的dd标签
b、通过属性限定
[@class=‘属性值’]:选取class属性等于属性值的。----//div[@class=“container”] ----> 选取所有class=“container” 的div标签
[contains(@href,‘baidu’)]:选取属性名为href的属性值包含baidu的标签–//a[contains(@href,“1203”)]
c、通过子标签的内容来限定
//book[price>35]–选取book标签的price字标签的内容大于35的book标签。3)通配符:*
@*—任意属性
* —任意节点

lxml的深入练习

from lxml import etree

text = '''<div> <ul> 
    <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div>
'''

#利用etree.HTML,将字符串转化为Element对象,Element对象具有xpath的方法
html = etree.HTML(text)
#获取href的列表和title的列表
href_list = html.xpath("//li[@class='item-1']/a/@href")
title_list = html.xpath("//li[@class='item-1']/a/text()")


#组装成字典
for href in href_list:
	#print("1----------"+str(href_list.index(href)))
	item = {}
	item["href"] = href
	#href_list.index(href)获取索引,值为0 1 2 
	item["title"] = title_list[href_list.index(href)]
	print(item)

输出为:

{'href': 'link1.html', 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}

如果把text的内容换成:

text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''

输出为:

{'href': 'link2.html', 'title': 'first item'}
{'href': 'link4.html', 'title': 'second item'}

数据的对应全部错了,这不是我们想要的,接下来通过下面小节的学习来解决这个问题,先根据某个标签进行分组,分组之后再进行数据的提取

from lxml import etree

text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''

#根据li标签进行分组
html = etree.HTML(text)
#结果是一个element对象,这个对象能够继续使用xpath方法
# // |文档的任意位置 
li_list = html.xpath("//li[@class='item-1']")

#在每一组中继续进行数据的提取
for li in li_list:
	item = {}
	# . |代表当前节点
	# @属性名 | 选取属性名所对应的方法 
	# text() | 选取内容 
	item["href"] = li.xpath("./a/@href")[0] if len(li.xpath("./a/@href"))>0 else None
	item["title"] = li.xpath("./a/text()")[0] if len(li.xpath("./a/text()"))>0 else None
	print(item)

输出为:

{'href': None, 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}

你可能感兴趣的:(python编程,xpath,python)