lxml:python用来解析xml和html模块,用这个模块就可以使用xpath语法。
1、什么是xpath?
xpath就是用来筛选html或者xml中元素语法。如果匹配标签和元素,则返回element对象,如果匹配到的是标签和text,则返回字符串
2、xml和html中一些名词。
元素
标签
属性
内容
3、xpath的语法
(1)选取节点
| . | 代表当前节点 |
| … | 代表父节点 |
| / | 从根节点开始 |
| // | 文档的任意位置 |
| nodename | 选取标签火元素 |
| @属性名 | 选取属性名所对应的方法 |
| text() | 选取内容 |
(2)谓语:语言中就是用来限定主语的成分。
a、可以通过位置限定
[数字]选取第几个----> //body/div[3] ----> 选取页面内所有的body标签下的第三个div标签
[last()]:选取最后一个----> //body/div[last()] ----> 选取所有body标签下的最后一个div标签
[last()-1]:选取倒数第二个//body/div[last()-1] ----> 选取所有body标签下的倒数第二个div标签
[position()>1]:选取位置大于1----> //dl/dd[position()>1] ----> 所有dl下的位置大于1的dd标签
b、通过属性限定
[@class=‘属性值’]:选取class属性等于属性值的。----//div[@class=“container”] ----> 选取所有class=“container” 的div标签
[contains(@href,‘baidu’)]:选取属性名为href的属性值包含baidu的标签–//a[contains(@href,“1203”)]
c、通过子标签的内容来限定
//book[price>35]–选取book标签的price字标签的内容大于35的book标签。
(3)通配符:*
@*—任意属性
* —任意节点
from lxml import etree
text = '''<div> <ul>
<li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>
<li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
<li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
<li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
<li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
</ul> </div>
'''
#利用etree.HTML,将字符串转化为Element对象,Element对象具有xpath的方法
html = etree.HTML(text)
#获取href的列表和title的列表
href_list = html.xpath("//li[@class='item-1']/a/@href")
title_list = html.xpath("//li[@class='item-1']/a/text()")
#组装成字典
for href in href_list:
#print("1----------"+str(href_list.index(href)))
item = {}
item["href"] = href
#href_list.index(href)获取索引,值为0 1 2
item["title"] = title_list[href_list.index(href)]
print(item)
输出为:
{'href': 'link1.html', 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}
如果把text的内容换成:
text = ''' <div> <ul>
<li class="item-1"><a>first item</a></li>
<li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
<li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
<li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
<li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
</ul> </div> '''
输出为:
{'href': 'link2.html', 'title': 'first item'}
{'href': 'link4.html', 'title': 'second item'}
数据的对应全部错了,这不是我们想要的,接下来通过下面小节的学习来解决这个问题,先根据某个标签进行分组,分组之后再进行数据的提取
from lxml import etree
text = ''' <div> <ul>
<li class="item-1"><a>first item</a></li>
<li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
<li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
<li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
<li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
</ul> </div> '''
#根据li标签进行分组
html = etree.HTML(text)
#结果是一个element对象,这个对象能够继续使用xpath方法
# // |文档的任意位置
li_list = html.xpath("//li[@class='item-1']")
#在每一组中继续进行数据的提取
for li in li_list:
item = {}
# . |代表当前节点
# @属性名 | 选取属性名所对应的方法
# text() | 选取内容
item["href"] = li.xpath("./a/@href")[0] if len(li.xpath("./a/@href"))>0 else None
item["title"] = li.xpath("./a/text()")[0] if len(li.xpath("./a/text()"))>0 else None
print(item)
输出为:
{'href': None, 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}