pip install pyquery
# 使用conda安装
conda install pyquery
解析 HTML 文本的时候,首先需要将其初始化为一个 pyquery 对象。它的初始化方式有多种,比如直接传入字符串、传入 URL、传入文件名,等等。
直接把 HTML 的内容当作参数来初始化 pyquery 对象
html = '''
- first item
- second item
- third item
- fourth item
- fifth item
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('li'))
结果如下:
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
传入网页的 URL,此时只需要指定参数为 url 即可:
from pyquery import PyQuery as pq
doc = pq(url='https://www.baidu.com/',encoding='utf-8')
print(doc('title'))
如果不加encoding=‘utf-8’,可能出现乱码
运行结果:
<title>百度一下,你就知道</title>
与下面的功能是相同的
from pyquery import PyQuery as pq
import requests
doc = pq(requests.get('https://www.baidu.com').content.decode())
print(doc('title'))
运行结果:
<title>百度一下,你就知道</title>
可以传递本地的文件名,参数指定为 filename
这个我的报错
'gbk' codec can't decode byte 0xa2 in position 51: illegal multibyte sequence
问题未解决,如果大家能解决这个问题,可留言
from pyquery import PyQuery as pq
doc = pq(filename='C:/Users/123/Desktop/超星查题.html')
print(doc('button'))
html = '''
- first item
- second item
- third item
- fourth item
- fifth item
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list li'))
print(type(doc('#container .list li')))
直接遍历这些节点,然后调用 text 方法,就可以获取节点的文本内容
for item in doc('#container .list li').items():
print(item.text())
结果:
first item
second item
third item
fourth item
fifth item
查找子节点需要用到 find 方法,传入的参数是 CSS 选择器
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
print(type(items))
print(items)
lis = items.find('li')
print(type(lis))
print(lis)
<class 'pyquery.pyquery.PyQuery'>
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
find 的查找范围是节点的所有子孙节点,只查找子节点,那可以用 children 方法
lis = items.children()
print(type(lis))
print(lis)
运行结果:
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
如果要筛选所有子节点中符合条件的节点,比如想筛选出子节点中 class 为 active 的节点,可以向 children 方法传入 CSS 选择器 .active,代码如下:
lis = items.children('.active')
print(lis)
结果:
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
用 parent 方法来获取某个节点的父节点
html = '''
- first item
- second item
- third item
- fourth item
- fifth item
'''
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
container = items.parent()
print(type(container))
print(container)
运行结果:
<class 'pyquery.pyquery.PyQuery'>
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
这里的父节点是该节点的直接父节点,也就是说,它不会再去查找父节点的父节点,即祖先节点。
但是如果你想获取某个祖先节点,该怎么办呢?我们可以用 parents 方法:
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
parents = items.parents()
print(type(parents))
print(parents)
运行结果:
<class 'pyquery.pyquery.PyQuery'>
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
使用 parents 方法会返回所有的祖先节点。
如果你想要筛选某个祖先节点的话,可以向 parents 方法传入 CSS 选择器,这样就会返回祖先节点中符合 CSS 选择器的节点:
parent = items.parents('.wrap')
print(parent)
结果:
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
用 siblings 方法
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-0.active')
print(li.siblings())
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0">first item</li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
如果要筛选某个兄弟节点,我们依然可以用 siblings 方法传入 CSS 选择器,这样就会从所有兄弟节点中挑选出符合条件的节点了
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-0.active')
print(li.siblings('.active'))
结果:
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
通过刚才的例子我们可以观察到,pyquery 的选择结果既可能是多个节点,也可能是单个节点,类型都是 pyquery 类型,并没有返回列表。
对于单个节点来说,可以直接打印输出,也可以直接转成字符串:
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
print(str(li))
运行结果:
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
对于有多个节点的结果,我们就需要用遍历来获取了。例如,如果要把每一个 li 节点进行遍历,需要调用 items 方法:
from pyquery import PyQuery as pq
doc = pq(html)
lis = doc('li').items()
print(type(lis))
for li in lis:
print(li, type(li))
运行结果如下:
<class 'generator'>
<li class="item-0">first item</li>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-1"><a href="link2.html">second item</a></li>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<class 'pyquery.pyquery.PyQuery'>
可以发现,调用 items 方法后,会得到一个生成器,遍历一下,就可以逐个得到 li 节点对象了,它的类型也是 pyquery 类型。每个 li 节点还可以调用前面所说的方法进行选择,比如继续查询子节点,寻找某个祖先节点等,非常灵活。