1.安装方法
pip install pyquery
2.引用方法
from pyquery import PyQuery as pq
3.简介
pyquery 是类型jquery 的一个专供python使用的html解析的库,使用方法类似bs4。
4.使用方法
4.1 初始化方法:
from pyquery import PyQuery as pq doc =pq(html) #解析html字符串 doc =pq("http://news.baidu.com/") #解析网页 doc =pq("./a.html") #解析html 文本
4.2 基本CSS选择器
from pyquery import PyQuery as pq html = '''''' doc = pq(html) print doc("#wrap .s_from link")asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
运行结果:
asdadasdad12312 asdadasdad12312 asdadasdad12312
#是查找id的标签 .是查找class 的标签 link 是查找link 标签 中间的空格表示里层
4.3 查找子元素
from pyquery import PyQuery as pq html = '''''' #查找子元素 doc = pq(html) items=doc("#wrap") print(items) print("类型为:%s"%type(items)) link = items.find('.s_from') print(link) link = items.children() print(link)asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
运行结果:
类型为:asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
根据运行结果可以发现返回结果类型为pyquery,并且find方法和children 方法都可以获取里层标签
4.4查找父元素
from pyquery import PyQuery as pq html = '''hello nihao''' doc = pq(html) items=doc(".s_from") print(items) #查找父元素 parent_href=items.parent() print(parent_href)asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
运行结果:
-
asdasd
asdadasdad12312
asdadasdad12312
asdadasdad12312
hello nihao
-
asdasd
asdadasdad12312
asdadasdad12312
asdadasdad12312
parent可以查找出外层标签包括的内容,与之类似的还有parents,可以获取所有外层节点
4.5 查找兄弟元素
from pyquery import PyQuery as pq html = '''hello nihao''' doc = pq(html) items=doc("link.active1.a123") print(items) #查找兄弟元素 siblings_href=items.siblings() print(siblings_href)asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
运行结果:
asdadasdad12312 asdadasdad12312 asdadasdad12312
根据运行结果可以看出,siblings 返回了同级的其他标签
结论:子元素查找,父元素查找,兄弟元素查找,这些方法返回的结果类型都是pyquery类型,可以针对结果再次进行选择
4.6 遍历查找结果
from pyquery import PyQuery as pq html = '''hello nihao''' doc = pq(html) its=doc("link").items() for it in its: print(it)asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
运行结果:
asdadasdad12312 asdadasdad12312 asdadasdad12312
4.7获取属性信息
from pyquery import PyQuery as pq html = '''hello nihao''' doc = pq(html) its=doc("link").items() for it in its: print(it.attr('href')) print(it.attr.href)asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
运行结果:
http://asda.com http://asda.com http://asda1.com http://asda1.com http://asda2.com http://asda2.com
4.8 获取文本
from pyquery import PyQuery as pq html = '''hello nihao''' doc = pq(html) its=doc("link").items() for it in its: print(it.text())asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
运行结果
asdadasdad12312 asdadasdad12312 asdadasdad12312
4.9 获取 HTML信息
from pyquery import PyQuery as pq html = '''hello nihao''' doc = pq(html) its=doc("link").items() for it in its: print(it.html())asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
运行结果:
asdadasdad12312 asdadasdad12312 asdadasdad12312
5.常用DOM操作
5.1 addClass removeClass
添加,移除class标签
from pyquery import PyQuery as pq html = '''hello nihao''' doc = pq(html) its=doc("link").items() for it in its: print("添加:%s"%it.addClass('active1')) print("移除:%s"%it.removeClass('active1'))asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
运行结果
添加:asdadasdad12312 移除:asdadasdad12312 添加:asdadasdad12312 移除:asdadasdad12312 添加:asdadasdad12312 移除:asdadasdad12312
需要注意的是已经存在的class标签不会继续添加
5.2 attr css
attr 为获取/修改属性 css 添加style属性
from pyquery import PyQuery as pq html = '''hello nihao''' doc = pq(html) its=doc("link").items() for it in its: print("修改:%s"%it.attr('class','active')) print("添加:%s"%it.css('font-size','14px'))asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
运行结果
C:\Python27\python.exe D:/test_his/test_re_1.py 修改:asdadasdad12312 添加:asdadasdad12312 修改:asdadasdad12312 添加:asdadasdad12312 修改:asdadasdad12312 添加:asdadasdad12312
attr css操作直接修改对象的
5.3 remove
remove 移除标签
from pyquery import PyQuery as pq html = '''hello nihao''' doc = pq(html) its=doc("div") print('移除前获取文本结果:\n%s'%its.text()) it=its.remove('ul') print('移除后获取文本结果:\n%s'%it.text())asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
运行结果
移除前获取文本结果: hello nihao asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312 移除后获取文本结果: hello nihao
其他DOM方法参考: 请点击
6.伪类选择器
from pyquery import PyQuery as pq html = '''hello nihao''' doc = pq(html) its=doc("link:first-child") print('第一个标签:%s'%its) its=doc("link:last-child") print('最后一个标签:%s'%its) its=doc("link:nth-child(2)") print('第二个标签:%s'%its) its=doc("link:gt(0)") #从零开始 print("获取0以后的标签:%s"%its) its=doc("link:nth-child(2n-1)") print("获取奇数标签:%s"%its) its=doc("link:contains('hello')") print("获取文本包含hello的标签:%s"%its)asdasd helloasdadasdad12312 asdadasdad12312 asdadasdad12312
运行结果
第一个标签:helloasdadasdad12312 最后一个标签:asdadasdad12312 第二个标签:asdadasdad12312 获取0以后的标签:asdadasdad12312 asdadasdad12312 获取奇数标签:helloasdadasdad12312 asdadasdad12312 获取文本包含hello的标签:helloasdadasdad12312
更多css选择器可以查看: 请点击