pip install pyquery

2.引用方法

from pyquery import PyQuery as pq

3.简介

 pyquery 是类型jquery 的一个专供python使用的html解析的库,使用方法类似bs4。

4.使用方法

  4.1 初始化方法:

from pyquery import PyQuery as pq
doc =pq(html) #解析html字符串
doc =pq("http://news.baidu.com/") #解析网页
doc =pq("./a.html") #解析html 文本

      4.2 基本CSS选择器

from pyquery import PyQuery as pq
html = '''
    
    asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
''' doc = pq(html) print doc("#wrap .s_from link")

  运行结果:

asdadasdad12312
            asdadasdad12312
            asdadasdad12312

  #是查找id的标签  .是查找class 的标签  link 是查找link 标签 中间的空格表示里层

  4.3 查找子元素

from pyquery import PyQuery as pq
html = '''
    
    asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
''' #查找子元素 doc = pq(html) items=doc("#wrap") print(items) print("类型为:%s"%type(items)) link = items.find('.s_from') print(link) link = items.children() print(link)

  运行结果:

    asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
类型为:
    asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
    asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312

  根据运行结果可以发现返回结果类型为pyquery,并且find方法和children 方法都可以获取里层标签

  4.4查找父元素

from pyquery import PyQuery as pq
html = '''
    
hello nihao
    asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
''' doc = pq(html) items=doc(".s_from") print(items) #查找父元素 parent_href=items.parent() print(parent_href)

  运行结果:

    asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
hello nihao
    asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312

  parent可以查找出外层标签包括的内容,与之类似的还有parents,可以获取所有外层节点

  4.5 查找兄弟元素

from pyquery import PyQuery as pq
html = '''
    
hello nihao
    asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
''' doc = pq(html) items=doc("link.active1.a123") print(items) #查找兄弟元素 siblings_href=items.siblings() print(siblings_href)

  运行结果:

asdadasdad12312
            
asdadasdad12312
            asdadasdad12312

  根据运行结果可以看出,siblings 返回了同级的其他标签

  结论:子元素查找,父元素查找,兄弟元素查找,这些方法返回的结果类型都是pyquery类型,可以针对结果再次进行选择

  4.6 遍历查找结果

from pyquery import PyQuery as pq
html = '''
    
hello nihao
    asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
''' doc = pq(html) its=doc("link").items() for it in its: print(it)

  运行结果:

asdadasdad12312
            
asdadasdad12312
            
asdadasdad12312

  4.7获取属性信息

from pyquery import PyQuery as pq
html = '''
    
hello nihao
    asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
''' doc = pq(html) its=doc("link").items() for it in its: print(it.attr('href')) print(it.attr.href)

  运行结果:

http://asda.com
http://asda.com
http://asda1.com
http://asda1.com
http://asda2.com
http://asda2.com

  4.8 获取文本

from pyquery import PyQuery as pq
html = '''
    
hello nihao
    asdasd asdadasdad12312 asdadasdad12312 asdadasdad12312
''' doc = pq(html) its=doc("link").items() for it in its: print(it.text())

  运行结果

asdadasdad12312
asdadasdad12312
asdadasdad12312

  4.9 获取 HTML信息

from pyquery import PyQuery as pq
html = '''
    
hello nihao
''' doc = pq(html) its=doc("link").items() for it in its: print(it.html())

  运行结果:

asdadasdad12312
asdadasdad12312
asdadasdad12312

 

5.常用DOM操作

  5.1 addClass removeClass

  添加,移除class标签

from pyquery import PyQuery as pq
html = '''
    
hello nihao
''' doc = pq(html) its=doc("link").items() for it in its: print("添加:%s"%it.addClass('active1')) print("移除:%s"%it.removeClass('active1'))

  运行结果

添加:asdadasdad12312
            
移除:asdadasdad12312
            
添加:asdadasdad12312
            
移除:asdadasdad12312
            
添加:asdadasdad12312
        
移除:asdadasdad12312

  需要注意的是已经存在的class标签不会继续添加

  5.2 attr css

  attr 为获取/修改属性 css 添加style属性

from pyquery import PyQuery as pq
html = '''
    
hello nihao
''' doc = pq(html) its=doc("link").items() for it in its: print("修改:%s"%it.attr('class','active')) print("添加:%s"%it.css('font-size','14px'))

  运行结果

C:\Python27\python.exe D:/test_his/test_re_1.py
修改:asdadasdad12312
            
添加:asdadasdad12312
            
修改:asdadasdad12312
            
添加:asdadasdad12312
            
修改:asdadasdad12312
        
添加:asdadasdad12312

  attr css操作直接修改对象的

  5.3 remove

  remove 移除标签

from pyquery import PyQuery as pq
html = '''
    
hello nihao
''' doc = pq(html) its=doc("div") print('移除前获取文本结果:\n%s'%its.text()) it=its.remove('ul') print('移除后获取文本结果:\n%s'%it.text())

  运行结果

移除前获取文本结果:
hello nihao
asdasd
asdadasdad12312
asdadasdad12312
asdadasdad12312
移除后获取文本结果:
hello nihao

其他DOM方法参考: 请点击

6.伪类选择器

 

from pyquery import PyQuery as pq
html = '''
    
hello nihao
''' doc = pq(html) its=doc("link:first-child") print('第一个标签:%s'%its) its=doc("link:last-child") print('最后一个标签:%s'%its) its=doc("link:nth-child(2)") print('第二个标签:%s'%its) its=doc("link:gt(0)") #从零开始 print("获取0以后的标签:%s"%its) its=doc("link:nth-child(2n-1)") print("获取奇数标签:%s"%its) its=doc("link:contains('hello')") print("获取文本包含hello的标签:%s"%its)

  运行结果

第一个标签:helloasdadasdad12312
            
最后一个标签:asdadasdad12312
        
第二个标签:asdadasdad12312
            
获取0以后的标签:asdadasdad12312
            asdadasdad12312
        
获取奇数标签:helloasdadasdad12312
            asdadasdad12312
        
获取文本包含hello的标签:helloasdadasdad12312