【零基础学爬虫】PyQuery详解

回顾

之前介绍了Beautifulsoup库,这个库可以让我们不写繁杂的正则表达式就可以爬取数据。但是你可能会觉得Beautifulsoup库不太好用,语法太繁杂,难记。今天介绍一个灵活又强大的网页解析库PyQuery。

什么是PyQuery

如果你熟悉jQuery的语法,那么PyQuery就是爬虫的绝佳选择,api可以无缝迁移。

PyQuery的安装

pip install pyquery

PyQuery的使用

下面案例讲解使用到的都是下面这个字符串

html = '''

'''

(1)字符串初始化

from pyquery import PyQuery as pq
doc = pq(html)#PyQuery对象,直接传入字符串
print(doc('li'))#其实就是css选择器,选择class时前面加‘.’;选择属性时前面加‘#’,选择标签直接写
print(doc('.item-0')[0].text)#输出第一个class值为item-0对应的内容

输出:

  • first item
  • < a href=" ">second item
  • < a href="link3.html">third item
  • < a href="link4.html">fourth item
  • < a href="link5.html">fifth item
  • first item

    (2) URL初始化

    from pyquery import PyQuery as pq
    doc = pq(url='http://www.baidu.com')#直接传入URL,会自动返回请求后的HTML并传入到PyQuery
    print(doc('head'))
    

    输出:

    ç�¾åº¦ä¸�ä¸�ï¼�ä½ å°±ç�¥é�� 
    

    (3)文件初始化

    from pyquery import PyQuery as pq
    doc = pq(filename='demo.html')#本地文件名
    print(doc('li'))
    

    基本的CSS选择器

    from pyquery import PyQuery as pq
    doc = pq(html)
    print(doc('#container .list li'))#id前面加‘#’,class选择就在前面加‘.’ 标签的话什么都不加,写在前面就是选择外层元素、后面就是选择里面的元素
    

    输出:

  • first item
  • < a href=" ">second item
  • < a href="link3.html">third item
  • < a href="link4.html">fourth item
  • < a href="link5.html">fifth item
  • (1)查找子元素

    from pyquery import PyQuery as pq
    doc = pq(html)
    items = doc('.list')
    print(type(items))
    print(items)
    lis = items.find('li')
    print(type(lis))
    print(lis)
    

    输出:

    
    
    • first item
    • < a href=" ">second item
    • < a href="link3.html">third item
    • < a href="link4.html">fourth item
    • < a href="link5.html">fifth item
  • first item
  • < a href="link2.html">second item
  • < a href="link3.html">third item
  • < a href="link4.html">fourth item
  • < a href="link5.html">fifth item

  • (2)直接子元素:

    lis = items.children(‘.active’)#()中是二次筛选,也可以没有
    print(type(lis))
    print(lis)
    

    (3)父元素

    from pyquery import PyQuery as pq
    doc = pq(html)
    items = doc('.list')#list的父节点
    container = items.parent()
    

    输出:

    
    
    • first item
    • < a href=" ">second item
    • < a href="link3.html">third item
    • < a href="link4.html">fourth item
    • < a href="link5.html">fifth item

    返回祖先节点:

    from pyquery import PyQuery as pq
    doc = pq(html)
    items = doc('.list')
    parents = items.parents()
    print(type(parents))
    print(parents)
    

    输出:

    
    
    • first item
    • < a href=" ">second item
    • < a href="link3.html">third item
    • < a href="link4.html">fourth item
    • < a href="link5.html">fifth item
    • first item
    • < a href="link2.html">second item
    • < a href="link3.html">third item
    • < a href="link4.html">fourth item
    • < a href="link5.html">fifth item

    也可以传入css选择器再次进行筛选:

    parent = items.parents('.wrap')
    print(parent)
    

    只会输出上面的第一个结果

    兄弟元素

    from pyquery import PyQuery as pq
    doc = pq(html)
    li = doc('.list .item-0.active')#后面是没有空格,表示查找同时包含这两个class的元素,只有一个符合条件
    print(li.siblings())
    

    输出的是其他4个兄弟li标签

    遍历

    from pyquery import PyQuery as pq
    doc = pq(html)
    lis = doc('li').items()
    print(type(lis))
    for li in lis:
        print(li)#每一个li标签都是pyquery类型,可以进行进一步操作
    

    获取信息

    (1)获取属性值

    from pyquery import PyQuery as pq
    doc = pq(html)
    a = doc('.item-0.active a')
    print(a)
    print(a.attr('href'))
    print(a.attr.href)
    

    输出:

    < a href=" ">third item
    link3.html
    link3.html
    < a href="link3.html">third item
    link3.html
    link3.html
    

    (2)获取文本值

    from pyquery import PyQuery as pq
    doc = pq(html)
    a = doc('.item-0.active a')
    print(a)
    print(a.text())
    

    输出:

    < a href=" ">third item
    third item
    

    (3)获取HTML

    from pyquery import PyQuery as pq
    doc = pq(html)
    li = doc('.item-0.active')
    print(li)
    print(li.html())
    

    输出:

  • < a href=" ">third item
  • < a href="link3.html">third item

    DOM 操作

    (1)addClass和removeClass

    from pyquery import PyQuery as pq
    doc = pq(html)
    li = doc('.item-0.active')
    print(li)
    li.removeClass('active')
    print(li)
    li.addClass('active')
    print(li)
    

    输出:

  • < a href=" ">third item
  • < a href="link3.html">third item
  • < a href="link3.html">third item
  • DOM操作其实就是对:属性、css、class等进行操作
    (2)添加属性attr、添加css

    from pyquery import PyQuery as pq
    doc = pq(html)
    li = doc('.item-0.active')
    print(li)
    li.attr('name', 'link')#添加新的属性对
    print(li)
    li.css('font-size', '14px')#添加新的css
    print(li)
    

    输出:

  • < a href=" ">third item
  • < a href="link3.html">third item
  • < a href="link3.html">third item
  • (3)移除

    html = '''
    
    Hello, World

    This is a paragraph.

    ''' from pyquery import PyQuery as pq doc = pq(html) wrap = doc('.wrap') print(wrap.text()) wrap.find('p').remove()#单独获取Hello world print(wrap.text())

    输出:
    Hello, World This is a paragraph.
    Hello, World
    (5)伪类选择器

    from pyquery import PyQuery as pq
    doc = pq(html)
    li = doc('li:first-child')#获取第一个li标签
    print(li)
    li = doc('li:last-child')#获取最后一个li标签
    print(li)
    li = doc('li:nth-child(2)')#获取第二个li标签
    print(li)
    li = doc('li:gt(2)')#获取大于2的li标签
    print(li)
    li = doc('li:nth-child(2n)')##获取第偶数个li标签
    print(li)
    li = doc('li:contains(second)')#获取包含某个文本值的li标签
    print(li)
    

    (5)其他

    • 其他DOM方法:http://pyquery.readthedocs.io/en/latest/api.html
    • 其他CSS选择器:http://www.w3school.com.cn/css/index.asp
    • pyquery 官方文档:http://pyquery.readthedocs.io/

    扫描下方二维码,及时获取更多互联网求职面经javapython爬虫大数据等技术,和海量资料分享
    公众号菜鸟名企梦后台发送“csdn”即可免费领取【csdn】和【百度文库】下载服务;
    公众号菜鸟名企梦后台发送“资料”:即可领取5T精品学习资料java面试考点java面经总结,以及几十个java、大数据项目资料很全,你想找的几乎都有

    【零基础学爬虫】PyQuery详解_第1张图片
    扫码关注,及时获取更多精彩内容。(博主今日头条大数据工程师)

    你可能感兴趣的:(【零基础学爬虫】PyQuery详解)