pyquery解析库介绍

PyQuery库是一个强大的网页解析库,在很多方面会比beautifulsoup更优。PyQuery 是 Python 仿照 jQuery 的严格实现,语法与 jQuery 几乎完全相同。

基本用法

html_doc = """
The Dormouse's story
    
        

    The Dormouse's story

    Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

    test for class story

"""
from pyquery import PyQuery as pq doc = pq(html_doc) print(f"type of doc: {type(doc)}") select_list = doc("#container .list") print(f"type of select_list: {type(select_list)}") print("select_list: ",select_list) # output """ type of doc: type of select_list: select_list:

    The Dormouse's story

    Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

    test for class story

"""

每次选择返回一个pyquery.pyquery.PyQuery对象,支持嵌套调用。
接着我们可以用find()方法得到select_list变量中的所有a标签:

print(select_list.find('a'))

# output
"""
Elsie,
                Lacie and
                Tillie;
                and they lived at the bottom of a well.
"""

可以发现通过这样的层层嵌套调用,我们可以很方便的获得我们想要的内容。

查找子元素:

print(select_list.children('a')) # 查找直接子元素 'a'

# None

print(select_list.children()) # 查找直接子元素

"""

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

test for class story

"""

find()不同,children()方法查找的是直接子元素,而find()只要求符合层级的组织关系即可。

类似的也有父元素:

print(select_list.parent()) # 查找直接 父元素
print(select_list.parents()) # 查找所有的父元素
print(select_list.parents("#container")) # 可再传入CSS选择器筛选

兄弟元素:

html = '''

'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-0.active')
print(li)
print(li.siblings())
print(li.siblings('.active'))

# output 
"""
  • third item
  • second item
  • first item
  • fourth item
  • fifth item
  • fourth item
  • """

    遍历元素

    html = '''
    
    '''
    from pyquery import PyQuery as pq
    doc = pq(html)
    print(doc('li'))
    lis = doc('li').items()
    print(f"type(lis): {type(lis)}")
    for li in lis:
        print(li) # 也可以对每个元素进行更多单独的操作
    
    # output 
    """
    
  • first item
  • second item
  • third item
  • fourth item
  • fifth item
  • type(lis):
  • first item
  • second item
  • third item
  • fourth item
  • fifth item
  • """

    获取属性和文本

    html = '''
    
    '''
    from pyquery import PyQuery as pq
    doc = pq(html)
    a = doc('.item-0.active a')
    print(a)
    # third item
    print(a.attr('href')) # 获取属性:链接
    # link3.html
    print(a.attr.href) # 获取属性:链接
    # link3.html
    print(a.text()) # 获取内容
    # third item
    print(a.html()) # 获取html
    # third item
    

    DOM操作

    文档对象模型( DOM, Document Object Model )主要用于对HTML和XML文档的内容进行操作。DOM描绘了一个层次化的节点树,通过对节点进行操作,实现对文档内容的添加、删除、修改、查找等功能。

    html = '''
    
    '''
    from pyquery import PyQuery as pq
    doc = pq(html)
    
    li = doc('.item-0.active')
    print(li)
    # 
  • third item
  • li.removeClass('active') # 移除active属性 print(li) #
  • third item
  • li.addClass('active') # 添加active属性 print(li) #
  • third item
  • li.attr('name', 'add_new_name_attr') # 添加新的name属性 print(li) #
  • third item
  • li.css('font-size', '16px') # 添加新的style属性 print(li) #
  • third item
  • print("------------------------------------------------------------------------") print("") print(li.find('span,a').remove()) print(li) # ------------------------------------------------------------------------ # # third item #
  • 通过url或者本地文件初始化

    # url 初始化
    from pyquery import PyQuery as pq
    doc = pq(url='http://www.baidu.com')
    print(doc('head'))
    
    # 文件初始化
    from pyquery import PyQuery as pq
    doc = pq(filename='demo.html')
    print(doc('list'))
    

    Reference

    • pyquery
    • CSS选择器
    • PyQuery库的使用

    你可能感兴趣的:(Python,爬虫)