Beautiful Soup库详解

安装

pip install lxml
pip install beautifulsoup4

验证安装

In [1]: from bs4 import BeautifulSoup

In [2]: soup = BeautifulSoup('

Hello

', 'lxml') In [3]: print(soup.p.string) Hello

Beautiful Soup 介绍

Beautiful Soup 所支持的解析器

解析器

综合对比,lxml解析器是比较好的选择

只需要在初始化 Beautiful Soup 时,将第二个参数设置为 lxml 即可

from bs4 import BeautifulSoup

html = '''

Beautiful Soup test

    

first content

second content ''' soup = BeautifulSoup(html, 'lxml') print(soup.prettify()) # 增加缩进,美化输出 print(soup.title.string) # 获取title节点的文本内容

注意:以上代码中的html内容是不全的,有些标签并没有闭合

运行结果:


 
  
   Beautiful Soup test
  
 
 
  

first content

second content

Beautiful Soup test

BeautifulSoup 会自动将html标签补全

节点选择器

from bs4 import BeautifulSoup

html = '''

Beautiful Soup test

    

first content

second content ''' soup = BeautifulSoup(html, 'lxml') print(soup.title) # Beautiful Soup test print(type(soup.title)) # print(soup.title.string) # Beautiful Soup test print(soup.head) # Beautiful Soup test print(soup.p) #

first content

节点名称

In [3]: print(soup.title.name)
title

节点所有属性

In [4]: print(soup.p.attrs)
{'class': ['first'], 'name': 'first_p'}

节点指定属性

In [5]: print(soup.p.attrs['name'])
first_p

节点指定属性简写

In [6]: print(soup.p['name'])
first_p

节点文本内容

In [7]: print(soup.p.string)
first content

嵌套选择

In [8]: print(soup.head.title)
Beautiful Soup test

In [9]: print(type(soup.head.title))


In [10]: print(soup.head.title.string)
Beautiful Soup test

关联选择

In [11]: print(soup.body.children)


In [12]: for i, child in enumerate(soup.body.children):
    ...:     print(i, child)
    ...:
0

1 

first content

2 3

second content

  • children 所有子节点
  • descendants 所有后代节点
  • parent 直接父节点
  • parents 祖先节点
  • next_sibling 下一个兄弟节点
  • previous_sibling 上一个兄弟节点
  • next_siblings 后面的所有兄弟节点
  • previous_siblings 前面的所有兄弟节点

方法选择器

find_all

数据准备

In [13]: from bs4 import BeautifulSoup
    ...:
    ...: html = '''
    ...: 
...:
...:

Hello

...:
...:
...:
    ...:
  • Foo
  • ...:
  • Bar
  • ...:
  • Jay
  • ...:
...:
    ...:
  • Foo
  • ...:
  • Bar
  • ...:
...:
...:
...: ''' ...: ...: soup = BeautifulSoup(html, 'lxml') ...: ...:

所有ul

In [16]: soup.find_all(name='ul')
Out[16]:
[
  • Foo
  • Bar
  • Jay
,
  • Foo
  • Bar
]

由于获取到的ul是Tag类型,可以进行迭代

In [17]: type(soup.find_all(name='ul')[0])
Out[17]: bs4.element.Tag

In [18]: for ul in soup.find_all(name='ul'):
    ...:     print(ul.find_all(name='li'))
    ...:
[
  • Foo
  • ,
  • Bar
  • ,
  • Jay
  • ] [
  • Foo
  • ,
  • Bar
  • ]

    再通过遍历li,获取li的文本

    In [19]: for ul in soup.find_all(name='ul'):
        ...:     print(ul.find_all(name='li'))
        ...:     for li in ul.find_all(name='li'):
        ...:         print(li.string)
        ...:
    [
  • Foo
  • ,
  • Bar
  • ,
  • Jay
  • ] Foo Bar Jay [
  • Foo
  • ,
  • Bar
  • ] Foo Bar

    attrs

    根据属性查询

    In [26]: soup.find_all(attrs={'id': 'list-1'})
    Out[26]:
    [
    • Foo
    • Bar
    • Jay
    ]

    text

    匹配节点的文本内容

    In [28]: import re
    
    # 返回所有匹配正则的节点文本组成的列表
    In [29]: soup.find_all(text=re.compile('ar'))
    Out[29]: ['Bar', 'Bar']
    

    find

    返回第一个匹配的元素

    In [30]: soup.find(text=re.compile('ar'))
    Out[30]: 'Bar'
    
    In [31]: soup.find('li')
    Out[31]: 
  • Foo
  • 关于find,还有其他用法:

    • find_parents() 和 find_parent()

    • find_next_siblings() 和 find_next_sibling()

    • find_previous_siblings() 和 find_previous_sibling()

    • find_all_next() 和 find_next()

    • fina_all_previous() 和 find_previous()

    css 选择器

    只需调用 select() 方法,传入相应的css选择器即可

    In [32]: soup.select('.panel .panel-heading')
    Out[32]:
    [

    Hello

    ] In [33]: soup.select('ul li') Out[33]: [
  • Foo
  • ,
  • Bar
  • ,
  • Jay
  • ,
  • Foo
  • ,
  • Bar
  • ] In [34]: soup.select('#list-2 .element') Out[34]: [
  • Foo
  • ,
  • Bar
  • ] In [35]: soup.select('ul')[0] Out[35]:
    • Foo
    • Bar
    • Jay

    嵌套选择

    In [36]: for ul in soup.select('ul'):
        ...:     print(ul.select('li'))
        ...:
    [
  • Foo
  • ,
  • Bar
  • ,
  • Jay
  • ] [
  • Foo
  • ,
  • Bar
  • ]

    获取属性

    In [37]: for ul in soup.select('ul'):
        ...:     print(ul['id'])
        ...:     print(ul.attrs['id'])
        ...:
    list-1
    list-1
    list-2
    list-2
    

    获取文本

    In [39]: for li in soup.select('li'):
        ...:     print('Get Text:', li.get_text())
        ...:     print('String:', li.string)
        ...:
        ...:
    Get Text: Foo
    String: Foo
    Get Text: Bar
    String: Bar
    Get Text: Jay
    String: Jay
    Get Text: Foo
    String: Foo
    Get Text: Bar
    String: Bar
    

    你可能感兴趣的:(Beautiful Soup库详解)