BeautifulSoup是一个非常优秀的Python扩展库,可以用来从HTML或XML文件中提取我们感兴趣的数据,并且允许指定使用不同的解析器。由于beautifulsoup3已经不再继续维护,因此新的项目中应使用beautifulsoup4,目前最新版本是4.5.0,可以使用pip install beautifulsoup4直接进行安装,安装之后应使用from bs4 import BeautifulSoup导入并使用。下面我们就一起来简单看一下BeautifulSoup4的强大功能,更加详细完整的学习资料请参考https://www.crummy.com/software/BeautifulSoup/bs4/doc/。
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('hello world!', 'lxml') #自动添加和补全标签
hello world!
>>> html_doc = """
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Lacie and
and they lived at the bottom of a well.
...
"""
>>> soup = BeautifulSoup(html_doc, 'html.parser') #也可以使用lxml或其他解析器
>>> print(soup.prettify()) #以优雅的方式显示出来
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...
>>> soup.title #访问特定的标签
>>> soup.title.name #标签名字
'title'
>>> soup.title.text #标签文本
"The Dormouse's story"
>>> soup.title.string
"The Dormouse's story"
>>> soup.title.parent #上一级标签
>>> soup.head
>>> soup.b
The Dormouse's story
>>> soup.body.b
The Dormouse's story
>>> soup.name #把整个BeautifulSoup对象看做标签对象
'[document]'
>>> soup.body
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Lacie and
and they lived at the bottom of a well.
...
>>> soup.p
The Dormouse's story
>>> soup.p['class'] #标签属性
['title']
>>> soup.p.get('class') #也可以这样查看标签属性
['title']
>>> soup.p.text
"The Dormouse's story"
>>> soup.p.contents
[The Dormouse's story]
>>> soup.a
>>> soup.a.attrs #查看标签所有属性
{'class': ['sister'], 'href': 'http://example.com/elsie', 'id': 'link1'}
>>> soup.find_all('a') #查找所有标签
>>> soup.find_all(['a', 'b']) #同时查找和标签
[The Dormouse's story, Elsie, Lacie, Tillie]
>>> import re
>>> soup.find_all(href=re.compile("elsie")) #查找href包含特定关键字的标签
[Elsie]
>>> soup.find(id='link3')
>>> soup.find_all('a', id='link3')
[Tillie]
>>> for link in soup.find_all('a'):
print(link.text,':',link.get('href'))
Elsie : http://example.com/elsie
Lacie : http://example.com/lacie
Tillie : http://example.com/tillie
>>> print(soup.get_text()) #返回所有文本
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
>>> soup.a['id'] = 'test_link1' #修改标签属性的值
>>> soup.a
>>> soup.a.string.replace_with('test_Elsie') #修改标签文本
'Elsie'
>>> soup.a.string
'test_Elsie'
>>> print(soup.prettify())
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
test_Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...
>>> for child in soup.body.children: #遍历直接子标签
print(child)
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Lacie and
and they lived at the bottom of a well.
...
>>> for string in soup.strings: #遍历所有文本,结果略
print(string)
>>> test_doc = ''
>>> s = BeautifulSoup(test_doc, 'lxml')
>>> for child in s.html.children: #遍历直接子标签
print(child)
>>> for child in s.html.descendants: #遍历子孙标签
print(child)