参考:https://cuiqingcai.com/1319.html
主要功能是从网页抓取数据。Beautiful Soup提供一些简单的、python式的函数用来处理导航】搜索、修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据,不需要多少代码就可以写出一个完整的应用程序。
Beautiful Soup自动将输入文档转换为unicode编码,输出文档转换为utf-8编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时,Beautiful Soup就不能自动识别编码方式了。然后,仅仅说明一下原始编码方式就可以了。
Beautiful Soup已成为和xml、html6lib一样出色的python解释器,为用户灵活提供不同的解析策略或强劲的速度。
python3中的模块为bs4
from bs4 import BeautifulSoup
html = """
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
#创建beautifulsoup对象
soup = BeautifulSoup(html)
#打印soup对象的内容,格式化输出
print(soup.prettify())
Beautiful Soup将复杂的HTML文档转换为一个复杂的树形结构,每个结点都是python对象,所有的对象可以归纳为以下4种:
(1)Tag HTML中的标签
print(soup.title)
print(soup.head)
print(soup.a)
print(soup.p)
The Dormouse's story
The Dormouse's story
The Dormouse's story
#验证对象类型
print(type(soup.a))
属性
name:soup对象本身比较特殊,它的name即为[document],对于其他内部标签,输出值变为标签本身名称。
print(soup.name)
print(soup.head.name)
[document]
head
attrs:可以对属性进行删查改
#查看属性
print(soup.p.attrs)
print(soup.p['class'])
print(soup.p.get('class'))
#修改属性
soup.p['class'] = 'newClass'
print(soup.p)
#删除属性
del soup.p['class']
print(soup.p)
{'class': ['title'], 'name': 'dromouse'}
['title']
['title']
The Dormouse's story
The Dormouse's story
(2) NavigableString,可遍历字符串
#获取标签内部的文字
print(soup.p.string)
print(type(soup.p.string))
The Dormouse's story
(3)BeautifulSoup对象表示一个文档的全部内容
print(type(soup.name))
print(soup.name)
print(soup.attrs)
[document]
{} #空字典
(4)Comment对象是一个特殊类型的NavigableString对象,实际是注释内容,但是已经把注释符号去掉了。
print(soup.a)
print(soup.a.string)
print(type(soup.a.string))
Elsie
(1)直接子结点
tag.contents属性可以将tag的子结点以列表方式输出,还可以通过列表索引获取某一个元素。
print(soup.head.contents)
print(soup.head.contents[0])
[The Dormouse's story ]
The Dormouse's story
tag.children 返回的不是list,而是list生成器对象,可以通过它遍历获取所有子结点。
print(soup.head.children)
for child in soup.body.children:
print(child)
The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
(2)所有子孙结点
tag.descendants属性,可以对所有tag的子孙结点进行递归循环,和children类似,也需要遍历获取其中的内容。
for child in soup.descendants:
print(child)
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
The Dormouse's story
The Dormouse's story
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
The Dormouse's story
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
Lacie
and
Tillie
Tillie
;
and they lived at the bottom of a well.
...
将所有的结点打印出来,先是html,其次是head,一个一个剥离。
(3)结点内容
tag.string属性,如果tag只有一个NavigableString类型子结点,那么这个tag可以使用 .string属性得到子结点;如果一个tag仅有一个子结点,那么也可以使用.string属性获取内容。
print(soup.head.string)
print(soup.title.string)
The Dormouse's story
The Dormouse's story
如果tag包含多个子结点,tag就无法确定,string方法应该调用哪个子结点的内容,.string的输出结果为None。
print(soup.html.string)
None
(4)多个内容
tag.strings属性获取多个内容,需要遍历获取。
for string in soup.strings:
print(repr(string))
"The Dormouse's story"
'\n'
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n'
',\n'
'Lacie'
' and\n'
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
'...'
'\n'
tag.stripped_strings属性,输出的字符串中可能包含了很多空格或空行,使用.stripped_strings可以去除多余空白内容。
for string in soup.stripped_strings:
print(repr(string))
"The Dormouse's story"
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
','
'Lacie'
'and'
'Tillie'
';\nand they lived at the bottom of a well.'
(5)父结点
tag.parent属性
p = soup.p
print(p.parent.name)
content = soup.head.title.string
print(content.parent.name)
body
title
(6)全部父结点
tag.parents属性:递归得到元素的所有父辈结点。
content = soup.head.title.string
for parent in content.parents:
print(parent.name)
title
head
html
[document]
(7)兄弟结点
tag.next_sibling属性:获取该结点的下一个兄弟结点。如果结点不存在,返回None。
tag.previous_sibling属性:获取该结点的前一个兄弟结点。如果结点不存在,返回None。
注意:实际文档中的tag.next_sibling和tag.previous_sibling属性通常是字符串或者空白,因为空白或者换行也可以被视作一个结点,所哟得到的结果可能是空白或者换行。
print(soup.p.next_sibling)
print(soup.p.prev_sibling)
print(soup.p.next_sibling.next_sibling)
D:\develop\Anaconda3\python.exe D:/thislove/pythonworkspace/blogspark/bs_test.py
#空白
None#没有前一个兄弟结点,返回None
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
#下一个结点的下一个兄弟结点是可以看到的结点
Process finished with exit code 0
(8)全部兄弟结点
tag.next_siblings属性、tag.previous_siblings属性:对当前结点的兄弟结点迭代输出。
for sibling in soup.a.next_siblings:
print(repr(sibling))
D:\develop\Anaconda3\python.exe D:/thislove/pythonworkspace/blogspark/bs_test.py
',\n'
Lacie
' and\n'
Tillie
';\nand they lived at the bottom of a well.'
Process finished with exit code 0
(9)前后结点
tag.next_element属性、tag.previous_element属性:不分层次关系的前后标签。
print(soup.head.next_element)
The Dormouse's story
(10)所有前后结点
tag.next_elements属性和tag.previous_elements属性:通过迭代器向前或者向后访问文档的解析内容。
for element in soup.a.next_elements:
print(repr(element))
' Elsie '
',\n'
Lacie
'Lacie'
' and\n'
Tillie
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
...
'...'
'\n'
(1)find_all(name, attrs, recursive, text, **kwargs):搜索当前tag的所有tag子结点,并判断是否符合过滤器的条件。
name参数:查找所有名字为name的tag,字符串对象会自动忽略掉。
print(soup.find_all('b'))
print(soup.find_all('a'))
[The Dormouse's story]
[, Lacie, Tillie]
import re
for tag in soup.find_all(re.compile('^b')): #以b开头的标签
print(tag.name)
body
b
print(soup.find_all(['a','b']))
[The Dormouse's story, , Lacie, Tillie]
for tag in soup.find_all(True):
print(tag.name)
html
head
title
body
p
b
p
a
a
a
p
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
print(soup.find_all(has_class_but_no_id))
[The Dormouse's story
, Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
, ...
]
keyword参数:如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为id的参数,Beautiful Soup会搜索每个tag的id属性。
import re
print(soup.find_all(id = 'link2'))
print(soup.find_all(href = re.compile('elsie')))
print(soup.find_all(href = re.compile('elsie'),id = 'link1'))
print(soup.find_all('a',class_ = 'sister'))#若用class过滤,不过class是python关键词,后面要加下划线
[Lacie]
[]
[]
[, Lacie, Tillie]
text参数:搜索文档中的字符串内容与name参数的可选值一样,text参数接受字符串,正则表达式,列表,True。
print(soup.find_all(text='Elsie'))
print(soup.find_all(text=['Tillie','Elsie','Lacie']))
print(soup.find_all(text=re.compile('Dormouse')))
[]
['Lacie', 'Tillie']
["The Dormouse's story", "The Dormouse's story"]
limit参数:find_all()返回全部的搜索结构,如果文档树很大那么搜索会很慢,如果不需要全部结果,可以使用limit参数限制返回结果的数量,效果与SQL中limit关键字类似,当搜索到的结果数量达到limit的限制时,就停止搜索返回结果。
print(soup.find_all('a',limit=2))
[, Lacie]
recursive参数:调用tag的find_all()方法时,Beautiful Soup会检索当前tag的所有子孙结点,若只想搜索tag的直接子结点,需要使用参数recursive=false。
print(soup.html.find_all('title'))
print(soup.html.find_all('title',recursive = False))
print(soup.html.find_all('head',recursive = False))
[The Dormouse's story ]
[]
[The Dormouse's story ]
(2)find(name,attrs,recursive,text,**kwargs):与find_all()的区别是,find_all()方法的返回结果值包含一个元素的列表,而find()直接返回结果。
(3)find_parents()和find_parent():用来搜索当前结点的父结点,搜索方法与普通tag的搜索方法相同,搜索文档搜索文档包含的内容。
(4)find_next_siblings()和find_next_sibling():返回所有符合条件的后面的兄弟结点;返回符合条件的后面的第一个tag结点。
(5)find_previous_siblings()和find_previous_sibling():返回所有符合条件的前面的兄弟结点;返回第一个符合条件的前面的兄弟结点。
(6)find_all_next()和find_next():通过.next_elements属性对当前tag之后的tag和字符串进行迭代
(7)find_all_previous()和find_previous():通过.previous_elements属性对当前结点前面的tag和字符串进行迭代。
使用方法为soup.select(),返回类型为list
(1)通过标签名查找
print(soup.select('title'))
print(soup.select('a'))
print(soup.select('b'))
[The Dormouse's story ]
[, Lacie, Tillie]
[The Dormouse's story]
(2)通过类名查找:类名前加 .。
print(soup.select('.sister'))
[, Lacie, Tillie]
(3)通过id名查找:id名前加#
print(soup.select('#link1'))
[]
(4)组合查找
print(soup.select('p #link1'))
#直接子标签查找
print(soup.select('head > title'))
[]
[The Dormouse's story ]
(5)属性查找:属性需要使用[],属性与标签属于同一结点,中间不能加空格,否则无法匹配。
print(soup.select('a[class="sister"]'))
print(soup.select('a[href="http://example.com/elsie"]'))
print(soup.select('p a[href="http://example.com/elsie"]'))
[, Lacie, Tillie]
[]
[]
(6)也可以使用遍历形式输出,然后使用get_text()方法来获取它的内容。
print(type(soup.select('title')))
print(soup.select('title')[0].get_text())
for title in soup.select('title'):
print(title.get_text())
The Dormouse's story
The Dormouse's story