Beautiful Soup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库.
pip install beautifulsoup4
lxml 解析器
:pip install lxml
html5lib 解析器
:pip install html5lib
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("data")
首先,文档被转换成 Unicode,并且 HTML 的实例都被转换成 Unicode 编码;
然后,Beautiful Soup 选择最合适的解析器来解析这段文档,如果手动指定解析器那么 Beautiful Soup 会选择指定的解析器来解析文档;
栗子:
from bs4 import BeautifulSoup
soup = BeautifulSoup("data")
print(soup.prettify()) # 标准的缩进格式的结构输出
<html>
<body>
<p>
data
</p>
</body>
</html>
生成的 Beautifulsoup 对象,转换为 tag 对象,后边的
.
是根据标签来确定的,如果是 b 就是.b
,是 p 就是.p
;
Tag
对象与 XML 或 HTML 原生文档中的 Tag 相同:from bs4 import BeautifulSoup
soup = BeautifulSoup('Extremely bold')
tag = soup.b
print(type(tag))
--- 输出 ---
<class 'bs4.element.Tag'>
.name
来获取:from bs4 import BeautifulSoup
soup = BeautifulSoup('Extremely bold')
tag = soup.b
print(tag.name)
输出
b
from bs4 import BeautifulSoup
soup = BeautifulSoup('Extremely bold')
tag = soup.b
tag.name = "blockquote"
print(tag)
结果:
<blockquote class="boldest">Extremely bold</blockquote>
tag
有一个 “class” 的属性,值为 “boldest”;tag['class']
.attrs
from bs4 import BeautifulSoup
soup = BeautifulSoup('Extremely bold')
tag = soup.b
tag.name = "blockquote"
print(tag.attrs) # 访问 tag 的属性
print(tag)
tag['class'] = 'verybold' # 修改属性
tag['id'] = 123 # 添加属性
print(tag)
del tag['class'] # 删除属性
print(tag)
{'class': ['boldest']}
<blockquote class="boldest">Extremely bold</blockquote>
<blockquote class="verybold" id="123">Extremely bold</blockquote>
<blockquote id="123">Extremely bold</blockquote>
list
;from bs4 import BeautifulSoup
css_soup = BeautifulSoup('')
tag = css_soup.p
print(tag['class'])
['body', 'strikeout']
from bs4 import BeautifulSoup
css_soup = BeautifulSoup('')
tag = css_soup.p
print(tag['id'])
输出:
body strikeout
from bs4 import BeautifulSoup
css_soup = BeautifulSoup('')
tag = css_soup.p
print(tag['rel'])
tag['rel'] = ['body', 'strikeout']
print(tag['rel'])
print(tag)
输出:
body
['body', 'strikeout']
<p rel="body strikeout"></p>
from bs4 import BeautifulSoup
css_soup = BeautifulSoup('', 'xml')
tag = css_soup.p
print(tag['rel'])
body strikeout
NavigableString 类
来包装 tag 中的字符串;from bs4 import BeautifulSoup
soup = BeautifulSoup('Extremely bold')
tag = soup.b
print(tag.string)
print(type(tag.string))
输出:
Extremely bold
<class 'bs4.element.NavigableString'>
unicode()
方法可以直接将 NavigableString 对象转换成 Unicode 字符串;from bs4 import BeautifulSoup
soup = BeautifulSoup('Extremely bold')
tag = soup.b
print(tag.string)
print(type(tag.string))
unicode_string = unicode(tag.string)
print(unicode_string)
print(type(unicode_string))
--- 输出 ---
Extremely bold
Extremely bold
from bs4 import BeautifulSoup
soup = BeautifulSoup('Extremely bold')
tag = soup.b
print(tag.string)
print(type(tag.string))
unicode_string = str(tag.string)
print(unicode_string)
print(type(unicode_string))
--- 输出 ---
Extremely bold
<class 'bs4.element.NavigableString'>
Extremely bold
<class 'str'>
replace_with()
方法:from bs4 import BeautifulSoup
soup = BeautifulSoup('Extremely bold')
tag = soup.b
print(tag)
tag.string.replace_with("change string")
print(tag)
输出:
<b class="boldest">Extremely bold</b>
<b class="boldest">change string</b>
Tag
对象,它支持遍历文档树和搜索文档树中描述的大部分方法;.name
属性是很方便的,所以 BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性 .name
。from bs4 import BeautifulSoup
soup = BeautifulSoup('Extremely bold')
print(soup.name)
输出:
[document]
from bs4 import BeautifulSoup
annotate = "" #内容是注释
text = "This is text"
soup_annotate = BeautifulSoup(annotate)
soup_text = BeautifulSoup(text)
comment_annotate = soup_annotate.b.string
comment_text = soup_text.b.string
print(type(comment_annotate))
print(type(comment_text))
结果:
<class 'bs4.element.Comment'>
<class 'bs4.element.NavigableString'>