Beautiful Soup 是一个HTML/XML 的解析器,主要用于解析和提取 HTML/XML 数据。
它基于HTML DOM 的,会载入整个文档,解析整个DOM树,因此时间和内存开销都会大很多,所以性能要低于lxml。
BeautifulSoup 用来解析 HTML 比较简单,API非常人性化,支持CSS选择器、Python标准库中的HTML解析器,也支持
lxml 的 XML解析器。
虽然说BeautifulSoup4 简单容易比较上手,但是匹配效率还是远远不如正则以及xpath的,一般不推荐使用,推荐正则的使用。
BeautifulSoup类的基本元素:
…
的名字是’p’,格式:# 导入bs4库
from bs4 import BeautifulSoup
import requests # 抓取页面
r = requests.get('https://python123.io/ws/demo.html') # Demo网址
demo = r.text # 抓取的数据
demo
'This is a python demo page \r\n\r\nThe demo python introduces several python courses.
\r\nPython is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\nBasic Python and Advanced Python.
\r\n'
# 解析HTML页面
soup = BeautifulSoup(demo, 'html.parser') # 抓取的页面数据;bs4的解析器
# 有层次感的输出解析后的HTML页面
print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
1)标签,用soup.
当HTML文档中存在多个相同
soup.a # 访问标签a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
soup.title
<title>This is a python demo page</title>
2)标签的名字:每个
soup.a.name
'a'
soup.a.parent.name
'p'
soup.p.parent.name
'body'
3)标签的属性,一个
tag = soup.a
print(tag.attrs)
print(tag.attrs['class'])
print(type(tag.attrs))
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
['py1']
<class 'dict'>
4)Attributes:标签内非属性字符串,格式:soup.
print(soup.a.string)
print(type(soup.a.string))
Basic Python
<class 'bs4.element.NavigableString'>
5)NavigableString:标签内字符串的注释部分,Comment是一种特殊类型(有–>)
print(type(soup.p.string))
<class 'bs4.element.NavigableString'>
print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
print(soup.a.prettify())
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
7)bs4库将任何HTML输入都变成utf‐8编码
Python 3.x默认支持编码是utf‐8,解析无障碍
newsoup = BeautifulSoup('中文', 'html.parser')
print(newsoup.prettify())
<a>
中文
</a>
HTML基本格式:<>…>
构成了所属关系,形成了标签的树形结构
所有儿子节点存入列表标签树的下行遍历
import requests
from bs4 import BeautifulSoup
r=requests.get('http://python123.io/ws/demo.html')
demo=r.text
soup=BeautifulSoup(demo,'html.parser')
print(soup.contents)# 获取整个标签树的儿子节点
[<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>]
print(soup.body.content)#返回标签树的body标签下的节点
None
print(soup.head)#返回head标签
<head><title>This is a python demo page</title></head>
for child in soup.body.children:#遍历儿子节点
print(child)
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
for child in soup.body.descendants:#遍历子孙节点
print(child)
<p class="title"><b>The demo python introduces several python courses.</b></p>
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Advanced Python
.
标签树的上行遍历
soup.title.parent
<head><title>This is a python demo page</title></head>
soup.title.parent
<head><title>This is a python demo page</title></head>
soup.parent
for parent in soup.a.parents: # 遍历先辈的信息
if parent is None:
print(parent)
else:
print(parent.name)
p
body
html
[document]
标签树的平行遍历
注意:
print(soup.a.next_sibling)#a标签的下一个标签
and
print(soup.a.next_sibling.next_sibling)#a标签的下一个标签的下一个标签
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
print(soup.a.previous_sibling)#a标签的前一个标签
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
print(soup.a.previous_sibling.previous_sibling)#a标签的前一个标签的前一个标签
None
for sibling in soup.a.next_siblings:#遍历后续节点
print(sibling)
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
.
for sibling in soup.a.previous_sibling:#遍历之前的节点
print(sibling)
(…) 等价于
.find_all(…)import requests
from bs4 import BeautifulSoup
r = requests.get('http://python123.io/ws/demo.html')
demo = r.text
soup = BeautifulSoup(demo,'html.parser')
soup
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
# name : 对标签名称的检索字符串
soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>,
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
soup.find_all(['a', 'p'])
[<p class="title"><b>The demo python introduces several python courses.</b></p>,
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>,
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>,
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
# attrs: 对标签属性值的检索字符串,可标注属性检索
soup.find_all("p","course")
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
soup.find_all(id="link") # 完全匹配才能匹配到
[]
# recursive: 是否对子孙全部检索,默认True
soup.find_all('p',recursive=False)
[]
# string: <>…>中字符串区域的检索字符串
soup.find_all(string = "Basic Python") # 完全匹配才能匹配到
['Basic Python']
# 导入库
import requests
from bs4 import BeautifulSoup
import bs4
def getHTMLText(url):
try:
r = requests.get(url, timeout=30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return ""
def fillUnivList(ulist, html):
soup = BeautifulSoup(html, "html.parser")
for tr in soup.find('tbody').children:
if isinstance(tr, bs4.element.Tag):
tds = tr('td')
ulist.append([tds[0].string, tds[1].string, tds[3].string])
def printUnivList(ulist, num):
tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
print(tplt.format("排名","学校名称","总分",chr(12288)))
for i in range(num):
u=ulist[i]
print(tplt.format(u[0],u[1],u[2],chr(12288)))
def main():
uinfo=[] #定义一个存放信息的列表
url='http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html'
html = getHTMLText(url)
fillUnivlist(uinfo,html)
printUnivlist(uinfo,30)#查看30个大学
main()
排名 学校名称 省市 总分
1 清华大学 北京 94.6
2 北京大学 北京 76.5
3 浙江大学 浙江 72.9
4 上海交通大学 上海 72.1
5 复旦大学 上海 65.6
6 中国科学技术大学 安徽 60.9
7 华中科技大学 湖北 58.9
7 南京大学 江苏 58.9
9 中山大学 广东 58.2
10 哈尔滨工业大学 黑龙江 56.7
11 北京航空航天大学 北京 56.3
12 武汉大学 湖北 56.2
13 同济大学 上海 55.7
14 西安交通大学 陕西 55.0
15 四川大学 四川 54.4
16 北京理工大学 北京 54.0
17 东南大学 江苏 53.6
18 南开大学 天津 52.8
19 天津大学 天津 52.3
20 华南理工大学 广东 52.0
21 中南大学 湖南 50.3
22 北京师范大学 北京 49.7
23 山东大学 山东 49.1
23 厦门大学 福建 49.1
25 吉林大学 吉林 48.9
26 大连理工大学 辽宁 48.6
27 电子科技大学 四川 48.4
28 湖南大学 湖南 48.1
29 苏州大学 江苏 47.3
30 西北工业大学 陕西 46.7
Suc30