pip install beautifulsoup4
小测Beautiful Soup库是否安装成功
获取https://python123.io/ws/demo.html该网页的源代码
>>> import requests
>>> r=requests.get("http://python123.io/ws/demo.html")
>>> r.text
'This is a python demo page \r\n\r\nThe demo python introduces several python courses.
\r\nPython is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\nBasic Python and Advanced Python.
\r\n'
>>> demo=r.text
>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(demo,"html.parser")
>>> print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
Beautiful Soup库使用方法:
from bs4 import BeautifulSoup
soup=BeautifulSoup('data
','html.parser')
#BeautifulSoup为一个类
#'data
'为待解析的html数据,'html.parser'为解析器
HTML文件看作是标签树
Beautiful Soup库是解析、遍历、维护“标签树”的功能库
Beautiful Soup库,也叫beautifulsoup4 或 bs4
from bs4 import BeautifulSoup
import bs4
HTML⬅➡标签树⬅➡BeautifulSoup类
BeautifulSoup对应一个HTML/XML文档的全部内容
Beautiful Soup库的解析器:
解析器 | 使用方法 | 条件 |
---|---|---|
bs4的HTML解析器 | BeautifulSoup(mk,‘html.parser’) | 安装bs4库 |
lxml的HTML解析器 | BeautifulSoup(mk,‘lxml’) | pip install lxml |
lxml的XML解析器 | BeautifulSoup(mk,‘xml’) | pip install lxml |
html5lib的解析器 | BeautifulSoup(mk,‘html5lib’) | pip install html5lib |
Beautiful Soup类的基本元素:
基本元素 | 说明 |
---|---|
Tag | 标签,最基本的信息组织单元,分别用<>和>标明开头和结尾 |
Name | 标签的名字,< p>… p>的名字是’p’,格式:< tag>.name |
Attributes | 标签的属性,字典形式组织,格式:< tag>.attrs |
NavigableString | 标签内非属性字符串,<>…>中字符串,格式:< tag>.string |
Comment | 标签内字符串的注释部分,一种特殊的Comment类型 |
>>> soup.title#获取html文件中的title标签
<title>This is a python demo page</title>
>>> tag=soup.a#获取html文件中的a标签
>>> tag
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.name#获取a标签的名字
'a'
>>> soup.a.parent.name#获取a标签的父标签的名字
'p'
>>> soup.a.parent.parent.name#获取p标签的父标签的名字
'body'
>>> tag =soup.a
>>> tag.attrs#获得a标签的属性
{
'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> tag.attrs['class']#获得class键值对的值
['py1']
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'
>>> type(tag.attrs)
<class 'dict'>#标签属性为字典,若属性为空,也是一个空字典
>>> type(tag)
<class 'bs4.element.Tag'>
>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.string#获得a标签中的字符串
'Basic Python'
>>> soup.p
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> soup.p.string#获得p标签中的字符串
'The demo python introduces several python courses.'#不包含P标签中的b标签
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>#可见,NavigableString可以跨标签
>>> newsoup=BeautifulSoup("This is not a comment
","html.parser")
>#中是注释
>>> newsoup.b.string
'This is a comment'
>>> type(newsoup.b.string)#b标签中是一段注释,但是获取b标签的字符串时并不会提示
<class 'bs4.element.Comment'>
>>> newsoup.p.string
'This is not a comment'
>>> type(newsoup.p.string)
<class 'bs4.element.NavigableString'>
HTML是具有树形结构的文本信息
3种遍历方式:
1.从根节点到叶节点的下行遍历方式
2.从叶节点到根节点的上行遍历方式
3.平行遍历方式
标签树的下行遍历:
属性 | 说明 |
---|---|
.contents | 子节点的列表,将< tag>所有儿子节点存入列表 |
.children | 子节点的迭代类型,与.contents类似,用于循环遍历儿子节点\ |
.descendants | 子孙节点的迭代类型,包含所有子孙节点,用于循环遍历 |
>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> soup.body.contents#不仅包括标签节点,还包括字符串节点
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
>>> len(soup.body.contents)
5
>>> soup.body.contents[1]
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> for child in soup.body.children:#遍历儿子节点
... print(child)
...
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
>>> for descendant in soup.body.descendants:#遍历子孙节点
... print(descendant)
...
<p class="title"><b>The demo python introduces several python courses.</b></p>
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Advanced Python
标签树的上行遍历:
属性 | 说明 |
---|---|
.parent | 节点的父亲标签 |
.parents | 节点先辈标签的迭代类型,用于循环遍历先辈节点 |
>>> for parent in soup.a.parents:
... if parent is None:
... print(parent)
... else:
... print(parent.name)
...
p
body
html
[document]
#soup本身的parent不存在,所以代码中要加一条如果parent为None的情况
标签树的平行遍历
属性 | 说明 |
---|---|
.next_sibling | 返回按照HTML文本顺序的下一个平行节点标签 |
.previous_sibling | 返回按照HTML文本顺序的上一个平行节点标签 |
.next_siblings | 迭代类型,返回按照HTML文本顺序的后续所有平行节点标签 |
.previous_siblings | 迭代类型,返回按照HTML文本顺序的前续所有平行节点标签 |
平行遍历的条件:平行遍历发生在同一个父节点下的各节点间
让HTML更加“友好”地显示
>>> soup.prettify()#加了换行符
'\n \n \n This is a python demo page\n \n \n \n \n \n The demo python introduces several python courses.\n \n
\n \n Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n \n Basic Python\n \n and\n \n Advanced Python\n \n .\n
\n \n'
>>> print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
>>> print(soup.a.prettify())#单独对一个标签
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
信息的标记:
信息标记的三种形式:XML, JSON,YAML
(与HTML很接近) extensible markup language扩展标记语言,以标签为主构建信息
< name>… name> #标签中有内容
< name />#标签中没有内容
< !-- -->#注释
javascript object notation, js语言中面向对象的信息表现形式
有类型的键值对 key:value 构建的信息表现形式(字符串、数字等)
“city” : “南京”
“city” : [“南京”,“上海”] #多值
“city” : {“newName” : “南京” ,
“oldName” : “金陵”
} #键值对嵌套用
YAML Ain’t Markup Language
无类型键值对 key:value(字符串)
name : 南京
#缩进表示所属关系
name:
oldName: 金陵
newName:南京
#-表达并列关系
name:
-南京
-金陵
# |表达整块数据
# #表示注释
text:|
南京,简称“宁”,古称金陵、建康,是江苏省会、副省级市、特大城市、南京都市圈核心城市,国务院批复确定的中国东部地区重要的中心城市、全国重要的科研教育基地和综合交通枢纽。截至2018年,全市下辖11个区,总面积6587平方千米,建成区面积971.62平方千米。2019年,常住人口850.0万人,城镇人口707.2万人,城镇化率83.2%。
方法一:完整解析信息的标记形式,再提取关键信息。需要标记解析器 例如:bs4库中的标签树遍历。(信息解析准确,但提取过程繁琐,速度慢)
方法二:无视标记形式,直接搜索关键信息。对信息的文本查找函数即可。(提取过程简洁,速度快。但提取结果准确性与信息内容相关)
融合方法:结合形式解析与搜索方法,提取关键信息。需要标记解析器及文本查找函数
例如:提取HTML中所有URL链接
1)搜索到所有< a>标签
2)解析< a>标签格式,提取href后的链接内容
>>> for link in soup.find_all('a'):
... print(link.get('href'))
...
http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001
<>.find_all(name,attrs,recursive,string,**kwargs)
返回一个列表类型,存储查找的结果
name:对标签名称的检索字符串
attrs:对标签属性值的检索字符串,可标注属性检索
recursive: 是否对子孙全部检索,默认为True
string: <>...</>中字符串区域的检索字符串
name
>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all(['a','b'])
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> for tag in soup.find_all(True):
... print(tag.name)
...
html
head
title
body
p
b
p
a
a
>>> import re #正则表达式库
>>> for tag in soup.find_all(re.compile('b')):
... print(tag.name)
...
body
b
attrs
>>> soup.find_all('p','course')
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
>>> soup.find_all(id='link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
>>> soup.find_all(id='link')
[]
>>> import re
>>> soup.find_all(id=re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
recursive
>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a',recursive=False)
[]
string
>>> soup
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.find_all(string="Basic Python")
['Basic Python']
>>> soup.find_all(string=re.compile("python"))
['This is a python demo page', 'The demo python introduces several python courses.']
< tag>(…)等价于< tag>.find_all(…)
即, soup(…)等价于 soup.find_all(…)
扩展方法
方法 | 说明 |
---|---|
<>.find() | 搜索且只返回一个结果,同find_all()参数 |
<>.find_parents() | 在先辈节点中搜索,返回列表类型,同find_all()参数 |
<>.find_parent() | 在先辈节点中返回一个结果,字符串类型,同find_all()参数 |
<>.find_next_siblings() | 在后续平行节点中搜索,返回列表类型,同find_all()参数 |
<>.find_next_sibling() | 在后续平行节点中返回一个结果,字符串类型,同find_all()参数 |
<>.find_previous_siblings() | 在前续平行节点中搜索,返回列表类型,同find_all()参数 |
<>.find_previous_sibling() | 在前续平行节点中返回一个结果,字符串类型,同find_all()参数 |
输入:大学排名的URL链接
输出:大学排名信息的屏幕输出(排名,大学名称,总分)
import requests
from bs4 import BeautifulSoup
import bs4
def getHTMLText(url):
try:
r = requests.get(url, timeout=30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return ""
def fillUnivList(ulist, html):
soup = BeautifulSoup(html, "html.parser")
for tr in soup.find('tbody').children:
if isinstance(tr, bs4.element.Tag):
tds = tr('td')
ulist.append([tds[0].string, tds[1].string, tds[3].string])
pass
def printUnivList(ulist, num):
print("{:^10}\t{:^6}\t{:^10}".format("排名", "学校名称", "总分"))
for i in range(num):
u = ulist[i]
print("{:^10}\t{:^6}\t{:^10}".format(u[0], u[1], u[2]))
def main():
ufo = []
url = "http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html"
html = getHTMLText(url)
fillUnivList(ufo, html)
printUnivList(ufo, 20) # 20 univs
main()
排名 学校名称 总分
1 清华大学 94.6
2 北京大学 76.5
3 浙江大学 72.9
4 上海交通大学 72.1
5 复旦大学 65.6
6 中国科学技术大学 60.9
7 华中科技大学 58.9
7 南京大学 58.9
9 中山大学 58.2
10 哈尔滨工业大学 56.7
11 北京航空航天大学 56.3
12 武汉大学 56.2
13 同济大学 55.7
14 西安交通大学 55.0
15 四川大学 54.4
16 北京理工大学 54.0
17 东南大学 53.6
18 南开大学 52.8
19 天津大学 52.3
20 华南理工大学 52.0
网页链接:http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html
观察该网页的html源代码:
发现每所大学的信息在< tr>标签对中,而每项具体信息又在< tb>标签对中