该系列是按照Beautiful Soup教程抄袭,原文链接:
http://beautifulsoup.readthedocs.io/zh_CN/latest/
工欲善其事,必先利其器。下面我们安装 beautifulsoup4:
#pip install beautifulsoup4 (Centos系统)
Collecting beautifulsoup4
Downloading beautifulsoup4-4.5.3-py3-none-any.whl (85kB)
100% |████████████████████████████████| 92kB 669kB/s
Installing collected packages: beautifulsoup4
Successfully installed beautifulsoup4-4.5.3
安装解析器:
Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml:
# pip install lxml
Collecting lxml
Downloading lxml-3.7.3-cp35-cp35m-manylinux1_x86_64.whl (7.1MB)
100% |████████████████████████████████| 7.1MB 83kB/s
Installing collected packages: lxml
Successfully installed lxml-3.7.3
安装完成之后,如何使用:
将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象,可以传入一段字符串或一个文件句柄。
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("data")
首先,文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码
BeautifulSoup("Sacré bleu!")
Sacré bleu!
然后,Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档.
首先是一段HTML代码的字符串:
html_doc = """
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...
几个浏览结构化数据的方法:
>>> soup.title
The Dormouse's story
>>> soup.title.name
'title'
>>> soup.title.string
"The Dormouse's story"
>>> soup.title.parent.name
'head'
>>> soup.p
The Dormouse's story
>>> soup.p['class']
['title']
>>> soup.a
Elsie
>>> soup.find_all('a')
[Elsie, Lacie, Tillie]
>>> soup.find(id="link2")
Lacie
从文档中找到所有标签的链接:
>>> for link in soup.find_all('a'):
... print(link.get('href'))
...
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
从文档中获得所有文字:
>>> print(soup.get_text())
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...