Beautiful Soup4学习笔记(一):安装

该系列是按照Beautiful Soup教程抄袭,原文链接:
http://beautifulsoup.readthedocs.io/zh_CN/latest/

工欲善其事,必先利其器。下面我们安装 beautifulsoup4:

#pip install   beautifulsoup4 (Centos系统)
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.5.3-py3-none-any.whl (85kB)
    100% |████████████████████████████████| 92kB 669kB/s 
Installing collected packages: beautifulsoup4
Successfully installed beautifulsoup4-4.5.3

安装解析器:
Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml:

# pip install lxml
Collecting lxml
  Downloading lxml-3.7.3-cp35-cp35m-manylinux1_x86_64.whl (7.1MB)
    100% |████████████████████████████████| 7.1MB 83kB/s 
Installing collected packages: lxml
Successfully installed lxml-3.7.3

安装完成之后,如何使用:
将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象,可以传入一段字符串或一个文件句柄。

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))

soup = BeautifulSoup("data")

首先,文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码

BeautifulSoup("Sacré bleu!")
Sacré bleu!

然后,Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档.

首先是一段HTML代码的字符串:

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""

使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())


 
  
   The Dormouse's story
  
 
 
  

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie , Lacie and Tillie ; and they lived at the bottom of a well.

...

几个浏览结构化数据的方法:

>>> soup.title
The Dormouse's story
>>> soup.title.name
'title'
>>> soup.title.string
"The Dormouse's story"
>>> soup.title.parent.name
'head'
>>> soup.p

The Dormouse's story

>>> soup.p['class'] ['title'] >>> soup.a Elsie >>> soup.find_all('a') [Elsie, Lacie, Tillie] >>> soup.find(id="link2") Lacie

从文档中找到所有标签的链接:

>>> for link in soup.find_all('a'):
...     print(link.get('href')) 
...

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

从文档中获得所有文字:

>>> print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

你可能感兴趣的:(Beautiful Soup4学习笔记(一):安装)