Beautiful Soup4学习笔记（一）:安装

该系列是按照Beautiful Soup教程抄袭，原文链接：
http://beautifulsoup.readthedocs.io/zh_CN/latest/

工欲善其事，必先利其器。下面我们安装 beautifulsoup4：

#pip install   beautifulsoup4 (Centos系统）
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.5.3-py3-none-any.whl (85kB)
    100% |████████████████████████████████| 92kB 669kB/s 
Installing collected packages: beautifulsoup4
Successfully installed beautifulsoup4-4.5.3

安装解析器：
Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml:

# pip install lxml
Collecting lxml
  Downloading lxml-3.7.3-cp35-cp35m-manylinux1_x86_64.whl (7.1MB)
    100% |████████████████████████████████| 7.1MB 83kB/s 
Installing collected packages: lxml
Successfully installed lxml-3.7.3

安装完成之后，如何使用：
将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象,可以传入一段字符串或一个文件句柄。

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))

soup = BeautifulSoup("data")

首先，文档被转换成Unicode，并且HTML的实例都被转换成Unicode编码

BeautifulSoup("Sacré bleu!")
Sacré bleu!

然后,Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档.

首先是一段HTML代码的字符串：

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...
"""

使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())


 
  
   The Dormouse's story
  
 
 
  
   
    The Dormouse's story
   
  
  
   Once upon a time there were three little sisters; and their names were
   
    Elsie
   
   ,
   
    Lacie
   
   and
   
    Tillie
   
   ;
and they lived at the bottom of a well.
  
  
   ...

几个浏览结构化数据的方法：

>>> soup.title
The Dormouse's story
>>> soup.title.name
'title'
>>> soup.title.string
"The Dormouse's story"
>>> soup.title.parent.name
'head'
>>> soup.p
The Dormouse's story
>>> soup.p['class']
['title']
>>> soup.a
Elsie
>>> soup.find_all('a')
[Elsie, Lacie, Tillie]
>>> soup.find(id="link2")
Lacie

从文档中找到所有标签的链接：

>>> for link in soup.find_all('a'):
...     print(link.get('href')) 
...

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

从文档中获得所有文字：

>>> print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

Beautiful Soup4学习笔记（一）:安装

你可能感兴趣的:(Beautiful Soup4学习笔记（一）:安装)