Python 的 Beautiful Soup 库

Beautiful Soup 4已经被移植到BS4了，所以要

from bs4 import BeautifulSoup

创建 beautifulsoup 对象

soup = BeautifulSoup(html, 'lxml')

另外，我们还可以用本地 HTML 文件来创建对象，例如

soup = BeautifulSoup(open('index.html'), 'lxml')

格式化输出：

print soup.prettify()

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...
"""

上面是示例文档，后面演示的都是搜索上面 html_doc 中的内容

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

BeautifulSoup 过滤器

这些过滤器可以被用在 tag 的 name 中，节点的属性中，字符串中或他们的混合中。

1.字符串
传入字符串，查找与字符串完整匹配的内容

2.正则表达式
传入正则表达式 re.compile(规则)，通过正则表达式来匹配内容

3.列表
传入列表参数，如 ['div','a','b']，返回与列表中任一元素匹配的内容。

4.True
匹配任何值

5.方法
还可以传入自定义函数

find_all 方法

搜索当前 tag 的所有子节点,并判断是否符合过滤器的条件。
返回的结果是所有符合条件的 tag 组成的列表。

语法：

find_all( name , attrs , recursive , string , **kwargs )

name 参数

查找所有名字为 name 的 tag
name 参数的值可以使任一类型的过滤器：字符串，正则表达式，列表，方法或是 True .

a. 给 name 参数传入字符串

soup.find_all("title")
# [The Dormouse's story]

soup.find_all('b')
# [The Dormouse's story]

print soup.find_all('a')
#[Elsie, Lacie, Tillie]

b.传入正则表达式

# 找出所有以b开头的标签,这表示和标签都应该被找到
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

# 找出所有名字中包含”t”的标签:
for tag in soup.find_all(re.compile("t")):
    print(tag.name)
# html
# title

c.传入列表

# 找到文档中所有标签和标签: soup.find_all(["a", "b"]) # [The Dormouse's story, # Elsie, # Lacie, # Tillie]

d.传入True
True 可以匹配任何值,下面代码查找到所有的tag

for tag in soup.find_all(True): print(tag.name) # html # head # title # body # p # b # p # a # a # a # p

keyword 参数

如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索
参数类型包括：字符串 , 正则表达式 , 列表, True .

1.传入字符串

# 找到所有属性名为 id 且属性值为 link2 的字符串 soup.find_all(id='link2') # [Lacie]

2.传入正则表达式

soup.find_all(href=re.compile("elsie")) # [Elsie]

3.True

# 查找所有包含 id 属性的tag,无论 id 的值是什么: soup.find_all(id=True) # [Elsie, # Lacie, # Tillie]

同时匹配多个属性

soup.find_all(href=re.compile("elsie"), id='link1') # [three]

想搜索 class 属性，但 class 是 python 关键字，所以用 class_ 代替

soup.find_all("a", class_="sister") # [Elsie, # Lacie, # Tillie]

HTML5中的 data-* 属性在搜索不能使用

data_soup = BeautifulSoup('foo!') data_soup.find_all(data-foo="value") # SyntaxError: keyword can't be an expression

可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag

data_soup.find_all(attrs={"data-foo": "value"}) # [foo!]

string 参数

匹配文档中的字符串内容，返回字符串列表
string 参数接受字符串 , 正则表达式 , 列表, True 和方法

soup.find_all(text="Elsie") # [u'Elsie'] soup.find_all(text=["Tillie", "Elsie", "Lacie"]) # [u'Elsie', u'Lacie', u'Tillie'] soup.find_all(text=re.compile("Dormouse")) [u"The Dormouse's story", u"The Dormouse's story"]

可以与其它参数混合使用

# 搜索内容里面包含“Elsie”的标签: soup.find_all("a", string="Elsie") # [Elsie]

limit 参数

设置匹配上限，当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果

# 文档树中有3个tag符合搜索条件,但结果只返回2个 soup.find_all("a", limit=2) # [Elsie, # Lacie]

recursive 参数

find_all() 方法时默认检索当前 tag 的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False

一段简单的文档:

The Dormouse's story ...

是否使用 recursive 参数的搜索结果:

soup.html.find_all("title") # [The Dormouse's story] soup.html.find_all("title", recursive=False) # []

标签在 <html> 标签下, 但并不是直接子节点, <head> 标签才是直接子节点. 在允许查询所有后代节点时能够查找到 <title> 标签. 但是使用了 recursive=False 参数之后,只能查找直接子节点,这样就查不到 <title> 标签了. find_all() 几乎是Beautiful Soup中最常用的搜索方法,所以我们定义了它的简写方法. 下面两行代码是等价的: <pre><code>soup.find_all("a") soup("a") </code></pre> 这两行代码也是等价的: <pre><code>soup.title.find_all(string=True) soup.title(string=True) </code></pre> <hr> <h3>find() 方法</h3> <pre><code>find( name , attrs , recursive , string , **kwargs ) </code></pre> find() 与 find_all() 区别： find_all() 返回符合条件的所有 tag，find() 只返回符合条件的第一个 tag find_all() 返回结果是列表,而 find() 方法直接返回结果. find_all() 方法没有找到目标时返回空列表, find() 方法找不到目标时,返回 None . <pre><code>soup.find_all('title', limit=1) # [<title>The Dormouse's story] soup.find('title') # The Dormouse's story

标签的属性 attrs

把标签的所有属性打印输出了出来，结果为字典类型

print soup.p.attrs #{'class': ['title'], 'name': 'dromouse'}

单独获取某个属性的值

print soup.p['class'] #['title’] print soup.p.get('class') #['title']

select() 方法

用 CSS 选择器的语法来筛选元素，返回 tag 列表

CSS选择器语法：
标签名不加任何修饰，类名（class="className"引号内即为类名）前加点，id名id="idName”引号内即为id名）前加 #

通过 tag 名查找

print soup.select('title') #[The Dormouse's story] print soup.select('a') #[Elsie, Lacie, Tillie]

通过类名查找

print soup.select('.sister') #[Elsie, Lacie, Tillie]

通过 id 名查找

print soup.select('#link1') #[Elsie]

通过属性查找

查找时还可以加入属性元素，属性需要用中括号括起来

print soup.select('a[class="sister"]') #[Elsie, Lacie, Tillie] print soup.select('a[href="http://example.com/elsie"]') #[Elsie]

是否存在某个属性来查找:

soup.select('a[href]') # [Elsie, # Lacie, # Tillie]

多个查找条件属于同一 tag 的，不用空格隔开；

多个查找条件不属于同一 tag 的，用空格隔开。

（同时符合条件1和条件2的 tag）
选择标签名为 a，id 为 link2 的 tag：

soup.select('a#link2’) # [Lacie]

tag 之间的包含查找
查找标签 p 中，id 等于 link1 的 tag，二者需要用空格分开

print soup.select('p #link1') # [Elsie]

找到某个tag标签下的直接子标签

soup.select("head > title") # [The Dormouse's story] soup.select("p > a") # [Elsie, # Lacie, # Tillie] soup.select("p > #link1") # [Elsie] soup.select("body > a") # []

同时用多种CSS选择器查询（符合条件1或条件2的tag）:

soup.select("#link1,#link2") # [Elsie, # Lacie]

用 beautifulsoup 获取 HTML 网页源码里的内容，想删除或替换里面的

使用 \xa0

>>> soup = BeautifulSoup('a b', 'lxml') >>> soup.prettify() u'\n \n \n a\xa0b\n \n \n'

.text与.string

在用find()方法找到特定的tag后，想获取里面的文本，可以用.text属性或者.string属性。

在很多时候，两者的返回结果一致，但其实两者是有区别的

例如 html 像这样：

1、some text 2、 3、more text 4、even more text

.string 属性得到的结果

1、some text 2、None 3、more text 4、None

.text 属性得到的结果

1、some text 2、more text 3、even more text

.find和.string之间的差异：

第一行，td没有子标签，且有文本时，两者的返回结果一致，都是文本

第二行，td没有子标签，且没有文本时，.string返回None，.text返回为空

第三行，td只有一个子标签时，且文本只出现在子标签之间时，两者返回结果一致，都返回子标签内的文本

第四行，最关键的区别，td有子标签，并且父标签td和子标签p各自包含一段文本时，两者的返回结果，存在很大的差异：

.string返回为空，因为文本数>=2，string不知道获取哪一个

.text返回的是，两段文本的拼接。

使用 BeautifulSoup 提取网页内容 demo

# python3 # -*- coding: utf-8 -*- # Filename: BeautifulSoup_demo.py """ 练习使用 BeautifulSoup 提取网页内容 @author: v1coder """ import requests from bs4 import BeautifulSoup headers = {'User-Agent':''} # 爬取biqukan.com 的小说《一念永恒》 def biqukan_com(): url = 'https://www.biqukan.com/1_1094/5403177.html' req = requests.get(url) # text_data = req.text 也可以 content_data = req.content soup = BeautifulSoup(content_data, 'lxml') # print(soup.prettify()) texts = soup.find_all('div', id="content") # texts = soup.find_all('div', class_="showtxt") #也可以 print(texts[0].text.replace('\xa0'*8,'\n')) #biqukan_com() # 爬取吐槽大会第三季每期标题 def tucaodahui_title(): url = 'https://v.qq.com/detail/8/80844.html' data = requests.get(url).text soup = BeautifulSoup(data, 'lxml') titles = soup.find_all('strong', itemprop="episodeNumber") for title in titles: print(title.text) #tucaodahui_title() # 爬取纯文本文章 def jianshu(): url = 'https://www.jianshu.com/p/713415f82576' data = requests.get(url, headers=headers).text soup = BeautifulSoup(data, 'lxml') texts = soup.find_all('div', class_="show-content-free") print(texts[0].text) #jianshu() # 豆瓣电影 TOP250 def douban_TOP250(): url = 'https://movie.douban.com/top250' data = requests.get(url, headers=headers).text soup = BeautifulSoup(data, 'lxml') comments = soup.find_all('span', class_="inq") titles = soup.find_all('img', width="100") for num in range(len(titles)): # titles[num].get('alt') get方法，传入属性的名称，获得属性值 title = str(num+1) + '.' + '《' + titles[num].get('alt') + '》' comment = ' ：' + comments[num].text print(title) print(comment) print() #douban_TOP250() # 电影天堂最新电影 def dytt(): url = 'https://www.dytt8.net/' data = requests.get(url, headers=headers).content soup = BeautifulSoup(data, 'lxml') text = soup.find('div', class_="co_content8") dates = text.find_all('font') # 得到日期 # for date in dates[1:]: # print(date.text) names = text.select("td a") # 得到电影名 num = 1 for name in names[2::2]: print(name.text) print(dates[num].text) print() num += 1 #dytt() # 妹子图1，输出图片链接 def mmjpg(): url = 'http://www.mmjpg.com/' data = requests.get(url, headers=headers).content soup = BeautifulSoup(data, 'lxml') url_tags = soup.find_all('img', width="220") for url_tag in url_tags: pic_url = url_tag['src'] # 获得属性值用 ['src'] 或 get('src') print(pic_url) #mmjpg() # 妹子图2，输出图片链接 def haopic_me(): url = 'http://www.haopic.me/tag/meizitu' data = requests.get(url, headers=headers).content soup = BeautifulSoup(data, 'lxml') url_tags = soup.find_all('div', class_="post") for url_tag in url_tags: pic_url = url_tag.find('img')['src'] print(pic_url) #haopic_me() # 妹子图3，输出图片链接 def mzitu_com(): url = 'https://www.mzitu.com/' data = requests.get(url, headers=headers).content soup = BeautifulSoup(data, 'lxml') url_tags = soup.find_all('img', class_='lazy') for url_tag in url_tags: pic_url = url_tag.get('data-original') print(pic_url) #mzitu_com()

鸣谢：

Beautiful Soup的用法 | 静觅
Beautiful Soup 4.4.0 文档
BeautifulSoup解析网页

2018-12-27

Python 的 Beautiful Soup 库

BeautifulSoup 过滤器

find_all 方法

name 参数

keyword 参数

string 参数

limit 参数

recursive 参数

标签的属性 attrs

select() 方法

通过 tag 名查找

通过类名查找

通过 id 名查找

通过属性查找

.text与.string

使用 BeautifulSoup 提取网页内容 demo

你可能感兴趣的:(Python 的 Beautiful Soup 库)