BeautifulSoup袭来

BeautifulSoup是啥？它是一个可以从HTML或XML文件中提取数据的Python库。它能通过我们喜欢的转换器实现文档导航，查找和修改。

安装BeautifulSoup

可以通过pip来安装，包的名字的是beautifulsoup4。

pip install -i https://pypi.douban.com/simple beautifulsoup4

安装解析器

BeautifulSoup除了支持Python标准库中的HTML解析器之外，还支持一些第三方的解析器，比如lxml。可以按下列方式来安装lxml：

pip install -i https://pypi.douban.com/simple lxml

另一个可供选择的解析器是纯Python实现的html5lib，可以按下列方式来安装html5lib：

pip install -i https://pypi.douban.com/simple html5lib

下表列出了主要的解析器以及它们的优缺点：

如何使用

将一段文档传入BeautifulSoup的构造方法就能得到一个文档对象，可以传入一段字符串或一个文件句柄。

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))

suop = BeautifulSoup("data")

如果手动指定解析器的话，BeautifulSoup会选择指定的解析器来解析文档。

对象的种类

BeautifulSoup将复杂的HTML文档转换成一个复杂的树形结构，每个节点都是Python对象，所有对象可以归纳为4中：Tag，NavigableString，BeautifulSoup，Comment。

Tag

Tag对象与XML或HTML文档中的tag相同：

from bs4 import BeautifulSoup

soup = BeautifulSoup('Extremely bold')
tag = soup.b
type(tag)
#

Tag有很多方法和属性，比如遍历文档树和搜索文档树。现在介绍一下tag中最重要的属性：name和attributes。

Name

每个tag都有自己的名字，通过.name来获取：

tag.name
# 'b'

如果改变tag的name将会影响所有通过当前BeautifulSoup对象生成的HTML文档：

tag.name = "blockquote"
tag 
# Extremely bold

tag.name = "ironman"
tag
# Extremely bold

Attributes

一个tag可能有很多个属性。tag 有一个名为class的属性，值为boldest。tag的属性的操作方法与字典相同：

tag['class'] # ['boldest']

也可以直接.取属性，比如：.attrs：

tag.attrs # {'class': ['boldest']}

tag的属性可以被添加，删除或修改，和字典一样。

In [13]: tag['class'] Out[13]: ['boldest'] In [14]: tag['class'] = 'verybold' In [15]: tag['id'] = 1 In [16]: tag Out[16]: Extremely bold In [17]: del tag['class'] In [18]: del tag['id'] In [19]: tag Out[19]: Extremely bold In [20]: tag['class'] KeyError In [21]: tag.get('class')

多值属性

HTML定义了一系列可以包含多个值的属性。最常见的多值属性是class（一个tag可以有多个CSS的class）。在BeautifulSoup中多值属性的返回类型是list：

In [22]: css_soup = BeautifulSoup('') In [23]: css_soup.p['class'] Out[23]: ['body', 'strikeout']

如果某个属性看起来好像有多个值，但在任何版本的HTML定义中都没有被定义为多值属性，那么BeautifulSoup会将这个属性作为字符串返回

In [25]: id_soup = BeautifulSoup('') In [26]: id_soup.p['id'] Out[26]: 'my id'

NavigableString

字符串常被包含在tag内。BeautifulSoup用NavigableString类来包装tag中的字符串：

In [27]: tag Out[27]: Extremely bold In [28]: tag.string Out[28]: 'Extremely bold' In [29]: type(tag.string) Out[29]: bs4.element.NavigableString

NavigableString 对象支持遍历文档树和搜索文档树中定义的大部分属性, 并非全部。

BeautifulSoup

BeautifulSoup对象表示的是整个文档的内容。大部分时候，可以把它当作Tag对象，它支持遍历文档树和搜索文档树中定义的大部分方法。

Comment

Tag，NavigableString，BeautifulSoup几乎覆盖了html和xml中的所有内容，但是还有一些特殊对象。

In [30]: markup = "" In [31]: soup = BeautifulSoup(markup, 'lxml') In [32]: comment = soup.b.string In [34]: type(comment) Out[34]: bs4.element.Comment

Comment对象是一个特殊类型的NavigableString对象。 soup = BeautifulSoup(markup, 'lxml')中的lxml用于指定解析器。

搜索文档树

BeautifulSoup定义了很多搜索方法，这里着重介绍2个：find_all()和select()。

以“爱丽丝”文档作为例子：

In [40]: html_doc = """ ...: The Dormouse's story ...: ...: The Dormouse's story ...: ...: Once upon a time there were three little sisters; and their names were ...: Elsie, ...: Lacie and ...: Tillie; ...: and they lived at the bottom of a well. ...: ...: ... ...: """ In [41]: from bs4 import BeautifulSoup In [42]: soup = BeautifulSoup(html_doc, 'lxml')

使用find_all()类似的方法可以查找到想要查找的文档内容。

过滤器

在介绍find_all()方法之前，先介绍一下过滤器的类型，这些过滤器贯穿整个搜索的API。过滤器可以被用在tag的name中，节点的属性中，字符串中或它们的混合中。

字符串

最简单的过滤器是字符串。在搜索方法中传入也给字符串参数，BeautifulSoup会查找与字符串完整匹配的内容，下面的例子用于查找文档中所有标签：

In [43]: soup.find_all('b') Out[43]: [The Dormouse's story]

正则表达式

如果传入正则表达式作为参数，BeautifulSoup会通过正则表达式的match()来匹配内容。下例中找出所有以b开头的标签，这意味着和标签都应该被找到：

In [44]: import re In [45]: for tag in soup.find_all(re.compile('^b')): ...: print(tag.name) ...: # body # b

列表

如果传入列表参数BeautifulSoup会将领与列表中任一元素匹配的内容返回。下列代码将会找到文档中所有和标签：
In [46]: soup.find_all(['a','b']) Out[46]: [The Dormouse's story,Elsie, Lacie, Tillie]

True

True可以匹配任何值，以下代码查找到所有的tag，但不会返回字符串节点：

In [47]: for tag in soup.find_all(True): ...: print(tag.name) ...: # html # head # title # body # p # b # p # a # a # a p

方法

如果没有合适的过滤器，还可以定义一个方法，方法只接受一个元素参数，如果这个方法返回True表示当前元素匹配并且被找到，若不是则返回False。

下列方法检验了当前元素，如果包含class属性却不包含id属性则返回True：

def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id')

将该方法作为参数传入find_all()方法将得到所有
标签：

In [66]: soup.find_all(has_class_but_no_id) Out[66]: [The Dormouse's story, Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well., ... ]

返回结果中只有
标签没有标签，因为标签还定义了id，没有返回和，因为和中没有定义class属性。

下面代码找到所有被文字包含的节点内容：

In [68]: from bs4 import NavigableString def surrounded_by_strings(tag): return isinstance(tag.next_element, NavigableString) and isinstance(tag.previous_element, NavigableString) In [70]: for tag in soup.find_all(surrounded_by_strings): ...: print(tag.name) ...: # p # a # a # a # p

find_all()

find_all(name, attrs, recursive, text, **kwargs)

find_all()方法搜索当前tag的所有tag子节点，并判断是否符合过滤器的条件。

name参数

name参数可以查找所有名字为name的tag，字符串对象会被自动忽略掉。

例如：

In [72]: soup.find_all("title") Out[72]: [The Dormouse's story]

name参数的值可以是任一类型的过滤器，字符串、正则表达式、列表、方法或是True。

keyword参数

如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当作指定名字tag的属性来搜索。如果包含一个名字为id参数，BeautifulSoup会搜索每个tag的id属性。

In [73]: soup.find_all(id="link2") Out[73]: [Lacie]

如果传入href参数，BeautifulSoup会搜索每个tag的href属性：

In [76]: soup.find_all(href=re.compile('elsie')) Out[76]: [Elsie]

搜索指定名字的属性时可以使用的参数值包括：字符串、正则表达式、列表、True。

下例在文档中查找所有包含id属性的tag，无论id的值是什么：

In [77]: soup.find_all(id=True) Out[77]: [Elsie, Lacie, Tillie]

使用多个指定名字的参数可以同时过滤tag的多个属性：

In [78]: soup.find_all(href=re.compile("elsie"), id='link1') Out[78]: [Elsie]

有些tag属性在搜索时不能使用，比如HTML5中的data-*属性：

In [79]: data_soup = BeautifulSoup('foo!', 'lxml') In [80]: data_soup.find_all(data-foo="value") File "", line 1 data_soup.find_all(data-foo="value") ^ SyntaxError: keyword can't be an expression

可以通过find_all()方法attrs参数定义一个字典参数来搜索包含特殊属性的tag：

In [81]: data_soup.find_all(attrs={'data-foo':'value'}) Out[81]: [foo! ]

按CSS搜索

可以通过class_参数搜索指定CSS类名的tag：

In [82]: soup.find_all('a', class_="sister") Out[82]: [Elsie, Lacie, Tillie]

class_参数同样接受不同类型的过滤器，字符串，正则表达式，方法或True。

text参数

通过text参数可以搜索文档中的字符串内容。与name参数的可选值一样，text参数接受字符串，正则表达式，列表，True，方法。

In [83]: soup.find_all(text='Elsie') Out[83]: ['Elsie'] In [84]: soup.find_all(text=['Tillie','Elsie','Lacie']) Out[84]: ['Elsie', 'Lacie', 'Tillie'] In [85]: soup.find_all(text=re.compile('Dormouse')) Out[85]: ["The Dormouse's story", "The Dormouse's story"] def is_the_only_string_within_a_tag(s): """Return True if this string is the only child of its parent tag.""" return (s == s.parent.string) In [87]: soup.find_all(text=is_the_only_string_within_a_tag) Out[87]: ["The Dormouse's story", "The Dormouse's story", 'Elsie', 'Lacie', 'Tillie', '...']

虽然text参数用于搜索字符串，还可以与其它参数混合使用来过滤tag。BeautifulSoup会找到.string属性与text参数值相符的tag。以下代码用来搜索内容中包含Elsie的标签：

In [88]: soup.find_all('a', text="Elsie") Out[88]: [Elsie]

limit参数

find_all()方法返回全部的搜索结构，如果文档树很大那么搜索会很慢。如果我们不需要全部结果，可以使用limit参数限制返回结果的数量。当搜索结果的数量达到limit的限制时，就停止搜索返回结果。

In [89]: soup.find_all('a') Out[89]: [Elsie, Lacie, Tillie] In [90]: soup.find_all('a',limit=2) Out[90]: [Elsie, Lacie]

recursive参数

调用tag的find_all()方法时，BeautifulSoup会检索当前tag的所有子孙节点，如果只想搜索tag的直接子节点，可以使用参数recursive=False。

有如下文档：

The Dormouse's story ...

是否使用recursive参数的结果：

In [95]: soup.html.find_all('title') Out[95]: [ The Dormouse's story ] In [96]: soup.html.find_all('title', recursive=False) Out[96]: []

像调用find_all()一样调用tag

find_all()几乎是BeautifulSoup中最常见的搜索方法，所以定义了它的简写方法。BeautifulSoup对象和Tag对象可以被当作一个方法来使用，这个方法的执行结果与调用这个对象的find_all()方法相同。

下面两行代码是等价的：

soup.find_all("a") soup("a")

以下两行代码也是等价的：

soup.title.find_all(text=True) soup.title(text=True)

CSS选择器

BeautifulSoup支持大部分的CSS选择器，在Tag或BeautifulSoup对象的select()方法中传入字符串参数即可使用CSS选择器的语法找到tag：

In [1]: from bs4 import BeautifulSoup In [2]: html_doc = """ ...: The Dormouse's story ...: ...: The Dormouse's story ...: ...: Once upon a time there were three little sisters; and their names were ...: Elsie, ...: Lacie and ...: Tillie; ...: and they lived at the bottom of a well. ...: ...: ... ...: """ In [3]: soup = BeautifulSoup(html_doc, 'lxml') In [4]: soup.select('title') Out[4]: [The Dormouse's story] In [5]: soup.select('p:nth-of-type(3)') Out[5]: [... ]

通过tag标签逐层查找：

In [6]: soup.select('body a') Out[6]: [Elsie, Lacie, Tillie] In [7]: soup.select('html head title') Out[7]: [The Dormouse's story]

找到某个tag标签下的直接子标签：

In [8]: soup.select('head > title') Out[8]: [The Dormouse's story] In [9]: soup.select('p > a') Out[9]: [Elsie, Lacie, Tillie]

找到兄弟节点标签：

In [14]: soup.select('#link1 ~ .sister') Out[14]: [Lacie, Tillie] In [15]: soup.select('#link1 + .sister') Out[15]: [Lacie]

通过CSS的类名查找：

In [16]: soup.select('.sister') Out[16]: [Elsie, Lacie, Tillie] In [17]: soup.select('[class~=sister]') Out[17]: [Elsie, Lacie, Tillie]

通过tag的id查找：

In [18]: soup.select('#link1') Out[18]: [Elsie] In [19]: soup.select('a#link2') Out[19]: [Lacie]

通过是否存在某个属性来查找：

In [20]: soup.select('a[href]') Out[20]: [Elsie, Lacie, Tillie]

通过属性的值来查找：

In [21]: soup.select('a[href="http://example.com/elsie"]') Out[21]: [Elsie] In [22]: soup.select('a[href^="http://example.com/"]') Out[22]: [Elsie, Lacie, Tillie] In [23]: soup.select('a[href$="tillie"]') Out[23]: [Tillie] In [24]: soup.select('a[href*=".com/el"]') Out[24]: [Elsie]

更多有关CSS选择器的内容，请参考CSS 选择器参考手册。

爬取豆瓣电影TOP250海报

接下来，通过爬取豆瓣电影TOP250海报实践一下。访问豆瓣电影 Top 250

可以看到每部电影左侧都配有一张海报，我们的目的就是下载这些海报到本地。

多看几页，就会发现每个页面的URL和页号是有关系的。

URL中start这个参数的值等于(页号 - 1) * 25。接下来，再审查一下元素

可以发现海报的地址包含在class值为pic的div元素下，我们可以用BeautifulSoup的select()方法获取到img元素，再提取src属性。代码如下所示：

from urllib.request import urlopen from bs4 import BeautifulSoup from fake_useragent import UserAgent from urllib.request import Request import os BASE_URL = "https://movie.douban.com/top250?start={page_id}&filter=" user_agent = UserAgent() def get_content(url): """ 根据url发送请求 :param url: 请求url :return: 响应内容 """ headers = { 'User-Agent': user_agent.random } request = Request(url, headers=headers) response = urlopen(request) return response.read() def parse_html(html): """ 解析HTML :param html: HTML文档字符串 :return: """ soup = BeautifulSoup(html, 'lxml') items = soup.select(".pic img") for item in items: yield item.get("src") def save_image(image_url, file_name, folder='images'): """ 保存图片 :param image_url: 图片地址 :param file_name: 文件名 :param folder: 保存目录 """ image = get_content(image_url) if not os.path.exists(folder): os.mkdir(folder) file_name = "{}/{}.jpg".format(folder, file_name) with open(file_name, 'wb') as f_obj: f_obj.write(image) print("保存:", file_name) def main(): total = 0 for i in range(20): url = BASE_URL.format(page_id=i * 25) html = get_content(url).decode('utf-8') for image in parse_html(html): print("下载:", image) save_image(image, total) total += 1 if __name__ == '__main__': main()

运行结果如下：

最后，可以将所有的海报拼接起来：

import math import PIL.Image as Image def merge_images(folder, size=1000): """ 将folder目录下的图片按尺寸size拼接 :param folder: 图片 :param size: 尺寸 """ images_count = len(os.listdir(folder)) each_size = int(math.sqrt(size * size / images_count)) - 1 lines_count = int(size / each_size) final_image = Image.new('RGB', (size, size), 'white') row, column = 0, 0 for i in range(images_count): image_path = "{}/{}.jpg".format(folder, i) try: image = Image.open(image_path) except IOError as e: print(image_path, "出错啦！！！") else: image = image.resize((each_size, each_size)) final_image.paste(image, (row * each_size, column * each_size)) column += 1 if column == lines_count: column = 0 row += 1 final_image.save("{}/{}.jpg".format(folder, "final_image")))

得到：

总结

本文我们介绍了BeautifulSoup库的一些内容，它可以将HTML/XML文档进行解析，得到一个文档树

树中的每个节点都是BeautifulSoup四种对象中的一种。可以用find_all()或select()找到我们需要查找的节点。之后，可以通过节点的属性或者方法获取到我们需要的信息。

参考

[1] https://www.crummy.com/softar...

BeautifulSoup袭来

安装BeautifulSoup

安装解析器

如何使用

对象的种类

Tag

Name

Attributes

多值属性

NavigableString

BeautifulSoup

Comment

搜索文档树

过滤器

字符串

正则表达式

列表

True

方法

find_all()

name参数

keyword参数

按CSS搜索

text参数

limit参数

recursive参数

像调用find_all()一样调用tag

CSS选择器

爬取豆瓣电影TOP250海报

总结

参考

你可能感兴趣的:(python,beautifulsoup,网页爬虫)