什么是BeautifulSoup
是用Python写的一个HTML/XML的解析器,它可以很好的处理不规范标记并生成剖析树(parse tree)。 它提供简单又常用的导航(navigating),搜索以及修改剖析树的操作。利用它我们不在需要编写正则表达式就可以方便的实现网页信息的提取。
就像java实现爬虫一样有HttpClient+Jsoup,python中我们就能用requests+BeautifulSoup来实现
BeautifulSoup官方文档:https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
首先我们需要下载BeautifulSoup,我们还需要配合一个解析器来使用这里讲的是lxml
pip install bs4 安装BeautifulSoup
pip install lxml 安装lxml解析器
然后我们先拿官网上一段html来解析一下,了解它的用法,具体方法上图和注解里面都有,自己去看了
demo1
"""
使用BeautifulSoup
"""
from bs4 import BeautifulSoup
html_doc = """
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
# 实例化BeautifulSoup对象
soup = BeautifulSoup(html_doc, "lxml")
# 格式化输出(按照HTML格式输出)
# print(soup)
# print(soup.prettify())
# 获取标签
# tag = soup.title
# name = tag.name
# str = tag.string
# print(tag)
# print(name)
# print(str)
# tag = soup.p
# print(tag)
# 获取所有的P标签
# tags = soup.find_all("p", attrs={"class": "story"})
# print(len(tags))
# tag = soup.find(class_="title")
# print(tag)
# tag = soup.title
# print(tag.parent.name)
# 获取属性的值
# tag = soup.p
# str = tag.get("class")
# print(str)
# str = soup.p["class"]
# print(str)
# str = soup.a.get("id")
# print(str)
# 获取a标签的所有属性值
# attrs = soup.a.attrs
# print(attrs)
# tag = soup.body
# for c in tag.descendants:
# print(c)
# print("*"*30)
# for p in soup.title.parents:
# print(p.name)
# 先获取第一个
# tag = soup.a
# print(tag.next_sibling.next_sibling)
for i in range(10):
print(i)
通过BeautifulSoup解析,我们减少了写正则来筛选内容的步骤,我们可以直接通过方法来获取我们想要的东西,所以说比正则还是方便不少
demo2
爬取猫眼电影TOP100电影的图片,并存入文件夹
思路和之前一样的,无论我们用哪种技术,哪种方式,都需要先分析我们要爬取的信息,再通过不同的手段筛选出我们要的信息,最终爬取出来
"""
爬取“猫眼电影的排行榜”
"""
import requests
from bs4 import BeautifulSoup
import os
headers = {
"User-Agent": "Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14"
}
# 获取当前根目录
getcwd = os.getcwd()
for j in range(10):
# 进入当前跟目录
os.chdir(getcwd)
# 在根目录中船舰文件夹“第1页”
os.mkdir(f"第{j+1}页")
# 改变当前目录
os.chdir(f"第{j+1}页")
response = requests.get(f"https://maoyan.com/board/4?offset={j*10}", headers=headers)
if response.status_code == 200:
# 解析网页
soup = BeautifulSoup(response.text, "lxml")
imgTag = soup.find_all("img",attrs={"class":"board-img"})
for imgTag in imgTag:
name = imgTag.get("alt")
src = imgTag.get("data-src")
resp = requests.get(src,headers=headers)
with open(f"{name}.png","wb") as f:
f.write(resp.content)
print(f"{name}{src}保存成功")
demo3
爬取中国最好大学的排名
"""
爬取“最好大学网”排行
"""
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14"
}
response = requests.get("http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html", headers=headers)
response.encoding = "utg-8"
if response.status_code == 200:
soup = BeautifulSoup(response.text,"lxml")
trTags = soup.find_all("tr",attrs={"class":"alt"})
for trTag in trTags:
id = trTag.contents[0].string
name = trTag.contents[1].string
addr = trTag.contents[2].string
sco = trTag.contents[3].string
print(f"{id}{name}{addr}{sco}")