声明:本文内容皆来自网上
环境:ubuntu19.04、python3.x
python包:requests、bs4、beautifulsoup、re、urllib、lxml、os
下载方式:$pip install [包名]
ps:部分电脑未安装python-pip,报错后按照系统提示下载python-pip
爬虫过程:
1)模拟浏览器向目标网页发送请求
2)接收响应
3)解析,将响应转为网页代码输出
4)查找代码中需要的部分
5)处理
代码实现:
这里以爬取发表情(https://www.fabiaoqing.com/biaoqing)网页上的图片为例
1、2) url = 'https://www.fabiaoqing.com/biaoqing' #目标网址
response = requests.get(url) #发送访问请求接收
3) soup= BeautifulSoup(response.content.decode('utf-8'), 'lxml') #解析响应,此时soup是目标网页代码
4) gowl = str(soup.findAll('img')) #查找所有图片,gowl即为图片url
#由于我们仅需要其中的表情,所以需要对gowl再进行一次筛选。目标网页表情包url
我们可以发现url分为.jpg和.gif,但它们前面都是http://w...sinaimg.cn/bmiddle/................................ #此处 . 表示匹配一个数字或字符。所以:
picUrls = re.findall('http://w...sinaimg.cn/bmiddle/.................................jpg',gowl)
picUrlst = re.findall('http://w...sinaimg.cn/bmiddle/.................................gif',gowl)
#picUrls和picUrlst的和即为目标表情包
5)然后下载图片
imgName = 0
for url in picUrls:
r = requests.get(url)
img = r.content
with open(str(imgName)+".jpg",'wb') as f:
f.write(img)
imgName = imgName +1
imgNamet = len(picUrls)
for urls in picUrlst:
rt = requests.get(urls)
imgt = rt.content
with open(str(imgNamet)+".gif",'wb') as f:
f.write(imgt)
imgNamet = imgNamet +1
最后
完整代码:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import re
import urllib
import lxml
import os
url = "https://www.fabiaoqing.com/biaoqing"
response = requests.get(url) #发送请求
soup = BeautifulSoup(response.content.decode('utf-8'), 'lxml') #解析
gowl = str(soup.findAll('img')) #查找
#print(gowl) #所有img
picUrls = re.findall('http://w...sinaimg.cn/bmiddle/.................................jpg',gowl)
picUrlst = re.findall('http://w...sinaimg.cn/bmiddle/.................................gif',gowl)
#print picUrls,picUrlst #所有符合要求的图片
#file_obj = open('goel', 'r+') #写入文件,goels是你创建的文档名称
#file_obj.write(str(picUrls))
imgName = 0
for url in picUrls:
r = requests.get(url)
img = r.content
with open(str(imgName)+".jpg",'wb') as f:
f.write(img)
imgName = imgName +1
imgNamet = len(picUrls)
for urls in picUrlst:
rt = requests.get(urls)
imgt = rt.content
with open(str(imgNamet)+".gif",'wb') as f:
f.write(imgt)
imgNamet = imgNamet +1
ps:这是我的第一次写爬虫,希望对此时阅读的你们有所帮助。如果文章或代码有所错误欢迎指出。也欢迎各位大神对代码进行简化。