从一次爬虫案例谈谈异常处理与多线程的重要性

最近因为项目需要,需要爬取一些头像图片,爬取的网址是https://www.woyaogexing.com/touxiang/,这次爬虫没有使用任何框架,使用Python的requests库请求网址,使用pyquery解析网页。网页的结构比较简单,这里就不详细说了,刚开始时,由于下载速度过快,下载一会就出现异常而中断了,后来我就每下载一张后就暂停几秒。晚上回去前,把爬虫脚本一直运行着,以为第二天早上来应该要下载几千张图片了,结果并不是,下载了100多张后程序就发送异常而终止了。后来我就想要是不中断就好了,异常处理就派上用场了。于是我改写了爬虫程序,将可能出现异常的语句均添加了异常处理,这样爬虫程序就可以一直运行下去了。由于每次下载过后都会延迟一段时间,我就想到了多线程,但是由于对python的多线程不太熟悉,所有就改用了多进程。将脚本同时运行了8次,这样下来一分钟就可以下载50多张图片了,一个晚上几万张图片妥妥的了。下面是使用到的爬虫脚本:

# -*- coding: utf-8 -*-
import random
import time
from PIL import Image

import requests
from pyquery import PyQuery as pq

class all_in_one:

    def __init__(self):
        self.image_name = 1
        self.image_suffix = '.jpeg'
        self.has_download_img = set()
        self.download_url = set()
        self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
        self.base = 'D:\pycharmWorkplace\head_images_spider\imgs'
        for i in range(2,1000):
            self.download_url.add('https://www.woyaogexing.com/touxiang/index_' + str(i) + '.html')
        print('初始化完成')

    def download_img(self, url):
        download_url = 'http:'+ url
        if (download_url in self.has_download_img):
            return
        filename = str(self.image_name) + str(self.image_suffix)
        filename = self.base + '\\' + filename
        print('开始下载' + filename)
        t = random.randint(5, 10)
        time.sleep(t) # 暂停t秒试一下
        try:
            data = requests.get(download_url, headers=self.headers).content
        except Exception as e:
            print(e)
            return
        with open(filename, 'wb') as f:
            f.write(data)
        if (self.is_valid_image(filename)):
            print(filename + '下载成功!!')
            self.image_name += 1
            self.has_download_img.add(download_url)
        else:
            for i in range(0, 5): #最多重试5次,5次都失败了就放弃
                print('下载失败,正在重试第' + str(i) + '次')
                t = random.randint(30, 50)
                time.sleep(t)  # 暂停t秒试一下
                try:
                    data = requests.get(download_url, headers=self.headers).content
                except Exception as e:
                    print(e)
                    return
                with open(filename, 'wb') as f:
                    f.write(data)
                if (self.is_valid_image(filename)):
                    print(filename + '下载成功!!')
                    self.image_name += 1
                    break

    # 判断图片是否损坏
    def is_valid_image(self, filename):
        valid = True
        try:
            Image.open(filename).load()
        except OSError:
            valid = False
        return valid

    def download_one_page(self, url):
        try:
            response = requests.get(url, headers=self.headers)
        except Exception as e:
            print(e)
            time.sleep(10)
            return

        print('请求页面' + url + '成功.....')
        t = random.randint(20, 60)
        time.sleep(t)
        doc = pq(response.text)
        for item in doc('img.lazy').items():
            self.download_img(url=item.attr('src'))

    def download_page(self):
        for url in self.download_url:
            self.download_one_page(url)



if __name__ == '__main__':
    aio = all_in_one()
    aio.download_page()

从本次的过程中,深刻的认识到了在一个健壮的程序中异常处理的重要性,以及多线程的威力。

你可能感兴趣的:(爬虫)