阴阳师壁纸获取

阴阳师壁纸爬虫

爬虫其实并不难, 难的是反爬和反反爬

目标网址

1. 目标:获取该网站375张壁纸图片保存到本地
2. 问题
(1) 访问过快封ip
(2) 假如下载完100张图片后挂掉了,该如何处理;下次访问如何跳过前100张
(3) 如何保证数据完整性
(4) 如何提高效率
3. 获取所有壁纸url保存到mysql数据库, 并给每一个url加一个状态值
import requests
import re
from public.operation_db import *


class Spider(object):

    def __init__(self):
        self.url = 'https://yys.163.com/media/picture.html'

    def get_data(self):
        """
        获取url保存到mysql
        """
        resp = requests.get(self.url)
        resp.encoding = resp.apparent_encoding
        url_list = re.findall(r'data-src="(.*?)"', resp.text)
        data = []
        for url in url_list:
            data.append([url])
        sql = 'insert into yys(url) values(%s)'
        # 批量导入数据
        save_batch_data(sql, data)


if __name__ == '__main__':
    s = Spider()
    s.get_data()

阴阳师壁纸获取_第1张图片

4. 依次从数据库获取url, 获取一次修改一次状态值, 表示已经下载过该图片; 结合多进程加快下载速率
import requests
from retrying import retry
from public.operation_db import *
import multiprocessing


def get_url():
    """
    从数据库获取url
    """
    sql = 'select id, url from yys where status=0 limit 1'
    result = select_data(sql)

    title = result[0][0]
    url = result[0][1]

    return title, url


@retry(stop_max_attempt_number=3, wait_fixed=3000)
def download(title, url):
    """
    壁纸图片保存至本地
    """
    response = requests.get(url).content
    title = './img/' + str(title) + '.jpg'
    with open(title, 'wb') as f:
        print(title)
        f.write(response)


def run():
    """
    运行主逻辑
    """
    while True:
        try:
            title, url = get_url()
            download(title, url)
            update_url = 'update yys set status=1 where url = "%s"' % url
            update_data(update_url)
        except Exception as e:
            print(e)
        finally:
            if title == 375:
                break


if __name__ == '__main__':
    process1 = multiprocessing.Process(target=run)
    process2 = multiprocessing.Process(target=run)
    process1.start()
    process2.start()
    process1.join()
    process2.join()

阴阳师壁纸获取_第2张图片

5. 参考博客

https://blog.csdn.net/gklcsdn/article/details/102700926

https://blog.csdn.net/gklcsdn/article/details/102641267

https://blog.csdn.net/gklcsdn/article/details/102879328

6. 源码

github

你可能感兴趣的:(spider,Crawler)