1网站整个图片的意思是，网站有用的图片，广告推荐位，等等除外
萌新上路，老司机请略过

第一步找出网站url分页的规律

选择自己要爬取的分类（如果要所有的图片可以不选，显示的就是所有的照片，具体怎么操作请根据实际情况进行改进）

QQ截图20190620144258.png

url地址的显示

QQ截图20190620144349.png

看分页的url规律

QQ截图20190620144417.png

url地址的显示

由此可知分页的参数就是 page/页数

第二步获取总页数和进行url请求

1判断页数的几种办法，1最直接的从浏览器上眼看 2先数一页完整的网页一共有多少套图片，假如有15套，如果有一页少于15套那它就是最后一页（不排除最后一页也是15张）3和第一种方法差不多区别在于是用程序来查看总页数的，4不管多少页写个http异常捕获，如果get请求返回的是404那就是已经爬完了 5页面捕获下一页，如果没有下一页就证明爬取完成（但是有些数据少的页面就没有下一页这个标签，这就尴尬了），这里以第三种方法为例
由图可知总页数

you

用程序捕捉页数
由上图可知翻页的布局在一个div里正常情况下包括上一页1 2 3 4 5 ... 101下一页一共9个选择项那么倒数第二个就是总页数
通过xpath获取标签的规则
这里点击右键copy copy xpath

QQ截图20190620150800.png

然后用到一个谷歌插件 xpath

刚才

把刚才复制的xpath 粘贴进去

QQ截图20190620150845.png

可以看到获取的是总页数101 但是我们认为的是标签的倒数第二个才是总页数，所以我们获取的是一个列表而不是一个确定的值，因为翻页的便签的个数是会变的但是总页数一直都是最后一个*（这里以我测试的网站为例，，一切以实际情况为准）
获取翻页的列表

QQ截图20190620150859.png

调用查找总页数的方法返回第二个值就是总页数

QQ截图20190620152157.png

并做个判断如果页数大于总页数的时候跳出循环

页数判断完毕进行图片爬取

一个页面有20组图片通过xpath获取这10组图片的链接并进行请求

QQ截图20190620152830.png

一共四步
1访问第一页抓取一共多少页，

QQ截图20190620183133.png

第二步抓取页面10组图详情页的连接

QQ截图20190620183139.png

第三请求第一组图片的详情页获取多少张图片，

![QQ截图20190620183154.png](https://upload-images.jianshu.io/upload_images/18295040-d11db768fd21ecf0.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

第四步请求每一页的详情页并保存图片，

QQ截图20190620183154.png

）

其实可以整合成两步，我这样写等于多请求了两次，懒的改了，有兴趣的话可以自己改一下

QQ截图20190620183420.png

看我哔哔了那么多，其实没啥用有用的才开始多线程

根据你的网速如果下一张图片没问题，那么100张100万张呢？
整个http访问的过程最慢的就是请求图片链接进行保存，这是最慢的一步，因为图片的资源大（这是废话）
假如一组套图有70张，保存一张就要3秒，70张是多少秒，我不知道（小学毕业），，，但是如果开了多线程，保存一张要3秒保存100张也要3秒（原理就不解释了，大家都懂，上代码了）

开启队列

QQ截图20190620184013.png

整合图片详情页的url添加到队列里并开启进程（其实可以一个for循环完成，但是我试了几次老是添加的多了所以添加队列和开启进程就分开，你可以试试用一个循环）

QQ截图20190620184041.png

每个线程结束后删除一个相应的队列

QQ截图20190620184121.png

（由于mac 和win的路径方式不同我就没有写如果没有创建文件夹就自动创建，所以运行之前请在代码的同级目录创建一个imgs文件夹）
看看速度的对比

多线程的前提是对访问的频率没有限制，一般的小网站和见不得人的网站都没有这样限制，所以你懂得！

QQ截图20190620190027.png

爬取相同的一组图片

QQ截图20190620190256.png

普通版

import requests
from lxml import etree
import random
import threading
from  time import sleep
from queue import Queue
class ImgSpider() :
    def __init__(self):
        self.urls = 'http://www.jitaotu.com/tag/meitui/page/{}'
        self.deatil = 'http://www.jitaotu.com/xinggan/{}'
        self.headers = {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
            'Accept-Encoding': 'gzip, deflate',
            'Accept-Language': 'zh-CN,zh;q=0.9',
            'Cache-Control': 'max-age=0',
            'Cookie': 'UM_distinctid=16b7398b425391-0679d7e790c7ad-3e385b04-1fa400-16b7398b426663;Hm_lvt_7a498bb678e31981e74e8d7923b10a80=1561012516;CNZZDATA1270446221 = 1356308073 - 1561011117 - null % 7C1561021918;Hm_lpvt_7a498bb678e31981e74e8d7923b10a80 = 1561022022',
            'Host': 'www.jitaotu.com',
            'If-None-Match': '"5b2b5dc3-2f7a7"',
            'Proxy-Connection': 'keep-alive',
            'Referer': 'http://www.jitaotu.com/cosplay/68913_2.html',
            'Upgrade-Insecure-Requests': '1',
             'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
                        }
        self.url_queue = Queue()

    def pages(self):
        #总页数
        response = requests.get(self.urls.format(2))
        strs = response.content.decode()
        html = etree.HTML(strs)
        page = html.xpath('/html/body/section[1]/nav/div/a/text()')
        return page[-2]

    def html_list(self, page):
        #页面组图的连接
        print(self.urls.format(page))
        response = requests.get(self.urls.format(page))
        strs = response.content.decode()
        html = etree.HTML(strs)
        page = html.xpath('/html/body/section[1]/div/ul/li/div[1]/a/@href')
        return page

    def detail_page(self, imgde):
        #总图片数
        response = requests.get(self.deatil.format(imgde))
        strs = response.content.decode()
        html = etree.HTML(strs)
        page = html.xpath("//*[@id='imagecx']/div[4]/a[@class='page-numbers']/text()")
        return page[-1]
    def detail_list(self, imgde, page):
        #图片详情页连接
        #截取链接关键码
        urls = imgde[-10:-5]
        print('开始访问图片页面并抓取图片地址保存')
        for i in range(int(page)):
            print(self.deatil.format(urls+'_'+str(i+1)+'.html'))
            response = requests.get(self.deatil.format(urls+'_'+str(i+1)+'.html'))
            strs = response.content.decode()
            html = etree.HTML(strs)
            imgs = html.xpath('//*[@id="imagecx"]/div[3]/p/a/img/@src')
            #保存图片
            self.save_img(imgs)

    def save_img(self, imgs):
        print(imgs[0]+'?tdsourcetag=s_pcqq_aiomsg')
        response = requests.get(imgs[0], headers=self.headers)
        strs = response.content
        s = random.sample('zyxwvutsrqponmlkjihgfedcba1234567890', 5)
        a = random.sample('zyxwvutsrqponmlkjihgfedcba1234567890', 5)
        with open("./imgs/" + str(a) + str(s) + ".jpg", "wb") as f:
            f.write(strs)
        print("保存图片")
        return
    def run(self):
        page = 1
        # 获取总页数
        pageall = self.pages()
        print('总页数'+str(pageall))
        while True:
            print('访问第' + str(page)+'页')
            #访问页面，获取10组图片的详情页链接
            html_list = self.html_list(page)
            #访问图片的详情页
            s =1
            for htmls in html_list:
                print('访问第'+str(page)+'页第'+str(s)+'组')
                imgdetalpage = self.detail_page(htmls)
                # 址遍历详情页请求获取图片地
                print('第' + str(page) + '页第' + str(s) + '组有'+str(imgdetalpage)+'张图片')
                self.detail_list(htmls, imgdetalpage)
                s += 1
            page += 1
            if page > pageall:
                print('爬取完毕 退出循环')
                return

if __name__ == '__main__':
    Imgs = ImgSpider()
    Imgs.run()

多线程

import requests
from lxml import etree
import random
import threading
from  time import sleep
from queue import Queue
class ImgSpider() :
    def __init__(self):
        self.urls = 'http://www.jitaotu.com/tag/meitui/page/{}'
        self.deatil = 'http://www.jitaotu.com/xinggan/{}'
        self.headers = {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
            'Accept-Encoding': 'gzip, deflate',
            'Accept-Language': 'zh-CN,zh;q=0.9',
            'Cache-Control': 'max-age=0',
            'Cookie': 'UM_distinctid=16b7398b425391-0679d7e790c7ad-3e385b04-1fa400-16b7398b426663;Hm_lvt_7a498bb678e31981e74e8d7923b10a80=1561012516;CNZZDATA1270446221 = 1356308073 - 1561011117 - null % 7C1561021918;Hm_lpvt_7a498bb678e31981e74e8d7923b10a80 = 1561022022',
            'Host': 'www.jitaotu.com',
            'If-None-Match': '"5b2b5dc3-2f7a7"',
            'Proxy-Connection': 'keep-alive',
            'Referer': 'http://www.jitaotu.com/cosplay/68913_2.html',
            'Upgrade-Insecure-Requests': '1',
             'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
                        }
        self.url_queue = Queue()

    def pages(self):
        response = requests.get(self.urls.format(2))
        strs = response.content.decode()
        html = etree.HTML(strs)
        page = html.xpath('/html/body/section[1]/nav/div/a/text()')
        return page[-2]

    def html_list(self, page):
        print(self.urls.format(page))
        response = requests.get(self.urls.format(page))
        strs = response.content.decode()
        html = etree.HTML(strs)
        page = html.xpath('/html/body/section[1]/div/ul/li/div[1]/a/@href')
        return page

    def detail_page(self, imgde):
        response = requests.get(self.deatil.format(imgde))
        strs = response.content.decode()
        html = etree.HTML(strs)
        page = html.xpath("//*[@id='imagecx']/div[4]/a[@class='page-numbers']/text()")
        return page[-1]
    def detail_list(self, imgde, page):
        #截取链接关键码
        urls = imgde[-10:-5]
        print('开始访问图片页面并抓取图片地址保存')
        for i in range(int(page)):
            print(self.deatil.format(urls + '_' + str(i + 1) + '.html'))
            urlss = self.deatil.format(urls + '_' + str(i + 1) + '.html')
            self.url_queue.put(urlss)
        for i in range(int(page)):
            t_url = threading.Thread(target=self.More_list)
            # t_url.setDaemon(True)
            t_url.start()
        self.url_queue.join()
        print('主线程结束进行下一个')
    def More_list(self):
        urls = self.url_queue.get()
        response = requests.get(urls)
        strs = response.content.decode()
        html = etree.HTML(strs)
        imgs = html.xpath('//*[@id="imagecx"]/div[3]/p/a/img/@src')
        # 保存图片
        self.save_img(imgs)
    def save_img(self, imgs):
        try:
            print(imgs[0])
            response = requests.get(imgs[0], headers=self.headers)
        except:
            print('超时跳过')
            self.url_queue.task_done()
            return
        else:
            strs = response.content
            s = random.sample('zyxwvutsrqponmlkjihgfedcba1234567890', 5)
            a = random.sample('zyxwvutsrqponmlkjihgfedcba1234567890', 5)
            with open("./imgsa/" + str(a) + str(s) + ".jpg", "wb") as f:
                f.write(strs)
            print("保存图片")
            self.url_queue.task_done()
            return
    def run(self):
        page = 1
        # 获取总页数
        pageall = self.pages()
        print('总页数'+str(pageall))
        while True:
            print('访问第' + str(page)+'页')
            #访问页面，获取10组图片的详情页链接
            html_list = self.html_list(page)
            #访问图片的详情页
            s =1
            for htmls in html_list:
                print('访问第'+str(page)+'页第'+str(s)+'组')
                imgdetalpage = self.detail_page(htmls)
                # 址遍历详情页请求获取图片地
                print('第' + str(page) + '页第' + str(s) + '组有'+str(imgdetalpage)+'张图片')
                self.detail_list(htmls, imgdetalpage)
                s += 1
            page += 1
            if page > pageall:
                print('爬取完毕 退出循环')
                return

if __name__ == '__main__':
    Imgs = ImgSpider()
    Imgs.run()

看不懂不理解的可以问我，我也是新手可以交流交流 qq1341485724

python 多线程爬取网站图片（详解）

第一步找出网站url分页的规律

第二步获取总页数和进行url请求

其实可以整合成两步，我这样写等于多请求了两次，懒的改了，有兴趣的话可以自己改一下

看我哔哔了那么多，其实没啥用有用的才开始多线程

多线程的前提是对访问的频率没有限制，一般的小网站和见不得人的网站都没有这样限制，所以你懂得！

QQ截图20190620190027.png

你可能感兴趣的:(python 多线程爬取网站图片（详解）)

python 多线程爬取网站图片（详解）

第一步找出网站url分页的规律

第二步获取总页数和进行url请求

其实可以整合成两步，我这样写等于多请求了两次，懒的改了，有兴趣的话可以自己改一下

看我哔哔了那么多，其实没啥用 有用的才开始 多线程

多线程的前提是对访问的频率没有限制，一般的小网站和见不得人的网站都没有这样限制，所以你懂得！ QQ截图20190620190027.png

你可能感兴趣的:(python 多线程爬取网站图片（详解）)

看我哔哔了那么多，其实没啥用有用的才开始多线程

多线程的前提是对访问的频率没有限制，一般的小网站和见不得人的网站都没有这样限制，所以你懂得！

QQ截图20190620190027.png