python漫画爬虫:我不做人了,b站!爬取辉夜大小姐等漫画

今天我们要爬取这个网站的《辉夜大小姐想让我告白》漫画(穷人靠科技,富人靠硬币,懂,不多说)
主要就两步:1.在主界面找到所有话的链接 2.在每一话找到该话的所有图片

需要源码的直接翻到最后

首先我们找到了每一话的链接
python漫画爬虫:我不做人了,b站!爬取辉夜大小姐等漫画_第1张图片

# 获取章节链接和章节名称
hrefs = re.findall('
  • \n.*?\n.*?(.*?)',r.text) for href in hrefs: # 拼接章节链接 chapter_url = 'http://www.90mh.com' + href[0] name = href[1] chapter_path = root_path + '\\' + name print(chapter_path) # 辉夜大小姐想让我告白\周刊13话
  • 在进入其中一话,找到每一话的所有图片
    python漫画爬虫:我不做人了,b站!爬取辉夜大小姐等漫画_第2张图片
    .

    # 获取章节图片
        chapter_imges = re.search('chapterImages = (\[.*?\])',chapter_page.text,re.S)
        chapter_src = re.search('chapterPath = "(.*?)"',chapter_page.text).group(1)
    ''' ...... '''
    pic_url = 'https://js1.zzszs.com.cn/' + chapter_src + chapter_imges[i]
    

    最终效果:
    python漫画爬虫:我不做人了,b站!爬取辉夜大小姐等漫画_第3张图片

    成功!

    当然,不同网站结构不同,爬取方式也有些许不同。比如动漫之家——参考自这里.

    但方式其实也就那么几种,还是可以摸索出来的,目前我爬了四五个网站,也都成功了,大家可以自己动手试试。

    源码:
    这里采用了多协程的方式,比正常方式快几十倍,但编写时麻烦些,并且存在有的网址访问超时的情况,故需要多跑几遍.这里我使用了代理,大家需要自己配置,并更改代理ip地址.

    import requests
    import re
    import time
    import os
    from ast import literal_eval
    import asyncio
    import aiohttp
    import aiofiles
    
    
    async def get_image(session,href_url,name):
        # 拼接章节链接
        chapter_url = 'http://www.90mh.com' + href_url
        chapter_path = root_path + '\\' + name
        print(chapter_path)
    
        # 建立章节文件夹
        if not os.path.exists(chapter_path):
            os.mkdir(chapter_path)
        try:
            async with session.get(chapter_url, headers=headers, proxy=proxy, timeout=30) as response:
                r = await response.text()
        except:
            async with session.get(chapter_url, headers=headers, proxy=proxy, timeout=30) as response:
                r = await response.text()
        # 获取章节图片
        chapter_imges = re.search('chapterImages = (\[.*?\])', r, re.S)
        chapter_src = re.search('chapterPath = "(.*?)"', r).group(1)
    
    
        chapter_imges = chapter_imges.group(1)
        # 将字符串形式的列表转为列表
        chapter_imges = literal_eval(chapter_imges)
    
        tasks = []
        for i in range(len(chapter_imges)):
            if i < 10:
                pic_path = chapter_path + '\\' + str(0) + str(i) + '.jpg'
            else:
                pic_path = chapter_path + '\\' + str(i) + '.jpg'
            print(pic_path)
            if not os.path.exists(pic_path):
                pic_url = 'https://js1.zzszs.com.cn/' + chapter_src + chapter_imges[i]
                tasks.append(get_photo(session,pic_url,pic_path))
        if tasks:
            await asyncio.wait(tasks)
        if hrefs:
            href = hrefs.pop()
            task = [asyncio.create_task(get_image(session, href[0], href[1]))]
            await asyncio.wait(task)
    
    
    async def get_photo(session,pic_url,pic_path):
        try:
            async with session.get(pic_url, headers=pic_headers, timeout=30) as p:
                pic = await p.content.read()
        except:
            async with session.get(pic_url, headers=pic_headers, timeout=50) as p:
                pic = await p.content.read()
        async with aiofiles.open(pic_path, 'wb') as f:
            await f.write(pic)
    
    
    
    group_size = 5
    ip = '127.0.0.1:7890'
    proxy = 'http://' + ip
    proxies = {
        'http': 'http://' + ip,
        'https': 'https://' + ip
    }
    # 漫画主页
    url = 'http://www.90mh.com/manhua/zongzhijiushifeichangkeai/'
    host = 'www.90mh.com'
    headers = {
        'Host': 'www.90mh.com',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
    }
    pic_headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
    }
    root_path = '总之就是非常可爱'
    
    async def main():
        # 建立根文件夹
        if not os.path.exists(root_path):
            os.mkdir(root_path)
        async with aiohttp.ClientSession() as session:
            try:
                async with session.get(url, headers=headers,proxy=proxy, timeout=30) as response:  #
                    r = await response.text()
            except:
                async with session.get(url, headers=headers, proxy=proxy, timeout=50) as response:
                    r = await response.text()
    
            # 获取章节链接和章节名称
            global hrefs
            hrefs = re.findall('
  • \n.*?\n.*?(.*?)',r) tasks = [] if len(hrefs) < group_size: num = len(hrefs) else: num = group_size for i in range(num): href = hrefs.pop() tasks.append(asyncio.create_task(get_image(session,href[0],href[1]))) await asyncio.wait(tasks) if __name__ == '__main__': asyncio.run(main())
  • 你可能感兴趣的:(python爬虫)