(4)在scrapy中嵌入pyppeteer(scrapy+asyncio)

常规pyppeteer中间件

常规的pyppeteer中间件,尽管pyppeteer是基于asyncio的异步框架,但因为通过同步的方式调用,无法发挥其异步框架的优势,会将scrapy阻塞,相当于总并发降至1,参考github项目(https://github.com/Python3WebSpider/ScrapyPyppeteer.git)

import websockets
from scrapy.http import HtmlResponse
from logging import getLogger
import asyncio
import pyppeteer
import logging
from concurrent.futures._base import TimeoutError


class PyppeteerMiddleware():
    def render(self, url, **kwargs):
        async def async_render(url, **kwargs):
            try:
                page = await self.browser.newPage()
                response = await page.goto(url, options={'timeout': int(timeout * 1000)})
                content = await page.content()
                
                return content, response.status
            except TimeoutError:
                return None, 500
            finally:
                if not page.isClosed():
                    await page.close()
               
        return content, status
    
    def process_request(self, request, spider):
        if request.meta.get('render') == 'pyppeteer':
            try:
                html, status = self.render(request.url)
                return HtmlResponse(url=request.url, body=html, request=request, encoding='utf-8',
                                    status=status)
            except websockets.exceptions.ConnectionClosed:
                pass
    

异步pyppeteer中间件

将pyppeteer中间件弄成异步需要进行两步操作

  1. 在process_request方法中,将pyppeteer请求函数协程异步调用,并用Deferred.fromFuture将twisted deffered 改成asyncio的future
from twisted.internet.defer import Deferred
from scrapy.http import HtmlResponse

def as_deferred(f):
    """Transform a Twisted Deffered to an Asyncio Future"""

    return Deferred.fromFuture(asyncio.ensure_future(f))


class PuppeteerMiddleware:
    async def _process_request(self, request, spider):
        """Handle the request using Puppeteer"""

        page = await self.browser.newPage()

        ......

        return HtmlResponse(
            page.url,
            status=response.status,
            headers=response.headers,
            body=body,
            encoding='utf-8',
            request=request
        )

    def process_request(self, request, spider):
        """Check if the Request should be handled by Puppeteer"""

        if request.meta.get('render') == 'pyppeteer':
            return as_deferred(self._process_request(request, spider))
        
        return None

  1. 由于scrapy是基于twisted,而pyppeteer基于asyncio,需要解决reactor的互通问题。
    Twisted有一个解决方案,可以在asyncio上运行twisted,那就是asyncioreactor,不过要确保在导入scrappy或执行任何其他操作之前做处理,可以在导入execute之前先解决reactor问题
import asyncio
from twisted.internet import asyncioreactor

asyncioreactor.install(asyncio.get_event_loop())

'''
导入scrapy之前,必须先加上以上三行,否则无法对接asyncio
'''

from scrapy.cmdline import execute


execute("scrapy crawl spider_name".split())

参考github项目(https://github.com/clemfromspace/scrapy-puppeteer.git)

这样就可以兼容scrapy的并发设置了。

参考

  1. https://github.com/Python3WebSpider/ScrapyPyppeteer.git
  2. https://github.com/clemfromspace/scrapy-puppeteer.git
  3. https://medium.com/@yashrsharma44/using-asyncio-in-twisted-5c2457e23618

你可能感兴趣的:((4)在scrapy中嵌入pyppeteer(scrapy+asyncio))