scrapy爬虫 出现10054错误 远程主机强迫关闭了一个现有的连接

**

解决:python爬虫 出现10054错误 远程主机强迫关闭了一个现有的连接

**
问题:
1,网络问题。
确定是否是本机或爬虫目标网站出现网络问题
2,单位时间内请求页面频率过高
3,网站监测到非人为行为,断开连接

解决思路:

最有效的方法是异常捕获try except!!!

1.判断是否网络有误,如果有错误,建议换稳定的网络
2,设置下载延迟
setting.py文件中添加以下内容:

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
DOWNLOAD_TIMEOUT = 60

DOWNLOAD_DELAY = 3
下载延迟3秒
DOWNLOAD_TIMEOUT = 60
下载超时60秒,有些网页打开很慢,该设置表示,到60秒后若还没加载出来自动舍弃
3,设置UA:
设置UA有多种方法:
1),直接在spider中添加

class KespiderSpider(scrapy.Spider):
    name = 'kespider'
    allowed_domains = ['rong.jingdata.com']
    headers = {
        'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36",
    }
    cookeis = COOKIES[0]
    industry_list = ["E_COMMERCE", "SOCIAL_NETWORK", ]
    phase_list = ["ANGEL", ]

    def start_requests(self):
        for industry in self.industry_list:
            for phase in self.phase_list:
                for i in range(15, 16):
                    try:
                        url = "https://rong.jingdata.com/n/api/column/0/company?phase={}&industry={}&sortField=HOT_SCORE&p={}".format(
                            phase, industry, str(i))
                        print(url)
                        yield scrapy.Request(url=url, headers=self.headers, cookies=self.cookeis, callback=self.parse,
                                             dont_filter=True)
                    except Exception as e:
                        print("出现错误:", e)

2)在setting.py文件中添加:

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = '******(+http://www.yourdomain.com)'

3)使用middlewares中间件,python源码中提供了很多中间件,以下是中间件的目录
scrapy爬虫 出现10054错误 远程主机强迫关闭了一个现有的连接_第1张图片
将useragent.py复制到项目根目录下,如下图:
scrapy爬虫 出现10054错误 远程主机强迫关闭了一个现有的连接_第2张图片

打开useragent.py,可以看到默认从setting.py中读取UA,

"""Set User-Agent header per spider or use a default value from settings"""

from scrapy import signals


class UserAgentMiddleware(object):
    """This middleware allows spiders to override the user_agent"""

    def __init__(self, user_agent='Scrapy'):
        self.user_agent = user_agent

    @classmethod
    def from_crawler(cls, crawler):
        o = cls(crawler.settings['USER_AGENT'])
        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
        return o

    def spider_opened(self, spider):
        self.user_agent = getattr(spider, 'user_agent', self.user_agent)

    def process_request(self, request, spider):
        if self.user_agent:
            request.headers.setdefault(b'User-Agent', self.user_agent)

设置随机UA,首先命令行 pip install fake-useragent 安装 fake-useragent库。
在useragent.py中修改代码,建议直接复制粘贴。

"""Set User-Agent header per spider or use a default value from settings"""

from scrapy import signals
from fake_useragent import UserAgent


class RandomUserAgentMiddleware(object):
    """This middleware allows spiders to override the user_agent"""

    def __init__(self, crawler):
        super(RandomUserAgentMiddleware, self).__init__()
        self.user_agent = UserAgent()
        self.user_agent_type = crawler.settings.get("RANDOM_UA_TYPE", "random")

    @classmethod
    def from_crawler(cls, crawler):
        return crawler


    def process_request(self, request, spider):
        def get_user_agent():
            return getattr(self.user_agent, self.user_agent_type)
        request.headers.setdefault("User-Agent", get_user_agent())


最后不要忘了在setting.py中开启该中间件:

DOWNLOADER_MIDDLEWARES = {
    # 'dangdang.middlewares.DangdangDownloaderMiddleware': 543,
    'dangdang.useragent.RandomUserAgentMiddleware': 544,
}

附加:
若使用scrapy_redis,可以使用命令行加参数运行,自动保存爬虫状态:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

注意:somespider要替换成自己的爬虫名称。

完结!!!;

你可能感兴趣的:(python)