爬虫文件中settings文件中的参数作用

项目名称

BOT_NAME = 'qidianwang'

爬虫文件路径

SPIDER_MODULES = ['qidianwang.spiders']
NEWSPIDER_MODULE = 'qidianwang.spiders'

Crawl responsibly by identifying yourself (and your website) on the user-agent

设置模拟浏览器加载

USER_AGENT = 'qidianwang (+http://www.yourdomain.com)'

Obey robots.txt rules

是否遵守robot协议(默认为True表示遵守)

ROBOTSTXT_OBEY = False

Configure maximum concurrent requests performed by Scrapy (default: 16)

scrapy 发起请求的最大并发数量(默认是16个)

CONCURRENT_REQUESTS = 32

Configure a delay for requests for the same website (default: 0)

See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

See also autothrottle settings and docs

设置下载延时,默认为0

DOWNLOAD_DELAY = 0

The download delay setting will honor only one of:

在每个域下允许发起请求的最大并发数(默认是8个)

CONCURRENT_REQUESTS_PER_DOMAIN = 16

针对每个ip允许发起请求的最大并发数量(默认0个)

1.在不为0的情况CONCURRENT_REQUESTS_PER_IP的设置优先级要比CONCURRENT_REQUESTS_PER_DOMAIN要高

2.不为0的情况下DOWNLOAD_DELAY就会针对于ip而不是网站了,

CONCURRENT_REQUESTS_PER_IP = 16

Disable cookies (enabled by default)

是否要携带cookies,默认为True表示携带

COOKIES_ENABLED = False

COOKIES_DEBUG 默认为False表示不追踪cookies

COOKIES_DEBUG = True

Disable Telnet Console (enabled by default)

====是一个扩展插件,通过TELENET可以监听到当前爬虫的一些状态,默认是True开启状态

TELNETCONSOLE_ENABLED = False

Override the default request headers:

=======请求头的设置

DEFAULT_REQUEST_HEADERS = {

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8',

'Accept-Language': 'en',

'User-Agnet':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',

}

Enable or disable spider middlewares

See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

=========爬虫中间件

SPIDER_MIDDLEWARES = {

'qidianwang.middlewares.QidianwangSpiderMiddleware': 543,

}

Enable or disable downloader middlewares

See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

===========下载中间件,自定义下载中间键需要在这里激活,后面的数字越小优先级越高,

DOWNLOADER_MIDDLEWARES = {
'qidianwang.middlewares.QidianUserAgentDownloadmiddlerware': 543,
# 'qidianwang.middlewares.QidianProxyDownloadMiddlerware':544,
# 'qidianwang.middlewares.SeleniumDownlaodMiddlerware':543,
}

Enable or disable extensions

See https://doc.scrapy.org/en/latest/topics/extensions.html

EXTENSIONS================添加扩展

EXTENSIONS = {

'scrapy.extensions.telnet.TelnetConsole': None,

}

Configure item pipelines

See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

=====================激活管道,后面跟的数字越小优先级越高

ITEM_PIPELINES = {
'qidianwang.pipelines.QidianwangPipeline': 300,
'scrapy_redis.pipelines.RedisPipeline': 400,
}

=======================================================动态下载延时,(自动限速的扩展,默认情况下是关闭的)

使用步骤1.打开:AUTOTHROTTLE_ENABLED = True

Enable and configure the AutoThrottle extension (disabled by default)

See https://doc.scrapy.org/en/latest/topics/autothrottle.html

AUTOTHROTTLE_ENABLED = True

The initial download delay

========初始的下载延时

AUTOTHROTTLE_START_DELAY = 5

The maximum download delay to be set in case of high latencies

========最大的下载延时

AUTOTHROTTLE_MAX_DELAY = 60

The average number of requests Scrapy should be sending in parallel to

each remote server

==========发送到每一个服务器的并行请求数量

AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

Enable showing throttling stats for every response received:

============是否要开启自动限速的DEBUG模式

AUTOTHROTTLE_DEBUG = False

==========================================================

Enable and configure HTTP caching (disabled by default)

See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

=====数据缓存的一个扩展(默认情况下是关闭的为HTTPCACHE_ENABLED = False)

HTTPCACHE_ENABLED = True

=====设置缓存超时的时间

HTTPCACHE_EXPIRATION_SECS = 0

=====设置缓存保存的路径

HTTPCACHE_DIR = 'httpcache'

=====缓存忽略的响应状态码设置为400表示忽略掉不缓存 :HTTPCACHE_IGNORE_HTTP_CODES = ['400']

HTTPCACHE_IGNORE_HTTP_CODES = []

缓存的储存插件,

HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

将log日志信息保存在本地文件

LOG_FILE = 'qdlogfile.log'

LOG_LEVEL = 'DEBUG'

你可能感兴趣的:(爬虫文件中settings文件中的参数作用)