这篇报废,这里的PyInstaller教程相信对大家会有用,其中会解决spider not found的问题,真正成功地打包Scrapy:https://blog.csdn.net/La_vie_est_belle/article/details/96321995
这个是我的项目目录:
众所周知,dist,build和crawl.spec是打包后生成的文件。
思路就是:写一个脚本来运行我们的Scrapy项目,接着用Pyinstaller将该脚本转化为exe文件。
刚开始想用以下脚本文件来运行:
from scrapy import cmdline
cmdline.execute("scrapy crawl SpiderName".split())
最终发现不行(所以想用该脚本来打包的同学三思。。不过你当然也可以试试。。。说不定可以解决。)
后来突然想到Scrapy文档有专门讲是怎么运行爬虫的:https://doc.scrapy.org/en/latest/topics/practices.html
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
# 'followall' is the name of one of the spiders of the project.
process.crawl('followall', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished
我就用了上面这段代码,将它放在crawl.py文件中(当然爬虫的名字要改掉)
然后开始用Pyinstaller打包:
在项目下shift+右键,然后点击‘’在此处打开命令窗口‘’,输入:pyinstaller crawl.py (注意不加-F)
最终生成dist,build和crawl.spec这三个文件。
打开dist->crawl->crawl.exe,会发现有闪退现象(解决办法就是把该exe文件拖入cmd中运行)
在cmd中我们发现错误是缺少scrapy文件夹下的VERSION
接下来你需要在crawl.exe同目录下新建一个scrapy文件夹
然后到你安装的scrapy文件夹中把VERSION和mime.types两个文件复制到刚才创建的scrapy文件夹中(如果只复制VERSION的话,那之后还会提示找不到mime.types文件,所以这里就两个一块讲了)
完成后再运行crawl.exe文件,发现提示缺少模块,缺哪一个就在crawl.py中import哪一个,于是乎,好像差点要把整个scrapy框架都import进来。。。(这里花了我蛮久时间)
以下就是提示缺少的模块
import robotparser
import scrapy.spiderloader
import scrapy.statscollectors
import scrapy.logformatter
import scrapy.dupefilters
import scrapy.squeues
import scrapy.extensions.spiderstate
import scrapy.extensions.corestats
import scrapy.extensions.telnet
import scrapy.extensions.logstats
import scrapy.extensions.memusage
import scrapy.extensions.memdebug
import scrapy.extensions.feedexport
import scrapy.extensions.closespider
import scrapy.extensions.debug
import scrapy.extensions.httpcache
import scrapy.extensions.statsmailer
import scrapy.extensions.throttle
import scrapy.core.scheduler
import scrapy.core.engine
import scrapy.core.scraper
import scrapy.core.spidermw
import scrapy.core.downloader
import scrapy.downloadermiddlewares.stats
import scrapy.downloadermiddlewares.httpcache
import scrapy.downloadermiddlewares.cookies
import scrapy.downloadermiddlewares.useragent
import scrapy.downloadermiddlewares.httpproxy
import scrapy.downloadermiddlewares.ajaxcrawl
import scrapy.downloadermiddlewares.chunked
import scrapy.downloadermiddlewares.decompression
import scrapy.downloadermiddlewares.defaultheaders
import scrapy.downloadermiddlewares.downloadtimeout
import scrapy.downloadermiddlewares.httpauth
import scrapy.downloadermiddlewares.httpcompression
import scrapy.downloadermiddlewares.redirect
import scrapy.downloadermiddlewares.retry
import scrapy.downloadermiddlewares.robotstxt
import scrapy.spidermiddlewares.depth
import scrapy.spidermiddlewares.httperror
import scrapy.spidermiddlewares.offsite
import scrapy.spidermiddlewares.referer
import scrapy.spidermiddlewares.urllength
import scrapy.pipelines
import scrapy.core.downloader.handlers.http
import scrapy.core.downloader.contextfactory
import scrapy.pipelines.images # 用到图片管道
import openpyxl # 用到openpyxl库
注意最后有注释的两行,在我这个例子项目中我引入了openpyxl并且使用到了图片管道,如果你的项目中没有用到这两个的话则不需要import,当然你的项目可能会用到其他模块说不定。。。
上面没有注释的那些行应该就是要必须引入的了。
所以整个crawl.py文件看起来是这样的(为了方便看官区分,我加了一些注释)
# -*- coding: utf-8 -*-
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
# 这里是必须引入的
import robotparser
import scrapy.spiderloader
import scrapy.statscollectors
import scrapy.logformatter
import scrapy.dupefilters
import scrapy.squeues
import scrapy.extensions.spiderstate
import scrapy.extensions.corestats
import scrapy.extensions.telnet
import scrapy.extensions.logstats
import scrapy.extensions.memusage
import scrapy.extensions.memdebug
import scrapy.extensions.feedexport
import scrapy.extensions.closespider
import scrapy.extensions.debug
import scrapy.extensions.httpcache
import scrapy.extensions.statsmailer
import scrapy.extensions.throttle
import scrapy.core.scheduler
import scrapy.core.engine
import scrapy.core.scraper
import scrapy.core.spidermw
import scrapy.core.downloader
import scrapy.downloadermiddlewares.stats
import scrapy.downloadermiddlewares.httpcache
import scrapy.downloadermiddlewares.cookies
import scrapy.downloadermiddlewares.useragent
import scrapy.downloadermiddlewares.httpproxy
import scrapy.downloadermiddlewares.ajaxcrawl
import scrapy.downloadermiddlewares.chunked
import scrapy.downloadermiddlewares.decompression
import scrapy.downloadermiddlewares.defaultheaders
import scrapy.downloadermiddlewares.downloadtimeout
import scrapy.downloadermiddlewares.httpauth
import scrapy.downloadermiddlewares.httpcompression
import scrapy.downloadermiddlewares.redirect
import scrapy.downloadermiddlewares.retry
import scrapy.downloadermiddlewares.robotstxt
import scrapy.spidermiddlewares.depth
import scrapy.spidermiddlewares.httperror
import scrapy.spidermiddlewares.offsite
import scrapy.spidermiddlewares.referer
import scrapy.spidermiddlewares.urllength
import scrapy.pipelines
import scrapy.core.downloader.handlers.http
import scrapy.core.downloader.contextfactory
# 以下两行是我的项目有用到的
import scrapy.pipelines.images # 用到图片管道
import openpyxl # 用到openpyxl库
process = CrawlerProcess(get_project_settings())
# 'Books' is the name of one of the spiders of the project.
process.crawl('Books')
process.start() # the script will block here until the crawling is finished
因为我这个例子项目用到一张图片资源blank.png(当爬取不到图片的时候就用这个白色图片代替),所以还需要把这张图片放到crawl.exe同路径下。
好,到这里就差不多可以了。
重新用Pyinstaller打包(注意在cmd窗口中cd到你自己的项目路径,也不要用pyinstaller -F crawl.py):pyinstaller crawl.py
完成后点击dist->crawl->crawl.exe就可以了
虽然麻烦,但总算是解决了。。。
总之一句话:缺什么,import什么
例子代码在这个链接处下载:http://download.csdn.net/download/la_vie_est_belle/10198706 (额。。。最低下载积分不能设为0。。。所以没积分的同学凑合看看下面的代码吧。。。或者你把邮箱发我,如果我有空看博客的话,会发到你邮箱的)
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
from scrapy import Field
class BooksItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 产品信息
upc = Field() # UPC编号
title = Field() # 标题
price = Field() # 价格
stock = Field() # 库存
link = Field() # 页面链接
rating = Field() # 评分
pro_des = Field() # 产品描述
category = Field() # 书本类别
# 图片
images = Field()
image_urls = Field()
middlewares.py不变
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import scrapy
import openpyxl as opxl
from scrapy.exceptions import DropItem
from openpyxl.drawing.image import Image
from openpyxl.styles import Font, Border, Side
from scrapy.pipelines.images import ImagesPipeline
class DuplicatePipeline(object):
# 去重
def __init__(self):
self.books_set = set()
def process_item(self, item, spider):
if item['upc'] in self.books_set:
raise DropItem("数据%s已存在---删除完毕" % item['upc'])
else:
self.books_set.add(item['upc'])
return item
class DetailsPipeline(object):
# 将评分用数字代替
rating_convert = {
'One': 1,
'Two': 2,
'Three': 3,
'Four': 4,
'Five': 5
}
def process_item(self, item, spider):
item['rating'] = self.rating_convert[item['rating']]
return item
class ExcelPipeline(object):
wb = None
ws = None
row_index = 2
# outline没法用???
top = Side(border_style="thin", color='000000')
left = Side(border_style="thin", color='000000')
right = Side(border_style="thin", color='000000')
bottom = Side(border_style="thin", color='000000')
def open_spider(self, spider):
self.wb = opxl.Workbook()
self.ws = self.wb.active
self.ws.title = 'Books'
self.ws.append(['图片', '标题', '价格', 'UPC', '库存', '页面链接', '评分', '类别', '产品描述'])
# 将第一行单元格样式化
for row in self.ws.iter_rows():
for cell in row:
cell.font = Font(name=u'微软雅黑', bold=True)
cell.border = Border(left=self.left, right=self.right, top=self.top, bottom=self.bottom)
def close_spider(self, spider):
# 除第一行外,把其他的所有单元格样式化
di_index = 2 # 用于行高
for row in self.ws.iter_rows(min_row=2):
self.ws.row_dimensions[di_index].height = 30
di_index += 1
for cell in row:
cell.font = Font(name=u'微软雅黑')
cell.border = Border(left=self.left, right=self.right, top=self.top, bottom=self.bottom)
self.wb.save('books.xlsx')
def process_item(self, item, spider):
self.ws.append(['', item['title'], item['price'], item['upc'], item['stock'], item['link'], item['rating'],
item['category'], item['pro_des']])
# 若图片不存在,则用空白图片
try:
img = Image('images/%s/%s_small.jpg' % (item['category'], item['title']))
self.ws.add_image(img, 'A%s' % self.row_index)
except:
print (u'++++++产品图片不存在,换用空白图插入++++++')
img = Image('blank.png')
self.ws.add_image(img, 'A%s' % self.row_index)
self.row_index += 1
return item
class ForImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url, meta={'item': item})
def file_path(self, request, response=None, info=None):
item = request.meta['item']
return '%s/%s.jpg' % (item['category'], item['title']) # 要记得带上.jpg
def thumb_path(self, request, thumb_id, response=None, info=None):
item = request.meta['item']
return '%s/%s_small.jpg' % (item['category'], item['title']) # 要记得带上.jpg
settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for Scrape_Books project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'Scrape_Books'
SPIDER_MODULES = ['Scrape_Books.spiders']
NEWSPIDER_MODULE = 'Scrape_Books.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'Scrape_Books.middlewares.ScrapeBooksSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'Scrape_Books.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'Scrape_Books.pipelines.DuplicatePipeline': 297,
'Scrape_Books.pipelines.ForImagesPipeline': 298,
'Scrape_Books.pipelines.DetailsPipeline': 299,
'Scrape_Books.pipelines.ExcelPipeline': 300,
}
IMAGES_STORE = 'images'
IMAGES_THUMBS = {
'small': (38, 36)
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
Books.py(爬虫文件)
# -*- coding: utf-8 -*-
import scrapy
from Scrape_Books.items import BooksItem
class BooksSpider(scrapy.Spider):
name = 'Books'
allowed_domains = ['books.toscrape.com']
#website = raw_input('Please input the website: ')
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
# 爬取每个书本详细页面的链接并进入
for page_link in response.css('h3 > a::attr(href)').extract():
yield response.follow(page_link, callback=self.parse_books)
next_link = response.css('ul.pager li.next > a::attr(href)').extract_first()
yield response.follow(next_link, callback=self.parse)
def parse_books(self, response):
# 爬取响应信息
info = BooksItem()
info['upc'] = response.css('table tr:first-child td::text').extract_first()
info['title'] = response.css('h1::text').extract_first()
info['price'] = response.css('p.price_color::text').extract_first()
info['stock'] = response.css('p.instock::text').re_first(r'\d+')
info['link'] = response.url
info['rating'] = response.css('p.star-rating::attr(class)').extract_first()[12:]
info['pro_des'] = response.css('div#product_description + p::text').extract_first()
info['category'] = response.css('ul.breadcrumb li:nth-child(3) a::text').extract_first()
info['image_urls'] = [response.urljoin(response.css('img::attr(src)').extract_first())]
yield info
资源文件blank.png就是一张38x36像素的空白图。
如果有什么问题话欢迎下方评论区讨论。
新建了个Python QQ交流群,欢迎大家加入相互交流学习:820934083