当我用scrapy使用代理爬取网站的时候,出现了一些错误,想要分享一下。
第一个出错:
Connection to the other side was lost in a non-clean fashion: Connection lost.
当我搜索这个时候,解决方案便是在seetings.py中增加user-agent。
但毕竟bug这种东西千奇百怪,回到正题,我使用了代理,如果是头文件可能出错的话,那我就找一下装有请求头的代码。
发现
if 'proxy' not in request.meta or self.current_proxy.is_expiring:
print(request.meta)
self.update_proxy()
request.meta['proxy'] = self.current_proxy.proxy
感觉这里面有猫腻,果然用print的方式调试,到这边代码就卡出了,然后,额。。。。。
好,介绍一下request.meta:
meta是一个字典,它的主要作用是传递数据。它包含了本次HTTP请求的HEADERS信息,Ip、user-agent和cookie等,都包括在里面,知道了这些就算是足够了,就是meta里面没有proxy,所以会报错,改成
request.meta['REMOTE_ADDR'] = self.current_proxy.proxy
程序又可以开始跑了。
2.
Traceback (most recent call last):
File "g:\python_learn\python_setup\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "g:\python_learn\python_setup\lib\site-packages\scrapy\core\downloader\middleware.py", line 56, in process_response
(six.get_method_self(method).__class__.__name__, type(response)))
scrapy.exceptions._InvalidOutput: Middleware IPProxyDownloadMiddleware.process_response must return Response or Request, got
2019-09-10 11:42:58 [scrapy.core.scraper] ERROR: Error downloading
Traceback (most recent call last):
File "g:\python_learn\python_setup\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "g:\python_learn\python_setup\lib\site-packages\scrapy\core\downloader\middleware.py", line 44, in process_request
defer.returnValue((yield download_func(request=request, spider=spider)))
File "g:\python_learn\python_setup\lib\site-packages\twisted\internet\defer.py", line 1362, in returnValue
raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 https://xt.meituan.com/meishi/pn1/>
这个bug确实是自己很傻批
def process_response(self,request,response,spider):
print(response.status)
if response.status != 200 or 'captcha' in response.url:
print(response.status)
if not self.current_proxy.blacked:
self.current_proxy.blacked = True
self.update_proxy()
print('%s代理失效' % self.current_proxy.proxy)
request.meta['proxy'] = self.current_proxy.proxy
print(request)
return request
return response
主要是由于我原来的return 写错了地方,看来下次的细心了。上面的代码是正确的。主要为了检验代理ip是否可以爬取此网站。
3.pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ')
产生这个的原因主要是因为,字符串插入到pymysql会出现报错。
也就是这段代码:
insert into meishi(name,phone,address) values(%s,%s,%s)
改进后便可以插入字符串了:
insert into meishi(name,phone,address) values('"+name[x]+"','"+phone[x]+"','"+address[x]+"')
day3 2019/9/10 坚持坚持坚持