有时候遇到这种情况,每个请求里面有cookies和header,但是抓包怎么也抓不到是怎么来的,用 scrapy和requests都不能执行js,只能是爬取静态的页面。利用scrapy-splash虽然可以爬取动态的页面,但是自己必须起一个服务来跑scrapy-splash。这个时候觉得还是采用selenium,selenium支持chrome和firefox等。
def __init__(self):
chrome_options = Options()
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--hide-scrollbars')
# 不显示浏览器窗口
# chrome_options.add_argument('--headless')
self.browser = webdriver.Chrome(executable_path='/opt/webdriver/chrome/chromedriver',
chrome_options=chrome_options)
self.browser.set_page_load_timeout(30)
# 重写start_requests方法
def start_requests(self):
cookies = self.convert_cookies(self.get_cookies())
for form_data in self.form_data_list:
yield scrapy.FormRequest(self.start_url, method="POST", cookies=cookies, formdata=form_data,
dont_filter=True)
pass
# 通过webdriver获取cookies
def get_cookies(self):
self.browser.get(self.cookies_url)
cookies = []
try:
WebDriverWait(self.browser, 100).until(
expected_conditions.element_to_be_clickable((By.XPATH, "//a[@class='searchbutton']")))
cookies = self.browser.get_cookies()
except Exception as e:
self.logger.info("获取cookies出错")
finally:
# 关闭浏览器
self.browser.quit()
return cookies
def convert_cookies(self, cookies):
newcookies = {}
for cookie in cookies:
newcookies[cookie['name']] = cookie['value']
return newcookies
# 表单数据转化为dict
def fromData2Dict(self, formData):
# urlencode会把空格转化为+,此处做个转换
params = urllib.parse.unquote(formData).replace('+', ' ').split("&")
nums = len(params)
form_data = {}
for i in range(0, nums):
param = params[i].split("=", 1)
key = param[0]
value = param[1]
form_data[key] = value
return form_data
设置无头模式,不显示窗口(遇到问题:导致寻找不到页面元素)
chrome_options.add_argument('--headless')
关闭沙盒:
options.add_argument('--no-sandbox')
遇到了的问题汇总:
1.在mac环境运行的好好的,在Linux环境一直报错,DevToolsActivePort文件找不到,参考了很多国外国内的博客都写的禁用沙箱然并卵。
比如:
- https://github.com/heroku/heroku-buildpack-google-chrome/issues/46
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--disable-setuid-sandbox')
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
desired_capabilities=desired_capabilities)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
self.start_session(capabilities, browser_profile)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
(Driver info: chromedriver=2.45.615279 (12b89733300bd268cff3b78fc76cb8f3a7cc44e5),platform=Linux 3.10.0-327.el7.x86_64 x86_64)
增加了无头模式虽然可以跑,但是无法找到页面元素
2019-01-08 16:43:00 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
[2019-01-08 16:43:00] 140734813173184 POST http://127.0.0.1:56931/session/cd22b1e86a32e3f65f5b2fb0a0795a49/element {"using": "xpath", "value": "//a[@class='searchbutton']", "sessionId": "cd22b1e86a32e3f65f5b2fb0a0795a49"}
2019-01-08 16:43:00 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:56931/session/cd22b1e86a32e3f65f5b2fb0a0795a49/element {"using": "xpath", "value": "//a[@class='searchbutton']", "sessionId": "cd22b1e86a32e3f65f5b2fb0a0795a49"}
[2019-01-08 16:43:00] 140734813173184 http://127.0.0.1:56931 "POST /session/cd22b1e86a32e3f65f5b2fb0a0795a49/element HTTP/1.1" 200 358
2019-01-08 16:43:00 [urllib3.connectionpool] DEBUG: http://127.0.0.1:56931 "POST /session/cd22b1e86a32e3f65f5b2fb0a0795a49/element HTTP/1.1" 200 358
在看别人博客发现linux服务器是无界面的,知道了xvfb这个概念:Xvfb在内存中执行所有的图形操作,不需要借助任何显示设备。就尝试安装一下看看是否能解决问题:
yum install Xvfb
还是一如既往的报错,决定降低chrome版本试试,看了下linux版本信息:
[root@localhost google]# uname -a
Linux localhost.localdomain 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
我卸载了当前的goole-chrome(版本信息:google-chrome-stable-71.0.3578.98),重新安装了google-chrome(版本信息:google-chrome-stable-62.0.3202.94)。chromedriver版本从2.45.615279改为了2.33.506092。
最后还是报错了:
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
desired_capabilities=desired_capabilities)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
self.start_session(capabilities, browser_profile)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally
(Driver info: chromedriver=2.33.506092 (733a02544d189eeb751fe0d7ddca79a0ee28cce4),platform=Linux 3.10.0-327.el7.x86_64 x86_64)
不过,和以前的错误不一样,感觉离成功更近了一步。
查找资料安装pyvirtualdisplay:
pip install pyvirtualdisplay
在代码中使用:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(800, 800))
display.start()
driver = webdriver.Chrome()
功夫不负有心人代码完美运行。
2.scrapy定义初始化方法,本地python 3.7环境直接定义__init__(self)
格式,但是Linux python 3.6的环境却报错,按理说使用的scrapy版本都是3.5.1。linux python 3.6的写法:
def __init__(self, *args, **kwargs):
super(SpdSpider, self).__init__(*args, **kwargs)
参考文档
- chromedriver与chrome的对应关系
- selenium+chromedriver爬虫报错pyvirtualdisplay的使用
- DevToolsActivePort file doesn't exist
- Unknown error: Chrome failed to start: exited abnormally