python+selenium使用

有时候遇到这种情况,每个请求里面有cookies和header,但是抓包怎么也抓不到是怎么来的,用 scrapy和requests都不能执行js,只能是爬取静态的页面。利用scrapy-splash虽然可以爬取动态的页面,但是自己必须起一个服务来跑scrapy-splash。这个时候觉得还是采用selenium,selenium支持chrome和firefox等。

    def __init__(self):
        chrome_options = Options()
        chrome_options.add_argument('--disable-gpu')
        chrome_options.add_argument('--hide-scrollbars')
        # 不显示浏览器窗口
        # chrome_options.add_argument('--headless')
        self.browser = webdriver.Chrome(executable_path='/opt/webdriver/chrome/chromedriver',
                                        chrome_options=chrome_options)
        self.browser.set_page_load_timeout(30)

    # 重写start_requests方法
    def start_requests(self):
        cookies = self.convert_cookies(self.get_cookies())
        for form_data in self.form_data_list:
            yield scrapy.FormRequest(self.start_url, method="POST", cookies=cookies, formdata=form_data,
                                     dont_filter=True)
        pass

    # 通过webdriver获取cookies
    def get_cookies(self):
        self.browser.get(self.cookies_url)
        cookies = []
        try:
            WebDriverWait(self.browser, 100).until(
                expected_conditions.element_to_be_clickable((By.XPATH, "//a[@class='searchbutton']")))
            cookies = self.browser.get_cookies()
        except Exception as e:
            self.logger.info("获取cookies出错")

        finally:
            # 关闭浏览器
            self.browser.quit()
        return cookies

    def convert_cookies(self, cookies):
        newcookies = {}
        for cookie in cookies:
            newcookies[cookie['name']] = cookie['value']
        return newcookies

    # 表单数据转化为dict
    def fromData2Dict(self, formData):
        # urlencode会把空格转化为+,此处做个转换
        params = urllib.parse.unquote(formData).replace('+', ' ').split("&")
        nums = len(params)
        form_data = {}
        for i in range(0, nums):
            param = params[i].split("=", 1)
            key = param[0]
            value = param[1]
            form_data[key] = value
        return form_data

设置无头模式,不显示窗口(遇到问题:导致寻找不到页面元素)

chrome_options.add_argument('--headless')

关闭沙盒:

options.add_argument('--no-sandbox')

遇到了的问题汇总:
1.在mac环境运行的好好的,在Linux环境一直报错,DevToolsActivePort文件找不到,参考了很多国外国内的博客都写的禁用沙箱然并卵。
比如:

  • https://github.com/heroku/heroku-buildpack-google-chrome/issues/46
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--disable-setuid-sandbox')
  File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
    desired_capabilities=desired_capabilities)
  File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
    self.start_session(capabilities, browser_profile)
  File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally
  (unknown error: DevToolsActivePort file doesn't exist)
  (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
  (Driver info: chromedriver=2.45.615279 (12b89733300bd268cff3b78fc76cb8f3a7cc44e5),platform=Linux 3.10.0-327.el7.x86_64 x86_64)

增加了无头模式虽然可以跑,但是无法找到页面元素

2019-01-08 16:43:00 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
[2019-01-08 16:43:00] 140734813173184 POST http://127.0.0.1:56931/session/cd22b1e86a32e3f65f5b2fb0a0795a49/element {"using": "xpath", "value": "//a[@class='searchbutton']", "sessionId": "cd22b1e86a32e3f65f5b2fb0a0795a49"}
2019-01-08 16:43:00 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:56931/session/cd22b1e86a32e3f65f5b2fb0a0795a49/element {"using": "xpath", "value": "//a[@class='searchbutton']", "sessionId": "cd22b1e86a32e3f65f5b2fb0a0795a49"}
[2019-01-08 16:43:00] 140734813173184 http://127.0.0.1:56931 "POST /session/cd22b1e86a32e3f65f5b2fb0a0795a49/element HTTP/1.1" 200 358
2019-01-08 16:43:00 [urllib3.connectionpool] DEBUG: http://127.0.0.1:56931 "POST /session/cd22b1e86a32e3f65f5b2fb0a0795a49/element HTTP/1.1" 200 358

在看别人博客发现linux服务器是无界面的,知道了xvfb这个概念:Xvfb在内存中执行所有的图形操作,不需要借助任何显示设备。就尝试安装一下看看是否能解决问题:

yum install Xvfb

还是一如既往的报错,决定降低chrome版本试试,看了下linux版本信息:

[root@localhost google]# uname -a
Linux localhost.localdomain 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

我卸载了当前的goole-chrome(版本信息:google-chrome-stable-71.0.3578.98),重新安装了google-chrome(版本信息:google-chrome-stable-62.0.3202.94)。chromedriver版本从2.45.615279改为了2.33.506092
最后还是报错了:

File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
    desired_capabilities=desired_capabilities)
  File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
    self.start_session(capabilities, browser_profile)
  File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally
  (Driver info: chromedriver=2.33.506092 (733a02544d189eeb751fe0d7ddca79a0ee28cce4),platform=Linux 3.10.0-327.el7.x86_64 x86_64)

不过,和以前的错误不一样,感觉离成功更近了一步。
查找资料安装pyvirtualdisplay:

pip install pyvirtualdisplay

在代码中使用:

from pyvirtualdisplay import Display
display = Display(visible=0, size=(800, 800))  
display.start()
driver = webdriver.Chrome()

功夫不负有心人代码完美运行。
2.scrapy定义初始化方法,本地python 3.7环境直接定义__init__(self)格式,但是Linux python 3.6的环境却报错,按理说使用的scrapy版本都是3.5.1。linux python 3.6的写法:

def __init__(self, *args, **kwargs):
        super(SpdSpider, self).__init__(*args, **kwargs)

参考文档

  • chromedriver与chrome的对应关系
  • selenium+chromedriver爬虫报错pyvirtualdisplay的使用
  • DevToolsActivePort file doesn't exist
  • Unknown error: Chrome failed to start: exited abnormally

你可能感兴趣的:(python+selenium使用)