Selenium主要用在自动化测试中,但是也可以用在爬取数据中,由于其实真实的浏览器,则可以无缝地提取数据,而无需担心各类的数据屏蔽,这里主要介绍在CentOS上安装它们的过程以及其中碰到的各类问题记录。
CentOS 7.4 , Selenium 3.13.0, google chrome, Gecko Driver,这里以google的chrome为例,Gecko的过程类似。 墙内的用户建议使用Gecko Driver。 自备梯子的童鞋,则可以考虑Google Chrome的driver。
创建google chrome.repo文件。
vi /etc/yum.repos.d/google-chrome.repo
在文件中输入如下内容:
[google-chrome]
name=google-chrome
baseurl=http://dl.google.com/linux/chrome/rpm/stable/x86_64
enabled=1
gpgcheck=1
gpgkey=https://dl.google.com/linux/linux_signing_key.pub
2 执行yum的更新操作
yum update
如果顺利的话,则可以提示需要更新的内容。
在执行过程中,碰到如下问题:
’
---> Package fontpackages-filesystem.noarch 0:1.44-8.el7 will be installed
---> Package google-chrome-stable.x86_64 0:68.0.3440.84-1 will be installed
--> Processing Dependency: libappindicator3.so.1()(64bit) for package: google-chrome-stable-68.0.3440.84-1.x86_64
---> Package graphite2.x86_64 0:1.3.10-1.el7_3 will be installed
---> Package lcms2.x86_64 0:2.6-3.el7 will be installed
---> Package libXxf86vm.x86_64 0:1.1.4-1.el7 will be installed
---> Package libgusb.x86_64 0:0.2.9-1.el7 will be installed
---> Package libsoup.x86_64 0:2.56.0-4.el7_4 will be installed
--> Processing Dependency: glib-networking(x86-64) >= 2.38.0 for package: libsoup-2.56.0-4.el7_4.x86_64
---> Package libxshmfence.x86_64 0:1.2-1.el7 will be installed
---> Package mesa-libgbm.x86_64 0:17.0.1-6.20170307.el7 will be installed
---> Package mesa-libglapi.x86_64 0:17.0.1-6.20170307.el7 will be installed
---> Package stix-fonts.noarch 0:1.1.0-5.el7 will be installed
--> Running transaction check
---> Package glib-networking.x86_64 0:2.50.0-1.el7 will be installed
--> Processing Dependency: gsettings-desktop-schemas for package: glib-networking-2.50.0-1.el7.x86_64
---> Package google-chrome-stable.x86_64 0:68.0.3440.84-1 will be installed
--> Processing Dependency: libappindicator3.so.1()(64bit) for package: google-chrome-stable-68.0.3440.84-1.x86_64
--> Running transaction check
---> Package google-chrome-stable.x86_64 0:68.0.3440.84-1 will be installed
--> Processing Dependency: libappindicator3.so.1()(64bit) for package: google-chrome-stable-68.0.3440.84-1.x86_64
---> Package gsettings-desktop-schemas.x86_64 0:3.22.0-1.el7 will be installed
--> Finished Dependency Resolution
Error: Package: google-chrome-stable-68.0.3440.84-1.x86_64 (google-chrome)
Requires: libappindicator3.so.1()(64bit)
You could try using --skip-broken to work around the
先是执行如下命令:
yum –enablerepo=extras install epel-release
输出信息如下;
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
Package epel-release-7-9.noarch already installed and latest version
Nothing to do
表示其已经安装成功了。于是继续安装:
yum install libappindicator-gtk3
但是依然会提示上述的错误信息,于是这里我就直接将epel-release进行了卸载和重新安装,则问题解决:
yum –enablerepo=extras reinstall epel-release
yum install libappindicator-gtk3
则在yum update过程中的错误信息解决。
yum install google-chrome-stable
但是非常不幸的是,问题再次出现了,问题的错误信息如下:
Total size: 52 M
Installed size: 187 M
Is this ok [y/d/N]: y
Downloading packages:
warning: /var/cache/yum/x86_64/7/google-chrome/packages/google-chrome-stable-68.0.3440.84-1.x86_64.rpm: Header V4 DSA/SHA1 Signature, key ID 7fac5991: NOKEY
Retrieving key from https://dl-ssl.google.com/linux/linux_signing_key.pub
GPG key retrieval failed: [Errno 14] curl#7 - "Failed to connect to 2404:6800:4008:c00::5d: Network is unreachable"
从错误信息上,好像是网络的某些设置被阻隔了。好吧,于是直接下载安装包,本地安装好了。
wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
yum -y localinstall google-chrome-stable_current_x86_64.rpm
然后安装完成。
chromedriver
输出信息如下:
Starting ChromeDriver 2.41.578700 (2f1ed5f9343c13f73144538f15c00b370eda6706) on port 9515
Only local connections are allowed.
这个表示其被正确启动了,安装成功了。
5. 安装selenium
由于Selenium是标准的python包,这里直接基于pip进行安装。
pip install selenium
6.启动本地Spider程序
在程序启动过程中,出现了如下错误信息:
File "xxx-sy.py", line 384, in
browser = webdriver.Chrome(chrome_options=chrome_options)
File "/export/home/anaconda3/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 75, in __init__
desired_capabilities=desired_capabilities)
File "/export/home/anaconda3/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 156, in __init__
self.start_session(capabilities, browser_profile)
File "/export/home/anaconda3/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 251, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/export/home/anaconda3/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 320, in execute
self.error_handler.check_response(response)
File "/export/home/anaconda3/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
(Driver info: chromedriver=2.41.578700 (2f1ed5f9343c13f73144538f15c00b370eda6706),platform=Linux 3.10.0-693.5.2.el7.x86_64 x86_64)
于是切换到命令行下,尝试测试一下google chrome的命令是否可用:
google-chrome
命令输出内容如下:
[29574:29574:0803/145908.944672:ERROR:zygote_host_impl_linux.cc(89)] Running as root without --no-sandbox is not supported. See https://crbug.com/638180.
从错误信息上可以看到,需要在启动Firefox的过程中,加入–no-sanbox的启动参数。
在启动过程中,还出现了一个新的错误信息:
Fri, 03 Aug 2018 15:04:51 xx-xx.py[line:183] INFO category, style, create dir:/export/xx/spider/xx/
Traceback (most recent call last):
File "taobao-sexy.py", line 428, in
total_images = spide_one_action(page_url, folder_path)
File "taobao-sexy.py", line 264, in spide_one_action
brw= create_brw()
File "taobao-sexy.py", line 309, in create_brw
brw = webdriver.Chrome(chrome_options=chrome_options)
File "/export/home/anaconda3/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 75, in __init__
desired_capabilities=desired_capabilities)
File "/export/home/anaconda3/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 156, in __init__
self.start_session(capabilities, browser_profile)
File "/export/home/anaconda3/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 251, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/export/home/anaconda3/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 320, in execute
self.error_handler.check_response(response)
File "/export/home/anaconda3/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
(Driver info: chromedriver=2.41.578700 (2f1ed5f9343c13f73144538f15c00b370eda6706),platform=Linux 3.10.0-693.5.2.el7.x86_64 x86_64)
经过调查研究之后,发现其需要在启动过程中设置chrome的参数如下:
–disable-dev-shm-usage
7.完成的Chrome启动参数如下:
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--disable-dev-shm-usage')
于是就可以欢欢喜喜地去调用Selenium驱动的chrome快乐地游玩了。
这个过程中还是有非常多的问题,需要不停的研究和解决的,这里权作记录,方便后续的参考。