2020-05-28 学习python爬虫系列(五):利用selenium模块爬取动态网页之环境设置

首先安装chromedriver

参考:https://blog.csdn.net/tymatlab/article/details/78649727

方法一:下载原始文件直接下载chromedriver并添加路径

1.下载chromedriver,查看chrome浏览器版本为83
下载地址:https://npm.taobao.org/mirrors/chromedriver/83.0.4103.39/

$ mv /Users/chelsea/Downloads/chromedriver /usr/local/bin/
$ export PATH=$PATH:/usr/local/bin/chromedriver
image-20200526172303376.png

需要注意,不同的driver支持的Chrome浏览器版本是不同的。而且直接从官方网站下载比较困难,推荐从https://npm.taobao.org/mirrors/chromedriver/镜像下载,解压缩后添加到系统环境变量中使用,不同的电脑设置方法不同,可搜索系统+关键词找到相应教程。

1、 加载需要用到的模块

import urllib
import time
from lxml import etree
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException

2、通过关键字构造网址

自己手动输入几个关键字,分析出网址的构造。
PubMed的原始链接为'https://pubmed.ncbi.nlm.nih.gov/' 当输入并搜索'pulmonary arterial hypertension','congeital heart disease'时,url变成了'https://pubmed.ncbi.nlm.nih.gov/?term=%27pulmonary+arterial+hypertension%27%2C%27congeital+heart+disease%27' 。于是不难分析出网址由两部分组成,一部分是不变的'https://pubmed.ncbi.nlm.nih.gov/?term=' ,另一部分由我们搜索的关键字由字符串%2C拼接而构成。因此对于传进来的keyword,我们可以做以下处理,拼接成搜索后返回的url。

keyword = '%2C'.join(['pulmonary arterial hypertension','congeital heart disease'])
start_url = 'https://pubmed.ncbi.nlm.nih.gov/?term='
url = start_url + keyword

3、创建浏览器对象

browser = webdriver.Chrome()
browser.get(url)

上面三步运行成功后,就会自动弹出一个Google浏览器的窗口


image.png

方法二:使用brew安装chromedriver

1.brew安装chromedriver

$ brew install chromedriver

2.安装完成后,再次运行:

from selenium import webdriver
driver = webdriver.Chrome()

但是这个方法不好用,我尝试着解决了下面提示中homebrew/cask的问题踩了好几个坑还是没能成功。

(base) Cheng-MacBook-Pro:MacOS chelsea$ brew install chromedriver
Updating Homebrew...
Error: No available formula with the name "chromedriver"
It was migrated from homebrew/core to homebrew/cask.
You can access it again by running:
  brew tap homebrew/cask
And then you can install it by running:
  brew cask install chromedriver

记录如下,虽然没能解决问题,在这个过程中还是学到了点东西。

(base) Cheng-MacBook-Pro:~ chelsea$ brew tap homebrew/cask
Updating Homebrew...
==> Tapping homebrew/cask
Cloning into '/usr/local/Homebrew/Library/Taps/homebrew/homebrew-cask'...
fatal: unable to access 'https://github.com/Homebrew/homebrew-cask/': LibreSSL SSL_read: SSL_ERROR_SYSCALL, errno 54
Error: Failure while executing; `git clone https://github.com/Homebrew/homebrew-cask /usr/local/Homebrew/Library/Taps/homebrew/homebrew-cask` exited with 128.
(base) Cheng-MacBook-Pro:~ chelsea$ ping github.com
PING github.com (13.229.188.59): 56 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2
Request timeout for icmp_seq 3
Request timeout for icmp_seq 4
Request timeout for icmp_seq 5
Request timeout for icmp_seq 6
Request timeout for icmp_seq 7
Request timeout for icmp_seq 8
Request timeout for icmp_seq 9
Request timeout for icmp_seq 10
^C
--- github.com ping statistics ---
12 packets transmitted, 0 packets received, 100.0% packet loss
(base) Cheng-MacBook-Pro:~ chelsea$ sudo vim /private/etc/hosts
(base) Cheng-MacBook-Pro:~ chelsea$ ping github.com
PING github.com (192.30.253.112): 56 data bytes
76 bytes from 120.80.173.241: Time to live exceeded
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 5400 51a9   0 0000  01  01 85b9 192.168.100.15  192.30.253.112

Request timeout for icmp_seq 0
76 bytes from 120.80.173.241: Time to live exceeded
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 5400 1681   0 0000  01  01 c0e1 192.168.100.15  192.30.253.112

Request timeout for icmp_seq 1
76 bytes from 120.80.173.241: Time to live exceeded
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 5400 1ea6   0 0000  01  01 b8bc 192.168.100.15  192.30.253.112

Request timeout for icmp_seq 2
76 bytes from 120.80.173.241: Time to live exceeded
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 5400 81c5   0 0000  01  01 559d 192.168.100.15  192.30.253.112

Request timeout for icmp_seq 3
76 bytes from 120.80.173.241: Time to live exceeded
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 5400 43a2   0 0000  01  01 93c0 192.168.100.15  192.30.253.112

^C
--- github.com ping statistics ---
5 packets transmitted, 0 packets received, 100.0% packet loss

参考学习资料
https://www.jianshu.com/p/dd996cdcc3f7
备选的IP

151.101.185.194 github.global.ssl.fastly.net
192.30.253.112 github.com
151.101.184.133 assets-cdn.github.com
151.101.184.133 avatars0.githubusercontent.com
151.101.112.133 avatars1.githubusercontent.com

解决homebrew-cask的过程,手动从GitHub下载ruby编辑的homebrew-cask软件,放在原来的位置。

(base) Cheng-MacBook-Pro:~ chelsea$ cd /usr/local/Homebrew/Library/Taps/homebrew/
(base) Cheng-MacBook-Pro:homebrew chelsea$ ls
homebrew-core
(base) Cheng-MacBook-Pro:homebrew chelsea$ mv /Users/chelsea/Downloads/homebrew-cask /usr/local/Homebrew/Library/Taps/homebrew/
(base) Cheng-MacBook-Pro:homebrew chelsea$ ls
homebrew-cask homebrew-core

因为我开始用的是方法2结果踩了一堆坑后仍然没能配制好chromedriver,在这个过程中发现selenium模块还支持其它浏览器,比如mac自带的safari,也找到的相关教程来设置,但我发现这个还是不好用,也不知道是不是我设置的不好,每次只能打开一次,如果不关掉就不能再次启用(简单来说不能支持多开),一点也不智能,最后终于按照方法1完成设置,chrome浏览器就没有这个问题。

https://blog.csdn.net/xqhadoop/article/details/77892796

设置可用

(base) Cheng-MacBook-Pro:~ chelsea$ safaridriver --enable
Password:
(base) Cheng-MacBook-Pro:~ chelsea$

那么一开始我为何没有用方法1呢,其实我是用了的,但是我用错了,主要是设置到环境的问题。
一开始我从官方网站下载不下来,也不知道有镜像,就通过一个远程服务器下载了软件,下载的过程中又参考了别的教程(大部分是windows的教程),设置环境变量出了问题,导致无法调用,反正是各种坑踩完就弄明白了。
年轻就要折腾,但是最好还是多方搜索汲取前人经验教训!!!
找到一分还不错的中文文档:http://selenium-python-zh.readthedocs.io/en/latest/

你可能感兴趣的:(2020-05-28 学习python爬虫系列(五):利用selenium模块爬取动态网页之环境设置)