在之前的 Python爬虫爬取Google图片 ,给出了从谷歌图片搜索结果的动态网页上爬取图片的方法,这个方法需要我们手动通过检查元素进行页面元素相关代码的复制,但是显然,我们更希望让脚本也可以完成这个工作,而我们只需要告诉这个脚本哪部分元素是我们需要的,这就涉及到了两个问题:
我们需要用到一个工具集——Selenium
Selenium
是一个工具集,并不仅仅只是 python
的一个 package
,对于 Java
,Ruby
等也有相关的实现,可以实现远程控制浏览器并模拟用户和模拟器的交互操作。
这里给出 python
的安装方法
pip install selenium
如果是使用源代码安装的话,则需要下载相关的压缩文件,然后执行:
python setup.py install
安装对应浏览器的驱动。对应大部分的浏览器的驱动,都需要指定一个可执行文件使得 Selenium
可以和浏览器进行对接,以 Windows
为例:
C:\WebDriver\bin
Path
中一个适用于不同语言的API以及协议(protocol),用于处理 selenium
和浏览器之间的交流,从而控制浏览器的行为。对于几乎所有的主流浏览器都有对应的接口。
截至 2020-07-19 对不同浏览器的支持情况:
Browser | Maintainer | Versions Supported |
---|---|---|
Chrome | Chromium | All versions |
Firefox | Mozilla | 54 and newer |
Internet Explorer | Selenium | 6 and newer |
Opera | Opera Chromium / Presto | 10.5 and newer |
Safari | Apple | 10 and newer |
顾名思义,即我们搜索元素的依据
from selenium import webdriver
from selenium.webdriver.common.by import By
元素搜索的函数:
from selenium import webdriver
from selenium.webdriver.common.by import By
dirver = webdriver.Chrome()
driver.get("https://www.example.com")
# 获取第一个 div 标签元素
div = driver.find_element(By.TAG_NAME, "div")
# 获取所有段落
ps = driver.find_elements(By.TAG_NAME, "p")
for p in ps:
print(e.text)
切换到当前选中的元素(即 active),并获取相关标签信息:
# 通过模拟用户的输入得到一个活跃的元素
driver = webdriver.Chrome()
driver.get("https://www.google.com")
driver.find_element(By.CSS_SELECTOR, '[name="q"]').send_keys("webElement")
# 切换到当前的活跃元素,获取相关标签的内容
attr = driver.switch_to.active_element.get_attribute("title")
模拟用户的键盘输入:
输入文本,并且可以模拟键盘输入:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
# 模拟页面跳转
driver.get("https://www.google.com")
# 输入 “webdriver” 并且模拟 enter 键输入
driver.find_element(By.NAME, "q").send_keys("webdriver" + Keys.ENTER)
模拟 shift
,ctrl
,alt
的按键效果,例如 ctrl + A
进行页面全选:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
# 模拟页面跳转
driver.get("https://www.google.com")
# 全选页面
driver.ActionChains(driver).key_down(Keys.CONTROL).send_keys("a").perform()
结合 key_down
使用,即输入包含大小写交替出现时,通过 key_down
和 key_up
切换,同样通过 ActionChains
实现用户的串联输入操作。
清除内容:
text = driver.find_element(By.TAG_NAME, "input")
text.send_keys("pokemon")
# 清除内容
text.clear()
模拟用户的鼠标操作
同样通过 ActionChains
来实现
# 获取隐藏目录,例如下拉目录
menu = driver.find_element_by_css_selector(".nav")
hidden_submenu = driver.find_element_by_css_selector(".nav #submenu1")
actions = ActionChains(driver)
# 将鼠标定位到相应的元素
actions.move_to_element(menu)
# 点击操作
actions.click(hidden_submenu)
actions.perform()
通过 ActionChains1
的 drag_and_drop
来实现,将一个元素拖拽到另一个元素上:
source = driver.find_element(By.ID, "source")
target = driver.find_element(By.ID, "target")
ActionChains(driver).drag_and_drop(source, target).perform()
定义 Http
传输协议
from selenium import webdriver
PROXY = ""
webdriver.DesiredCapabilities.FIREFOX['proxy'] = {
"httpProxy": PROXY,
"ftpProxy": PROXY,
"sslProxy": PROXY,
"proxyType": "MANUAL",
}
with webdriver.Firefox() as driver:
# Open URL
driver.get("https://selenium.dev")
设置浏览器的加载策略:
normal
:等待网页完全加载完毕eager
:只加载初始文档,不包括初始文档加载的内容(即只能抓取静态网页的内容,不能抓取动态网页),即 DOMContentLoaded
事件结果返回时none
:只加载初始文档from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.page_load_strategy = 'strategy' # normal, eager, none
driver = webdriver.Chrome(options=options)
# Navigate to url
driver.get("http://www.google.com")
driver.quit()
正题,现在我们尝试完全依靠 python
脚本实现自动爬取谷歌图片。我们需要实现:
url
我们需要模拟鼠标点击通过 ActionChains
来实现:
# search on google
# navigate to url
self.driver.get(self.url)
# locate input field
search_input = self.driver.find_element(By.NAME, 'q')
# emulate user input and enter to search
webdriver.ActionChains(self.driver).move_to_element(search_input).send_keys("pokemon" + Keys.ENTER).perform()
点击跳转到图片搜索结果:
# navigate to google image
# find navigation buttons
self.driver.find_element(By.LINK_TEXT, '图片').click()
这里可以结合我们之前使用的 BeautifulSoup
来实现,此时需要读取检查元素页面中对应部分的 html
代码,并且大部分情况下,使用 BeautifulSoup
是更加高效的,因为 WebDriver
需要对所有的 DOM
元素进行遍历。但是 WebDriver
的优势在于可以自动实现网页的下拉,从而可以爬取到所有的搜索结果,这里我们介绍使用 Selenium
的相关实现:
# load more images as many as possible
# 通过让驱动执行js脚本实现下拉
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
# 找到加载按钮
show_more_button = self.driver.find_element(By.CSS_SELECTOR, "input[value='显示更多搜索结果']")
try:
while True:
# 根据浏览器信息
message = self.driver.find_element(By.CSS_SELECTOR, 'div.OuJzKb.Bqq24e').get_attribute('textContent')
# print(message)
if message == '正在加载更多内容,请稍候':
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
elif message == '新内容已成功加载。向下滚动即可查看更多内容。':
# scrolling to bottom
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
# 当出现加载更多的按钮时,点击加载更多
if show_more_button.is_displayed():
show_more_button.click()
# 没有更多图片可以加载时退出
elif message == '看来您已经看完了所有内容':
break
# 点击重试,这个地方没有测试
elif message == '无法加载更多内容,点击即可重试。':
show_more_button.click()
else:
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
except Exception as err:
print(err)
# find all image elements in google image result page
imgs = self.driver.find_elements(By.CSS_SELECTOR, "img.rg_i.Q4LuWd")
这里和 BeautifulSoup
部分实现相同,:
img_count = 0
for img in imgs:
try:
# image per second
time.sleep(1)
print('\ndownloading image ' + str(img_count) + ': ')
img_url = img.get_attribute("src")
path = os.path.join(imgs_dir, str(img_count) + "_img.jpg")
request.urlretrieve(url = img_url, filename = path, reporthook = progress_callback, data = None)
img_count = img_count + 1
except error.HTTPError as http_err:
print(http_err)
except Exception as err:
print(err)
最终,完整部分的代码如下:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from urllib import error
from urllib import request
import os
import time
import sys
# default url
# replace for yours
url = "https://www.google.com"
explorer = "Chrome"
# directory
imgs_dir = "./images"
# report hook with three parameters passed
# count_of_blocks The number of blocks transferred
# block_size The size of block
# total_size Total size of the file
def progress_callback(count_of_blocks, block_size, total_size):
# determine current progress
progress = int(50 * (count_of_blocks * block_size) / total_size)
if progress > 50:
progress = 50
# update progress bar
sys.stdout.write("\r[%s%s] %d%%" % ('█' * progress, ' ' * (50 - progress), progress * 2))
sys.stdout.flush()
class CrawlSelenium:
def __init__(self, explorer="Chrome", url="https://www.google.com"):
self.url = url
self.explorer = explorer
def set_loading_strategy(self, strategy="normal"):
self.options = Options()
self.options.page_load_strategy = strategy
def crawl(self):
# instantiate driver according to corresponding explorer
if self.explorer == "Chrome":
self.driver = webdriver.Chrome(options=self.options)
if self.explorer == "Opera":
self.driver = webdriver.Opera(options=self.options)
if self.explorer == "Firefox":
self.driver = webdriver.Firefox(options=self.options)
if self.explorer == "Edge":
self.driver = webdriver.Edge(options=self.options)
# search on google
# navigate to url
self.driver.get(self.url)
# locate input field
search_input = self.driver.find_element(By.NAME, 'q')
# emulate user input and enter to search
webdriver.ActionChains(self.driver).move_to_element(search_input).send_keys("pokemon" + Keys.ENTER).perform()
# navigate to google image
# find navigation buttons
self.driver.find_element(By.LINK_TEXT, '图片').click()
# load more images as many as possible
# scrolling to bottom
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
# get button
show_more_button = self.driver.find_element(By.CSS_SELECTOR, "input[value='显示更多搜索结果']")
try:
while True:
# do according to message
message = self.driver.find_element(By.CSS_SELECTOR, 'div.OuJzKb.Bqq24e').get_attribute('textContent')
# print(message)
if message == '正在加载更多内容,请稍候':
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
elif message == '新内容已成功加载。向下滚动即可查看更多内容。':
# scrolling to bottom
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
if show_more_button.is_displayed():
show_more_button.click()
elif message == '看来您已经看完了所有内容':
break
elif message == '无法加载更多内容,点击即可重试。':
show_more_button.click()
else:
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
except Exception as err:
print(err)
# find all image elements in google image result page
imgs = self.driver.find_elements(By.CSS_SELECTOR, "img.rg_i.Q4LuWd")
img_count = 0
for img in imgs:
try:
# image per second
time.sleep(1)
print('\ndownloading image ' + str(img_count) + ': ')
img_url = img.get_attribute("src")
path = os.path.join(imgs_dir, str(img_count) + "_img.jpg")
request.urlretrieve(url = img_url, filename = path, reporthook = progress_callback, data = None)
img_count = img_count + 1
except error.HTTPError as http_err:
print(http_err)
except Exception as err:
print(err)
def main():
# setting
crawl_s = CrawlSelenium(explorer, url)
crawl_s.set_loading_strategy("normal")
# make directory
if not os.path.exists(imgs_dir):
os.mkdir(imgs_dir)
# crawling
crawl_s.crawl()
if __name__ == "__main__":
main()