selenium打开浏览器模仿人工操作是诸多爬虫工作者最万能的网页数据获取方式,但是在做自动化爬虫时,经常被检测到是selenium驱动。比如前段时间selenium打开维普高级搜索时得到的页面是空白页,懂车帝对selenium反爬也很厉害。
主要原因是selenium打开的浏览器指纹和人工操作打开的浏览器指纹是不同的,比如最熟知的window.navigator.webdriver关键字,在selenium打开的浏览器打印返回结果为true,而正常浏览器打印结果返回为undefined,我们可以在(https://bot.sannysoft.com/)网站比较各关键字。
from selenium import webdriver
options = webdriver.ChromeOptions()
# 此步骤很重要,设置为开发者模式,防止被各大网站识别出来使用了Selenium
driver = webdriver.Chrome(options=options)
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": """
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
})
"""
})
但是因为浏览器指纹很多,这种方法的局限性是显而易见的。
import time
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36')
driver = Chrome('./chromedriver', options=chrome_options)
with open('/Users/kingname/test_pyppeteer/stealth.min.js') as f:
js = f.read()
driver.execute_cdp_cmd("Page.add`在这里插入代码片`ScriptToEvaluateOnNewDocument", {
"source": js
})
stealth.min.js文件来源于puppeteer,有开发者给 puppeteer 写了一套插件,叫做puppeteer-extra。其中,就有一个插件叫做puppeteer-extra-plugin-stealth专门用来让 puppeteer 隐藏模拟浏览器的指纹特征。
python开发者就需要把其中的隐藏特征的脚本提取出来,做成一个 js 文件。然后让 Selenium 或者 Pyppeteer 在打开任意网页之前,先运行一下这个 js 文件里面的内容。
puppeteer-extra-plugin-stealth的作者还写了另外一个工具,叫做extract-stealth-evasions。这个东西就是用来生成stealth.min.js文件的。
undetected_chromedriver 可以防止浏览器特征被识别,并且可以根据浏览器版本自动下载驱动。不需要自己去下载对应版本的chromedriver。
import undetected_chromedriver as uc
driver = uc.Chrome()
driver.get('目标网址')
import undetected_chromedriver as uc
#specify chromedriver version to download and patch
#this did not work correctly until 1.2.1
uc.TARGET_VERSION = 78
# or specify your own chromedriver binary to patch
undetected_chromedriver.install(
executable_path='c:/users/user1/chromedriver.exe',
)
from selenium.webdriver import Chrome, ChromeOptions
opts = ChromeOptions()
opts.add_argument(f'--proxy-server=socks5://127.0.0.1:9050')
driver = Chrome(options=opts)
driver.get('目标网址')
有时候会报chrome的版本不匹配,undetected_chromedriver可以根据浏览器版本自动下载驱动,自然不存在版本不匹配问题,最终发现是undetected_chromedriver只支持chrome96及以上的版本,需将chrome浏览器版本升级。
import os
import shutil
import tempfile
import undetected_chromedriver as webdriver
class ProxyExtension:
manifest_json = """
{
"version": "1.0.0",
"manifest_version": 2,
"name": "Chrome Proxy",
"permissions": [
"proxy",
"tabs",
"unlimitedStorage",
"storage",
"",
"webRequest",
"webRequestBlocking"
],
"background": {"scripts": ["background.js"]},
"minimum_chrome_version": "76.0.0"
}
"""
background_js = """
var config = {
mode: "fixed_servers",
rules: {
singleProxy: {
scheme: "http",
host: "%s",
port: %d
},
bypassList: ["localhost"]
}
};
chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});
function callbackFn(details) {
return {
authCredentials: {
username: "%s",
password: "%s"
}
};
}
chrome.webRequest.onAuthRequired.addListener(
callbackFn,
{ urls: [""] },
['blocking']
);
"""
def __init__(self, host, port, user, password):
self._dir = os.path.normpath(tempfile.mkdtemp())
manifest_file = os.path.join(self._dir, "manifest.json")
with open(manifest_file, mode="w") as f:
f.write(self.manifest_json)
background_js = self.background_js % (host, port, user, password)
background_file = os.path.join(self._dir, "background.js")
with open(background_file, mode="w") as f:
f.write(background_js)
@property
def directory(self):
return self._dir
def __del__(self):
shutil.rmtree(self._dir)
if __name__ == "__main__":
proxy = ("64.32.16.8", 8080, "username", "password") # your proxy with auth, this one is obviously fake
proxy_extension = ProxyExtension(*proxy)
options = webdriver.ChromeOptions()
options.add_argument(f"--load-extension={proxy_extension.directory}")
driver = webdriver.Chrome(options=options)
driver.get("Just a moment...")
driver.quit()