@Date : 2018-09-03
@Author : lmingzhi ([email protected])
[TOC]
0.前言
1.CentOS安装mitmproxy
1.1.使用linux已编译好的二进制包
step0. 参考资料
step1. 下载链接
step2. 具体实现
1.2.conda安装mitmproxy >>>> 另一种选择
1.3.CentOS Linux 7 证书配置
step0. 引自 >>>> Python3网络爬虫开发实战-崔庆才 <<<<
step1. mitmproxy官方文档
step2. stackoverflow pem证书转换为crt
step3. centOS系统的证书安装
step4. 具体实现代码
2. mitmproxy注入js脚本
2.1.参考资料
2.2.具体实现
3.安装google-chrome
3.1.安装命令
3.2.安装流水
4.安装chromedriver
4.1.安装一些必要的工具包
4.2 下载chromedriver
4.2 修改chromedrive源代码
5.反爬监测测试
5.0. 本地接入服务器的jupyternobook
5.0.1.参考链接
5.0.2.具体实现
5.0.2.1 远程服务器
5.0.2.2 本地服务器设置
5.1.测试网站
5.2.测试代码
5.2.0.导入模块
5.2.1 test1_不注入js
5.2.2 重置浏览器
5.2.3 注入js
0.前言
爬虫和反爬向来都是一对“冤家”,最近在一次数据采集中遇到了一个问题,如何防止Selenium控制下的Chrome操作不被反爬监测到?
久闻selenium控制下的浏览器会被监测到,原来没有意识到这个问题,但是听别人提及过,然后最近,终于有了实践的机会了——刚好遇到了A网站在登录时进行了反爬监测。
于是就有了以下的内容。本教程的目标:
通过该测试网站https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html的测试,使表格的所有项目全绿。
主要步骤包括:安装mitmproxy
设置mitmproxy证书
编写mitmproxy脚本,注入js代码
安装Google Chrome
下载chromedriver,并修改源文件
目标网站测试
1.CentOS安装mitmproxy
1.1.使用linux已编译好的二进制包
step0. 参考资料《Python3网络爬虫开发实战 》 崔庆才
1.7.2 mitmproxy的安装
step1. 下载链接
step2. 具体实现
我使用的是mitmproxy 2.0.2
# 下载已编译好的二进制包
$ wget https://path/to/mitmproxy-2.0.2-linux.tar.gz
$ tar -zxvf mitmproxy-2.0.2-linux.tar.gz
$ sudo mv mitmproxy mitmdump mitmweb /usr/bin
1.2.conda安装mitmproxy >>>> 另一种选择
系统信息:CentOS Linux 7
# 安装gcc和c++
$ yum install gcc
$ yum install gcc-c++
# 查看已经有的conda虚拟环境
$ conda env list
# conda environments:
#
py35 /root/miniconda3/envs/py35
root * /root/miniconda3
# 安装conda虚拟环境,名称为py36
$ conda create --name py36
Fetching package metadata .............
Solving package specifications:
Package plan for installation in environment /root/miniconda3/envs/py36:
Proceed ([y]/n)? y
#
# To activate this environment, use:
# > source activate py36
#
# To deactivate an active environment, use:
# > source deactivate
#
# 进入虚拟环境py36并安装mitmproxy
$ source activate py36
(py36) $ pip install mitmproxy
# 最后安装的提示
Found existing installation: pyasn1 0.2.3
Cannot uninstall 'pyasn1'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.
You are using pip version 10.0.1, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
# 上面没有安装成功,需要对pip进行降级处理
(py36) $ pip install --upgrade --force-reinstall pip==9.0.3
# 安装mitmproxy
(py36) $ pip install mitmproxy --disable-pip-version-check
# 尝试是否可以启动成功
(py36) $ mitmproxy
# `Ctrl+C` 退出
1.3.CentOS Linux 7 证书配置
step0. mitmproxy证书配置
引自 >>>> Python3网络爬虫开发实战-崔庆才 <<<<
step1. mitmproxy官方文档
step2. stackoverflow pem证书转换为crt
step3. centOS系统的证书安装
step4. 具体实现代码
系统:CentOS Linux 7.4.1708
# a.查看mitmproxy版本和系统信息
$ mitmproxy --version
Mitmproxy version: 2.0.2 (release version) Precompiled Binary
Python version: 3.5.2
Platform: Linux-3.10.0-693.el7.x86_64-x86_64-with-centos-7.4.1708-Core
SSL version: OpenSSL 1.1.0e 16 Feb 2017
Linux distro: CentOS Linux 7.4.1708 Core
# b.安装 ca-certificates
$ yum install ca-certificates
# c. 转换证书,pem → crt
$ cd ~/.mitmproxy/
$ openssl x509 -in mitmproxy-ca-cert.pem -inform PEM -out mitmproxy-ca-cert.crt
$ ll
total 28
-rw-r--r-- 1 root root 1318 Aug 31 16:19 mitmproxy-ca-cert.cer
-rw-r--r-- 1 root root 1318 Aug 31 18:52 mitmproxy-ca-cert.crt
-rw-r--r-- 1 root root 1140 Aug 31 16:19 mitmproxy-ca-cert.p12
-rw-r--r-- 1 root root 1318 Aug 31 16:19 mitmproxy-ca-cert.pem
-rw-r--r-- 1 root root 2529 Aug 31 16:19 mitmproxy-ca.p12
-rw-r--r-- 1 root root 3022 Aug 31 16:19 mitmproxy-ca.pem
-rw-r--r-- 1 root root 770 Aug 31 16:19 mitmproxy-dhparam.pem
# 将证书导入系统
$ update-ca-trust force-enable
$ cp mitmproxy-ca-cert.crt /etc/pki/ca-trust/source/anchors/
$ update-ca-trust extract
# 外部证书存放目录
$ ll /etc/pki/ca-trust/source/anchors/
total 4
-rw-r--r-- 1 root root 1318 Aug 31 19:41 mitmproxy-ca-cert.crt
2. mitmproxy注入js脚本
2.1.参考资料
参考文献:
注意事项:以下JavaScript代码是为了防止selenium控制chrome浏览器访问网站时被反爬监测到,而插入的。
mitmproxy默认代理端口为8080。
2.2.具体实现
# 编辑注入脚本
(py36) $ vim indject_js_proxy.py
indject_js_proxy.py
from mitmproxy import ctx
injected_javascript = '''// overwrite the `languages` property to use a custom getterObject.defineProperty(navigator, "languages", {get: function() {return ["zh-CN","zh","zh-TW","en-US","en"];}});// Overwrite the `plugins` property to use a custom getter.Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3, 4, 5],});// Pass the Webdriver testObject.defineProperty(navigator, 'webdriver', {get: () => false,});// Pass the Chrome Test.// We can mock this in as much depth as we need for the test.window.navigator.chrome = {runtime: {},// etc.};// Pass the Permissions Test.const originalQuery = window.navigator.permissions.query;window.navigator.permissions.query = (parameters) => (parameters.name === 'notifications' ?Promise.resolve({ state: Notification.permission }) :originalQuery(parameters));'''
def response(flow):
# Only process 200 responses of HTML content.
if not flow.response.status_code == 200:
return
# Inject a script tag containing the JavaScript.
html = flow.response.text
html = html.replace('
', '' % injected_javascript)flow.response.text = str(html)
ctx.log.info('>>>> js代码插入成功 <<<
# 只要url链接以target开头,则将网页内容替换为目前网址
# target = 'https://target-url.com'
# if flow.url.startswith(target):
# flow.response.text = flow.url
启动mitmprox, 以及js注入脚本
# 启动脚本
(py36) $ mitmdump -s indject_js_proxy.py
Loading script indject_js_proxy.py
Proxy server listening at http://*:8080
3.安装google-chrome
3.1.安装命令
安装Google Chrome仅需要2行代码:
wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
sudo yum install google-chrome-stable_current_x86_64.rpm
3.2.安装流水
# a.下载
$ wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
--2018-08-31 16:37:05-- https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
Resolving dl.google.com (dl.google.com)... 203.208.40.68, 203.208.40.64, 203.208.40.67, ...
Connecting to dl.google.com (dl.google.com)|203.208.40.68|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 54333686 (52M) [application/x-rpm]
Saving to: ‘google-chrome-stable_current_x86_64.rpm’
100%[ =======================================================================================================>] 54,333,686 6.63MB/s in 6.9s
2018-08-31 16:37:12 (7.56 MB/s) - ‘google-chrome-stable_current_x86_64.rpm’ saved [54333686/54333686]
# b.安装浏览器和依赖包
$ sudo yum install google-chrome-stable_current_x86_64.rpm
4.安装chromedriver
4.1.安装一些必要的工具包
# 安装zip,unzip
$ yum install zip, unzip
# 安装十六进制编辑器 hexedit
$ yum install hexedit
4.2 下载chromedriver
# a.下载
$ wget https://chromedriver.storage.googleapis.com/2.40/chromedriver_linux64.zip
# b.解压压缩包
$ unzip chromedriver_linux64_2.40.zip
Archive: chromedriver_linux64_2.40.zip
inflating: chromedriver
4.2 修改chromedrive源代码
# 修改chromedriver
$ hexedit chromedriver
# 操作
1. tab 跳转到string栏
2. ctrl+S 查找 var key = '$cdc_asdjflasutopfhvcZLmcfl_'(对于2.40版本)
3. 替换'$cdc_asdjflasutopfhvcZLmcfl_'为任意值
4. ctrl+X 保存
# 移动chromedriver 到 /usr/bin
$ mv chromedriver /usr/bin
5.反爬监测测试
5.0. 本地接入服务器的jupyternobook
因为代码准备布置在远程服务器上,所以需要直接连上服务器做测试,由于更喜欢用jupyter notebook,所以顺便说一下如何连上远程的jupyter notebook.
5.0.1.参考链接
5.0.2.具体实现
5.0.2.1 远程服务器
由于远程仅安装了miniconda3, 还需要按安装jupyter.
$ conda install jupyter
# 启动jupyter notebook
$ jupyter notebook
[I 11:39:00.645 NotebookApp] Writing notebook server cookie secret to /run/user/0/jupyter/notebook_cookie_secret
[C 11:39:00.665 NotebookApp] Running as root is not recommended. Use --allow-root to bypass.
# 提示说需要加上--allow-root,因为账号是用管理员账号登录的
$ jupyter notebook --allow-root
[I 11:39:18.954 NotebookApp] Serving notebooks from local directory: /data/scrapy_lmz/tg_scrapy_eleme_mt
[I 11:39:18.954 NotebookApp] 0 active kernels
[I 11:39:18.954 NotebookApp] The Jupyter Notebook is running at: http://localhost:8888/?token=1478c6364d77b3a74edd529250a1d01c36d79eb079f93e5e
[I 11:39:18.954 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 11:39:18.955 NotebookApp] No web browser found: could not locate runnable browser.
[C 11:39:18.955 NotebookApp]
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://localhost:8888/?token=1478c6364d77b3a74edd529250a1d01c36d79eb079f93e5e
启动juypter notebook也可以使用以下命令,使jupyter notebook在后台运行,日志记录在 /tmp/ipynb.logs 中。
# 远程服务器后台运行jupyter notebook
$ nohup jupyter notebook --allow-root > /tmp/ipynb.logs &
# 查看日志
$ tail -f /tmp/ipynb.logs
# 查看jupyter notebook进程
$ ps aux | grep jupyter
root 29106 0.0 0.1 475080 61068 pts/1 Sl 14:05 0:04 /root/miniconda3/bin/python /root/miniconda3/bin/jupyter-notebook --allow-root
root 29151 0.0 0.1 771448 55556 ? Ssl 14:08 0:02 /root/miniconda3/bin/python -m ipykernel_launcher -f /run/user/0/jupyter/kernel-afbd8de5-efed-4b18-94ac-d27866219a2c.json
root 30714 0.0 0.1 744156 44668 ? Ssl 17:24 0:00 /root/miniconda3/bin/python -m ipykernel_launcher -f /run/user/0/jupyter/kernel-b6c35685-bd4e-4f18-ba3c-b7680706ad1b.json
root 31028 0.0 0.0 112708 976 pts/1 S+ 18:18 0:00 grep --color=auto jupyter
# 关闭jupyter进程
$ kill 29106
# 或批量删除jupyter进程
$ ps aux|grep jupyter|awk '{print $2}'|xargs kill -9
5.0.2.2 本地服务器设置
# 端口映射
$ ssh -L8008:localhost:8888 [email protected]
# 输入远程服务器的密码
# 在本地浏览器输入以下地址连接远程jupyter notebook,注意本地端口为8008
http://localhost:8008/?token=1478c6364d77b3a74edd529250a1d01c36d79eb079f93e5e>
5.1.测试网站
正常浏览结果:
5.2.测试代码
5.2.0.导入模块
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time, sys
import json, io, re, os, random
from datetime import datetime, timedelta
# 导入logging模块
import logging
# create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# create console handler and set level to debug
ch = logging.StreamHandler()
ch.setLevel(logging.INFO)
# create formatter
formatter = logging.Formatter('%(asctime)s-%(levelname)s-%(message)s')
# add formatter to ch
ch.setFormatter(formatter)
# add ch to logger
logger.addHandler(ch)
# fh = logging.FileHandler(r'/tmp/crawl_logs.log')
# fh.setFormatter(formatter)
# logger.addHandler(fh)
5.2.1 test1_不注入js即不接入mitmproxy代理
# mitmproxy端口8080
proxy_host = 'localhost'
proxy_port = 8080
options = webdriver.ChromeOptions()
options.add_argument('lang=zh-CN,zh,zh-TW,en-US,en')
options.add_argument('--headless')
options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36')
options.add_argument("--no-sandbox")
options.add_argument('--disable-gpu')
# test1 不注入js代码 -- 不接入mitmproxy代理
# Specify the proxy.
# options.add_argument('--proxy-server=%s:%s' % (proxy_host, proxy_port))
# 启动浏览器
driver = webdriver.Chrome(chrome_options=options)
url = 'https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html'
# 访问测试网站
driver.get(url)
# 截屏
driver.save_screenshot('test1.png')
截屏test1.png
5.2.2 重置浏览器
driver.close()
# 查看后台进程
!ps aux|grep chrome
root 30727 0.0 0.0 206520 5400 ? Sl 17:28 0:00 chromedriver --port=28011
root 30844 0.0 0.0 113172 1208 pts/3 Ss+ 17:32 0:00 /usr/bin/sh -c ps aux|grep chrome
root 30846 0.0 0.0 112708 944 pts/3 S+ 17:32 0:00 grep chrome
###################################################################
# 关闭chromedriver --port进程
import subprocess
cmd = "ps aux|grep chrome|awk '{print $2}'|xargs kill -9"
p = subprocess.Popen(args=cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, close_fds=True)
z = p.wait()
logging.warning('kill chrome process%s' % str(z))
###################################################################
# 再次查看后台进程
!ps aux|grep chrome
root 30727 0.0 0.0 0 0 ? Z 17:28 0:00 [chromedriver]
root 30853 0.0 0.0 113172 1208 pts/3 Ss+ 17:33 0:00 /usr/bin/sh -c ps aux|grep chrome
root 30855 0.0 0.0 112708 944 pts/3 S+ 17:33 0:00 grep chrome
5.2.3 注入js重新启动,接入mitmproxy,注入js代码
# mitmproxy端口8080
proxy_host = 'localhost'
proxy_port = 8080
options = webdriver.ChromeOptions()
options.add_argument('lang=zh-CN,zh,zh-TW,en-US,en')
options.add_argument('--headless')
options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36')
options.add_argument("--no-sandbox")
options.add_argument('--disable-gpu')
# test1 不注入js代码 -- 不接入mitmproxy代理
# Specify the proxy.
options.add_argument('--proxy-server=%s:%s' % (proxy_host, proxy_port))
# 启动浏览器
driver = webdriver.Chrome(chrome_options=options)
url = 'https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html'
driver.get(url)
# 截屏
driver.save_screenshot('test2.png')
test2.png
chrome这一项好像没有通过,不过我的目标网址已经可以通过反爬监测了,这一项还需要再看看。