用python notebooks 安装selenium_Selenium Chrome Driver之反爬监测

@Date : 2018-09-03

@Author : lmingzhi ([email protected])

[TOC]

0.前言

1.CentOS安装mitmproxy

1.1.使用linux已编译好的二进制包

step0. 参考资料

step1. 下载链接

step2. 具体实现

1.2.conda安装mitmproxy >>>> 另一种选择

1.3.CentOS Linux 7 证书配置

step0. 引自 >>>> Python3网络爬虫开发实战-崔庆才 <<<<

step1. mitmproxy官方文档

step2. stackoverflow pem证书转换为crt

step3. centOS系统的证书安装

step4. 具体实现代码

2. mitmproxy注入js脚本

2.1.参考资料

2.2.具体实现

3.安装google-chrome

3.1.安装命令

3.2.安装流水

4.安装chromedriver

4.1.安装一些必要的工具包

4.2 下载chromedriver

4.2 修改chromedrive源代码

5.反爬监测测试

5.0. 本地接入服务器的jupyternobook

5.0.1.参考链接

5.0.2.具体实现

5.0.2.1 远程服务器

5.0.2.2 本地服务器设置

5.1.测试网站

5.2.测试代码

5.2.0.导入模块

5.2.1 test1_不注入js

5.2.2 重置浏览器

5.2.3 注入js

0.前言

爬虫和反爬向来都是一对“冤家”,最近在一次数据采集中遇到了一个问题,如何防止Selenium控制下的Chrome操作不被反爬监测到?

久闻selenium控制下的浏览器会被监测到,原来没有意识到这个问题,但是听别人提及过,然后最近,终于有了实践的机会了——刚好遇到了A网站在登录时进行了反爬监测。

于是就有了以下的内容。本教程的目标:

通过该测试网站https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html的测试,使表格的所有项目全绿。

主要步骤包括:安装mitmproxy

设置mitmproxy证书

编写mitmproxy脚本,注入js代码

安装Google Chrome

下载chromedriver,并修改源文件

目标网站测试

1.CentOS安装mitmproxy

1.1.使用linux已编译好的二进制包

step0. 参考资料《Python3网络爬虫开发实战 》 崔庆才

1.7.2 mitmproxy的安装

step1. 下载链接

step2. 具体实现

我使用的是mitmproxy 2.0.2

# 下载已编译好的二进制包

$ wget https://path/to/mitmproxy-2.0.2-linux.tar.gz

$ tar -zxvf mitmproxy-2.0.2-linux.tar.gz

$ sudo mv mitmproxy mitmdump mitmweb /usr/bin

1.2.conda安装mitmproxy >>>> 另一种选择

系统信息:CentOS Linux 7

# 安装gcc和c++

$ yum install gcc

$ yum install gcc-c++

# 查看已经有的conda虚拟环境

$ conda env list

# conda environments:

#

py35 /root/miniconda3/envs/py35

root * /root/miniconda3

# 安装conda虚拟环境,名称为py36

$ conda create --name py36

Fetching package metadata .............

Solving package specifications:

Package plan for installation in environment /root/miniconda3/envs/py36:

Proceed ([y]/n)? y

#

# To activate this environment, use:

# > source activate py36

#

# To deactivate an active environment, use:

# > source deactivate

#

# 进入虚拟环境py36并安装mitmproxy

$ source activate py36

(py36) $ pip install mitmproxy

# 最后安装的提示

Found existing installation: pyasn1 0.2.3

Cannot uninstall 'pyasn1'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

You are using pip version 10.0.1, however version 18.0 is available.

You should consider upgrading via the 'pip install --upgrade pip' command.

# 上面没有安装成功,需要对pip进行降级处理

(py36) $ pip install --upgrade --force-reinstall pip==9.0.3

# 安装mitmproxy

(py36) $ pip install mitmproxy --disable-pip-version-check

# 尝试是否可以启动成功

(py36) $ mitmproxy

# `Ctrl+C` 退出

1.3.CentOS Linux 7 证书配置

step0. mitmproxy证书配置

引自 >>>> Python3网络爬虫开发实战-崔庆才 <<<<

step1. mitmproxy官方文档

step2. stackoverflow pem证书转换为crt

step3. centOS系统的证书安装

step4. 具体实现代码

系统:CentOS Linux 7.4.1708

# a.查看mitmproxy版本和系统信息

$ mitmproxy --version

Mitmproxy version: 2.0.2 (release version) Precompiled Binary

Python version: 3.5.2

Platform: Linux-3.10.0-693.el7.x86_64-x86_64-with-centos-7.4.1708-Core

SSL version: OpenSSL 1.1.0e 16 Feb 2017

Linux distro: CentOS Linux 7.4.1708 Core

# b.安装 ca-certificates

$ yum install ca-certificates

# c. 转换证书,pem → crt

$ cd ~/.mitmproxy/

$ openssl x509 -in mitmproxy-ca-cert.pem -inform PEM -out mitmproxy-ca-cert.crt

$ ll

total 28

-rw-r--r-- 1 root root 1318 Aug 31 16:19 mitmproxy-ca-cert.cer

-rw-r--r-- 1 root root 1318 Aug 31 18:52 mitmproxy-ca-cert.crt

-rw-r--r-- 1 root root 1140 Aug 31 16:19 mitmproxy-ca-cert.p12

-rw-r--r-- 1 root root 1318 Aug 31 16:19 mitmproxy-ca-cert.pem

-rw-r--r-- 1 root root 2529 Aug 31 16:19 mitmproxy-ca.p12

-rw-r--r-- 1 root root 3022 Aug 31 16:19 mitmproxy-ca.pem

-rw-r--r-- 1 root root 770 Aug 31 16:19 mitmproxy-dhparam.pem

# 将证书导入系统

$ update-ca-trust force-enable

$ cp mitmproxy-ca-cert.crt /etc/pki/ca-trust/source/anchors/

$ update-ca-trust extract

# 外部证书存放目录

$ ll /etc/pki/ca-trust/source/anchors/

total 4

-rw-r--r-- 1 root root 1318 Aug 31 19:41 mitmproxy-ca-cert.crt

2. mitmproxy注入js脚本

2.1.参考资料

参考文献:

注意事项:以下JavaScript代码是为了防止selenium控制chrome浏览器访问网站时被反爬监测到,而插入的。

mitmproxy默认代理端口为8080。

2.2.具体实现

# 编辑注入脚本

(py36) $ vim indject_js_proxy.py

indject_js_proxy.py

from mitmproxy import ctx

injected_javascript = '''// overwrite the `languages` property to use a custom getterObject.defineProperty(navigator, "languages", {get: function() {return ["zh-CN","zh","zh-TW","en-US","en"];}});// Overwrite the `plugins` property to use a custom getter.Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3, 4, 5],});// Pass the Webdriver testObject.defineProperty(navigator, 'webdriver', {get: () => false,});// Pass the Chrome Test.// We can mock this in as much depth as we need for the test.window.navigator.chrome = {runtime: {},// etc.};// Pass the Permissions Test.const originalQuery = window.navigator.permissions.query;window.navigator.permissions.query = (parameters) => (parameters.name === 'notifications' ?Promise.resolve({ state: Notification.permission }) :originalQuery(parameters));'''

def response(flow):

# Only process 200 responses of HTML content.

if not flow.response.status_code == 200:

return

# Inject a script tag containing the JavaScript.

html = flow.response.text

html = html.replace('

', '' % injected_javascript)

flow.response.text = str(html)

ctx.log.info('>>>> js代码插入成功 <<<

# 只要url链接以target开头,则将网页内容替换为目前网址

# target = 'https://target-url.com'

# if flow.url.startswith(target):

# flow.response.text = flow.url

启动mitmprox, 以及js注入脚本

# 启动脚本

(py36) $ mitmdump -s indject_js_proxy.py

Loading script indject_js_proxy.py

Proxy server listening at http://*:8080

3.安装google-chrome

3.1.安装命令

安装Google Chrome仅需要2行代码:

wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm

sudo yum install google-chrome-stable_current_x86_64.rpm

3.2.安装流水

# a.下载

$ wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm

--2018-08-31 16:37:05-- https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm

Resolving dl.google.com (dl.google.com)... 203.208.40.68, 203.208.40.64, 203.208.40.67, ...

Connecting to dl.google.com (dl.google.com)|203.208.40.68|:443... connected.

HTTP request sent, awaiting response... 200 OK

Length: 54333686 (52M) [application/x-rpm]

Saving to: ‘google-chrome-stable_current_x86_64.rpm’

100%[ =======================================================================================================>] 54,333,686 6.63MB/s in 6.9s

2018-08-31 16:37:12 (7.56 MB/s) - ‘google-chrome-stable_current_x86_64.rpm’ saved [54333686/54333686]

# b.安装浏览器和依赖包

$ sudo yum install google-chrome-stable_current_x86_64.rpm

4.安装chromedriver

4.1.安装一些必要的工具包

# 安装zip,unzip

$ yum install zip, unzip

# 安装十六进制编辑器 hexedit

$ yum install hexedit

4.2 下载chromedriver

# a.下载

$ wget https://chromedriver.storage.googleapis.com/2.40/chromedriver_linux64.zip

# b.解压压缩包

$ unzip chromedriver_linux64_2.40.zip

Archive: chromedriver_linux64_2.40.zip

inflating: chromedriver

4.2 修改chromedrive源代码

# 修改chromedriver

$ hexedit chromedriver

# 操作

1. tab 跳转到string栏

2. ctrl+S 查找 var key = '$cdc_asdjflasutopfhvcZLmcfl_'(对于2.40版本)

3. 替换'$cdc_asdjflasutopfhvcZLmcfl_'为任意值

4. ctrl+X 保存

# 移动chromedriver 到 /usr/bin

$ mv chromedriver /usr/bin

5.反爬监测测试

5.0. 本地接入服务器的jupyternobook

因为代码准备布置在远程服务器上,所以需要直接连上服务器做测试,由于更喜欢用jupyter notebook,所以顺便说一下如何连上远程的jupyter notebook.

5.0.1.参考链接

5.0.2.具体实现

5.0.2.1 远程服务器

由于远程仅安装了miniconda3, 还需要按安装jupyter.

$ conda install jupyter

# 启动jupyter notebook

$ jupyter notebook

[I 11:39:00.645 NotebookApp] Writing notebook server cookie secret to /run/user/0/jupyter/notebook_cookie_secret

[C 11:39:00.665 NotebookApp] Running as root is not recommended. Use --allow-root to bypass.

# 提示说需要加上--allow-root,因为账号是用管理员账号登录的

$ jupyter notebook --allow-root

[I 11:39:18.954 NotebookApp] Serving notebooks from local directory: /data/scrapy_lmz/tg_scrapy_eleme_mt

[I 11:39:18.954 NotebookApp] 0 active kernels

[I 11:39:18.954 NotebookApp] The Jupyter Notebook is running at: http://localhost:8888/?token=1478c6364d77b3a74edd529250a1d01c36d79eb079f93e5e

[I 11:39:18.954 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

[W 11:39:18.955 NotebookApp] No web browser found: could not locate runnable browser.

[C 11:39:18.955 NotebookApp]

Copy/paste this URL into your browser when you connect for the first time,

to login with a token:

http://localhost:8888/?token=1478c6364d77b3a74edd529250a1d01c36d79eb079f93e5e

启动juypter notebook也可以使用以下命令,使jupyter notebook在后台运行,日志记录在 /tmp/ipynb.logs 中。

# 远程服务器后台运行jupyter notebook

$ nohup jupyter notebook --allow-root > /tmp/ipynb.logs &

# 查看日志

$ tail -f /tmp/ipynb.logs

# 查看jupyter notebook进程

$ ps aux | grep jupyter

root 29106 0.0 0.1 475080 61068 pts/1 Sl 14:05 0:04 /root/miniconda3/bin/python /root/miniconda3/bin/jupyter-notebook --allow-root

root 29151 0.0 0.1 771448 55556 ? Ssl 14:08 0:02 /root/miniconda3/bin/python -m ipykernel_launcher -f /run/user/0/jupyter/kernel-afbd8de5-efed-4b18-94ac-d27866219a2c.json

root 30714 0.0 0.1 744156 44668 ? Ssl 17:24 0:00 /root/miniconda3/bin/python -m ipykernel_launcher -f /run/user/0/jupyter/kernel-b6c35685-bd4e-4f18-ba3c-b7680706ad1b.json

root 31028 0.0 0.0 112708 976 pts/1 S+ 18:18 0:00 grep --color=auto jupyter

# 关闭jupyter进程

$ kill 29106

# 或批量删除jupyter进程

$ ps aux|grep jupyter|awk '{print $2}'|xargs kill -9

5.0.2.2 本地服务器设置

# 端口映射

$ ssh -L8008:localhost:8888 [email protected]

# 输入远程服务器的密码

# 在本地浏览器输入以下地址连接远程jupyter notebook,注意本地端口为8008

http://localhost:8008/?token=1478c6364d77b3a74edd529250a1d01c36d79eb079f93e5e>

5.1.测试网站

正常浏览结果:

5.2.测试代码

5.2.0.导入模块

# -*- coding: utf-8 -*-

from selenium import webdriver

from selenium.webdriver.common.keys import Keys

import time, sys

import json, io, re, os, random

from datetime import datetime, timedelta

# 导入logging模块

import logging

# create logger

logger = logging.getLogger()

logger.setLevel(logging.INFO)

# create console handler and set level to debug

ch = logging.StreamHandler()

ch.setLevel(logging.INFO)

# create formatter

formatter = logging.Formatter('%(asctime)s-%(levelname)s-%(message)s')

# add formatter to ch

ch.setFormatter(formatter)

# add ch to logger

logger.addHandler(ch)

# fh = logging.FileHandler(r'/tmp/crawl_logs.log')

# fh.setFormatter(formatter)

# logger.addHandler(fh)

5.2.1 test1_不注入js即不接入mitmproxy代理

# mitmproxy端口8080

proxy_host = 'localhost'

proxy_port = 8080

options = webdriver.ChromeOptions()

options.add_argument('lang=zh-CN,zh,zh-TW,en-US,en')

options.add_argument('--headless')

options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36')

options.add_argument("--no-sandbox")

options.add_argument('--disable-gpu')

# test1 不注入js代码 -- 不接入mitmproxy代理

# Specify the proxy.

# options.add_argument('--proxy-server=%s:%s' % (proxy_host, proxy_port))

# 启动浏览器

driver = webdriver.Chrome(chrome_options=options)

url = 'https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html'

# 访问测试网站

driver.get(url)

# 截屏

driver.save_screenshot('test1.png')

截屏test1.png

5.2.2 重置浏览器

driver.close()

# 查看后台进程

!ps aux|grep chrome

root 30727 0.0 0.0 206520 5400 ? Sl 17:28 0:00 chromedriver --port=28011

root 30844 0.0 0.0 113172 1208 pts/3 Ss+ 17:32 0:00 /usr/bin/sh -c ps aux|grep chrome

root 30846 0.0 0.0 112708 944 pts/3 S+ 17:32 0:00 grep chrome

###################################################################

# 关闭chromedriver --port进程

import subprocess

cmd = "ps aux|grep chrome|awk '{print $2}'|xargs kill -9"

p = subprocess.Popen(args=cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, close_fds=True)

z = p.wait()

logging.warning('kill chrome process%s' % str(z))

###################################################################

# 再次查看后台进程

!ps aux|grep chrome

root 30727 0.0 0.0 0 0 ? Z 17:28 0:00 [chromedriver]

root 30853 0.0 0.0 113172 1208 pts/3 Ss+ 17:33 0:00 /usr/bin/sh -c ps aux|grep chrome

root 30855 0.0 0.0 112708 944 pts/3 S+ 17:33 0:00 grep chrome

5.2.3 注入js重新启动,接入mitmproxy,注入js代码

# mitmproxy端口8080

proxy_host = 'localhost'

proxy_port = 8080

options = webdriver.ChromeOptions()

options.add_argument('lang=zh-CN,zh,zh-TW,en-US,en')

options.add_argument('--headless')

options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36')

options.add_argument("--no-sandbox")

options.add_argument('--disable-gpu')

# test1 不注入js代码 -- 不接入mitmproxy代理

# Specify the proxy.

options.add_argument('--proxy-server=%s:%s' % (proxy_host, proxy_port))

# 启动浏览器

driver = webdriver.Chrome(chrome_options=options)

url = 'https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html'

driver.get(url)

# 截屏

driver.save_screenshot('test2.png')

test2.png

chrome这一项好像没有通过,不过我的目标网址已经可以通过反爬监测了,这一项还需要再看看。

你可能感兴趣的:(用python,notebooks,安装selenium)