import urllib import urllib.request urllib.request.urlopen("http://www.baidu.com")
这个库是配合一些驱动去爬取动态渲染网页的库
(1)chromedriver
我们使用的时候需要先下载一个 chromedriver.exe ,下载好了以后放在 chrome.exe 的相同目录下(默认安装路径),然后将这个目录放作为 PATH
import selenium from selenium import webdriver driver = webdriver.Chrome() driver.get("http://www.baidu.com") driver.page_source
这种方式的唯一的缺点是会出现浏览器界面,这可能是我们不需要的,所以我们可以使用 headless 的方式来隐藏 web 界面(其实就是使用 options() 对象的 add_argument 属性去设置 headless 参数 )
我自己是一名高级python开发工程师,我有整理了一套最新的python系统学习教程,想要学习Python的小伙伴可以加我的python学习交流群:835017344,找管理员即可免费领取。
import os from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.chrome.options import Options import time chrome_options = Options() chrome_options.add_argument("--headless") base_url = "http://www.baidu.com/" #对应的chromedriver的放置目录 driver = webdriver.Chrome(executable_path=(r'C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe'), chrome_options=chrome_options) driver.get(base_url + "/") start_time=time.time() print('this is start_time ',start_time) driver.find_element_by_id("kw").send_keys("selenium webdriver") driver.find_element_by_id("su").click() driver.save_screenshot('screen.png') driver.close() end_time=time.time() print('this is end_time ',end_time)
(2)phantomJS
这是另一种无界面的实现方法,虽然说不维护了,并且在使用的过程中会出现各种玄学,但是还是要介绍一下
和 Chromedriver 一样,我们首先要去 下载 phantomJS,然后将其放在 PATH 中方便我们后面的调用
import selenium from selenium import webdriver driver = webdriver.phantomJS() driver.get("http://www.baidu.com") driver.page_source
这个是为 XPATH 的使用准备的库
pip 安装的时候注意一下要安装 beautifulsoup4,表示第四个版本,并且这个库是依赖于 lxml 的,所以安装之前请先安装 lxml
from bs4 import BeautifulSoup soup = BeautifulSoup('`','lxml')
和 BeautifulSoup 一样也是一个网页解析库,但是相对来讲语法简单一些(语法是模仿 jQuery 的)
from pyquery import PyQuery as pq page = pq('`hello world`') result = page('html').text() result
这个库是 py 操纵 Mysql 的库
import pymysql conn = pymysql.connect(host='localhost',user='root',password='root',port=3306,db='test') cursor = conn.cursor() result = cursor.execute('select * from user where id = 1') print(cursor.fetchone())
import pymango client = pymango.MongoClient('localhost') db = client('newtestdb') db['table'].insert({'name':'Bob'}) db['table'].find_one({'name':'Bob'})
import redis r = redis.Redis('localhost',6379) r.set("name","Bob") r.get('name')
flask 在后期使用代理的时候可能会用到
from flask import Flask app = Flask(__name__) @app.route('/') def hello(): return "hello world" if __name__ == '__main__': app.run(debug=True)
在分布式爬虫的维护方面可能会用到 django
网页端记事本
(1)爬虫是什么
爬虫就是请求网页并且提取数据的自动化工具
(2)爬虫的基本流程
1.发起请求:
通过 HTTP 库向目标网站发起请求,即发送一个 request(可以包含额外的header信息),然后等待服务器的响应
2.获取响应内容
如果服务器正常响应,会得到一个 Response.其内容就是所要获取的页面的内容,类型可以是 HTML、JSON、二进制数据(图片视频)等
3.解析内容
对 HTML 数据可以使用正则表达式、网页解析库进行解析。如果是 Json 则可以转化成 JSON 对象解析,如果是二进制数据可以保存或者进一步处理
4.保存数据
保存的形式多样,可以是纯文本,也可以保存成数据库,或者保存为特定格式的文件
(3)请求的基本元素
1.请求方法
2.请求 URL
3.请求头
4.请求体(POST 方法独有)
(4)请响应的基本元素
1.状态码
2.响应头
3.响应体
(5)实例代码:
1.请求网页数据
import requests headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'} res = requests.get("http://www.baidu.com",headers=headers) print(res.status_code) print(res.headers) print(res.text)
当然这里使用的是 res.text 这种文本格式,如果返回的是一个二进制格式的数据(比如图片),那么我们应该使用 res.content
2.请求二进制数据
import requests headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'} res = requests.get("https://ss2.bdstatic.com/lfoZeXSm1A5BphGlnYG/icon/95486.png",headers=headers) print(res.content) with open(r'E:\桌面\1.png','wb') as f: f.write(res.content) f.close()
(6)解析方式
1.直接处理
2.转化成 json对象
3.正则匹配
4.BeautifulSoap
5.PyQuery
6.XPath
(7)response 的结果为什么和浏览器中的看到的不同
我们使用脚本去请求(只是一次请求)网页得到的是最原始的网页的源码,这个源码里面会有很多的远程的 js 和 css 的加载,我们的脚本是没法解析的,但是浏览器能对这些远程的链接进行再次的请求,然后利用加载到的数据对页面进行进一步的加载和渲染,于是我们在浏览器中看到的页面是很多请求渲染得到的结果,因此和我们一次请求的到的页面肯定是不一样的。
(8)如何解决 JS 渲染的问题
解决问题的方法本质上就是模拟浏览器的加载渲染,然后将渲染好的页面进行返回
1.分析Ajax 请求
2.selenium+webdriver(推荐)
3.splash
4.PyV8、Ghostpy
(9)如何存储数据
1.纯本文
2.关系型数据库
3.非关系型数据库
4.二进制文件
这个库是 python 的内置的一个请求库
urllib.request —————–>请求模块
urllib.error——————–>异常处理模块
urllib.parse——————–>url解析模块
urllib.robotparser ————>robots.txt 解析模块
(1)函数调用原型
urllib.request.urlopen(url,data,timeout...)
(2)实例代码一:GET 请求
import urllib.request res = urllib.request.urlopen("http://www.baidu.com") print(res.read().decode('utf-8'))
(3)实例代码二:POST 请求
import urllib.request import urllib.parse from pprint import pprint data = bytes(urllib.parse.urlencode({'world':'hello'}),encoding = 'utf8') res = urllib.request.urlopen('https://httpbin.org/post',data = data) pprint(res.read().decode('utf-8'))
(4)实例代码三:超时设置
import urllib.request res = urllib.request.urlopen("http://httpbin.org.get",timeout = 1) print(res.read().decode('utf-8'))
(5)实例代码:获取响应状态码、响应头、响应体
import urllib.request res = urllib.request.urlopen("http://httpbin.org/get") print(res.status) print(res.getheaders()) print(res.getheader('Server')) #获取响应体的使用 read() 的结果是 Bytes 类型,我们还要用 decode('utf-8')转化成字符串 print(res.read().decode('utf-8'))
(6) request 对象
from urllib import request,parse from pprint import pprint url = "https://httpbin.org/post" headers = { 'User-Agent':'hello wolrd', 'Host':'httpbin.org' } dict = { 'name':'Tom', } data = bytes(parse.urlencode(dict),encoding='utf8') req = request.Request(url=url,data=data,headers=headers,method='POST') res = request.urlopen(req) pprint(res.read().decode('utf-8'))
(1)代理
import urllib.request proxy_handler = urllib.request.ProxyHandler({ 'http':'http://127.0.0.1:9743' }) opener = urllib.request.build_opener(proxy_handler) res = opener.open('https://www.taobao.com') print(res.read())
(2)Cookie
1.获取 cookies
import http.cookiejar import urllib.request cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com") for item in cookie: print(item.name+"="+item.value)
2.将 cookie 保存成文本文件
格式一:
import http.cookiejar, urllib.request filename = "cookie.txt" cookie = http.cookiejar.MozillaCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') cookie.save(ignore_discard=True, ignore_expires=True)
格式二:
import http.cookiejar, urllib.request filename = 'cookie.txt' cookie = http.cookiejar.LWPCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') cookie.save(ignore_discard=True, ignore_expires=True)
3.使用文件中的 cookie
import http.cookiejar, urllib.request cookie = http.cookiejar.LWPCookieJar() cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') print(response.read().decode('utf-8'))
(3)异常处理
1.实例代码一:URLError
from urllib import request from urllib import error try: urllib.request.urlopen("http://httpbin.org/xss") except error.URLError as e: print(e.reason)
2.实例代码二:HTTPError
from urllib import request, error try: response = request.urlopen('http://httpbin.org/xss') except error.HTTPError as e: print(e.reason, e.code, e.headers, sep='\n') except error.URLError as e: print(e.reason) else: print('Request Successfully')
3.实例代码三:异常类型判断
import socket import urllib.request import urllib.error try: response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01) except urllib.error.URLError as e: print(type(e.reason)) if isinstance(e.reason, socket.timeout): print('TIME OUT')
(4)URL 解析工具类
1.urlparse
from urllib.parse import urlparse result = urlparse('http://www.baidu.com/index.html;user?id=5#comment') print(type(result), result)
2.urlunparse
from urllib.parse import urlunparse data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment'] print(urlunparse(data))
3.urljoin
from urllib.parse import urljoin print(urljoin('http://www.baidu.com', 'FAQ.html'))
4.urlencode
from urllib.parse import urlencode params = { 'name': 'germey', 'age': 22 } base_url = 'http://www.baidu.com?' url = base_url + urlencode(params) print(url)
这个库是基于 URLlib3 的,改善了 urllib api 比较繁琐的特点,使用几句简单的语句就能实现设置 cookie 和设置代理的功能,非常的方便
(1)获取响应信息
import requests res = requests.get("http://www.baidu.com") print(res.status_code) print(res.text) print(res.cookies)
(2)各种请求方法
import requests requests.get("http://httpbin.org/get") requests.post("http://httpbin.org/post") requests.put("http://httpbin.org/put") requests.head("http://httpbin.org/get") requests.delete("http://httpbin.org/delete") requests.options("http://httpbin.org/get")
(3)带参数的 get 请求
import requests params = { 'id':1, 'user':'Tom', 'pass':'123456' } res = requests.get('http://httpbin.org/get',params = params ) print(res.text)
(4)解析 json
import requests res = requests.get("http://httpbin.org/get") print(res.json())
(5)获取二进制数据
import requests response = requests.get("https://github.com/favicon.ico") with open('favicon.ico', 'wb') as f: f.write(response.content) f.close()
(6)添加 headers
import requests headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' } response = requests.get("https://www.zhihu.com/explore", headers=headers) print(response.text)
(7)POST 请求
import requests data = { 'id':1, 'user':'Tom', 'pass':'123456', } res = requests.post('http://httpbin.org/post',data=data) print(res.text)
(8) response 属性
import requests data = { 'id':1, 'user':'Tom', 'pass':'123456', } res = requests.post('http://httpbin.org/post',data=data) print(res.text) print(res.status_code) print(res.headers) print(res.cookies) print(res.history) print(res.url)
(9)响应状态码
每一个状态码都对应着一个名字,我们只要调用这个名字就可以进行判断了
100: ('continue',), 101: ('switching_protocols',), 102: ('processing',), 103: ('checkpoint',), 122: ('uri_too_long', 'request_uri_too_long'), 200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '✓'), 201: ('created',), 202: ('accepted',), 203: ('non_authoritative_info', 'non_authoritative_information'), 204: ('no_content',), 205: ('reset_content', 'reset'), 206: ('partial_content', 'partial'), 207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'), 208: ('already_reported',), 226: ('im_used',), # Redirection. 300: ('multiple_choices',), 301: ('moved_permanently', 'moved', '\\o-'), 302: ('found',), 303: ('see_other', 'other'), 304: ('not_modified',), 305: ('use_proxy',), 306: ('switch_proxy',), 307: ('temporary_redirect', 'temporary_moved', 'temporary'), 308: ('permanent_redirect', 'resume_incomplete', 'resume',), # These 2 to be removed in 3.0 # Client Error. 400: ('bad_request', 'bad'), 401: ('unauthorized',), 402: ('payment_required', 'payment'), 403: ('forbidden',), 404: ('not_found', '-o-'), 405: ('method_not_allowed', 'not_allowed'), 406: ('not_acceptable',), 407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'), 408: ('request_timeout', 'timeout'), 409: ('conflict',), 410: ('gone',), 411: ('length_required',), 412: ('precondition_failed', 'precondition'), 413: ('request_entity_too_large',), 414: ('request_uri_too_large',), 415: ('unsupported_media_type', 'unsupported_media', 'media_type'), 416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'), 417: ('expectation_failed',), 418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'), 421: ('misdirected_request',), 422: ('unprocessable_entity', 'unprocessable'), 423: ('locked',), 424: ('failed_dependency', 'dependency'), 425: ('unordered_collection', 'unordered'), 426: ('upgrade_required', 'upgrade'), 428: ('precondition_required', 'precondition'), 429: ('too_many_requests', 'too_many'), 431: ('header_fields_too_large', 'fields_too_large'), 444: ('no_response', 'none'), 449: ('retry_with', 'retry'), 450: ('blocked_by_windows_parental_controls', 'parental_controls'), 451: ('unavailable_for_legal_reasons', 'legal_reasons'), 499: ('client_closed_request',), # Server Error. 500: ('internal_server_error', 'server_error', '/o\\', '✗'), 501: ('not_implemented',), 502: ('bad_gateway',), 503: ('service_unavailable', 'unavailable'), 504: ('gateway_timeout',), 505: ('http_version_not_supported', 'http_version'), 506: ('variant_also_negotiates',), 507: ('insufficient_storage',), 509: ('bandwidth_limit_exceeded', 'bandwidth'), 510: ('not_extended',), 511: ('network_authentication_required', 'network_auth', 'network_authentication'),
实例代码:
import requests response = requests.get('http://www.jianshu.com/hello.html') exit() if not response.status_code == requests.codes.not_found else print('404 Not Found')
(1)文件上传
import requests files = {'file':open('E:\\1.png','rb')} res= requests.post('http://httpbin.org/post',files=files) print(res.text)
(2)获取 cookies
import requests res = requests.get("http://www.baidu.com") for key,value in res.cookies.items(): print(key + "=" + value)
(3)会话维持
这个用法非常的重要,在我们的模拟登陆的过程中是必然会用到的方法,在 CTF 的写脚本的过程中也经常会用到,所以我们稍微详细解释一下
我们在使用 requests.get 的时候要明确一点就是,我们每使用一个 requests.get 就相当于重新打开了一个浏览器,因此上一个 requests.get 中设置的 cookie 在下面的第二次请求中是不能同步的,我们来看下面的例子
实例代码:
import requests #这里我们设置了 cookie requests.get('http://httpbin.org/cookies/set/number/123456789') #我们再次发起请求,查看是否能带着我们设置的 cookie res = requests.get('http://httpbin.org/cookies') print(res.text)
运行结果:
{ "cookies": {} }
我们发现,正如我们上面分析的,我们第一次访问设置的 cookie 并没有在第二次访问中生效,那么怎么办呢,我们有一个 session() 方法能帮助我们解决这个问题
实例代码:
import requests s = requests.Session() s.get('http://httpbin.org/cookies/set/number/123456789') res = s.get('http://httpbin.org/cookies') print(res.text)
运行结果:
{ "cookies": { "number": "123456789" } }
(4)证书验证
我们在访问 https 的网站的时候浏览器首先会对网站的证书进行校验,如果发现这个证书不是官方授权的话就会出现警告页面而不会继续访问该网站,对于爬虫来讲就会抛出异常,那如果我们想要让爬虫忽略证书的问题继续访问这个网站的话就要对其进行设置
1.忽略证书验证
import requests response = requests.get('https://www.heimidy.cc/',verify=False) print(response.status_code)
但是此时是存在一个警告的,我们可以通过导入 urilib3 的包,并调用消除 warning 的方法来消除这个警告
import requests from requests.packages import urllib3 urllib3.disable_warnings() response = requests.get('https://www.heimidy.cc/',verify=False) print(response.status_code)
2.手动指定本地证书进行验证
import requests response = requests.get('https://www.12306.cn', cert=('/path/server.crt', '/path/key')) print(response.status_code)
(5)代理设置
除了常见到的 https 和 http 代理以=以外,我们还可以使用 socks 代理,不过需要 pip 安装一个 requests[socks] 包
import requests proxies = { "http":"http://127.0.0.1:1080", "https":"https://127.0.0.1:1080" } res = requests.get("https://www.google.com",proxies=proxies) print(res.status_code)
这里有一个疑问就是我是用 socks 代理访问 google 是失败的,会报错
实例代码:
import requests proxies = { "http":"socks5://127.0.0.1:1080", "https":"socks5://127.0.0.1:1080" } res = requests.get("https://www.google.com",proxies=proxies,verify=False) print(res.status_code)
运行结果:
SSLError: SOCKSHTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')")))
试了一些方法都没什么效果,有待以后考证
(6)超时设置
import requests from requests.exceptions import ReadTimeout try: response = requests.get("http://httpbin.org/get", timeout = 0.5) print(response.status_code) except ReadTimeout: print('Timeout')
(7)Basic 认证
实例代码一:
import requests from requests.auth import HTTPBasicAuth r = requests.get('http://120.27.34.24:9001', auth=HTTPBasicAuth('user', '123')) print(r.status_code)
实例代码二:
import requests r = requests.get('http://120.27.34.24:9001', auth=('user', '123')) print(r.status_code)
(8)异常处理
import requests from requests.exceptions import ReadTimeout, ConnectionError, RequestException try: response = requests.get("http://httpbin.org/get", timeout = 0.5) print(response.status_code) except ReadTimeout: print('Timeout') except ConnectionError: print('Connection error') except RequestException: print('Error')
正则表达式是对字符串进行操作的一种逻辑公式,用事先定义好的一些特定的字符,以及这些字符的组合,组成一个规则字符串,用这个规则字符串去表达对字符串的一种过滤的逻辑,在python 中使用过re 库来实现
re.match(pattern, string, flags=0)
(1)常规匹配
span() 方法是返回匹配的范围,group() 是返回匹配的结果
实例代码:
import re content = 'Hello 123 4567 World_This is a Regex Demo' res = re.match('^\w{5}\s\d{3}\s\d{4}\s\w{10}.*Demo$',content) print(res.span()) print(res.group())
(2)泛匹配
import re content = 'Hello 123 4567 World_This is a Regex Demo' res = re.match('^Hello.*Demo$',content) print(res.span()) print(res.group())
(3)匹配具体内容
我们如果想匹配具体的内容,我们可以用小括号将其括起来
import re content = 'Hello 1234567 World_This is a Regex Demo' res = re.match('^Hello\s(\d+)\s.*Demo$',content) print(res.span(1)) print(res.group(1))
(4)贪婪与非贪婪模式
所谓贪婪模式指的就是 .*
会匹配尽可能多的字符,我们来看下面的例子
实例代码:
import re content = 'Hello 1234567 World_This is a Regex Demo' res = re.match('^He.*(\d+).*Demo$',content) print(res.span(1)) print(res.group(1))
运行结果:
(12, 13) 7
我们的本意是想匹配 1234567 这个字符串,但是实际上我们只匹配到了 7 ,因为 .*
默认的贪婪模式将123456匹配掉了,那么为了解决这个问题,我们可以使用 ?
去消除非贪婪模式
实例代码:
import re content = 'Hello 1234567 World_This is a Regex Demo' res = re.match('^He.*?(\d+).*Demo$',content) print(res.span(1)) print(res.group(1))
运行结果:
(6, 13) 1234567
(5)匹配模式
匹配模式是用来解决一些细节问题的,比如匹配中的是否区分大小写、是否能匹配换行符等
实例代码:
import re content = '''Hello 1234567 World_This is a Regex Demo''' res = re.match('^He.*?(\d+).*Demo$',content,re.S) print(res.span(1)) print(res.group(1))
运行结果:
(6, 13) 1234567
可以发现我们的 .*
本来是不能匹配换行符的,但是我们使用了 re.S 的模式以后就可以正常匹配了
(6)转义字符
如果在待匹配字符串中出现了正则表达式中的特殊字符,我们要对其进行转义操作
import re content = 'price is $5.00' res = re.match('price is \$5\.00', content) print(res.group())
我们上面介绍的 re.match 有一个弊端就是它只能从开头开始匹配,也就是说如果我们的正则不匹配第一个字符那么是无法匹配中间的字符的,所以我们还有一个武器叫 re.search,它会扫描整个字符串并返回第一个成功的匹配。
实例代码:
import re content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings' res = re.search('Hello.*?(\d+).*?Demo', content) print(res.group(1))
运行结果:
因为这个特性能大大减少我们写正则的难度,因此,我们在能用 search 的情况下就不要用 match
匹配练习:
实例代码:
import re html = '''''' res = re.search('
运行结果:
任贤齐 沧海一声笑
与之前两个不同的是 re.findall 搜会索字符串,并以列表形式返回全部能匹配的子串。
匹配练习一:
实例代码:
import re html = '''''' res = re.findall('
运行结果:
[('/2.mp3', '任贤齐', '沧海一声笑'), ('/3.mp3', '齐秦', '往事随风'), ('/4.mp3', 'beyond', '光辉岁月'), ('/5.mp3', '陈慧琳', '记事本'), ('/6.mp3', '邓丽君', '但愿人长久')] /2.mp3 任贤齐 沧海一声笑 /3.mp3 齐秦 往事随风 /4.mp3 beyond 光辉岁月 /5.mp3 陈慧琳 记事本 /6.mp3 邓丽君 但愿人长久
匹配练习二:
实例代码:
import re html = '''''' res = re.findall('
运行结果:
一路上有你 沧海一声笑 往事随风 光辉岁月 记事本 但愿人长久
该方法的作用是替换字符串中每一个匹配的子串后返回替换后的字符串
实例代码一:
import re content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings' res = re.sub('\d+','K0rz3n',content) print(res)
运行结果:
Extra stings Hello K0rz3n World_This is a Regex Demo Extra stings
有时候我们替换的时候需要保留原始字符串,这个时候我们就要使用正则表达式的后向引用技术
实例代码二:
import re content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings' content = re.sub('(\d+)', '\\1 8910', content) print(content)
运行结果:
Extra stings Hello 1234567 890 World_This is a Regex Demo Extra stings
该方法可以将正则表达式转换成正则表达式对象,方便我们后期的复用
实例代码:
import re content = '''Hello 1234567 World_This is a Regex Demo''' pattern = re.compile('Hello.*Demo', re.S) res = re.match(pattern, content) print(res.group(0))
import requests import re content = requests.get('http://book.douban.com/').text pattern = re.compile('
这是一个方便的网页解析库,不用编写正则就是实现网页信息的提取
实例代码:
html = """The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.prettify()) print(soup.title.string)
运行结果:
The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie ; and they lived at the bottom of a well.
...
The Dormouse's story
(1)选择元素
使用soup.(点)属性标签的方式来进行选择,如果有多个符合的话只能返回第一个匹配的标签
实例代码:
html = """The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.head) print(soup.title) print(soup.p)
运行结果:
The Dormouse's story The Dormouse's story The Dormouse's story
(2)获取属性
实例代码:
html = """The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.attrs['name']) print(soup.p['name'])
运行结果:
dromouse dromouse
(3)获取内容
实例代码:
html = """The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.string)
运行结果:
The Dormouse's story
(4)嵌套选择
实例代码:
html = """The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.head.title.string)
运行结果:
The Dormouse's story
(5)获取子孙节点
1.contents
这种方法是以列表形式返回标签的子节点
实例代码:
html = """The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.contents)
运行结果:
['\n Once upon a time there were three little sisters; and their names were\n ', Elsie , '\n', Lacie, ' \n and\n ', Tillie, '\n and they lived at the bottom of a well.\n ']
2.children
这种方法返回的是一个子节点的迭代器形式
实例代码:
html = """The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.children) for i, child in enumerate(soup.p.children): print(i, child)
运行结果:
0 Once upon a time there were three little sisters; and their names were 1 Elsie 2 3 Lacie 4 and 5 Tillie 6 and they lived at the bottom of a well.
3.descendants
返回子孙节点,其实和上面 children 的不同在于这个方法会再次强调一下孙子节点
实例代码:
html = """The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.descendants) for i, child in enumerate(soup.p.descendants): print(i, child)
运行结果:
0 Once upon a time there were three little sisters; and their names were 1 Elsie 2 3 Elsie 4 Elsie 5 6 7 Lacie 8 Lacie 9 and 10 Tillie 11 Tillie 12 and they lived at the bottom of a well.
(6)父节点和祖先节点
1.parent
实例代码:
html = """The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.a.parent)
运行结果:
Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.
2.parents
以列表的形式输出所有的祖先节点
html = """The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(list(enumerate(soup.a.parents)))
(7)兄弟节点
实例代码:
html = """The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(list(enumerate(soup.a.next_siblings))) print(list(enumerate(soup.a.previous_siblings)))
运行结果:
[(0, '\n'), (1, Lacie), (2, ' \n and\n '), (3, Tillie), (4, '\n and they lived at the bottom of a well.\n ')] [(0, '\n Once upon a time there were three little sisters; and their names were\n ')]
上面我们讲述的标签选择器虽然选择速度快,但是选择的内容也是比较笼统的,在现实中很难满足我们的需求,于是我们就需要更强大的选择器帮助我们去实现
find_all( name , attrs , recursive , text , **kwargs )
可根据标签名、属性、内容查找文档
(1) name
实例代码一:
html='''''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') soup.find_all('ul')Hello
- Foo
- Bar
- Jay
- Foo
- Bar
运行结果:
[
如果我们还想获取更里面的标签,我们可以再次对获取到的 ul 标签使用 find_all()
实例代码二:
html='''''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') for i in soup.find_all('ul'): print(i.find_all('li'))Hello
- Foo
- Bar
- Jay
- Foo
- Bar
运行结果:
[
(2)attrs
传入想要定位的属性键值对,就能成功定位
实例代码一:
html='''''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.find_all(attrs={'id':'list-1'})) print(soup.find_all(attrs={'name':'elements'}))Hello
- Foo
- Bar
- Jay
- Foo
- Bar
运行结果:
[
或者,如果 你觉得这种方式比较复杂的话我们还可以使用更加简单的直接使用等于号链接属性和值作为参数传入来定位
实例代码二:
html='''''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.find_all(id = 'list-1')) print(soup.find_all(class_ = 'panel-heading'))Hello
- Foo
- Bar
- Jay
- Foo
- Bar
运行结果:
[
注意:
Class 是 python 中的关键字,因此我们在写属性的时候不能直接写 classs,否则会引起歧义,所以我们改成了 class_
(3)text
实例代码:
html='''''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(text='Foo'))Hello
- Foo
- Bar
- Jay
- Foo
- Bar
运行结果:
['Foo', 'Foo']
(4)其他
find( name , attrs , recursive , text , **kwargs )
find_all() 返回所有元素,而find()返回单一元素
find_parents() find_parent()
find_parents()返回所有祖先节点,find_parent()返回直接父节点。
find_next_siblings() find_next_sibling()
find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。
find_previous_siblings() find_previous_sibling()
find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。
find_all_next() find_next()
find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点
find_all_previous() 和 find_previous()
find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点
(1)基本使用
通过select()直接传入CSS选择器即可完成选择
实例代码一:
html='''''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.select('.panel-heading')) print(soup.select('#list-1')) print(soup.select('li'))Hello
- Foo
- Bar
- Jay
- Foo
- Bar
运行结果:
[] [Hello
实例代码二:
html='''''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'): print(ul.select('li'))Hello
- Foo
- Bar
- Jay
- Foo
- Bar
运行结果:
[
(2)获取属性
实例代码:
html='''''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'): print(ul['id']) print(ul.attrs['id'])Hello
- Foo
- Bar
- Jay
- Foo
- Bar
运行结果:
list-1 list-1 list-2 list-2
(3)获取内容
实例代码:
html='''''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for li in soup.select('li'): print(li.get_text())Hello
- Foo
- Bar
- Jay
- Foo
- Bar
运行结果:
Foo Bar Jay Foo Bar
PyQuery 是另一个比较强大的网页解析库,语法完全从 jQuery 迁移过来,所以对于熟悉 JQuery 的开发人员来说是非常好的选择
(1)字符串初始化
实例代码:
html = '''''' from pyquery import PyQuery as pq doc = pq(html) print(doc('li'))
- first item
- second item
- third item
- fourth item
- fifth item
运行结果:
(2)URL初始化
实例代码:
from pyquery import PyQuery as pq doc = pq(url='http://www.baidu.com') print(doc('head'))
运行结果:
ç¾åº¦ä¸ä¸ï¼ä½ å°±ç¥é
(3)文件初始化
实例代码:
from pyquery import PyQuery as pq doc = pq(filename='demo.html') print(doc('li'))
实例代码:
html = '''''' from pyquery import PyQuery as pq doc = pq(html) print(doc('#container .list li'))
- first item
- second item
- third item
- fourth item
- fifth item
运行结果:
(1)子元素
实例代码一:
html = '''''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.list').find('li') print(li)
- first item
- second item
- third item
- fourth item
- fifth item
运行结果:
当然,除了使用 find 方法以外,我们还能使用 children 方法
实例代码二:
html = '''''' from pyquery import PyQuery as pq doc = pq(html) items = doc('.list') lis = items.children('.active') print(lis)
- first item
- second item
- third item
- fourth item
- fifth item
运行结果:
(2)父元素
实例代码一:
html = '''''' from pyquery import PyQuery as pq doc = pq(html) items = doc('.list') container = items.parent() print(container)
- first item
- second item
- third item
- fourth item
- fifth item
运行结果:
- first item
- second item
- third item
- fourth item
- fifth item
使用 parent() 是返回直接父节点,但是使用 parents()能返回全部的父节点
实例代码二:
html = '''''' from pyquery import PyQuery as pq doc = pq(html) items = doc('.list') parents = items.parents('.wrap') print(parents)
- first item
- second item
- third item
- fourth item
- fifth item
运行结果:
- first item
- second item
- third item
- fourth item
- fifth item
(3)兄弟节点
实例代码:
html = '''''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.list .item-0.active') print(li.siblings('.active'))
- first item
- second item
- third item
- fourth item
- fifth item
运行结果:
实例代码:
html = '''''' from pyquery import PyQuery as pq doc = pq(html) lis = doc('li').items() for i in lis: print(i)
- first item
- second item
- third item
- fourth item
- fifth item
运行结果:
(1)获取属性
实例代码:
html = '''''' from pyquery import PyQuery as pq doc = pq(html) a = doc('.list .item-0.active a') print(a.attr.href) print(a.attr('href'))
- first item
- second item
- third item
- fourth item
- fifth item
运行结果:
link3.html link3.html
(2)获取文本
实例代码:
html = '''''' from pyquery import PyQuery as pq doc = pq(html) a = doc('.item-0.active a') print(a.text())
- first item
- second item
- third item
- fourth item
- fifth item
运行结果:
third item
(3)获取 HTML
实例代码:
html = '''''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.item-0.active') print(li.html())
- first item
- second item
- third item
- fourth item
- fifth item
运行结果:
third item
(1)addClass、removeClass
实例代码:
html = '''''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.item-0.active') print(li) li.removeClass('active') print(li) li.addClass('active') print(li)
- first item
- second item
- third item
- fourth item
- fifth item
运行结果:
(2)attr、css
实例代码:
html = '''''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.item-0.active') print(li) li.attr('name','link') print(li) li.css('front-size','14px') print(li)
- first item
- second item
- third item
- fourth item
- fifth item
运行结果:
(3)remove
实例代码:
html = '''Hello, World''' from pyquery import PyQuery as pq doc = pq(html) wrap = doc('.wrap') print(wrap.text()) wrap.find('p').remove() print(wrap.text())This is a paragraph.
运行结果:
Hello, World This is a paragraph. Hello, World
(4)其他
https://pyquery.readthedocs.io/en/latest/api.html
该函数库可以配合各种浏览器引擎以及 phantomJS 进行自动化测试工作,主要是为了解决 JS 动态渲染页面无法直接抓取的问题
实例代码:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.wait import WebDriverWait browser = webdriver.Chrome() try: browser.get('https://www.baidu.com') input = browser.find_element_by_id('kw') input.send_keys('Python') input.send_keys(Keys.ENTER) wait = WebDriverWait(browser, 10) wait.until(EC.presence_of_element_located((By.ID, 'content_left'))) print(browser.current_url) print(browser.get_cookies()) print(browser.page_source) finally: browser.close()
from selenium import webdriver browser = webdriver.Chrome() browser = webdriver.Firefox() browser = webdriver.Edge() browser = webdriver.PhantomJS() browser = webdriver.Safari()
from selenium import webdriver browser = webdriver.Chrome() browser.get("https://www.baidu.com") print(browser.page_source) browser.close()
(1)查找单个元素
实例代码一:
from selenium import webdriver browser = webdriver.Chrome() browser.get('https://www.taobao.com') input_first = browser.find_element_by_id('q') input_second = browser.find_element_by_css_selector('#q') input_third = browser.find_element_by_xpath('//*[@id="q"]') print(input_first, input_second, input_third) browser.close()
运行结果:
补充:
除此之外还有一些查找元素的方法,如下
find_element_by_name find_element_by_xpath find_element_by_link_text find_element_by_partial_link_text find_element_by_tag_name find_element_by_class_name find_element_by_css_selector
实例代码二:
from selenium import webdriver from selenium.webdriver.common.by import By browser = webdriver.Chrome() browser.get('https://www.taobao.com') input_first = browser.find_element(By.ID, 'q') print(input_first) browser.close()
运行结果:
(2)查找多个元素
实例代码一:
from selenium import webdriver browser = webdriver.Chrome() browser.get('https://www.taobao.com') lis = browser.find_elements_by_css_selector('.service-bd li') print(lis) browser.close()
实例代码二:
from selenium import webdriver from selenium.webdriver.common.by import By browser = webdriver.Chrome() browser.get('https://www.taobao.com') lis = browser.find_elements(By.CSS_SELECTOR, '.service-bd li') print(lis) browser.close()
补充:
除了上面的查找方法,查找多个元素还有下面的一些常见的方法:
find_elements_by_name find_elements_by_xpath
find_elements_by_link_text find_elements_by_partial_link_text
find_elements_by_tag_name find_elements_by_class_name
find_elements_by_css_selector
(3)元素的交互操作
我们可以对获取的元素调用交互方法
实例代码:
from selenium import webdriver import time browser = webdriver.Chrome() browser.get("http://www.taobao.com") input = browser.find_element_by_id('q') input.send_keys('iphone') time.sleep(1) input.clear() input.send_keys('ipad') button = browser.find_element_by_class_name('btn-search') button.click()
补充:
官方文档:
http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.remote.webelement
(4)交互动作
将动作附加到动作链中串行执行,这是我们使用 selenium 去模拟键鼠操作的非常常用的东西
实例代码:
from selenium import webdriver from selenium.webdriver import ActionChains browser = webdriver.Chrome() url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable' browser.get(url) browser.switch_to.frame('iframeResult') source = browser.find_element_by_css_selector('#draggable') target = browser.find_element_by_css_selector('#droppable') actions = ActionChains(browser) actions.drag_and_drop(source, target) actions.perform()
补充:
官方文档:
http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.common.action_chains
ActionChains方法列表
click(on_element=None) ——单击鼠标左键
click_and_hold(on_element=None) ——点击鼠标左键,不松开
context_click(on_element=None) ——点击鼠标右键
double_click(on_element=None) ——双击鼠标左键
drag_and_drop(source, target) ——拖拽到某个元素然后松开
drag_and_drop_by_offset(source, xoffset, yoffset) ——拖拽到某个坐标然后松开
key_down(value, element=None) ——按下某个键盘上的键
key_up(value, element=None) ——松开某个键
move_by_offset(xoffset, yoffset) ——鼠标从当前位置移动到某个坐标
move_to_element(to_element) ——鼠标移动到某个元素
move_to_element_with_offset(to_element, xoffset, yoffset)
——移动到距某个元素(左上角坐标)多少距离的位置
perform() ——执行链中的所有动作
release(on_element=None) ——在某个元素位置松开鼠标左键
send_keys(
keys_to_send) ——发送某个键到当前焦点的元素
send_keys_to_element(element,
keys_to_send) ——发送某个键到指定元素
(5)执行JavaScript
当我们找不到现成的 api 的时候,我们可以使用 js 来帮助我们实现一些动作,比如进度条的拖拽等
实例代码:
from selenium import webdriver browser = webdriver.Chrome() browser.get('https://www.zhihu.com/explore') browser.execute_script('window.scrollTo(0, document.body.scrollHeight)') browser.execute_script('alert("To Bottom")') browser.close()
(1)获取属性
实例代码:
from selenium import webdriver from selenium.webdriver import ActionChains browser = webdriver.Chrome() url = 'https://www.zhihu.com/explore' browser.get(url) logo = browser.find_element_by_id('zh-top-link-logo') print(logo) print(logo.get_attribute('class')) browser.close()
(2)获取文本值
实例代码:
from selenium import webdriver browser = webdriver.Chrome() url = 'https://www.zhihu.com/explore' browser.get(url) input = browser.find_element_by_class_name('zu-top-add-question') print(input.text) browser.close()
(3)获取ID、位置、标签名、大小
实例代码:
from selenium import webdriver browser = webdriver.Chrome() url = 'https://www.zhihu.com/explore' browser.get(url) input = browser.find_element_by_class_name('zu-top-add-question') print(input.id) print(input.location) print(input.tag_name) print(input.size) browser.close()
如果出现 frame 或者 iframe 我们必须要进入这个区域才能进行操作
实例代码:
import time from selenium import webdriver from selenium.common.exceptions import NoSuchElementException browser = webdriver.Chrome() url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable' browser.get(url) browser.switch_to.frame('iframeResult') source = browser.find_element_by_css_selector('#draggable') print(source) try: logo = browser.find_element_by_class_name('logo') except NoSuchElementException: print('NO LOGO') finally: browser.close() browser.switch_to.parent_frame() logo = browser.find_element_by_class_name('logo') print(logo) print(logo.text)
(1)隐式等待
这个方法是针对网页中的 ajax 请求设计的,当 webdriver 查找元素或元素并没有立即出现的时候(可能还需要后期的 ajax 请求),隐式等待将等待一段时间再查找 DOM,默认的时间是0
实例代码:
from selenium import webdriver browser = webdriver.Chrome() browser.implicitly_wait(10) browser.get('https://www.zhihu.com/explore') input = browser.find_element_by_class_name('zu-top-add-question') print(input) browser.close()
运行结果:
(2)显式等待
显示等待会设置一个条件和一个最长等待时间,如果在这个最长等待时间内条件还是没有成立才会抛出异常
实例代码:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC browser = webdriver.Chrome() browser.get('https://www.taobao.com/') wait = WebDriverWait(browser, 10) input = wait.until(EC.presence_of_element_located((By.ID, 'q'))) button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.btn-search'))) print(input, button) browser.close()
运行结果:
补充:
常见的判断条件:
title_is 标题是某内容
title_contains 标题包含某内容
presence_of_element_located 元素加载出,传入定位元组,如(By.ID, ‘p’)
visibility_of_element_located 元素可见,传入定位元组
visibility_of 可见,传入元素对象
presence_of_all_elements_located 所有元素加载出
text_to_be_present_in_element 某个元素文本包含某文字
text_to_be_present_in_element_value 某个元素值包含某文字
frame_to_be_available_and_switch_to_it frame加载并切换
invisibility_of_element_located 元素不可见
element_to_be_clickable 元素可点击
staleness_of 判断一个元素是否仍在DOM,可判断页面是否已经刷新
element_to_be_selected 元素可选择,传元素对象
element_located_to_be_selected 元素可选择,传入定位元组
element_selection_state_to_be 传入元素对象以及状态,相等返回True,否则返回False
element_located_selection_state_to_be 传入定位元组以及状态,相等返回True,否则返回False
alert_is_present 是否出现Alert
官方文档:
http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.support.expected_conditions
实例代码:
from selenium import webdriver browser = webdriver.Chrome() browser.get("http://www.baidu.com") browser.get("http://www.taobao.com") browser.get("http://www.zhihu.com") browser.back() browser.forward() browser.close()
实例代码:
from selenium import webdriver browser = webdriver.Chrome() browser.get('http://www.baidu.com') print(browser.get_cookies()) browser.add_cookie({'name':'Tom','pass':'123456','value': 'germey'}) print(browser.get_cookies()) browser.delete_all_cookies() print(browser.get_cookies()) browser.close()
实例代码:
from selenium import webdriver import time browser = webdriver.Chrome() browser.get('http://www.baidu.com') browser.execute_script('window.open()') print(browser.window_handles) browser.switch_to.window(browser.window_handles[1]) browser.get('http://www.taobao.com') time.sleep(1) browser.switch_to.window(browser.window_handles[0]) browser.get('http://httpbin.org') browser.close()
运行结果:
['CDwindow-3FCC47842DFF6841B4C86EE72CB7DB93', 'CDwindow-CCFA4494DE4B6C99494BE87524153E4E']
实例代码:
from selenium import webdriver from selenium.common.exceptions import TimeoutException, NoSuchElementException browser = webdriver.Chrome() try: browser.get('https://www.baidu.com') except TimeoutException: print('Time Out') try: browser.find_element_by_id('hello') except NoSuchElementException: print('No Element') finally: browser.close()
运行结果:
No Element
补充:
官方文档:详细文档: http://selenium-python.readthedocs.io/api.html#module-selenium.common.exceptions