问题:1、urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)
内部参数解释
2、urlparse(urlstring[, scheme[, allow_fragments]]
内部参数解释
3、JS渲染
附件有笔者对部分内容和方法的标注及解释
简单思路描述:在获取预期的资源时,部分网站存在防爬机制,伪装成正常的用户,构造headers(User-Agent,data,Referer,status)请求的身份,cookie,换Proxy代理等方法绕过此机制
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' ,'Referer':'http://www.zhihu.com/articles' }
URL
请求方法(get,head,put,delete,post,options)
请求包headers
请求包内容
req = request.Request(url=url,data=data,headers=headers,method=’POST’)
response= request.urlopen(req)
等价于(优于)
response = urllib.request.urlopen(‘url’,超时设置)
cookie保存文件-重复登录
超时设置(timeout)、报错(urllib.error异常处理模块)(面向爬虫作者)
Urllib 详解
它是python内置的HTTP请求库
urllib.request 请求模块
urllib.error 异常处理模块
urllib.parse url解析模块
urllib.rebotparser robots.txt解析模块
相比python2的变化
原本的python2中比如 urllib2.urlopen(‘url’)
python3中成为 urllib.request.urlopen(‘url’)
urlopen之用法
urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)
用例1: get请求
importurllib.request
response= urllib.request.urlopen(‘http://www.baidu.com’)
print(response.read().decode(‘utf-8))
用例2: post请求
importurllib.request
importurllib.parse
data =bytes(urllib.parse.urlencode({‘word’:’hello’},encoding=’utf-8’)
response= urllib.request.urlopen(‘http://baidu.com/post’,data=data)
print(response.read())
用例3: 超时设置
importurllib.request
response= urllib.request.urlopen(‘http://baidu.com’,timeout=1)
print(response.read())
用例4: 超时报错
importurllib.request
importsocket
import.urllib.error
try:
response = urllib.reqeust.urlopen(‘http://www.baidu.com’,timeout=0.1)
except urllib.error.URLError as e:
ifisinstance(e.reason,socket.timeout):
print(‘TimeOut’)
响应:
响应类型:
import urllib.request
response= urllib.reqeust.urlopen(‘url’)
print(type(response))
状态码,响应头
importurllib.request
response= urllib.request.urlopen(‘url’)
print(response.status)
print(response.getheaders())
print(response.getheader(‘Server’))
用例2:
importurllib.request
response= urllib.request.urlopen(‘https://python.org’)
print(response.read().decode(‘utf-8’))
Request 如果要发送一些复杂的请求,比如要发送headers
用例1:
importurllib.request
request= urllib.request.Request(‘https://python.org’)
response= urllib.request.urlopen(request)
print(response.read().decode(‘utf-8’))
用例2:
fromurllib import request,parse
url = http://httpbin.org/post
headers= {
‘User-Angent’:’xxxxxxx’,
‘Host’:’httpbin.org’
}
dict = {
‘name’ = ‘feng’
}
data =bytes(parse.urlencode(dict),encoding=’utf-8’)#解析提交的参数
req =request.Request(url=url,data=data,headers=headers,method=’POST’)
response=request.urlopen(req)
print(response.read().decode(‘utf-8’))
用例3:
fromurllib import request,parse
url =‘http://httpbin.org/post’
dict ={‘name’:’feng’}
data =bytes(parse.urlencode(dict),encoding=’utf-8’)
req =request.Request(url=url,data=data,headers=headers,method=’POST’)
req.add_header(‘User-Anget’,’xxxxxxx’)
response= request.urlopen(req)
print(response.read().decode(‘utf-8’))
Handler 用法代理
importurllib.request
proxy_handler= urllib.request.ProxyHandler({
‘http’:’http://127.0.0.1:9743’,
‘https’:’https://127.0.0.1:9743’
})
opener =url.request.build_opener(proxy_handler)
response= opener.open(‘http://www.baidu.com’)
print(response.read())
Cookies使用
用例1: 获取cookie
import http.cookiejar,urllib.request
#声明一个CookieJar对象实例来保存cookie
cookie = http.cookiejar.CookieJar()
#利用urllib2库的HTTPCookieProcessor对象来创建cookie处理器
handler =urllib.request.HTTPCookieProcessor(cookie)
#通过handler来构建opener
opener = urllib.request.build_opener(handler)
response = opener.open(‘url’)
for item in cookie:
print(item.name+’=’+item.value)
用例2: 把cookie保存成txt文件 火狐浏览器格式
importhttp.cookiejar,urllib.request
filename= ‘cookie.txt’
cookie =http.cookiejar.MozillaCookieJar(filename)
handler= urllib.request.HTTPcookieProcessor(cookie)
opener =urllib.request.build_opener(handler)
response= opener.open(‘url’)
cookie.save(ignore_discard=True,ignore_expires=True)
官方解释如下:
ignore_discard:save even cookies set to be discarded.
ignore_expires:save even cookies that have expiredThe file is overwritten if it already exists
由此可见,ignore_discard的意思是即使cookies将被丢弃也将它保存下来,ignore_expires的意思是如果在该文件中 cookies已经存在,则覆盖原文件写入,在这里,我们将这两个全部设置为True。运行之后,cookies将被保存到cookie.txt文件中, 我们查看一下内容
用例3: 另外一种cookie保存方式 LWP 2.0格式
importhttp.cookiejar,urllib.request
filename=’cookie.txt’
cookie =http.cookiejar.LWPCookieJar(filename)
handler= urllib.request.HTTPCookieProcessor(cookie)
opener =urllib.request.build_opener(handler)
response=opener.open(‘url’)
cookie.save(ignore_discard=True,ignore_expires=True)
用例4: 读取cookie文件
importhttp.cookiejar,urllib.request
cookie =http.cookiejar.LWPCookieJar()
#从文件中读取cookie内容到变量
cookie.load(‘cookie.txt’, ignore_discard=True,ignore_expires=True)
handler= urllib.request.HTTPCookieProcessor(cookie)
opener =urllib.request.build_opener(handler)
response= opener.open(‘http://www.baidu.com’)
print(response.read().decode(‘utf-8’))
异常处理
用例1:
fromurllib import request,error
try:
response= request.urlopen(‘http://aaaaaaa.com/sss.html’)
excepterror.URLError as e:
print(e.reason)
用例2:
formurllib import request,error
try:
response = request.urlopen(‘http://aaalckjvz.com/aaa.html’)
excepterror.HTTPError as e:
print(e.reason,e.code,e.headers,sep=’\n’)
except error.URLErroras e:
print(e.reason)
else:
print(‘Request Successfully’)
用例3:
importsocket
importurllib.request
importurllib.error
try:
response =urllib.request.urlopen(‘https://www.baidu.com’,timeout=0.01)
excepturllib.error.URLError as e:
print(type(e.reason))
if isinstance(e.reason,socket.timeout):
print(‘Time Out’)
URL解析
urlparse和urlunparse
fromurllib.parse import urlparse
result =urlparse (‘url’)
print(type(result),result)
它的参数
result =urlparse(‘url’,scheme=’https’)解析协议 可以去掉http://
result =urlparse(‘url’,scheme=’http’)
result =urlparse(‘url’,allow_fragments=False) url带有查询参数
result =urlparse(‘url’,allow_fragments=False) url不带有查询参数
urlunparse
用来拼接url
用例:
fromurllib.parse import urlunparse
data = [‘http’,’www.baidu.com’,’index.html’,’user’,’a=1’,’comment’]
print(urlunparse(data))
urljoin 用来拼接url的方法 或者叫组合方法
url必须为一致站点,否则后面参数会覆盖前面的host
fromurllib.parse import urljoin
print(urljoin(‘http://www.baidu.com’,’FAQ.html))
print(urljoin(‘http://www.badiu.com’,’https://www.baidu.com/FAQ.html’))
print(urljoin(‘http://www.baidu.com/about.html’,’http://www.baidu.com/FAQ.html’))
print(urljoin(‘www.baidu.com#comment’,’?category=2’))
urlencode 又一种url组合方法,将字典对象转换为get请求参数
fromurllib.parse import urlencode
params ={
‘name’:’feng’,
‘age’:18
}
base_url= ‘http://www.baidu.com?’
url =base_url+urlencode(params)
print(url)