问题:1、urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)

内部参数解释

     2、urlparse(urlstring[, scheme[, allow_fragments]]

内部参数解释

    3、JS渲染

附件有笔者对部分内容和方法的标注及解释


简单思路描述:在获取预期的资源时,部分网站存在防爬机制,伪装成正常的用户,构造headers(User-Agent,data,Referer,status)请求的身份,cookie,换Proxy代理等方法绕过此机制

headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'  ,'Referer':'http://www.zhihu.com/articles' }



URL

请求方法(get,head,put,delete,post,options)

请求包headers

请求包内容



req = request.Request(url=url,data=data,headers=headers,method=’POST’)

response= request.urlopen(req)

等价于(优于)

response = urllib.request.urlopen(‘url’,超时设置)




cookie保存文件-重复登录


超时设置(timeout)、报错(urllib.error异常处理模块)(面向爬虫作者)


Urllib 详解

它是python内置的HTTP请求库

urllib.request  请求模块

urllib.error     异常处理模块

urllib.parse     url解析模块

urllib.rebotparser    robots.txt解析模块

相比python2的变化

原本的python2中比如 urllib2.urlopen(‘url’)

python3中成为 urllib.request.urlopen(‘url’)

 

urlopen之用法

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)

 

用例1: get请求

importurllib.request

 

response= urllib.request.urlopen(‘http://www.baidu.com’)

print(response.read().decode(‘utf-8))

 

用例2: post请求

importurllib.request

importurllib.parse

 

data =bytes(urllib.parse.urlencode({‘word’:’hello’},encoding=’utf-8’)

response= urllib.request.urlopen(‘http://baidu.com/post’,data=data)

print(response.read())

 

用例3: 超时设置

importurllib.request

 

response= urllib.request.urlopen(‘http://baidu.com’,timeout=1)

print(response.read())

 

用例4: 超时报错

importurllib.request

importsocket

import.urllib.error

 

try:

   response = urllib.reqeust.urlopen(‘http://www.baidu.com’,timeout=0.1)

except urllib.error.URLError as e:

ifisinstance(e.reason,socket.timeout):

    print(‘TimeOut’)

响应:

响应类型:

import urllib.request

response= urllib.reqeust.urlopen(‘url’)

print(type(response))

 

状态码,响应头

importurllib.request

 

response= urllib.request.urlopen(‘url’)

print(response.status)

print(response.getheaders())

print(response.getheader(‘Server’))

 

用例2:

importurllib.request

 

response= urllib.request.urlopen(‘https://python.org’)

print(response.read().decode(‘utf-8’))

 

 

 

Request  如果要发送一些复杂的请求,比如要发送headers

用例1:

importurllib.request

 

request= urllib.request.Request(‘https://python.org’)

response= urllib.request.urlopen(request)

print(response.read().decode(‘utf-8’))

 

用例2:

fromurllib import request,parse

 

url = http://httpbin.org/post

headers= {

‘User-Angent’:’xxxxxxx’,

‘Host’:’httpbin.org’

}

 

dict = {

    ‘name’ = ‘feng’

}

data =bytes(parse.urlencode(dict),encoding=’utf-8’)#解析提交的参数

req =request.Request(url=url,data=data,headers=headers,method=’POST’)

response=request.urlopen(req)

print(response.read().decode(‘utf-8’))

 

用例3:

fromurllib import request,parse

 

url =‘http://httpbin.org/post’

dict ={‘name’:’feng’}

data =bytes(parse.urlencode(dict),encoding=’utf-8’)

req =request.Request(url=url,data=data,headers=headers,method=’POST’)

req.add_header(‘User-Anget’,’xxxxxxx’)

response= request.urlopen(req)

print(response.read().decode(‘utf-8’))

 

 

Handler 用法代理

 

importurllib.request

 

proxy_handler= urllib.request.ProxyHandler({

‘http’:’http://127.0.0.1:9743’,

‘https’:’https://127.0.0.1:9743’

})

opener =url.request.build_opener(proxy_handler)

response= opener.open(‘http://www.baidu.com’)

print(response.read())

 

 

 

Cookies使用

用例1: 获取cookie

import http.cookiejar,urllib.request

#声明一个CookieJar对象实例来保存cookie

cookie = http.cookiejar.CookieJar()

#利用urllib2库的HTTPCookieProcessor对象来创建cookie处理器

handler =urllib.request.HTTPCookieProcessor(cookie)

#通过handler来构建opener

opener = urllib.request.build_opener(handler)

response = opener.open(‘url’)

for item in cookie:

print(item.name+’=’+item.value)

 

用例2: 把cookie保存成txt文件 火狐浏览器格式

importhttp.cookiejar,urllib.request

filename= ‘cookie.txt’

cookie =http.cookiejar.MozillaCookieJar(filename)

handler= urllib.request.HTTPcookieProcessor(cookie)

opener =urllib.request.build_opener(handler)

response= opener.open(‘url’)

cookie.save(ignore_discard=True,ignore_expires=True)

官方解释如下:

ignore_discard:save even cookies set to be discarded.

ignore_expires:save even cookies that have expiredThe file is overwritten if it already exists

由此可见,ignore_discard的意思是即使cookies将被丢弃也将它保存下来,ignore_expires的意思是如果在该文件中 cookies已经存在,则覆盖原文件写入,在这里,我们将这两个全部设置为True。运行之后,cookies将被保存到cookie.txt文件中, 我们查看一下内容

 

 

用例3: 另外一种cookie保存方式 LWP 2.0格式

importhttp.cookiejar,urllib.request

 

filename=’cookie.txt’

cookie =http.cookiejar.LWPCookieJar(filename)

handler= urllib.request.HTTPCookieProcessor(cookie)

opener =urllib.request.build_opener(handler)

response=opener.open(‘url’)

cookie.save(ignore_discard=True,ignore_expires=True)

 

用例4: 读取cookie文件

importhttp.cookiejar,urllib.request

cookie =http.cookiejar.LWPCookieJar()

#从文件中读取cookie内容到变量

cookie.load(‘cookie.txt’, ignore_discard=True,ignore_expires=True)

handler= urllib.request.HTTPCookieProcessor(cookie)

opener =urllib.request.build_opener(handler)

response= opener.open(‘http://www.baidu.com’)

print(response.read().decode(‘utf-8’))

 

异常处理

用例1:

fromurllib import request,error

 

try:

response= request.urlopen(‘http://aaaaaaa.com/sss.html’)

excepterror.URLError as e:

print(e.reason)

 

用例2:

formurllib import request,error

 

try:

response = request.urlopen(‘http://aaalckjvz.com/aaa.html’)

excepterror.HTTPError as e:

print(e.reason,e.code,e.headers,sep=’\n’)

except error.URLErroras e:

print(e.reason)

else:

print(‘Request Successfully’)

用例3:

importsocket

importurllib.request

importurllib.error

try:

response =urllib.request.urlopen(‘https://www.baidu.com’,timeout=0.01)

excepturllib.error.URLError as e:

print(type(e.reason))

if isinstance(e.reason,socket.timeout):

    print(‘Time Out’)

 

 

URL解析

urlparse和urlunparse

fromurllib.parse import urlparse

result =urlparse (‘url’)

print(type(result),result)

 

它的参数

result =urlparse(‘url’,scheme=’https’)解析协议 可以去掉http://

result =urlparse(‘url’,scheme=’http’)

result =urlparse(‘url’,allow_fragments=False)  url带有查询参数

result =urlparse(‘url’,allow_fragments=False)  url不带有查询参数

 

urlunparse

用来拼接url

用例:

fromurllib.parse import urlunparse

data = [‘http’,’www.baidu.com’,’index.html’,’user’,’a=1’,’comment’]

print(urlunparse(data))

 


urljoin 用来拼接url的方法 或者叫组合方法

url必须为一致站点,否则后面参数会覆盖前面的host

fromurllib.parse import urljoin

 

print(urljoin(‘http://www.baidu.com’,’FAQ.html))

print(urljoin(‘http://www.badiu.com’,’https://www.baidu.com/FAQ.html’))

print(urljoin(‘http://www.baidu.com/about.html’,’http://www.baidu.com/FAQ.html’))

print(urljoin(‘www.baidu.com#comment’,’?category=2’))

 

 

urlencode 又一种url组合方法,将字典对象转换为get请求参数

fromurllib.parse import urlencode

params ={

‘name’:’feng’,

‘age’:18

}

base_url= ‘http://www.baidu.com?’

url =base_url+urlencode(params)

print(url)