urllib是python3内置的HTTP请求库,包含request,error,parse,robotparser四个基本模块,分别用于发送请求,异常处理,URL解析处理,识别robots协议。
urlopen(url, data=None, [timeout,]*, cafile=None, capath=None, cadefault=False, context=None)
class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None
from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
username = 'username'
password = 'password'
url = 'http://localhost:5000/'
p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None,url,username,password)
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)
result = opener.open(url)
html = result.read().decode('utf-8')
from urllib.request import ProxyHandler, build_opener
proxy_handler = ProxyHandler({
'http':'http://127.0.0.1:9743',
'https':'https://127.0.0.1:9743'
})
opener = build_opener(proxy_handler)
response = opener.open('https://www.baidu.com')
import http.cookiejar, urllib.request
#cookie = http.cookiejar.LWPCookieJar(finame='cookies.txt') 将cookie保存为LWP格式
cookie = http.cookiejar.MozillaCookieJar(filename='cookies.txt') #将cookie保存为Mozilla型浏览器的cookies格式
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)
既然可以保存cookie到文件中,那肯定也可以从文件中提取cookie来使用。
import http.cookiejar, urllib.request
cookie = http.cookiejar.MozillaCookieJar()
cookie.load('cookies.txt',ignore_discard=True,ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))
error定义了由request模块产生的异常,如果出现了问题,request模块便会抛出error模块中定义的异常。
parse模块定义了处理URL的借口,可实现URL各部分的抽取,合并,分离等。常用的方法如下:
User-agent : *
Disallow : /
Allow : /public/
上述robots协议便是对所有爬虫只允许爬取public目录。