Python爬虫之urllib库的使用

本文为笔者根据Python静觅教程记录笔记,仅供学习使用

urllib库包含四个模块:

  • urllib.request: 请求模块
  • urllib.error: 异常处理模块
  • urllib.parse: url解析模块
  • urllib.robotparser(用的不多): robots.txt解析模块
    本文主要介绍前三个用法

1.urllib.request的用法

import urllib.request
response=urllib.request.urlopen("http://www.baidu.com")
print(response.read().decode('utf-8'))

运行结果:


Python爬虫之urllib库的使用_第1张图片
运行结果图
import urllib.parse
import urllib.request
data=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')
response=urllib.request.urlopen('http://httpbin.org/post',data=data)
print(response.read().decode('utf-8'))

运行结果:

Python爬虫之urllib库的使用_第2张图片
运行结果图
import urllib.request
response=urllib.request.urlopen('http://httpbin.org/get',timeout=1)
print(response.read().decode('utf-8'))

运行结果:


Python爬虫之urllib库的使用_第3张图片
运行结果图
import socket
import urllib.request
import urllib.error
try:
    response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout):
        print("TIMEOUT")

运行结果:
>>>TIMEOUT

2.状态码、响应头

import urllib.request
response=urllib.request.urlopen('http://www.python.org')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))

运行结果:


Python爬虫之urllib库的使用_第4张图片
运行结果图
import urllib.request
request=urllib.request.Request('http://www.python.org')
response=urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

运行结果同样可以获取Python网站上的内容

import urllib.request
url='http://httpbin.org/post'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36','Host':'httpbin.org'}
dict={'name':'Germey'}
data=bytes(urllib.parse.urlencode(dict),encoding='utf8')
req=urllib.request.Request(url=url,data=data,headers=headers,method='POST')
res=urllib.request.urlopen(req)
print(res.read().decode('utf-8'))

运行结果:


Python爬虫之urllib库的使用_第5张图片
运行结果图

3.代理

import urllib.request
proxy_handler=urllib.request.ProxyHandler({'http':'http://61.152.81.193:9100'})
opener=urllib.request.build_opener(proxy_handler)
response=opener.open('http://httpbin.org/get')
print(response.read().decode('utf-8'))

运行结果:

Python爬虫之urllib库的使用_第6张图片
运行结果图

4.Cookie的使用

①读取cookie信息

import http.cookiejar,urllib.request
cookie=http.cookiejar.CookieJar()
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

运行结果:

Python爬虫之urllib库的使用_第7张图片
运行结果图

②.LWPCookieJar

import http.cookiejar,urllib.request
filename='cookie.txt'
cookie=http.cookiejar.LWPCookieJar(filename)
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)

运行完会将cookie信息保存在cookie.txt文件中


cookie.txt

③MozillaCookieJar

import http.cookiejar,urllib.request
filename='cookie1.txt'
cookie=http.cookiejar.MozillaCookieJar(filename)
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)
Python爬虫之urllib库的使用_第8张图片
cooki1.txt

④读取cookie.txt信息请求网页

import http.cookiejar,urllib.request
cookie=http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

5.urllib.error异常处理

from urllib import request,error
try:
    response=request.urlopen('http://www.51ashin.cn/index.html')
    print('你好')
except error.URLError as e:
    print(e.reason)

结果:>>>Not Found

from urllib import request,error
try:
    response=request.urlopen('http://www.51ashin.cn/index.html')
    print('你好')
except error.HTTPError as e:
    print(e.reason,e.code,e.headers,sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Successful!')

优先打印出URLError子类HTTPError的错误信息

Not Found
404
Date: Sat, 22 Jul 2017 08:54:46 GMT
Server: Apache/2.2.15 (CentOS)
Content-Length: 288
Connection: close
Content-Type: text/html; charset=iso-8859-1

注意:如果将URLError写在前面就会先打印出NotFound

from urllib import request,error
import socket
try:
    response=request.urlopen('http://www.51ashin.cn/index.html',timeout=0.001)
except error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason,socket.timeout):
        print('TIMEOUT')

运行结果:

>>>
>>>TIMEOUT

6.urllib.parse

①urllib.parse.urlparse

urllib.parse.urlparse(urlstring,scheme='',allow_fragments=True)

from urllib.parse import urlparse
result=urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result),result,sep='\n')

运行结果:


运行结果图

没写协议类型时,指定协议类型

from urllib.parse import urlparse
result=urlparse('www.baidu.com/index.html;users?id=5#comment',scheme='https')
print(type(result),result,sep='\n')

运行结果:


运行结果图

当前面写了协议类型时,指定协议类型不生效

from urllib.parse import urlparse
result=urlparse('http://www.baidu.com/index.html;users?id=5#comment',scheme='https')
print(type(result),result,sep='\n')

运行结果:


运行结果图

allow_fragments=False

from urllib.parse import urlparse
result=urlparse('http://www.baidu.com/index.html;users?id=5#comment',allow_fragments=False)
print(type(result),result,sep='\n')

运行结果:


运行结果图
from urllib.parse import urlparse
result=urlparse('http://www.baidu.com/index.html#comment',allow_fragments=False)
print(type(result),result,sep='\n')

运行结果:

运行结果图
②urllib.parse.urlunparse
from urllib.parse import urlunparse
data=['http','www.baidu.com','index.html','user','id=5','comment']
print(urlunparse(data))

输出:>>>http://www.baidu.com/index.html;user?id=5#comment

③urllib.parse.urljoin
from urllib.parse import urljoin
print(urljoin('http://www.baidu.com','FAQ.html'))
print(urljoin('http://www.baidu.com','http://www.sina.com.cn'))
print(urljoin('http://www.baidu.com','?category=2#comment'))
print(urljoin('www.baidu.com','?category=2#comment'))
print(urljoin('www.baidu.com#comment','?category=2'))

运行结果:

运行结果图
④urllib.parse.urlencode(将字典格式的数据转换为请求参数)
from urllib.parse import urlencode
params={'name':'sam','age':22}
base_url='http://www.baidu.com?'
url=base_url+urlencode(params)
print(url)

运行结果:http://www.baidu.com?name=sam&age=22

你可能感兴趣的:(Python爬虫之urllib库的使用)