Python爬虫——urllib库的基本使用

目录

什么是Urllib   

urlopen

以GET形式发送请求,获取响应体的内容

以POST方式发送请求

判断错误类型是否为超时

响应(response)

响应类型

状态码,响应头

Request(传递Headers)

结果与例一一致

方法一

方法二

HANDLER

获取Cookie

保存Cookie

读取Cookie

异常处理

URL解析

urlparse

urlunparse

urlencode


什么是Urllib   

最基本的请求库

Python内置的HTTP请求库

名称

urllib.request

请求模块

urllib.error

异常处理模块

urllib.parse

url解析模块

urllib.robotparser

robots.txt解析模块

urlopen

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)

url:地址

data:POST请求参数

timeout:超时时间

例子一

import urllib.request

response = urllib.request.urlopen('http://www.python.org')

print(response.read().decode('utf-8'))#read读取

以GET形式发送请求,获取响应体的内容

例子二

import urllib.parse

import urllib.request

data=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')#把字典传到POST请求中

response=urllib.request.urlopen('http://httpbin.org/post',data=data)

print(response.read())

以POST方式发送请求

例子三

import socket

import urllib.error

import urllib.request

try:

    response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)#timeout:设置超时判断时间

except urllib.error.URLError as e:#捕获异常

    if isinstance(e.reason,socket.timeout):#判断错误类型是否为超时

        print('TIME OUT')

判断错误类型是否为超时

响应(response)

响应类型

import urllib.request

responce = urllib.request.urlopen('https://www.python.org')

print(type(responce))

运行结果:

类型为“http.client.HTTPResponse”

HTTPResponse包含状态码和响应头

状态码,响应头

状态码

响应头Python爬虫——urllib库的基本使用_第1张图片

import urllib.request

responce = urllib.request.urlopen('https://www.python.org')

print(responce.status)

print(responce.getheaders())#响应头,类型为list

print(responce.getheader('Server'))

运行结果:

200

[('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'SAMEORIGIN'), ('x-xss-protection', '1; mode=block'), ('X-Clacks-Overhead', 'GNU Terry Pratchett'), ('Via', '1.1 varnish'), ('Content-Length', '48990'), ('Accept-Ranges', 'bytes'), ('Date', 'Fri, 18 Jan 2019 02:47:58 GMT'), ('Via', '1.1 varnish'), ('Age', '2096'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2126-IAD, cache-lax8625-LAX'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '37, 179'), ('X-Timer', 'S1547779678.133551,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]

nginx

Request(传递Headers)

例子四

import urllib.request

request = urllib.request.Request('https://python.org')

response = urllib.request.urlopen(request)

print(response.read().decode('utf-8'))

结果与例一一致

request可更加方便地指定请求方式,往headers中添加参数和加额外的数据

方法一

from urllib import request,parse

url = 'http://httpbin.org/post'

headers = {

    'User-Agent':'Mozilia/4.0(compatible;MSIE 5.5;Windows NT)',

    'Host':'httpbin.org'

}

dict = {

    'name':'Germey'

}

data = bytes(parse.urlencode(dict),encoding='utf8')

req = request.Request(url=url,data=data,headers=headers,method='POST')

response = request.urlopen(req)

print(response.read().decode('utf-8'))

运行结果:

{

  "args": {},

  "data": "",

  "files": {},

  "form": {

    "name": "Germey"

  },

  "headers": {

    "Accept-Encoding": "identity",

    "Connection": "close",

    "Content-Length": "11",

    "Content-Type": "application/x-www-form-urlencoded",

    "Host": "httpbin.org",

    "User-Agent": "Mozilia/4.0(compatible;MSIE 5.5;Windows NT)"

  },

  "json": null,

  "origin": "183.200.46.48",

  "url": "http://httpbin.org/post"

}

作用:传递Headers

优点:逻辑结构清晰

Form Data中的数据需编码

方法二

from urllib import request,parse

url = 'http://httpbin.org/post'

dict = {

    'name':'Germey'

}

data = bytes(parse.urlencode(dict),encoding='utf8')

req = request.Request(url=url,data=data,method='POST')

req.add_header('Ueer-Agent','Mozilia/4.0(compatible;MSIE 5.5;Windows NT)')

response = request.urlopen(req)

print(response.read().decode('utf-8'))

运行结果与方法一相同

应用场景:有多个键值对时用for循环添加

HANDLER

作用:切换IP代理,防止被封

例子

import urllib.request

proxy_hander = urllib.request.ProxyHandler({

    'http':'http://127.0.0.2',

    'https':'https://127.0.0.2'

    })

opener = urllib.request.build_opener(proxy_hander)

response = opener.open('http://baidu.com')

print(response.read())

在客户端保存,用于记录用户身份的文本文件

在爬虫中是维持登录状态的一个机制

获取Cookie

import http.cookiejar,urllib.request

cookie = http.cookiejar.CookieJar()

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

for item in cookie:

print(item.name+'='+item.value)

运行结果:

BAIDUID=79AFFAB6229EB9D1A62E3B973D8BB693:FG=1

BIDUPSID=79AFFAB6229EB9D1A62E3B973D8BB693

H_PS_PSSID=1444_21103_28328_28131_26350_28267_27244

PSTM=1547805103

delPer=0

BDSVRTM=0

BD_HOME=0

保存Cookie

将cookie保存在当前文件夹

格式一

import http.cookiejar,urllib.request

filename = "cookie.txt"

cookie = http.cookiejar.MozillaCookieJar(filename)#火狐浏览器

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

cookie.save(ignore_discard=True,ignore_expires=True)

格式二

import http.cookiejar,urllib.request

filename = "cookie.txt"

cookie = http.cookiejar.LWPCookieJar(filename)#LWP格式

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

cookie.save(ignore_discard=True,ignore_expires=True)

读取Cookie

用什么格式存,就用什么格式读

import http.cookiejar,urllib.request

cookie = http.cookiejar.LWPCookieJar()

cookie.load('cookie.txt',ignore_discard = True,ignore_expires = True)

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

print(response.read().decode('utf-8'))

异常处理

保证程序正常运行

from urllib import request,error

try:

    response = request.urlopen('https://blog.csdn.net/qiao39gs/1234')

except error.URLError as e:

    print(e.reason)

运行结果:

Not Found

URL解析

urlparse

urlib.parse.urlparse(urlstring,scheme='',allow_fragments=True)

作用:URL拆分

from urllib.parse import urlparse

result = urlparse('https://www.baidu.com/index.html;user?id=5#comment')

print(type(result),result)

运行结果:

 ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

scheme:协议类型

netloc:域名

path:路径

params:参数

在域名中没有协议类型时,可添加参数scheme=''指定协议类型,域名中有协议类型时则无用

allow_fragments:False时向前拼接fragment

from urllib.parse import urlparse

result = urlparse('https://www.baidu.com/index.html#comment',allow_fragments=False)

print(result)

运行结果:

ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html#comment', params='', query='', fragment='')

urlunparse

作用:拼接URL

from urllib.parse import urlunparse

data = ['http','ww.baidu.com','index.html','user','a=5','comment']

print(urlunparse(data))

运行结果:

http://ww.baidu.com/index.html;user?a=5#comment

urlencode

作用:将字典对象转换成GET请求参数

from urllib.parse import urlencode

params = {

    'name':'germey',

    'age':22

}

base_url = 'http://www.baidu.com?'

url = base_url + urlencode(params)

print(url)

运行结果:

http://www.baidu.com?name=germey&age=22 

 

你可能感兴趣的:(Python爬虫,urllib库,urllib,爬虫,python爬虫,URL)