当爬取百度主页时,可以看到get函数只有一个url,也会成功响应,状态码为[200]:
>>> import requests
>>> url='https://www.baidu.com/'
>>> r=requests.get(url)
>>> r
>>> r.text
'\r\n ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93 å\x85³äº\x8eç\x99¾åº¦ About Baidu
©2017 Baidu 使ç\x94¨ç\x99¾åº¦å\x89\x8då¿\x85读 æ\x84\x8fè§\x81å\x8f\x8dé¦\x88 京ICPè¯\x81030173å\x8f·
\r\n'
然而,爬取其他一些网站时,可能会出现不能正确响应的情况,如下。
>>> import requests
>>> url='https://bj.lianjia.com/zufang/'
>>> r=requests.get(url)
>>> r
>>> r.text
'\r\n\r\n403 Forbidden \r\n\r\n403 Forbidden
\r\nYou don\'t have permission to access the URL on this server. Sorry for the inconvenience.
\r\nPlease report this message and include the following information to us.
\r\nThank you very much!
\r\n\r\n\r\nURL: \r\nhttps://bj.lianjia.com/zufang/ \r\n \r\n\r\nServer: \r\nproxy17-online.mars.ljnode.com \r\n \r\n\r\nDate: \r\n2019/07/03 01:00:14 \r\n \r\n
\r\n
Powered by Lianjia\r\n\r\n'
>>> from bs4 import BeautifulSoup
>>> bs=BeautifulSoup(r.text)
>>> bs
403 Forbidden
403 Forbidden
You don't have permission to access the URL on this server. Sorry for the inconvenience.
Please report this message and include the following information to us.
Thank you very much!
URL:
https://bj.lianjia.com/zufang/
Server:
proxy17-online.mars.ljnode.com
Date:
2019/07/03 01:00:14
Powered by Lianjia
可以看出,requests.get()函数返回状态不是[200],网站被禁止访问。解决办法是在requests.get()函数中加入headers参数,模拟网站的正常访问,绕开反爬机制。headers参数查找方法如下:
headers参数从访问网站的后台获取,在待爬取网页页面右击选择审查元素,弹出的界面中选择Network选项,F5刷新出现进程图,选择Headers进程中的User-Agent一行,复制粘贴到代码行中,将其改写为字典格式,赋值给headers变量,在get()函数url参数后面添加headers=headers即可(如果只有User-Agent参数仍然不能正确响应,可在字典中包含Cookie等其他参数尝试。为简便,可将Request Header中所有参数均包含进去,其中字典元素的键和值均为字符串格式)。如下所示:
>>> headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
>>> r=requests.get(url,headers=headers)
>>> r
>>> r=requests.get(url,headers)
>>> r
>>> r=requests.get(url,h=headers)
Traceback (most recent call last):
File "", line 1, in
r=requests.get(url,h=headers)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'h'
>>> r=requests.get(url,headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'})
>>> r
>>> r=requests.get(url,'User-Agent'='Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36')
SyntaxError: keyword can't be an expression
>>> r=requests.get(url,User-Agent='Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36')
SyntaxError: keyword can't be an expression
>>> r=requests.get(url,'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36')
SyntaxError: invalid syntax
>>> r=requests.get(url,header=headers)
Traceback (most recent call last):
File "", line 1, in
r=requests.get(url,header=headers)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'header'
>>> head={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
>>> r=requests.get(url,header=head)
Traceback (most recent call last):
File "", line 1, in
r=requests.get(url,header=head)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'header'
>>> head={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
>>> r=requests.get(url,headers=head)
>>> r
还可以看出,参数必须采用“headers=字典”格式,其它均不行。
参考:Python——爬虫【Requests设置请求头Headers】