爬虫时requests.get()响应状态码不是[200]怎么办?

当爬取百度主页时,可以看到get函数只有一个url,也会成功响应,状态码为[200]:

>>> import requests
>>> url='https://www.baidu.com/'
>>> r=requests.get(url)
>>> r

>>> r.text
'\r\n ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93  
\r\n'

然而,爬取其他一些网站时,可能会出现不能正确响应的情况,如下。

>>> import requests
>>> url='https://bj.lianjia.com/zufang/'
>>> r=requests.get(url)
>>> r

>>> r.text
'\r\n\r\n403 Forbidden\r\n\r\n

403 Forbidden

\r\n

You don\'t have permission to access the URL on this server. Sorry for the inconvenience.
\r\nPlease report this message and include the following information to us.
\r\nThank you very much!

\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n
URL:https://bj.lianjia.com/zufang/
Server:proxy17-online.mars.ljnode.com
Date:2019/07/03 01:00:14
\r\n
Powered by Lianjia\r\n\r\n' >>> from bs4 import BeautifulSoup >>> bs=BeautifulSoup(r.text) >>> bs 403 Forbidden

403 Forbidden

You don't have permission to access the URL on this server. Sorry for the inconvenience.
Please report this message and include the following information to us.
Thank you very much!

URL: https://bj.lianjia.com/zufang/
Server: proxy17-online.mars.ljnode.com
Date: 2019/07/03 01:00:14

Powered by Lianjia

可以看出,requests.get()函数返回状态不是[200],网站被禁止访问。解决办法是在requests.get()函数中加入headers参数,模拟网站的正常访问,绕开反爬机制。headers参数查找方法如下:

headers参数从访问网站的后台获取,在待爬取网页页面右击选择审查元素,弹出的界面中选择Network选项,F5刷新出现进程图,选择Headers进程中的User-Agent一行,复制粘贴到代码行中,将其改写为字典格式,赋值给headers变量,在get()函数url参数后面添加headers=headers即可(如果只有User-Agent参数仍然不能正确响应,可在字典中包含Cookie等其他参数尝试。为简便,可将Request Header中所有参数均包含进去,其中字典元素的键和值均为字符串格式)。如下所示:

爬虫时requests.get()响应状态码不是[200]怎么办?_第1张图片

>>> headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
>>> r=requests.get(url,headers=headers)
>>> r

>>> r=requests.get(url,headers)
>>> r

>>> r=requests.get(url,h=headers)
Traceback (most recent call last):
  File "", line 1, in 
    r=requests.get(url,h=headers)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'h'
>>> r=requests.get(url,headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'})
>>> r

 

>>> r=requests.get(url,'User-Agent'='Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36')
SyntaxError: keyword can't be an expression
>>> r=requests.get(url,User-Agent='Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36')
SyntaxError: keyword can't be an expression
>>> r=requests.get(url,'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36')
SyntaxError: invalid syntax
>>> r=requests.get(url,header=headers)
Traceback (most recent call last):
  File "", line 1, in 
    r=requests.get(url,header=headers)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'header'
>>> head={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
>>> r=requests.get(url,header=head)
Traceback (most recent call last):
  File "", line 1, in 
    r=requests.get(url,header=head)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'header'
>>> head={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
>>> r=requests.get(url,headers=head)
>>> r

 还可以看出,参数必须采用“headers=字典”格式,其它均不行。

参考:Python——爬虫【Requests设置请求头Headers】

 

你可能感兴趣的:(Python)