python requests 编码格式_Python学习--- requests库中文编码问题

为什么会有ISO-8859-1这样的字符集编码

requests会从服务器返回的响应头的 Content-Type 去获取字符集编码,如果content-type有charset字段那么requests才能正确识别编码,否则就使用默认的 ISO-8859-1. 一般那些不规范的页面往往有这样的问题.

\requests\utils.pydef get_encoding_from_headers(headers):

"""Returns encodings from given HTTP Header Dict.

:param headers: dictionary to extract encoding from.

:rtype: str

"""

content_type = headers.get('content-type')

if not content_type:

return None

content_type, params = cgi.parse_header(content_type)

if 'charset' in params:

return params['charset'].strip("'\"")

if 'text' in content_type:

return 'ISO-8859-1'

如何获取正确的编码

requests的返回结果对象里有个apparent_encoding函数, apparent_encoding通过调用chardet.detect()来识别文本编码. 但是需要注意的是,这有些消耗计算资源.

\requests\models.py

@property

def apparent_encoding(self):

"""The apparent encoding, provided by the chardet library."""

return chardet.detect(self.content)['encoding']

requests的text() 跟 content() 有什么区别?

requests在获取网络资源后,我们可以通过两种模式查看内容。 一个是r.text,另一个是r.content,那他们之间有什么区别呢?

分析requests的源代码发现,r.text返回的是处理过的Unicode型的数据,而使用r.content返回的是bytes型的原始数据。也就是说,r.content相对于r.text来说节省了计算资源,r.content是把内容bytes返回. 而r.text是decode成Unicode. 如果headers没有charset字符集的化,text()会调用chardet来计算字符集,这又是消耗cpu的事情.

通过看requests代码来分析text() content()的区别.# r.text

@property

def text(self):

"""Content of the response, in unicode.

If Response.encoding is None, encoding will be guessed using

``chardet``.

The encoding of the response content is determined based solely on HTTP

headers, following RFC 2616 to the letter. If you can take advantage of

non-HTTP knowledge to make a better guess at the encoding, you should

set ``r.encoding`` appropriately before accessing this property.

"""

# Try charset from content-type

content = None

encoding = self.encoding

if not self.content:

return str('')

# Fallback to auto-detected encoding.

if self.encoding is None:

encoding = self.apparent_encoding

# Decode unicode from given encoding.

try:

content = str(self.content, encoding, errors='replace')

except (LookupError, TypeError):

# A LookupError is raised if the encoding was not found which could

# indicate a misspelling or similar mistake.

#

# A TypeError can be raised if encoding is None

#

# So we try blindly encoding.

content = str(self.content, errors='replace')

return content

# Content

@property

def content(self):

"""Content of the response, in bytes."""

if self._content is False:

# Read the contents.

if self._content_consumed:

raise RuntimeError(

'The content for this response was already consumed')

if self.status_code == 0 or self.raw is None:

self._content = None

else:

self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()

self._content_consumed = True

# don't need to release the connection; that's been handled by urllib3

# since we exhausted the data.

return self._content

requests中文乱码解决方法

方法一: 直接encode成utf-8格式.r.content.decode(r.encoding).encode('utf-8')

r.encoding = 'utf-8'

方法二:如果headers头部没有charset,那么就从html的meta中抽取.

[python 学习] requests 库的使用

1.get请求 # -*- coding: utf-8 -*- import requests URL_IP = "http://b.com/index.php" pyload = ...

关于requests库中文编码问题

转自:代码分析Python requests库中文编码问题 Python reqeusts在作为代理爬虫节点抓取不同字符集网站时遇到的一些问题总结. 简单说就是中文乱码的问题.   如果单纯的抓取微博 ...

【转】使用Python的Requests库进行web接口测试

你可能感兴趣的:(python,requests,编码格式)