解决UnicodeEncodeError: 'ascii' codec can't encode character

解决python中UnicodeEncodeError: ‘ascii’ codec can’t encode character ‘\uff1f’ in position 22: ordinal not in range(128)问题

问题说明:

代码环境:python3.7,PyCharm中编译。
在用 urllib.request.urlopen 访问 url 的时候,报错如下:

Traceback (most recent call last): 
File "/usr/local/lib/python3.7/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) 
File "/usr/local/lib/python3.7/urllib/request.py", line 525, in open response = self._open(req, data) 
File "/usr/local/lib/python3.7/urllib/request.py", line 543, in _open '_open', req) File "/usr/local/lib/python3.7/urllib/request.py", line 503, in _call_chain result = func(*args) 
File "/usr/local/lib/python3.7/urllib/request.py", line 1345, in http_open return self.do_open(http.client.HTTPConnection, req) 
File "/usr/local/lib/python3.7/urllib/request.py", line 1317, in do_open encode_chunked=req.has_header('Transfer-encoding')) 
File "/usr/local/lib/python3.7/http/client.py", line 1229, in request self._send_request(method, url, body, headers, encode_chunked) 
File "/usr/local/lib/python3.7/http/client.py", line 1240, in _send_request self.putrequest(method, url, **skips) 
File "/usr/local/lib/python3.7/http/client.py", line 1107, in putrequest self._output(request.encode('ascii')) 
UnicodeEncodeError: 'ascii' codec can't encode character '\uff1f' in position 22: ordinal not in range(128)

check url后确定问题是url中有无法用ASCII码表示的中文字符,因此需要对url重新编码。

解决办法:

python3 中的urllib.parse函数可以解析url,这里可以用来重构url,具体采用其中的 quote 函数。

from urllib import request, parse
import string

f = open('./check_url.txt', 'r')
for line in f:
	# encode url
    url = parse.quote(line, safe=string.printable)
    # aviod SSL verification
    context = ssl._create_unverified_context()
    # request to open url
    # 'safe' refers to characters that can be ignored
    response = request.urlopen(url, context=context)

注意:

  1. python3.7中默认环境编码格式为 utf-8,因此在python2.7中的改变默认编码格式来解决此问题的方法不可行;
  2. parse.quote 中需要设置参数 safe ,防止url中字符完全被编码,否则会导致url中非字母的字符都变成带 ‘%’ 的转码字符,这样url就无法打开了。
  3. 参数 string.printable 的具体说明:该参数表示ASCII码第33~126号可打印字符,其中第48~57号为0~9十个阿拉伯数字;65~90号为26个大写英文字母,97~122号为26个小写英文字母,其余的是一些标点符号、运算符号等。

解法参考:https://blog.csdn.net/qq_25406563/article/details/81253347

你可能感兴趣的:(解决bug,python,爬虫)