前些日子在一个论坛上看到网友拿03版《天龙八部》和13版《天龙八部》作对比。在比较两个版本的片尾曲的时候,提到了03版的片尾曲《宽恕》。帖子中提到,这首歌由王菲演唱、林夕作词、赵季平(电视剧《关西无极刀》片头曲的作者)作曲。
在Python 3.x中,一个str对象可以通过调用encode方法来编码得到一个bytes类型的字节序列。而bytes对象则有一个decode方法来实现字节序列的解码操作。看一个例子:
>>> song = '海阔天空'
>>> song_bytes = song.encode('utf-8') # 以UTF-8编码song这个字符串
>>> song_bytes
b'\xe6\xb5\xb7\xe9\x98\x94\xe5\xa4\xa9\xe7\xa9\xba'
>>> song_bytes.decode('utf-8')
'海阔天空'
这个例子中,把“海阔天空”以utf-8形式编码,得到\xe6\xb5\xb7\xe9\x98\x94\xe5\xa4\xa9\xe7\xa9\xba这样一个字节序列。那么,如果创建一个文本文件“song.txt”,在其中敲入“海阔天空”并按UTF-8编码保存,然后用任意一个具有二进制数据编辑/显示功能的编辑器以二进制形式打开,我们将看到这样的字节序列:E6B5B7E99894E5A4A9E7A9BA。
同样:
>>> song = '宽恕'
>>> song_bytes = song.encode('utf-8')
>>> song_bytes
b'\xe5\xae\xbd\xe6\x81\x95'
即,把歌曲名“宽恕”按UTF-8编码将得到字节序列:
E5AEBDE68195第2步很简单。第1步,我们可以通过bytes类的fromhex方法来完成。
bytes.fromhex会把传入的字符串形式的十六进制数字(如:'E5 AE BD E6 81 95')转换成相应的bytes类型字节序列(如:b'\xe5\xae\xbd\xe6\x81')——前者两个十六进制数字对应后者一个字节,并忽略所有空白。具体代码如下:
>>> strange_file_name = "%E5%AE%BD%E6%81%95"
>>> strange_file_name = strange_file_name.replace('%', '')
>>> strange_file_name
'E5AEBDE68195'
>>> strange_file_name_bytes = bytes.fromhex(_)
>>> strange_file_name_bytes
b'\xe5\xae\xbd\xe6\x81\x95'
>>> _.decode('utf-8')
'宽恕'
from re import compile as re_compile
_percent_pat = re_compile(r'(?:%[A-Fa-f0-9]{2})+')
def percent_decode(string):
for substr in _percent_pat.findall(string):
substr_dec = bytes.fromhex(
substr.replace('%', '')).decode('utf-8')
string = string.replace(substr, substr_dec)
return string
该函数能根据query中的数据,通过调用quote_plus生成URL query string。比如,我们在使用用户名、密码登陆某个论坛的时候,或者在某个网站上搜索关键词的时候,urlencode能帮助我们得到最终的查询链接:
>>> from urllib.parse import urlencode
>>> query_filter = {'song': '宽恕', 'artist': '王菲'}
>>> query_parms = urlencode(query_filter)
>>> query_parms
'artist=%E7%8E%8B%E8%8F%B2&song=%E5%AE%BD%E6%81%95'
>>> query_url = 'http://www.example.com/query?{}'.format(query_parms)
>>> query_url
'http://www.example.com/query?artist=%E7%8E%8B%E8%8F%B2&song=%E5%AE%BD%E6%81%95'
从上面实现的percent_decode可以看出,由于str和bytes已经把字符串的编码和解码工作封装好了(str提供了encode接口,bytes提供了decode接口),所以在percent_decode实现解码的过程中,我只是把str形式的包含百分号的字符串转换成了相应的bytes形式的字节序列,然后把转换的结果丢给bytes的decode方法来得到最终的结果。那么,也可以猜测到,编码的过程则是:先使用str的encode方法得到一个bytes类型的字节序列,然后再转换成包含百分号的字符串形式。我不知道怎么去“优雅”地实现这个过程(我认为的优雅就是:代码尽量简洁、紧凑、尽量使用现有函数——比如前面的fromhex),那么就看Python是怎么实现的吧。
以下代码代码摘取自Python 3.3.3的urllib.parse模块(其中,以"##"开头的中文注释是我对这部分代码的理解):
## “百分号编码”中,如下ASCII字符在编码过程中保持原样。
## 这些字符也是所谓的“未保留字符”(Unreserved Characters)。
## 通过quote、quote_plus函数的safe参数,我们可以指定额外的未保留字符。
_ALWAYS_SAFE = frozenset(b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
b'abcdefghijklmnopqrstuvwxyz'
b'0123456789'
b'_.-')
_ALWAYS_SAFE_BYTES = bytes(_ALWAYS_SAFE)
_safe_quoters = {}
## 百分号编码的编码操作就是把bytes形式的字节序列转换成相应的包含百分号的字符串。
## 例如: b'\xe5\xae\xbd\xe6\x81\x95' -> %E5%AE%BD%E6%81%95
## 该类实际上封装了上述这一功能。具体做法就是以一种类似于字典(不过这里不是使用中括号)的工作方式来提供
## 查询操作。如:
## quoter = Quoter();
## 那么调用quoter(b'\xe5')将得到'%E5'。对“未保留字符”,quoter将返回其字符形式,即:quoter(b'a')将得到字符'a'。
class Quoter(collections.defaultdict):
"""A mapping from bytes (in range(0,256)) to strings.
String values are percent-encoded byte values, unless the key < 128, and
in the "safe" set (either the specified safe set, or default set).
"""
# Keeps a cache internally, using defaultdict, for efficiency (lookups
# of cached keys don't call Python code at all).
def __init__(self, safe):
"""safe: bytes object."""
self.safe = _ALWAYS_SAFE.union(safe)
def __repr__(self):
# Without this, will just display as a defaultdict
return "" % dict(self)
def __missing__(self, b):
# Handle a cache miss. Store quoted string in cache and return.
## self.safe是_ALWAYS_SAFE(由“未保留字符”构成的集合)和
## 在调用quote、quote_plus时通过参数safe额外指定的字符集的并集。
## 对于存在于self.safe中的字节,返回其字符形式。否则,返回
## 形如%XX的字符序列(这里的'XX'是该字节的十六进制形式)。
res = chr(b) if b in self.safe else '%{:02X}'.format(b)
self[b] = res
return res
def quote(string, safe='/', encoding=None, errors=None):
"""quote('abc def') -> 'abc%20def'
Each part of a URL, e.g. the path info, the query, etc., has a
different set of reserved characters that must be quoted.
RFC 2396 Uniform Resource Identifiers (URI): Generic Syntax lists
the following reserved characters.
reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
"$" | ","
Each of these characters is reserved in some component of a URL,
but not necessarily in all of them.
By default, the quote function is intended for quoting the path
section of a URL. Thus, it will not encode '/'. This character
is reserved, but in typical usage the quote function is being
called on a path where the existing slash characters are used as
reserved characters.
string and safe may be either str or bytes objects. encoding must
not be specified if string is a str.
The optional encoding and errors parameters specify how to deal with
non-ASCII characters, as accepted by the str.encode method.
By default, encoding='utf-8' (characters are encoded with UTF-8), and
errors='strict' (unsupported characters raise a UnicodeEncodeError).
"""
if isinstance(string, str):
if not string:
return string
if encoding is None:
encoding = 'utf-8'
if errors is None:
errors = 'strict'
## 如果是字符串,先编码成字节序列
string = string.encode(encoding, errors)
else:
if encoding is not None:
raise TypeError("quote() doesn't support 'encoding' for bytes")
if errors is not None:
raise TypeError("quote() doesn't support 'errors' for bytes")
## 调用quote_from_bytes函数,把字符串编码后生成的字节序列转换成
## 相应的百分号编码字符串。即:
## b'\xe5\xae\xbd\xe6\x81\x95' -> %E5%AE%BD%E6%81%95
return quote_from_bytes(string, safe)
## 该函数会先保留字符串中的空格字符(通过把空格字符附加到safe集合中,这样
## 空格字符就不会被变成%20),然后调用quote函数进行百分号编码操作。
## 最后,再把字符串中的空格替换成加号。
def quote_plus(string, safe='', encoding=None, errors=None):
"""Like quote(), but also replace ' ' with '+', as required for quoting
HTML form values. Plus signs in the original string are escaped unless
they are included in safe. It also does not have safe default to '/'.
"""
# Check if ' ' in string, where string may either be a str or bytes. If
# there are no spaces, the regular quote will produce the right answer.
if ((isinstance(string, str) and ' ' not in string) or
(isinstance(string, bytes) and b' ' not in string)):
return quote(string, safe, encoding, errors)
if isinstance(safe, str):
space = ' '
else:
space = b' '
string = quote(string, safe + space, encoding, errors)
return string.replace(' ', '+')
## 通过Quoter类提供的服务,实现实际的转换操作:
## 即:b'\xe5\xae\xbd\xe6\x81\x95' -> %E5%AE%BD%E6%81%95
def quote_from_bytes(bs, safe='/'):
"""Like quote(), but accepts a bytes object rather than a str, and does
not perform string-to-bytes encoding. It always returns an ASCII string.
quote_from_bytes(b'abc def\x3f') -> 'abc%20def%3f'
"""
if not isinstance(bs, (bytes, bytearray)):
raise TypeError("quote_from_bytes() expected bytes")
if not bs:
return ''
if isinstance(safe, str):
# Normalize 'safe' by converting to bytes and removing non-ASCII chars
safe = safe.encode('ascii', 'ignore')
else:
safe = bytes([c for c in safe if c < 128])
## 如果bs中包含的字节都是要保留的,那么rstrip后将得到一个空的bytes类型序列。
## 这表明bs中的所有字节都需要保持原样。那么,只需调用decode方法转换一下类型
## 就可以了。例如,如果bs是b'Beyond',只需返回b'Beyond'.decode,即字符串:'Beyond'。
if not bs.rstrip(_ALWAYS_SAFE_BYTES + safe):
return bs.decode()
## 构建一个Quoter类型的对象,用以提供类似如下的查询服务:
## quoter(b'\xe4') 得到:'%E4'
## quoter(b'A') 得到: 'A'
try:
quoter = _safe_quoters[safe]
except KeyError:
_safe_quoters[safe] = quoter = Quoter(safe).__getitem__
## 通过列表解析,处理bs每个字节,并连接成字符串返回。
return ''.join([quoter(char) for char in bs])
_unreserved_chars = frozenset(b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
b'abcdefghijklmnopqrstuvwxyz'
b'0123456789'
b'_.-')
# A simple implement of "urllib.parse.quote"
def percent_encode(string, safe = '/', encoding = 'utf-8', errors = 'strict'):
if not string:
return string
string = string.encode(encoding, errors)
bytes_unchanged = _unreserved_chars.union(
safe.encode('ascii', 'ignore'))
## 这里,我使用一个lambda函数来实现类似于上面的Quoter类提供的功能。
process_byte = lambda byte: chr(byte) if byte in bytes_unchanged \
else '%{:02X}'.format(byte)
return ''.join((process_byte(b) for b in string))
# A simple implement of "urllib.parse.quote_plus"
def percent_encode_plus(string, safe = '', encoding = 'utf-8',
errors = 'strict'):
safe += ' '
string = percent_encode(string, safe, encoding, errors)
return string.replace(' ', '+')
其实,无论是从上面给出的在线URL编码解码网站还是urllib.parse模块中的几个相关函数,我们都可以看到它们都支持除'UTF-8'编码之外的编码类型(urllib.parse中的几个相关函数通过encoding参数来指定编码选项)。而我写的percent_decode则没有这个参数。现在,让我给它加上这个参数:
def percent_decode(string, encoding = 'utf-8'):
for substr in _percent_pat.findall(string):
substr_dec = bytes.fromhex(
substr.replace('%', '')).decode(encoding)
string = string.replace(substr, substr_dec)
return string
看起来简单极了!而且我自认为这段代码比Python中的unquote的实现还要简洁。于是,我便自信满满地去测试了:
>>> from re import compile as re_compile
>>> _percent_pat = re_compile(r'(?:%[A-Fa-f0-9]{2})+')
>>> def percent_decode(string, encoding = 'utf-8'):
for substr in _percent_pat.findall(string):
substr_dec = bytes.fromhex(
substr.replace('%', '')).decode(encoding)
string = string.replace(substr, substr_dec)
return string
>>> song = 'Beyond-海阔天空'
>>> from urllib.parse import quote, unquote
>>> song_pct_enc = quote(song, encoding = 'utf-8')
>>> song_pct_enc
'Beyond-%E6%B5%B7%E9%98%94%E5%A4%A9%E7%A9%BA'
>>> percent_decode(_, 'utf-8')
'Beyond-海阔天空'
>>> unquote(song_pct_enc)
'Beyond-海阔天空'
>>> song_pct_enc_utf16 = quote(song, encoding = 'utf-16')
>>> song_pct_enc_utf16
'%FF%FEB%00e%00y%00o%00n%00d%00-%00wm%14%96%29Yzz'
>>> percent_decode(_, 'utf-16')
Traceback (most recent call last):
File "", line 1, in
percent_decode(_, 'utf-16')
File "", line 4, in percent_decode
substr.replace('%', '')).decode(encoding)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x00 in position 0: truncated data
>>> unquote(song_pct_enc_utf16, 'utf-16')
'Beyond-海阔天空'
有麻烦了!我的percent_decode在处理UTF-16的时候出错了。仔细一看,我们会发现,song的UTF-16编码中出现了%00,而:
b'\x00'.decode('utf-16')
是会出现UnicodeDecodeError异常的。(b'\x00\x00'.decode('utf-16') 是可以的。)
那么,这里的%00是怎么产生的呢?
>>> 'B'.encode('utf-16')
b'\xff\xfeB\x00' # 小尾(端),包含BOM:FF FE
>>> 'B'.encode('utf-16-le')
b'B\x00' # 小尾
>>> 'B'.encode('utf-16-be')
b'\x00B' # 大尾(端)
看了这段示例之后,错误很明显了。就拿小尾序不含BOM情况下字符'B'的UTF-16编码b'B\x00'来说,b'B'和b'\x00'两个在一起才是完整的,才能解码得到字符'B':
>>> b'B\x00'.decode('utf-16-le')
'B'
而单独拿两者任意一个去解码:
>>> b'B'.decode('utf-16-le')
Traceback (most recent call last):
File "", line 1, in
b'B'.decode('utf-16-le')
File "D:\Program Files\Python33\lib\encodings\utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x42 in position 0: truncated data
>>> b'\x00'.decode('utf-16-le')
Traceback (most recent call last):
File "", line 1, in
b'\x00'.decode('utf-16-le')
File "D:\Program Files\Python33\lib\encodings\utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x00 in position 0: truncated data
前者会得到truncated data的错误提示,即要解码的序列被截断了,无法解码。而后者就更不行了。事实上,b'\x00'和UTF-16中任意一个字符都不对应。即使是ASCII值为0的'\0'字符:
>>> '\0'.encode('utf-16-le')
b'\x00\x00'
因此,上面percent_decode那种见到形如%XX%XX就去解码然后连接到结果字符串中的做法是错误的。另一种做法,就是把解码操作推迟到最后阶段:
>>> quote('B', encoding = 'utf-16-le')
'B%00'
当我们想要去解码'B%00'这样一个字符串时,应该先想办法把它转换成形如:b'B\x00'这样的字节序列,然后在整个序列上调用bytes的decode方法,这样就不会出现如上错误了。事实上,这正是Python 3.3.3中unquote函数的做法:
import re
_asciire = re.compile('([\x00-\x7f]+)')
_hexdig = '0123456789ABCDEFabcdef'
## 建立如下的对应关系:
## b'00': '00'
## b'01': '01'
## ...
## b'FF': 'FF'
## 即从单字节到该字节的二位十六进制表现形式。可以看做是
## bytes.fromhex的逆操作。
_hextobyte = {(a + b).encode(): bytes([int(a + b, 16)])
for a in _hexdig for b in _hexdig}
##
def unquote_to_bytes(string):
"""unquote_to_bytes('abc%20def') -> b'abc def'."""
# Note: strings are encoded as UTF-8. This is only an issue if it contains
# unescaped non-ASCII characters, which URIs should not.
if not string:
# Is it a string-like object?
## 下面这句代码好像没用。我感觉放在这里只是起测试作用,即只有string包含
## split属性的时候,才会return一个空字节序列。
string.split
return b''
## 如果string是字符串,则转换成字节序列
## 我认为这里即使使用'ascii'作为encoding类型也可以——
## 毕竟,一个正常的经过百分号编码算法编码的字符串中
## 不可能包含除ASCII字符以外的字符。
## 但Python文档中有这样一句话:
## The source character set is defined by the encoding declaration; it is UTF-8 if ## no encoding declaration is given in the source file
## 也就是说,在不包含编码声明的Python脚本中,Python 3.x会
## 认为其中的字符串字面量是UTF-8编码的。所以,这里使用UTF-8也合理。
if isinstance(string, str):
string = string.encode('utf-8')
## 以字节b'%'作为分隔符,得到一个由bytes类型对象构成的列表。
bits = string.split(b'%')
if len(bits) == 1:
return string
res = [bits[0]]
append = res.append
for item in bits[1:]:
try:
## 这里实际上是res.append(_hextobyte[item[:2]])
## 还拿字符'B'的UTF-16-LE形式的百分号编码'B%00'来说:
## string是'B%00'
## bits是[b'B%00']
## 这里,通过查字典_hextobyte,把b'%00'变成b'\x00'
## 这样我们得到的res就是:
## [b'B\x00']
append(_hextobyte[item[:2]])
## 其它部分,不予处理。
## 比如,字符'B'的UTF-16-BE的百分号编码为:'%00B'
## 上面的操作只是把b'%00'变成了b'\x00',而剩余的b'B'
## 只需要添加到列表res中就行了。
append(item[2:])
except KeyError:
append(b'%')
append(item)
## 经过b'%XX' -> b'\xXX'这样的映射操作后,连接起来重新得到完整的字符串。
return b''.join(res)
def unquote(string, encoding='utf-8', errors='replace'):
"""Replace %xx escapes by their single-character equivalent. The optional
encoding and errors parameters specify how to decode percent-encoded
sequences into Unicode characters, as accepted by the bytes.decode()
method.
By default, percent-encoded sequences are decoded with UTF-8, and invalid
sequences are replaced by a placeholder character.
unquote('abc%20def') -> 'abc def'.
"""
if '%' not in string:
string.split
return string
if encoding is None:
encoding = 'utf-8'
if errors is None:
errors = 'replace'
## 我认为这句代码的作用也不大。
bits = _asciire.split(string)
res = [bits[0]]
append = res.append
for i in range(1, len(bits), 2):
## 对unquote_to_bytes返回的字节序列进行解码操作。
append(unquote_to_bytes(bits[i]).decode(encoding, errors))
append(bits[i + 1])
return ''.join(res)
上面就是Python 3.3.3中的unquote函数的实现思路。
后来,我在自己安装了Python 3.3.2的Debian 7.3上写代码时,吃惊地发现,Python 3.3.2到Python 3.3.3,urllib.parse中的unquote函数的实现方式完全不同。事实上,Python3.3.2中的unquote函数有问题,即当我拿一个中文字符串以某种encoding type(比如:UTF-16)编码(quote)再解码(unquote)后,得到的字符串和原来的不一样了。我用TortoiseSVN提供的diff工具对比了一下从Debian 7上得到的Python 3.3.2中的urllib.parse模块相应的parse.py和Win7下Python 3.3.3中urllib.parse相应的parse.py,发现两者最大的不同之处也就是unqoute、unquote_to_bytes这两个函数实现方式的改变。
以下代码摘自Python 3.3.2中的urllib.parse模块:
def unquote_to_bytes(string):
"""unquote_to_bytes('abc%20def') -> b'abc def'."""
# Note: strings are encoded as UTF-8. This is only an issue if it contains
# unescaped non-ASCII characters, which URIs should not.
if not string:
# Is it a string-like object?
string.split
return b''
if isinstance(string, str):
string = string.encode('utf-8')
res = string.split(b'%')
if len(res) == 1:
return string
string = res[0]
for item in res[1:]:
try:
string += bytes([int(item[:2], 16)]) + item[2:]
except ValueError:
string += b'%' + item
return string
def unquote(string, encoding='utf-8', errors='replace'):
"""Replace %xx escapes by their single-character equivalent. The optional
encoding and errors parameters specify how to decode percent-encoded
sequences into Unicode characters, as accepted by the bytes.decode()
method.
By default, percent-encoded sequences are decoded with UTF-8, and invalid
sequences are replaced by a placeholder character.
unquote('abc%20def') -> 'abc def'.
"""
if string == '':
return string
res = string.split('%')
if len(res) == 1:
return string
if encoding is None:
encoding = 'utf-8'
if errors is None:
errors = 'replace'
# pct_sequence: contiguous sequence of percent-encoded bytes, decoded
pct_sequence = b''
string = res[0]
for item in res[1:]:
try:
if not item:
raise ValueError
pct_sequence += bytes.fromhex(item[:2])
rest = item[2:]
if not rest:
# This segment was just a single percent-encoded character.
# May be part of a sequence of code units, so delay decoding.
# (Stored in pct_sequence).
continue
except ValueError:
rest = '%' + item
# Encountered non-percent-encoded characters. Flush the current
# pct_sequence.
string += pct_sequence.decode(encoding, errors) + rest
pct_sequence = b''
if pct_sequence:
# Flush the final pct_sequence
string += pct_sequence.decode(encoding, errors)
return string
理解了这些函数工作原理之后,我把自己仿写的几个函数放到一个名为uriparse.py的脚本里,如下:
#! /usr/bin/env python3
# -*- coding: utf-8 -*-
# By mayadong7349 2014-01-19 19:39
from re import compile as re_compile
_percent_pat = re_compile(b'((?:%[A-Fa-f0-9]{2})+)')
_unreserved_chars = frozenset(b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
b'abcdefghijklmnopqrstuvwxyz'
b'0123456789'
b'_.-')
# A simple implement of "urllib.parse.unquote"
def percent_decode(string, encoding = 'utf-8', errors = 'replace'):
str_bytes = string.encode('utf-8')
hex_to_byte = lambda match_ret: \
bytes.fromhex(
match_ret.group(0).replace(b'%', b'').decode('utf-8'))
str_bytes = _percent_pat.sub(hex_to_byte, str_bytes)
string = str_bytes.decode(encoding, errors)
return string
# A simple implement of "urllib.parse.unquote_plus"
def percent_decode_plus(string, encoding = 'utf-8', errors = 'replace'):
return percent_decode(string.replace('+', '%20'), encoding, errors)
# A simple implement of "urllib.parse.quote"
def percent_encode(string, safe = '/', encoding = 'utf-8', errors = 'strict'):
if not string:
return string
string = string.encode(encoding, errors)
bytes_unchanged = _unreserved_chars.union(
safe.encode('ascii', 'ignore'))
process_byte = lambda byte: chr(byte) if byte in bytes_unchanged \
else '%{:02X}'.format(byte)
return ''.join((process_byte(b) for b in string))
# A simple implement of "urllib.parse.quote_plus"
def percent_encode_plus(string, safe = '', encoding = 'utf-8',
errors = 'strict'):
safe += ' '
string = percent_encode(string, safe, encoding, errors)
return string.replace(' ', '+')
if __name__ == '__main__':
import unittest
import urllib.parse
class TestURIParse(unittest.TestCase):
def setUp(self):
pass
def tearDown(self):
pass
def doTest(self, str_, str_with_space, encoding_list):
for en in encoding_list:
# print('Test encoding:', en)
str_enc = percent_encode(str_, encoding = en)
self.assertEqual(
str_enc, urllib.parse.quote(str_, encoding = en))
str_with_space_enc = percent_encode_plus(
str_with_space, encoding = en)
self.assertEqual(
str_with_space_enc,
urllib.parse.quote_plus(str_with_space, encoding = en))
# print('Test decoding:', en)
self.assertEqual(percent_decode(str_enc, encoding = en),
urllib.parse.unquote(str_enc, encoding = en))
self.assertEqual(
percent_decode(str_with_space_enc, encoding = en),
urllib.parse.unquote(str_with_space_enc, encoding = en))
self.assertEqual(
percent_decode_plus(str_with_space_enc, encoding = en),
urllib.parse.unquote_plus(
str_with_space_enc, encoding = en))
def testChinese(self):
fn = 'Beyond-海阔天空'
fn_with_space = 'Beyond 海 阔 天 空'
encoding_list = ('utf-8', 'gb2312', 'gbk', 'utf-16', 'utf-16-le',
'utf-16-be', 'utf-32', 'utf-32-le', 'utf-32-be',
'gb18030')
self.doTest(fn, fn_with_space, encoding_list)
def testReservedChars(self):
reserved_chars = "!*'();:@&=+$,/?#[]"
encoding_list = ('utf-8', 'gb2312', 'gbk', 'utf-16', 'utf-16-le',
'utf-16-be', 'utf-32', 'utf-32-le', 'utf-32-be',
'gb18030')
self.doTest(reserved_chars, reserved_chars, encoding_list)
def testEmptyString(self):
self.doTest('', '', ('utf-8', 'utf-16-be', 'utf-32-le'))
def testURL(self):
url = 'http://www.baidu.com/'
url_with_space = 'http://www.baidu.com/黑 客 帝 国.rmvb'
encoding_list = ('utf-8', 'gb2312', 'gbk', 'utf-16', 'utf-16-le',
'utf-32', 'utf-32-le', 'gb18030')
self.doTest(url, url_with_space, encoding_list)
def testRealURL(self):
wiki_page = 'http://zh.wikipedia.org/wiki/%E7%99%BE%E5%88%86%E5%8F%B7%E7%BC%96%E7%A0%81'
self.assertEqual(percent_decode(wiki_page),
urllib.parse.unquote(wiki_page))
unittest.main()
而对于UTF-16:
>>> '海阔天空'.encode('utf-16-le')
b'wm\x14\x96)Yzz'
>>> quote('海阔天空', encoding = 'utf-16-le')
'wm%14%96%29Yzz'
>>> for ch in '海阔天空':
... print(repr(quote(ch, encoding = 'utf-16-le')))
...
'wm'
'%14%96'
'%29Y'
'zz'
4.
扩展阅读三:
字符集和字符编码(Charset & Encoding)
字符编码笔记:ASCII,Unicode和UTF-8
python 中文乱码 问题深入分析