python2 unicode str

unicode

unicode是一种编码方案, utf-8是unicode的一种实现方式。

Python2 编码

In [1]: a = '啊哈哈'
In [2]: a
Out[2]: '\xe5\x95\x8a\xe5\x93\x88\xe5\x93\x88'
In [4]: type(a)
Out[4]: str
In [5]: len(a)
Out[5]: 9
In [6]: b = u'姚赫赫'
In [7]: type(b)
Out[7]: unicode
In [8]: len(b)
Out[8]: 3
In [9]: a.decode('utf-8')
Out[9]: u'\u554a\u54c8\u54c8'
In [10]: b
Out[10]: u'\u59da\u8d6b\u8d6b'

In [11]: b.encode('utf-8')
Out[11]: '\xe5\xa7\x9a\xe8\xb5\xab\xe8\xb5\xab'

In [12]: c = '姚赫赫'

In [13]: c
Out[13]: '\xe5\xa7\x9a\xe8\xb5\xab\xe8\xb5\xab'

In [14]: import sys

In [15]: sys.getdefaultencoding()
Out[15]: 'ascii'

In [16]: b + c
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
16-c6b7c7e5694f> in ()
----> 1 b + c

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)

In [17]: import sys

In [18]: relaod(sys)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
 in ()
----> 1 relaod(sys)

NameError: name 'relaod' is not defined

In [19]: reload(sys)
sys' (built-in)>

In [20]: sys.setdefaultencoding('utf-8')

In [21]: b + c
Out[21]: u'\u59da\u8d6b\u8d6b\u59da\u8d6b\u8d6b'

In [22]: type(b + c)
Out[22]: unicode

python2 中a='啊哈哈', a的类型是str, 是编码后的字节序列。a的长度是字节数;而b的类型是unicode(存储文本字符串), b的长度是字符数。

相互转化

str –>decode(‘utf-8’) –> unicode
unicode –>encode(‘utf-8’)–> str
写入文件的时候str类型的可以直接写入,unicode类型的必须encode之后写入。

你可能感兴趣的:(Python,python,unicode,编码,utf-8)