\ufeff问题

https://stackoverflow.com/questions/17912307/u-ufeff-in-python-string 解释最充分,摘抄如下:

he Unicode characterU+FEFFis the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you. Examples:

#!python2

#coding: utf8

u = u'ABC'

e8 = u.encode('utf-8')        # encode without BOM

e8s = u.encode('utf-8-sig')  # encode with BOM

e16 = u.encode('utf-16')      # encode with BOM

e16le = u.encode('utf-16le')  # encode without BOM

e16be = u.encode('utf-16be')  # encode without BOM

print 'utf-8    %r' % e8

print 'utf-8-sig %r' % e8s

print 'utf-16    %r' % e16

print 'utf-16le  %r' % e16le

print 'utf-16be  %r' % e16be

print

print 'utf-8  w/ BOM decoded with utf-8    %r' % e8s.decode('utf-8')

print 'utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')

print 'utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16')

print 'utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le')

Note thatEF BB BFis a UTF-8-encoded BOM. It is not required for UTF-8, but serves only as a signature (usually on Windows).

Output:

utf-8    'ABC'

utf-8-sig '\xef\xbb\xbfABC'

utf-16    '\xff\xfeA\x00B\x00C\x00'    # Adds BOM and encodes using native processor endian-ness.

utf-16le  'A\x00B\x00C\x00'

utf-16be  '\x00A\x00B\x00C'

utf-8  w/ BOM decoded with utf-8    u'\ufeffABC'    # doesn't remove BOM if present.

utf-8  w/ BOM decoded with utf-8-sig u'ABC'          # removes BOM if present.

utf-16 w/ BOM decoded with utf-16    u'ABC'          # *requires* BOM to be present.

utf-16 w/ BOM decoded with utf-16le  u'\ufeffABC'    # doesn't remove BOM if present.

Note that theutf-16codedrequiresBOM to be present, or Python won't know if the data is big- or little-endian.

你可能感兴趣的:(\ufeff问题)