python读取一个utf-8编码的文件,出现\xef\xbb\xbf

python读取一个utf-8编码保存的文件,第一行为空,然后我用line.strip() == ‘’来判断是否是空行,发现判断不对。

line.strip()后, 我发现显示的值是‘’, 但为什么与‘’不相等呢?len(line.strip())居然等于3!!太奇怪了,显然不是空值呀,然后我用repr()这个函数对结果进行转义,发现有值\xef\xbb\xbf, 那这个值是什么意思呢?

EF BB BF是被称为 Byte order mark (BOM)的文件标记,用来指出这个文件是UTF-8编码。

处理方式见 Reading Unicode file data with BOM chars in Python 的第一个回答,附下:

There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist:

1. # Standard UTF-8 without BOM

>>> b'hello'.decode('utf-8')

'hello'

>>> b'hello'.decode('utf-8-sig')

'hello'

2. # BOM encoded UTF-8

>>> b'\xef\xbb\xbfhello'.decode('utf-8')

'\ufeffhello'

>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')

'hello'

In the example above, you can see utf-8-sig correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig and not worry about it

所以我在读取文件时,采用utf-8-sig的方式,在python 2.7中,代码如下:

import codecs

with codecs.open(file_path, 'r', 'utf-8-sig') as fh:

你可能感兴趣的:(python读取一个utf-8编码的文件,出现\xef\xbb\xbf)