Python报错UnicodeDecodeError: ‘gbk‘ codec can‘t decode byte,这个错误是做NLP的小伙伴常见的一个错误,报错原因是读取的文件中有中文。
核心思路:
将 with open(file) as f:
改成with open(file, ‘r’, encoding=‘utf-8’) as f:
例如:
def load_data(filename):
D = []
with open(filename,'r', encoding='utf-8') as f:
for i, l in enumerate(f):
l = json.loads(l)
text, label = l['sentence'], l['label']
D.append((text, labels.index(label)))
return D
即将原始的
with open(filename) as f:
修改为
with open(filename,'r', encoding='utf-8') as f:
即可解决以上问题。
还有一种情况:不加encoding默认编码方式采用utf-8
,因为cvs文件的表头带中文汉字,所以报错。汉字采用的编码方式是gb2312
,一般都是表头带汉字引起的,更改为如下代码即可。
data = pd.read_csv(filename,encoding = 'gb2312')