out = open('train_data.txt', 'w')
for sentence in sentences:
out.write(sentence.encode("utf-8")+"\n")
print("done!")
报错:TypeError:can't concat str to bytes
修改为:
out.write(sentence.encode("utf-8")+b"\n")
此错误消失,原因:encode返回的是bytes型的数据,不可以和str相加,将‘\n’前加b
新的错误:TypeError: write() argument must be str, not bytes
修改为:
out.write(str(sentence.encode("utf-8")+b"\n"))
原因:write函数参数需要为str类型,需转化为str
后续:
发现写入的中文文本格式是:\xc2\xa0\n\xef\xbc\x88\xe8\x8b\xb1\xe5\x9b\xbd\xe5\x8f\x91\xe9\x9f\xb3\xef\xbc\x
修改程序为:
out = open('train_data.txt', 'w',encoding='utf-8')
for sentence in sentences:
out.write(sentence+"\n")
print("done!")
所有错误消失!
原因:
在windows下面,新文件的默认编码是gbk,python解释器会用gbk编码去解析我们的网络数据流txt,然而txt此时已经是decode过的unicode编码,这样的话就会导致解析不了,解决的办法就是,改变目标文件的编码:
参考:
https://blog.csdn.net/dawei_01/article/details/79569466
https://stackoverflow.com/questions/40740150/python-3-cant-concat-bytes-to-str-for-a-list
https://www.imooc.com/qadetail/227268