glove词向量为utf-8格式编码文件,python3中以gbk编码格式读入会出错:`
glove = open('glove.6B.100d.txt', 'r')
word = list()
word_vector = list()
line = glove.readline() #一行一行的读取,返回str
while line:
line = list(line.split())
word.append(line[0])
word_vector.append(line[1:])
line = glove.readline()
结果:
File "F:/data set/NLP/experiment1.py", line 9, in <module>
line = glove.readline()
UnicodeDecodeError: 'gbk' codec can't decode byte 0x93 in position 5456: illegal multibyte sequence
line
['political', '-0.33926', '0.068714', '-0.31557', '-0.24849', '0.44435', '0.15167', '-0.31527', '-1.1698', '-0.10753', '0.52095', '-0.77615', '0.16561', '0.72414', '-0.016989', '-0.43988', '0.17367', '-0.10719', '-0.52538', '-0.07708', '-0.28964', '0.52395', '0.29934', '0.70362', '-0.72564', '-0.42393', '-0.48204', '0.033616', '-0.29511', '0.34794', '-0.27514', '0.3467', '0.51157', '-0.30432', '-0.043146', '-0.71941', '-0.17902', '0.28824', '0.13239', '-0.60676', '0.26591', '-1.5263', '-0.49898', '0.56189', '-0.60347', '-0.4829', '-0.92018', '0.24844', '-0.31727', '-0.58208', '0.16869', '0.16816', '-0.42411', '0.119', '1.1859', '0.55422', '-2.7024', '0.48945', '-0.28438', '1.5228', '0.77069', '-0.46766', '0.21007', '-0.8434', '-0.40481', '1.6652', '-0.12326', '0.32125', '-0.12691', '0.59459', '0.29502', '-0.024563', '-0.42846', '-0.51083', '-0.45647', '-0.66782', '-0.1642', '-0.56383', '-0.24997', '-0.81554', '0.24945', '0.52835', '0.34749', '0.719', '-0.074859', '-1.353', '0.14949', '-0.48989', '0.44484', '0.17209', '-1.838', '0.1503', '0.29288', '-0.30107', '0.4089', '-0.39897', '-0.11257', '0.23602', '-0.73818', '0.49146', '0.88707']
也即在使用gbk格式途中无法进行gbk格式读取
解决方案:
1.以utf-8的编码格式读取
import numpy as np
with open('glove.6B.100d.txt','r',encoding='utf—8') as glove: #以gbk编码读取
for line in glove.readlines():
line = list(line.split())
c = np.array(line).dtype#此时的词向量是字符串格式需要后期转化为float型
2.以二进制方式读取,然后以utf-8格式解码
glove = open('glove.6B.100d.txt','rb') #以二进制形式rb读入
word = list()
word_vector = list()
line = glove.readline().decode('utf—8')
while line:
line = list(line.split())
word.append(line[0])
word_vector.append(line[1:])
line = glove.readline().decode('utf—8')