word2vec查询词向量时报错:'utf-8' codec cann't decode bytes in position 96-07:unexpected end of data

加载word2vec模型时报错:

    model_path = "model/Hanlp_cut_news.bin"
    w2v_dict = word2vec.load(model_path)
    print(w2v_dict["奥运"])
Traceback (most recent call last):
  File "/home/iiip/PycharmProjects/smp_yinglish/demo1/data_preprocess.py", line 10, in 
    w2v_dict = word2vec.load(model_path)
  File "/home/iiip/.local/lib/python3.5/site-packages/word2vec/io.py", line 18, in load
    return word2vec.WordVectors.from_binary(fname, *args, **kwargs)
  File "/home/iiip/.local/lib/python3.5/site-packages/word2vec/wordvectors.py", line 202, in from_binary
    vocab[i] = word.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: unexpected end of data

查看了一下自己分词文件的编码,utf-8的,没问题:

$file hanlp_cut_news.txt
Hanlp_cut_news.txt: UTF-8 Unicode text, with very long lines, with no line terminators

再看了训练出来的bin文件编码,data表示二进制文件,也没问题:

$ file Hanlp_cut_news.bin 
Hanlp_cut_news.bin: data

回到报错信息,点开word2vec.py源码202行,注意到:

                if include:
                    vocab[i] = word.decode(encoding)

修改一下源码为:

                if include:
                    try:
                            print (word)
                            print(word.encode(encoding)
                            vocab[i] = word.decode(encoding)
                        except:
                            vocab[i] = word

在运行出来的结果中,程序停在了一个特别长的二进制输出,可以推测,应该某个分词结果存在编码混乱或者过长的错误。

把那个很长的二进制编码copy出来测试一下:

line = '\xe9\x98\xbf\xe5\xb0\x94\xe6\xaf\x94\xe5\xb7\xb4\xe9\x87\x8c\xe5\xb8\x83\xe9\x9b\xb7\xe8\xa5\xbf\xe6\xa0\xbc\xe7\xbd\x97\xe7\x91\x9f\xe6\x9b\xbc\xe6\x89\x98\xe7\x93\xa6\xe6\xa2\x85\xe8\xa5\xbf\xe7\xba\xb3\xe6\x91\xa9\xe5\xbe\xb7\xe7\xba\xb3\xe7\x89\xb9\xe9\x87\x8c\xe5\x9f\x83\xe8\xb5\xab\xe5\xba\x93\xe6\x96\xaf\xe9\x98\xbf\xe6\x8b\x89\xe7\xbb\xb4\xe6\xb2\x99\xe6\x8b\x89\xe6\x9b\xbc\xe5\x8d'
print (line)

输出是一堆乱码……

解决原问题的方法就是把源码改为:

                if include:
                    # vocab[i] = word.decode(encoding)
                    try:
                        # print (word)
                        # print (word.decode(encoding))
                        vocab[i] = word.decode(encoding)
                    except:
                        # vocab[i] = word
                        vocab[i] = 'UNK'
                        print (word, 'UNK')

直接跳过这个出错的词语。额,当然了,其实最好应该是在分词的时候做数据清洗(只不过我的分词文件很大,重新跑一遍分词程序不划算)。

你可能感兴趣的:(python,python,word2vec,编码报错)