记录解析Tencent_AILab_ChineseEmbedding.txt文件时的一个坑

腾讯提供的词向量下载地址

问题描述

这个文件很大,解压后将近16G,但是电脑配置有限,而且我觉得也不需要这么多词,所以还是自己写程序来加载词向量吧,于是有了下面这段

import numpy as np
from tqdm import tqdm
w = []
v = []
id2word = {}
word2id = {}
with open('Tencent_AILab_ChineseEmbedding.txt', 'r', encoding='utf-8') as f: # 词向量的维度为200维
    _ = f.readline()
    for i in tqdm(range(1000000)): # 取1000000个词
        word = f.readline()
        v.append([float(_) for _ in word.split()[1:]])
        id2word[len(id2word)+1] = word.split()[0]
word2id = {i: j for j, i in id2word.items()}

然后就是将list转成numpy数据了,可是,如果你这么做了,那么恭喜你,下面是错误的
word2vec = np.array(v, dtype=np.float32)
在执行上面的类型转换前,需要先执行下面这行段,理由是词向量文件中有一个空格,字符串的split方法会默认不要这个空格, 导致v的第414个元素出错了

if len(v[413]) != 200:
    print('debugger v')
    v[413].insert(0, -0.073052)
if id2word[414] == '-0.073052':
    print('debugger id2word')
    id2word[414] = ' '
if '-0.073052' in word2id.keys():
    print('debugger word2id')
    word2id.pop('-0.073052')
    word2id[' '] = 414
完整代码
import numpy as np
from tqdm import tqdm
w = []
v = []
id2word = {}
word2id = {}
with open('Tencent_AILab_ChineseEmbedding.txt', 'r', encoding='utf-8') as f: # 词向量的维度为200维
    _ = f.readline()
    for i in tqdm(range(1000000)): # 取1000000个词
        word = f.readline()
        v.append([float(_) for _ in word.split()[1:]])
        id2word[len(id2word)+1] = word.split()[0]
word2id = {i: j for j, i in id2word.items()}

if len(v[413]) != 200:
    print('debugger v')
    v[413].insert(0, -0.073052)
if id2word[414] == '-0.073052':
    print('debugger id2word')
    id2word[414] = ' '
if '-0.073052' in word2id.keys():
    print('debugger word2id')
    word2id.pop('-0.073052')
    word2id[' '] = 414
word2vec = np.array(v, dtype=np.float32)
word_size = word2vec.shape[1]
word2vec = np.concatenate([np.zeros((1, word_size)), word2vec])

你可能感兴趣的:(夏季蚊子咬)