pytorch从glove词向量源文件中生成embedding并载入

首先是下载glove文件

格式为txt,每一行开头是单词,后面是100个float类型数,空格隔开,因此我们载入这个文件,并取出每一行

def get_numpy_word_embed(word2ix):
    row = 0
    file = 'zhs_wiki_glove.vectors.100d.txt'
    path = '/home/socialbird/platform/aion-autonlp/Downloads'
    whole = os.path.join(path, file)
    words_embed = {}
    with open(whole, mode='r')as f:
        lines = f.readlines()
        for line in lines:
            # print(line)
            # print(len(line.split()))
            line_list = line.split()
            word = line_list[0]
            embed = line_list[1:]
            embed = [float(num) for num in embed]
            words_embed[word] = embed
            if row > 20000:
                break
            row += 1
    # word2ix = {}
    ix2word = {ix: w for w, ix in word2ix.items()}
    id2emb = {}
    for ix in range(len(word2ix)):
        if ix2word[ix] in words_embed:
            id2emb[ix] = words_embed[ix2word[ix]]
        else:
            id2emb[ix] = [0.0] * 100
    data = [id2emb[ix] for ix in range(len(word2ix))]

    return data

假如我们已经建立了自己的词典,我们只需要载入这部分词就可以了,因此根据word2id构建id2word,并且按id存放emb。最后构建出2d list用来表示embedding,shape为vocabulary size * 100(100是embedding size)

 

numpy_embed = get_numpy_word_embed(word2ix)
embedding = nn.Embedding.from_pretrained(torch.FloatTensor(numpy_embed)).to('cuda')

最用from pretrained这个方法载入embedding,成为网络的一个layer

你可能感兴趣的:(NLP,nlp,词向量,glove,pytorch,中文预训练)