诸神缄默不语-个人CSDN博文目录
本文介绍在深度学习中如何应用预训练的词表征(word2vec等),应用到的框架包括numpy、PyTorch和TensorFlow
不同形式,见到了就补充总结一下。
最近更新时间:2022.12.16
最早更新时间:2022.12.15
numpy.ndarray
padding向量一般都是全0(这样叠mask方便)
UNK
文件可以是其他向量的平均池化,也可以随便初始化一个向量
np.save
保存)(范例:LADAN)with open('data_and_config/data/w2id_thulac.pkl', 'rb') as f:
word2id_dict = pk.load(f)
f.close()
emb_path = 'cail_thulac.npy'
word_embedding = np.cast[np.float32](np.load(emb_path))
np.ndarray
或torch.FloatTensor
来表征后不进行初始化,可以用zeros()
或zero_()
等函数来初始为全0
word2vec = json.load(open('./word2vec.json', "r"))
word2vec['UNK'] = np.random.randn(200).tolist()
word2vec['Padding'] = [0. for i in range(200)]
embedding = torch.FloatTensor(339503, 200).zero_()
idx2word = json.load(open('./idx2word.json', "r"))
for i in tqdm(range(339503)):
word = idx2word[str(i)]
result = list(map(float, word2vec[word]))
embedding[i] = torch.from_numpy(np.array(result))
2. 写法2
word2id=json.load(open('word2vec/word2id.json'))
word2vec=json.load(open('word2vec/word2vec.json'))
word2vec['UNK'] = np.random.randn(200).tolist()
word2vec['Padding'] = [0. for i in range(200)]
embedding=np.zeros((339503,200))
for k in word2id:
embedding[word2id[k],:]=[float(factor) for factor in word2vec[k]]
embedding=torch.from_numpy(embedding)
示例代码改自LADAN项目
description_layer=tf.nn.embedding_lookup(word_embedding,num_input)
word_embedding是np.ndarray
格式的词向量矩阵,num_input是转换为数字格式并padding好的文本矩阵
示例代码改自NeurJudge项目
在torch.nn.Module
子类__init__()
中:
(本代码是不在训练过程中更新embedding权重的意思)
self.embs = nn.Embedding(339503, 200)
self.embs.weight.data.copy_(embedding)
self.embs.weight.requires_grad = False
for j in range(int(min(len(fact), max_length))): #遍历最长句长内的每个词
if fact[j] in word2id_dict: #对于词表中存在的词
id_list.append(int(word2id_dict[fact[j]]))
else:
id_list.append(int(word2id_dict['UNK']))
while len(id_list) < 512: #padding
id_list.append(int(word2id_dict['BLANK']))
def transform(self, word):
"""将词语转换为索引"""
if not (word in self.word2id.keys()):
return self.word2id["UNK"]
else:
return self.word2id[word]
def seq2tensor(self, sents, max_len=350):
"""将句子序列转换为词索引tensor,注意这里句子序列每个元素是一个已经分词好的list"""
sent_len_max = max([len(s) for s in sents])
sent_len_max = min(sent_len_max, max_len)
sent_tensor = torch.LongTensor(len(sents), sent_len_max).zero_()
sent_len = torch.LongTensor(len(sents)).zero_()
for s_id, sent in enumerate(sents):
sent_len[s_id] = len(sent)
for w_id, word in enumerate(sent):
if w_id >= sent_len_max: break
sent_tensor[s_id][w_id] = self.transform(word)
return sent_tensor,sent_len