本文基于百度飞浆Paddle平台
项目地址:
『NLP打卡营』实践课1:词向量应用演示
VisualDL官方说明文档
Embedding源码地址
pip install --upgrade paddlenlp -i https://pypi.org/simple
Requirement already up-to-date: paddlenlp in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (2.1.1)
Requirement already satisfied, skipping upgrade: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.42.1)
Requirement already satisfied, skipping upgrade: h5py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (2.9.0)
。。。
from paddlenlp.embeddings import TokenEmbedding
# 初始化TokenEmbedding,预训练embedding未下载时会自动下载并加载数据
token_embedding = TokenEmbedding(embedding_name="w2v.baidu_encyclopedia.target.word-word.dim300")
# 查看token_embedding详情
print(token_embedding)
[2021-11-10 21:42:13,213] [ INFO] - Loading token embedding...
W1110 21:42:18.557029 1415 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W1110 21:42:18.562464 1415 device_context.cc:465] device: 0, cuDNN Version: 7.6.
[2021-11-10 21:42:23,903] [ INFO] - Finish loading embedding vector.
[2021-11-10 21:42:23,906] [ INFO] - Token Embedding info:
Unknown index: 635963
Unknown token: [UNK]
Padding index: 635964
Padding token: [PAD]
Shape :[635965, 300]
Object type: TokenEmbedding(635965, 300, padding_idx=635964, sparse=False)
Unknown index: 635963
Unknown token: [UNK]
Padding index: 635964
Padding token: [PAD]
Parameter containing:
Tensor(shape=[635965, 300], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
[[-0.24200200, 0.13931701, 0.07378800, ..., 0.14103900,
0.05592300, -0.08004800],
[-0.08671700, 0.07770800, 0.09515300, ..., 0.11196400,
0.03082200, -0.12893000],
[-0.11436500, 0.12201900, 0.02833000, ..., 0.11068700,
0.03607300, -0.13763499],
...,
[ 0.02628800, -0.00008300, -0.00393500, ..., 0.00654000,
0.00024600, -0.00662600],
[-0.01989385, -0.02005955, 0.01555019, ..., 0.00248810,
-0.02033536, -0.01693229],
[ 0. , 0. , 0. , ..., 0. ,
0. , 0. ]])
这里使用了一个300维的预训练模型
TokenEmbedding.shearch()
获得指定词汇的词向量
test_token_embedding = token_embedding.search("中国")
print(test_token_embedding)
[[ 0.260801 0.1047 0.129453 -0.257317 -0.16152 0.19567 -0.074868
0.361168 0.245882 -0.219141 -0.388083 0.235189 0.029316 0.154215
-0.354343 0.017746 0.009028 0.01197 -0.121429 0.096542 0.009255
0.039721 0.363704 -0.239497 -0.41168 0.16958 0.261758 0.022383
-0.053248 -0.000994 -0.209913 -0.208296 0.197332 -0.3426 -0.162112
0.134557 -0.250201 0.431298 0.303116 0.517221 0.243843 0.022219
-0.136554 -0.189223 0.148563 -0.042963 -0.456198 0.14546 -0.041207
0.049685 0.20294 0.147355 -0.206953 -0.302796 -0.111834 0.128183
0.289539 -0.298934 -0.096412 0.063079 0.324821 -0.144471 0.052456
0.088761 -0.040925 -0.103281 -0.216065 -0.200878 -0.100664 0.170614
-0.355546 -0.062115 -0.52595 -0.235442 0.300866 -0.521523 -0.070713
-0.331768 0.023021 0.309111 -0.125696 0.016723 -0.0321 -0.200611
0.057294 -0.128891 -0.392886 0.423002 0.282569 -0.212836 0.450132
0.067604 -0.124928 -0.294086 0.136479 0.091505 -0.061723 -0.577495
0.293856 -0.401198 0.302559 -0.467656 0.021708 -0.088507 0.088322
-0.015567 0.136594 0.112152 0.005394 0.133818 0.071278 -0.198807
0.043538 0.116647 -0.210486 -0.217972 -0.320675 0.293977 0.277564
0.09591 -0.359836 0.473573 0.083847 0.240604 0.441624 0.087959
0.064355 -0.108271 0.055709 0.380487 -0.045262 0.04014 -0.259215
-0.398335 0.52712 -0.181298 0.448978 -0.114245 -0.028225 -0.146037
0.347414 -0.076505 0.461865 -0.105099 0.131892 0.079946 0.32422
-0.258629 0.05225 0.566337 0.348371 0.124111 0.229154 0.075039
-0.139532 -0.08839 -0.026703 -0.222828 -0.106018 0.324477 0.128269
-0.045624 0.071815 -0.135702 0.261474 0.297334 -0.031481 0.18959
0.128716 0.090022 0.037609 -0.049669 0.092909 0.0564 -0.347994
-0.367187 -0.292187 0.021649 -0.102004 -0.398568 -0.278248 -0.082361
-0.161823 0.044846 0.212597 -0.013164 0.005527 -0.004024 0.176243
0.237274 -0.174856 -0.197214 0.150825 -0.164427 -0.244255 -0.14897
0.098907 -0.295891 -0.013408 -0.146875 -0.126049 0.033235 -0.133444
-0.003258 0.082053 -0.162569 0.283657 0.315608 -0.171281 -0.276051
0.258458 0.214045 -0.129798 -0.511728 0.198481 -0.35632 -0.186253
-0.203719 0.22004 -0.016474 0.080321 -0.463004 0.290794 -0.003445
0.061247 -0.069157 -0.022525 0.13514 0.001354 0.011079 0.014223
-0.079145 -0.41402 -0.404242 -0.301509 0.036712 0.037076 -0.061683
-0.202429 0.130216 0.054355 0.140883 -0.030627 -0.281293 -0.28059
-0.214048 -0.467033 0.203632 -0.541544 0.183898 -0.129535 -0.286422
-0.162222 0.262487 0.450505 0.11551 -0.247965 -0.15837 0.060613
-0.285358 0.498203 0.025008 -0.256397 0.207582 0.166383 0.669677
-0.067961 -0.049835 -0.444369 0.369306 0.134493 -0.080478 -0.304565
-0.091756 0.053657 0.114497 -0.076645 -0.123933 0.168645 0.018987
-0.260592 -0.019668 -0.063312 -0.094939 0.657352 0.247547 -0.161621
0.289043 -0.284084 0.205076 0.059885 0.055871 0.159309 0.062181
0.123634 0.282932 0.140399 -0.076253 -0.087103 0.07262 ]]
如上所示,每一个Embedding的词都会对应一个300维的词向量
cosine_sim()
计算余弦相似度,语义相近更高说明表达能力更好
用于判断两个词之间的距离
score1 = token_embedding.cosine_sim("女孩", "女人")
score2 = token_embedding.cosine_sim("女孩", "书籍")
print('score1:', score1)
print('score2:', score2)
score1: 0.7017183
score2: 0.19189896
通过上述分析可知,如果两个词语之间的语义更相近,则两个词语之间的向量距离会更短,对应的cos值会更高
使用深度学习的VisualDL的High Dimensional组件可以队embedding 的结构进行可视化展示,首先,我们升级VisualDL到最新版本
pip install --upgrade visualdl
pip install --upgrade paddlenlp -i https://pypi.org/simple
# 获取词表中前1000单词
labels = token_embedding.vocab.to_tokens(list(range(0, 1000)))
# 取出这1000个单词对应的Embedding
test_token_embedding = token_embedding.search(labels)
# 引入VisualDL的LogWriter记录日志
from visualdl import LogWriter
with LogWriter(logdir='./token_hidi')as writer:
writer.add_embeddings(tag='test', mat = [i for i in test_token_embedding], metadata=labels)
paddlenlp.TokenEmbedding
组建word-embedding层paddlenlp.seq2vec.BoWENcoder
组建句子建模层
import paddle
import paddle.nn as nn
import paddlenlp
class BoWModel(nn.Layer):
def __init__(self, embedder):
super().__init__()
self.embedder = embedder
emb_dim = self.embedder.embedding_dim
self.encoder = paddlenlp.seq2vec.BoWEncoder(emb_dim)
self.cos_sim_func = nn.CosineSimilarity(axis=-1)
def get_cos_sim(self, text_a, text_b):
text_a_embedding = self.forward(text_a)
text_b_embedding = self.forward(text_b)
cos_sim = self.cos_sim_func(text_a_embedding, text_b_embedding)
return cos_sim
def forward(self, text):
# Shape: (batch_size, num_tokens, embedding_dim)
embedded_text = self.embedder(text)
# Shape: (batch_size, embedding_dim)
summed = self.encoder(embedded_text)
return summed
model = BoWModel(embedder=token_embedding)
使用TokenEmbedding词表构造Tokenizer
from data import Tokenizer
# 分词器
# 注意data为手写的脚本
tokenizer = Tokenizer()
tokenizer.set_vocab(vocab=token_embedding.vocab)
text_pairs = {}
with open('text_pair.txt', 'r', encoding = "utf8") as f:
for line in f:
text_a, text_b = line.strip().split("\t")
if text_a not in text_pairs:
text_pairs[text_a] = []
text_pairs[text_a].append(text_b)
for text_a, text_b_list in text_pairs.items():
# 找a对应的词向量id
text_a_ids = paddle.to_tensor([tokenizer.text_to_ids(text_a)])
# 对于每个再b中的词语
for text_b in text_b_list:
# 找b中词语对应的词向量id
text_b_ids = paddle.to_tensor([tokenizer.text_to_ids(text_b)])
print("text_a: {}".format(text_a))
print("text_b: {}".format(text_b))
print("相似度为: {}".format(model.get_cos_sim(text_a_ids, text_b_ids).numpy()[0]))
print()
text_a: 多项式矩阵左共轭积对偶Sylvester共轭和数学算子完备参数解
text_b: 多项式矩阵的左共轭积及其应用
相似度为: 0.8861938714981079
text_a: 多项式矩阵左共轭积对偶Sylvester共轭和数学算子完备参数解
text_b: 退化阻尼对高维可压缩欧拉方程组经典解的影响
相似度为: 0.7975839972496033
text_a: 多项式矩阵左共轭积对偶Sylvester共轭和数学算子完备参数解
text_b: Burgers方程基于特征正交分解方法的数值解法研究
相似度为: 0.8188782930374146
text_a: 多项式矩阵左共轭积对偶Sylvester共轭和数学算子完备参数解
text_b: 有界对称域上解析函数空间的若干性质
相似度为: 0.8041478395462036
text_a: 多项式矩阵左共轭积对偶Sylvester共轭和数学算子完备参数解
text_b: 基于卷积神经网络的图像复杂度研究与应用
相似度为: 0.7444740533828735
text_a: 多项式矩阵左共轭积对偶Sylvester共轭和数学算子完备参数解
text_b: Cartesian发射机中线性功率放大器的研究
相似度为: 0.7536822557449341
text_a: 多项式矩阵左共轭积对偶Sylvester共轭和数学算子完备参数解
text_b: CFRP加固WF型梁侧扭屈曲的几何非线性有限元分析
相似度为: 0.7572889924049377
。。。(此处省略)
text_a: 互联网企业互动问答社区产品盈利模式经营策略商业价值
text_b: 基于创新的中国广告产业演化研究
相似度为: 0.7780816555023193
text_a: 互联网企业互动问答社区产品盈利模式经营策略商业价值
text_b: 高管性别结构、内部制衡与企业技术创新——基于我国创业板上市企业的实证研究
相似度为: 0.7984799146652222
text_a: 互联网企业互动问答社区产品盈利模式经营策略商业价值
text_b: 环境扫描对企业竞争优势的影响研究--以电子信息行业为例
相似度为: 0.7848146557807922
text_a: 互联网企业互动问答社区产品盈利模式经营策略商业价值
text_b: 高管团队特征对公司绩效的影响——以我国新三板教育行业公司为例
相似度为: 0.8023167252540588
text_a: 互联网企业互动问答社区产品盈利模式经营策略商业价值
text_b: 国有润滑油企业市场开发策略研究
相似度为: 0.8262609243392944
# 引入VisualDL的LogWriter记录日志
import numpy as numpy
from visualdl import LogWriter
# 获取句子以及对应的向量
label_list = []
embedding_list = []
for text_a, text_b_list in text_pairs.items():
text_a_ids = paddle.to_tensor([tokenizer.text_to_ids(text_a)])
embedding_list.append(model(text_a_ids).flatten().numpy())
label_list.append(text_a)
for text_b in text_b_list:
# 找句子b中对应每个词向量的id
text_a_ids = paddle.to_tensor([tokenizer.text_to_ids(text_b)])
embedding_list.append(model(text_b_ids).flatten().numpy())
label_list.append(text_b)
with LogWriter(logdir='./sentence_hidi') as writer:
writer.add_embeddings(tag = 'test', mat=embedding_list, metadata=label_list)
各位看官,都看到这里了,麻烦动动手指头给博主来个点赞8,您的支持作者最大的创作动力哟!
<(^-^)>
才疏学浅,若有纰漏,恳请斧正
本文章仅用于各位同志作为学习交流之用,不作任何商业用途,若涉及版权问题请速与作者联系,望悉知