字符级别word2vec

论文《End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF》在做词性标注任务的时候,提到了对字符进行编码,用卷积神经网络编码字符级别信息。
字符级别word2vec_第1张图片

实验中提到字符级别的embeddings 维度30,范围在[-sqrt(3/dim),sqrt(3/dim)]。所以先用word2vec实验了一下字符embedding。

#训练字符级别词向量
from gensim.models.word2vec import Word2Vec
from gensim.models import word2vec
alphabet = 'abcdefghijklmnopqrstuvwxyz0123456789,.)(; '
f = open('text').read()
text = f.replace('\n', ' ').lower()
chars = [ch for ch in text if ch in alphabet]
filtered =''.join(chars)
tokens = filtered.split(' ')
words = [t for t in tokens if len(t) >=2]
#print(words)
char_sequences = [list(w) for w in words]
print(char_sequences)
model = Word2Vec(char_sequences,size=30,window=5,min_count=1)
model.save('char_embeddings.vec')

处理得到的字符序列为:
这里写图片描述

得到的模型测试了一下:

print(model['a'])
print(model.most_similar('a',topn=5))
---------------------------------------
array([-0.01051879,  0.00305209,  0.00773612,  0.01362684,  0.01594807,
        0.01029609,  0.00346048,  0.00261297, -0.01034051,  0.00964036,
       -0.00509238,  0.0021358 , -0.00605083,  0.0087046 ,  0.00930654,
        0.01411205,  0.00340451, -0.0071094 , -0.00138468,  0.00443402,
        0.00809182, -0.00498053, -0.00288919,  0.01092559, -0.01460177,
       -0.00596451, -0.00200858, -0.01376272,  0.00229289,  0.01006972], dtype=float32)

[('w', 0.5829492211341858), ('c', 0.34324681758880615), ('k', 0.3245270252227783), ('u', 0.20812581479549408), ('i', 0.15292495489120483)]  

你可能感兴趣的:(机器学习,自然语言处理)