BOW 和 TF-IDF 都只着重于词汇出现在文件中的次数,未考虑语言、文字有上下文的关联,针对上下文的关联,Google 研发团队提出了词向量 Word2vec,将每个单字改以上下文表达,然后转换为向量,这就是词嵌入(Word Embedding),与 TF-IDF 输出的是稀疏向量不同,词嵌入的输出是一个稠密的样本空间。
在 Gensim 中提供了 Word2Vec 的支持:gensim.models.word2vec。Word2Vec 算法包括 Skip-Gram 和 CBOW 模型,使用层次 s o f t m a x softmax softmax 或负采样。
在 Gensim 中训练词向量的方法不仅仅是 Word2Vec,还有 Doc2Vec、FastText。
初始化模型:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
common_texts 内容如下所示:
[
['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'], ['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']
]
model = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")
训练是流式的,因此 sentences
是可迭代的,可以即时从磁盘或网络读取输入数据,而无需将整个语料库加载到 RAM 中。
如果您保存模型,您可以稍后继续训练它:
model = Word2Vec.load("word2vec.model")
print(model.train([["hello", "world"]], total_examples=1, epochs=1))
(0, 2)
训练好的词向量存储在 KeyedVectors
实例中,即 model.wv
:
vector = model.wv['computer'] # get numpy vector of a word
sims = model.wv.most_similar('computer', topn=10) # get other similar words
print(sims)
[('system', 0.21617142856121063),
('survey', 0.044689200818538666),
('interface', 0.01520337350666523),
('time', 0.0019510575802996755),
('trees', -0.03284314647316933),
('human', -0.0742427185177803),
('response', -0.09317588806152344),
('graph', -0.09575346857309341),
('eps', -0.10513805598020554),
('user', -0.16911622881889343)]
将经过训练的向量分离到 KeyedVectors
中的原因是,如果您不再需要完整的模型状态(不需要继续训练),它的状态可以被丢弃,只保留向量和它们的键。这就产生了一个更小更快的对象,可以通过被映射来实现闪电般的快速加载,并在进程间共享 RAM 中的向量。
from gensim.models import KeyedVectors
# Store just the words + their trained embeddings.
word_vectors = model.wv
word_vectors.save("word2vec.wordvectors")
# Load back with memory-mapping = read-only, shared across processes.
wv = KeyedVectors.load("word2vec.wordvectors", mmap='r')
vector = wv['computer'] # Get numpy vector of a word
print(vector)
[-0.00515774 -0.00667028 -0.0077791 0.00831315 -0.00198292 -0.00685696
-0.0041556 0.00514562 -0.00286997 -0.00375075 0.0016219 -0.0027771
-0.00158482 0.0010748 -0.00297881 0.00852176 0.00391207 -0.00996176
0.00626142 -0.00675622 0.00076966 0.00440552 -0.00510486 -0.00211128
0.00809783 -0.00424503 -0.00763848 0.00926061 -0.00215612 -0.00472081
0.00857329 0.00428458 0.0043261 0.00928722 -0.00845554 0.00525685
0.00203994 0.0041895 0.00169839 0.00446543 0.00448759 0.0061063
-0.00320303 -0.00457706 -0.00042664 0.00253447 -0.00326412 0.00605948
0.00415534 0.00776685 0.00257002 0.00811904 -0.00138761 0.00808028
0.0037181 -0.00804967 -0.00393476 -0.0024726 0.00489447 -0.00087241
-0.00283173 0.00783599 0.00932561 -0.0016154 -0.00516075 -0.00470313
-0.00484746 -0.00960562 0.00137242 -0.00422615 0.00252744 0.00561612
-0.00406709 -0.00959937 0.00154715 -0.00670207 0.0024959 -0.00378173
0.00708048 0.00064041 0.00356198 -0.00273993 -0.00171105 0.00765502
0.00140809 -0.00585215 -0.00783678 0.00123304 0.00645651 0.00555797
-0.00897966 0.00859466 0.00404815 0.00747178 0.00974917 -0.0072917
-0.00904259 0.0058377 0.00939395 0.00350795]
您可以使用经过训练的模型执行各种 NLP 任务。
如果你完成了一个模型的训练(即不再更新,只有查询),你可以切换到 KeyedVectors
实例,以削减不需要的模型状态,使用更少的 RAM,并允许快速加载和共享内存(mmap)。
word_vectors = model.wv
del model
有一个 gensim.models.phrases
模块,它可以让你自动检测超过一个词的短语,使用搭配统计(collocation statistics)。使用短语,你可以学习一个 Word2Vec 模型,其中的 “词” 实际上是多字表达,比如 new_york_times
或 financial_crisis
。
from gensim.models import Phrases
# Train a bigram detector.
bigram_transformer = Phrases(common_texts)
# Apply the trained MWE detector to a corpus, using the result to train a Word2vec model.
model = Word2Vec(bigram_transformer[common_texts], min_count=1)
Gensim 在数据存储库中附带了几个已经预训练的模型:
import gensim.downloader
# Show all available models in gensim-data
print(list(gensim.downloader.info()['models'].keys()))
['fasttext-wiki-news-subwords-300',
'conceptnet-numberbatch-17-06-300',
'word2vec-ruscorpora-300',
'word2vec-google-news-300',
'glove-wiki-gigaword-50',
'glove-wiki-gigaword-100',
'glove-wiki-gigaword-200',
'glove-wiki-gigaword-300',
'glove-twitter-25',
'glove-twitter-50',
'glove-twitter-100',
'glove-twitter-200',
'__testing_word2vec-matrix-synopsis']
下载的数据默认存放在 C 盘的 user 文件夹下:gensim-data。
# Download the "glove-twitter-25" embeddings
glove_vectors = gensim.downloader.load('glove-twitter-25')
# Use the downloaded vectors as usual:
glove_vectors.most_similar('twitter')
[('facebook', 0.948005199432373),
('tweet', 0.9403423070907593),
('fb', 0.9342358708381653),
('instagram', 0.9104824066162109),
('chat', 0.8964964747428894),
('hashtag', 0.8885937333106995),
('tweets', 0.8878158330917358),
('tl', 0.8778461217880249),
('link', 0.8778210878372192),
('internet', 0.8753897547721863)]
完整代码
from gensim.models import KeyedVectors
# glove_vectors = gensim.downloader.load('glove-twitter-25') # 从 gensim-data 加载预先训练好的词向量
# glove_vectors.save('glove-twitter-25.model')
glove_vectors = KeyedVectors.load('glove-twitter-25.model')
# 使用默认的 "余弦相似度" 测量法,检查 "最相似的词"
result = glove_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
print(result)
most_similar_key, similarity = result[0] # 查看最匹配的结果
print(f'{most_similar_key}:{similarity:4f}')
# 使用不同的相似性测量 "cosmul"
result = glove_vectors.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])
print(result)
most_similar_key, similarity = result[0] # look at the first match
print(f"{most_similar_key}:{similarity:.4f}")
[('meets', 0.8841924071311951), ('prince', 0.8321634531021118), ('queen', 0.8257461786270142), ('’s', 0.817409873008728), ('crow', 0.813499391078949), ('hunter', 0.8131038546562195), ('father', 0.8115833401679993), ('soldier', 0.81113600730896),('mercy', 0.808239221572876), ('hero', 0.8082264065742493)]
meets:0.884192
[('meets', 1.0724927186965942), ('crow', 1.03579580783844), ('hedgehog', 1.0280965566635132), ('prince', 1.024889349937439), ('hunter', 1.022676706314087), ('mercy', 1.0204170942306519), ('queen', 1.0198343992233276), ('shepherd', 1.0195918083190918), ('soldier', 1.0193928480148315), ('widow', 1.0162571668624878)]
meets:1.0725
这个模型训练的结果貌似不太好。