最近公司项目中涉及到给每个用户推荐app,而在app数据相关处理的过程中,将app变为了一个向量,最后再转变到一个用户用一个向量来表示,而这其中用到的关键技术就是Word2Vec!之前只是大概听过,现在系统性的总结一波~
首先来看看维基百科定义:
Word2vec:为一群用来产生词向量的相关模型。这些模型为浅层双层的神经网络,用来训练以重新建构语言学之词文本。网络以词表现,并且需猜测相邻位置的输入词,在word2vec中词袋模型假设下,词的顺序是不重要的。
训练完成之后,word2vec模型可用来映射每个词到一个向量,可用来表示词对词之间的关系。该向量为神经网络之隐藏层[1]。
Word2vec依赖skip-grams或连续词袋(CBOW)来建立神经词嵌入。Word2vec为托马斯·米科洛夫(Tomas Mikolov)在Google带领的研究团队创造。该算法渐渐被其他人所分析和解释[2][3]。
结合上述定义我们可以看到:
这时候不禁就会问一句,为什么要搞一个词向量?词汇为啥要表示成向量呢?
在下面介绍文本向量化的时候会涉及到分词,首先介绍下分词的基本原理。
文本无法直接参与建模进行后续分析,而转化成向量之后就可以进行!所以如何将文本变为向量就是一个大学问~
但归纳起来,可以理解为两种方式:
方式1:基于频数(词袋模型,BoW)的向量化表示
可以结合下面结果知道,这种方法本质还是one-hot,只不过这时候的1表示为频数!而不仅仅是表示有没有出现!
Python实现:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer()
corpus=["I come to China to travel",
"This is a car polupar in China",
"I love tea and Apple ",
"The work is to write some papers in science"]
print (vectorizer.fit_transform(corpus))
(0, 4) 1
(0, 15) 2
(0, 3) 1
(0, 16) 1
(1, 3) 1
(1, 14) 1
(1, 6) 1
(1, 2) 1
(1, 9) 1
(1, 5) 1
(2, 7) 1
(2, 12) 1
(2, 0) 1
(2, 1) 1
(3, 15) 1
(3, 6) 1
(3, 5) 1
(3, 13) 1
(3, 17) 1
(3, 18) 1
(3, 11) 1
(3, 8) 1
(3, 10) 1
按位置定义的所有词汇如下:
print (vectorizer.fit_transform(corpus).toarray())
print('词向量的维度为: ', len(vectorizer.fit_transform(corpus).toarray()[0]))
print (vectorizer.get_feature_names())
[[0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 2 1 0 0]
[0 0 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0]
[1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 1 1 0 1 0 1 1 0 1 0 1 0 1 1]]
词向量的维度为: 19
['and', 'apple', 'car', 'china', 'come', 'in', 'is', 'love', 'papers', 'polupar', 'science', 'some', 'tea', 'the', 'this', 'to', 'travel', 'work', 'write']
方式2:基于Hash Trick的向量化表示
什么叫Hash Trick呢?为什么要用Hash Trick?
对比基于词频的向量化+Hash Trick后的向量化:
基于词频的向量化应用场景:
基于Hash Trick的向量化应用场景:
Python实现:
将上述19维的转变为6维
from sklearn.feature_extraction.text import HashingVectorizer
vectorizer2=HashingVectorizer(n_features = 6,norm = None)
print (vectorizer2.fit_transform(corpus))
(0, 1) 2.0
(0, 2) -1.0
(0, 4) 1.0
(0, 5) -1.0
(1, 0) 1.0
(1, 1) 1.0
(1, 2) -1.0
(1, 5) -1.0
(2, 0) 2.0
(2, 5) -2.0
(3, 0) 0.0
(3, 1) 4.0
(3, 2) -1.0
(3, 3) 1.0
(3, 5) -1.0
print (vectorizer2.fit_transform(corpus).toarray())
print('词向量的维度为: ', len(vectorizer2.fit_transform(corpus).toarray()[0]))
[[ 0. 2. -1. 0. 1. -1.]
[ 1. 1. -1. 0. 0. -1.]
[ 2. 0. 0. 0. 0. -2.]
[ 0. 4. -1. 1. 0. -1.]]
词向量的维度为: 6
方式3:基于TF-IDF的向量化表示
首先TF-IDF在之前的博客中小编已经介绍过,详情可以戳:机器学习 | TF-IDF和TEXT-RANK的区别
在此处,大概流程和上述1很类似,就是将词频换成了该词汇的TF-IDF得分!
至于为什么基于频数进行优化也很好理解,比如有些话中to很多,词频会很大,但其意义可能并不大,TF-IDF就可以有效解决这个问题!
Python实现:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf2 = TfidfVectorizer()
corpus=["I come to China to travel",
"This is a car polupar in China",
"I love tea and Apple ",
"The work is to write some papers in science"]
re = tfidf2.fit_transform(corpus)
print (re)
(0, 16) 0.4424621378947393
(0, 3) 0.348842231691988
(0, 15) 0.697684463383976
(0, 4) 0.4424621378947393
(1, 5) 0.3574550433419527
(1, 9) 0.45338639737285463
(1, 2) 0.45338639737285463
(1, 6) 0.3574550433419527
(1, 14) 0.45338639737285463
(1, 3) 0.3574550433419527
(2, 1) 0.5
(2, 0) 0.5
(2, 12) 0.5
(2, 7) 0.5
(3, 10) 0.3565798233381452
(3, 8) 0.3565798233381452
(3, 11) 0.3565798233381452
(3, 18) 0.3565798233381452
(3, 17) 0.3565798233381452
(3, 13) 0.3565798233381452
(3, 5) 0.2811316284405006
(3, 6) 0.2811316284405006
(3, 15) 0.2811316284405006
向量维度以及各维度表示的含义为:
tfidf2.get_feature_names()
['and',
'apple',
'car',
'china',
'come',
'in',
'is',
'love',
'papers',
'polupar',
'science',
'some',
'tea',
'the',
'this',
'to',
'travel',
'work',
'write']
print('词向量的维度为: ', len(tfidf2.fit_transform(corpus).toarray()[0]))
tfidf2.fit_transform(corpus).toarray()
词向量的维度为: 19
array([[0. , 0. , 0. , 0.34884223, 0.44246214,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0.69768446, 0.44246214, 0. , 0. ],
[0. , 0. , 0.4533864 , 0.35745504, 0. ,
0.35745504, 0.35745504, 0. , 0. , 0.4533864 ,
0. , 0. , 0. , 0. , 0.4533864 ,
0. , 0. , 0. , 0. ],
[0.5 , 0.5 , 0. , 0. , 0. ,
0. , 0. , 0.5 , 0. , 0. ,
0. , 0. , 0.5 , 0. , 0. ,
0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ,
0.28113163, 0.28113163, 0. , 0.35657982, 0. ,
0.35657982, 0.35657982, 0. , 0.35657982, 0. ,
0.28113163, 0. , 0.35657982, 0.35657982]])
可以看到此时to就没有this重要,虽然频数大!
方式4:Word2vec
归结起来,Word2vec为2种模型+2种求解优化方法,故总共为4种方案,下面在数学原理篇将进行详细介绍!
首先在Word2vec之前已经有两种模型在做词向量的工作,那就是CBOW和Skip-Gram,而Word2vec就是在这个基础上加入了两种优化方法:Hierarchical Softmax和Negative Sampling,于是就产生了4种Word2vec模型:
模型作用:用来训练产生词向量
最后前向计算预测的时候,
思路和上述CBOW相反,已知某个词汇,输出该词汇对应上下文。
即输入:特定的一个词的词向量;输出:特定词对应的上下文词向量。
还是上面的例子,我们的上下文大小取值为4, 特定的这个词"Learning"是我们的输入,而这8个上下文词是我们的输出。
Skip-Gram神经网络模型输入层有1个神经元,输出层有词汇表大小个神经元。
训练输入:输入是特定词的词向量
训练输出:输出是上下文的8个词的词向量
最后前向计算预测的时候,
CBOW+Hierarchical Softmax:梯度迭代使用了随机梯度上升法
Skip-Gram+Hierarchical Softmax:梯度迭代使用了随机梯度上升法
为什么有上面Hierarchical Softmax还要有Negative Sampling呢?因为它也有自身的局限性。
Hierarchical Softmax的优缺点为:
那我们来看看Negative Sampling,听这个名字大家就可以看出,这是一种采样的方法,如何采样呢?
接下来有两个问题:
CBOW+Negative Sampling
Skip-Gram+Negative Sampling
优点:
缺点:
import numpy as np
import gensim
from gensim.models import word2vec
import jieba
import jieba.analyse
jieba.suggest_freq('沙瑞金', True)
jieba.suggest_freq('田国富', True)
jieba.suggest_freq('高育良', True)
jieba.suggest_freq('侯亮平', True)
jieba.suggest_freq('钟小艾', True)
jieba.suggest_freq('陈岩石', True)
jieba.suggest_freq('欧阳菁', True)
jieba.suggest_freq('易学习', True)
jieba.suggest_freq('王大路', True)
jieba.suggest_freq('蔡成功', True)
jieba.suggest_freq('孙连城', True)
jieba.suggest_freq('季昌明', True)
jieba.suggest_freq('丁义珍', True)
jieba.suggest_freq('郑西坡', True)
jieba.suggest_freq('赵东来', True)
jieba.suggest_freq('高小琴', True)
jieba.suggest_freq('赵瑞龙', True)
jieba.suggest_freq('林华华', True)
jieba.suggest_freq('陆亦可', True)
jieba.suggest_freq('刘新建', True)
jieba.suggest_freq('刘庆祝', True)
Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/vx/np6lccw52hdfcz_2qswpfhch0000gn/T/jieba.cache
Loading model cost 1.243 seconds.
Prefix dict has been built succesfully.
1
with open('./in_the_name_of_people.txt',encoding='utf-8') as f:
document = f.read()
#document_decode = document.decode('GBK')
document_cut = jieba.cut(document)
#print ' '.join(jieba_cut) //如果打印结果,则分词效果消失,后面的result无法显示
result = ' '.join(document_cut)
# result = result.encode('utf-8')
with open('./in_the_name_of_people_segment.txt', 'w', encoding='utf-8') as f2:
f2.write(result)
f.close()
f2.close()
# import modules & set up logging
import logging
import os
from gensim.models import word2vec
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = word2vec.LineSentence('./in_the_name_of_people_segment.txt')
model = word2vec.Word2Vec(sentences, hs=1,min_count=1,window=3,size=100)
'''
1、模型默认用CBOW【即已知上下文 求中间的!】 sg=0
2、优化方法默认用negative sampling hs=0 而hs=1表示用hierarchical softmax
3、词向量的默认为:size 即100 [Dimensionality of the word vectors]
'''
2019-08-18 17:20:52,719 : INFO : collecting all words and their counts
2019-08-18 17:20:52,722 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-08-18 17:20:52,815 : INFO : collected 17878 word types from a corpus of 161343 raw words and 2311 sentences
2019-08-18 17:20:52,816 : INFO : Loading a fresh vocabulary
2019-08-18 17:20:52,850 : INFO : effective_min_count=1 retains 17878 unique words (100% of original 17878, drops 0)
2019-08-18 17:20:52,852 : INFO : effective_min_count=1 leaves 161343 word corpus (100% of original 161343, drops 0)
2019-08-18 17:20:52,919 : INFO : deleting the raw counts dictionary of 17878 items
2019-08-18 17:20:52,923 : INFO : sample=0.001 downsamples 38 most-common words
2019-08-18 17:20:52,924 : INFO : downsampling leaves estimated 120578 word corpus (74.7% of prior 161343)
2019-08-18 17:20:52,944 : INFO : constructing a huffman tree from 17878 words
2019-08-18 17:20:53,601 : INFO : built huffman tree with maximum node depth 17
2019-08-18 17:20:53,645 : INFO : estimated required memory for 17878 words and 100 dimensions: 33968200 bytes
2019-08-18 17:20:53,647 : INFO : resetting layer weights
2019-08-18 17:20:54,001 : INFO : training model with 3 workers on 17878 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=3
2019-08-18 17:20:54,373 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-08-18 17:20:54,387 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-08-18 17:20:54,399 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-08-18 17:20:54,401 : INFO : EPOCH - 1 : training on 161343 raw words (120392 effective words) took 0.4s, 305531 effective words/s
2019-08-18 17:20:54,678 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-08-18 17:20:54,686 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-08-18 17:20:54,693 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-08-18 17:20:54,694 : INFO : EPOCH - 2 : training on 161343 raw words (120560 effective words) took 0.3s, 417119 effective words/s
2019-08-18 17:20:54,893 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-08-18 17:20:54,894 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-08-18 17:20:54,907 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-08-18 17:20:54,908 : INFO : EPOCH - 3 : training on 161343 raw words (120517 effective words) took 0.2s, 567584 effective words/s
2019-08-18 17:20:55,198 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-08-18 17:20:55,207 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-08-18 17:20:55,218 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-08-18 17:20:55,219 : INFO : EPOCH - 4 : training on 161343 raw words (120712 effective words) took 0.3s, 391368 effective words/s
2019-08-18 17:20:55,526 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-08-18 17:20:55,533 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-08-18 17:20:55,553 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-08-18 17:20:55,554 : INFO : EPOCH - 5 : training on 161343 raw words (120478 effective words) took 0.3s, 362484 effective words/s
2019-08-18 17:20:55,555 : INFO : training on a 806715 raw words (602659 effective words) took 1.6s, 388087 effective words/s
找出某一个词向量最相近的词集合
req_count = 5
for key in model.wv.similar_by_word('李达康', topn =100):
if len(key[0])==3:
req_count -= 1
print (key[0], key[1])
if req_count == 0:
break;
2019-08-18 17:21:44,506 : INFO : precomputing L2-norms of word weight vectors
侯亮平 0.9604056477546692
欧阳菁 0.9600167274475098
蔡成功 0.9599809646606445
刘新建 0.9572819471359253
祁同伟 0.9565152525901794
/Users/apple/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
if np.issubdtype(vec.dtype, np.int):
req_count = 5
for key in model.wv.similar_by_word('沙瑞金', topn =100):
if len(key[0])==3:
req_count -= 1
print (key[0], key[1])
if req_count == 0:
break;
高育良 0.9720388650894165
田国富 0.9549083709716797
易学习 0.9494497776031494
李达康 0.9454081058502197
侯亮平 0.9189556241035461
/Users/apple/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
if np.issubdtype(vec.dtype, np.int):
看两个词向量的相近程度
print (model.wv.similarity('沙瑞金', '高育良'))
print (model.wv.similarity('李达康', '王大路'))
0.9720388
0.9373346
/Users/apple/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
if np.issubdtype(vec.dtype, np.int):
print (model.wv.similarity('沙瑞金', '刘庆祝'))
0.8436507
/Users/apple/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
if np.issubdtype(vec.dtype, np.int):
找出不同类的词
print (model.wv.doesnt_match("沙瑞金 高育良 李达康 刘庆祝".split()))
刘庆祝
/Users/apple/anaconda3/lib/python3.6/site-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
if np.issubdtype(vec.dtype, np.int):
刘庆祝和另外三个人不是同一类人!
得到词向量
# 语料库有多少单词
model.corpus_total_words
161343
model.corpus_count
2311
model.vocabulary
model['李达康']
/Users/apple/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
if __name__ == '__main__':
array([-0.0870866 , 0.05248798, -0.28147143, -0.32899868, -0.24419424,
-0.26717356, 0.68835247, 0.4199263 , 0.07673895, 0.34578642,
-0.18166232, -0.64018744, 0.0661103 , 1.3144252 , 0.23052616,
-0.9842175 , 0.16689244, -1.0376722 , -0.6779322 , -0.08552188,
0.8821609 , 0.85630375, 0.70850575, 0.02350087, -0.26186958,
-0.19465029, -0.5280784 , 0.02718589, -0.22725886, 0.584188 ,
-0.22170487, 0.17096068, 0.22743836, -0.58258903, -0.8521926 ,
0.01146634, 0.17366898, -0.20080233, 0.49060255, -0.0892161 ,
0.2798695 , -0.48753452, -0.26934424, -0.28810668, -0.50305516,
-0.52781904, -1.0276003 , -0.29357475, -0.5148399 , -0.99778444,
0.82347995, -0.17103711, 0.45900956, -0.25982574, -0.10443403,
-0.43294677, -0.03601839, 0.23268174, -0.0897947 , -0.30117008,
0.13093895, -0.04065455, 0.98853856, -0.19679072, 0.02730171,
-0.39002168, -0.86443186, -0.30278337, -0.35015163, 0.45706993,
-0.35796672, -0.5281926 , 0.4609695 , -0.16861178, -0.4281448 ,
-0.05549743, 0.30860028, -0.33855316, -0.8916333 , 0.77231795,
-0.45779762, 0.29819477, -0.05069054, 0.41183752, -0.25177717,
-0.20057783, 0.53893435, 0.13017803, 0.8262993 , 0.77265227,
-0.57259095, -0.02957028, -0.03229868, 0.4734169 , 0.02673261,
-0.56793886, 0.48301852, -0.14260153, -0.21643269, 0.4321306 ],
dtype=float32)
len(model['李达康'])
/Users/apple/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
if __name__ == '__main__':
100
model['侯亮平']
/Users/apple/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
if __name__ == '__main__':
array([-0.27619898, 0.27101442, -0.3888319 , -0.21565337, -0.1988687 ,
-0.21134071, 0.58008534, 0.6338025 , 0.26411813, 0.300347 ,
0.0545746 , -0.7266006 , -0.06810553, 1.4180936 , 0.04470716,
-1.2312315 , 0.2570867 , -1.356324 , -0.74197394, -0.03976419,
0.89614266, 0.73904985, 0.9443898 , 0.13467237, -0.09986281,
-0.27338284, -0.6192025 , -0.19986346, -0.3509883 , 0.8633056 ,
-0.1322346 , 0.02944488, 0.00851353, -0.8523627 , -0.69786495,
0.17855184, 0.27958298, -0.1690526 , 0.74027956, -0.09224971,
0.27419734, -0.6110898 , -0.45265457, -0.33315966, -0.5103257 ,
-0.63461596, -1.1950399 , 0.09368438, -0.29370093, -1.0550132 ,
0.93446714, -0.30718964, 0.6203983 , -0.26469257, -0.3890905 ,
-0.34891984, -0.02781189, 0.56555355, 0.03353672, -0.03311604,
-0.03772071, 0.28559205, 1.2120959 , -0.19666088, 0.21143027,
-0.7012241 , -1.0564705 , -0.24415188, -0.35654724, 0.54533786,
-0.70228875, -0.6307003 , 0.5166867 , -0.3769945 , -0.25609592,
-0.09554568, 0.2651889 , -0.56329715, -1.3013954 , 0.9396692 ,
-0.38046873, 0.25952345, -0.18691233, 0.3837758 , -0.557426 ,
-0.388514 , 0.68085045, 0.12305634, 1.1934747 , 0.73448956,
-0.6552626 , 0.00999391, 0.10919277, 0.717848 , 0.0193353 ,
-0.6280944 , 0.39228523, 0.05402936, -0.11338637, 0.58770233],
dtype=float32)
len(model['侯亮平'])
/Users/apple/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
if __name__ == '__main__':
100
import numpy as np
def cos_sim(vector_a, vector_b):
"""
计算两个向量之间的余弦相似度
:param vector_a: 向量 a
:param vector_b: 向量 b
:return: sim
"""
vector_a = np.mat(vector_a)
vector_b = np.mat(vector_b)
num = float(vector_a * vector_b.T) # 两个向量乘积
denom = np.linalg.norm(vector_a) * np.linalg.norm(vector_b) # 两个向量各自模长的乘积
cos = num / denom
sim = 0.5 + 0.5 * cos # 归一化
return sim