利用gensim进行词向量处理和找到相似词

关于词向量

有三种存储格式:

  1. txt
    文本格式,类似 word 0.001233 0.34219 …
  2. bin
    google的序列化,二进制模式;
  3. mmap
    内存共享模式。一个字就是快;加载快。

加载方法

bin格式转mmap;或者txt转mmap(binary=False)

word = '机器学习'

def bin2mmap():
    word2vec_model = KeyedVectors.load_word2vec_format(word2vec_model_path_bin, binary=True, unicode_errors='ignore')
    print(word2vec_model.get_vector(word))

    print(word2vec_model.similar_by_word(word), 10)
    t0 = time.time()
    for i in range(1,100):
        word2vec_model.similar_by_word(word)
    t1 = time.time()
    print("timeusage-word2vec:",(t1-t0))

    word2vec_model.init_sims(replace=True) 
    print(word2vec_model.get_vector(word))

    print(word2vec_model.similar_by_word(word), 10)
    t0 = time.time()
    for i in range(1,100):
        word2vec_model.similar_by_word(word)
    t1 = time.time()
    print("timeusage-mmap:", (t1 - t0))

    word2vec_model.save(word2vec_model_path_mmap)
    print("finish word2vec format to mmap ")

结果如下:

原始向量:[-0.91519725 -1.3079238  -0.22899283  0.49597797  0.8235162  -0.00442344
 -0.31264892 -1.7862328   0.03999119 -1.1739256   2.0566266   1.471691
  1.0278571  -1.8369098  -0.77390254 -0.656341    1.8493638   0.9575242
 -0.2636744  -0.37055868  0.38282338 -0.49954703 -0.7511834   0.5515332
  0.03838996 -1.7573644  -0.0650226   0.31607443  0.63139683 -0.86080533
 -1.392348    0.43745396 -0.08675445 -0.96114326  0.78316736 -0.23862112
  0.48211917  0.15898947 -0.8290301   1.2457579  -1.4069376   1.9800359
  1.173904   -1.457771    0.6915998  -0.90467894 -0.98435926 -1.7640209
 -1.057895   -0.85384154  2.268948   -0.2041576  -0.16686307  2.161997
  0.03494044  1.3683997   1.9983184   0.4973347  -0.23408589  0.17055449
  1.4023277   0.0798879  -0.9979944  -0.26866376  0.5315857   2.1473117
  0.13713767  1.9981644  -1.6967313  -0.680935   -1.2813233   2.1992793
  0.5794937  -1.6094801  -1.5465956   0.16301139 -2.084712    1.003275
  0.5413395  -1.6723057   1.6716359   0.12574694 -0.10300653  1.6932416
  1.3340583  -0.44037423  1.1770108  -2.7784517  -0.69346255  1.4196216
  0.39841333  1.9590216   0.542251   -0.6345231   0.5592694   0.14699712
  1.0696484   1.262454   -1.7545087  -0.22811268]
[('机器学习方法', 0.9049049615859985), ('机器学习算法', 0.9041637182235718), ('用机器学习', 0.900841236114502), ('机器学习技术', 0.8847143650054932), ('监督学习', 0.8682323694229126), ('学习模型', 0.8643513917922974), ('无监督学习', 0.8610843420028687), ('分类算法', 0.8550550937652588), ('深度学习', 0.8413577079772949), ('无监督', 0.8409481048583984)] 10
timeusage-word2vec: 22.185466051101685

L2 Norm向量:[-0.07881102 -0.11263015 -0.01971942  0.0427105   0.07091603 -0.00038092
 -0.02692335 -0.15381911  0.00344379 -0.10109108  0.17710371  0.12673275
  0.08851257 -0.1581831  -0.06664361 -0.05651995  0.15925555  0.08245595
 -0.02270598 -0.03191018  0.03296633 -0.04301784 -0.06468718  0.04749456
  0.0033059  -0.15133314 -0.00559934  0.02721834  0.05437192 -0.07412713
 -0.11990023  0.03767078 -0.00747075 -0.0827676   0.06744143 -0.02054855
  0.04151706  0.01369117 -0.07139084  0.10727682 -0.1211566   0.1705082
  0.10108921 -0.12553404  0.05955622 -0.07790525 -0.08476681 -0.15190636
 -0.09109925 -0.07352746  0.19538751 -0.01758077 -0.0143692   0.18617757
  0.00300885  0.11783796  0.17208259  0.04282733 -0.020158    0.01468708
  0.12075962  0.00687944 -0.08594099 -0.02313563  0.04577681  0.18491295
  0.01180943  0.17206933 -0.1461118  -0.05863783 -0.11033949  0.18938807
  0.04990235 -0.13859828 -0.13318306  0.01403751 -0.17952226  0.08639573
  0.04661675 -0.14400843  0.14395075  0.01082853 -0.00887027  0.14581129
  0.11488069 -0.03792225  0.10135675 -0.23926274 -0.05971662  0.12224887
  0.03430884  0.1686986   0.04669524 -0.05464113  0.04816076  0.01265847
  0.09211138  0.10871458 -0.15108724 -0.01964363]
[('机器学习方法', 0.9049049615859985), ('机器学习算法', 0.9041637182235718), ('用机器学习', 0.900841236114502), ('机器学习技术', 0.8847143650054932), ('监督学习', 0.8682323694229126), ('学习模型', 0.8643513917922974), ('无监督学习', 0.8610843420028687), ('分类算法', 0.8550550937652588), ('深度学习', 0.8413577079772949), ('无监督', 0.8409481048583984)] 10
timeusage-mmap: 21.80446982383728
finish word2vec format to mmap 

Process finished with exit code 0



环境

gensim 3.7.1
python 3.7

我们把19GB的txt文本向量,利用mmap存储,大约9.8G空间。
一共2000万向量,向量的dimension=100.

结果

  1. 无论是词向量、文本向量、图片向量,只要是向量化的都可以利用此模型计算
  2. 建议mmap格式,加载速度很快;但是计算相似度的时候并没有明显提高。
  3. 仍然建议是init similarity,也就是提取L2Norm。
  4. 当然利用annoy https://pypi.org/project/annoy/1.0.5/ 之类的去建索引求解任何向量的topn是另外一个topic。

参考文献

https://pypi.org/project/annoy/1.0.5/
https://radimrehurek.com/gensim/models/keyedvectors.html

你可能感兴趣的:(算法,深度学习,机器学习)