有三种存储格式:
bin格式转mmap;或者txt转mmap(binary=False)
word = '机器学习'
def bin2mmap():
word2vec_model = KeyedVectors.load_word2vec_format(word2vec_model_path_bin, binary=True, unicode_errors='ignore')
print(word2vec_model.get_vector(word))
print(word2vec_model.similar_by_word(word), 10)
t0 = time.time()
for i in range(1,100):
word2vec_model.similar_by_word(word)
t1 = time.time()
print("timeusage-word2vec:",(t1-t0))
word2vec_model.init_sims(replace=True)
print(word2vec_model.get_vector(word))
print(word2vec_model.similar_by_word(word), 10)
t0 = time.time()
for i in range(1,100):
word2vec_model.similar_by_word(word)
t1 = time.time()
print("timeusage-mmap:", (t1 - t0))
word2vec_model.save(word2vec_model_path_mmap)
print("finish word2vec format to mmap ")
结果如下:
原始向量:[-0.91519725 -1.3079238 -0.22899283 0.49597797 0.8235162 -0.00442344
-0.31264892 -1.7862328 0.03999119 -1.1739256 2.0566266 1.471691
1.0278571 -1.8369098 -0.77390254 -0.656341 1.8493638 0.9575242
-0.2636744 -0.37055868 0.38282338 -0.49954703 -0.7511834 0.5515332
0.03838996 -1.7573644 -0.0650226 0.31607443 0.63139683 -0.86080533
-1.392348 0.43745396 -0.08675445 -0.96114326 0.78316736 -0.23862112
0.48211917 0.15898947 -0.8290301 1.2457579 -1.4069376 1.9800359
1.173904 -1.457771 0.6915998 -0.90467894 -0.98435926 -1.7640209
-1.057895 -0.85384154 2.268948 -0.2041576 -0.16686307 2.161997
0.03494044 1.3683997 1.9983184 0.4973347 -0.23408589 0.17055449
1.4023277 0.0798879 -0.9979944 -0.26866376 0.5315857 2.1473117
0.13713767 1.9981644 -1.6967313 -0.680935 -1.2813233 2.1992793
0.5794937 -1.6094801 -1.5465956 0.16301139 -2.084712 1.003275
0.5413395 -1.6723057 1.6716359 0.12574694 -0.10300653 1.6932416
1.3340583 -0.44037423 1.1770108 -2.7784517 -0.69346255 1.4196216
0.39841333 1.9590216 0.542251 -0.6345231 0.5592694 0.14699712
1.0696484 1.262454 -1.7545087 -0.22811268]
[('机器学习方法', 0.9049049615859985), ('机器学习算法', 0.9041637182235718), ('用机器学习', 0.900841236114502), ('机器学习技术', 0.8847143650054932), ('监督学习', 0.8682323694229126), ('学习模型', 0.8643513917922974), ('无监督学习', 0.8610843420028687), ('分类算法', 0.8550550937652588), ('深度学习', 0.8413577079772949), ('无监督', 0.8409481048583984)] 10
timeusage-word2vec: 22.185466051101685
L2 Norm向量:[-0.07881102 -0.11263015 -0.01971942 0.0427105 0.07091603 -0.00038092
-0.02692335 -0.15381911 0.00344379 -0.10109108 0.17710371 0.12673275
0.08851257 -0.1581831 -0.06664361 -0.05651995 0.15925555 0.08245595
-0.02270598 -0.03191018 0.03296633 -0.04301784 -0.06468718 0.04749456
0.0033059 -0.15133314 -0.00559934 0.02721834 0.05437192 -0.07412713
-0.11990023 0.03767078 -0.00747075 -0.0827676 0.06744143 -0.02054855
0.04151706 0.01369117 -0.07139084 0.10727682 -0.1211566 0.1705082
0.10108921 -0.12553404 0.05955622 -0.07790525 -0.08476681 -0.15190636
-0.09109925 -0.07352746 0.19538751 -0.01758077 -0.0143692 0.18617757
0.00300885 0.11783796 0.17208259 0.04282733 -0.020158 0.01468708
0.12075962 0.00687944 -0.08594099 -0.02313563 0.04577681 0.18491295
0.01180943 0.17206933 -0.1461118 -0.05863783 -0.11033949 0.18938807
0.04990235 -0.13859828 -0.13318306 0.01403751 -0.17952226 0.08639573
0.04661675 -0.14400843 0.14395075 0.01082853 -0.00887027 0.14581129
0.11488069 -0.03792225 0.10135675 -0.23926274 -0.05971662 0.12224887
0.03430884 0.1686986 0.04669524 -0.05464113 0.04816076 0.01265847
0.09211138 0.10871458 -0.15108724 -0.01964363]
[('机器学习方法', 0.9049049615859985), ('机器学习算法', 0.9041637182235718), ('用机器学习', 0.900841236114502), ('机器学习技术', 0.8847143650054932), ('监督学习', 0.8682323694229126), ('学习模型', 0.8643513917922974), ('无监督学习', 0.8610843420028687), ('分类算法', 0.8550550937652588), ('深度学习', 0.8413577079772949), ('无监督', 0.8409481048583984)] 10
timeusage-mmap: 21.80446982383728
finish word2vec format to mmap
Process finished with exit code 0
gensim 3.7.1
python 3.7
我们把19GB的txt文本向量,利用mmap存储,大约9.8G空间。
一共2000万向量,向量的dimension=100.
https://pypi.org/project/annoy/1.0.5/
https://radimrehurek.com/gensim/models/keyedvectors.html