word2vec官网:https://code.google.com/p/word2vec/
运行和测试同样需要text8、questions-words.txt文件,语料下载地址:http://mattmahoney.net/dc/text8.zip
该语料编码格式UTF-8,存储为一行,语料训练信息:training on 85026035 raw words (62529137 effective words) took 197.4s, 316692 effective words/s
-train 训练数据
-output 结果输入文件,即每个词的向量
-cbow 是否使用cbow模型,0表示使用skip-gram模型,1表示使用cbow模型,默认情况下是skip-gram模型,cbow模型快一些,skip-gram模型效果好一些
-size 表示输出的词向量维数
-window 为训练的窗口大小,8表示每个词考虑前8个词与后8个词(实际代码中还有一个随机选窗口的过程,窗口大小<=5)
-negative 表示是否使用NEG方,0表示不使用,其它的值目前还不是很清楚
-hs 是否使用HS方法,0表示不使用,1表示使用
-sample 表示 采样的阈值,如果一个词在训练样本中出现的频率越大,那么就越会被采样
-binary 表示输出的结果文件是否采用二进制存储,0表示不使用(即普通的文本存储,可以打开查看),1表示使用,即vectors.bin的存储类型
-alpha 表示 学习速率
-min-count 表示设置最低频率,默认为5,如果一个词语在文档中出现的次数小于该阈值,那么该词就会被舍弃
-classes 表示词聚类簇的个数,从相关源码中可以得出该聚类是采用k-means
# -*- coding: utf-8 -*-
"""
功能:测试gensim使用
时间:2016年5月2日 18:00:00
"""
from gensim.models import word2vec
import logging
# 主程序
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = word2vec.Text8Corpus("data/text8") # 加载语料
model = word2vec.Word2Vec(sentences, size=200) # 训练skip-gram模型; 默认window=5
# 计算两个词的相似度/相关程度
y1 = model.similarity("woman", "man")
print u"woman和man的相似度为:", y1
print "--------\n"
# 计算某个词的相关词列表
y2 = model.most_similar("good", topn=20) # 20个最相关的
print u"和good最相关的词有:\n"
for item in y2:
print item[0], item[1]
print "--------\n"
# 寻找对应关系
print ' "boy" is to "father" as "girl" is to ...? \n'
y3 = model.most_similar(['girl', 'father'], ['boy'], topn=3)
for item in y3:
print item[0], item[1]
print "--------\n"
more_examples = ["he his she", "big bigger bad", "going went being"]
for example in more_examples:
a, b, x = example.split()
predicted = model.most_similar([x, b], [a])[0][0]
print "'%s' is to '%s' as '%s' is to '%s'" % (a, b, x, predicted)
print "--------\n"
# 寻找不合群的词
y4 = model.doesnt_match("breakfast cereal dinner lunch".split())
print u"不合群的词:", y4
print "--------\n"
# 保存模型,以便重用
model.save("text8.model")
# 对应的加载方式
# model_2 = word2vec.Word2Vec.load("text8.model")
# 以一种C语言可以解析的形式存储词向量
model.save_word2vec_format("text8.model.bin", binary=True)
# 对应的加载方式
# model_3 = word2vec.Word2Vec.load_word2vec_format("text8.model.bin", binary=True)
if __name__ == "__main__":
pass
2016-5-2 18:56:19,332 : INFO : collecting all words and their counts
2016-5-2 18:56:19,334 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2016-5-2 18:56:27,431 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2016-5-2 18:56:27,740 : INFO : min_count=5 retains 71290 unique words (drops 182564)
2016-5-2 18:56:27,740 : INFO : min_count leaves 16718844 word corpus (98% of original 17005207)
2016-5-2 18:56:27,914 : INFO : deleting the raw counts dictionary of 253854 items
2016-5-2 18:56:27,947 : INFO : sample=0.001 downsamples 38 most-common words
2016-5-2 18:56:27,947 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2016-5-2 18:56:27,947 : INFO : estimated required memory for 71290 words and 200 dimensions: 149709000 bytes
2016-5-2 18:56:28,176 : INFO : resetting layer weights
2016-5-2 18:56:29,074 : INFO : training model with 3 workers on 71290 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5
2016-5-2 18:56:29,074 : INFO : expecting 1701 sentences, matching count from corpus used for vocabulary survey
2016-5-2 18:56:30,086 : INFO : PROGRESS: at 0.86% examples, 531932 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:56:31,088 : INFO : PROGRESS: at 1.72% examples, 528872 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:56:32,108 : INFO : PROGRESS: at 2.68% examples, 549248 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:56:33,113 : INFO : PROGRESS: at 3.47% examples, 534255 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:56:34,135 : INFO : PROGRESS: at 4.43% examples, 545575 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:56:35,145 : INFO : PROGRESS: at 5.40% examples, 555220 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:56:36,147 : INFO : PROGRESS: at 6.34% examples, 560815 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:56:37,155 : INFO : PROGRESS: at 7.28% examples, 564712 words/s, in_qsize 6, out_qsize 1
2016-5-2 18:56:38,172 : INFO : PROGRESS: at 8.24% examples, 568088 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:56:39,169 : INFO : PROGRESS: at 9.19% examples, 570872 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:56:40,191 : INFO : PROGRESS: at 10.16% examples, 573068 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:56:41,203 : INFO : PROGRESS: at 11.12% examples, 575184 words/s, in_qsize 5, out_qsize 1
2016-5-2 18:56:42,217 : INFO : PROGRESS: at 12.09% examples, 577227 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:56:43,220 : INFO : PROGRESS: at 13.04% examples, 578418 words/s, in_qsize 5, out_qsize 1
2016-5-2 18:56:44,235 : INFO : PROGRESS: at 14.00% examples, 579574 words/s, in_qsize 5, out_qsize 1
2016-5-2 18:56:45,239 : INFO : PROGRESS: at 14.96% examples, 580577 words/s, in_qsize 6, out_qsize 2
2016-5-2 18:56:46,243 : INFO : PROGRESS: at 15.86% examples, 578374 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:56:47,252 : INFO : PROGRESS: at 16.70% examples, 574918 words/s, in_qsize 5, out_qsize 1
2016-5-2 18:56:48,256 : INFO : PROGRESS: at 17.66% examples, 576221 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:56:49,258 : INFO : PROGRESS: at 18.61% examples, 577045 words/s, in_qsize 4, out_qsize 0
2016-5-2 18:56:50,260 : INFO : PROGRESS: at 19.54% examples, 576947 words/s, in_qsize 4, out_qsize 1
2016-5-2 18:56:51,261 : INFO : PROGRESS: at 20.47% examples, 577120 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:56:52,284 : INFO : PROGRESS: at 21.43% examples, 577251 words/s, in_qsize 5, out_qsize 1
2016-5-2 18:56:53,287 : INFO : PROGRESS: at 22.34% examples, 576556 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:56:54,308 : INFO : PROGRESS: at 23.20% examples, 574618 words/s, in_qsize 6, out_qsize 1
2016-5-2 18:56:55,306 : INFO : PROGRESS: at 24.15% examples, 575304 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:56:56,329 : INFO : PROGRESS: at 25.09% examples, 575610 words/s, in_qsize 5, out_qsize 1
2016-5-2 18:56:57,333 : INFO : PROGRESS: at 26.04% examples, 576358 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:56:58,340 : INFO : PROGRESS: at 26.97% examples, 576745 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:56:59,337 : INFO : PROGRESS: at 27.91% examples, 577161 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:57:00,338 : INFO : PROGRESS: at 28.84% examples, 577303 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:57:01,346 : INFO : PROGRESS: at 29.65% examples, 575087 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:02,353 : INFO : PROGRESS: at 30.55% examples, 574516 words/s, in_qsize 5, out_qsize 1
2016-5-2 18:57:03,356 : INFO : PROGRESS: at 31.36% examples, 572590 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:57:04,371 : INFO : PROGRESS: at 32.10% examples, 569320 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:05,380 : INFO : PROGRESS: at 32.95% examples, 568088 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:57:06,389 : INFO : PROGRESS: at 33.78% examples, 566886 words/s, in_qsize 6, out_qsize 1
2016-5-2 18:57:07,399 : INFO : PROGRESS: at 34.60% examples, 565345 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:08,418 : INFO : PROGRESS: at 35.51% examples, 564685 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:57:09,432 : INFO : PROGRESS: at 36.39% examples, 564093 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:10,441 : INFO : PROGRESS: at 37.21% examples, 562778 words/s, in_qsize 5, out_qsize 1
2016-5-2 18:57:11,453 : INFO : PROGRESS: at 38.14% examples, 563163 words/s, in_qsize 6, out_qsize 1
2016-5-2 18:57:12,449 : INFO : PROGRESS: at 38.98% examples, 562072 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:13,461 : INFO : PROGRESS: at 39.88% examples, 561949 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:14,464 : INFO : PROGRESS: at 40.75% examples, 561493 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:15,482 : INFO : PROGRESS: at 41.60% examples, 560419 words/s, in_qsize 5, out_qsize 1
2016-5-2 18:57:16,503 : INFO : PROGRESS: at 42.40% examples, 558807 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:17,520 : INFO : PROGRESS: at 43.27% examples, 558287 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:57:18,534 : INFO : PROGRESS: at 44.13% examples, 557685 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:19,538 : INFO : PROGRESS: at 44.93% examples, 556591 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:20,540 : INFO : PROGRESS: at 45.83% examples, 556881 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:57:21,541 : INFO : PROGRESS: at 46.75% examples, 557341 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:22,553 : INFO : PROGRESS: at 47.69% examples, 557860 words/s, in_qsize 5, out_qsize 1
2016-5-2 18:57:23,557 : INFO : PROGRESS: at 48.51% examples, 557066 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:24,564 : INFO : PROGRESS: at 49.42% examples, 557201 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:57:25,571 : INFO : PROGRESS: at 50.31% examples, 557231 words/s, in_qsize 5, out_qsize 1
2016-5-2 18:57:26,585 : INFO : PROGRESS: at 51.26% examples, 557820 words/s, in_qsize 6, out_qsize 1
2016-5-2 18:57:27,586 : INFO : PROGRESS: at 52.22% examples, 558455 words/s, in_qsize 4, out_qsize 0
2016-5-2 18:57:28,588 : INFO : PROGRESS: at 53.16% examples, 558932 words/s, in_qsize 6, out_qsize 1
2016-5-2 18:57:29,609 : INFO : PROGRESS: at 54.11% examples, 559389 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:57:30,616 : INFO : PROGRESS: at 55.01% examples, 559415 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:31,642 : INFO : PROGRESS: at 55.87% examples, 558596 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:57:32,647 : INFO : PROGRESS: at 56.78% examples, 558665 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:33,656 : INFO : PROGRESS: at 57.57% examples, 557526 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:34,660 : INFO : PROGRESS: at 58.39% examples, 556830 words/s, in_qsize 4, out_qsize 0
2016-5-2 18:57:35,664 : INFO : PROGRESS: at 59.31% examples, 557019 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:36,670 : INFO : PROGRESS: at 60.12% examples, 556187 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:37,683 : INFO : PROGRESS: at 60.94% examples, 555461 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:38,686 : INFO : PROGRESS: at 61.78% examples, 554836 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:39,705 : INFO : PROGRESS: at 62.54% examples, 553555 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:40,710 : INFO : PROGRESS: at 63.35% examples, 552863 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:41,719 : INFO : PROGRESS: at 64.12% examples, 551760 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:42,726 : INFO : PROGRESS: at 64.93% examples, 551152 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:57:43,741 : INFO : PROGRESS: at 65.74% examples, 550535 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:44,743 : INFO : PROGRESS: at 66.51% examples, 549746 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:45,743 : INFO : PROGRESS: at 67.23% examples, 548498 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:46,773 : INFO : PROGRESS: at 67.98% examples, 547297 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:47,786 : INFO : PROGRESS: at 68.81% examples, 546808 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:48,792 : INFO : PROGRESS: at 69.58% examples, 546028 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:49,798 : INFO : PROGRESS: at 70.37% examples, 545344 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:50,807 : INFO : PROGRESS: at 71.19% examples, 545012 words/s, in_qsize 6, out_qsize 1
2016-5-2 18:57:51,802 : INFO : PROGRESS: at 72.09% examples, 545184 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:52,806 : INFO : PROGRESS: at 72.98% examples, 545315 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:57:53,827 : INFO : PROGRESS: at 73.92% examples, 545714 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:57:54,827 : INFO : PROGRESS: at 74.86% examples, 546256 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:57:55,840 : INFO : PROGRESS: at 75.79% examples, 546379 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:57:56,851 : INFO : PROGRESS: at 76.73% examples, 546823 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:57:57,843 : INFO : PROGRESS: at 77.66% examples, 547189 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:58,847 : INFO : PROGRESS: at 78.50% examples, 546858 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:57:59,849 : INFO : PROGRESS: at 79.39% examples, 546959 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:58:00,854 : INFO : PROGRESS: at 80.27% examples, 546954 words/s, in_qsize 5, out_qsize 1
2016-5-2 18:58:01,856 : INFO : PROGRESS: at 81.22% examples, 547394 words/s, in_qsize 3, out_qsize 0
2016-5-2 18:58:02,875 : INFO : PROGRESS: at 82.13% examples, 547429 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:58:03,888 : INFO : PROGRESS: at 83.07% examples, 547815 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:58:04,880 : INFO : PROGRESS: at 84.00% examples, 548153 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:58:05,895 : INFO : PROGRESS: at 84.91% examples, 548428 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:58:06,888 : INFO : PROGRESS: at 85.77% examples, 548357 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:58:07,901 : INFO : PROGRESS: at 86.64% examples, 548365 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:58:08,897 : INFO : PROGRESS: at 87.50% examples, 548265 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:58:09,902 : INFO : PROGRESS: at 88.42% examples, 548504 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:58:10,916 : INFO : PROGRESS: at 89.18% examples, 547765 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:58:11,921 : INFO : PROGRESS: at 89.94% examples, 547006 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:58:12,923 : INFO : PROGRESS: at 90.81% examples, 546992 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:58:13,930 : INFO : PROGRESS: at 91.72% examples, 547225 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:58:14,935 : INFO : PROGRESS: at 92.59% examples, 547187 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:58:15,939 : INFO : PROGRESS: at 93.46% examples, 547133 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:58:16,944 : INFO : PROGRESS: at 94.18% examples, 546224 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:58:17,953 : INFO : PROGRESS: at 94.93% examples, 545497 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:58:18,959 : INFO : PROGRESS: at 95.70% examples, 544697 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:58:19,967 : INFO : PROGRESS: at 96.40% examples, 543702 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:58:20,974 : INFO : PROGRESS: at 97.26% examples, 543612 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:58:21,978 : INFO : PROGRESS: at 98.17% examples, 543801 words/s, in_qsize 5, out_qsize 0
2016-5-2 18:58:22,994 : INFO : PROGRESS: at 99.07% examples, 543908 words/s, in_qsize 4, out_qsize 2
2016-5-2 18:58:23,989 : INFO : PROGRESS: at 99.91% examples, 543692 words/s, in_qsize 6, out_qsize 0
2016-5-2 18:58:24,067 : INFO : worker thread finished; awaiting finish of 2 more threads
2016-5-2 18:58:24,083 : INFO : worker thread finished; awaiting finish of 1 more threads
2016-5-2 18:58:24,086 : INFO : worker thread finished; awaiting finish of 0 more threads
2016-5-2 18:58:24,086 : INFO : training on 85026035 raw words (62534095 effective words) took 115.0s, 543725 effective words/s
2016-5-2 18:58:24,086 : INFO : precomputing L2-norms of word weight vectors
woman和man的相似度为: 0.699695936218
--------
和good最相关的词有:
bad 0.721469461918
poor 0.567566931248
safe 0.534923613071
luck 0.518905758858
courage 0.510788619518
useful 0.498157411814
quick 0.497716665268
easy 0.497328162193
everyone 0.485905945301
pleasure 0.483758479357
true 0.482762247324
simple 0.480014979839
practical 0.479516804218
fair 0.479104012251
happy 0.476968646049
wrong 0.476797521114
reasonable 0.476701617241
you 0.475801795721
fun 0.472196519375
helpful 0.471719056368
--------
"boy" is to "father" as "girl" is to ...?
mother 0.76334130764
grandmother 0.690031766891
daughter 0.684129178524
--------
'he' is to 'his' as 'she' is to 'her'
'big' is to 'bigger' as 'bad' is to 'worse'
'going' is to 'went' as 'being' is to 'was'
--------
不合群的词: cereal
--------
2016-5-2 18:58:24,185 : INFO : saving Word2Vec object under text8.model, separately None
2016-5-2 18:58:24,185 : INFO : storing numpy array 'syn1neg' to text8.model.syn1neg.npy
2016-5-2 18:58:24,235 : INFO : not storing attribute syn0norm
2016-5-2 18:58:24,235 : INFO : storing numpy array 'syn0' to text8.model.syn0.npy
2016-5-2 18:58:24,278 : INFO : not storing attribute cum_table
2016-5-2 18:58:25,083 : INFO : storing 71290x200 projection weights into text8.model.bin
2002年秋天北京大学网络与分布式实验室天网小组通过动员不同专业的几十个学生,人工选取形成了一个全新的基于层次模型的大规模中文网页样本集。它包括11678个训练网页实例和3630个测试网页实例,分布在11个大类别中。
StandardAnalyzer(中英文)、ChineseAnalyzer(中文)、CJKAnalyzer(中英文)、IKAnalyzer(中英文,兼容韩文,日文)、paoding(中文)、MMAnalyzer(中英文)、MMSeg4j(中英文)、imdict(中英文)、NLTK(中英文)、Jieba(中英文)。