自然语言处理-Gensim构造词向量(简单版)

文章目录

  • 自然语言处理-Gensim构造词向量(简单版)
    • 1.导入模型
    • 2. 两句话
    • 3. 切分
    • 4.建立模型
      • min_count:
      • Size:
    • 5.测试两个词的相似程度

自然语言处理-Gensim构造词向量(简单版)

1.导入模型

from gensim.models import word2vec
import logging

logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',level=logging.INFO)

2. 两句话

raw_sentences=("the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep")

3. 切分

sentences=[s.split() for s in raw_sentences]
print(sentences)
[['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dogs'], ['yoyoyo', 'you', 'go', 'home', 'now', 'to', 'sleep']]

4.建立模型

model=word2vec.Word2Vec(sentences,min_count=1)
2020-04-20 18:33:15,654:INFO:collecting all words and their counts
2020-04-20 18:33:15,655:INFO:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-20 18:33:15,656:INFO:collected 15 word types from a corpus of 16 raw words and 2 sentences
2020-04-20 18:33:15,657:INFO:Loading a fresh vocabulary
2020-04-20 18:33:15,658:INFO:effective_min_count=1 retains 15 unique words (100% of original 15, drops 0)
2020-04-20 18:33:15,659:INFO:effective_min_count=1 leaves 16 word corpus (100% of original 16, drops 0)
2020-04-20 18:33:15,660:INFO:deleting the raw counts dictionary of 15 items
2020-04-20 18:33:15,660:INFO:sample=0.001 downsamples 15 most-common words
2020-04-20 18:33:15,661:INFO:downsampling leaves estimated 2 word corpus (13.7% of prior 16)
2020-04-20 18:33:15,663:INFO:estimated required memory for 15 words and 100 dimensions: 19500 bytes
2020-04-20 18:33:15,664:INFO:resetting layer weights
2020-04-20 18:33:15,665:INFO:training model with 3 workers on 15 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-04-20 18:33:15,672:INFO:worker thread finished; awaiting finish of 2 more threads
2020-04-20 18:33:15,675:INFO:worker thread finished; awaiting finish of 1 more threads
2020-04-20 18:33:15,675:INFO:worker thread finished; awaiting finish of 0 more threads
2020-04-20 18:33:15,676:INFO:EPOCH - 1 : training on 16 raw words (2 effective words) took 0.0s, 542 effective words/s
2020-04-20 18:33:15,680:INFO:worker thread finished; awaiting finish of 2 more threads
2020-04-20 18:33:15,681:INFO:worker thread finished; awaiting finish of 1 more threads
2020-04-20 18:33:15,682:INFO:worker thread finished; awaiting finish of 0 more threads
2020-04-20 18:33:15,684:INFO:EPOCH - 2 : training on 16 raw words (3 effective words) took 0.0s, 639 effective words/s
2020-04-20 18:33:15,688:INFO:worker thread finished; awaiting finish of 2 more threads
2020-04-20 18:33:15,691:INFO:worker thread finished; awaiting finish of 1 more threads
2020-04-20 18:33:15,691:INFO:worker thread finished; awaiting finish of 0 more threads
2020-04-20 18:33:15,692:INFO:EPOCH - 3 : training on 16 raw words (1 effective words) took 0.0s, 263 effective words/s
2020-04-20 18:33:15,697:INFO:worker thread finished; awaiting finish of 2 more threads
2020-04-20 18:33:15,699:INFO:worker thread finished; awaiting finish of 1 more threads
2020-04-20 18:33:15,700:INFO:worker thread finished; awaiting finish of 0 more threads
2020-04-20 18:33:15,701:INFO:EPOCH - 4 : training on 16 raw words (2 effective words) took 0.0s, 402 effective words/s
2020-04-20 18:33:15,705:INFO:worker thread finished; awaiting finish of 2 more threads
2020-04-20 18:33:15,707:INFO:worker thread finished; awaiting finish of 1 more threads
2020-04-20 18:33:15,707:INFO:worker thread finished; awaiting finish of 0 more threads
2020-04-20 18:33:15,708:INFO:EPOCH - 5 : training on 16 raw words (2 effective words) took 0.0s, 486 effective words/s
2020-04-20 18:33:15,709:INFO:training on a 80 raw words (10 effective words) took 0.0s, 234 effective words/s
2020-04-20 18:33:15,710:WARNING:under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay

min_count:

在不同大小的语料集中,我们对于基准词频的需求也是不一样的。譬如在较大的语料集中,我们希望忽略那些只出现过一两次的单词,这里我们就可以通过设置min_count参数进行控制。一般而言,合理的参数值会设置在0~100之间。

Size:

size参数主要是用来设置神经网络的层数,Word2Vec中的默认你只是设置为100层。更大的曾设置意味着更多的输入数据,不过也能提升整体的准确度,合理的设置范围为10~数百。

5.测试两个词的相似程度

model.similarity('dogs','go')
d:\progra~2\python\virtua~1\py37_x64\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `similarity` (Method will be removed in 4.0.0, use self.wv.similarity() instead).
  """Entry point for launching an IPython kernel.





-0.031395614

你可能感兴趣的:(深度学习)