Author: 2tong
与传统机器学习不同,深度学习既提供特征提取功能,也可以完成分类的功能。
word2vec模型背后的基本思想是对出现在上下文环境里的词进行预测。对于每一条输入文本,我们选取一个上下文窗口和一个中心词,并基于这个中心词去预测窗口里其他词出现的概率。因此,word2vec模型可以方便地从新增语料中学习到新增词的向量表达,是一种高效的在线学习算法(online learning)。
word2vec的主要思路:通过单词和上下文彼此预测,对应的两个算法分别为:
从直观上理解,Skip-Gram是给定input word来预测上下文。
从直观上理解,CBOW是给定上下文,来预测input word。
为了避免要计算所有词的softmax概率,word2vec采样了霍夫曼树来代替从隐藏层到输出softmax层的映射。
霍夫曼树的建立:
训练一个神经网络意味着要输入训练样本并且不断调整神经元的权重,从而不断提高对目标的准确预测。每当神经网络经过一个训练样本的训练,它的权重就会进行一次调整。
所以,词典的大小决定了我们的Skip-Gram神经网络将会拥有大规模的权重矩阵,所有的这些权重需要通过数以亿计的训练样本来进行调整,这是非常消耗计算资源的,并且实际中训练起来会非常慢。
为解决这个问题,提出Negative Sampling。它是用来提高训练速度并且改善所得到词向量的质量的一种方法。不同于原本每个训练样本更新所有的权重,负采样每次让一个训练样本仅仅更新一小部分的权重,这样就会降低梯度下降过程中的计算量。
import logging
import numpy as np
import pandas as pd
from gensim.models.word2vec import Word2Vec
if __name__ == '__main__':
logging.basicConfig(level=logging.INFO, format='%(asctime)-15s %(levelname)s: %(message)s')
# split data to 10 fold
fold_num = 10
data_file = '/home/2tong/data/train_set.csv'
fold_data = all_data2fold(fold_num)
# build train data for word2vec
fold_id = fold_num-1
train_texts = []
for i in range(0, fold_id):
data = fold_data[i]
train_texts.extend(data['text'])
logging.info('Total %d docs.' % len(train_texts))
logging.info('Start training...')
num_features = 100 # Word vector dimensionality
num_workers = 8 # Number of threads to run in parallel
train_texts = list(map(lambda x: list(x.split()), train_texts))
model = Word2Vec(train_texts, workers=num_workers, size=num_features)
model.init_sims(replace=True)
# save model
model.save("./word2vec_model/word2vec.bin")
# convert format
model.wv.save_word2vec_format('./word2vec_model/word2vec.txt', binary=False)
2020-07-31 21:28:57,521 INFO: Fold lens [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]
2020-07-31 21:28:57,543 INFO: Total 9000 docs.
2020-07-31 21:28:57,543 INFO: Start training...
2020-07-31 21:28:58,221 INFO: collecting all words and their counts
2020-07-31 21:28:58,221 INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-07-31 21:28:59,413 INFO: collected 5290 word types from a corpus of 8223375 raw words and 9000 sentences
2020-07-31 21:28:59,413 INFO: Loading a fresh vocabulary
2020-07-31 21:28:59,507 INFO: effective_min_count=5 retains 4324 unique words (81% of original 5290, drops 966)
2020-07-31 21:28:59,507 INFO: effective_min_count=5 leaves 8221380 word corpus (99% of original 8223375, drops 1995)
2020-07-31 21:28:59,519 INFO: deleting the raw counts dictionary of 5290 items
2020-07-31 21:28:59,520 INFO: sample=0.001 downsamples 61 most-common words
2020-07-31 21:28:59,520 INFO: downsampling leaves estimated 7098252 word corpus (86.3% of prior 8221380)
2020-07-31 21:28:59,528 INFO: estimated required memory for 4324 words and 100 dimensions: 5621200 bytes
2020-07-31 21:28:59,528 INFO: resetting layer weights
2020-07-31 21:29:00,375 INFO: training model with 8 workers on 4324 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-07-31 21:29:01,384 INFO: EPOCH 1 - PROGRESS: at 25.21% examples, 1784282 words/s, in_qsize 16, out_qsize 1
...
2020-07-31 21:29:04,402 INFO: EPOCH 1 - PROGRESS: at 91.77% examples, 1617927 words/s, in_qsize 16, out_qsize 0
2020-07-31 21:29:04,699 INFO: worker thread finished; awaiting finish of 7 more threads
...
2020-07-31 21:29:04,732 INFO: EPOCH - 1 : training on 8223375 raw words (7061176 effective words) took 4.3s, 1623264 effective words/s
2020-07-31 21:29:05,743 INFO: EPOCH 2 - PROGRESS: at 21.42% examples, 1504059 words/s, in_qsize 15, out_qsize 0
...
2020-07-31 21:29:08,767 INFO: EPOCH 2 - PROGRESS: at 84.64% examples, 1481574 words/s, in_qsize 16, out_qsize 0
2020-07-31 21:29:09,398 INFO: worker thread finished; awaiting finish of 7 more threads
...
2020-07-31 21:29:09,419 INFO: EPOCH - 2 : training on 8223375 raw words (7062608 effective words) took 4.7s, 1508810 effective words/s
2020-07-31 21:29:10,430 INFO: EPOCH 3 - PROGRESS: at 23.41% examples, 1645736 words/s, in_qsize 16, out_qsize 0
...
2020-07-31 21:29:13,886 INFO: worker thread finished; awaiting finish of 7 more threads
...
2020-07-31 21:29:13,909 INFO: EPOCH - 3 : training on 8223375 raw words (7062543 effective words) took 4.5s, 1574839 effective words/s
2020-07-31 21:29:14,923 INFO: EPOCH 4 - PROGRESS: at 18.72% examples, 1321168 words/s, in_qsize 15, out_qsize 0
...
2020-07-31 21:29:17,946 INFO: EPOCH 4 - PROGRESS: at 86.46% examples, 1514607 words/s, in_qsize 15, out_qsize 0
2020-07-31 21:29:18,515 INFO: worker thread finished; awaiting finish of 7 more threads
...
2020-07-31 21:29:18,544 INFO: EPOCH - 4 : training on 8223375 raw words (7060892 effective words) took 4.6s, 1524940 effective words/s
2020-07-31 21:29:19,559 INFO: EPOCH 5 - PROGRESS: at 21.06% examples, 1472794 words/s, in_qsize 13, out_qsize 2
...
2020-07-31 21:29:22,568 INFO: EPOCH 5 - PROGRESS: at 88.28% examples, 1552921 words/s, in_qsize 14, out_qsize 1
2020-07-31 21:29:23,043 INFO: worker thread finished; awaiting finish of 7 more threads
...
2020-07-31 21:29:23,063 INFO: EPOCH - 5 : training on 8223375 raw words (7061288 effective words) took 4.5s, 1564422 effective words/s
2020-07-31 21:29:23,064 INFO: training on a 41116875 raw words (35308507 effective words) took 22.7s, 1556223 effective words/s
2020-07-31 21:29:23,064 INFO: precomputing L2-norms of word weight vectors
2020-07-31 21:29:23,068 INFO: saving Word2Vec object under ./word2vec_model/word2vec.bin, separately None
2020-07-31 21:29:23,069 INFO: not storing attribute vectors_norm
2020-07-31 21:29:23,069 INFO: not storing attribute cum_table
2020-07-31 21:29:23,130 INFO: saved ./word2vec_model/word2vec.bin
2020-07-31 21:29:23,131 INFO: storing 4324x100 projection weights into ./word2vec_model/word2vec.txt
word2vec_model/
├── word2vec.bin
└── word2vec.txt
0 directories, 2 files