gensim.models.word2vec.Word2Vec(sentences=None, corpus_file=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=, iter=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False, callbacks=(), max_final_vocab=None)
- sentences (iterable of iterables, optional) – The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See
BrownCorpus
, Text8Corpus
or LineSentence
in word2vec
module for such examples. See also the tutorial on data streaming in Python. If you don’t supply sentences, the model is left uninitialized – use if you plan to initialize it in some other way.
- corpus_file (str, optional) – Path to a corpus file in
LineSentence
format. You may use this argument instead of sentences to get performance boost. Only one of sentences or corpus_file arguments need to be passed (or none of them).
- size (int, optional) – 词向量维度
- window (int, optional) – Maximum distance between the current and predicted word within a sentence.
- min_count (int, optional) – 忽略出现频率低于min_count的词
- workers (int, optional) – Use these many worker threads to train the model (=faster training with multicore machines).
- sg ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW.
- hs ({0, 1}, optional) – If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero, negative sampling will be used.
- negative (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
- ns_exponent (float, optional) – The exponent used to shape the negative sampling distribution. A value of 1.0 samples exactly in proportion to the frequencies, 0.0 samples all words equally, while a negative value samples low-frequency words more than high-frequency words. The popular default value of 0.75 was chosen by the original Word2Vec paper. More recently, in https://arxiv.org/abs/1804.04212, Caselles-Dupré, Lesaint, & Royo-Letelier suggest that other values may perform better for recommendation applications.
- cbow_mean ({0, 1}, optional) – If 0, use the sum of the context word vectors. If 1, use the mean, only applies when cbow is used.
- alpha (float, optional) – The initial learning rate.
- min_alpha (float, optional) – Learning rate will linearly drop to min_alpha as training progresses.
- seed (int, optional) – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization).
- max_vocab_size (int, optional) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.
- max_final_vocab (int, optional) – Limits the vocab to a target vocab size by automatically picking a matching min_count. If the specified min_count is more than the calculated min_count, the specified min_count will be used. Set to None if not required.
- sample (float, optional) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).
- hashfxn (function, optional) – Hash function to use to randomly initialize weights, for increased training reproducibility.
- iter (int, optional) – Number of iterations (epochs) over the corpus.
- trim_rule (function, optional) –
- sorted_vocab ({0, 1}, optional) – If 1, sort the vocabulary by descending frequency before assigning word indexes. See
sort_vocab()
.
- batch_words (int, optional) – Target size (in words) for batches of examples passed to worker threads (and thus cython routines).(Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.)
- compute_loss (bool, optional) – If True, computes and stores loss value which can be retrieved using
get_latest_training_loss()
.
- callbacks (iterable of
CallbackAny2Vec
, optional) – Sequence of callbacks to be executed at specific stages during training.
¶Word2VecKeyedVectors
- 该对象基本上包含单词和嵌入之间的映射。训练后,可以直接使用它以各种方式查询嵌入。
Word2VecVocab
- 此对象表示模型的词汇表(有时称为gensim中的Dictionary)。除了跟踪所有独特单词之外,该对象还提供额外的功能,例如构建一个霍夫曼树(频繁的单词更接近根),或丢弃极其罕见的单词。
Word2VecTrainables
- 该对象表示用于训练嵌入的内部浅层神经网络。在两种可用的训练模式(CBOW或SG)中,网络的语义略有不同,但您可以将其视为具有我们在语料库上训练的单个投影和隐藏层的NN。然后将权重用作嵌入(这意味着隐藏层的大小等于self.size的特征数)。