转自:SevenBlue
word2vec
Pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in this paper
download link | source link
fastText
1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
download link | source link
1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
download link | source link
2 million word vectors trained on Common Crawl (600B tokens).
download link | source link
GloVe
Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download)
download link | source link
Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download)
download link | source link
Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)
download link | source link
Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download)
download link | source link
word2vec
Wikipedia database, Vector Size 300, Corpus Size 1G, Vocabulary Size 50101, Jieba tokenizor
download link | source link
fastText
Trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We used the Stanford word segmenter for Tokenization
download link | source link
https://github.com/Hironsan/awesome-embedding-models
http://ahogrammer.com/2017/01/20/the-list-of-pretrained-word-embeddings/
https://code.google.com/archive/p/word2vec/
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
https://fasttext.cc/docs/en/english-vectors.html
https://arxiv.org/pdf/1310.4546.pdf