Word2vec is a two-layer neural net that processes text by “vectorizing” words. Its input is a text corpus and its output is a set of vectors: feature vectors that represent words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep neural networks can understand.
Word2vec’s applications extend beyond parsing sentences in the wild. It can be applied just as well to genes, code, likes, playlists, social media graphs and other verbal or symbolic series in which patterns may be discerned.
Why? Because words are simply discrete states like the other data mentioned above, and we are simply looking for the transitional probabilities between those states: the likelihood that they will co-occur. So gene2vec, like2vec and follower2vec are all possible. With that in mind, the tutorial below will help you understand how to create neural embeddings for any group of discrete and co-occurring states.
The purpose and usefulness of Word2vec is to group the vectors of similar words together in vectorspace. That is, it detects similarities mathematically. Word2vec creates vectors that are distributed numerical representations of word features, features such as the context of individual words. It does so without human intervention.
Given enough data, usage and contexts, Word2vec can make highly accurate guesses about a word’s meaning based on past appearances. Those guesses can be used to establish a word’s association with other words (e.g. “man” is to “boy” what “woman” is to “girl”), or cluster documents and classify them by topic. Those clusters can form the basis of search, sentiment analysis and recommendations in such diverse fields as scientific research, legal discovery, e-commerce and customer relationship management.
The output of the Word2vec neural net is a vocabulary in which each item has a vector attached to it, which can be fed into a deep-learning net or simply queried to detect relationships between words.
Measuring cosine similarity, no similarity is expressed as a 90 degree angle, while total similarity of 1 is a 0 degree angle, complete overlap; i.e. Sweden equals Sweden, while Norway has a cosine distance of 0.760124 from Sweden, the highest of any other country.
Word2vec是一个两层的神经网络,通过 **"矢量化 "**词语来处理文本。它的输入是一个文本语料库,其输出是一组向量:代表该语料库中单词的特征向量。虽然Word2vec不是一个深度神经网络,但它将文本转化为深度神经网络可以理解的数字形式。
Word2vec的应用超出了野外解析句子的范围。它同样可以应用于基因、代码、喜欢、播放列表、社交媒体图和其他可能被识别出模式的语言或符号系列。
为什么要用word2vec?因为词语只是像上面提到的其他数据一样的离散状态,而我们只是在寻找这些状态之间的过渡概率:它们共同出现的可能性。所以gene2vec、like2vec和follower2vec都是可能的。考虑到这一点,下面的教程将帮助你了解如何为任何一组离散和共同出现的状态创建神经嵌入。
Word2vec的目的和用处是在向量空间中将相似词的向量分组。也就是说,它以数学方式检测相似性。Word2vec创建的向量是单词特征的分布式数字表示,这些特征如单个单词的上下文。它是在没有人为干预的情况下进行的。
如果有足够的数据、用法和语境,Word2vec可以根据过去的出现情况对一个词的含义做出高度准确的猜测。这些猜测可以用来建立一个词与其他词的联系(例如,"男人 "是 "男孩 "的意思,"女人 "是 "女孩 "的意思),或者对文档进行聚类并按主题进行分类。这些聚类可以成为科学研究、法律发现、电子商务和客户关系管理等不同领域的搜索、情感分析和建议的基础。
Word2vec神经网络的输出是一个词汇表,其中每个项目都有一个向量,可以被送入一个深度学习网络或简单地查询以检测单词之间的关系。
测量余弦相似度,没有相似度表示为90度角,而总相似度为1是0度角,完全重合;即瑞典等于瑞典,而挪威与瑞典的余弦距离为0.760124,是所有其他国家中最高的。
A Beginner’s Guide to Word2Vec and Neural Word Embeddings
I find the concept of embeddings to be one of the most fascinating ideas in machine learning. If you’ve ever used Siri, Google Assistant, Alexa, Google Translate, or even smartphone keyboard with next-word prediction, then chances are you’ve benefitted from this idea that has become central to Natural Language Processing models. There has been quite a development over the last couple of decades in using embeddings for neural models (Recent developments include contextualized word embeddings leading to cutting-edge models like BERT and GPT2).
Word2vec is a method to efficiently create word embeddings and has been around since 2013. But in addition to its utility as a word-embedding method, some of its concepts have been shown to be effective in creating recommendation engines and making sense of sequential data even in commercial, non-language tasks. Companies like Airbnb, Alibaba, Spotify, and Anghami have all benefitted from carving out this brilliant piece of machinery from the world of NLP and using it in production to empower a new breed of recommendation engines.
In this post, we’ll go over the concept of embedding, and the mechanics of generating embeddings with word2vec. But let’s start with an example to get familiar with using vectors to represent things. Did you know that a list of five numbers (a vector) can represent so much about your personality?
我发现嵌入的概念是机器学习中最迷人的想法之一。如果你曾经使用过Siri、谷歌助手、Alexa、谷歌翻译,甚至是带有下一个单词预测功能的智能手机键盘,那么你有可能已经从这个已经成为自然语言处理模型核心的想法中受益。在过去的几十年里,在使用嵌入的神经模型方面有了相当大的发展(最近的发展包括上下文词嵌入,导致BERT和GPT2等尖端模型)。
Word2vec是一种有效创建词嵌入的方法,自2013年以来一直存在。但除了作为一个词嵌入方法的效用外,它的一些概念已被证明在创建推荐引擎和使连续数据有意义方面是有效的,甚至在商业、非语言任务中也是如此。像Airbnb、阿里巴巴、Spotify和Anghami这样的公司都受益于从NLP世界中挖掘出的这一杰出机器,并在生产中使用它来增强新品种的推荐引擎。
在这篇文章中,我们将讨论嵌入的概念,以及用word2vec生成嵌入的机制。但让我们从一个例子开始,熟悉使用向量来表示事物。你知道一个由五个数字组成的列表(一个向量)可以代表很多关于你的个性吗?
The Illustrated Word2vec
Introduction
This tool provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.
介绍
这个工具提供了连续词袋和跳格架构的有效实现,用于计算单词的向量表示。这些表示法随后可用于许多自然语言处理应用和进一步的研究。
Pre-trained word and phrase vectors
We are publishing pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in [2]. The archive is available here: GoogleNews-vectors-negative300.bin.gz.
预训练的单词和短语向量
我们正在发布在谷歌新闻数据集(约1000亿字)中训练的预训练向量。该模型包含300维的向量,用于300万个单词和短语。短语是使用[2]中描述的一种简单的数据驱动方法获得的。该档案可在此获得。GoogleNews-vectors-negative300.bin.gz。
word2vec Google
Abstract
In this paper, we define a measure of dependency between two random variables, based on the Jensen-Shannon (JS) divergence between their joint distribution and the product of their marginal distributions. Then, we show that word2vec’s skip-gram with negative sampling embedding algorithm finds the optimal low-dimensional approximation of this JS dependency measure between the words and their contexts. The gap between the optimal score and the low-dimensional approximation is demonstrated on a standard text corpus.
摘要
在本文中,我们根据两个随机变量的联合分布与它们的边际分布的乘积之间的Jensen-Shannon(JS)分歧,定义了两个随机变量之间的依赖性测量。然后,我们表明word2vec的带负采样嵌入的skip-gram算法可以找到词和它们的语境之间的这种JS依赖性度量的最佳低维近似值。我们在一个标准文本语料库上证明了最佳得分和低维近似值之间的差距。
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 167–171
Vancouver, Canada, July 30 - August 4, 2017.
c 2017 Association for Computational Linguistics
https://doi.org/10.18653/v1/P17-2026P
Abstract
User profiling in social networks can be significantly augmented by using available full-text items such as posts or statuses and ratings (in the form of likes) that users give them. In this work, we apply modern natural language processing techniques based on word embeddings to several problems related to user profiling in social networks. First, we present an approach to create user profiles that measure a user’s interest in various topics mined from the full texts of the items. As a result, we get a user profile that can be used, e.g., for cold start recommendations for items, targeted advertisement, and other purposes; our experiments show that the interests mining method performs on a level comparable with collaborative algorithms while at the same time being a cold start approach, i.e., it does not use the likes of an item being recommended.
Second, we study the problem of predicting a user’s demographic attributes such as age and gender based on his or her full-text items. We evaluate the efficiency of various age prediction algorithms based on word2vec word embeddings and conduct an extensive experimental evaluation, comparing these algorithms with each other and with classical baseline approaches.
摘要
社交网络中的用户特征分析可以通过使用可用的全文项目,如帖子或状态以及用户给予的评分(以喜欢的形式)来大大增强。在这项工作中,我们将基于词嵌入的现代自然语言处理技术应用于与社交网络中用户分析有关的几个问题。首先,我们提出了一种创建用户档案的方法,以衡量用户对从项目的全文中挖掘出来的各种主题的兴趣。结果是,我们得到了一个用户档案,可以用于例如冷启动项目推荐、有针对性的广告和其他目的;我们的实验表明,兴趣挖掘方法的表现与协作算法相当,同时也是一种冷启动方法,即它不使用被推荐项目的喜欢。
第二,我们研究了根据用户的全文项目来预测其人口属性,如年龄和性别的问题。我们评估了基于word2vec词嵌入的各种年龄预测算法的效率,并进行了广泛的实验评估,将这些算法相互比较,并与经典的基线方法进行比较。
Word Embeddings for User Profiling in Online Social Networks
Abstract
With the rapid expansion of new available information presented to us online on a daily basis, text classification becomes imperative in order to classify and maintain it. Word2vec offers a unique perspective to the text mining community. By converting words and phrases into a vector representation, word2vec takes an entirely new approach on text classification. Based on the assumption that word2vec brings extra semantic features that helps in text classification, our work demonstrates the effectiveness of word2vec by showing that tf-idf and word2vec combined can outperform tf-idf because word2vec provides complementary features (e.g. semantics that tf-idf can’t capture) to tf-idf. Our results show that the combination of word2vec weighted by tf-idf and tf-idf does not outperform tf-idf consistently. It is consistent enough to say the combination of the two can outperform either individually.
摘要
随着每天在网上呈现给我们的新的可用信息的迅速扩大,为了对其进行分类和维护,文本分类变得势在必行。Word2vec为文本挖掘界提供了一个独特的视角。通过将单词和短语转换为矢量表示,word2vec在文本分类方面采取了一种全新的方法。基于word2vec带来的额外语义特征有助于文本分类的假设,我们的工作通过展示tf-idf和word2vec的结合可以超越tf-idf来证明word2vec的有效性,因为word2vec为tf-idf提供了互补的特征(例如tf-idf无法捕捉的语义)。我们的结果表明,由tf-idf和tf-idf加权的word2vec的组合并没有持续地优于tf-idf。说二者的组合可以胜过单独的任何一个,是足够一致的。
Support vector machines and Word2vec for text classification with semantic features
Introduction
This module implements the word2vec family of algorithms, using highly optimized C routines, data streaming and Pythonic interfaces.
The word2vec algorithms include skip-gram and CBOW models, using either hierarchical softmax or negative sampling: Tomas Mikolov et al: Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov et al: Distributed Representations of Words and Phrases and their Compositionality.
本模块使用高度优化的C程序、数据流和Pythonic接口实现了word2vec系列算法。
word2vec算法包括skip-gram和CBOW模型,使用层次化的softmax或负采样。Tomas Mikolov等人:《矢量空间中单词表征的高效估计》,Tomas Mikolov等人:《单词和短语的分布式表征及其组成》。
Word2vec embeddings
import gensim.downloader
# Show all available models in gensim-data
print(list(gensim.downloader.info()['models'].keys()))
['fasttext-wiki-news-subwords-300',
'conceptnet-numberbatch-17-06-300',
'word2vec-ruscorpora-300',
'word2vec-google-news-300',
'glove-wiki-gigaword-50',
'glove-wiki-gigaword-100',
'glove-wiki-gigaword-200',
'glove-wiki-gigaword-300',
'glove-twitter-25',
'glove-twitter-50',
'glove-twitter-100',
'glove-twitter-200',
'__testing_word2vec-matrix-synopsis']
>>>
# Download the "glove-twitter-25" embeddings
glove_vectors = gensim.downloader.load('glove-twitter-25')
>>>
# Use the downloaded vectors as usual:
glove_vectors.most_similar('twitter')
[('facebook', 0.948005199432373),
('tweet', 0.9403423070907593),
('fb', 0.9342358708381653),
('instagram', 0.9104824066162109),
('chat', 0.8964964747428894),
('hashtag', 0.8885937333106995),
('tweets', 0.8878158330917358),
('tl', 0.8778461217880249),
('link', 0.8778210878372192),
('internet', 0.8753897547721863)]