自然语言三分彩源码修复处理之-----Word2Vec

Word2vec是一个处理三分彩源码修复Q2947702644【番薯源码论坛fanshuym.com】文本的双层神经网络。它的输入是一个文本语料库,它的输出是一组向量:该语料库中单词的特征向量。虽然Word2vec不是深度神经网络,但它将文本转换为深网可以理解的数字形式。 Deeplearning4j实现了一个分布式的Word2vec for Java和Scala,它可以在Spark上运行GPU。

Word2vec的应用程序不仅仅是解析野外的句子。它也可以应用于基因,代码,喜欢,播放列表,社交媒体图和其他可以辨别模式的语言或符号系列。

为什么?因为单词就像上面提到的其他数据一样只是离散状态,我们只是在寻找这些状态之间的过渡概率:它们共同发生的可能性。所以gene2vec,like2vec和follower2vec都是可能的。考虑到这一点,下面的教程将帮助您了解如何为任何离散和共现状态组创建神经嵌入。

Word2Vec的目的和用处是将相似单词的向量组合在向量空间中。也就是说,它以数学方式检测相似性。 Word2Vec创建的向量是单词特征的分布式数字表示,诸如单个单词的上下文之类的特征。它没有人为干预就这样做了。

有了足够的数据,用法和上下文,Word2Vec可以根据过去的外观对单词的含义进行高度准确的猜测。这些猜测可用于建立单词与其他单词的关联(例如“男人”是“男孩”,“女人”是“女孩”),或集群文档并按主题对其进行分类。这些集群可以构成搜索,情感分析和科学研究,法律发现,电子商务和客户关系管理等多个领域的建议的基础。

Word2Vec神经网络的输出是一个词汇表,其中每个项目都附有一个向量,可以将其输入深度学习网络或简单地查询以检测单词之间的关系。

测量余弦相似度,没有相似性表示为90度角,而1的总相似度是0度角,完全重叠;瑞典等于瑞典,而挪威与瑞典的余弦距离为0.760124,是其他任何国家中最高的。

以下是使用Word2vec与“瑞典”相关联的单词列表,按照接近顺序排列:

斯堪的纳维亚国家和几个富裕,北欧,日耳曼国家都位列前九。

Neural Word EmbeddingsThe vectors we use to represent words are called neural word embeddings, and representations are strange. One thing describes another, even though those two things are radically different. As Elvis Costello said: “Writing about music is like dancing about architecture.” Word2vec “vectorizes” about words, and by doing so it makes natural language computer-readable – we can start to perform powerful mathematical operations on words to detect their similarities.So a neural word embedding represents a word with numbers. It’s a simple, yet unlikely, translation.

Word2vec is similar to an autoencoder, encoding each word in a vector, but rather than training against the input words through reconstruction, as a restricted Boltzmann machine does, word2vec trains words against other words that neighbor them in the input corpus.

It does so in one of two ways, either using context to predict a target word (a method known as continuous bag of words, or CBOW), or using a word to predict a target context, which is called skip-gram. We use the latter method because it produces more accurate results on large datasets.

When the feature vector assigned to a word cannot be used to accurately predict that word’s context, the components of the vector are adjusted. Each word’s context in the corpus is the teacher sending error signals back to adjust the feature vector. The vectors of words judged similar by their context are nudged closer together by adjusting the numbers in the vector.

Just as Van Gogh’s painting of sunflowers is a two-dimensional mixture of oil on canvas that represents vegetable matter in a three-dimensional space in Paris in the late 1880s, so 500 numbers arranged in a vector can represent a word or group of words.

Those numbers locate each word as a point in 500-dimensional vectorspace. Spaces of more than three dimensions are difficult to visualize. (Geoff Hinton, teaching people to imagine 13-dimensional space, suggests that students first picture 3-dimensional space and then say to themselves: “Thirteen, thirteen, thirteen.”

你可能感兴趣的:(自然语言三分彩源码修复处理之-----Word2Vec)