使用摇滚乐队学习TensorFlow,Word2Vec模型和TSNE算法

by Patrick Ferris

帕特里克·费里斯(Patrick Ferris)

Learning the “TensorFlow way” to build a neural network can seem like a big hurdle to getting started with machine learning. In this tutorial, we’ll take it step by step and explain all of the critical components involved as we build a Bands2Vec model using Pitchfork data from Kaggle.

学习“ TensorFlow方式”以构建神经网络似乎是机器学习入门的一大障碍。 在本教程中,我们将逐步介绍它,并在使用来自Kaggle的 Pitchfork数据构建Bands2Vec模型时解释涉及的所有关键组件。

For the full code, check out the GitHub page.

有关完整代码,请查看GitHub 页面 。

Word2Vec模型 (The Word2Vec Model)

Neural networks consume numbers and produce numbers. They’re very good at it. But give them some text, and they’ll throw a tantrum and do nothing remotely interesting.

神经网络消耗数字并产生数字。 他们非常擅长。 但是给他们一些文字,他们会发脾气,并且什么也不会做有趣的事情。

If it is the neural network’s job to crunch the numbers and produce meaningful output, then it is our job to make sure that whatever we are feeding it is meaningful too. This quest for a meaningful representation of information gave birth to the Word2Vec model.

如果神经网络的工作是处理数字并产生有意义的输出,那么我们的工作就是确保所提供的任何东西也有意义。 对信息的有意义表示的追求催生了Word2Vec模型。

One approach to working with words is to form one-hot encoded vectors. Create a long (the number of distinct words in our vocabulary) list of zeroes, and have each word point to a unique index of this list. If we see this word, make that index in the list a number one.

处理单词的一种方法是形成单热编码矢量 。 创建一个长的零列表(词汇表中不同单词的数量),并使每个单词指向该列表的唯一索引。 如果我们看到这个词,请将列表中的索引设为第一。

While this approach works, it requires a lot space and is completely devoid of meaning. ‘Good’ and ‘Excellent’ are as similar as ‘Duck’ and ‘Blackhole’. If only there was a way to vectorise words so that we preserved this contextual similarity…

尽管此方法有效,但它需要大量空间,并且完全没有意义。 “好”和“优秀”与“鸭子”和“黑洞”相似。 如果只有一种方法可以对单词进行矢量化处理,以便我们保留上下文的相似性……

Thankfully, there is a way!

幸运的是,有一种方法!

Using a neural network, we can produce ‘embeddings’ of our words. These are vectors that represent each unique word extracted from the weights of the connections within our network.

使用神经网络,我们可以产生单词的“ 嵌入 ”。 这些是代表从我们网络内连接权重中提取的每个唯一单词的向量。

But the question remains: how do we make sure they’re meaningful? The answer: feed in pairs of words as a target word and a context word. Do this enough times, throwing in some bad examples too, and the neural network begins to learn what words appear together and how this forms almost a graph. Like a social network of words interconnected by contexts. ‘Good’ goes to ‘helpful’ which goes to ‘caring’ and so on. Our task is to feed this data into the neural network.

但是问题仍然存在:我们如何确保它们有意义? 答案:将成对的单词作为目标单词和上下文单词。 这样做足够多的时间,也抛出一些不好的例子,神经网络开始学习什么单词一起出现以及如何形成几乎一个图形。 就像由上下文相互联系的单词社交网络一样。 “好”代表“帮助”,“关心”代表等等。 我们的任务是将这些数据馈入神经网络。

One of the most common approaches is the Skipgram model, generating these target-context pairings based on moving a window across a dataset of text. But what if our data isn’t sentences, but we still have contextual meaning?

最常见的方法之一是Skipgram模型,它基于在文本数据集中移动窗口来生成这些目标-上下文配对。 但是,如果我们的数据不是句子,但仍然具有上下文含义呢?

In this tutorial, our words are artist names and our contexts are genres and mean review scores. We want artist A to be close to artist B if they share a genre and have a mean review score that is similar. So let’s get started.

在本教程中,我们的单词是艺术家的名字,我们的上下文是流派和平均评分。 如果艺术家A具有相同的流派并且平均评论得分相似,我们希望他们A与艺术家B接近。 因此,让我们开始吧。

建立我们的数据集 (Building our Dataset)

Pitchfork is an online American music magazine covering mostly rock, independent, and new music. The data released to Kaggle was scraped from their website and contains information like reviews, genres, and dates linked to each artist.

Pitchfork是一本在线美国音乐杂志,主要介绍摇滚,独立和新音乐。 发布给Kaggle的数据是从他们的网站上抓取的,其中包含评论,类型和链接到每个艺术家的日期等信息。

Let’s create an artist class and dictionary to store all of the useful information we want.

让我们创建一个艺术家类和字典来存储我们想要的所有有用信息。

Great! Now we want to manufacture our target-context pairings based on genre and mean review score. To do this, we’ll create two dictionaries: one for the different unique genres, and one for the scores (discretised to integers).

大! 现在,我们要基于体裁和平均评论分数来制作目标-上下文配对。 为此,我们将创建两个字典:一个用于不同的唯一流派,一个用于得分(离散为整数)。

We’ll add all our artists to the corresponding genre and mean score in these dictionaries to use later when generating pairs of artists.

我们会将所有艺术家添加到相应的流派中,并在这些词典中添加平均得分,以便以后在生成艺术家对时使用。

One last step before we dive into the TensorFlow code: generating a batch! A batch is like a sample of data that our neural network will use for each epoch. An epoch is one sweep across the neural network in a training phase. We want to generate two numpy arrays. One will contain the following code:

在深入研究TensorFlow代码之前的最后一步:生成批处理! 一批就像我们的神经网络将在每个时期使用的数据样本。 纪元是训练阶段遍及神经网络的一次扫描。 我们要生成两个numpy数组。 一个将包含以下代码:

TensorFlow (TensorFlow)

There are a myriad of TensorFlow tutorials and sources of knowledge out there. Any of these excellent articles will help you as well as the documentation. The following code is heavily based on the word2vec tutorial from the TensorFlow people themselves. Hopefully I can demystify some of it and boil it down to the essentials.

那里有无数的TensorFlow教程和知识来源。 这些出色的文章中的任何一篇都将对您和文档都有帮助。 以下代码很大程度上基于TensorFlow人们自己的word2vec教程。 希望我能揭开其中的神秘色彩,并将其归结为基本要素。

The first step is understanding the ‘graph’ representation. This is incredibly useful for the TensorBoard visualisations and for creating a mental image of the data flows within the neural network.

第一步是了解“图形”表示。 这对于TensorBoard可视化以及在神经网络内创建数据流的心理图像非常有用。

Take some time to read through the code and comments below. Before we feed data to a neural network, we have to initialise all of the parts we’re going to use. The placeholders are the inputs taking whatever we give the ‘feed_dict’. The variables are mutable parts of the graph that we will eventually tweak. The most important part of our model is the loss function. It’s the score of how well we did and the treasure map to how we can improve.

花一些时间通读下面的代码和注释。 在将数据提供给神经网络之前,我们必须初始化将要使用的所有部分。 占位符是我们提供“ feed_dict”的输入。 变量是图的可变部分,我们最终将对其进行调整。 我们模型中最重要的部分是损失函数。 这是我们表现出色的分数,也是我们如何改进的宝藏图。

Noise Contrastive Estimation (NCE) is a loss function. Usually we would use cross-entropy and softmax, but in the natural language processing world, all of our classes amount to every single unique word.

噪声对比估计(NCE)是损失函数。 通常我们会使用交叉熵和softmax,但是在自然语言处理世界中,我们所有的类都等于每个单个词。

Computationally, this is bad. NCE changes the framing of the problem from probabilities of classes to whether or not a target-context pairing is correct (a binary classification). It takes a true pairing and then samples to get bad pairings, the constant num_sampled controls this. Our neural network learns to distinguish between these good and bad pairings. Ultimately, it learns the contexts! You can read more about NCE and how it works here.

通过计算,这是不好的。 NCE将问题的框架从类的可能性更改为目标上下文配对是否正确(二进制分类)。 它需要一个真正的配对,然后进行采样才能得到不良的配对,常量num_sampled对此进行控制。 我们的神经网络学会区分这些好配对和坏配对。 最终,它学习上下文! 您可以在此处阅读有关NCE的更多信息。

运行神经网络 (Run the Neural Network)

Now that everything is set up nicely, we just have to hit the big green ‘go’ button and twiddle our thumbs for a bit.

现在一切都设置好了,我们只需要点击绿色的“开始”按钮,然后稍微旋转一下手指即可。

使用TSNE进行可视化 (Visualization using TSNE)

Okay, we’re not quite done. We now have context-rich, 64-dimensional vectors for our artists, but that’s perhaps too many dimensions to really visualize its usefulness.

好的,我们还没有完成。 现在,我们为我们的艺术家提供了内容丰富的64维矢量,但这可能是过多的维,无法真正可视化其用途。

Lucky for us we can squash this information into two dimensions while retaining as many of the properties as the 64 dimensions had! This is T-distributed Stochastic Neighbor Embedding, or TSNE for short. This video does a great job of explaining the main idea behind TSNE, but I’ll try to give a broad overview.

对我们来说幸运的是,我们可以将此信息压缩为两个维度,同时保留与64个维度一样多的属性! 这是T分布的随机邻居嵌入,简称TSNE。 该视频在解释TSNE背后的主要思想方面做得很好,但是我将尝试进行广泛概述。

TSNE is an approach to dimensionality reduction that retains the similarities (like Euclidean distance) of higher dimensions. To do this, it first builds a matrix of point-to-point similarities calculated using a normal distribution. The centre of the distribution is the first point, and the similarity of the second point is the value of the distribution at the distance between the points away from the centre of the distribution.

TSNE是一种减少维度的方法,它保留了更高维度的相似性(如欧几里得距离)。 为此,它首先构建一个使用正态分布计算的点对点相似性矩阵。 分布的中心是第一个点,第二个点的相似度是点之间的距离远离分布中心的分布值。

Then we project randomly onto the dimension below and do exactly the same process using a t-distribution. Now we have two matrices of point-to-point similarities. The algorithm then slowly moves the points in the lower dimension to try and make it look like the matrix for the higher dimension where the similarities were preserved. And repeat. Thankfully, Sci-kit Learn has a function which can do the number crunching for us.

然后,我们随机投影到下面的维度上,并使用t分布进行完全相同的处理。 现在,我们有了两个点对点相似性矩阵。 然后,该算法缓慢移动较低维度的点,以使其看起来像保留相似性的较高维度的矩阵。 重复一遍。 幸运的是,Sci-kit Learn具有可以为我们处理数字的功能。

结果 (The Results)

The amazing aspect of these embeddings is that, just like vectors, they support mathematical operations. The classic example being: King — Man + Woman = Queen , or at least very close to it. Let’s try an example.

这些嵌入的惊人之处在于,就像向量一样,它们支持数学运算。 典型的例子是: King — Man + Woman = Queen ,或者至少非常接近它。 让我们尝试一个例子。

Take the low dimensional embeddings of Coil, a band with the following genres, [‘electronic’, ‘experimental', ‘rock’] , and mean score 7.9. Now subtract the low dimensional embeddings of Elder Ones, a band with genres,['electronic'] , and mean score 7.8. With this embedding difference, find the closest bands to it and print their names and genres.

以Coil的低维嵌入为例,该带具有以下体裁的乐队['electronic', 'experimental', 'rock']和平均分7.9 。 现在减去Elder Ones的低维嵌入,这是一个流派为['electronic']且平均得分为7.8的乐队。 通过这种嵌入差异,找到最接近的频段并打印其名称和类型。

Artist: black lips, Mean Score: 7.48, Genres: ['rock', 'rock', 'rock', 'rock', 'rock']
Artist: crookers, Mean Score: 5.5, Genres: ['electronic']
Artist: guided by voices, Mean Score: 7.23043478261, Genres: ['rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock', 'rock']

It worked! We’re getting rock and electronic bands with vaguely similar review scores. Below are the first three hundred bands plotted with labels. Hopefully you’ve found this project educational and inspiring. Go forth and build, explore, and play!

有效! 我们得到的摇滚乐队和电子乐队的评论评分大致相似。 下面是标有标签的前三百个带。 希望您发现该项目具有教育意义和启发性。 继续建设,探索和玩耍!

翻译自: https://www.freecodecamp.org/news/learn-tensorflow-the-word2vec-model-and-the-tsne-algorithm-using-rock-bands-97c99b5dcb3a/

你可能感兴趣的:(神经网络,可视化,大数据,python,机器学习)