视频链接
Welcome back. You previously learned to encode a tweet as a vector of dimension V. You will now learn to encode a tweet or specifically represented as a vector of dimension 3. In doing so, you’ll have a much faster speed for your logistic regression classifier, because instead of learning V features, you only have to learn three features. Let’s take a look at how you can do this.
欢迎回来。你先前学习了把一条推特编码为一个V维的向量。你现在将学习把一条推特或特定的表征编码成一个3维的向量。这样的话,你的逻辑回归分类器的速度会快得多,因为你不需要学习V个特征,而只需要学习3个特征。让我们看看如何做到这一点。
You just saw that the frequency of a word in a class is simply the number of times that the word appears on the set of tweets belonging to that class and that this table is basically a dictionary mapping from word class pairs, to frequencies, or it just tells us how many times each word showed up in its corresponding class.
你只看到一个单词在这个类中的频率仅仅是这个词出现在一组推特属于那个类时的次数,这个表基本上是一个从单词类对到频率的字典,或者它只是告诉我们每个单词在对应的类中出现了多少次。
Now that you’ve built your frequencies dictionary, you can use it to extract useful features for sentiment analysis. What does a feature look like? Let’s look at the arbitrary tweet m. The first feature would be a bias unit equal to 1. The second is the sum of the positive frequencies for every unique word on tweet m. The third is the sum of negative frequencies for every unique word on the tweet. So to extract the features for this representation, you’d only have to sum frequencies of words.
现在已经构建了频率词字典,可以使用它抽取用于情感分析的有用特征。特征是怎么样的呢?我们来看任意的推特m。第一个特征是bias(偏差),单位等于1。第二个是在推文m上每一个唯一单词的正频率之和。第三个是在推文m上每一个卫衣单词的负频率之和。所以为了抽取这个表示的特征,你只需要对单词的频率求和。
Easy. For instance, take the following tweets. Now let’s look at the frequencies for the positive class from the last lecture. The only words from the vocabulary that don’t appear on these tweets are “happy” and “because”. Now let’s take a look at the second feature from the representation that you saw on the last slide. To get this value, you need to sum the frequencies of the words from the vocabulary that appear on the tweet. At the end, you get a value equal to eight.
简单的,例如,以下面的推特为例。让我们来看下上节课正频率的情况。词汇表中没有在这些推特中出现的单词是"happy"和"because"。现在让我们看看你们在上一张幻灯片上看到的第二个特征。为了获得这个值,你需要对出现在推特上的词汇表中单词的频率求和。最后,你得到值等于8。
Now let’s get the value of the third feature. It is the sum of negative frequencies of the words from the vocabulary that appear on the tweet. For this example, you should get 11 after summing up the underlined frequencies.
现在让我们得到第三个特征的值。它是出现在推特词汇表上的单词负频率的和。在这个例子中,把带下划线的频率加起来应该是11。
So far, this tweets, this representation would be equal to the vector 1, 8, 11.
到目前为止,这个推特,这个表示将等于向量[1, 8, 11]
You now know how to represent a tweet as a vector of dimension 3. In the next video you will learn to pre-process your tweets and as a result, you will use those pre-processed words as the words vocabulary.
你现在知道如何把一条推特表示为一个3维向量。在下个视频中你将会学习预处理你的推特,因此,你将会使用那些预处理好的单词作为词汇表。