文本挖掘 情感分析_文本挖掘的情感分析

文本挖掘 情感分析

In this tutorial, I will explore some text mining techniques for sentiment analysis. We'll look at how to prepare textual data. After that we will try two different classifiers to infer the tweets' sentiment. We will tune the hyperparameters of both classifiers with grid search. Finally, we evaluate the performance on a set of metrics like precision, recall and the F1 score.

在本教程中,我将探讨一些用于情感分析的文本挖掘技术。 我们将研究如何准备文本数据。 之后,我们将尝试使用两个不同的分类器来推断推文的情绪。 我们将使用网格搜索调整两个分类器的超参数。 最后,我们根据一组指标(如准确性,召回率和F1得分)评估性能。

For this project, we'll be working with the Twitter US Airline Sentiment data set on Kaggle. It contains the tweet’s text and one variable with three possible sentiment values. Let's start by importing the packages and configuring some settings.

对于此项目,我们将使用Kaggle上的Twitter美国航空情绪数据集 。 它包含推文的文本和一个带有三个可能的情感值的变量。 让我们首先导入软件包并配置一些设置。

import numpy as np 
import pandas as pd 
pd.set_option('display.max_colwidth', -1)
from time import time
import re
import string
import os
import emoji
from pprint import pprint
import collections
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
sns.set(font_scale=1.3)
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib
import gensim
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import warnings
warnings.filterwarnings('ignore')
np.random.seed(37)

加载数据 (Loading the data)

We read in the comma separated file we downloaded from the Kaggle Datasets. We shuffle the data frame in case the classes are sorted. Applying the reindex method on the permutation of the original indices is good for that. In this notebook, we will work with the text variable and the airline_sentiment variable.

我们读取从Kaggle数据集下载的逗号分隔文件。 如果对类进行排序,我们会重新整理数据框。 将reindex方法应用于原始索引的permutation对此很有好处。 在此笔记本中,我们将使用text变量和airline_sentiment变量。

df = pd.read_csv('../input/Tweets.csv')
df = df.reindex(np.random.permutation(df.index))
df = df[['text', 'airline_sentiment']]

探索性数据分析 (Exploratory Data Analysis)

目标变量 (Target variable)

There are three class labels we will predict: negative, neutral or positive.

我们将预测三种类别的标签:负面,中性或正面。

The class labels are imbalanced as we can see below in the chart. This is something that we should keep in mind during the model training phase. With the factorplot of the seaborn package, we can visualize the distribution of the target variable.

类别标签不平衡,如下图所示。 在模型训练阶段,我们应该牢记这一点。 随着factorplot的seaborn包,我们可以直观的目标变量的分布。

sns.factorplot(x="airline_sentiment", data=df, kind="count", size=6, aspect=1.5, palette="PuBuGn_d")
plt.show();

输入变量 (Input variable)

To analyze the text variable we create a class TextCounts. In this class we compute some basic statistics on the text variable.

为了分析text变量,我们创建了一个TextCounts类。 在此类中,我们计算有关文本变量的一些基本统计信息。

  • count_words: number of words in the tweet

    count_words :鸣叫中的单词数

  • count_mentions: referrals to other Twitter accounts start with a @

    count_mentions :对其他Twitter帐户的引荐以@开头

  • count_hashtags: number of tag words, preceded by a #

    count_hashtags :标记词的数量, count_hashtags

  • count_capital_words: number of uppercase words are sometimes used to “shout” and express (negative) emotions

    count_capital_words :大写单词的数量有时用于“喊”和表达(负面)情绪

  • count_excl_quest_marks: number of question or exclamation marks

    count_excl_quest_marks :问题或感叹号的数量

  • count_urls: number of links in the tweet, preceded by http(s)

    count_urls :推文中的链接数,以http(s) count_urls

  • count_emojis: number of emoji, which might be a good sign of the sentiment

    count_emojis :表情符号的数量,这可能是情绪的好兆头

class TextCounts(BaseEstimator, TransformerMixin):
    
    def count_regex(self, pattern, tweet):
        return len(re.findall(pattern, tweet))
    
    def fit(self, X, y=None, **fit_params):
        # fit method is used when specific operations need to be done on the train data, but not on the test data
        return self
    
    def transform(self, X, **transform_params):
        count_words = X.apply(lambda x: self.count_regex(r'\w+', x)) 
        count_mentions = X.apply(lambda x: self.count_regex(r'@\w+', x))
        count_hashtags = X.apply(lambda x: self.count_regex(r'#\w+', x))
        count_capital_words = X.apply(lambda x: self.count_regex(r'\b[A-Z]{2,}\b', x))
        count_excl_quest_marks = X.apply(lambda x: self.count_regex(r'!|\?', x))
        count_urls = X.apply(lambda x: self.count_regex(r'http.?://[^\s]+[\s]?', x))
        # We will replace the emoji symbols with a description, which makes using a regex for counting easier
        # Moreover, it will result in having more words in the tweet
        count_emojis = X.apply(lambda x: emoji.demojize(x)).apply(lambda x: self.count_regex(r':[a-z_&]+:', x))
        
        df = pd.DataFrame({'count_words': count_words
                           , 'count_mentions': count_mentions
                           , 'count_hashtags': count_hashtags
                           , 'count_capital_words': count_capital_words
                           , 'count_excl_quest_marks': count_excl_quest_marks
                           , 'count_urls': count_urls
                           , 'count_emojis': count_emojis
                          })
        
        return df
tc = TextCounts()
df_eda = tc.fit_transform(df.text)
df_eda['airline_sentiment'] = df.airline_sentiment

It could be interesting to see how the TextStats variables relate to the class variable. So we write a function show_dist that provides descriptive statistics and a plot per target class.

看看TextStats变量与类变量之间的关系可能会很有趣。 因此,我们编写了一个函数show_dist ,该函数提供描述性统计信息和每个目标类的图表。

def show_dist(df, col):
    print('Descriptive stats for {}'.format(col))
    print('-'*(len(col)+22))
    print(df.groupby('airline_sentiment')[col].describe())
    bins = np.arange(df[col].min(), df[col].max() + 1)
    g = sns.FacetGrid(df, col='airline_sentiment', size=5, hue='airline_sentiment', palette="PuBuGn_d")
    g = g.map(sns.distplot, col, kde=False, norm_hist=True, bins=bins)
    plt.show()

Below you can find the distribution of the number of words in a tweet per target class. For brevity, we will limit us to only this variable. The charts for all TextCounts variables are in the notebook on Github.

在下面,您可以找到每个目标类别的推文中单词数的分布。 为简便起见,我们将限于此变量。 所有TextCounts变量的图表都在Github的笔记本中 。

  • The number of words used in the tweets is rather low. The largest number of words is 36 and there are even tweets with only 2 words. So we’ll have to be careful during data cleaning not to remove too many words. But the text processing will be faster. Negative tweets contain more words than neutral or positive tweets.

    推文中使用的单词数量很少。 单词数量最多,为36个,甚至还有只有2个单词的推文。 因此,在数据清理过程中,我们必须注意不要删除过多的单词。 但是文本处理会更快。 负面推文比中立或正面推文包含更多的单词。
  • All tweets have at least one mention. This is the result of extracting the tweets based on mentions in the Twitter data. There seems to be no difference in the number of mentions with regard to the sentiment.

    所有推文至少都有一处提及。 这是根据Twitter数据中的提及提取推文的结果。 在情感方面,提及的次数似乎没有差异。
  • Most of the tweets do not contain hash tags. So this variable will not be retained during model training. Again, no difference in the number of hash tags with regard to the sentiment.

    大多数推文不包含哈希标签。 因此,在模型训练期间将不会保留此变量。 再次,关于情感,哈希标签的数量没有差异。
  • Most of the tweets do not contain capitalized words and we do not see a difference in distribution between the sentiments.

    大多数推文不包含大写单词,我们在情感之间的分布上也看不到差异。
  • The positive tweets seem to be using a bit more exclamation or question marks.

    积极的推文似乎使用了更多的感叹号或问号。
  • Most tweets do not contain a URL.

    大多数推文不包含URL。
  • Most tweets do not use emojis.

    大多数推文不使用表情符号。

文字清理 (Text Cleaning)

Before we start using the tweets’ text we need to clean it. We’ll do the this in the class CleanText. With this class we’ll perform the following actions:

在开始使用推文之前,我们需要先对其进行清理。 我们将在类CleanText执行此CleanText 在此类中,我们将执行以下操作:

  • remove the mentions, as we want to generalize to tweets of other airline companies too.

    删除提及,因为我们也希望将其推广到其他航空公司的推文中。
  • remove the hash tag sign (#) but not the actual tag as this may contain information

    删除哈希标签符号(#),但不要删除实际标签,因为它可能包含信息
  • set all words to lowercase

    将所有单词设置为小写
  • remove all punctuations, including the question and exclamation marks

    删除所有标点符号,包括问号和感叹号
  • remove the URLs as they do not contain useful information. We did not notice a difference in the number of URLs used between the sentiment classes

    删除网址,因为它们不包含有用的信息。 我们没有发现情感类别之间使用的URL数量有所不同
  • make sure to convert the emojis into one word.

    确保将表情符号转换为一个单词。
  • remove digits

    删除数字
  • remove stopwords

    删除停用词
  • apply the PorterStemmer to keep the stem of the words

    应用PorterStemmer保持词干

class CleanText(BaseEstimator, TransformerMixin):
    def remove_mentions(self, input_text):
        return re.sub(r'@\w+', '', input_text)
    
    def remove_urls(self, input_text):
        return re.sub(r'http.?://[^\s]+[\s]?', '', input_text)
    
    def emoji_oneword(self, input_text):
        # By compressing the underscore, the emoji is kept as one word
        return input_text.replace('_','')
    
    def remove_punctuation(self, input_text):
        # Make translation table
        punct = string.punctuation
        trantab = str.maketrans(punct, len(punct)*' ')  # Every punctuation symbol will be replaced by a space
        return input_text.translate(trantab)
    def remove_digits(self, input_text):
        return re.sub('\d+', '', input_text)
    
    def to_lower(self, input_text):
        return input_text.lower()
    
    def remove_stopwords(self, input_text):
        stopwords_list = stopwords.words('english')
        # Some words which might indicate a certain sentiment are kept via a whitelist
        whitelist = ["n't", "not", "no"]
        words = input_text.split() 
        clean_words = [word for word in words if (word not in stopwords_list or word in whitelist) and len(word) > 1] 
        return " ".join(clean_words) 
    
    def stemming(self, input_text):
        porter = PorterStemmer()
        words = input_text.split() 
        stemmed_words = [porter.stem(word) for word in words]
        return " ".join(stemmed_words)
    
    def fit(self, X, y=None, **fit_params):
        return self
    
    def transform(self, X, **transform_params):
        clean_X = X.apply(self.remove_mentions).apply(self.remove_urls).apply(self.emoji_oneword).apply(self.remove_punctuation).apply(self.remove_digits).apply(self.to_lower).apply(self.remove_stopwords).apply(self.stemming)
        return clean_X

To show how the cleaned text variable will look like, here’s a sample.

为了显示清除后的文本变量的外观,这是一个示例。

ct = CleanText()
sr_clean = ct.fit_transform(df.text)
sr_clean.sample(5)

glad rt bet bird wish flown south winter

高兴rt投注鸟希望飞过南冬天

glad rt bet bird wish flown south winterpoint upc code check baggag tell luggag vacat day tri swimsuit

高兴rt赌注鸟希望飞行南冬季 点upc代码检查baggag告诉行李vacat日三泳衣

glad rt bet bird wish flown south winterpoint upc code check baggag tell luggag vacat day tri swimsuitvx jfk la dirti plane not standard

高兴rt赌注鸟希望飞行南冬季 点upc代码检查baggag告诉行李vacat日三泳衣 vx肯尼迪洛杉矶肮脏飞机不标

glad rt bet bird wish flown south winterpoint upc code check baggag tell luggag vacat day tri swimsuitvx jfk la dirti plane not standardtell mean work need estim time arriv pleas need laptop work thank

高兴rt赌注鸟希望飞行南冬天 点upc代码检查baggag告诉行李vacat日三泳衣 vx jfk ladirti飞机不标准 告诉意思是工作需要估计时间到达pleas需要笔记本工作

glad rt bet bird wish flown south winterpoint upc code check baggag tell luggag vacat day tri swimsuitvx jfk la dirti plane not standardtell mean work need estim time arriv pleas need laptop work thanksure busi go els airlin travel name kathryn sotelo

高兴rt赌注鸟希望飞行南冬天 点upc代码检查baggag告诉行李vacat日三泳衣 vx jfk ladirti飞机不标准 告诉意思是工作需要估计时间到达pleas需要笔记本工作,感谢 确定busi go els airlin旅行名称凯瑟琳索特洛

One side-effect of text cleaning is that some rows do not have any words left in their text. For the CountVectorizer and TfIdfVectorizer this does not pose a problem. Yet, for the Word2Vec algorithm this causes an error. There are different strategies to deal with these missing values.

清除文本的一个副作用是某些行的文本中没有剩余单词。 对于CountVectorizerTfIdfVectorizer这不会造成问题。 但是,对于Word2Vec算法,这会导致错误。 有不同的策略来应对这些缺失的价值观。

  • Remove the complete row, but in a production environment this is not desirable.

    删除完整的行,但是在生产环境中,这是不希望的。
  • Impute the missing value with some placeholder text like *[no_text]*

    用一些占位符文本(例如* [no_text] *)来估算缺失值
  • When applying Word2Vec: use the average of all vectors

    应用Word2Vec时:使用所有向量的平均值

Here we will impute with placeholder text.

在这里,我们将使用占位符文本进行插补。

empty_clean = sr_clean == ''
print('{} records have no words left after text cleaning'.format(sr_clean[empty_clean].count()))
sr_clean.loc[empty_clean] = '[no_text]'

Now that we have the cleaned text of the tweets, we can have a look at what are the most frequent words. Below we’ll show the top 20 words. The most frequent word is “flight”.

既然我们已经清除了推文的文本,我们就可以看看最常用的词是什么。 下面我们将显示前20个字。 最常见的词是“飞行”。

cv = CountVectorizer()
bow = cv.fit_transform(sr_clean)
word_freq = dict(zip(cv.get_feature_names(), np.asarray(bow.sum(axis=0)).ravel()))
word_counter = collections.Counter(word_freq)
word_counter_df = pd.DataFrame(word_counter.most_common(20), columns = ['word', 'freq'])
fig, ax = plt.subplots(figsize=(12, 10))
sns.barplot(x="word", y="freq", data=word_counter_df, palette="PuBuGn_d", ax=ax)
plt.show();

创建测试数据 (Creating test data)

To check the performance of the models we’ll need a test set. Evaluating on the train data would not be correct. You should not test on the same data used for training the model.

要检查模型的性能,我们需要测试集。 评估火车数据是不正确的。 您不应在用于训练模型的相同数据上进行测试。

First, we combine the TextCounts variables with the CleanText variable. Initially, I made the mistake to execute TextCounts and CleanText in the GridSearchCV. This took too long as it applies these functions each run of the GridSearch. It suffices to run them only once.

首先,我们将TextCounts变量与CleanText变量结合在一起。 最初,我在GridSearchCV执行TextCounts和CleanText时犯了一个错误。 只要它在GridSearch的每次运行中应用这些功能,就需要花费很长时间。 只运行一次就足够了。

df_model = df_eda
df_model['clean_text'] = sr_clean
df_model.columns.tolist()

So df_model now contains several variables. But our vectorizers (see below) will only need the clean_text variable. The TextCountsvariables can be added as such. To select columns, I wrote the class ColumnExtractor below.

因此df_model现在包含几个变量。 但是我们的矢量化程序(见下文)将只需要clean_text变量。 可以这样添加TextCounts变量。 为了选择列,我在下面编写了ColumnExtractor类。

class ColumnExtractor(TransformerMixin, BaseEstimator):
    def __init__(self, cols):
        self.cols = cols
    def transform(self, X, **transform_params):
        return X[self.cols]
    def fit(self, X, y=None, **fit_params):
        return self
X_train, X_test, y_train, y_test = train_test_split(df_model.drop('airline_sentiment', axis=1), df_model.airline_sentiment, test_size=0.1, random_state=37)

超参数调整和交叉验证 (Hyperparameter tuning and cross-validation)

As we will see below, the vectorizers and classifiers all have configurable parameters. To choose the best parameters, we need to test on a separate validation set. This validation set was not used during the training. Yet, using only one validation set may not produce reliable validation results. Due to chance, you might have a good model performance on the validation set. If you would split the data otherwise, you might end up with other results. To get a more accurate estimation, we perform cross-validation.

正如我们将在下面看到的,矢量化器和分类器都具有可配置的参数。 为了选择最佳参数,我们需要在单独的验证集上进行测试。 训练期间未使用此验证集。 但是,仅使用一个验证集可能不会产生可靠的验证结果。 由于偶然的原因,您可能在验证集中具有良好的模型性能。 如果以其他方式拆分数据,则可能会导致其他结果。 为了获得更准确的估计,我们执行交叉验证。

With cross-validation we split the data into a train and validation set many times. The evaluation metric is then averaged over the different folds. Luckily, GridSearchCV applies cross-validation out-of-the-box.

通过交叉验证,我们将数据多次拆分为训练和验证集。 然后,将评估指标在不同折数上取平均值。 幸运的是,GridSearchCV开箱即用地应用了交叉验证。

To find the best parameters for both a vectorizer and classifier, we create a Pipeline.

为了找到矢量化器和分类器的最佳参数,我们创建了Pipeline

评估指标 (Evaluation metrics)

By default GridSearchCV uses the default scorer to compute the best_score_. For both the MultiNomialNb and LogisticRegression this default scoring metric is accuracy.

默认情况下,GridSearchCV使用默认best_score_来计算best_score_ 。 对于MultiNomialNbLogisticRegression此默认评分指标均为准确性。

In our function grid_vectwe additionally generate the classification_report on the test data. This provides some interesting metrics per target class. This might be more appropriate here. These metrics are the precision, recall and F1 score.

在我们的函数grid_vect我们还根据测试数据生成了classification_report报告。 这为每个目标类别提供了一些有趣的指标。 这在这里可能更合适。 这些指标是精度,召回率和F1得分

  • Precision: Of all rows we predicted to be a certain class, how many did we correctly predict?

    精度在我们预测为某一类的所有行中,我们正确预测了几行?

  • Recall: Of all rows of a certain class, how many did we correctly predict?

    回想一下在某个类的所有行中,我们正确预测了多少行?

  • F1 score: Harmonic mean of Precision and Recall.

    F1得分精确度和召回率的谐波平均值。

With the elements of the confusion matrix we can calculate Precision and Recall.

使用混淆矩阵的元素,我们可以计算精度和召回率。

# Based on http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html
def grid_vect(clf, parameters_clf, X_train, X_test, parameters_text=None, vect=None, is_w2v=False):
    
    textcountscols = ['count_capital_words','count_emojis','count_excl_quest_marks','count_hashtags'
                      ,'count_mentions','count_urls','count_words']
    
    if is_w2v:
        w2vcols = []
        for i in range(SIZE):
            w2vcols.append(i)
        features = FeatureUnion([('textcounts', ColumnExtractor(cols=textcountscols))
                                 , ('w2v', ColumnExtractor(cols=w2vcols))]
                                , n_jobs=-1)
    else:
        features = FeatureUnion([('textcounts', ColumnExtractor(cols=textcountscols))
                                 , ('pipe', Pipeline([('cleantext', ColumnExtractor(cols='clean_text')), ('vect', vect)]))]
                                , n_jobs=-1)
    
    pipeline = Pipeline([
        ('features', features)
        , ('clf', clf)
    ])
    
    # Join the parameters dictionaries together
    parameters = dict()
    if parameters_text:
        parameters.update(parameters_text)
    parameters.update(parameters_clf)
    # Make sure you have scikit-learn version 0.19 or higher to use multiple scoring metrics
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, cv=5)
    
    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(X_train, y_train)
    print("done in %0.3fs" % (time() - t0))
    print()
    print("Best CV score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))
        
    print("Test score with best_estimator_: %0.3f" % grid_search.best_estimator_.score(X_test, y_test))
    print("\n")
    print("Classification Report Test Data")
    print(classification_report(y_test, grid_search.best_estimator_.predict(X_test)))
                        
    return grid_search

GridSearchCV的参数网格 (Parameter grids for GridSearchCV)

In the grid search, we will investigate the performance of the classifier. The set of parameters used to test the performance are specified below.

在网格搜索中,我们将研究分类器的性能。 下面指定了用于测试性能的参数集。

# Parameter grid settings for the vectorizers (Count and TFIDF)
parameters_vect = {
    'features__pipe__vect__max_df': (0.25, 0.5, 0.75),
    'features__pipe__vect__ngram_range': ((1, 1), (1, 2)),
    'features__pipe__vect__min_df': (1,2)
}

# Parameter grid settings for MultinomialNB
parameters_mnb = {
    'clf__alpha': (0.25, 0.5, 0.75)
}

# Parameter grid settings for LogisticRegression
parameters_logreg = {
    'clf__C': (0.25, 0.5, 1.0),
    'clf__penalty': ('l1', 'l2')
}

分类器 (Classifiers)

Here we will compare the performance of a MultinomialNBand LogisticRegression.

在这里,我们将比较MultinomialNBLogisticRegression的性能。

mnb = MultinomialNB()
logreg = LogisticRegression()

CountVectorizer (CountVectorizer)

To use words in a classifier, we need to convert the words to numbers. Sklearn’s CountVectorizer takes all words in all tweets, assigns an ID and counts the frequency of the word per tweet. We then use this bag of words as input for a classifier. This bag of words is a sparse data set. This means that each record will have many zeroes for the words not occurring in the tweet.

要在分类器中使用单词,我们需要将单词转换为数字。 Sklearn的CountVectorizer接收所有推文中的所有单词,分配一个ID并计算每条推文中单词的出现频率。 然后,我们将这袋单词用作分类器的输入。 这个词袋是一个稀疏的数据集。 这意味着对于未在推文中出现的单词,每个记录将具有多个零。

countvect = CountVectorizer()
# MultinomialNB
best_mnb_countvect = grid_vect(mnb, parameters_mnb, X_train, X_test, parameters_text=parameters_vect, vect=countvect)
joblib.dump(best_mnb_countvect, '../output/best_mnb_countvect.pkl')
# LogisticRegression
best_logreg_countvect = grid_vect(logreg, parameters_logreg, X_train, X_test, parameters_text=parameters_vect, vect=countvect)
joblib.dump(best_logreg_countvect, '../output/best_logreg_countvect.pkl')

TF-IDF矢量化器 (TF-IDF Vectorizer)

One issue with CountVectorizer is that there might be words that occur frequently. These words might not have discriminatory information. Thus they can be removed. TF-IDF (term frequency — inverse document frequency)can be used to down-weight these frequent words.

CountVectorizer的一个问题是可能经常出现单词。 这些词可能没有歧视性信息。 因此可以将它们删除。 TF-IDF(术语频率-逆文档频率)可用于降低这些常用单词的权重。

tfidfvect = TfidfVectorizer()
# MultinomialNB
best_mnb_tfidf = grid_vect(mnb, parameters_mnb, X_train, X_test, parameters_text=parameters_vect, vect=tfidfvect)
joblib.dump(best_mnb_tfidf, '../output/best_mnb_tfidf.pkl')
# LogisticRegression
best_logreg_tfidf = grid_vect(logreg, parameters_mnb, X_train, X_test, parameters_text=parameters_vect, vect=tfidfvect)
joblib.dump(best_logreg_tfidf, '../output/best_logreg_tfidf.pkl')

Word2Vec (Word2Vec)

Another way of converting the words to numerical values is to use Word2Vec. Word2Vec maps each word in a multi-dimensional space. It does this by taking into account the context in which a word appears in the tweets. As a result, words that are similar are also close to each other in the multi-dimensional space.

将单词转换为数值的另一种方法是使用Word2Vec 。 Word2Vec映射多维空间中的每个单词。 它通过考虑单词在推文中出现的上下文来做到这一点。 结果,相似的词在多维空间中也彼此接近。

The Word2Vec algorithm is part of the gensim package.

Word2Vec算法是gensim程序包的一部分。

The Word2Vec algorithm uses lists of words as input. For that purpose, we use the word_tokenize method of the the nltk package.

Word2Vec算法使用单词列表作为输入。 为此,我们使用nltk包的word_tokenize方法。

SIZE = 50
X_train['clean_text_wordlist'] = X_train.clean_text.apply(lambda x : word_tokenize(x))
X_test['clean_text_wordlist'] = X_test.clean_text.apply(lambda x : word_tokenize(x))
model = gensim.models.Word2Vec(X_train.clean_text_wordlist
, min_count=1
, size=SIZE
, window=5
, workers=4)
model.most_similar('plane', topn=3)

The Word2Vec model provides a vocabulary of the words in all the tweets. For each word you also have its vector values. The number of vector values is equal to the chosen size. These are the dimensions on which each word is mapped in the multi-dimensional space. Words with an occurrence less than min_count are not kept in the vocabulary.

Word2Vec模型提供所有推文中的单词词汇。 对于每个单词,您还具有其向量值。 向量值的数量等于所选的大小。 这些是每个单词在多维空间中所映射的维度。 出现次数少于min_count的单词不会保留在词汇表中。

A side effect of the min_count parameter is that some tweets could have no vector values. This is would be the case when the word(s) in the tweet occur in less than min_count tweets. Due to the small corpus of tweets, there is a risk of this happening in our case. Thus we set the min_count value equal to 1.

min_count参数的副作用是某些推文可能没有向量值。 如果tweet中的单词少于min_count的情况就是这种情况 鸣叫。 由于推文的语料很少,因此在我们的案例中有发生这种情况的风险。 因此,我们将min_count值设置为等于1。

The tweets can have a different number of vectors, depending on the number of words it contains. To use this output for modeling we will calculate the average of all vectors per tweet. As such we will have the same number (i.e. size) of input variables per tweet.

这些推文可以具有不同数量的向量,具体取决于它包含的单词数。 要使用此输出进行建模,我们将计算每条推文所有向量的平均值。 这样,每条推文我们将具有相同数量(即大小)的输入变量。

We do this with the function compute_avg_w2v_vector. In this function we also check whether the words in the tweet occur in the vocabulary of the Word2Vec model. If not, a list filled with 0.0 is returned. Else the average of the word vectors.

我们使用函数compute_avg_w2v_vector进行此compute_avg_w2v_vector 。 在此功能中,我们还将检查tweet中的单词是否出现在Word2Vec模型的词汇表中。 如果不是,则返回填充为0.0的列表。 否则,单词向量的平均值。

def compute_avg_w2v_vector(w2v_dict, tweet):
    list_of_word_vectors = [w2v_dict[w] for w in tweet if w in w2v_dict.vocab.keys()]
    
    if len(list_of_word_vectors) == 0:
        result = [0.0]*SIZE
    else:
        result = np.sum(list_of_word_vectors, axis=0) / len(list_of_word_vectors)
        
    return result
X_train_w2v = X_train['clean_text_wordlist'].apply(lambda x: compute_avg_w2v_vector(model.wv, x))
X_test_w2v = X_test['clean_text_wordlist'].apply(lambda x: compute_avg_w2v_vector(model.wv, x))

This gives us a Series with a vector of dimension equal to SIZE. Now we will split this vector and create a DataFrame with each vector value in a separate column. That way we can concatenate the Word2Vec variables to the other TextCounts variables. We need to reuse the index of X_train and X_test. Otherwise this will give issues (duplicates) in the concatenation later on.

这给我们一个序列,其向量的尺寸等于SIZE 。 现在,我们将分割此向量,并使用单独列中的每个向量值创建一个DataFrame。 这样,我们可以将Word2Vec变量连接到其他TextCounts变量。 我们需要重用X_trainX_test的索引。 否则,这将在以后的连接中产生问题(重复)。

X_train_w2v = pd.DataFrame(X_train_w2v.values.tolist(), index= X_train.index)
X_test_w2v = pd.DataFrame(X_test_w2v.values.tolist(), index= X_test.index)
# Concatenate with the TextCounts variables
X_train_w2v = pd.concat([X_train_w2v, X_train.drop(['clean_text', 'clean_text_wordlist'], axis=1)], axis=1)
X_test_w2v = pd.concat([X_test_w2v, X_test.drop(['clean_text', 'clean_text_wordlist'], axis=1)], axis=1)

We only consider LogisticRegression as we have negative values in the Word2Vec vectors. MultinomialNB assumes that the variables have a multinomial distribution. So they cannot contain negative values.

我们仅考虑LogisticRegression,因为在Word2Vec向量中具有负值。 MultinomialNB假定变量具有多项式分布 。 因此它们不能包含负值。

best_logreg_w2v = grid_vect(logreg, parameters_logreg, X_train_w2v, X_test_w2v, is_w2v=True)
joblib.dump(best_logreg_w2v, '../output/best_logreg_w2v.pkl')

结论 (Conclusion)

  • Both classifiers achieve the best results when using the features of the CountVectorizer

    使用CountVectorizer的功能时,两个分类器均能获得最佳结果
  • Logistic Regression outperforms the Multinomial Naive Bayes classifier

    Logistic回归优于多项式朴素贝叶斯分类器
  • The best performance on the test set comes from the LogisticRegression with features from CountVectorizer.

    测试集上的最佳性能来自LogisticRegression和CountVectorizer的功能。

最佳参数 (Best parameters)

  • C value of 1

    C值1
  • L2 regularization

    L2正则化
  • max_df: 0.5 or maximum document frequency of 50%.

    max_df:0.5或最大文档频率为50%。
  • min_df: 1 or the words need to appear in at least 2 tweets

    min_df:1或单词需要出现在至少2条推文中
  • ngram_range: (1, 2), both single words as bi-grams are used

    ngram_range:(1,2),两个单词都作为双字母组使用

评估指标 (Evaluation metrics)

  • A test accuracy of 81,3%. This is better than a baseline performance of predicting the majority class (here a negative sentiment) for all observations. The baseline would give 63% accuracy.

    测试精度为81.3%。 这好于预测所有观察结果的多数类别(此处为负面情绪)的基准性能。 基线将给出63%的准确性。
  • The Precision is rather high for all three classes. For instance, of all cases that we predict as negative, 80% is negative.

    这三个类别的精度都很高。 例如,在我们预测为负面的所有情况中,80%为负面。
  • The Recall for the neutral class is low. Of all neutral cases in our test data, we only predict 48% as being neutral.

    中立类别的召回率很低。 在我们的测试数据中,所有中性案例中,我们仅预测48%为中性。

在新推文上应用最佳模型 (Apply the best model on new tweets)

For the fun, we will use the best model and apply it to some new tweets that contain @VirginAmerica. I selected 3 negative and 3 positive tweets by hand.

为了好玩,我们将使用最佳模型并将其应用于包含@VirginAmerica的一些新推文中。 我手动选择了3条负面和3条正面的推文。

Thanks to the GridSearchCV, we now know what are the best hyperparameters. So now we can train the best model on all training data, including the test data that we split off before.

多亏了GridSearchCV,我们现在知道了最好的超参数。 因此,现在我们可以在所有训练数据(包括我们之前拆分的测试数据)上训练最佳模型。

textcountscols = ['count_capital_words','count_emojis','count_excl_quest_marks','count_hashtags'
,'count_mentions','count_urls','count_words']
features = FeatureUnion([('textcounts', ColumnExtractor(cols=textcountscols))
, ('pipe', Pipeline([('cleantext', ColumnExtractor(cols='clean_text'))
, ('vect', CountVectorizer(max_df=0.5, min_df=1, ngram_range=(1,2)))]))]
, n_jobs=-1)
pipeline = Pipeline([
('features', features)
, ('clf', LogisticRegression(C=1.0, penalty='l2'))
])
best_model = pipeline.fit(df_model.drop('airline_sentiment', axis=1), df_model.airline_sentiment)
# Applying on new positive tweets
new_positive_tweets = pd.Series(["Thank you @VirginAmerica for you amazing customer support team on Tuesday 11/28 at @EWRairport and returning my lost bag in less than 24h! #efficiencyiskey #virginamerica"
,"Love flying with you guys ask these years. Sad that this will be the last trip ? @VirginAmerica #LuxuryTravel"
,"Wow @VirginAmerica main cabin select is the way to fly!! This plane is nice and clean & I have tons of legroom! Wahoo! NYC bound! ✈️"])
df_counts_pos = tc.transform(new_positive_tweets)
df_clean_pos = ct.transform(new_positive_tweets)
df_model_pos = df_counts_pos
df_model_pos['clean_text'] = df_clean_pos
best_model.predict(df_model_pos).tolist()
# Applying on new negative tweets
new_negative_tweets = pd.Series(["@VirginAmerica shocked my initially with the service, but then went on to shock me further with no response to what my complaint was. #unacceptable @Delta @richardbranson"
,"@VirginAmerica this morning I was forced to repack a suitcase w a medical device because it was barely overweight - wasn't even given an option to pay extra. My spouses suitcase then burst at the seam with the added device and had to be taped shut. Awful experience so far!"
,"Board airplane home. Computer issue. Get off plane, traverse airport to gate on opp side. Get on new plane hour later. Plane too heavy. 8 volunteers get off plane. Ohhh the adventure of travel ✈️ @VirginAmerica"])
df_counts_neg = tc.transform(new_negative_tweets)
df_clean_neg = ct.transform(new_negative_tweets)
df_model_neg = df_counts_neg
df_model_neg['clean_text'] = df_clean_neg
best_model.predict(df_model_neg).tolist()

The model classifies all tweets correctly. A larger test set should be used to assess the model’s performance. But on this small data set it does what we are aiming for. I hope you enjoyed reading this story. If you did, feel free to share it.

该模型将所有推文正确分类。 应该使用更大的测试集来评估模型的性能。 但是,在这个小的数据集上,它确实可以实现我们的目标。 希望您喜欢阅读这个故事。 如果您愿意,可以随时分享。

翻译自: https://www.freecodecamp.org/news/sentiment-analysis-with-text-mining/

文本挖掘 情感分析

你可能感兴趣的:(大数据,python,机器学习,人工智能,数据分析)