文本挖掘 情感分析
In this tutorial, I will explore some text mining techniques for sentiment analysis. We'll look at how to prepare textual data. After that we will try two different classifiers to infer the tweets' sentiment. We will tune the hyperparameters of both classifiers with grid search. Finally, we evaluate the performance on a set of metrics like precision, recall and the F1 score.
在本教程中,我将探讨一些用于情感分析的文本挖掘技术。 我们将研究如何准备文本数据。 之后,我们将尝试使用两个不同的分类器来推断推文的情绪。 我们将使用网格搜索调整两个分类器的超参数。 最后,我们根据一组指标(如准确性,召回率和F1得分)评估性能。
For this project, we'll be working with the Twitter US Airline Sentiment data set on Kaggle. It contains the tweet’s text and one variable with three possible sentiment values. Let's start by importing the packages and configuring some settings.
对于此项目,我们将使用Kaggle上的Twitter美国航空情绪数据集 。 它包含推文的文本和一个带有三个可能的情感值的变量。 让我们首先导入软件包并配置一些设置。
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', -1)
from time import time
import re
import string
import os
import emoji
from pprint import pprint
import collections
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
sns.set(font_scale=1.3)
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib
import gensim
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import warnings
warnings.filterwarnings('ignore')
np.random.seed(37)
We read in the comma separated file we downloaded from the Kaggle Datasets. We shuffle the data frame in case the classes are sorted. Applying the reindex
method on the permutation
of the original indices is good for that. In this notebook, we will work with the text
variable and the airline_sentiment
variable.
我们读取从Kaggle数据集下载的逗号分隔文件。 如果对类进行排序,我们会重新整理数据框。 将reindex
方法应用于原始索引的permutation
对此很有好处。 在此笔记本中,我们将使用text
变量和airline_sentiment
变量。
df = pd.read_csv('../input/Tweets.csv')
df = df.reindex(np.random.permutation(df.index))
df = df[['text', 'airline_sentiment']]
There are three class labels we will predict: negative, neutral or positive.
我们将预测三种类别的标签:负面,中性或正面。
The class labels are imbalanced as we can see below in the chart. This is something that we should keep in mind during the model training phase. With the factorplot
of the seaborn package, we can visualize the distribution of the target variable.
类别标签不平衡,如下图所示。 在模型训练阶段,我们应该牢记这一点。 随着factorplot
的seaborn包,我们可以直观的目标变量的分布。
sns.factorplot(x="airline_sentiment", data=df, kind="count", size=6, aspect=1.5, palette="PuBuGn_d")
plt.show();
To analyze the text
variable we create a class TextCounts
. In this class we compute some basic statistics on the text variable.
为了分析text
变量,我们创建了一个TextCounts
类。 在此类中,我们计算有关文本变量的一些基本统计信息。
count_words
: number of words in the tweet
count_words
:鸣叫中的单词数
count_mentions
: referrals to other Twitter accounts start with a @
count_mentions
:对其他Twitter帐户的引荐以@开头
count_hashtags
: number of tag words, preceded by a #
count_hashtags
:标记词的数量, count_hashtags
#
count_capital_words
: number of uppercase words are sometimes used to “shout” and express (negative) emotions
count_capital_words
:大写单词的数量有时用于“喊”和表达(负面)情绪
count_excl_quest_marks
: number of question or exclamation marks
count_excl_quest_marks
:问题或感叹号的数量
count_urls
: number of links in the tweet, preceded by http(s)
count_urls
:推文中的链接数,以http(s) count_urls
count_emojis
: number of emoji, which might be a good sign of the sentiment
count_emojis
:表情符号的数量,这可能是情绪的好兆头
class TextCounts(BaseEstimator, TransformerMixin):
def count_regex(self, pattern, tweet):
return len(re.findall(pattern, tweet))
def fit(self, X, y=None, **fit_params):
# fit method is used when specific operations need to be done on the train data, but not on the test data
return self
def transform(self, X, **transform_params):
count_words = X.apply(lambda x: self.count_regex(r'\w+', x))
count_mentions = X.apply(lambda x: self.count_regex(r'@\w+', x))
count_hashtags = X.apply(lambda x: self.count_regex(r'#\w+', x))
count_capital_words = X.apply(lambda x: self.count_regex(r'\b[A-Z]{2,}\b', x))
count_excl_quest_marks = X.apply(lambda x: self.count_regex(r'!|\?', x))
count_urls = X.apply(lambda x: self.count_regex(r'http.?://[^\s]+[\s]?', x))
# We will replace the emoji symbols with a description, which makes using a regex for counting easier
# Moreover, it will result in having more words in the tweet
count_emojis = X.apply(lambda x: emoji.demojize(x)).apply(lambda x: self.count_regex(r':[a-z_&]+:', x))
df = pd.DataFrame({'count_words': count_words
, 'count_mentions': count_mentions
, 'count_hashtags': count_hashtags
, 'count_capital_words': count_capital_words
, 'count_excl_quest_marks': count_excl_quest_marks
, 'count_urls': count_urls
, 'count_emojis': count_emojis
})
return df
tc = TextCounts()
df_eda = tc.fit_transform(df.text)
df_eda['airline_sentiment'] = df.airline_sentiment
It could be interesting to see how the TextStats variables relate to the class variable. So we write a function show_dist
that provides descriptive statistics and a plot per target class.
看看TextStats变量与类变量之间的关系可能会很有趣。 因此,我们编写了一个函数show_dist
,该函数提供描述性统计信息和每个目标类的图表。
def show_dist(df, col):
print('Descriptive stats for {}'.format(col))
print('-'*(len(col)+22))
print(df.groupby('airline_sentiment')[col].describe())
bins = np.arange(df[col].min(), df[col].max() + 1)
g = sns.FacetGrid(df, col='airline_sentiment', size=5, hue='airline_sentiment', palette="PuBuGn_d")
g = g.map(sns.distplot, col, kde=False, norm_hist=True, bins=bins)
plt.show()
Below you can find the distribution of the number of words in a tweet per target class. For brevity, we will limit us to only this variable. The charts for all TextCounts variables are in the notebook on Github.
在下面,您可以找到每个目标类别的推文中单词数的分布。 为简便起见,我们将限于此变量。 所有TextCounts变量的图表都在Github的笔记本中 。
Before we start using the tweets’ text we need to clean it. We’ll do the this in the class CleanText
. With this class we’ll perform the following actions:
在开始使用推文之前,我们需要先对其进行清理。 我们将在类CleanText
执行此CleanText
。 在此类中,我们将执行以下操作:
apply the PorterStemmer
to keep the stem of the words
应用PorterStemmer
保持词干
class CleanText(BaseEstimator, TransformerMixin):
def remove_mentions(self, input_text):
return re.sub(r'@\w+', '', input_text)
def remove_urls(self, input_text):
return re.sub(r'http.?://[^\s]+[\s]?', '', input_text)
def emoji_oneword(self, input_text):
# By compressing the underscore, the emoji is kept as one word
return input_text.replace('_','')
def remove_punctuation(self, input_text):
# Make translation table
punct = string.punctuation
trantab = str.maketrans(punct, len(punct)*' ') # Every punctuation symbol will be replaced by a space
return input_text.translate(trantab)
def remove_digits(self, input_text):
return re.sub('\d+', '', input_text)
def to_lower(self, input_text):
return input_text.lower()
def remove_stopwords(self, input_text):
stopwords_list = stopwords.words('english')
# Some words which might indicate a certain sentiment are kept via a whitelist
whitelist = ["n't", "not", "no"]
words = input_text.split()
clean_words = [word for word in words if (word not in stopwords_list or word in whitelist) and len(word) > 1]
return " ".join(clean_words)
def stemming(self, input_text):
porter = PorterStemmer()
words = input_text.split()
stemmed_words = [porter.stem(word) for word in words]
return " ".join(stemmed_words)
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X, **transform_params):
clean_X = X.apply(self.remove_mentions).apply(self.remove_urls).apply(self.emoji_oneword).apply(self.remove_punctuation).apply(self.remove_digits).apply(self.to_lower).apply(self.remove_stopwords).apply(self.stemming)
return clean_X
To show how the cleaned text variable will look like, here’s a sample.
为了显示清除后的文本变量的外观,这是一个示例。
ct = CleanText()
sr_clean = ct.fit_transform(df.text)
sr_clean.sample(5)
glad rt bet bird wish flown south winter
高兴rt投注鸟希望飞过南冬天
glad rt bet bird wish flown south winterpoint upc code check baggag tell luggag vacat day tri swimsuit
高兴rt赌注鸟希望飞行南冬季 点upc代码检查baggag告诉行李vacat日三泳衣
glad rt bet bird wish flown south winterpoint upc code check baggag tell luggag vacat day tri swimsuitvx jfk la dirti plane not standard
高兴rt赌注鸟希望飞行南冬季 点upc代码检查baggag告诉行李vacat日三泳衣 vx肯尼迪洛杉矶肮脏飞机不标
glad rt bet bird wish flown south winterpoint upc code check baggag tell luggag vacat day tri swimsuitvx jfk la dirti plane not standardtell mean work need estim time arriv pleas need laptop work thank
高兴rt赌注鸟希望飞行南冬天 点upc代码检查baggag告诉行李vacat日三泳衣 vx jfk ladirti飞机不标准 告诉意思是工作需要估计时间到达pleas需要笔记本工作
glad rt bet bird wish flown south winterpoint upc code check baggag tell luggag vacat day tri swimsuitvx jfk la dirti plane not standardtell mean work need estim time arriv pleas need laptop work thanksure busi go els airlin travel name kathryn sotelo
高兴rt赌注鸟希望飞行南冬天 点upc代码检查baggag告诉行李vacat日三泳衣 vx jfk ladirti飞机不标准 告诉意思是工作需要估计时间到达pleas需要笔记本工作,感谢 确定busi go els airlin旅行名称凯瑟琳索特洛
One side-effect of text cleaning is that some rows do not have any words left in their text. For the CountVectorizer
and TfIdfVectorizer
this does not pose a problem. Yet, for the Word2Vec
algorithm this causes an error. There are different strategies to deal with these missing values.
清除文本的一个副作用是某些行的文本中没有剩余单词。 对于CountVectorizer
和TfIdfVectorizer
这不会造成问题。 但是,对于Word2Vec
算法,这会导致错误。 有不同的策略来应对这些缺失的价值观。
Here we will impute with placeholder text.
在这里,我们将使用占位符文本进行插补。
empty_clean = sr_clean == ''
print('{} records have no words left after text cleaning'.format(sr_clean[empty_clean].count()))
sr_clean.loc[empty_clean] = '[no_text]'
Now that we have the cleaned text of the tweets, we can have a look at what are the most frequent words. Below we’ll show the top 20 words. The most frequent word is “flight”.
既然我们已经清除了推文的文本,我们就可以看看最常用的词是什么。 下面我们将显示前20个字。 最常见的词是“飞行”。
cv = CountVectorizer()
bow = cv.fit_transform(sr_clean)
word_freq = dict(zip(cv.get_feature_names(), np.asarray(bow.sum(axis=0)).ravel()))
word_counter = collections.Counter(word_freq)
word_counter_df = pd.DataFrame(word_counter.most_common(20), columns = ['word', 'freq'])
fig, ax = plt.subplots(figsize=(12, 10))
sns.barplot(x="word", y="freq", data=word_counter_df, palette="PuBuGn_d", ax=ax)
plt.show();
To check the performance of the models we’ll need a test set. Evaluating on the train data would not be correct. You should not test on the same data used for training the model.
要检查模型的性能,我们需要测试集。 评估火车数据是不正确的。 您不应在用于训练模型的相同数据上进行测试。
First, we combine the TextCounts
variables with the CleanText
variable. Initially, I made the mistake to execute TextCounts and CleanText in the GridSearchCV
. This took too long as it applies these functions each run of the GridSearch. It suffices to run them only once.
首先,我们将TextCounts
变量与CleanText
变量结合在一起。 最初,我在GridSearchCV
执行TextCounts和CleanText时犯了一个错误。 只要它在GridSearch的每次运行中应用这些功能,就需要花费很长时间。 只运行一次就足够了。
df_model = df_eda
df_model['clean_text'] = sr_clean
df_model.columns.tolist()
So df_model
now contains several variables. But our vectorizers (see below) will only need the clean_text
variable. The TextCounts
variables can be added as such. To select columns, I wrote the class ColumnExtractor
below.
因此df_model
现在包含几个变量。 但是我们的矢量化程序(见下文)将只需要clean_text
变量。 可以这样添加TextCounts
变量。 为了选择列,我在下面编写了ColumnExtractor
类。
class ColumnExtractor(TransformerMixin, BaseEstimator):
def __init__(self, cols):
self.cols = cols
def transform(self, X, **transform_params):
return X[self.cols]
def fit(self, X, y=None, **fit_params):
return self
X_train, X_test, y_train, y_test = train_test_split(df_model.drop('airline_sentiment', axis=1), df_model.airline_sentiment, test_size=0.1, random_state=37)
As we will see below, the vectorizers and classifiers all have configurable parameters. To choose the best parameters, we need to test on a separate validation set. This validation set was not used during the training. Yet, using only one validation set may not produce reliable validation results. Due to chance, you might have a good model performance on the validation set. If you would split the data otherwise, you might end up with other results. To get a more accurate estimation, we perform cross-validation.
正如我们将在下面看到的,矢量化器和分类器都具有可配置的参数。 为了选择最佳参数,我们需要在单独的验证集上进行测试。 训练期间未使用此验证集。 但是,仅使用一个验证集可能不会产生可靠的验证结果。 由于偶然的原因,您可能在验证集中具有良好的模型性能。 如果以其他方式拆分数据,则可能会导致其他结果。 为了获得更准确的估计,我们执行交叉验证。
With cross-validation we split the data into a train and validation set many times. The evaluation metric is then averaged over the different folds. Luckily, GridSearchCV applies cross-validation out-of-the-box.
通过交叉验证,我们将数据多次拆分为训练和验证集。 然后,将评估指标在不同折数上取平均值。 幸运的是,GridSearchCV开箱即用地应用了交叉验证。
To find the best parameters for both a vectorizer and classifier, we create a Pipeline
.
为了找到矢量化器和分类器的最佳参数,我们创建了Pipeline
。
By default GridSearchCV uses the default scorer to compute the best_score_
. For both the MultiNomialNb
and LogisticRegression
this default scoring metric is accuracy.
默认情况下,GridSearchCV使用默认best_score_
来计算best_score_
。 对于MultiNomialNb
和LogisticRegression
此默认评分指标均为准确性。
In our function grid_vect
we additionally generate the classification_report
on the test data. This provides some interesting metrics per target class. This might be more appropriate here. These metrics are the precision, recall and F1 score.
在我们的函数grid_vect
我们还根据测试数据生成了classification_report
报告。 这为每个目标类别提供了一些有趣的指标。 这在这里可能更合适。 这些指标是精度,召回率和F1得分。
Precision: Of all rows we predicted to be a certain class, how many did we correctly predict?
精度:在我们预测为某一类的所有行中,我们正确预测了几行?
Recall: Of all rows of a certain class, how many did we correctly predict?
回想一下:在某个类的所有行中,我们正确预测了多少行?
F1 score: Harmonic mean of Precision and Recall.
F1得分:精确度和召回率的谐波平均值。
With the elements of the confusion matrix we can calculate Precision and Recall.
使用混淆矩阵的元素,我们可以计算精度和召回率。
# Based on http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html
def grid_vect(clf, parameters_clf, X_train, X_test, parameters_text=None, vect=None, is_w2v=False):
textcountscols = ['count_capital_words','count_emojis','count_excl_quest_marks','count_hashtags'
,'count_mentions','count_urls','count_words']
if is_w2v:
w2vcols = []
for i in range(SIZE):
w2vcols.append(i)
features = FeatureUnion([('textcounts', ColumnExtractor(cols=textcountscols))
, ('w2v', ColumnExtractor(cols=w2vcols))]
, n_jobs=-1)
else:
features = FeatureUnion([('textcounts', ColumnExtractor(cols=textcountscols))
, ('pipe', Pipeline([('cleantext', ColumnExtractor(cols='clean_text')), ('vect', vect)]))]
, n_jobs=-1)
pipeline = Pipeline([
('features', features)
, ('clf', clf)
])
# Join the parameters dictionaries together
parameters = dict()
if parameters_text:
parameters.update(parameters_text)
parameters.update(parameters_clf)
# Make sure you have scikit-learn version 0.19 or higher to use multiple scoring metrics
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, cv=5)
print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
pprint(parameters)
t0 = time()
grid_search.fit(X_train, y_train)
print("done in %0.3fs" % (time() - t0))
print()
print("Best CV score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
print("Test score with best_estimator_: %0.3f" % grid_search.best_estimator_.score(X_test, y_test))
print("\n")
print("Classification Report Test Data")
print(classification_report(y_test, grid_search.best_estimator_.predict(X_test)))
return grid_search
In the grid search, we will investigate the performance of the classifier. The set of parameters used to test the performance are specified below.
在网格搜索中,我们将研究分类器的性能。 下面指定了用于测试性能的参数集。
# Parameter grid settings for the vectorizers (Count and TFIDF)
parameters_vect = {
'features__pipe__vect__max_df': (0.25, 0.5, 0.75),
'features__pipe__vect__ngram_range': ((1, 1), (1, 2)),
'features__pipe__vect__min_df': (1,2)
}
# Parameter grid settings for MultinomialNB
parameters_mnb = {
'clf__alpha': (0.25, 0.5, 0.75)
}
# Parameter grid settings for LogisticRegression
parameters_logreg = {
'clf__C': (0.25, 0.5, 1.0),
'clf__penalty': ('l1', 'l2')
}
Here we will compare the performance of a MultinomialNB
and LogisticRegression
.
在这里,我们将比较MultinomialNB
和LogisticRegression
的性能。
mnb = MultinomialNB()
logreg = LogisticRegression()
To use words in a classifier, we need to convert the words to numbers. Sklearn’s CountVectorizer
takes all words in all tweets, assigns an ID and counts the frequency of the word per tweet. We then use this bag of words as input for a classifier. This bag of words is a sparse data set. This means that each record will have many zeroes for the words not occurring in the tweet.
要在分类器中使用单词,我们需要将单词转换为数字。 Sklearn的CountVectorizer
接收所有推文中的所有单词,分配一个ID并计算每条推文中单词的出现频率。 然后,我们将这袋单词用作分类器的输入。 这个词袋是一个稀疏的数据集。 这意味着对于未在推文中出现的单词,每个记录将具有多个零。
countvect = CountVectorizer()
# MultinomialNB
best_mnb_countvect = grid_vect(mnb, parameters_mnb, X_train, X_test, parameters_text=parameters_vect, vect=countvect)
joblib.dump(best_mnb_countvect, '../output/best_mnb_countvect.pkl')
# LogisticRegression
best_logreg_countvect = grid_vect(logreg, parameters_logreg, X_train, X_test, parameters_text=parameters_vect, vect=countvect)
joblib.dump(best_logreg_countvect, '../output/best_logreg_countvect.pkl')
One issue with CountVectorizer is that there might be words that occur frequently. These words might not have discriminatory information. Thus they can be removed. TF-IDF (term frequency — inverse document frequency)can be used to down-weight these frequent words.
CountVectorizer的一个问题是可能经常出现单词。 这些词可能没有歧视性信息。 因此可以将它们删除。 TF-IDF(术语频率-逆文档频率)可用于降低这些常用单词的权重。
tfidfvect = TfidfVectorizer()
# MultinomialNB
best_mnb_tfidf = grid_vect(mnb, parameters_mnb, X_train, X_test, parameters_text=parameters_vect, vect=tfidfvect)
joblib.dump(best_mnb_tfidf, '../output/best_mnb_tfidf.pkl')
# LogisticRegression
best_logreg_tfidf = grid_vect(logreg, parameters_mnb, X_train, X_test, parameters_text=parameters_vect, vect=tfidfvect)
joblib.dump(best_logreg_tfidf, '../output/best_logreg_tfidf.pkl')
Another way of converting the words to numerical values is to use Word2Vec
. Word2Vec maps each word in a multi-dimensional space. It does this by taking into account the context in which a word appears in the tweets. As a result, words that are similar are also close to each other in the multi-dimensional space.
将单词转换为数值的另一种方法是使用Word2Vec
。 Word2Vec映射多维空间中的每个单词。 它通过考虑单词在推文中出现的上下文来做到这一点。 结果,相似的词在多维空间中也彼此接近。
The Word2Vec algorithm is part of the gensim package.
Word2Vec算法是gensim程序包的一部分。
The Word2Vec algorithm uses lists of words as input. For that purpose, we use the word_tokenize
method of the the nltk
package.
Word2Vec算法使用单词列表作为输入。 为此,我们使用nltk
包的word_tokenize
方法。
SIZE = 50
X_train['clean_text_wordlist'] = X_train.clean_text.apply(lambda x : word_tokenize(x))
X_test['clean_text_wordlist'] = X_test.clean_text.apply(lambda x : word_tokenize(x))
model = gensim.models.Word2Vec(X_train.clean_text_wordlist
, min_count=1
, size=SIZE
, window=5
, workers=4)
model.most_similar('plane', topn=3)
The Word2Vec model provides a vocabulary of the words in all the tweets. For each word you also have its vector values. The number of vector values is equal to the chosen size. These are the dimensions on which each word is mapped in the multi-dimensional space. Words with an occurrence less than min_count
are not kept in the vocabulary.
Word2Vec模型提供所有推文中的单词词汇。 对于每个单词,您还具有其向量值。 向量值的数量等于所选的大小。 这些是每个单词在多维空间中所映射的维度。 出现次数少于min_count
的单词不会保留在词汇表中。
A side effect of the min_count parameter is that some tweets could have no vector values. This is would be the case when the word(s) in the tweet occur in less than min_count tweets. Due to the small corpus of tweets, there is a risk of this happening in our case. Thus we set the min_count value equal to 1.
min_count参数的副作用是某些推文可能没有向量值。 如果tweet中的单词少于min_count的情况就是这种情况 鸣叫。 由于推文的语料很少,因此在我们的案例中有发生这种情况的风险。 因此,我们将min_count值设置为等于1。
The tweets can have a different number of vectors, depending on the number of words it contains. To use this output for modeling we will calculate the average of all vectors per tweet. As such we will have the same number (i.e. size) of input variables per tweet.
这些推文可以具有不同数量的向量,具体取决于它包含的单词数。 要使用此输出进行建模,我们将计算每条推文所有向量的平均值。 这样,每条推文我们将具有相同数量(即大小)的输入变量。
We do this with the function compute_avg_w2v_vector
. In this function we also check whether the words in the tweet occur in the vocabulary of the Word2Vec model. If not, a list filled with 0.0 is returned. Else the average of the word vectors.
我们使用函数compute_avg_w2v_vector
进行此compute_avg_w2v_vector
。 在此功能中,我们还将检查tweet中的单词是否出现在Word2Vec模型的词汇表中。 如果不是,则返回填充为0.0的列表。 否则,单词向量的平均值。
def compute_avg_w2v_vector(w2v_dict, tweet):
list_of_word_vectors = [w2v_dict[w] for w in tweet if w in w2v_dict.vocab.keys()]
if len(list_of_word_vectors) == 0:
result = [0.0]*SIZE
else:
result = np.sum(list_of_word_vectors, axis=0) / len(list_of_word_vectors)
return result
X_train_w2v = X_train['clean_text_wordlist'].apply(lambda x: compute_avg_w2v_vector(model.wv, x))
X_test_w2v = X_test['clean_text_wordlist'].apply(lambda x: compute_avg_w2v_vector(model.wv, x))
This gives us a Series with a vector of dimension equal to SIZE
. Now we will split this vector and create a DataFrame with each vector value in a separate column. That way we can concatenate the Word2Vec variables to the other TextCounts variables. We need to reuse the index of X_train
and X_test
. Otherwise this will give issues (duplicates) in the concatenation later on.
这给我们一个序列,其向量的尺寸等于SIZE
。 现在,我们将分割此向量,并使用单独列中的每个向量值创建一个DataFrame。 这样,我们可以将Word2Vec变量连接到其他TextCounts变量。 我们需要重用X_train
和X_test
的索引。 否则,这将在以后的连接中产生问题(重复)。
X_train_w2v = pd.DataFrame(X_train_w2v.values.tolist(), index= X_train.index)
X_test_w2v = pd.DataFrame(X_test_w2v.values.tolist(), index= X_test.index)
# Concatenate with the TextCounts variables
X_train_w2v = pd.concat([X_train_w2v, X_train.drop(['clean_text', 'clean_text_wordlist'], axis=1)], axis=1)
X_test_w2v = pd.concat([X_test_w2v, X_test.drop(['clean_text', 'clean_text_wordlist'], axis=1)], axis=1)
We only consider LogisticRegression as we have negative values in the Word2Vec vectors. MultinomialNB assumes that the variables have a multinomial distribution. So they cannot contain negative values.
我们仅考虑LogisticRegression,因为在Word2Vec向量中具有负值。 MultinomialNB假定变量具有多项式分布 。 因此它们不能包含负值。
best_logreg_w2v = grid_vect(logreg, parameters_logreg, X_train_w2v, X_test_w2v, is_w2v=True)
joblib.dump(best_logreg_w2v, '../output/best_logreg_w2v.pkl')
For the fun, we will use the best model and apply it to some new tweets that contain @VirginAmerica. I selected 3 negative and 3 positive tweets by hand.
为了好玩,我们将使用最佳模型并将其应用于包含@VirginAmerica的一些新推文中。 我手动选择了3条负面和3条正面的推文。
Thanks to the GridSearchCV, we now know what are the best hyperparameters. So now we can train the best model on all training data, including the test data that we split off before.
多亏了GridSearchCV,我们现在知道了最好的超参数。 因此,现在我们可以在所有训练数据(包括我们之前拆分的测试数据)上训练最佳模型。
textcountscols = ['count_capital_words','count_emojis','count_excl_quest_marks','count_hashtags'
,'count_mentions','count_urls','count_words']
features = FeatureUnion([('textcounts', ColumnExtractor(cols=textcountscols))
, ('pipe', Pipeline([('cleantext', ColumnExtractor(cols='clean_text'))
, ('vect', CountVectorizer(max_df=0.5, min_df=1, ngram_range=(1,2)))]))]
, n_jobs=-1)
pipeline = Pipeline([
('features', features)
, ('clf', LogisticRegression(C=1.0, penalty='l2'))
])
best_model = pipeline.fit(df_model.drop('airline_sentiment', axis=1), df_model.airline_sentiment)
# Applying on new positive tweets
new_positive_tweets = pd.Series(["Thank you @VirginAmerica for you amazing customer support team on Tuesday 11/28 at @EWRairport and returning my lost bag in less than 24h! #efficiencyiskey #virginamerica"
,"Love flying with you guys ask these years. Sad that this will be the last trip ? @VirginAmerica #LuxuryTravel"
,"Wow @VirginAmerica main cabin select is the way to fly!! This plane is nice and clean & I have tons of legroom! Wahoo! NYC bound! ✈️"])
df_counts_pos = tc.transform(new_positive_tweets)
df_clean_pos = ct.transform(new_positive_tweets)
df_model_pos = df_counts_pos
df_model_pos['clean_text'] = df_clean_pos
best_model.predict(df_model_pos).tolist()
# Applying on new negative tweets
new_negative_tweets = pd.Series(["@VirginAmerica shocked my initially with the service, but then went on to shock me further with no response to what my complaint was. #unacceptable @Delta @richardbranson"
,"@VirginAmerica this morning I was forced to repack a suitcase w a medical device because it was barely overweight - wasn't even given an option to pay extra. My spouses suitcase then burst at the seam with the added device and had to be taped shut. Awful experience so far!"
,"Board airplane home. Computer issue. Get off plane, traverse airport to gate on opp side. Get on new plane hour later. Plane too heavy. 8 volunteers get off plane. Ohhh the adventure of travel ✈️ @VirginAmerica"])
df_counts_neg = tc.transform(new_negative_tweets)
df_clean_neg = ct.transform(new_negative_tweets)
df_model_neg = df_counts_neg
df_model_neg['clean_text'] = df_clean_neg
best_model.predict(df_model_neg).tolist()
The model classifies all tweets correctly. A larger test set should be used to assess the model’s performance. But on this small data set it does what we are aiming for. I hope you enjoyed reading this story. If you did, feel free to share it.
该模型将所有推文正确分类。 应该使用更大的测试集来评估模型的性能。 但是,在这个小的数据集上,它确实可以实现我们的目标。 希望您喜欢阅读这个故事。 如果您愿意,可以随时分享。
翻译自: https://www.freecodecamp.org/news/sentiment-analysis-with-text-mining/
文本挖掘 情感分析