用户评论情感分析 nlp

自然语言处理 (Natural Language Processing)

A few days ago, I published an article that uses this same machine learning module to perform a sentiment analysis on a dataset of tweets reaching 96% accuracy. It is now time to increase complexity and approach more complicated problems. One perfect dataset for this experiment is the movie review dataset that you can download on Kaggle (see the link above).

几天前，我发表了一篇文章，该文章使用相同的机器学习模块对达到96％准确性的推文数据集执行情感分析。现在该增加复杂性并解决更复杂的问题了。该实验的一个理想数据集是电影评论数据集，您可以在Kaggle上下载它(请参见上面的链接)。

机器学习与深度学习 (Machine Learning vs. Deep Learning)

Why am I not using deep learning for these tasks? If I had to use Tensorflow, I would use an Embedding neural network. Unfortunately, this dataset only contains 2000 reviews. Compared with the standard movie reviews in Keras, which contains 50,000 reviews, there might not be enough data for the neural net to perform at its top. Deep Learning only outperforms machine learning when there is a sufficient volume of data.

为什么我不将深度学习用于这些任务？如果必须使用Tensorflow，则可以使用嵌入神经网络。不幸的是，该数据集仅包含2000条评论。与Keras中包含50,000条评论的标准电影评论相比，可能没有足够的数据供神经网络在其顶部执行。深度学习仅在有足够数据量时才胜过机器学习。

nltk模块 (nltk module)

I will be using a machine learning library specialized for NLP, called nltk. I prefer using scikit-learn for creating machine learning models, but it is a library specialized for tabular data, rather than natural language processing.

我将使用专门用于NLP的机器学习库，称为nltk。我更喜欢使用scikit-learn创建机器学习模型，但是它是专门用于表格数据而不是自然语言处理的库。

脚步 (Steps)

In this article, I will follow the following steps. Compared with the Twitter Sentiment analysis in the previous article, the preprocessing of data will be much more troubling.

在本文中，我将遵循以下步骤。与上一篇文章中的Twitter Sentiment分析相比，数据的预处理将更加麻烦。

Importing Modules
导入模块
Looking at the data
看数据
Creating Features and Labels (encoding)
创建特征和标签(编码)
Creating train and test (splitting)
创建训练和测试(拆分)
Using the model: Naive Bayes Classifier
使用模型：朴素贝叶斯分类器
Performance Evaluation
绩效评估

I will be using a particular kind of encoding: instead of converting words to numbers, I will store them into a dictionary. I will then feed this dictionary to the model.

我将使用一种特殊的编码方式：将单词存储为字典，而不是将单词转换为数字。然后，我将此字典输入模型。

1.导入模块 (1. Importing Modules)

!pip install nltk
import nltk
#per risolvere un bug, altrimenti da errore
nltk.download('punkt')

In addition to the main module I will be using (not only for machine learning but also to set up the tokenizer), I will also have to create my tool for tokenization. This function will break every sentence of the reviews in individual strings and will put every word into a python dictionary.

除了我将要使用的主要模块(不仅用于机器学习，还用于设置令牌生成器)之外，我还必须创建用于令牌化的工具。此功能将每个评论的句子分解成单个字符串，并将每个单词放入python字典中。

#tokenizer
def format_sentence(sent):
  return({word: True for word in nltk.word_tokenize(sent)})#example
format_sentence('how are you')
{'are': True, 'how': True, 'you': True}

2.查看数据 (2. Looking at the data)

import pandas as pd
total = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Projects/20200602_Twitter_Sentiment_Analysis/movie_review.csv')
total

As we can immediately see after having imported our dataset, we have a big problem. The reviews have been stored by sentence, and many chunks of the same review are hosted in more than one row. I will need to group the reviews from the chunks of text, and then label them as positive or negative.

正如我们在导入数据集后立即看到的那样，我们遇到了一个大问题。评论已按句子存储，同一评论的许多块都托管在多个行中。我将需要根据文本块对评论进行分组，然后将其标记为肯定或否定。

***This may not make much of a difference using Machine Learning tools, but if we ever decide to use Deep Learning, we might want to have all the reviews bundled together. In both ways, it is an instructive problem to solve.

***使用机器学习工具可能并没有太大的区别，但是如果我们决定使用深度学习，我们可能希望将所有评论捆绑在一起。两种方式都需要解决。

I will be using the pandas’ function groupby to aggregate the rows by html_id and then convert them into a list.

我将使用pandas的function groupby通过html_id汇总行，然后将其转换为列表。

#group by html_id
total = total.groupby('html_id').agg(lambda x: x.tolist())
total = total.reset_index()
total

All the reviews have been merged under the column text. I am now going to drop the extra columns.

所有评论均已合并到该列文本下。我现在将删除多余的列。

#drop other columns: only conserve text
total.columns
total = total.drop(['html_id', 'fold_id', 'cv_tag', 'sent_id'], axis=1)
total

#the result is a separated chunks for every reviews: 
#chunk1, chunk2, ..., sentiment
total.values

I will now recreate a dataset with complete reviews in one column, and ONE sentiment in another column, rather than a list of sentiments per row.

现在，我将重新创建一个数据集，其中一列包含完整的评论，另一列中包含一个情感，而不是每行的情感列表。

#we merge the chunks together and we obtain: 
#review, sentiment
total_text = list()
for lists in total.values:
  combines_text = ''
  for _ in lists[0]:
    combines_text = combines_text + _
  total_text.append([combines_text, lists[1][0]])#total_text
total_text = pd.DataFrame(total_text)
total_text

Here is the final list: a combination of all positive and negative reviews in a single dataset.

这是最终列表：单个数据集中所有正面和负面评论的组合。

隔离正面评论 (Isolating the positive reviews)

total_positive = total_text.copy()
total_positive.columns
total_positive = total_positive.loc[total_positive[1] == 'pos']
#total_positive = total_positive.pop('text')
#total_positive = total_positive.drop(['fold_id', 'cv_tag', 'html_id', 'sent_id'], axis=1)
total_positive

隔离负面评论 (Isolating the negative reviews)

total_negative = total_text.copy()
total_negative.columns
total_negative = total_negative.loc[total_negative[1] == 'neg']
#total_negative = total_negative.pop('text')
#total_negative = total_negative.drop(['fold_id', 'cv_tag', 'html_id', 'sent_id'], axis=1)
total_negative

3.创建特征和标签(编码) (3. Creating Features and Labels (encoding))

To train a supervised learning AI, I will need to get my data ready for the Naive Bayes Classifier model.

要训练有监督的学习型AI，我需要为Naive Bayes分类器模型准备好数据。

#   tokenizer
def create_dict(total_positive, total_negative):
  positive_reviews = list()
  #word tokenization
  for sentence in list(total_positive.values):
    positive_reviews.append([format_sentence(sentence[0]), 'pos'])
    #saves the sentence in format: [{tokenized sentence}, 'pos]
  negative_reviews = list()
  #word tokenization
  for sentence in list(total_negative.values):
    #print(sentence)
    negative_reviews.append([format_sentence(sentence[0]), 'neg'])
    #saves the sentence in format: [{tokenized sentence}, 'pos]
  return positive_reviews, negative_reviewsXy_pos, Xy_neg = create_dict(total_positive, total_negative)

Let me have a look at the structure of the data. This is the first dictionary stored in the list of positive reviews.

让我看看数据的结构。这是肯定评论列表中存储的第一本词典。

Xy_pos[0]
[{'!': True,
  '&': True,
  "'d": True,
  "'s": True,
  "'ve": True,
  ...
  '.women': True,
  'with': True,
  'without': True,
  'women': True,
  'words': True,
  'would': True,
  'you': True,
  'yourself': True},
 'pos']

I want to see the graph of the dataset to see the proportion of positive and negative reviews.

我想查看数据集的图表，以查看正面和负面评论的比例。

X = pd.concat([total_positive, total_negative], axis=0)
X.columns = ['text', 'sentiment']import seaborn as sns
sns.countplot(x='sentiment', data=X)y = pd.DataFrame(X.pop('sentiment'))

1000 positive reviews, 1000 negative reviews 1000条正面评论，1000条负面评论

4.创建训练和测试(拆分) (4. Creating train and test (splitting))

def split(pos, neg, ratio):
  train = pos[:int((1-ratio)*len(pos))] + neg[:int((1-ratio)*len(neg))]
  test = pos[int((ratio)*len(pos)):] + neg[int((ratio)*len(neg)):]
  return train, testXy_train, Xy_test = split(Xy_pos, Xy_neg, 0.1)

I can now prepare the train and the test proportion of the dataset to train the NLP model.

我现在可以准备训练和数据集的测试比例来训练NLP模型。

5.使用模型：朴素贝叶斯分类器 (5. Using the model: Naive Bayes Classifier)

Finally, I can create the Naive Bayes Classifier model. I will be using the Xy_train proportion of the dataset to feed the model.

最后，我可以创建Naive Bayes分类器模型。我将使用数据集的Xy_train比例填充模型。

from nltk.classify import NaiveBayesClassifier#training the model
classifier = NaiveBayesClassifier.train(Xy_train)
classifier.show_most_informative_features()
ost Informative Features                insulting = True              neg : pos    =     17.7 : 1.0                ludicrous = True              neg : pos    =     13.4 : 1.0                   avoids = True              pos : neg    =     12.3 : 1.0              outstanding = True              pos : neg    =     12.3 : 1.0                   regard = True              pos : neg    =     11.7 : 1.0                animators = True              pos : neg    =     10.3 : 1.0              fascination = True              pos : neg    =     10.3 : 1.0                    .yeah = True              neg : pos    =     10.3 : 1.0                     3000 = True              neg : pos    =     10.3 : 1.0                    sucks = True              neg : pos    =      9.8 : 1.0

The model has associated one value to each word in the dataset. It will perform a calculation on all the words contained in every review it has to analyze, and then make an estimation: positive or negative.

该模型为数据集中的每个单词关联了一个值。它将对要分析的每个评论中包含的所有单词进行计算，然后做出估计：肯定或否定。

6.绩效评估 (6. Performance Evaluation)

from nltk.classify.util import accuracy
print(accuracy(classifier, Xy_test))
0.9477777777777778

Our accuracy is 94.7%, we can approximate by excess to 95%. Astonishing result!

我们的准确度是94.7％，大约可以达到95％。惊人的结果！

翻译自: https://medium.com/towards-artificial-intelligence/sentiment-analysis-on-movie-reviews-with-nlp-achieving-95-accuracy-91eef597e0f7