nlp文本表征模型_python中的nlp第4部分监督文本分类模型简介

nlp文本表征模型

This post will show you a simplified example of building a basic supervised text classification model. If this sounds a little like gibberish, let’s see some definitions:

这篇文章将向您展示构建基本的受监管文本分类模型的简化示例。 如果这听起来有点像胡言乱语,让我们看一些定义:

supervised: we know the correct output class for each text in sample data text: input data is in a text format classification model: a model that uses input data to predict output classEach input text is also known as ‘document’ and output is also known as ‘target’ (the term, not the shop! ).

受监督: 我们知道示例数据中每个文本的正确输出类别 文本: 输入数据为文本格式 分类模型: 使用输入数据预测输出 类别的模型 每个输入文本也称为``文档'',输出也称为``目标''(术语,不是商店! )。

Does supervised text classification model sound more meaningful now? Maybe? Among supervised text classification models, we will focus on one particular type in this post. Here, we will build a supervised sentiment classifier as we will be using a sentiment polarity data on movie reviews with a binary target.

监督文本分类模型现在听起来是否更有意义? 也许? 在受监管的文本分类模型中,本文将重点介绍一种特定的类型。 在这里,我们将建立一个监督的情感分类器,因为我们将在带有二进制目标的电影评论中使用情感极性数据。

0. Python设置 (0. Python setup )

This post assumes that you have access to and are familiar with Python including installing packages, defining functions and other basic tasks. If you are new to Python, this is a good place to get started.

这篇文章假定您具有Python的使用权限,并且熟悉Python,包括安装软件包,定义函数和其他基本任务。 如果您不熟悉Python,那么这是一个入门的好地方。

I have used and tested the scripts in Python 3.7.1. Let’s make sure you have the right tools before we get started.

我已经使用并测试了Python 3.7.1中的脚本。 在开始之前,请确保您具有正确的工具。

Ensure️确保已安装必需的软件包:pandas,nltk和sklearn (⬜️ Ensure the required packages are installed: pandas, nltk & sklearn)

We will use the following powerful third party packages:

我们将使用以下功能强大的第三方软件包:

  • pandas: Data analysis library,

    pandas :数据分析库,

  • nltk: Natural Language Tool Kit library and

    nltk:自然语言工具包库和

  • sklearn: Machine Learning library.

    sklearn:机器学习库。

⬜️从nltk下载'stopwords','wordnet'和movie_reviews语料库 (⬜️ Download ‘stopwords’ , ‘wordnet’ and movie_reviews corpora from nltk)

The script below can help you download these corpora. If you have already downloaded, running this will notify you that they are up-to-date:

下面的脚本可以帮助您下​​载这些语料库。 如果您已经下载了,运行此命令将通知您它们是最新的:

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('movie_reviews')

1.数据准备 (1. Data preparation ➡ )

1.1。 导入样本数据和包 (1.1. Import sample data and packages)

Firstly, let’s prepare the environment by importing the required packages:

首先,让我们通过导入所需的包来准备环境:

import pandas as pdfrom nltk.corpus import movie_reviews, stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizerfrom sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score

We will transform movie_reviews tagged corpus from nltk to a pandas dataframe with the script below:

我们将使用以下脚本将带有movie_reviews标签的语料库从nltk转换为熊猫数据

# Script copied from here
reviews = []
for fileid in movie_reviews.fileids():
tag, filename = fileid.split('/')
reviews.append((tag, movie_reviews.raw(fileid)))
sample = pd.DataFrame(reviews, columns=['target', 'document'])
print(f'Dimensions: {sample.shape}')
sample.head()

You will see that the dataframe has 2 columns: a column for the targets, the polarity sentiment, and a column for the reviews (i.e. documents) for 2000 reviews. Each review is either tagged as positive or negative review. Let’s check the distribution of the target classes:

您将看到数据框有2列:一列用于目标,极性情感,一列用于2000条审阅的审阅(即文档)。 每个评论都被标记为正面或负面评论。 让我们检查目标类的分布:

sample[‘target’].value_counts()

Each class (i.e. ‘pos’, ‘neg’) has 1000 records each, perfectly balanced. Let’s ensure that the classes are binary coded:

每个类别(即“ pos”,“ neg”)每个都有1000条记录,并且完全平衡。 让我们确保这些类是二进制编码的:

sample['target'] = np.where(sample['target']=='pos', 1, 0)
sample['target'].value_counts()

This looks good, let’s proceed to partitioning the data.

这看起来不错,让我们继续对数据进行分区。

1.2。 分区数据 (1.2. Partition data)

When it comes to partitioning data, we have 2 options:

在对数据进行分区时,我们有2个选项:

  1. Split the sample data into 3 groups: train, validation and test, where train is used to fit the model, validation is used to evaluate fitness of interim models, and test is used to assess final model fitness.

    将样本数据分为3组: 训练验证测试 其中使用火车来拟合模型,使用验证来评估临时模型的适用性,并使用测试来评估最终模型的适用性。

  2. Split the sample data into 2 groups: train and test, where train is further split into train and validation set k times using k-fold cross validation, and test is used to assess final model fitness. With k-fold cross validation:

    将样本数据分为2组: 训练测试 其中列车被进一步分割为利用火车和验证集k的k折交叉验证和 测试用于评估最终模型健身。 使用k折交叉验证

    Split the sample data into 2 groups: train and test, where train is further split into train and validation set k times using k-fold cross validation, and test is used to assess final model fitness. With k-fold cross validation:First: Train is split into k pieces.

    将样本数据分为2组: 训练测试 其中列车被进一步分割为利用火车和验证集k的k折交叉验证和 测试用于评估最终模型健身。 使用k折交叉验证首先火车被分成k件。

    Split the sample data into 2 groups: train and test, where train is further split into train and validation set k times using k-fold cross validation, and test is used to assess final model fitness. With k-fold cross validation:First: Train is split into k pieces.Second: Take one piece for validation set to evaluate fitness of interim models after fitting the model to the remaining k-1 pieces.

    将样本数据分为2组: 训练测试 其中列车被进一步分割为利用火车和验证集k的k折交叉验证和 测试用于评估最终模型健身。 使用k折交叉验证首先火车被分成k件。 第二 :将模型拟合到剩余的k-1个模型后,取一件作为验证集,以评估临时模型的适用性。

    Split the sample data into 2 groups: train and test, where train is further split into train and validation set k times using k-fold cross validation, and test is used to assess final model fitness. With k-fold cross validation:First: Train is split into k pieces.Second: Take one piece for validation set to evaluate fitness of interim models after fitting the model to the remaining k-1 pieces.Third: Repeat the second step k-1 times using a different piece for the validation set each time and the remaining for the train set such that each piece of train is used as validation set only once.

    将样本数据分为2组: 训练测试 其中列车被进一步分割为利用火车和验证集k的k折交叉验证和 测试用于评估最终模型健身。 使用k折交叉验证首先火车被分成k件。 第二 :将模型拟合到剩余的k-1个模型后,取一件作为验证集,以评估临时模型的适用性。 第三 :重复第二步k-1次,每次对验证集使用不同的片段,对火车集使用其余片段,以使每列火车仅用作验证集一次。

Interim models here refer to the models created during the iterative process of comparing different machine learning classifiers as well as trying different hyperparameters for a given classifier to find the best model.

这里的临时模型是指在比较不同机器学习分类器以及为给定分类器尝试不同的超参数以找到最佳模型的迭代过程中创建的模型。

We will be using the second option to partition the sample data as we don’t have big sample data. Let’s put aside some test data so that we could check how well the final model generalises on unseen data later:

我们将使用第二个选项对样本数据进行分区,因为我们没有大量的样本数据。 让我们搁置一些测试数据,以便稍后可以检查最终模型在看不见的数据上的推广程度:

X_train, X_test, y_train, y_test = train_test_split(sample['document'], sample['target'], test_size=0.3, random_state=123)print(f'Train dimensions: {X_train.shape, y_train.shape}')
print(f'Test dimensions: {X_test.shape, y_test.shape}')# Check out target distribution
print(y_train.value_counts())
print(y_test.value_counts())

We have 1400 documents in train and 600 documents in test dataset. The target is evenly distributed in both train and test dataset.

我们在训练中有1400个文档,在测试数据集中有600个文档。 目标均匀分布在训练和测试数据集中。

If you are slightly confused about this section on data partitioning, you may want to check this awesome article to learn more.

如果您对本部分中的数据分区感到有些困惑,则可能需要查看这篇很棒的文章以了解更多信息。

1.2。 预处理文件 (1.2. Preprocess documents)

It’s time to preprocess training documents that is to transform unstructured data to numbers in a matrix. Let’s preprocess the text using an approach called bag-of-word where each text is represented by its words regardless of the order which they are presented or the embedded grammar with the following steps:

现在是时候对培训文档进行预处理了,该培训文档会将非结构化数据转换为矩阵中的数字。 让我们使用一种称为“词袋”的方法对文本进行预处理,其中,每个文本都由其词表示,而与显示顺序或嵌入语法无关,其步骤如下:

  1. Tokenise

    代币化
  2. Normalise

    归一化
  3. Remove stop words

    删除停用词
  4. Count vectorise

    计数向量化
  5. Transform to tf-idf representation

    转换为TF-IDF表示形式

I have provided a detailed explanation on the preprocessing steps including the breakdown of the code chunk below in the first part of the series.

我已对预处理步骤进行了详细说明,包括本系列第一部分下面的代码块细分。

These sequential steps are accomplished with the code chunk below:

这些顺序步骤通过以下代码块完成:

def preprocess_text(text):
# Tokenise words while ignoring punctuation
tokeniser = RegexpTokenizer(r'\w+')
tokens = tokeniser.tokenize(text)
# Lowercase and lemmatise
lemmatiser = WordNetLemmatizer()
lemmas = [lemmatiser.lemmatize(token.lower(), pos='v') for token in tokens]
# Remove stop words
keywords= [lemma for lemma in lemmas if lemma not in stopwords.words('english')]
return keywords# Create an instance of TfidfVectorizer
vectoriser = TfidfVectorizer(analyzer=preprocess_text)# Fit to the data and transform to feature matrix
X_train_tfidf = vectoriser.fit_transform(X_train)
X_train_tfidf.shape

If you are not sure what tf-idf is, I have provided a detailed explanation in the third part of the series.

如果不确定tf-idf是什么,我将在系列的第三部分中提供详细的说明。

Once we preprocess the text, our training data is now a 1400 x 27676 feature matrix stored in a sparse matrix format. This format provides efficient storage of the data and speeds up subsequent processes. We have 27676 features that represent the unique words from the training dataset. Now, the training data is ready for modelling!

预处理文本后,训练数据现在是以稀疏矩阵格式存储的1400 x 27676特征矩阵。 此格式可有效存储数据并加快后续过程。 我们有27676个特征,代表训练数据集中的唯一单词。 现在,训练数据已准备好进行建模!

2.建模Ⓜ️ (2. Modelling Ⓜ️)

2.1。 基准模型 (2.1. Baseline model)

Let’s build a baseline model using Stochastic Gradient Descent Classifier. I have chosen this classifier because it is fast and works well with sparse matrix. Using 5-fold cross validation, let’s fit the model to the data and evaluate it:

让我们使用随机梯度下降分类器构建基线模型。 我选择此分类器是因为它速度快并且可以与稀疏矩阵一起很好地工作。 使用5倍交叉验证,让模型适合数据并评估:

sgd_clf = SGDClassifier(random_state=123)
sgf_clf_scores = cross_val_score(sgd_clf, X_train_tfidf, y_train, cv=5)print(sgf_clf_scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (sgf_clf_scores.mean(), sgf_clf_scores.std() * 2))

Given the data is perfectly balanced and we want both labels to be predicted as correctly as possible, we will use accuracy as a metric to evaluate the model fitness. However, accuracy is not always the best measure depending on the distribution of the target and relative misclassification costs of the classes. In which case, other evaluation metrics such as precision, recall or f1 may be more appropriate.

鉴于数据是完美平衡的,并且我们希望尽可能正确地预测两个标签,因此我们将使用准确性作为衡量模型适用性的指标。 但是,根据目标的分布和类别的相对误分类成本,准确性不一定总是最佳的度量。 在这种情况下,其他评估指标(例如精度,召回率或f1)可能更合适。

The initial performance does not look bad. The baseline model can predict accurately ~83% +/- 3% of the time.

初始性能看起来还不错。 基线模型可以准确地预测〜83%+/- 3%的时间。

Of note, the default metric used is accuracy in cross_val_score hence we don’t need to specify it unless you want to explicitly say so like below:

值得注意的是,所使用的默认指标是cross_val_score 准确性 ,因此除非您想像下面这样明确声明,否则我们无需指定它:

cross_val_score(sgd_clf, X_train_tfidf, y_train, cv=5, scoring='accuracy')

Let’s understand the predictions a bit further by looking at confusion matrix:

通过查看混淆矩阵,让我们进一步了解这些预测:

sgf_clf_pred = cross_val_predict(sgd_clf, X_train_tfidf, y_train, cv=5)
print(confusion_matrix(y_train, sgf_clf_pred))

The accuracy of predictions is similar for both classes.

两种类别的预测准确性相似。

2.2。 尝试提高性能 (2.2. Attempt to improve performance)

The purpose of this section is to find the best machine learning algorithm as well as its hyperparameters. Let’s see if we are able to improve the model by tweaking some hyperparameters. We will leave most of the hyperparameters to its sensible default value. With the help of grid search, we will run a model with every single value combination of the subset of hyperparameters specified below and cross validate the results to get a feel of its accuracy:

本节的目的是找到最佳的机器学习算法及其超参数。 让我们看看我们是否能够通过调整一些超参数来改善模型。 我们将大多数超参数保留为明智的默认值。 借助网格搜索,我们将使用下面指定的超参数子集的每个单个值组合来运行模型,并交叉验证结果以了解其准确性:

grid = {'fit_intercept': [True,False],
'early_stopping': [True, False],
'loss' : ['hinge', 'log', 'squared_hinge'],
'penalty' : ['l2', 'l1', 'none']}
search = GridSearchCV(estimator=sgd_clf, param_grid=grid, cv=5)
search.fit(X_train_tfidf, y_train)
search.best_params_

These are the best values for the hyperparameters specified above. Let’s train and validate the model using these values for the selected hyperparameters:

这些是上面指定的超参数的最佳值。 让我们使用这些值对选定的超参数训练和验证模型:

grid_sgd_clf_scores = cross_val_score(search.best_estimator_, X_train_tfidf, y_train, cv=5)
print(grid_sgd_clf_scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (grid_sgd_clf_scores.mean(), grid_sgd_clf_scores.std() * 2))

The model fitness is slightly better compared to baseline (small yay❕).

与基线相比,模型适应性稍好(小)。

We will choose this model as our final model and stop this section here in the interest of time. However, this section could be extended much further by trying different modelling techniques and finding optimal values for the hyperparameters of the model using a grid search.

我们将选择此模型作为最终模型,并出于时间原因在此停止本节。 但是,可以通过尝试使用不同的建模技术并使用网格搜索为模型的超参数找到最佳值来进一步扩展此部分。

Exercise: See if you can further improve this model’s accuracy by using different modelling techniques and/or optimising the hyperparameters.

练习:查看是否可以通过使用不同的建模技术和/或优化超参数来进一步提高该模型的准确性。

2.3。 最终模型 (2.3. Final model)

Now that we have finalised the model, let’s put the data transformation step as well as the model in a pipeline:

现在我们已经完成了模型的确定,让我们将数据转换步骤以及模型放在管道中

pipe = Pipeline([('vectoriser', vectoriser),
('classifier', search.best_estimator_)])pipe.fit(X_train, y_train)

In the code shown above, the pipeline first transforms the unstructured data to feature matrix, then fits the preprocessed data to the model. This is an elegant way of putting together the essential steps in a single pipeline.

在上面显示的代码中,管道首先将非结构化数据转换为特征矩阵,然后将预处理后的数据拟合到模型中。 这是将重要步骤组合到一个管道中的一种优雅方法。

Let’s assess the predictive power of the model on the test set. Here, we will pass the test data to the pipeline, which will first preprocess the data then make predictions using the previously fitted model:

让我们评估模型在测试集上的预测能力。 在这里,我们将测试数据传递到管道,管道将首先对数据进行预处理,然后使用先前拟合的模型进行预测:

y_test_pred = pipe.predict(X_test)
print("Accuracy: %0.2f" % (accuracy_score(y_test, y_test_pred)))
print(confusion_matrix(y_test, y_test_pred))

The accuracy of the final model on unseen data is ~85%. If this test data is representative of future data, the predictive power of the model is decent given the effort we have put in so far, don’t you think? Either way, congratulations! You have just built a simple supervised text classification model!

最终模型在看不见的数据上的准确度约为85%。 如果该测试数据可以代表将来的数据,那么鉴于我们到目前为止所做的努力,该模型的预测能力是不错的,您不是吗? 无论哪种方式,恭喜! 您刚刚建立了一个简单的监督文本分类模型!

Thank you for taking the time to go through this post. I hope that you learned something from reading it. This was the last part of the 4-part series of posts on Introduction to NLP! Links to the rest of the posts are collated below:◼️ Part 1: Preprocessing text in Python◼️ Part 2: Difference between lemmatisation and stemming◼️ Part 3: TF-IDF explained◼️ Part 4: Supervised text classification model in Python

感谢您抽出宝贵的时间阅读这篇文章。 我希望您能从阅读中学到一些东西。 这是关于NLP简介的4部分系列文章的最后一部分! 其余帖子的链接如下:◼️ 第1部分:Python中的预处理文本 ◼️ 第2部分:词元化和词干之间的区别 ◼️ 第3部分:TF-IDF解释了 ◼️ 第4部分:Python中的监督文本分类模型

Happy modelling! Bye for now

建模愉快! 再见for

3.参考文献 (3. References )

  • Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008

    Christopher D. Manning,Prabhakar Raghavan和HinrichSchütze, 《信息检索导论》 ,剑桥大学出版社,2008年

  • Bird, Steven, Edward Loper and Ewan Klein, Natural Language Processing with Python. O’Reilly Media Inc, 2009

    Bird,Steven,Edward Loper和Ewan Klein,《 使用Python进行自然语言处理》 。 O'Reilly Media Inc,2009年

  • Jason Brownlee, What is the Difference Between Test and Validation Datasets?, Machine Learning Mastery, 2017

    Jason Brownlee,测试和验证数据集有什么区别? ,机器学习精通,2017年

翻译自: https://towardsdatascience.com/introduction-to-nlp-part-4-supervised-text-classification-model-in-python-96e9709b4267

nlp文本表征模型

你可能感兴趣的:(nlp,python,机器学习,自然语言处理,人工智能)