

This post will show you a simplified example of building a basic supervised text classification model. If this sounds a little like gibberish, let’s see some definitions:

supervised: we know the correct output class for each text in sample data text: input data is in a text format classification model: a model that uses input data to predict output classEach input text is also known as ‘document’ and output is also known as ‘target’ (the term, not the shop! ).

Does supervised text classification model sound more meaningful now? Maybe? Among supervised text classification models, we will focus on one particular type in this post. Here, we will build a supervised sentiment classifier as we will be using a sentiment polarity data on movie reviews with a binary target.

0. Python设置 (0. Python setup )

This post assumes that you have access to and are familiar with Python including installing packages, defining functions and other basic tasks. If you are new to Python, this is a good place to get started.

I have used and tested the scripts in Python 3.7.1. Let’s make sure you have the right tools before we get started.

Ensure️确保已安装必需的软件包:pandas,nltk和sklearn (⬜️ Ensure the required packages are installed: pandas, nltk & sklearn)

We will use the following powerful third party packages:


  • pandas: Data analysis library,

  • nltk: Natural Language Tool Kit library and


  • sklearn: Machine Learning library.


⬜️从nltk下载'stopwords','wordnet'和movie_reviews语料库 (⬜️ Download ‘stopwords’ , ‘wordnet’ and movie_reviews corpora from nltk)

The script below can help you download these corpora. If you have already downloaded, running this will notify you that they are up-to-date:

import nltk'stopwords')'wordnet')'movie_reviews')

1.数据准备 (1. Data preparation ➡ )

1.1。 导入样本数据和包 (1.1. Import sample data and packages)

Firstly, let’s prepare the environment by importing the required packages:


import pandas as pdfrom nltk.corpus import movie_reviews, stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizerfrom sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score

We will transform movie_reviews tagged corpus from nltk to a pandas dataframe with the script below:


# Script copied from here
reviews = []
for fileid in movie_reviews.fileids():
tag, filename = fileid.split('/')
reviews.append((tag, movie_reviews.raw(fileid)))
sample = pd.DataFrame(reviews, columns=['target', 'document'])
print(f'Dimensions: {sample.shape}')

You will see that the dataframe has 2 columns: a column for the targets, the polarity sentiment, and a column for the reviews (i.e. documents) for 2000 reviews. Each review is either tagged as positive or negative review. Let’s check the distribution of the target classes:

Each class (i.e. ‘pos’, ‘neg’) has 1000 records each, perfectly balanced. Let’s ensure that the classes are binary coded:

sample['target'] = np.where(sample['target']=='pos', 1, 0)

This looks good, let’s proceed to partitioning the data.


1.2。 分区数据 (1.2. Partition data)

When it comes to partitioning data, we have 2 options:


  1. Split the sample data into 3 groups: train, validation and test, where train is used to fit the model, validation is used to evaluate fitness of interim models, and test is used to assess final model fitness.

  2. Split the sample data into 2 groups: train and test, where train is further split into train and validation set k times using k-fold cross validation, and test is used to assess final model fitness. With k-fold cross validation:

    Split the sample data into 2 groups: train and test, where train is further split into train and validation set k times using k-fold cross validation, and test is used to assess final model fitness. With k-fold cross validation:First: Train is split into k pieces.

    Split the sample data into 2 groups: train and test, where train is further split into train and validation set k times using k-fold cross validation, and test is used to assess final model fitness. With k-fold cross validation:First: Train is split into k pieces.Second: Take one piece for validation set to evaluate fitness of interim models after fitting the model to the remaining k-1 pieces.

    Split the sample data into 2 groups: train and test, where train is further split into train and validation set k times using k-fold cross validation, and test is used to assess final model fitness. With k-fold cross validation:First: Train is split into k pieces.Second: Take one piece for validation set to evaluate fitness of interim models after fitting the model to the remaining k-1 pieces.Third: Repeat the second step k-1 times using a different piece for the validation set each time and the remaining for the train set such that each piece of train is used as validation set only once.

Interim models here refer to the models created during the iterative process of comparing different machine learning classifiers as well as trying different hyperparameters for a given classifier to find the best model.


We will be using the second option to partition the sample data as we don’t have big sample data. Let’s put aside some test data so that we could check how well the final model generalises on unseen data later:

X_train, X_test, y_train, y_test = train_test_split(sample['document'], sample['target'], test_size=0.3, random_state=123)print(f'Train dimensions: {X_train.shape, y_train.shape}')
print(f'Test dimensions: {X_test.shape, y_test.shape}')# Check out target distribution

We have 1400 documents in train and 600 documents in test dataset. The target is evenly distributed in both train and test dataset.

If you are slightly confused about this section on data partitioning, you may want to check this awesome article to learn more.


1.2。 预处理文件 (1.2. Preprocess documents)

It’s time to preprocess training documents that is to transform unstructured data to numbers in a matrix. Let’s preprocess the text using an approach called bag-of-word where each text is represented by its words regardless of the order which they are presented or the embedded grammar with the following steps:

  1. Tokenise

  2. Normalise

  3. Remove stop words

  4. Count vectorise

  5. Transform to tf-idf representation


I have provided a detailed explanation on the preprocessing steps including the breakdown of the code chunk below in the first part of the series.


These sequential steps are accomplished with the code chunk below:


def preprocess_text(text):
# Tokenise words while ignoring punctuation
tokeniser = RegexpTokenizer(r'\w+')
tokens = tokeniser.tokenize(text)
# Lowercase and lemmatise
lemmatiser = WordNetLemmatizer()
lemmas = [lemmatiser.lemmatize(token.lower(), pos='v') for token in tokens]
# Remove stop words
keywords= [lemma for lemma in lemmas if lemma not in stopwords.words('english')]
return keywords# Create an instance of TfidfVectorizer
vectoriser = TfidfVectorizer(analyzer=preprocess_text)# Fit to the data and transform to feature matrix
X_train_tfidf = vectoriser.fit_transform(X_train)

If you are not sure what tf-idf is, I have provided a detailed explanation in the third part of the series.


Once we preprocess the text, our training data is now a 1400 x 27676 feature matrix stored in a sparse matrix format. This format provides efficient storage of the data and speeds up subsequent processes. We have 27676 features that represent the unique words from the training dataset. Now, the training data is ready for modelling!

2.建模Ⓜ️ (2. Modelling Ⓜ️)

2.1。 基准模型 (2.1. Baseline model)

Let’s build a baseline model using Stochastic Gradient Descent Classifier. I have chosen this classifier because it is fast and works well with sparse matrix. Using 5-fold cross validation, let’s fit the model to the data and evaluate it:

sgd_clf = SGDClassifier(random_state=123)
sgf_clf_scores = cross_val_score(sgd_clf, X_train_tfidf, y_train, cv=5)print(sgf_clf_scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (sgf_clf_scores.mean(), sgf_clf_scores.std() * 2))

Given the data is perfectly balanced and we want both labels to be predicted as correctly as possible, we will use accuracy as a metric to evaluate the model fitness. However, accuracy is not always the best measure depending on the distribution of the target and relative misclassification costs of the classes. In which case, other evaluation metrics such as precision, recall or f1 may be more appropriate.

The initial performance does not look bad. The baseline model can predict accurately ~83% +/- 3% of the time.

Of note, the default metric used is accuracy in cross_val_score hence we don’t need to specify it unless you want to explicitly say so like below:

cross_val_score(sgd_clf, X_train_tfidf, y_train, cv=5, scoring='accuracy')

Let’s understand the predictions a bit further by looking at confusion matrix:


sgf_clf_pred = cross_val_predict(sgd_clf, X_train_tfidf, y_train, cv=5)
print(confusion_matrix(y_train, sgf_clf_pred))

The accuracy of predictions is similar for both classes.


2.2。 尝试提高性能 (2.2. Attempt to improve performance)

The purpose of this section is to find the best machine learning algorithm as well as its hyperparameters. Let’s see if we are able to improve the model by tweaking some hyperparameters. We will leave most of the hyperparameters to its sensible default value. With the help of grid search, we will run a model with every single value combination of the subset of hyperparameters specified below and cross validate the results to get a feel of its accuracy:

grid = {'fit_intercept': [True,False],
'early_stopping': [True, False],
'loss' : ['hinge', 'log', 'squared_hinge'],
'penalty' : ['l2', 'l1', 'none']}
search = GridSearchCV(estimator=sgd_clf, param_grid=grid, cv=5), y_train)

These are the best values for the hyperparameters specified above. Let’s train and validate the model using these values for the selected hyperparameters:

grid_sgd_clf_scores = cross_val_score(search.best_estimator_, X_train_tfidf, y_train, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (grid_sgd_clf_scores.mean(), grid_sgd_clf_scores.std() * 2))

The model fitness is slightly better compared to baseline (small yay❕).


We will choose this model as our final model and stop this section here in the interest of time. However, this section could be extended much further by trying different modelling techniques and finding optimal values for the hyperparameters of the model using a grid search.

Exercise: See if you can further improve this model’s accuracy by using different modelling techniques and/or optimising the hyperparameters.


2.3。 最终模型 (2.3. Final model)

Now that we have finalised the model, let’s put the data transformation step as well as the model in a pipeline:


pipe = Pipeline([('vectoriser', vectoriser),
('classifier', search.best_estimator_)]), y_train)

In the code shown above, the pipeline first transforms the unstructured data to feature matrix, then fits the preprocessed data to the model. This is an elegant way of putting together the essential steps in a single pipeline.

Let’s assess the predictive power of the model on the test set. Here, we will pass the test data to the pipeline, which will first preprocess the data then make predictions using the previously fitted model:

y_test_pred = pipe.predict(X_test)
print("Accuracy: %0.2f" % (accuracy_score(y_test, y_test_pred)))
print(confusion_matrix(y_test, y_test_pred))

The accuracy of the final model on unseen data is ~85%. If this test data is representative of future data, the predictive power of the model is decent given the effort we have put in so far, don’t you think? Either way, congratulations! You have just built a simple supervised text classification model!

Thank you for taking the time to go through this post. I hope that you learned something from reading it. This was the last part of the 4-part series of posts on Introduction to NLP!

Happy modelling! Bye for now

