Recipe 5-5. Clustering Documents

Document clustering yet again includes similar steps, so let’s have a look at

  1. Tokenization
  2. Stemming and lemmatization
  3. Removing stop words and punctuation
  4. Computing term frequencies or TF-IDF
  5. Clustering: K-means/Hierarchical; we can then use
    any of the clustering algorithms to cluster different
    documents based on the features we have generated
  6. Evaluation and visualization: Finally, the clustering
    results can be visualized by plotting the clusters into
    a two-dimensional space

Step 5-1 Import data and libraries
Here are the libraries, then the data:
!pip install mpld3
import numpy as np
import pandas as pd
import nltk
from nltk.stem.snowball import SnowballStemmer
from bs4 import BeautifulSoup
import re
import os
import codecs
from sklearn import feature_extraction
import mpld3
from sklearn.metrics.pairwise import cosine_similarity
import os
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.manifold import MDS

Data = pd.read_csv("/Consumer_Complaints.
#selecting required columns and rows
Data = Data[[‘consumer_complaint_narrative’]]
Data = Data[pd.notnull(Data[‘consumer_complaint_narrative’])]

lets do the clustering for just 200 documents. Its easier to

Step 5-2 Preprocessing and TF-IDF feature engineering
Now we preprocess it:

Remove unwanted symbol

Data_sample[‘consumer_complaint_narrative’] = Data_

Convert dataframe to list

complaints = Data_sample[‘consumer_complaint_narrative’].tolist()

create the rank of documents – we will use it later

ranks = []
for i in range(1, len(complaints)+1):

Stop Words

stopwords = nltk.corpus.stopwords.words(‘english’)

Load ‘stemmer’

stemmer = SnowballStemmer(“english”)

Functions for sentence tokenizer, to remove numeric tokens

and raw #punctuation
def tokenize_and_stem(text):
tokens = [word for sent in nltk.sent_tokenize(text) for
word in nltk.word_tokenize(sent)]
filtered_tokens = []
for token in tokens:
if’[a-zA-Z]’, token):
stems = [stemmer.stem(t) for t in filtered_tokens]
return stems
def tokenize_only(text):
tokens = [word.lower() for sent in nltk.sent_tokenize(text)
for word in nltk.word_tokenize(sent)]
filtered_tokens = []
for token in tokens:
if’[a-zA-Z]’, token):
return filtered_tokens

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf vectorizer

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.2, stop_words=‘english’,
use_idf=True, tokenizer=tokenize_
and_stem, ngram_range=(1,3))

#fit the vectorizer to data
tfidf_matrix = tfidf_vectorizer.fit_transform(complaints)
terms = tfidf_vectorizer.get_feature_names()
(200, 30)
Step 5-3 Clustering using K-means
Let’s start the clustering:

Import Kmeans

from sklearn.cluster import KMeans
#Define number of clusters
num_clusters = 6
#Running clustering algorithm
km = KMeans(n_clusters=num_clusters)
#final clusters
clusters = km.labels_.tolist()
complaints_data = { ‘rank’: ranks, ‘complaints’: complaints,
‘cluster’: clusters }
frame = pd.DataFrame(complaints_data, index = [clusters] ,
columns = [‘rank’, ‘cluster’])
#number of docs per cluster

0 42
1 37
5 36
3 36
2 27
4 22

quite disappointing, isn’t it?

Step 5-4 Identify cluster behavior

Identify which are the top 5 words that are nearest to the cluster centroid.
totalvocab_stemmed = []
totalvocab_tokenized = []
for i in complaints:
allwords_stemmed = tokenize_and_stem(i)
allwords_tokenized = tokenize_only(i)
vocab_frame = pd.DataFrame({‘words’: totalvocab_tokenized},
index = totalvocab_stemmed)
#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
for i in range(num_clusters):
print(“Cluster %d words:” % i, end=")
for ind in order_centroids[i, :6]:
print(’ %s’ % vocab_frame.ix[terms[ind].split(’ ‘)].
values.tolist()[0][0].encode(‘utf-8’, ‘ignore’), end=’,’)
Cluster 0 words: b’needs’, b’time’, b’bank’, b’information’, b’told’
Cluster 1 words: b’account’, b’bank’, b’credit’, b’time’, b’months’
Cluster 2 words: b’debt’, b’collection’, b’number’, b’credit’, b"n’t"
Cluster 3 words: b’report’, b’credit’, b’credit’, b’account’,
Cluster 4 words: b’loan’, b’payments’, b’pay’, b’months’, b’state’
Cluster 5 words: b’payments’, b’pay’, b’told’, b’did’, b’credit’

Step 5-5 Plot the clusters on a 2D graph
Finally, we plot the clusters:
similarity_distance = 1 - cosine_similarity(tfidf_matrix)

Convert two components as we’re plotting points in a

two- dimensional plane
mds = MDS(n_components=2, dissimilarity=“precomputed”,
pos = mds.fit_transform(similarity_distance) # shape
(n_components, n_samples)
xs, ys = pos[:, 0], pos[:, 1]
#Set up colors per clusters using a dict
cluster_colors = {0: ‘#1b9e77’, 1: ‘#d95f02’, 2: ‘#7570b3’,
3: ‘#e7298a’, 4: ‘#66a61e’, 5: ‘#D2691E’}
#set up cluster names using a dict
cluster_names = {0: ‘property, based, assist’,
1: ‘business, card’,
2: ‘authorized, approved, believe’,
3: ‘agreement, application,business’,
4: ‘closed, applied, additional’,
5: ‘applied, card’}

Finally plot it

%matplotlib inline
#Create data frame that has the result of the MDS and the cluster
df = pd.DataFrame(dict(x=xs, y=ys, label=clusters))
groups = df.groupby(‘label’)

Set up plot

fig, ax = plt.subplots(figsize=(17, 9)) # set size
for name, group in groups:
ax.plot(group.x, group.y, marker=‘o’, linestyle=", ms=20,
label=cluster_names[name], color=cluster_colors[name],
axis= ‘x’,
axis= ‘y’,

That’s it. We have clustered 200 complaints into 6 groups using
K-means clustering. It basically clusters similar kinds of complaints to 6
buckets using TF-IDF. We can also use the word embeddings and solve this
to achieve better clusters. 2D graphs provide a good look into the cluster’s
behavior and if we look, we will see that the same color dots (docs) are
located closer to each other.

wordnet to give each cluster a name?

