NLTK(Natural Language Toolkit),自然语言处理工具包,在NLP(自然语言处理)领域中,最常使用的一个Python库。自带语料库,词性分类库。自带分类,分词功能。
import nltk
sentence="I Love China !"
[‘I’, ‘Love’, ‘China’, ‘!’]
>>> import jieba
>>> seg_list=jieba.cut("我正在学习机器学习",cut_all=True)
>>> print("全模式:","/".join(seg_list))
全模式: 我/正在/学习/学习机/机器/学习
>>> seg_list=jieba.cut("我正在学习机器学习",cut_all=False)
>>> print("精确模式:","/".join(seg_list))
精确模式: 我/正在/学习/机器/学习
在文本处理中,文本中的单词表示离散的、分类的特征。我们如何以算法可以使用的方式对这些数据进行编码?从文本数据到实值向量的映射称为特征提取。用数字表示文本的最简单的技术之一是Bag of Words。
TF('beautiful',Document1) = 2/10, IDF('beautiful')=log(2/2) = 0
TF(‘day’,Document1) = 5/10, IDF(‘day’)=log(2/1) = 0.30
TF-IDF(‘beautiful’, Document1) = (2/10)*0 = 0
TF-IDF(‘day’, Document1) = (5/10)*0.30 = 0.15
正如您在Document1中看到的,TF-IDF方法严重惩罚了“beautiful”一词,但对“day”赋予了更大的权重。这是由于IDF部分,它为不同的单词赋予了更多的权重。换句话说,从整个语料库的上下文来看,“day”是Document1的一个重要词。Python scikit学习库为文本数据挖掘提供了有效的工具,并提供了计算给定文本语料库的文本词汇表TF-IDF的函数。
Target words: ice, steam
Probe words: solid, gas, water, fashion
让P(k | w)是单词k出现在单词W的上下文中的概率W.考虑一个与ice有密切关系的词,而不是与steam有关的词,例如solid。P(solid | ice)相对较高,P(solid | steam)相对较低。因此,P(solid | ice)/ P(solid | steam)的比率将很大。如果我们用一个词,比如气体,它与steam有关,但与ice无关,那么P(gas | ice) / P(gas | steam) 的比值就会变小。对于一个既与ice有关又与water有关的词,例如water,我们预计其比率接近1。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df_train = pd.read_csv("/Users/gawaintan/workSpace/movie-review-sentiment-analysis-kernels-only/train.tsv", sep='\t')
PhraseId | SentenceId | Phrase | Sentiment | |
0 | 1 | 1 | A series of escapades demonstrating the adage ... | 1 |
1 | 2 | 1 | A series of escapades demonstrating the adage ... | 2 |
2 | 3 | 1 | A series | 2 |
3 | 4 | 1 | A | 2 |
4 | 5 | 1 | series | 2 |
df_test = pd.read_csv("/Users/gawaintan/workSpace/movie-review-sentiment-analysis-kernels-only/test.tsv", sep='\t')
PhraseId | SentenceId | Phrase | |
0 | 156061 | 8545 | An intermittently pleasing but mostly routine ... |
1 | 156062 | 8545 | An intermittently pleasing but mostly routine ... |
2 | 156063 | 8545 | An |
3 | 156064 | 8545 | intermittently pleasing but mostly routine effort |
4 | 156065 | 8545 | intermittently pleasing but mostly routine |
2 79582
3 32927
1 27273
4 9206
0 7072
Name: Sentiment, dtype: int64
RangeIndex: 156060 entries, 0 to 156059
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PhraseId 156060 non-null int64
1 SentenceId 156060 non-null int64
2 Phrase 156060 non-null object
3 Sentiment 156060 non-null int64
dtypes: int64(3), object(1)
memory usage: 4.8+ MB
df_train_1 = df_train.drop(['PhraseId','SentenceId'],axis=1)
Phrase | Sentiment | |
0 | A series of escapades demonstrating the adage ... | 1 |
1 | A series of escapades demonstrating the adage ... | 2 |
2 | A series | 2 |
3 | A | 2 |
4 | series | 2 |
Let’s check the phrase length of each of the movie reviews.
df_train_1['phrase_len'] = [len(t) for t in df_train_1.Phrase]
Phrase | Sentiment | phrase_len | |
0 | A series of escapades demonstrating the adage ... | 1 | 188 |
1 | A series of escapades demonstrating the adage ... | 2 | 77 |
2 | A series | 2 | 8 |
3 | A | 2 | 1 |
fig,ax = plt.subplots(figsize=(5,5))
从上面的箱线图中,有些评论的长度超过 100 个字符。
df_train_1[df_train_1.phrase_len > 100].head()
Phrase | Sentiment | phrase_len | |
0 | A series of escapades demonstrating the adage ... | 1 | 188 |
27 | is also good for the gander , some of which oc... | 2 | 110 |
28 | is also good for the gander , some of which oc... | 2 | 108 |
116 | A positively thrilling combination of ethnogra... | 3 | 152 |
117 | A positively thrilling combination of ethnogra... | 4 | 150 |
df_train_1[df_train_1.phrase_len > 100].loc[0].Phrase
'A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .'
wordcloud 是文本文件集合中常用词的图形表示。这张图片中每个词的高度是该词在整个文本中出现频率的指标。在进行文本分析时,此类图表非常有用。
neg_phrases = df_train_1[df_train_1.Sentiment == 0]
neg_words = []
for t in neg_phrases.Phrase:
['would have a hard time sitting through this one',
'have a hard time sitting through this one',
'Aggressive self-glorification and a manipulative whitewash',
'self-glorification and a manipulative whitewash']
**pandas.Series.str.cat ** : 使用给定的分隔符连接系列/索引中的字符串。这里我们给一个空格作为分隔符,因此,它将连接每个索引中由空格分隔的所有字符串。
neg_text = pd.Series(neg_words).str.cat(sep=' ')
'would have a hard time sitting through this one have a hard time sitting through this one Aggressive'
for t in neg_phrases.Phrase[:300]:
if 'good' in t:
's not a particularly good film
covers huge , heavy topics in a bland , surfacey way that does n't offer any insight into why , for instance , good things happen to bad people .
huge , heavy topics in a bland , surfacey way that does n't offer any insight into why , for instance , good things happen to bad people
a bland , surfacey way that does n't offer any insight into why , for instance , good things happen to bad people
pos_phrases = df_train_1[df_train_1.Sentiment == 4] ## 4 is positive sentiment
pos_string = []
for t in pos_phrases.Phrase:
pos_text = pd.Series(pos_string).str.cat(sep=' ')
'This quiet , introspective and entertaining independent is worth seeking . quiet , introspective and'
from wordcloud import WordCloud
wordcloud = WordCloud(width=1600, height=800, max_font_size=200).generate(neg_text)
plt.imshow(wordcloud, interpolation='bilinear')
一些大的词可以解释得相当中性,例如“film”、“moive”等。我们可以看到一些较小的词在负面电影评论中是有意义的,例如“bad movie”、“dull” 、“boring”等。
wordcloud = WordCloud(width=1600, height=800, max_font_size=200).generate(pos_text)
plt.imshow(wordcloud, interpolation='bilinear')
我们需要 Term Frequency 数据来查看电影评论中使用了哪些词以及使用了多少次。让我们继续使用 CountVectorizer 来计算词频:
from sklearn.feature_extraction.text import CountVectorizer
cvector = CountVectorizer(min_df = 0.0, max_df = 1.0, ngram_range=(1,2))
CountVectorizer(min_df=0.0, ngram_range=(1, 2))
看起来 count vectorizer 已经从语料库中提取了 94644 个单词。可以使用以下代码块获取每个类的词频。
neg_matrix = cvector.transform(df_train_1[df_train_1.Sentiment == 0].Phrase)
term_freq_df= pd.DataFrame(list(sorted([(word, neg_matrix.sum(axis=0)[0, idx]) for word, idx in cvector.vocabulary_.items()], key = lambda x: x[1], reverse=True)),columns=['Terms','negative'])
for i in range(1,5):
All_matrix.append(cvector.transform(df_train_1[df_train_1.Sentiment == i].Phrase))
aa=pd.DataFrame(list(sorted([(word,All_words[i-1][0, idx]) for word, idx in cvector.vocabulary_.items()], key = lambda x: x[1], reverse=True)),columns=['Terms',All_labels[i]])
term_freq_df['total'] = term_freq_df['negative'] + term_freq_df['some-negative'] + term_freq_df['neutral'] + term_freq_df['some-positive'] + term_freq_df['positive']
term_freq_df.sort_values(by='total', ascending=False).head(10)
negative | some-negative | neutral | some-positive | positive | total | |
Terms | ||||||
the | 3462 | 10885 | 20619 | 12459 | 4208 | 51633 |
of | 2277 | 6660 | 12287 | 8405 | 3073 | 32702 |
and | 2549 | 6204 | 10241 | 9180 | 4003 | 32177 |
to | 1916 | 5571 | 8295 | 5411 | 1568 | 22761 |
in | 1038 | 2965 | 5562 | 3365 | 1067 | 13997 |
is | 1372 | 3362 | 3703 | 3489 | 1550 | 13476 |
that | 1139 | 2982 | 3677 | 3280 | 1260 | 12338 |
it | 1086 | 3067 | 3791 | 2927 | 863 | 11734 |
as | 757 | 2184 | 2941 | 2037 | 732 | 8651 |
with | 452 | 1533 | 2471 | 2365 | 929 | 7750 |
Next, let’s explore about how different the tokens in two different classes(positive, negative).
from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer(stop_words='english',max_features=10000)
CountVectorizer(max_features=10000, stop_words='english')
neg_matrix = cvec.transform(df_train_1[df_train_1.Sentiment == 0].Phrase)
som_neg_matrix = cvec.transform(df_train_1[df_train_1.Sentiment == 1].Phrase)
neu_matrix = cvec.transform(df_train_1[df_train_1.Sentiment == 2].Phrase)
som_pos_matrix = cvec.transform(df_train_1[df_train_1.Sentiment == 3].Phrase)
pos_matrix = cvec.transform(df_train_1[df_train_1.Sentiment == 4].Phrase)
neg_words = neg_matrix.sum(axis=0)
neg_words_freq = [(word, neg_words[0, idx]) for word, idx in cvec.vocabulary_.items()]
neg_tf = pd.DataFrame(list(sorted(neg_words_freq, key = lambda x: x[1], reverse=True)),columns=['Terms','negative'])
neg_tf_df = neg_tf.set_index('Terms')
som_neg_words = som_neg_matrix.sum(axis=0)
som_neg_words_freq = [(word, som_neg_words[0, idx]) for word, idx in cvec.vocabulary_.items()]
som_neg_tf = pd.DataFrame(list(sorted(som_neg_words_freq, key = lambda x: x[1], reverse=True)),columns=['Terms','some-negative'])
som_neg_tf_df = som_neg_tf.set_index('Terms')
neu_words = neu_matrix.sum(axis=0)
neu_words_freq = [(word, neu_words[0, idx]) for word, idx in cvec.vocabulary_.items()]
neu_words_tf = pd.DataFrame(list(sorted(neu_words_freq, key = lambda x: x[1], reverse=True)),columns=['Terms','neutral'])
neu_words_tf_df = neu_words_tf.set_index('Terms')
som_pos_words = som_pos_matrix.sum(axis=0)
som_pos_words_freq = [(word, som_pos_words[0, idx]) for word, idx in cvec.vocabulary_.items()]
som_pos_words_tf = pd.DataFrame(list(sorted(som_pos_words_freq, key = lambda x: x[1], reverse=True)),columns=['Terms','some-positive'])
som_pos_words_tf_df = som_pos_words_tf.set_index('Terms')
pos_words = pos_matrix.sum(axis=0)
pos_words_freq = [(word, pos_words[0, idx]) for word, idx in cvec.vocabulary_.items()]
pos_words_tf = pd.DataFrame(list(sorted(pos_words_freq, key = lambda x: x[1], reverse=True)),columns=['Terms','positive'])
pos_words_tf_df = pos_words_tf.set_index('Terms')
term_freq_df = pd.concat([neg_tf_df,som_neg_tf_df,neu_words_tf_df,som_pos_words_tf_df,pos_words_tf_df],axis=1)
term_freq_df['total'] = term_freq_df['negative'] + term_freq_df['some-negative'] \
+ term_freq_df['neutral'] + term_freq_df['some-positive'] \
+ term_freq_df['positive']
term_freq_df.sort_values(by='total', ascending=False).head(15)
negative | some-negative | neutral | some-positive | positive | total | |
Terms | ||||||
film | 480 | 1281 | 2175 | 1848 | 949 | 6733 |
movie | 793 | 1463 | 2054 | 1344 | 587 | 6241 |
like | 332 | 942 | 1167 | 599 | 150 | 3190 |
story | 153 | 532 | 954 | 664 | 236 | 2539 |
rrb | 131 | 498 | 1112 | 551 | 146 | 2438 |
good | 100 | 334 | 519 | 974 | 334 | 2261 |
lrb | 119 | 452 | 878 | 512 | 137 | 2098 |
time | 153 | 420 | 752 | 464 | 130 | 1919 |
characters | 167 | 455 | 614 | 497 | 149 | 1882 |
comedy | 174 | 341 | 578 | 475 | 245 | 1813 |
just | 216 | 598 | 550 | 282 | 82 | 1728 |
life | 77 | 200 | 729 | 544 | 168 | 1718 |
does | 135 | 566 | 519 | 375 | 79 | 1674 |
little | 109 | 492 | 580 | 339 | 85 | 1605 |
funny | 73 | 257 | 267 | 639 | 347 | 1583 |
y_pos = np.arange(50)
plt.bar(y_pos, term_freq_df.sort_values(by='negative', ascending=False)['negative'][:50], align='center', alpha=0.5)
plt.xticks(y_pos, term_freq_df.sort_values(by='negative', ascending=False)['negative'][:50].index,rotation='vertical')
plt.xlabel('Top 50 negative tokens')
plt.title('Top 50 tokens in negative movie reviews')
Text(0.5, 1.0, 'Top 50 tokens in negative movie reviews')
我们再看一下条形图上的前 50 个正面标记
y_pos = np.arange(50)
plt.bar(y_pos, term_freq_df.sort_values(by='positive', ascending=False)['positive'][:50], align='center', alpha=0.5)
plt.xticks(y_pos, term_freq_df.sort_values(by='positive', ascending=False)['positive'][:50].index,rotation='vertical')
plt.xlabel('Top 50 positive tokens')
plt.title('Top 50 tokens in positive movie reviews')
Text(0.5, 1.0, 'Top 50 tokens in positive movie reviews')
Once again, there are some neutral words like “film”, “movie”, are quite high up in the rank.
phrase = np.array(df_train_1['Pfihrase'])
sentiments = np.array(df_train_1['Sentiment'])
# build train and test datasets
from sklearn.model_selection import train_test_split
phrase_train, phrase_test, sentiments_train, sentiments_test = train_test_split(phrase, sentiments, test_size=0.2, random_state=4)
Next, we will try to see how different are the tokens in 4 different classes(positive,some positive,neutral, some negative, negative).
众所周知,所有机器学习算法都擅长数字;我们必须在不丢失大量信息的情况下将文本数据提取或转换为数字。进行这种转换的一种方法是词袋 (BOW),它为每个词提供一个数字,但效率非常低。因此,一种方法是通过CountVectorizer:它计算文档中的单词数,即将文本文档集合转换为文档中每个单词出现次数的矩阵。
例如:如果我们有如下 3 个文本文档的集合,那么 CountVectorizer 会将其转换为文档中每个单词出现的单独计数,如下所示:
cv1 = CountVectorizer()
x_traincv = cv1.fit_transform(["Hi How are you How are you doing","Hi what's up","Wow that's awesome"])
x_traincv_df = pd.DataFrame(x_traincv.toarray(),columns=list(cv1.get_feature_names()))
/Users/gawaintan/miniforge3/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
warnings.warn(msg, category=FutureWarning)
are | awesome | doing | hi | how | that | up | what | wow | you | |
0 | 2 | 0 | 1 | 1 | 2 | 0 | 0 | 0 | 0 | 2 |
1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
2 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
现在,在 CountVectorizer 的情况下,我们只是在计算文档中的单词数量,很多时候,“are”、“you”、“hi”等单词的数量非常大,这将支配我们的机器学习算法的结果。
因此,TF-IDF(代表Term-Frequency-Inverse-Document Frequency)降低了几乎所有文档中出现的常见词的权重,并更加重视出现在文档子集中的词。TF-IDF 的工作原理是通过分配较低的权重来惩罚这些常用词,同时重视特定文档中的一些稀有词。
对于 CountVectorizer 这一次,停用词不会有太大帮助,因为相同的高频词,例如“the”、“to”,在两个类中的出现频率相同。如果这些停用词支配两个类,我将无法获得有意义的结果。因此,我决定删除停用词,并且还将使用 countvectorizer 将 max_features 限制为 10,000。
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
## Build Bag-Of-Words on train phrases
cv = CountVectorizer(stop_words='english',max_features=10000)
cv_train_features = cv.fit_transform(phrase_train)
# build TFIDF features on train reviews
tv = TfidfVectorizer(min_df=0.0, max_df=1.0, ngram_range=(1,2),
tv_train_features = tv.fit_transform(phrase_train)
# transform test reviews into features
cv_test_features = cv.transform(phrase_test)
tv_test_features = tv.transform(phrase_test)
print('BOW model:> Train features shape:', cv_train_features.shape, ' Test features shape:', cv_test_features.shape)
print('TFIDF model:> Train features shape:', tv_train_features.shape, ' Test features shape:', tv_test_features.shape)
BOW model:> Train features shape: (124848, 10000) Test features shape: (31212, 10000)
TFIDF model:> Train features shape: (124848, 93697) Test features shape: (31212, 93697)
####Evaluation metrics
from sklearn import metrics
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.base import clone
from sklearn.preprocessing import label_binarize
from scipy import interp
from sklearn.metrics import roc_curve, auc
def get_metrics(true_labels, predicted_labels):
print('Accuracy:', np.round(
print('Precision:', np.round(
print('Recall:', np.round(
print('F1 Score:', np.round(
def train_predict_model(classifier,
train_features, train_labels,
test_features, test_labels):
# build model
classifier.fit(train_features, train_labels)
# predict using model
predictions = classifier.predict(test_features)
return predictions
def display_confusion_matrix(true_labels, predicted_labels, classes=[1,0]):
total_classes = len(classes)
level_labels = [total_classes*[0], list(range(total_classes))]
cm = metrics.confusion_matrix(y_true=true_labels, y_pred=predicted_labels,
cm_frame = pd.DataFrame(data=cm,
columns=pd.MultiIndex(levels=[['Predicted:'], classes],
index=pd.MultiIndex(levels=[['Actual:'], classes],
def display_classification_report(true_labels, predicted_labels, classes=[1,0]):
report = metrics.classification_report(y_true=true_labels,
def display_model_performance_metrics(true_labels, predicted_labels, classes=[1,0]):
print('Model Performance metrics:')
get_metrics(true_labels=true_labels, predicted_labels=predicted_labels)
print('\nModel Classification report:')
display_classification_report(true_labels=true_labels, predicted_labels=predicted_labels,
print('\nPrediction Confusion Matrix:')
display_confusion_matrix(true_labels=true_labels, predicted_labels=predicted_labels,
def plot_model_decision_surface(clf, train_features, train_labels,
plot_step=0.02, cmap=plt.cm.RdYlBu,
markers=None, alphas=None, colors=None):
if train_features.shape[1] != 2:
raise ValueError("X_train should have exactly 2 columnns!")
x_min, x_max = train_features[:, 0].min() - plot_step, train_features[:, 0].max() + plot_step
y_min, y_max = train_features[:, 1].min() - plot_step, train_features[:, 1].max() + plot_step
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
np.arange(y_min, y_max, plot_step))
clf_est = clone(clf)
if hasattr(clf_est, 'predict_proba'):
Z = clf_est.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:,1]
Z = clf_est.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=cmap)
le = LabelEncoder()
y_enc = le.fit_transform(train_labels)
n_classes = len(le.classes_)
plot_colors = ''.join(colors) if colors else [None] * n_classes
label_names = le.classes_
markers = markers if markers else [None] * n_classes
alphas = alphas if alphas else [None] * n_classes
for i, color in zip(range(n_classes), plot_colors):
idx = np.where(y_enc == i)
plt.scatter(train_features[idx, 0], train_features[idx, 1], c=color,
label=label_names[i], cmap=cmap, edgecolors='black',
marker=markers[i], alpha=alphas[i])
def plot_model_roc_curve(clf, features, true_labels, label_encoder=None, class_names=None):
## Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
if hasattr(clf, 'classes_'):
class_labels = clf.classes_
elif label_encoder:
class_labels = label_encoder.classes_
elif class_names:
class_labels = class_names
raise ValueError('Unable to derive prediction classes, please specify class_names!')
n_classes = len(class_labels)
y_test = label_binarize(true_labels, classes=class_labels)
if n_classes == 2:
if hasattr(clf, 'predict_proba'):
prob = clf.predict_proba(features)
y_score = prob[:, prob.shape[1]-1]
elif hasattr(clf, 'decision_function'):
prob = clf.decision_function(features)
y_score = prob[:, prob.shape[1]-1]
raise AttributeError("Estimator doesn't have a probability or confidence scoring system!")
fpr, tpr, _ = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label='ROC curve (area = {0:0.2f})'
elif n_classes > 2:
if hasattr(clf, 'predict_proba'):
y_score = clf.predict_proba(features)
elif hasattr(clf, 'decision_function'):
y_score = clf.decision_function(features)
raise AttributeError("Estimator doesn't have a probability or confidence scoring system!")
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
## Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
## Compute macro-average ROC curve and ROC area
# First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
mean_tpr += interp(all_fpr, fpr[i], tpr[i])
# Finally average it and compute AUC
mean_tpr /= n_classes
fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])
## Plot ROC curves
plt.figure(figsize=(6, 4))
plt.plot(fpr["micro"], tpr["micro"],
label='micro-average ROC curve (area = {0:0.2f})'
''.format(roc_auc["micro"]), linewidth=3)
plt.plot(fpr["macro"], tpr["macro"],
label='macro-average ROC curve (area = {0:0.2f})'
''.format(roc_auc["macro"]), linewidth=3)
for i, label in enumerate(class_labels):
plt.plot(fpr[i], tpr[i], label='ROC curve of class {0} (area = {1:0.2f})'
''.format(label, roc_auc[i]),
linewidth=2, linestyle=':')
raise ValueError('Number of classes should be atleast 2 or more')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
from sklearn.linear_model import SGDClassifier, LogisticRegression
lr = LogisticRegression(penalty='l2', max_iter=100, C=1)
sgd = SGDClassifier(loss='hinge')
# Logistic Regression model on BOW features
lr_bow_predictions = train_predict_model(classifier=lr,
train_features=cv_train_features, train_labels=sentiments_train,
test_features=cv_test_features, test_labels=sentiments_test)
display_model_performance_metrics(true_labels=sentiments_test, predicted_labels=lr_bow_predictions,
Model Performance metrics:
Accuracy: 0.6369
Precision: 0.6177
Recall: 0.6369
F1 Score: 0.6132
Model Classification report:
precision recall f1-score support
0 0.55 0.28 0.37 1426
1 0.53 0.36 0.43 5428
2 0.68 0.87 0.77 15995
3 0.57 0.45 0.50 6603
4 0.56 0.34 0.42 1760
accuracy 0.64 31212
macro avg 0.58 0.46 0.50 31212
weighted avg 0.62 0.64 0.61 31212
Prediction Confusion Matrix:
0 1 2 3 4
Actual: 0 393 626 349 53 5
1 251 1967 2936 255 19
2 57 862 13982 1031 63
3 15 236 3023 2941 388
4 1 23 253 888 595
# Logistic Regression model on TF-IDF features
lr_tfidf_predictions = train_predict_model(classifier=lr,
train_features=tv_train_features, train_labels=sentiments_train,
test_features=tv_test_features, test_labels=sentiments_test)
display_model_performance_metrics(true_labels=sentiments_test, predicted_labels=lr_tfidf_predictions,
Model Performance metrics:
Accuracy: 0.6455
Precision: 0.6314
Recall: 0.6455
F1 Score: 0.6189
Model Classification report:
precision recall f1-score support
0 0.60 0.22 0.32 1426
1 0.56 0.38 0.45 5428
2 0.67 0.89 0.77 15995
3 0.60 0.47 0.53 6603
4 0.60 0.29 0.39 1760
accuracy 0.65 31212
macro avg 0.61 0.45 0.49 31212
weighted avg 0.63 0.65 0.62 31212
Prediction Confusion Matrix:
0 1 2 3 4
Actual: 0 312 681 408 22 3
1 177 2051 3066 125 9
2 29 793 14193 944 36
3 2 109 3115 3088 289
4 0 9 281 966 504
# SGD model on Countvectorizer
sgd_bow_predictions = train_predict_model(classifier=sgd,
train_features=cv_train_features, train_labels=sentiments_train,
test_features=cv_test_features, test_labels=sentiments_test)
display_model_performance_metrics(true_labels=sentiments_test, predicted_labels=sgd_bow_predictions,
Model Performance metrics:
Accuracy: 0.5988
Precision: 0.5776
Recall: 0.5988
F1 Score: 0.5455
Model Classification report:
precision recall f1-score support
0 0.52 0.23 0.32 1426
1 0.54 0.19 0.28 5428
2 0.62 0.93 0.74 15995
3 0.54 0.30 0.38 6603
4 0.52 0.29 0.37 1760
accuracy 0.60 31212
macro avg 0.55 0.39 0.42 31212
weighted avg 0.58 0.60 0.55 31212
Prediction Confusion Matrix:
0 1 2 3 4
Actual: 0 332 392 646 49 7
1 234 1025 3909 230 30
2 56 371 14874 637 57
3 18 106 4156 1956 367
4 4 15 502 735 504
# SGD model on TF-IDF
sgd_tfidf_predictions = train_predict_model(classifier=sgd,
train_features=tv_train_features, train_labels=sentiments_train,
test_features=tv_test_features, test_labels=sentiments_test)
display_model_performance_metrics(true_labels=sentiments_test, predicted_labels=sgd_tfidf_predictions,
Model Performance metrics:
Accuracy: 0.5594
Precision: 0.5543
Recall: 0.5594
F1 Score: 0.4666
Model Classification report:
precision recall f1-score support
0 0.60 0.11 0.18 1426
1 0.52 0.09 0.16 5428
2 0.56 0.97 0.71 15995
3 0.55 0.16 0.25 6603
4 0.59 0.15 0.24 1760
accuracy 0.56 31212
macro avg 0.56 0.30 0.31 31212
weighted avg 0.55 0.56 0.47 31212
Prediction Confusion Matrix:
0 1 2 3 4
Actual: 0 152 241 1020 13 0
1 83 512 4759 67 7
2 17 193 15447 315 23
3 2 38 5328 1085 150
4 0 2 993 502 263
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_jobs=-1)
# RandomForest model on TF-IDF
rfc_tfidf_predictions = train_predict_model(classifier=rfc,
train_features=tv_train_features, train_labels=sentiments_train,
test_features=tv_test_features, test_labels=sentiments_test)
display_model_performance_metrics(true_labels=sentiments_test, predicted_labels=rfc_tfidf_predictions,
Model Performance metrics:
Accuracy: 0.6423
Precision: 0.6267
Recall: 0.6423
F1 Score: 0.6274
Model Classification report:
precision recall f1-score support
0 0.47 0.36 0.41 1426
1 0.56 0.42 0.48 5428
2 0.70 0.84 0.76 15995
3 0.58 0.46 0.51 6603
4 0.50 0.40 0.45 1760
accuracy 0.64 31212
macro avg 0.56 0.50 0.52 31212
weighted avg 0.63 0.64 0.63 31212
Prediction Confusion Matrix:
0 1 2 3 4
Actual: 0 520 605 283 17 1
1 465 2281 2539 133 10
2 101 1094 13479 1258 63
3 8 115 2793 3057 630
4 2 6 217 825 710