【阿旭机器学习实战】【11】文本分类实战:利用朴素贝叶斯模型进行邮件分类

【阿旭机器学习实战】系列文章主要介绍机器学习的各种算法模型及其实战案例,欢迎点赞,关注共同学习交流。

本文主要介绍如何使用朴素贝叶斯模型进行邮件分类,置于朴素贝叶斯模型的原理及分类,可以参考我的上一篇文章《【阿旭机器学习实战】【10】朴素贝叶斯模型原理及3种贝叶斯模型对比:高斯分布朴素贝叶斯、多项式分布朴素贝叶斯、伯努利分布朴素贝叶斯》

文本分类实战

读取文本数据

import pandas as pd
# sep参数代表指定的csv的属性分割符号
sms = pd.read_csv("../data/SMSSpamCollection",sep="\t",header=None)

sms
0 1
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
5 spam FreeMsg Hey there darling it's been 3 week's n...
6 ham Even my brother is not like to speak with me. ...
7 ham As per your request 'Melle Melle (Oru Minnamin...
8 spam WINNER!! As a valued network customer you have...
9 spam Had your mobile 11 months or more? U R entitle...
10 ham I'm gonna be home soon and i don't want to tal...
11 spam SIX chances to win CASH! From 100 to 20,000 po...
12 spam URGENT! You have won a 1 week FREE membership ...
13 ham I've been searching for the right words to tha...
14 ham I HAVE A DATE ON SUNDAY WITH WILL!!
15 spam XXXMobileMovieClub: To use your credit, click ...
16 ham Oh k...i'm watching here:)
17 ham Eh u remember how 2 spell his name... Yes i di...
18 ham Fine if that’s the way u feel. That’s the way ...
19 spam England v Macedonia - dont miss the goals/team...
20 ham Is that seriously how you spell his name?
21 ham I‘m going to try for 2 months ha ha only joking
22 ham So ü pay first lar... Then when is da stock co...
23 ham Aft i finish my lunch then i go str down lor. ...
24 ham Ffffffffff. Alright no way I can meet up with ...
25 ham Just forced myself to eat a slice. I'm really ...
26 ham Lol your always so convincing.
27 ham Did you catch the bus ? Are you frying an egg ...
28 ham I'm back & we're packing the car now, I'll...
29 ham Ahhh. Work. I vaguely remember that! What does...
... ... ...
5542 ham Armand says get your ass over to epsilon
5543 ham U still havent got urself a jacket ah?
5544 ham I'm taking derek & taylor to walmart, if I...
5545 ham Hi its in durban are you still on this number
5546 ham Ic. There are a lotta childporn cars then.
5547 spam Had your contract mobile 11 Mnths? Latest Moto...
5548 ham No, I was trying it all weekend ;V
5549 ham You know, wot people wear. T shirts, jumpers, ...
5550 ham Cool, what time you think you can get here?
5551 ham Wen did you get so spiritual and deep. That's ...
5552 ham Have a safe trip to Nigeria. Wish you happines...
5553 ham Hahaha..use your brain dear
5554 ham Well keep in mind I've only got enough gas for...
5555 ham Yeh. Indians was nice. Tho it did kane me off ...
5556 ham Yes i have. So that's why u texted. Pshew...mi...
5557 ham No. I meant the calculation is the same. That ...
5558 ham Sorry, I'll call later
5559 ham if you aren't here in the next <#> hou...
5560 ham Anything lor. Juz both of us lor.
5561 ham Get me out of this dump heap. My mom decided t...
5562 ham Ok lor... Sony ericsson salesman... I ask shuh...
5563 ham Ard 6 like dat lor.
5564 ham Why don't you wait 'til at least wednesday to ...
5565 ham Huh y lei...
5566 spam REMINDER FROM O2: To get 2.50 pounds free call...
5567 spam This is the 2nd time we have tried 2 contact u...
5568 ham Will ü b going to esplanade fr home?
5569 ham Pity, * was in mood for that. So...any other s...
5570 ham The guy did some bitching but I acted like i'd...
5571 ham Rofl. Its true to its name

5572 rows × 2 columns

提取特征与标签

data = sms[[1]]
target = sms[[0]]
data.shape
(5572, 1)

将文本变为稀疏矩阵

对于文本数据,一般情况下会把字符串里面单词转化成浮点数表示稀疏矩阵

from sklearn.feature_extraction.text import TfidfVectorizer
# 这个算法模型用于把一堆字符串处理成稀疏矩阵
tf = TfidfVectorizer()
# 训练特征数:告诉tf模型有那些单词
tf.fit(data[1])
TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)
# 转化:把数据有5572条语句转化成5572*XX的一个稀疏矩阵
data = tf.transform(data[1])
data
# 此时得到了一个5572*8713的稀疏矩阵,说明这5572条语句中有8713种单词
<5572x8713 sparse matrix of type ''
	with 74169 stored elements in Compressed Sparse Row format>

训练模型

b_NB.fit(data,target)
message = ["Confidence doesn't need any specific reason. If you're alive , you should feel 100 percent confident.",
           "Avis is only NO.2 in rent a cars.SO why go with us?We try harder.",
           "SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info"
          ]

预测

# 把message转化成稀疏矩阵
x_test = tf.transform(message)
b_NB.predict(x_test)
array(['ham', 'ham', 'spam'],
      dtype='
b_NB.score(data,target)
0.98815506101938266

使用多项式贝叶斯

m_NB = MultinomialNB()
m_NB.fit(data,target)
m_NB.score(data,target)
0.97613065326633164

使用高斯贝叶斯

g_NB = GaussianNB()
g_NB.fit(data.toarray(),target)
g_NB.score(data.toarray(),target)
0.94149318018664752

如果内容对你有帮助,感谢记得点赞+关注哦!

欢迎关注我的公众号:阿旭算法与机器学习,共同学习交流。
更多干货内容持续更新中…

你可能感兴趣的:(机器学习,分类,朴素贝叶斯,邮件分类)