概率图模型1-朴素贝叶斯之垃圾短信分类
- 1.数据加载
- 2.词向量
- 3.TF-IDF转换
- 4.数据集分割
- 5.建模
- 6.预测
垃圾短信分类项目:
- (1) 数据加载
- (2) 词向量
- (3) 统计词频即TF-IDF、通过词频判断类别即是否是垃圾短信
- (4) 建模
- (5) 预测
1.数据加载
import pandas as pd
from sklearn.naive_bayes import GaussianNB,BernoulliNB,MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
messages = pd.read_csv('./data/messages.csv',sep = '\t',header=None)
messages
messages.rename({0:'label',1:'message'},axis = 1,inplace = True)
messages
![概率图模型1-朴素贝叶斯之垃圾短信分类_第1张图片](http://img.e-com-net.com/image/info8/abc4212308fd45be8f36f6f452a55fb0.png)
y = messages['label']
y
0 ham
1 ham
...
5570 ham
5571 ham
Name: label, Length: 5572, dtype: object
2.词向量
cv = CountVectorizer()
X = cv.fit_transform(messages['message'])
X
<5572x8713 sparse matrix of type 'numpy.int64'>'
with 74169 stored elements in Compressed Sparse Row format>
5572*8713
48548836
3.TF-IDF转换
from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf = TfidfTransformer()
X2 = tf_idf.fit_transform(X)
tf_idf2 = TfidfVectorizer()
X3 = tf_idf2.fit_transform(messages['message'])
X3
<5572x8713 sparse matrix of type 'numpy.float64'>'
with 74169 stored elements in Compressed Sparse Row format>
4.数据集分割
X_train,X_test,y_train,y_test = train_test_split(X2,y)
display(X_train,X_test)
<4179x8713 sparse matrix of type 'numpy.float64'>'
with 55087 stored elements in Compressed Sparse Row format>
<1393x8713 sparse matrix of type 'numpy.float64'>'
with 19082 stored elements in Compressed Sparse Row format>
5.建模
%%time
gNB = GaussianNB()
gNB.fit(X_train.toarray(),y_train)
gNB.score(X_test.toarray(),y_test)
Wall time: 7.04 s
0.8994974874371859
%%time
bNB = BernoulliNB()
bNB.fit(X_train,y_train)
bNB.score(X_test,y_test)
Wall time: 386 ms
0.9806173725771715
%%time
mNB = MultinomialNB()
mNB.fit(X_train,y_train)
mNB.score(X_test,y_test)
Wall time: 39 ms
0.95908111988514
6.预测
X_test = ['Your free ringtone is waiting to be collected. Simply text the password "MIX" to 85069 to verify.I see the letter B on my car Please call now 08000930705 for delivery tomorrow',
'Precious things are very few in the world,that is the reason there is only one you',
"GENT! We are trying to contact you. Last weekends draw shows that you won a £1000 prize GUARANTEED. U don't know how stubborn I am. Congrats! 1 year special cinema pass for 2 is yours.",
'Congrats! 1 year special cinema pass for 2 is yours. call 09061209465 now! C Suprman V, Matrix3, StarWars3, etc all 4 FREE! bx420-ip4-5we. 150pm. Dont miss out!']
X_test
['Your free ringtone is waiting to be collected. Simply text the password "MIX" to 85069 to verify.I see the letter B on my car Please call now 08000930705 for delivery tomorrow',
'Precious things are very few in the world,that is the reason there is only one you',
"GENT! We are trying to contact you. Last weekends draw shows that you won a £1000 prize GUARANTEED. U don't know how stubborn I am. Congrats! 1 year special cinema pass for 2 is yours.",
'Congrats! 1 year special cinema pass for 2 is yours. call 09061209465 now! C Suprman V, Matrix3, StarWars3, etc all 4 FREE! bx420-ip4-5we. 150pm. Dont miss out!']
X_test_tf_idf = tf_idf.transform(cv.transform(X_test))
X_test_tf_idf
<4x8713 sparse matrix of type 'numpy.float64'>'
with 94 stored elements in Compressed Sparse Row format>
bNB.predict(X_test_tf_idf)
array(['spam', 'ham', 'spam', 'spam'], dtype=')