垃圾短信检测---逻辑回归

垃圾短信检测


代码:

# _*_ coding: tf-8 _*_
# 垃圾短信检测

# 1、导入需要的包
import pandas as pd
from sklearn import linear_model
from sklearn.feature_extraction.text import TfidfVectorizer

# 2、读取数据集
# 第一列是短信的label,\t键后面是短信的正文
# ham:非垃圾短信
# spam:垃圾短信
df = pd.read_csv('SMSSpamCollection.txt', delimiter = '\t', header = None) # 用‘\t’分割每行的两列,没有文件头
y ,X_train = df[0], df[1] # 类别赋值给df[0],短消息文本本身赋值给df[1]

# 3、用tf-idf向量化
vectorizer = TfidfVectorizer() 
X = vectorizer.fit_transform(X_train)
# 4、训练模型,使用逻辑回归
lr = linear_model.LogisticRegression()
lr.fit(X, y)

# 5、测试
testX = vectorizer.transform(['URGENT! Your mobile No. 1234 was awarded a Prize',
                              'Hey honey, whats up?'])
predictions = lr.predict(testX)
print(predictions)

数据集文档位置:
https://github.com/lifelong37learner/CV-

你可能感兴趣的:(计算机视觉CV)