1.数据集的收集清洗
找一个入门级的垃圾邮件分类训练集,如SpamBase(下载传送门:http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/),提取58个属性,最后一位是垃圾邮件的标志位,其余用空格隔开。
def load_SpamBase(filename):
x=[]
y=[]
with open(filename) as f:
for line in f:
line=line.strip('\n')
v=line.split(',')
y.append(int(v[-1]))
t=[]
for i in range(57):
t.append(float(v[i]))
t=np.array(t)
x.append(t)
x=np.array(x)
y=np.array(y)
print x.shape
print y.shape
x_train, x_test, y_train, y_test=train_test_split( x,y, test_size=0.4, random_state=0)
print x_train.shape
print x_test.shape
return x_train, x_test, y_train, y_test
2.分别使用朴素贝叶斯和DNN训练
def main(unused_argv):
x_train, x_test, y_train, y_test=load_SpamBase("../data/spambase/spambase.data")
gnb = GaussianNB()
y_predict = gnb.fit(x_train, y_train).predict(x_test)
score = metrics.accuracy_score(y_test, y_predict)
print('Accuracy: {0:f}'.format(score))
feature_columns = tf.contrib.learn.infer_real_valued_columns_from_input(x_train)
classifier = tf.contrib.learn.DNNClassifier(
feature_columns=feature_columns, hidden_units=[30,10], n_classes=2)
classifier.fit(x_train, y_train, steps=500,batch_size=10)
y_predict=list(classifier.predict(x_test, as_iterable=True))
#y_predict = classifier.predict(x_test)
#print y_predict
score = metrics.accuracy_score(y_test, y_predict)
print('Accuracy: {0:f}'.format(score))
DNN直接设置两个隐藏层,分别有30和10个节点数,训练时有500批次,每批次10个训练数据。
3.结果
(在服务器上跑的结果,可以看到最后倒数二三行,DNN的结果比贝叶斯的好,但准确率仍然不高。)