学习笔记(十三):用Tensorflow识别垃圾邮件

1.数据集的收集清洗

     找一个入门级的垃圾邮件分类训练集,如SpamBase(下载传送门:http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/),提取58个属性,最后一位是垃圾邮件的标志位,其余用空格隔开。

def load_SpamBase(filename):
    x=[]
    y=[]
    with open(filename) as f:
        for line in f:
            line=line.strip('\n')
            v=line.split(',')
            y.append(int(v[-1]))
            t=[]
            for i in range(57):
                t.append(float(v[i]))
            t=np.array(t)
            x.append(t)

    x=np.array(x)
    y=np.array(y)
    print x.shape
    print y.shape

    x_train, x_test, y_train, y_test=train_test_split( x,y, test_size=0.4, random_state=0)
    print x_train.shape
    print x_test.shape
    return x_train, x_test, y_train, y_test



2.分别使用朴素贝叶斯和DNN训练

def main(unused_argv):
    x_train, x_test, y_train, y_test=load_SpamBase("../data/spambase/spambase.data")



    gnb = GaussianNB()
    y_predict = gnb.fit(x_train, y_train).predict(x_test)
    score = metrics.accuracy_score(y_test, y_predict)
    print('Accuracy: {0:f}'.format(score))


    feature_columns = tf.contrib.learn.infer_real_valued_columns_from_input(x_train)
    classifier = tf.contrib.learn.DNNClassifier(
        feature_columns=feature_columns, hidden_units=[30,10], n_classes=2)
    classifier.fit(x_train, y_train, steps=500,batch_size=10)
    y_predict=list(classifier.predict(x_test, as_iterable=True))
    #y_predict = classifier.predict(x_test)
    #print y_predict
    score = metrics.accuracy_score(y_test, y_predict)
    print('Accuracy: {0:f}'.format(score))


DNN直接设置两个隐藏层,分别有30和10个节点数,训练时有500批次,每批次10个训练数据。

3.结果

学习笔记(十三):用Tensorflow识别垃圾邮件_第1张图片

(在服务器上跑的结果,可以看到最后倒数二三行,DNN的结果比贝叶斯的好,但准确率仍然不高。)

你可能感兴趣的:(python)