文本分类练习二:按照THUCNews的子集对新闻所属类别进行分类

1. 特点:中文数据集、十个类别

2. 工具:TensorFlow

3. 数据集说明及代码示例:https://github.com/gaussic/text-classification-cnn-rnn

4. 对代码示例的run_cnn.py做如下修改(run_rnn.py可做类似修改),并将cnews数据子集放在data文件夹下,即可在PyCharm里运行代码(MacOS + PyCharm + TensorFlow 1.12.0 + Python 3.6)

if __name__ == '__main__':
    # if len(sys.argv) != 2 or sys.argv[1] not in ['train', 'test']:
    #     raise ValueError("""usage: python run_cnn.py [train / test]""")

    print('Configuring CNN model...')
    config = TCNNConfig()
    if not os.path.exists(vocab_dir):  # 如果不存在词汇表,重建
        build_vocab(train_dir, vocab_dir, config.vocab_size)
    categories, cat_to_id = read_category()
    words, word_to_id = read_vocab(vocab_dir)
    config.vocab_size = len(words)
    model = TextCNN(config)

    # if sys.argv[1] == 'train':
    #     train()
    # else:
    #     test()
    train()
    test()

5. 代码输出

/Users/gaoxuanxuan/anaconda3/envs/tensorflow/bin/python /Users/gaoxuanxuan/PycharmProjects/NLP/TextClassification/text-classification-cnn-rnn/run_cnn.py
Configuring CNN model...
WARNING:tensorflow:From /Users/gaoxuanxuan/PycharmProjects/NLP/TextClassification/text-classification-cnn-rnn/cnn_model.py:66: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.

Configuring TensorBoard and Saver...
Loading training and validation data...
Time usage: 0:00:24
2019-03-03 21:34:01.237824: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Training and evaluating...
Epoch: 1
Iter:      0, Train Loss:    2.3, Train Acc:   3.12%, Val Loss:    2.3, Val Acc:   9.92%, Time: 0:00:07 *
Iter:    100, Train Loss:   0.86, Train Acc:  78.12%, Val Loss:    1.2, Val Acc:  68.96%, Time: 0:01:17 *
Iter:    200, Train Loss:   0.36, Train Acc:  89.06%, Val Loss:   0.72, Val Acc:  80.48%, Time: 0:02:24 *
Iter:    300, Train Loss:   0.18, Train Acc:  96.88%, Val Loss:   0.43, Val Acc:  89.72%, Time: 0:03:24 *
Iter:    400, Train Loss:   0.14, Train Acc:  96.88%, Val Loss:   0.36, Val Acc:  91.00%, Time: 0:04:22 *
Iter:    500, Train Loss:   0.22, Train Acc:  93.75%, Val Loss:   0.39, Val Acc:  91.16%, Time: 0:05:26 *
Iter:    600, Train Loss:    0.3, Train Acc:  90.62%, Val Loss:   0.33, Val Acc:  92.28%, Time: 0:06:42 *
Iter:    700, Train Loss:   0.11, Train Acc:  95.31%, Val Loss:   0.29, Val Acc:  92.92%, Time: 0:08:45 *
Epoch: 2
Iter:    800, Train Loss:  0.051, Train Acc:  98.44%, Val Loss:   0.29, Val Acc:  92.84%, Time: 0:10:57 
Iter:    900, Train Loss:   0.21, Train Acc:  93.75%, Val Loss:    0.3, Val Acc:  90.86%, Time: 0:12:56 
Iter:   1000, Train Loss:  0.044, Train Acc: 100.00%, Val Loss:   0.29, Val Acc:  91.52%, Time: 0:14:52 
Iter:   1100, Train Loss:   0.13, Train Acc:  98.44%, Val Loss:   0.28, Val Acc:  92.72%, Time: 0:17:00 
Iter:   1200, Train Loss:   0.06, Train Acc:  98.44%, Val Loss:   0.29, Val Acc:  93.14%, Time: 0:19:00 *
Iter:   1300, Train Loss:  0.084, Train Acc:  98.44%, Val Loss:   0.29, Val Acc:  90.76%, Time: 0:21:04 
Iter:   1400, Train Loss:   0.13, Train Acc:  93.75%, Val Loss:   0.19, Val Acc:  94.68%, Time: 0:23:24 *
Iter:   1500, Train Loss:  0.066, Train Acc:  98.44%, Val Loss:   0.21, Val Acc:  94.20%, Time: 0:25:49 
Epoch: 3
Iter:   1600, Train Loss: 0.0086, Train Acc: 100.00%, Val Loss:   0.18, Val Acc:  94.92%, Time: 0:27:49 *
Iter:   1700, Train Loss: 0.0068, Train Acc: 100.00%, Val Loss:   0.21, Val Acc:  94.60%, Time: 0:46:25 
Iter:   1800, Train Loss:  0.039, Train Acc:  98.44%, Val Loss:   0.18, Val Acc:  94.94%, Time: 3:12:36 *
Iter:   1900, Train Loss:  0.043, Train Acc: 100.00%, Val Loss:   0.18, Val Acc:  94.70%, Time: 4:55:34 
Iter:   2000, Train Loss: 0.0047, Train Acc: 100.00%, Val Loss:   0.21, Val Acc:  94.46%, Time: 7:22:15 
Iter:   2100, Train Loss:  0.015, Train Acc: 100.00%, Val Loss:   0.17, Val Acc:  95.26%, Time: 9:09:48 *
Iter:   2200, Train Loss:   0.13, Train Acc:  96.88%, Val Loss:   0.22, Val Acc:  93.24%, Time: 10:55:41 
Iter:   2300, Train Loss:  0.091, Train Acc:  95.31%, Val Loss:   0.22, Val Acc:  93.14%, Time: 12:44:59 
Epoch: 4
Iter:   2400, Train Loss:    0.1, Train Acc:  96.88%, Val Loss:   0.21, Val Acc:  94.24%, Time: 13:53:55 
Iter:   2500, Train Loss:  0.021, Train Acc: 100.00%, Val Loss:   0.19, Val Acc:  94.96%, Time: 15:43:02 
Iter:   2600, Train Loss:  0.012, Train Acc: 100.00%, Val Loss:    0.2, Val Acc:  95.00%, Time: 18:03:03 
Iter:   2700, Train Loss:  0.037, Train Acc:  98.44%, Val Loss:   0.18, Val Acc:  95.18%, Time: 19:59:22 
Iter:   2800, Train Loss:  0.041, Train Acc:  98.44%, Val Loss:    0.2, Val Acc:  94.46%, Time: 22:22:35 
Iter:   2900, Train Loss:  0.022, Train Acc: 100.00%, Val Loss:   0.22, Val Acc:  94.10%, Time: 1 day, 0:11:13 
Iter:   3000, Train Loss:  0.051, Train Acc:  98.44%, Val Loss:   0.18, Val Acc:  95.42%, Time: 1 day, 2:37:24 *
Iter:   3100, Train Loss:   0.14, Train Acc:  96.88%, Val Loss:   0.22, Val Acc:  94.08%, Time: 1 day, 4:24:21 
Epoch: 5
Iter:   3200, Train Loss: 0.0027, Train Acc: 100.00%, Val Loss:   0.18, Val Acc:  95.60%, Time: 1 day, 6:07:38 *
Iter:   3300, Train Loss:  0.001, Train Acc: 100.00%, Val Loss:   0.19, Val Acc:  95.16%, Time: 1 day, 8:19:46 
Iter:   3400, Train Loss: 0.0047, Train Acc: 100.00%, Val Loss:    0.2, Val Acc:  95.36%, Time: 1 day, 10:04:01 
Iter:   3500, Train Loss: 0.0057, Train Acc: 100.00%, Val Loss:   0.19, Val Acc:  95.18%, Time: 1 day, 12:14:28 
Iter:   3600, Train Loss:  0.011, Train Acc: 100.00%, Val Loss:   0.18, Val Acc:  95.26%, Time: 1 day, 13:10:23 
Iter:   3700, Train Loss:  0.076, Train Acc:  98.44%, Val Loss:    0.2, Val Acc:  94.50%, Time: 1 day, 13:12:36 
Iter:   3800, Train Loss: 0.0061, Train Acc: 100.00%, Val Loss:   0.19, Val Acc:  95.64%, Time: 1 day, 13:14:34 *
Iter:   3900, Train Loss:  0.014, Train Acc: 100.00%, Val Loss:    0.2, Val Acc:  94.86%, Time: 1 day, 13:16:45 
Epoch: 6
Iter:   4000, Train Loss:  0.016, Train Acc:  98.44%, Val Loss:   0.22, Val Acc:  94.34%, Time: 1 day, 13:18:47 
Iter:   4100, Train Loss:  0.034, Train Acc:  96.88%, Val Loss:   0.22, Val Acc:  94.82%, Time: 1 day, 13:20:49 
Iter:   4200, Train Loss: 0.0029, Train Acc: 100.00%, Val Loss:   0.23, Val Acc:  94.72%, Time: 1 day, 13:23:00 
Iter:   4300, Train Loss: 0.0052, Train Acc: 100.00%, Val Loss:   0.16, Val Acc:  96.10%, Time: 1 day, 13:24:04 *
Iter:   4400, Train Loss:  0.025, Train Acc:  98.44%, Val Loss:   0.18, Val Acc:  95.36%, Time: 1 day, 13:25:03 
Iter:   4500, Train Loss: 0.0013, Train Acc: 100.00%, Val Loss:   0.21, Val Acc:  95.06%, Time: 1 day, 13:26:02 
Iter:   4600, Train Loss:  0.028, Train Acc:  98.44%, Val Loss:   0.25, Val Acc:  93.72%, Time: 1 day, 13:26:59 
Epoch: 7
Iter:   4700, Train Loss:  0.014, Train Acc:  98.44%, Val Loss:   0.24, Val Acc:  94.42%, Time: 1 day, 13:27:55 
Iter:   4800, Train Loss: 0.0071, Train Acc: 100.00%, Val Loss:   0.18, Val Acc:  95.98%, Time: 1 day, 13:28:51 
Iter:   4900, Train Loss: 0.00074, Train Acc: 100.00%, Val Loss:    0.2, Val Acc:  95.42%, Time: 1 day, 13:29:53 
Iter:   5000, Train Loss: 0.00081, Train Acc: 100.00%, Val Loss:   0.18, Val Acc:  95.60%, Time: 1 day, 13:31:02 
Iter:   5100, Train Loss: 0.00093, Train Acc: 100.00%, Val Loss:   0.19, Val Acc:  95.78%, Time: 1 day, 13:32:01 
Iter:   5200, Train Loss:  0.001, Train Acc: 100.00%, Val Loss:   0.22, Val Acc:  94.86%, Time: 1 day, 13:33:02 
Iter:   5300, Train Loss: 0.0074, Train Acc: 100.00%, Val Loss:   0.19, Val Acc:  95.26%, Time: 1 day, 13:34:01 
No optimization for a long time, auto-stopping...
Loading test data...
Testing...
Test Loss:   0.12, Test Acc:  97.08%
Precision, Recall and F1-Score...
              precision    recall  f1-score   support

          体育       1.00      0.99      0.99      1000
          财经       0.96      0.98      0.97      1000
          房产       1.00      1.00      1.00      1000
          家居       0.98      0.91      0.94      1000
          教育       0.95      0.94      0.95      1000
          科技       0.96      0.99      0.98      1000
          时尚       0.96      0.98      0.97      1000
          时政       0.93      0.97      0.95      1000
          游戏       0.99      0.97      0.98      1000
          娱乐       0.98      0.97      0.98      1000

   micro avg       0.97      0.97      0.97     10000
   macro avg       0.97      0.97      0.97     10000
weighted avg       0.97      0.97      0.97     10000

Confusion Matrix...
[[990   0   0   0   2   2   1   4   1   0]
 [  0 985   0   1   1   3   0  10   0   0]
 [  0   0 997   1   1   0   0   1   0   0]
 [  0  20   2 906  17   5  10  36   1   3]
 [  0   6   0   5 945  13   8  18   2   3]
 [  0   3   0   2   0 988   3   1   2   1]
 [  2   0   0   2   4   1 982   0   2   7]
 [  0   9   0   2  13   5   0 970   1   0]
 [  1   1   1   3   7   2   9   3 970   3]
 [  0   2   0   5   3   5   7   0   3 975]]
Time usage: 0:00:30

Process finished with exit code 0

 

你可能感兴趣的:(NLP)