自然语言处理入门之路系列博客为跟随开源组织Datawhale学习小组的学习过程记录,任务内容及相关数据集为Datawhale开源组织搜集并无偿提供,饮水思源,特此宣传,欢迎关注Datawhale。
1、先来概念。【注结合英文记忆,翻译过来的汉语名字总感觉很容易混淆】
(比较尴尬,一直说要学习下LaTeX,还没学,,,公式先不放了)
准确率(Accuracy):分类正确的样本占总样本个数的比例。
精确率(Precision):分类正确的正样本个数,占分类器判定为正样本的个数的比例
召回率(Reacall):分类正确的正样本个数,占真正的正样本个数的比例
PR曲线:P-R曲线,横轴是召回率(Recall),纵轴是精确率(Precision)
ROC曲线:ROC曲线是Receive Operation Characteristic Curve的简称,中文名为“受试者工作特征曲线”,ROC曲线的横坐标为假阳性率(False Positive Rate,FPR);纵坐标为真阳性率(True Positive Rate,TPR)。其中:FPR = FP / N;TPR = TP/P;P是真是的正样本数量,N是真实的负样本数量,TP是P个正样本中被分类器预测为正样本的个数;FP是N个负样本中被分类器预测为正样本的个数。
AUC(Aera Under Curve):AUC值得是ROC曲线下的面积大小,该值能够量化地反映基于ROC曲线衡量出的模型性质。沿着ROC曲线做积分即可得到AUC的值
2、IMDB数据集探索
参考苏格兰折耳喵大神的代码:
https://www.kesci.com/home/project/5b6c05409889570010ccce90
import tensorflow as tf
from tensorflow import keras
import numpy as np
def load_data_preview(path, imdb):
"""
load the IMDB data, and preview an sample
:param path:
:param imdb:
:return:
"""
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(path, num_words=15000)
print("Training entries:{}, labels:{}", format(len(train_data), len(train_labels)))
print("Train_labels[0]:{}", train_labels[0])
print("Length of train_data[0]:{}, and train_data[1]:{}", len(train_labels[0]), len(train_data[1]))
return (train_data, train_labels), (test_data, test_labels)
def convert_int2word(text, word_index):
"""
convert the integer in the data set to words
:param text:
:param word_index:
:return:
"""
revers_word_index = dict([(value, key) for (key, value) in word_index.items()])
return ' '.join([revers_word_index.get(i, '?') for i in text])
def prepare_data(train_data, test_data, word_index):
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
value=word_index["" ],
padding='post',
maxlen=256)
test_data = keras.preprocessing.sequence.pad_sequences(test_data,
value=word_index["" ],
padding='post',
maxlen=256)
print("Length of data after standardize:{}", len(train_data[0]))
return train_data, test_data
def main():
imdb = keras.datasets.imdb
word_index = imdb.get_word_index() # 单词到整数的索引工具
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["" ] = 0
word_index["" ] = 1
word_index["" ] = 3
word_index["" ] = 3
(train_data, train_labels), (test_data, test_labels) = load_data_preview('', imdb)
convert_int2word(imdb, word_index)
train_data, test_data = prepare_data(train_data, test_data, word_index)
if __name__ == '__main__':
main()