在嵌套的练习中,我们首先要处理下tfrecord格式的文件,关于tfrecord的处理建议看一下这篇文章
我们这里主要讲解一下对于得到的字符串如何转化为特征输入。
先给出这次练习的: 训练集 验证集
我这里创建一个输入字典:
num = {'numbers': ['1', '1', '2', '3', '4', '5', '6'], 'prime': ['2', '3', '5', '7']}
再创立一个词汇表:
lst = ('1', '2')
下一步用词汇表返还字典的结果:
column = tf.feature_column.categorical_column_with_vocabulary_list(key='numbers', vocabulary_list=lst)
indicator = tf.feature_column.indicator_column(column)
tensor = tf.feature_column.input_layer(num, [indicator])
我们打印出结果看看:
with tf.Session() as sess:
sess.run(tf.tables_initializer())
print(sess.run(tensor))
[[1. 0.]
[1. 0.]
[0. 1.]
[0. 0.]
[0. 0.]
[0. 0.]
[0. 0.]]
我们这里很明显可以看出,对于num字典只处理了key为numbers的values()
我们这里字典的长度为2, num中numbers的长度为7,返回的结果的shape是(7,2)
如果在神经网络中我们输入的字典shape为(None) 这里None表示为不定长。 字典长度为50,那么返回的结果为(None, 50)
很明显这不是我们想要的。
这里我们更改一下写法:
num = {'numbers': [['1', '1', '2', '3', '4', '5', '6']], 'prime': ['2', '3', '5', '7']}
[[2. 1.]]
我们再print一下shape:
(1, 2)
这回从独热码变成了频数,这样我们就方便多了。
对于给定的输入端 (batch_size,None) 我们用字典(shape=N) 就可以得到返回shape为(batch_size,N)的结果。这样输入端的shape固定了,我们就可以扔进神经网络处理了。
剩下步骤大家应该就清楚了,直接把特征扔进dnn。我这里给出代码:
import tensorflow as tf
def _parse_function(record):
"""Extracts features and labels.
Args:
record: File path to a TFRecord file
Returns:
A `tuple` `(labels, features)`:
features: A dict of tensors representing the features
labels: A tensor with the corresponding labels.
"""
features = {
"terms": tf.VarLenFeature(dtype=tf.string), # terms are strings of varying lengths
"labels": tf.FixedLenFeature(shape=[1], dtype=tf.float32) # labels are 0 or 1
}
parsed_features = tf.parse_single_example(record, features)
terms = parsed_features['terms'].values
labels = parsed_features['labels']
return {'terms': terms}, labels
def my_input_fn(input_filenames, num_epochs=None, shuffle=True):
# Same code as above; create a dataset and map features and labels.
ds = tf.data.TFRecordDataset(input_filenames)
ds = ds.map(_parse_function)
if shuffle:
ds = ds.shuffle(10000)
# Our feature data is variable-length, so we pad and batch
# each field of the dataset structure to whatever size is necessary.
ds = ds.padded_batch(25, ds.output_shapes)
ds = ds.repeat(num_epochs)
# Return the next batch of data.
features, labels = ds.make_one_shot_iterator().get_next()
return features, labels
def add_layer(inputs, input_size, output_size, activation_function=None):
weights = tf.Variable(tf.random_normal([input_size, output_size], stddev=.1))
biases = tf.Variable(tf.zeros([output_size]) + .1)
wx_b = tf.matmul(inputs, weights) + biases
if activation_function is None:
outputs = wx_b
else:
outputs = activation_function(wx_b)
return weights, biases, outputs
def _loss(pred, ys):
log_loss = tf.reduce_mean(- ys*tf.log(pred) - (1-ys)*tf.log(1-pred)) # 请自行尝试其他loss
return log_loss
def train_step(learning_rate, loss):
train_step = tf.train.AdamOptimizer(learning_rate).minimize(loss, global_step=global_step) # 动态lr
return train_step
train_path = my_input_fn('./nlp/train.tfrecord') # shape [25, None]
test_path = my_input_fn('./nlp/test.tfrecord') # shape [25, None]
informative_terms = ("bad", "great", "best", "worst", "fun", "beautiful",
"excellent", "poor", "boring", "awful", "terrible",
"definitely", "perfect", "liked", "worse", "waste",
"entertaining", "loved", "unfortunately", "amazing",
"enjoyed", "favorite", "horrible", "brilliant", "highly",
"simple", "annoying", "today", "hilarious", "enjoyable",
"dull", "fantastic", "poorly", "fails", "disappointing",
"disappointment", "not", "him", "her", "good", "time",
"?", ".", "!", "movie", "film", "action", "comedy",
"drama", "family")
terms_feature_column = tf.feature_column.categorical_column_with_vocabulary_list(key="terms", vocabulary_list=informative_terms)
inidcator = tf.feature_column.indicator_column(terms_feature_column)
xs = tf.feature_column.input_layer(train_path[0], [inidcator]) # shape [batch_size, 50]
ys = train_path[1] # shape[batch_size, 1]
vx = tf.feature_column.input_layer(test_path[0], [inidcator]) # shape [batch_size, 50]
vy = test_path[1] # shape[batch_size, 1]
global_step = tf.Variable(0, trainable=False)
start_learning_rate = .01
lr = tf.train.exponential_decay(start_learning_rate, global_step, 100, .98, staircase=True)
w1, b1, l1 = add_layer(xs, 50, 20, activation_function=tf.nn.tanh) # 激活函数请自己尝试
w2, b2, l2 = add_layer(l1, 20, 20, activation_function=tf.nn.tanh) # 激活函数请自己尝试
w3, b3, pred = add_layer(l2, 20, 1, activation_function=tf.nn.sigmoid) # 不要更换,AUC输入的两个值必须为0~1之间的正数,
# 二元分类我们使用sigmoid
loss = _loss(pred, ys)
train = train_step(lr, loss)
sess = tf.Session()
init = tf.global_variables_initializer()
init_table = tf.tables_initializer()
sess.run([init, init_table])
v1 = tf.nn.tanh(tf.matmul(vx, w1) + b1)
v2 = tf.nn.tanh(tf.matmul(v1, w2) + b2)
v_pred = tf.nn.sigmoid(tf.matmul(v2, w3) + b3)
valication_auc = tf.metrics.auc(vy, v_pred, num_thresholds=500)
# train_auc = tf.metrics.auc(ys, pred, num_thresholds=500)
sess.run(tf.local_variables_initializer())
for i in range(2000):
sess.run(train)
if i % 50 == 0:
print('validation_AUC:', sess.run(valication_auc)[0])
这里train_auc我并没有print,validation的结果和官方文档基本没有差别 .86~.90之间。因batchsize过小,可能导致收敛方向错误,大家调试的时候可以试试加大batchsize。这里的_parse_function 和my_input_fn 均是复制自官方文档,my_input_fn的写法我们已经轻车熟路了。_parse_function希望好好研究研究,tfrecord可能很常用(我是不想用)。
我们这里通过字典把字符串映射成数字自然是很好的方法,不过如果输入的字符串中有非常多的单词需要我们提取,而我们不知道字典该如何创建时,我们可以通过tf.feature_column.categorical_column_with_hash_buckets(key='key',hash_buckets_size=分类数)强行分类,这种分类出现的问题也是显而易见的,我们可能把两个毫无关系的单词甚至意义相反的单词分到了一类。不过好处就是我们终于不用敲词汇表了(不过这里的词汇表是官方教程给出的)。至于hash_buckets分类效果如何,我并没有实际尝试过,因此就不做判断了。
当然我们这里并没有涉及到这章的主题,嵌入。但是介于我们需要使用很多新的函数,我觉得分开写还是有必要的,何况嵌入也需要单独介绍。