【学习笔记】字符串转化为特征

在嵌套的练习中,我们首先要处理下tfrecord格式的文件,关于tfrecord的处理建议看一下这篇文章

我们这里主要讲解一下对于得到的字符串如何转化为特征输入。

先给出这次练习的: 训练集  验证集

我这里创建一个输入字典:

num = {'numbers': ['1', '1', '2', '3', '4', '5', '6'], 'prime': ['2', '3', '5', '7']}

再创立一个词汇表:

lst = ('1', '2')

下一步用词汇表返还字典的结果:

column = tf.feature_column.categorical_column_with_vocabulary_list(key='numbers', vocabulary_list=lst)

indicator = tf.feature_column.indicator_column(column)

tensor = tf.feature_column.input_layer(num, [indicator])

我们打印出结果看看:

with tf.Session() as sess:
    sess.run(tf.tables_initializer())
    print(sess.run(tensor))
[[1. 0.]
 [1. 0.]
 [0. 1.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]]

我们这里很明显可以看出,对于num字典只处理了key为numbers的values()

我们这里字典的长度为2, num中numbers的长度为7,返回的结果的shape是(7,2)

如果在神经网络中我们输入的字典shape为(None)  这里None表示为不定长。 字典长度为50,那么返回的结果为(None, 50)

很明显这不是我们想要的。

 

这里我们更改一下写法:

num = {'numbers': [['1', '1', '2', '3', '4', '5', '6']], 'prime': ['2', '3', '5', '7']}
[[2. 1.]]

我们再print一下shape:

(1, 2)

这回从独热码变成了频数,这样我们就方便多了。

对于给定的输入端 (batch_size,None)  我们用字典(shape=N) 就可以得到返回shape为(batch_size,N)的结果。这样输入端的shape固定了,我们就可以扔进神经网络处理了。

剩下步骤大家应该就清楚了,直接把特征扔进dnn。我这里给出代码:

import tensorflow as tf


def _parse_function(record):
    """Extracts features and labels.

    Args:
      record: File path to a TFRecord file
    Returns:
      A `tuple` `(labels, features)`:
        features: A dict of tensors representing the features
        labels: A tensor with the corresponding labels.
    """
    features = {
        "terms": tf.VarLenFeature(dtype=tf.string),  # terms are strings of varying lengths
        "labels": tf.FixedLenFeature(shape=[1], dtype=tf.float32)  # labels are 0 or 1
    }

    parsed_features = tf.parse_single_example(record, features)

    terms = parsed_features['terms'].values
    labels = parsed_features['labels']

    return {'terms': terms}, labels


def my_input_fn(input_filenames, num_epochs=None, shuffle=True):
    # Same code as above; create a dataset and map features and labels.
    ds = tf.data.TFRecordDataset(input_filenames)
    ds = ds.map(_parse_function)

    if shuffle:
        ds = ds.shuffle(10000)

    # Our feature data is variable-length, so we pad and batch
    # each field of the dataset structure to whatever size is necessary.
    ds = ds.padded_batch(25, ds.output_shapes)

    ds = ds.repeat(num_epochs)

    # Return the next batch of data.
    features, labels = ds.make_one_shot_iterator().get_next()
    return features, labels

def add_layer(inputs, input_size, output_size, activation_function=None):
    weights = tf.Variable(tf.random_normal([input_size, output_size], stddev=.1))
    biases = tf.Variable(tf.zeros([output_size]) + .1)
    wx_b = tf.matmul(inputs, weights) + biases
    if activation_function is None:
        outputs = wx_b
    else:
        outputs = activation_function(wx_b)
    return weights, biases, outputs


def _loss(pred, ys):
    log_loss = tf.reduce_mean(- ys*tf.log(pred) - (1-ys)*tf.log(1-pred))  # 请自行尝试其他loss
    return log_loss


def train_step(learning_rate, loss):
    train_step = tf.train.AdamOptimizer(learning_rate).minimize(loss, global_step=global_step)  # 动态lr
    return train_step


train_path = my_input_fn('./nlp/train.tfrecord')    # shape [25, None]
test_path = my_input_fn('./nlp/test.tfrecord')      # shape [25, None]


informative_terms = ("bad", "great", "best", "worst", "fun", "beautiful",
                     "excellent", "poor", "boring", "awful", "terrible",
                     "definitely", "perfect", "liked", "worse", "waste",
                     "entertaining", "loved", "unfortunately", "amazing",
                     "enjoyed", "favorite", "horrible", "brilliant", "highly",
                     "simple", "annoying", "today", "hilarious", "enjoyable",
                     "dull", "fantastic", "poorly", "fails", "disappointing",
                     "disappointment", "not", "him", "her", "good", "time",
                     "?", ".", "!", "movie", "film", "action", "comedy",
                     "drama", "family")

terms_feature_column = tf.feature_column.categorical_column_with_vocabulary_list(key="terms", vocabulary_list=informative_terms)
inidcator = tf.feature_column.indicator_column(terms_feature_column)
xs = tf.feature_column.input_layer(train_path[0], [inidcator])  # shape [batch_size, 50]
ys = train_path[1]  # shape[batch_size, 1]
vx = tf.feature_column.input_layer(test_path[0], [inidcator])  # shape [batch_size, 50]
vy = test_path[1]   # shape[batch_size, 1]
global_step = tf.Variable(0, trainable=False)
start_learning_rate = .01
lr = tf.train.exponential_decay(start_learning_rate, global_step, 100, .98, staircase=True)

w1, b1, l1 = add_layer(xs, 50, 20, activation_function=tf.nn.tanh)  # 激活函数请自己尝试
w2, b2, l2 = add_layer(l1, 20, 20, activation_function=tf.nn.tanh)  # 激活函数请自己尝试
w3, b3, pred = add_layer(l2, 20, 1, activation_function=tf.nn.sigmoid)  # 不要更换,AUC输入的两个值必须为0~1之间的正数,
# 二元分类我们使用sigmoid

loss = _loss(pred, ys)
train = train_step(lr, loss)

sess = tf.Session()
init = tf.global_variables_initializer()
init_table = tf.tables_initializer()
sess.run([init, init_table])

v1 = tf.nn.tanh(tf.matmul(vx, w1) + b1)
v2 = tf.nn.tanh(tf.matmul(v1, w2) + b2)
v_pred = tf.nn.sigmoid(tf.matmul(v2, w3) + b3)

valication_auc = tf.metrics.auc(vy, v_pred, num_thresholds=500)
# train_auc = tf.metrics.auc(ys, pred, num_thresholds=500)
sess.run(tf.local_variables_initializer())

for i in range(2000):
    sess.run(train)
    if i % 50 == 0:
        print('validation_AUC:', sess.run(valication_auc)[0])

这里train_auc我并没有print,validation的结果和官方文档基本没有差别 .86~.90之间。因batchsize过小,可能导致收敛方向错误,大家调试的时候可以试试加大batchsize。这里的_parse_function 和my_input_fn 均是复制自官方文档,my_input_fn的写法我们已经轻车熟路了。_parse_function希望好好研究研究,tfrecord可能很常用(我是不想用)。

我们这里通过字典把字符串映射成数字自然是很好的方法,不过如果输入的字符串中有非常多的单词需要我们提取,而我们不知道字典该如何创建时,我们可以通过tf.feature_column.categorical_column_with_hash_buckets(key='key',hash_buckets_size=分类数)强行分类,这种分类出现的问题也是显而易见的,我们可能把两个毫无关系的单词甚至意义相反的单词分到了一类。不过好处就是我们终于不用敲词汇表了(不过这里的词汇表是官方教程给出的)。至于hash_buckets分类效果如何,我并没有实际尝试过,因此就不做判断了。

当然我们这里并没有涉及到这章的主题,嵌入。但是介于我们需要使用很多新的函数,我觉得分开写还是有必要的,何况嵌入也需要单独介绍。

你可能感兴趣的:(【学习笔记】字符串转化为特征)