tensorflow读取tfrecords格式文件

dataset基本原理：

1.写入tfrecords文件：

|图片|文字 ->格式转换->example（tf.train.Example）生成 ->write(tf.python_io.TFRecordWriter.write)

2.读取tfrecords文件：

tf.data.dataset加载文件序列 ->dataset(example迭代器) -> tf.parse_single_sample(逐个解析example)

##代码实现：

to dataset:

dataset_b = tf.data.Dataset.list_files(files).shuffle(len(files))

dataset = dataset_b.apply(tf.contrib.data.parallel_interleave(\

tf.data.TFRecordDataset, cycle_length= num_threads))

dataset = dataset.map(decode_ex, num_parallel_calls=num_threads)

shapes = dataset._output_shapes

logging.info('dataset decode shapes', shapes)

dataset = dataset.shuffle(buffer_size=buffer_size)

#pipline

dataset = dataset.prefetch(num_prefetch_batches * batch_size)

#pad

dataset = dataset.padded_batch(batch_size, padded_shapes=(shapes))

###parse sigle example:

features_dict= {'id':tf.FixedLenFeature([], tf.string),\

'classes': tf.FixedLenFeature([NUM_CLASSES], tf.float32),\

'comment': tf.VarLenFeature(tf.int64),\

'title': tf.VarLenFeature(tf.int64),\

'comment_str': tf.FixedLenFeature([], tf.string),\

'title_str': tf.FixedLenFeature([], tf.string)

}

features = tf.parse_single_example(example, features = features_dict)

id = features['id']

classes = features['classes']

comment = None

comment = features['comment']

comment = tf.sparse_tensor_to_dense(comment)

comment = comment[:500]

#parser comment

comment_str = features['comment_str']

#comment_str= comment_str[:500]

#parser title

title = None

title = features['title']

参考教程：

https://www.yanshuo.me/p/305699（part6：如何使用 TensorFlow Eager 从 TFRecords 批量读取数据）

https://www.tensorflow.org/performance/datasets_performance(dataset的官方教程)

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/data_reader.py（港科大的datareader代码）

##tensorflow并行数据抽取与预测

###1.基本知识

####cpu与gpu的分工：

layer是运行在gpu/tpu上的。也就是说，embedding_layer转化是在gpu上进行，但是prepare即word->id，是在cpu上进行的。parser_example是在cpu上；

####model训练性能比较：

TPU>GPU>CPU

TPU是google专门针对tensorflow开发的处理器，降低功耗，加大运算速率。alphago就是在TPU处理器上搭建的。

####加快cpu的预处理速度的方法

cpu做的工作有两个：1. 抽取（I/O）2.数据解析（map(parser)），故而，加快cpu预处理的速率的方法，有两个：

##1. 并行抽取；

##2.map方法并行

cpu与gpu结合的过程，还可以进行一步管道优化：也就是在batch的过程中，cpu的prepare与gpu的训练同时进行。

2.优化方法具体介绍

#一. batch流水线pipline

##原理流程图：

##代码实现：

dataset = dataset.batch(batch_size=FLAGS.batch_size)

dataset = dataset.prefetch(buffer_size=FLAGS.prefetch_buffer_size)

--------------------------------------------------------

##二.并行prepare（map）：

##原理流程图：

##代码实现：

---------------------------------------------

dataset = dataset.map(map_func=parse_fn, num_parallel_calls=FLAGS.num_parallel_calls)

dataset = dataset.batch(batch_size=FLAGS.batch_size)

------------------------------------------------

#三.并行数据抽取

##原理流程图：

##代码实现：

------------------------

改代码：

dataset = files.interleave(tf.data.TFRecordDataset)

为：

dataset = files.apply(tf.contrib.data.parallel_interleave(

tf.data.TFRecordDataset, cycle_length=FLAGS.num_parallel_readers))

3.参考教程：

https://www.tensorflow.org/performance/datasets_performance

https://tensorflow.juejin.im/performance/datasets_performance.html

tensorflow读取tfrecords格式文件

你可能感兴趣的:(tensorflow读取tfrecords格式文件)