首先是生成tfrecord文件,以下代码是生成最简单的文件,数据为f f f f。
import tensorflow as tf
def write_test(output):
writer = tf.python_io.TFRecordWriter(output)
with tf.Session() as sess:
for i in range(4):
example = tf.train.Example(features=tf.train.Features(feature={
'data': tf.train.Feature(bytes_list=tf.train.BytesList(value=['f']))
}
))
writer.write(example.SerializeToString())
writer.close()
write_test('f.tfrecord')
然后,通过上面的程序生成几个文件,分别为’a.tfrecord’, ‘b.tfrecord’, ‘c.tfrecord’,‘d.tfrecord’,‘e.tfrecord’,‘f.tfrecord’
读取上面生成的tfrecord数据,设置并行交错读取两个文件,通过迭代器获取数据,由于是最简单的实例,所以,上面生成TFRecord并没有对数据进行处理,所以,在读取数据时也并不需要对数据进行处理(解析)。
import tensorflow as tf
current_file = ['a.tfrecord', 'b.tfrecord', 'c.tfrecord','d.tfrecord','e.tfrecord','f.tfrecord']
filenames = tf.data.Dataset.list_files(current_file, shuffle=False)
dataset = filenames.apply(tf.contrib.data.parallel_interleave(lambda filename: tf.data.TFRecordDataset(filename), cycle_length=2))
iterator_hdfs = dataset.make_initializable_iterator()
parse_data = iterator_hdfs.get_next()
with tf.Session() as sess:
sess.run(iterator_hdfs.initializer)
i = 0
while(True):
i = i + 1
print('Trainning Batch ', i)
try:
print(sess.run(parse_data))
except tf.errors.OutOfRangeError:
print('outofrange')
break
未完待续。。。
很多时候我们的数据存放是按照时间或者一定的顺序进行存放的,当读取的时候我们也想要按照顺序读取文件,那么这里就可能会造成一些问题。当作者在使用时,发现自己读取出的文件的顺序是乱的,最后发现是list_files造成的数据顺序紊乱。
filenames = tf.data.Dataset.list_files(current_file)
我们来看一下该函数的源码:该函数有两个参数,一个是file_patten,A string or scalar string tf.Tensor
, representing the filename pattern that will be matched. 这个很好理解,没有什么疑问的。第二个参数shuffle, (Optional.) If True
, the file names will be shuffled randomly. Defaults to True
. 当为真的时候,会将文件进行随机的shuffle,默认设置是为True的.
if shuffle is None:
shuffle = True
所以,代码里不指明的话,就采用默认值,也就是造成文件顺序紊乱的原因。
def list_files(file_pattern, shuffle=None):
"""A dataset of all files matching a pattern.
Example:
If we had the following files on our filesystem:
- /path/to/dir/a.txt
- /path/to/dir/b.py
- /path/to/dir/c.py
If we pass "/path/to/dir/*.py" as the directory, the dataset would
produce:
- /path/to/dir/b.py
- /path/to/dir/c.py
NOTE: The order of the file names returned can be non-deterministic even
when `shuffle` is `False`.
Args:
file_pattern: A string or scalar string `tf.Tensor`, representing
the filename pattern that will be matched.
shuffle: (Optional.) If `True`, the file names will be shuffled randomly.
Defaults to `True`.
Returns:
Dataset: A `Dataset` of strings corresponding to file names.
"""
# TODO(b/73959787): Add a `seed` argument and make the `shuffle=False`
# behavior deterministic (e.g. by sorting the filenames).
if shuffle is None:
shuffle = True
matching_files = gen_io_ops.matching_files(file_pattern)
dataset = Dataset.from_tensor_slices(matching_files)
if shuffle:
# NOTE(mrry): The shuffle buffer size must be greater than zero, but the
# list of files might be empty.
buffer_size = math_ops.maximum(
array_ops.shape(matching_files, out_type=dtypes.int64)[0], 1)
dataset = dataset.shuffle(buffer_size)
return dataset
当我们不希望对文件shuffle的时候,我们应该将该参数设置为 shuffle=False
filenames = tf.data.Dataset.list_files(current_file, shuffle=False)