最简单的TFRecord数据的生成与读取实例(未完待续)

首先是生成tfrecord文件,以下代码是生成最简单的文件,数据为f f f f。

import tensorflow as tf
def write_test(output):
    writer = tf.python_io.TFRecordWriter(output)
    with tf.Session() as sess:
        for i in range(4):
            example = tf.train.Example(features=tf.train.Features(feature={
     
                'data': tf.train.Feature(bytes_list=tf.train.BytesList(value=['f']))
            }
            ))
            writer.write(example.SerializeToString())

        writer.close()

write_test('f.tfrecord')

然后,通过上面的程序生成几个文件,分别为’a.tfrecord’, ‘b.tfrecord’, ‘c.tfrecord’,‘d.tfrecord’,‘e.tfrecord’,‘f.tfrecord’
读取上面生成的tfrecord数据,设置并行交错读取两个文件,通过迭代器获取数据,由于是最简单的实例,所以,上面生成TFRecord并没有对数据进行处理,所以,在读取数据时也并不需要对数据进行处理(解析)。

import tensorflow as tf


current_file = ['a.tfrecord', 'b.tfrecord', 'c.tfrecord','d.tfrecord','e.tfrecord','f.tfrecord']

filenames = tf.data.Dataset.list_files(current_file, shuffle=False)

dataset = filenames.apply(tf.contrib.data.parallel_interleave(lambda filename: tf.data.TFRecordDataset(filename), cycle_length=2))

iterator_hdfs = dataset.make_initializable_iterator()

parse_data = iterator_hdfs.get_next()

with tf.Session() as sess:
    sess.run(iterator_hdfs.initializer)
    i = 0
    while(True):
        i = i + 1
        print('Trainning Batch ', i)
        try:
            print(sess.run(parse_data))
        except tf.errors.OutOfRangeError:
            print('outofrange')
            break

程序的输出结果

未完待续。。。

可能存在的一些坑之一:list_file

很多时候我们的数据存放是按照时间或者一定的顺序进行存放的,当读取的时候我们也想要按照顺序读取文件,那么这里就可能会造成一些问题。当作者在使用时,发现自己读取出的文件的顺序是乱的,最后发现是list_files造成的数据顺序紊乱。

filenames = tf.data.Dataset.list_files(current_file)

我们来看一下该函数的源码:该函数有两个参数,一个是file_patten,A string or scalar string tf.Tensor, representing the filename pattern that will be matched. 这个很好理解,没有什么疑问的。第二个参数shuffle, (Optional.) If True, the file names will be shuffled randomly. Defaults to True. 当为真的时候,会将文件进行随机的shuffle,默认设置是为True的.

 if shuffle is None:
      shuffle = True

所以,代码里不指明的话,就采用默认值,也就是造成文件顺序紊乱的原因。

  def list_files(file_pattern, shuffle=None):
    """A dataset of all files matching a pattern.

    Example:
      If we had the following files on our filesystem:
        - /path/to/dir/a.txt
        - /path/to/dir/b.py
        - /path/to/dir/c.py
      If we pass "/path/to/dir/*.py" as the directory, the dataset would
      produce:
        - /path/to/dir/b.py
        - /path/to/dir/c.py

    NOTE: The order of the file names returned can be non-deterministic even
    when `shuffle` is `False`.

    Args:
      file_pattern: A string or scalar string `tf.Tensor`, representing
        the filename pattern that will be matched.
      shuffle: (Optional.) If `True`, the file names will be shuffled randomly.
        Defaults to `True`.

    Returns:
     Dataset: A `Dataset` of strings corresponding to file names.
    """
    # TODO(b/73959787): Add a `seed` argument and make the `shuffle=False`
    # behavior deterministic (e.g. by sorting the filenames).
    if shuffle is None:
      shuffle = True
    matching_files = gen_io_ops.matching_files(file_pattern)
    dataset = Dataset.from_tensor_slices(matching_files)
    if shuffle:
      # NOTE(mrry): The shuffle buffer size must be greater than zero, but the
      # list of files might be empty.
      buffer_size = math_ops.maximum(
          array_ops.shape(matching_files, out_type=dtypes.int64)[0], 1)
      dataset = dataset.shuffle(buffer_size)
    return dataset

当我们不希望对文件shuffle的时候,我们应该将该参数设置为 shuffle=False

filenames = tf.data.Dataset.list_files(current_file, shuffle=False)

你可能感兴趣的:(机器学习,python相关,tensorflow,TFRecord)