本文是对官方文档的学习笔记。

使用tf.data API，可以从简单，可重用的片段中构建复杂的输入管道（Input Pipeline）。例如，图像模型的管道可能会聚集来自分布式文件系统中文件的数据，对每个图像应用随机扰动，然后将随机选择的图像合并为一批以进行训练。文本模型的管道可能涉及从原始文本数据中提取符号，将其转换为带有查找表的嵌入标识符，以及将不同长度的序列分批处理。 tf.data API使处理大量数据，从不同数据格式读取数据以及执行复杂的转换成为可能。

tf.data API引入了tf.data.Dataset ，它表示一系列元素，其中每个元素由一个或多个组件组成。关于 Dataset API 详细描述，请参考(Tensorflow 2 API ：Dataset 类介绍)[https://www.jianshu.com/writer#/notebooks/48504509/notes/82119988]

2 种创建Dataset 的方式:

数据源从内存，一个或者多个文件中构建出 Dataset
从一个或者多个其他 Dataset 转化而来。

基础

要创建输入管道，您必须从数据源开始。例如，要从内存中的数据构造数据集，可以使用tf.data.Dataset.from_tensors() 或 tf.data.Dataset.from_tensor_slices() 。或者，如果输入数据以推荐的TFRecord格式存储在文件中，则可以使用tf.data.TFRecordDataset()。

拥有Dataset 对象后，可以通过链接 tf.data.Dataset对象上的方法调用将其转换为新的Dataset 。例如，可以应用每个元素的转换（例如 [Dataset.map() (https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map)）和多个元素的转换（例如Dataset.batch()）。

数据集对象是Python迭代的。这样就可以使用for循环使用其元素：

dataset = tf.data.Dataset.from_tensor_slices([8, 3, 0, 8, 2, 1])
dataset

for elem in dataset:
  print(elem.numpy())

或者通过使用iter显式创建一个Python迭代器，并使用next来使用其元素

it = iter(dataset)

print(next(it).numpy())

或者，可以使用reduce转换来消耗数据dataset ，这会减少所有元素以产生单个结果。下面的示例说明如何使用reduce转换来计算整数dataset 的总和。

print(dataset.reduce(0, lambda state, value: state + value).numpy())

Dataset 结构

Dataset 包含每个具有相同（嵌套）结构的元素，并且该结构的各个组件可以是 tf.TypeSpec 可表示的任何类型，包括： tf.Tensor, tf.sparse.SparseTensor, tf.RaggedTensor, tf.TensorArray, or tf.data.Dataset。

Dataset.element_spec 属性使您可以检查每个元素组件的类型。该属性返回tf.TypeSpec对象的嵌套结构，该嵌套结构与元素的结构匹配，该元素可以是单个组件，组件的元组或组件的嵌套元组。例如：

dataset1 = tf.data.Dataset.from_tensor_slices(tf.random.uniform([4, 10]))
dataset1.element_spec
--------------------------------------------------------------------
TensorSpec(shape=(10,), dtype=tf.float32, name=None)

dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random.uniform([4]),
    tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)))

dataset2.element_spec
--------------------------------------------------------------------
(TensorSpec(shape=(), dtype=tf.float32, name=None),
 TensorSpec(shape=(100,), dtype=tf.int32, name=None))

dataset3 = tf.data.Dataset.zip((dataset1, dataset2))
dataset3.element_spec
--------------------------------------------------------------------
(TensorSpec(shape=(10,), dtype=tf.float32, name=None),
 (TensorSpec(shape=(), dtype=tf.float32, name=None),
  TensorSpec(shape=(100,), dtype=tf.int32, name=None)))

# Dataset containing a sparse tensor.
dataset4 = tf.data.Dataset.from_tensors(tf.SparseTensor(indices=[[0, 0], [1, 2]], values=[1, 2], dense_shape=[3, 4]))
dataset4.element_spec
--------------------------------------------------------------------
SparseTensorSpec(TensorShape([3, 4]), tf.int32)

# Use value_type to see the type of value represented by the element spec
dataset4.element_spec.value_type
--------------------------------------------------------------------
tensorflow.python.framework.sparse_tensor.SparseTensor

Dataset transformations 支持各种结构的 Dataset 。当使用 Dataset.map(), and Dataset.filter() 时，函数会作用在每个元素上，元素的结构决定了函数的参数。

dataset1 = tf.data.Dataset.from_tensor_slices(
    tf.random.uniform([4, 10], minval=1, maxval=10, dtype=tf.int32))
dataset1
--------------------------------------------------------------------

for z in dataset1:
  print(z.numpy())
--------------------------------------------------------------------
[1 3 6 2 4 3 7 5 9 5]
[8 9 5 3 6 4 1 6 6 3]
[4 8 2 3 3 6 9 8 5 5]
[7 8 7 9 5 5 6 4 8 8]

dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random.uniform([4]),
    tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)))

dataset2
--------------------------------------------------------------------

dataset3 = tf.data.Dataset.zip((dataset1, dataset2))
dataset3
--------------------------------------------------------------------

for a, (b,c) in dataset3:
  print('shapes: {a.shape}, {b.shape}, {c.shape}'.format(a=a, b=b, c=c))
--------------------------------------------------------------------
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)

数据读取

Numpy array

如果内存足够，最简单构建 Dataset 的方法是利用 Dataset.from_tensor_slices() 把数据直接转化成 tf.Tensor。

train, test = tf.keras.datasets.fashion_mnist.load_data()

images, labels = train
images = images/255

dataset = tf.data.Dataset.from_tensor_slices((images, labels))
dataset

注意： 上面的代码片段将把 feature 和 label 数组作为tf.constant（）操作嵌入到TensorFlow图中。这对于较小的数据集来说效果很好，但是浪费了内存-因为数组的内容将被多次复制-并可能达到tf.GraphDef协议缓冲区的2GB限制。

Python 生成器

另外一个常用的数据源是 Python生成器。

注意：虽然这是一种简便的方法，但它的可移植性和可伸缩性有限。它必须在创建生成器的同一python进程中运行，并且仍受Python GIL的约束

一个Python 生成器

def count(stop):
  i = 0
  while i

 
 Dataset.from_generator 构造函数将python生成器转换为功能齐全的tf.data.Dataset。 
 构造函数将可调用对象作为输入，而不是迭代器。这样，它可以在结束时重新启动生成器。它带有一个可选的args参数，该参数作为可调用的参数传递。 
 之所以需要output_types参数，是因为tf.data在内部构建了tf.Graph，并且 edge 需要tf.dtype。 
 ds_counter = tf.data.Dataset.from_generator(count, args=[25], output_types=tf.int32, output_shapes = (), )

for count_batch in ds_counter.repeat().batch(10).take(10):
  print(count_batch.numpy())
------------------------------------------------------------------------
[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24  0  1  2  3  4]
[ 5  6  7  8  9 10 11 12 13 14]
[15 16 17 18 19 20 21 22 23 24]
[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24  0  1  2  3  4]
[ 5  6  7  8  9 10 11 12 13 14]
[15 16 17 18 19 20 21 22 23 24]
 
 output_shapes参数不是必需的，但由于许多张量流操作不支持未知 rank 的张量，因此强烈建议使用它。如果特定轴的长度未知或可变，则在output_shapes中将其设置为None。 
 还需要注意的是，output_shapes和output_types遵循与其他数据集方法相同的嵌套规则。 
 这是一个示例生成器，演示了两个方面，它返回数组的元组，其中第二个数组是长度未知的向量。 
 def gen_series():
  i = 0
  while True:
    size = np.random.randint(0, 10)
    yield i, np.random.normal(size=(size,))
    i += 1

for i, series in gen_series():
  print(i, ":", str(series))
  if i > 5:
    break
------------------------------------------------------------------------
0 : [-0.1241  0.5308  0.3018]
1 : []
2 : [ 0.5769 -0.8721  2.0072 -1.7862  0.8289  0.59  ]
3 : [-1.5209 -1.5252  0.2506 -0.526   1.2647 -1.2677 -1.4078]
4 : [ 1.6039  0.2602  0.2278  1.205  -0.8033  0.3032]
5 : [ 0.5982  1.5779  0.0248  1.3666 -1.9277  1.3854 -0.4739]
6 : [ 1.0598 -0.2546  0.5908  1.3619  1.1141 -0.6058  0.8438 -2.4862]
 
 第一个输出是int32，第二个输出是float32。 
 第一项是标量，shape（），第二项是未知长度，shape（None，）的向量 
 ds_series = tf.data.Dataset.from_generator(
    gen_series, 
    output_types=(tf.int32, tf.float32), 
    output_shapes=((), (None,)))

ds_series
------------------------------------------------------------------------

 
 现在，它可以像常规tf.data.Dataset一样使用。请注意，在批处理具有可变形状的数据集时，需要使用Dataset.padded_batch。
 这样会使Batch 内部的数据长度对齐。 
 ds_series_batch = ds_series.shuffle(20).padded_batch(10)

ids, sequence_batch = next(iter(ds_series_batch))
print(ids.numpy())
print()
print(sequence_batch.numpy())
------------------------------------------------------------------------
[ 3  7  6 22  0 13  4 23  5 17]

[[ 9.5867e-01 -2.5104e-01  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00]
 [-1.4743e-01  4.1422e-02  4.8626e-01 -4.4328e-01 -3.0196e+00  3.0172e-01
   0.0000e+00]
 [ 2.9469e-01 -2.8750e-01 -1.2391e-01 -7.7315e-01  7.5218e-01  7.9246e-01
   0.0000e+00]
 [ 1.5680e+00 -6.4869e-01 -7.5440e-01  3.3234e-01 -1.0759e+00  0.0000e+00
   0.0000e+00]
 [ 4.0357e-01 -7.8729e-01  2.1975e-02  2.4870e-02 -9.1991e-01 -2.1324e+00
   0.0000e+00]
 [-8.2417e-02  1.0919e+00 -6.6252e-01 -4.2764e-01  7.9078e-01  1.9829e-03
  -9.5911e-01]
 [-2.7661e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00]
 [ 0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00]
 [ 1.7720e-01  1.1324e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00]
 [-4.8885e-01  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00]]
 
 正在一些实际的例子中需要将 preprocessing.image.ImageDataGenerator 包装成 tf.data.Dataset 。 
 # First download the data:
flowers = tf.keras.utils.get_file(
    'flower_photos',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
    untar=True)

# Create the image.ImageDataGenerator`
img_gen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255, rotation_range=20)

images, labels = next(img_gen.flow_from_directory(flowers))
------------------------------------------------------------------------
Found 3670 images belonging to 5 classes.
 
 print(images.dtype, images.shape)
print(labels.dtype, labels.shape)
------------------------------------------------------------------------
float32 (32, 256, 256, 3)
float32 (32, 5)
 
 ds = tf.data.Dataset.from_generator(
    lambda: img_gen.flow_from_directory(flowers), 
    output_types=(tf.float32, tf.float32), 
    output_shapes=([32,256,256,3], [32,5])
)

ds.element_spec
------------------------------------------------------------------------
(TensorSpec(shape=(32, 256, 256, 3), dtype=tf.float32, name=None),
 TensorSpec(shape=(32, 5), dtype=tf.float32, name=None))
 
 for images, label in ds.take(1):
  print('images.shape: ', images.shape)
  print('labels.shape: ', labels.shape)
------------------------------------------------------------------------
Found 3670 images belonging to 5 classes.
images.shape:  (32, 256, 256, 3)
labels.shape:  (32, 5)
 
 TFRecord 
 一个 Loading TFRecords 端到端的例子 
 tf.data API支持多种文件格式，因此可以处理内存不足的大型数据集。例如，TFRecord文件格式是一种简单的面向记录的二进制格式，许多TensorFlow应用程序都使用它来训练数据。通tf.data.TFRecordDataset类，你可以流式传输一个或多个TFRecord文件的内容，作为输入管道的一部分。 
 这是使用法国街道名称标志（FSNS）中测试文件的示例。 
 # Creates a dataset that reads all of the examples from two files.
fsns_test_file = tf.keras.utils.get_file("fsns.tfrec", "https://storage.googleapis.com/download.tensorflow.org/data/fsns-20160927/testdata/fsns-00000-of-00001")
 
 TFRecordDataset 初始化程序的 filenames 参数可以是 string，a list of strings 或字符串tf.Tensor。因此，如果有两组用于训练和验证的文件，则可以创建一个工厂方法来生成数据集，并使用文件名作为输入参数： 
 dataset = tf.data.TFRecordDataset(filenames = [fsns_test_file])
dataset
------------------------------------------------------------------------

 
 许多TensorFlow项目在其TFRecord文件中使用序列化的 tf.train.Example 记录。需要对它们进行解码，然后才能使用： 
 raw_example = next(iter(dataset))
parsed = tf.train.Example.FromString(raw_example.numpy())
parsed.features.feature['image/text']
------------------------------------------------------------------------
bytes_list {
  value: "Rue Perreyon"
}
 
 文本数据 
 一个 Loading Text 端到端的例子 
 许多数据集作为一个或多个文本文件分发。 tf.data.TextLineDataset提供了一种从一个或多个文本文件中提取行的简便方法。给定一个或多个文件名，TextLineDataset将在这些文件的每一行产生一个字符串值的元素。 
 directory_url = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
file_names = ['cowper.txt', 'derby.txt', 'butler.txt']

file_paths = [
    tf.keras.utils.get_file(file_name, directory_url + file_name)
    for file_name in file_names
]

dataset = tf.data.TextLineDataset(file_paths)
 
 第一个文件的开头几行 
 for line in dataset.take(5):
  print(line.numpy())
---------------------------------------------
b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;"
b'His wrath pernicious, who ten thousand woes'
b"Caused to Achaia's host, sent many a soul"
b'Illustrious into Ades premature,'
b'And Heroes gave (so stood the will of Jove)'
 
 要在文件之间 交替使用行，要使用Dataset.interleave。这样可以更轻松地将文件混在一起。这是每种翻译的第一，第二和第三行：
 **这是一个神奇的功能， 就是取第一个文件的第一行，然后取第二个文件的第一行， 然后是第三个文件的第一行， 然后是第一个文件的第二行， 第二个文件的第二行 ...... ** 
 files_ds = tf.data.Dataset.from_tensor_slices(file_paths)
lines_ds = files_ds.interleave(tf.data.TextLineDataset, cycle_length=3)

for i, line in enumerate(lines_ds.take(9)):
  if i % 3 == 0:
    print()
  print(line.numpy())
---------------------------------------------
b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;"
b"\xef\xbb\xbfOf Peleus' son, Achilles, sing, O Muse,"
b'\xef\xbb\xbfSing, O goddess, the anger of Achilles son of Peleus, that brought'

b'His wrath pernicious, who ten thousand woes'
b'The vengeance, deep and deadly; whence to Greece'
b'countless ills upon the Achaeans. Many a brave soul did it send'

b"Caused to Achaia's host, sent many a soul"
b'Unnumbered ills arose; which many a soul'
b'hurrying down to Hades, and many a hero did it yield a prey to dogs and'
 
 默认情况下，TextLineDataset 读入所有的行，这可能是不希望的，例如，如果文件以标题行开头或包含注释。可以使用Dataset.skip（）或Dataset.filter（）转换删除这些行。在这里，跳过第一行，然后过滤。 
 titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
titanic_lines = tf.data.TextLineDataset(titanic_file)

for line in titanic_lines.take(3):
  print(line.numpy())
---------------------------------------------
b'survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone'
b'0,male,22.0,1,0,7.25,Third,unknown,Southampton,n'
b'1,female,38.0,1,0,71.2833,First,C,Cherbourg,n'
 
 def survived(line):
  return tf.not_equal(tf.strings.substr(line, 0, 1), "0")

survivors = titanic_lines.skip(1).filter(survived)

for line in survivors.take(3):
  print(line.numpy())
---------------------------------------------
b'1,female,38.0,1,0,71.2833,First,C,Cherbourg,n'
b'1,female,26.0,0,0,7.925,Third,unknown,Southampton,y'
b'1,female,35.0,1,0,53.1,First,C,Southampton,n'
 
 CSV 文件 
 两个例子 
  
  Loading CSV Files 
  Loading Pandas DataFrames 
  
 CSV 是一个非常流行存储结构化数据的文件。 
 titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
df = pd.read_csv(titanic_file)
df.head()
 
 在内存大小足够的情况下， same [Dataset.from_tensor_slices (https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices) 是一个简单的方法。 
 titanic_slices = tf.data.Dataset.from_tensor_slices(dict(df))

for feature_batch in titanic_slices.take(1):
  for key, value in feature_batch.items():
    print("  {!r:20s}: {}".format(key, value))
---------------------------------------------------------------------------
 'survived'          : 0
  'sex'               : b'male'
  'age'               : 22.0
  'n_siblings_spouses': 1
  'parch'             : 0
  'fare'              : 7.25
  'class'             : b'Third'
  'deck'              : b'unknown'
  'embark_town'       : b'Southampton'
  'alone'             : b'n'
 
 一种更具可扩展性的方法是根据需要从磁盘加载。 
 tf.data模块提供了从一个或多个符合RFC 4180 的CSV文件中提取记录的方法。 
 experimental.make_csv_dataset 函数是用于读取csv文件集的高级接口。它支持列类型推断和许多其他功能，例如批处理和混排，以简化用法。 
 titanic_batches = tf.data.experimental.make_csv_dataset(
    titanic_file, batch_size=4,
    label_name="survived")

for feature_batch, label_batch in titanic_batches.take(1):
  print("'survived': {}".format(label_batch))
  print("features:")
  for key, value in feature_batch.items():
    print("  {!r:20s}: {}".format(key, value))
---------------------------------------------------------------------------
'survived': [0 0 1 1]
features:
  'sex'               : [b'male' b'male' b'female' b'female']
  'age'               : [38. 16. 28. 28.]
  'n_siblings_spouses': [0 0 0 0]
  'parch'             : [1 0 0 0]
  'fare'              : [153.4625  10.5      7.8792   7.7333]
  'class'             : [b'First' b'Second' b'Third' b'Third']
  'deck'              : [b'C' b'unknown' b'unknown' b'unknown']
  'embark_town'       : [b'Southampton' b'Southampton' b'Queenstown' b'Queenstown']
  'alone'             : [b'n' b'y' b'y' b'y']
 
 如果只需要列的子集，则可以使用select_columns参数。 
 titanic_batches = tf.data.experimental.make_csv_dataset(
    titanic_file, batch_size=4,
    label_name="survived", select_columns=['class', 'fare', 'survived'])

for feature_batch, label_batch in titanic_batches.take(1):
  print("'survived': {}".format(label_batch))
  for key, value in feature_batch.items():
    print("  {!r:20s}: {}".format(key, value))
---------------------------------------------------------------------------
'survived': [0 0 0 0]
  'fare'              : [25.4667  7.8958 13.      7.8958]
  'class'             : [b'Third' b'Third' b'Second' b'Third']
 
 还有一个较低级别的Experiment.CsvDataset类，它提供了更精细的控制。它不支持列类型推断。相反，必须指定每列的类型。 
 titanic_types  = [tf.int32, tf.string, tf.float32, tf.int32, tf.int32, tf.float32, tf.string, tf.string, tf.string, tf.string] 
dataset = tf.data.experimental.CsvDataset(titanic_file, titanic_types , header=True)

for line in dataset.take(10):
  print([item.numpy() for item in line])
---------------------------------------------------------------------------
[0, b'male', 22.0, 1, 0, 7.25, b'Third', b'unknown', b'Southampton', b'n']
[1, b'female', 38.0, 1, 0, 71.2833, b'First', b'C', b'Cherbourg', b'n']
[1, b'female', 26.0, 0, 0, 7.925, b'Third', b'unknown', b'Southampton', b'y']
[1, b'female', 35.0, 1, 0, 53.1, b'First', b'C', b'Southampton', b'n']
 
 如果某些列为空，则此 low-level 接口允许你提供默认值而不是列类型。 
 # Creates a dataset that reads all of the records from two CSV files, each with
# four float columns which may have missing values.

record_defaults = [999,999,999,999]
dataset = tf.data.experimental.CsvDataset("missing.csv", record_defaults)
dataset = dataset.map(lambda *items: tf.stack(items))
dataset

for line in dataset:
  print(line.numpy())

---------------------------------------------------------------------------
1,2,3,4
,2,3,4
1,,3,4
1,2,,4
1,2,3,

>>>>>>>>>>>>>>>>>>>>>>>>>>>

[1 2 3 4]
[999   2   3   4]
[  1 999   3   4]
[  1   2 999   4]
[  1   2   3 999]
[999 999 999 999]
 
 默认情况下，CsvDataset会产生文件每一行的每一列，这可能是不希望的，例如，如果文件以应忽略的标题行开头，或者输入中不需要某些列。这些行和字段可以分别使用header和select_cols参数删除。 
 # Creates a dataset that reads all of the records from two CSV files with
# headers, extracting float data from columns 2 and 4.
record_defaults = [999, 999] # Only provide defaults for the selected columns
dataset = tf.data.experimental.CsvDataset("missing.csv", record_defaults, select_cols=[1, 3])
dataset = dataset.map(lambda *items: tf.stack(items))
dataset

for line in dataset:
  print(line.numpy())
>>>>>>>>>>>>>>>>>>>>>>>>>>>

[2 4]
[2 4]
[999   4]
[2 4]
[  2 999]
[999 999]
 
 一堆文件 
 有许多作为一组文件分布的数据集，其中每个文件都是一个示例。 
 flowers_root = tf.keras.utils.get_file(
    'flower_photos',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
    untar=True)
flowers_root = pathlib.Path(flowers_root)
 
 根目录包含每个类的目录： 
 for item in flowers_root.glob("*"):
  print(item.name)
-----------------------------------------------------------------
sunflowers
daisy
LICENSE.txt
roses
tulips
dandelion
 
 list_ds = tf.data.Dataset.list_files(str(flowers_root/'*/*'))

for f in list_ds.take(5):
  print(f.numpy())
--------------------------------------------------------------------------
b'/home/kbuilder/.keras/datasets/flower_photos/daisy/9489270024_1b05f08492_m.jpg'
b'/home/kbuilder/.keras/datasets/flower_photos/daisy/11023214096_b5b39fab08.jpg'
b'/home/kbuilder/.keras/datasets/flower_photos/sunflowers/14928117202_139d2142cc_n.jpg'
b'/home/kbuilder/.keras/datasets/flower_photos/roses/2215318403_06eb99176a.jpg'
b'/home/kbuilder/.keras/datasets/flower_photos/tulips/7481215720_73e40f178f_n.jpg'
 
 使用tf.io.read_file函数读取数据，并从路径中提取标签，并返回（image, label）对： 
 def process_path(file_path):
  label = tf.strings.split(file_path, os.sep)[-2]
  return tf.io.read_file(file_path), label

labeled_ds = list_ds.map(process_path)

for image_raw, label_text in labeled_ds.take(1):
  print(repr(image_raw.numpy()[:100]))
  print()
  print(label_text.numpy())
--------------------------------------------------------------------------
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xe2\x0cXICC_PROFILE\x00\x01\x01\x00\x00\x0cHLino\x02\x10\x00\x00mntrRGB XYZ \x07\xce\x00\x02\x00\t\x00\x06\x001\x00\x00acspMSFT\x00\x00\x00\x00IEC sRGB\x00\x00\x00\x00\x00\x00'

b'daisy'
 
 Batching dataset elements 
 Batching 这里指的是把 Dataset 分成一批一批， 和Training 中 batch 是一个概念 
 简单 Batching 
 Batching 的最简单形式是将数据集的n个连续元素堆叠为单个元素。 Dataset.batch（）转换正是使用与tf.stack（）运算符相同的约束来做到这一点，并应用于元素的每个组成部分：即，对于每个组成部分i，所有元素都必须具有完全相同形状的 Tensor。 
 inc_dataset = tf.data.Dataset.range(100)
dec_dataset = tf.data.Dataset.range(0, -100, -1)
dataset = tf.data.Dataset.zip((inc_dataset, dec_dataset))
batched_dataset = dataset.batch(4)

for batch in batched_dataset.take(4):
  print([arr.numpy() for arr in batch])
--------------------------------------------------------------------------------------
[array([0, 1, 2, 3]), array([ 0, -1, -2, -3])]
[array([4, 5, 6, 7]), array([-4, -5, -6, -7])]
[array([ 8,  9, 10, 11]), array([ -8,  -9, -10, -11])]
[array([12, 13, 14, 15]), array([-12, -13, -14, -15])]
 
 当tf.data尝试传播形状信息时，Dataset.batch的默认设置会导致未知的批次大小，因为最后一个批次可能未满。注意形状中的None： 
 batched_dataset
--------------------------------------------------------------------------------------

 
 使用drop_remainder参数忽略最后一批，并获得完整的形状传播： 
 batched_dataset = dataset.batch(7, drop_remainder=True)
batched_dataset
--------------------------------------------------------------------------------------

 
 Batching tensors with padding 
 上面的方法，不同 batch 中的Tensor 都填充成了同一个长度。 。但是，许多模型（例如序列模型）都可以使用大小可变的输入数据（例如长度不同的序列）。了处理这种情况，通过 Dataset.padded_batch 转换， 可以让padding 只在batch 内保证Tensor 对齐。 
 dataset = tf.data.Dataset.range(100)
dataset = dataset.map(lambda x: tf.fill([tf.cast(x, tf.int32)], x))
dataset = dataset.padded_batch(4, padded_shapes=(None,))

for batch in dataset.take(2):
  print(batch.numpy())
  print()
--------------------------------------------------------------------------------------
[[0 0 0]
 [1 0 0]
 [2 2 0]
 [3 3 3]]

[[4 4 4 4 0 0 0]
 [5 5 5 5 5 0 0]
 [6 6 6 6 6 6 0]
 [7 7 7 7 7 7 7]]
 
 Dataset.padded_batch 转换允许您为每个组件的每个维度设置不同的填充，并且填充长度可以是可变长度（在上面的示例中由None表示）或恒定长度。也可以覆盖填充值，默认为0。 
 训练工作流 
 处理多个 epoch 
 tf.data 提高2个方法让一份 Dataset 被多个 epoch处理。 
 在多个 epoch 内迭代数据集的最简单方法是使用Dataset.repeat（）转换。首先，创建数据集： 
 titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
titanic_lines = tf.data.TextLineDataset(titanic_file)

def plot_batch_sizes(ds):
  batch_sizes = [batch.shape[0] for batch in ds]
  plt.bar(range(len(batch_sizes)), batch_sizes)
  plt.xlabel('Batch number')
  plt.ylabel('Batch size')
 
 然后调用 Dataset.repeat() 使得该 Dataset 可以重复使用。 注意， 这里 repeat 没有参数， 意思时 repeat 是可以无限次的。 
 Dataset.repeat transformation 不知道每个epoch 是否结束， 他只是按照要求不断的提供数据。因此，在Dataset.repeat之后应用的Dataset.batch 将产生跨epoch 元边界的批处理：（就是说，如果某个 ephco 没有用完Dataset 里面的所有数据， 剩下的数据会留到下个 ephco 使用） 
 titanic_batches = titanic_lines.repeat(3).batch(128)
plot_batch_sizes(titanic_batches)
 
  
   
     
    
   
  
    image.png 
   
  
 如果需要每个epcho 都重新使用 Dataset ， 则需要将 batch 放在repeat 前面： 
 titanic_batches = titanic_lines.batch(128).repeat(3)
plot_batch_sizes(titanic_batches)
 
  
   
     
    
   
  
    image.png 
   
  
 如果您想在每个epoch 结束时执行自定义计算（例如收集统计信息），那么最简单的方法是在每个时期重新开始 Dataset 迭代： 
 epochs = 3
dataset = titanic_lines.batch(128)

for epoch in range(epochs):
  for batch in dataset:
    print(batch.shape)
  print("End of epoch: ", epoch)
 
 随机洗牌 
 Dataset.shuffle（）转换维护一个固定大小的缓冲区，并从该缓冲区中随机均匀地选择下一个元素。 
 注意：尽管较大的buffer_sizes可以更彻底地随机抽取，但它们可能会占用大量内存，并需要大量时间来填充。如果这成为问题，需要使用 Dataset.interleave。 
 向数据集添加索引，以便可以看到效果： 
 lines = tf.data.TextLineDataset(titanic_file)
counter = tf.data.experimental.Counter()

dataset = tf.data.Dataset.zip((counter, lines))
dataset = dataset.shuffle(buffer_size=100)
dataset = dataset.batch(20)
dataset
 
 由于buffer_size为100，批处理大小为20，因此第一批不包含索引大于120的元素。 
 n,line_batch = next(iter(dataset))
print(n.numpy())
--------------------------------------------------------------------
[ 94  56  75  71  27  35  99  14  20  33  60   4  87  47  32  19  55  93
 112 103]
 
 与Dataset.batch一样，相对于Dataset.repeat的顺序也很重要。 
 直到shuffle缓冲区为空，Dataset.shuffle 才发信号， 通知一个epoch的结束。因此，在重复之前将使用一个 epoch 的每个元素，然后再移至下一个： 
 dataset = tf.data.Dataset.zip((counter, lines))
shuffled = dataset.shuffle(buffer_size=100).batch(10).repeat(2)

print("Here are the item ID's near the epoch boundary:\n")
for n, line_batch in shuffled.skip(60).take(5):
  print(n.numpy())
--------------------------------------------------------------------
Here are the item ID's near the epoch boundary:

[371 576 469 293 598 618 559 512 491 524]
[527 613 566 625 573 621 608 375 568 587]
[600 578 617 580 496  18 541 601]
[16 54 62 59 98 18 82 61 91 99]
[50 86 75 90 40 92 63 94 51 80]
 
 shuffle_repeat = [n.numpy().mean() for n, line_batch in shuffled]
plt.plot(shuffle_repeat, label="shuffle().repeat()")
plt.ylabel("Mean item ID")
plt.legend()
 
  
   
     
    
   
   
  
 但是，在shuffle 之前 repeat 会 跨越边界 
 dataset = tf.data.Dataset.zip((counter, lines))
shuffled = dataset.repeat(2).shuffle(buffer_size=100).batch(10)

print("Here are the item ID's near the epoch boundary:\n")
for n, line_batch in shuffled.skip(55).take(15):
  print(n.numpy())

repeat_shuffle = [n.numpy().mean() for n, line_batch in shuffled]

plt.plot(shuffle_repeat, label="shuffle().repeat()")
plt.plot(repeat_shuffle, label="repeat().shuffle()")
plt.ylabel("Mean item ID")
plt.legend()
 
  
   
     
    
   
  
    image.png 
   
  
 数据预处理 
 Dataset.map（f）将某个函数用于另外一个 Dataset 的每个元素， 从而得到一个新的 Dataset。 类似 python 的 Map函数。 
 图像解码，尺寸变化 
 # Rebuild the flower filenames dataset:
list_ds = tf.data.Dataset.list_files(str(flowers_root/'*/*'))

# Write a function that manipulates the dataset elements.
# Reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def parse_image(filename):
  parts = tf.strings.split(filename, os.sep)
  label = parts[-2]

  image = tf.io.read_file(filename)
  image = tf.image.decode_jpeg(image)
  image = tf.image.convert_image_dtype(image, tf.float32)
  image = tf.image.resize(image, [128, 128])
  return image, label

# Test that it works.
file_path = next(iter(list_ds))
image, label = parse_image(file_path)

def show(image, label):
  plt.figure()
  plt.imshow(image)
  plt.title(label.numpy().decode('utf-8'))
  plt.axis('off')

show(image, label)

# Map it over the dataset.
images_ds = list_ds.map(parse_image)

for image, label in images_ds.take(2):
  show(image, label)
 
 使用 Python 功能 
 出于性能原因，请尽可能使用TensorFlow操作预处理数据。但是，有时在解析输入数据时调用外部Python库很有用。您可以在Dataset.map（）转换中使用tf.py_function（）操作。 
 例如，如果要应用随机旋转，则tf.image模块仅具有tf.image.rot90，这对于图像增强不是很有用。 
 例子使用 scipy.ndimage.rotate 和 tf.py_function 一起完成图片旋转。 
 import scipy.ndimage as ndimage

def random_rotate_image(image):
  image = ndimage.rotate(image, np.random.uniform(-30, 30), reshape=False)
  return image

image, label = next(iter(images_ds))
image = random_rotate_image(image)
show(image, label)
 
 要将此函数与Dataset.map一起使用，请注意与Dataset.from_generator相同的警告，在应用此函数时，需要描述返回 shape 和 type： 
 def tf_random_rotate_image(image, label):
  im_shape = image.shape
  [image,] = tf.py_function(random_rotate_image, [image], [tf.float32])
  image.set_shape(im_shape)
  return image, label

rot_ds = images_ds.map(tf_random_rotate_image)

for image, label in rot_ds.take(2):
  show(image, label)
 
 接卸 tf.Example protocol buffer 消息 
 原文 
 许多输入管道从TFRecord格式提取tf.train.Example协议缓冲区消息。每个tf.train.Example记录都包含一个或多个“特征”，并且 input pipeline 通常将这些特征转换为 Tensor。 
 fsns_test_file = tf.keras.utils.get_file("fsns.tfrec", "https://storage.googleapis.com/download.tensorflow.org/data/fsns-20160927/testdata/fsns-00000-of-00001")
dataset = tf.data.TFRecordDataset(filenames = [fsns_test_file])
dataset

# You can work with [`tf.train.Example`](https://www.tensorflow.org/api_docs/python/tf/train/Example) protos outside of a [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) to understand the data:

raw_example = next(iter(dataset))
parsed = tf.train.Example.FromString(raw_example.numpy())

feature = parsed.features.feature
raw_img = feature['image/encoded'].bytes_list.value[0]
img = tf.image.decode_png(raw_img)
plt.imshow(img)
plt.axis('off')
_ = plt.title(feature["image/text"].bytes_list.value[0])

raw_example = next(iter(dataset))

def tf_parse(eg):
  example = tf.io.parse_example(
      eg[tf.newaxis], {
          'image/encoded': tf.io.FixedLenFeature(shape=(), dtype=tf.string),
          'image/text': tf.io.FixedLenFeature(shape=(), dtype=tf.string)
      })
  return example['image/encoded'][0], example['image/text'][0]

img, txt = tf_parse(raw_example)
print(txt.numpy())
print(repr(img.numpy()[:20]), "...")

decoded = dataset.map(tf_parse)
decoded

image_batch, text_batch = next(iter(decoded.batch(10)))
image_batch.shape
 
 时间序列窗口 
 时间序列端到端例子 
 时间序列数据通常以完整的时间轴进行组织. 使用一个简单的Dataset.range来演示： 
 range_ds = tf.data.Dataset.range(100000)
 
 通常，基于此类数据的模型需要连续的时间片。最简单的方法是批量处理数据： 
 batches = range_ds.batch(10, drop_remainder=True)

for batch in batches.take(5):
  print(batch.numpy())
--------------------------------------------------------------------------------
[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47 48 49]
 
 或者，要对未来进行一步的密集预测，要做窗口滑动： 
 def dense_1_step(batch):
  # Shift features and labels one step relative to each other.
  return batch[:-1], batch[1:]

predict_dense_1_step = batches.map(dense_1_step)

for features, label in predict_dense_1_step.take(3):
  print(features.numpy(), " => ", label.numpy())
--------------------------------------------------------------
[0 1 2 3 4 5 6 7 8]  =>  [1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18]  =>  [11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28]  =>  [21 22 23 24 25 26 27 28 29]
 
 要预测整个窗口而不是固定的偏移量，可以将批处理分为两部分： 
 batches = range_ds.batch(15, drop_remainder=True)

def label_next_5_steps(batch):
  return (batch[:-5],   # Take the first 5 steps
          batch[-5:])   # take the remainder

predict_5_steps = batches.map(label_next_5_steps)

for features, label in predict_5_steps.take(3):
  print(features.numpy(), " => ", label.numpy())
 
 要在2个批次的标签之间允许某些重叠，请使用 Dataset.zip： 
 feature_length = 10
label_length = 5

features = range_ds.batch(feature_length, drop_remainder=True)
labels = range_ds.batch(feature_length).skip(1).map(lambda labels: labels[:-5])

predict_5_steps = tf.data.Dataset.zip((features, labels))

for features, label in predict_5_steps.take(3):
  print(features.numpy(), " => ", label.numpy())
---------------------------------------------------------------------------------
[0 1 2 3 4 5 6 7 8 9]  =>  [10 11 12 13 14]
[10 11 12 13 14 15 16 17 18 19]  =>  [20 21 22 23 24]
[20 21 22 23 24 25 26 27 28 29]  =>  [30 31 32 33 34]
 
 使用 window 
 在使用Dataset.batch时，某些情况下可能需要更好的控制。 Dataset.window
 方法可让您完全控制，但需要格外小心：它返回数据集的数据集。有关详细信息，请参见Dataset structure。 
 window_size = 5

windows = range_ds.window(window_size, shift=1)
for sub_ds in windows.take(5):
  print(sub_ds)
---------------------------------------------------------------
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
 
 Dataset.flat_map 可以把包含dataset 的 dataset 拍平成一个dataset。 
 for x in windows.flat_map(lambda x: x).take(30):
   print(x.numpy(), end=' ')
 
 在几乎所有情况下，都需要首先对数据集进行批处理： 
 def sub_to_batch(sub):
  return sub.batch(window_size, drop_remainder=True)

for example in windows.flat_map(sub_to_batch).take(5):
  print(example.numpy())
---------------------------------------------------------------------------------
[0 1 2 3 4]
[1 2 3 4 5]
[2 3 4 5 6]
[3 4 5 6 7]
[4 5 6 7 8]
 
 shift参数控制了每个窗口的移动量。放在一起，您可以编写以下函数： 
 def make_window_dataset(ds, window_size=5, shift=1, stride=1):
  windows = ds.window(window_size, shift=shift, stride=stride)

  def sub_to_batch(sub):
    return sub.batch(window_size, drop_remainder=True)

  windows = windows.flat_map(sub_to_batch)
  return windows

ds = make_window_dataset(range_ds, window_size=10, shift = 5, stride=3)

for example in ds.take(10):
  print(example.numpy())
---------------------------------------------------------------------------------
[ 0  3  6  9 12 15 18 21 24 27]
[ 5  8 11 14 17 20 23 26 29 32]
[10 13 16 19 22 25 28 31 34 37]
[15 18 21 24 27 30 33 36 39 42]
[20 23 26 29 32 35 38 41 44 47]
[25 28 31 34 37 40 43 46 49 52]
[30 33 36 39 42 45 48 51 54 57]
[35 38 41 44 47 50 53 56 59 62]
[40 43 46 49 52 55 58 61 64 67]
[45 48 51 54 57 60 63 66 69 72]
 
 提取 lable 
 dense_labels_ds = ds.map(dense_1_step)

for inputs,labels in dense_labels_ds.take(3):
  print(inputs.numpy(), "=>", labels.numpy())
---------------------------------------------------------------------------------
[ 0  3  6  9 12 15 18 21 24] => [ 3  6  9 12 15 18 21 24 27]
[ 5  8 11 14 17 20 23 26 29] => [ 8 11 14 17 20 23 26 29 32]
[10 13 16 19 22 25 28 31 34] => [13 16 19 22 25 28 31 34 37]
 
 重采样 
 当使用类别非常不平衡的数据集时，可能需要对数据集重新采样。 tf.data提供了两种方法来执行此操作。信用卡欺诈数据集就是此类问题的一个很好的例子。 
 关于不均衡数据： 参考 Imbalanced Data 
 zip_path = tf.keras.utils.get_file(
    origin='https://storage.googleapis.com/download.tensorflow.org/data/creditcard.zip',
    fname='creditcard.zip',
    extract=True)

csv_path = zip_path.replace('.zip', '.csv')

creditcard_ds = tf.data.experimental.make_csv_dataset(
    csv_path, batch_size=1024, label_name="Class",
    # Set the column types: 30 floats and an int.
    column_defaults=[float()]*30+[int()])
 
 查看数据分布， 就会发现十分的不均衡 
 def count(counts, batch):
  features, labels = batch
  class_1 = labels == 1
  class_1 = tf.cast(class_1, tf.int32)

  class_0 = labels == 0
  class_0 = tf.cast(class_0, tf.int32)

  counts['class_0'] += tf.reduce_sum(class_0)
  counts['class_1'] += tf.reduce_sum(class_1)

  return counts

counts = creditcard_ds.take(10).reduce(
    initial_state={'class_0': 0, 'class_1': 0},
    reduce_func = count)

counts = np.array([counts['class_0'].numpy(),
                   counts['class_1'].numpy()]).astype(np.float32)

fractions = counts/counts.sum()
print(fractions)
---------------------------------------------------------------------------------
[0.9957 0.0043]
 
 训练不平衡数据集的一种常见方法是平衡它。 tf.data包含启用此工作流程的一些方法： 
 Dataset 采样 
 重采样数据集的一种方法是使用tf.data.experimental.sample_from_datasets。当每个类都有单独的data.Dataset时，此方法更适用。 在这里，只需使用过滤器从信用卡欺诈数据中生成它们： 
 negative_ds = (
  creditcard_ds
    .unbatch()
    .filter(lambda features, label: label==0)
    .repeat())
positive_ds = (
  creditcard_ds
    .unbatch()
    .filter(lambda features, label: label==1)
    .repeat())

for features, label in positive_ds.batch(10).take(1):
  print(label.numpy())

#  To use tf.data.experimental.sample_from_datasets  pass the datasets, and the weight for each:
balanced_ds = tf.data.experimental.sample_from_datasets(
    [negative_ds, positive_ds], [0.5, 0.5]).batch(10)

# Now the dataset produces examples of each class with 50/50 probability:
for features, labels in balanced_ds.take(10):
  print(labels.numpy())
---------------------------------------------------------------------------------
[1 1 0 1 1 0 1 0 1 0]
[0 0 1 0 0 0 0 1 1 1]
[0 0 0 0 0 0 0 1 0 1]
[0 0 0 0 1 0 0 0 0 0]
[1 1 0 0 1 0 1 1 0 1]
[0 1 1 0 0 0 1 0 1 0]
[0 1 0 0 1 1 0 0 1 0]
[1 0 0 0 1 0 1 0 1 0]
[1 1 1 0 1 1 1 0 1 0]
[1 0 1 1 1 0 1 1 0 1]

 
 排斥性重采样 
 上面方法的一个问题是它需要将 Dataset 按类型分开。 可以使用 Dataset.filter 但是它会导致数据重复加载。 data.experimental.rejection_resample 只会让数据加载一次。 rejection_resample 接受 class_func 作为参数。 class_func 会用在每个元素上。 
 def class_func(features, label):
  return label

# The resampler also needs a target distribution, and optionally an initial distribution estimate:
resampler = tf.data.experimental.rejection_resample(
    class_func, target_dist=[0.5, 0.5], initial_dist=fractions)

#  The resampler deals with individual examples, so you must unbatch the dataset before applying the resampler:
resample_ds = creditcard_ds.unbatch().apply(resampler).batch(10)

# The resampler returns creates (class, example) pairs from the output of the 
# class_func. In this case, the example was already a (feature, label) pair, so use 
# map to drop the extra copy of the labels:
balanced_ds = resample_ds.map(lambda extra_label, features_and_label: features_and_label)

# Now the dataset produces examples of each class with 50/50 probability:
for features, labels in balanced_ds.take(10):
  print(labels.numpy())
 
 遍历 Checkpoint 
 Tensorflow支持获取检查点，以便training 过程重新启动时，它可以还原最新的检查点以恢复大部分进度。除了检查模型变量之外，还可以检查数据集迭代器的进度。如果有一个很大的数据集，并且不想在每次重新启动时都从头开始，则这可能很有用。但是请注意，迭代器检查点可能很大，因为诸如shuffle和prefetch之类的转换需要迭代器中的缓冲元素。 
 要将迭代器包含在检查点中，需要将迭代器传递给[tf.train.Checkpoint (https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint)构造函数。 
 range_ds = tf.data.Dataset.range(20)

iterator = iter(range_ds)
ckpt = tf.train.Checkpoint(step=tf.Variable(0), iterator=iterator)
manager = tf.train.CheckpointManager(ckpt, '/tmp/my_ckpt', max_to_keep=3)

print([next(iterator).numpy() for _ in range(5)])

save_path = manager.save()

print([next(iterator).numpy() for _ in range(5)])

ckpt.restore(manager.latest_checkpoint)

print([next(iterator).numpy() for _ in range(5)])

# **Note:** It is not possible to checkpoint an iterator which relies on external state 
# such as a [`tf.py_function`](https://www.tensorflow.org/api_docs/python/tf/py_function). 
# Attempting to do so will raise an exception complaining about the external state.

 
 tf.data + tf.keras 
 tf.keras 很多函数都支持 Dataset 
 train, test = tf.keras.datasets.fashion_mnist.load_data()

images, labels = train
images = images/255.0
labels = labels.astype(np.int32)

fmnist_train_ds = tf.data.Dataset.from_tensor_slices((images, labels))
fmnist_train_ds = fmnist_train_ds.shuffle(5000).batch(32)

model = tf.keras.Sequential([
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(10)
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=['accuracy'])

model.fit(fmnist_train_ds, epochs=2)
 
 如果传递无限数据集（例如，通过调用Dataset.repeat（）），则只需传递steps_per_epoch参数即可： 
 model.fit(fmnist_train_ds.repeat(), epochs=2, steps_per_epoch=20)

loss, accuracy = model.evaluate(fmnist_train_ds)
print("Loss :", loss)
print("Accuracy :", accuracy)

loss, accuracy = model.evaluate(fmnist_train_ds.repeat(), steps=10)
print("Loss :", loss)
print("Accuracy :", accuracy)

predict_ds = tf.data.Dataset.from_tensor_slices(images).batch(32)
result = model.predict(predict_ds, steps = 10)
print(result.shape)

result = model.predict(fmnist_train_ds, steps = 10)
print(result.shape)

TF 2 输入管道 (1): tf.data

基础

Dataset 结构

数据读取

Numpy array

Python 生成器

TFRecord

文本数据

CSV 文件

一堆文件

Batching dataset elements

简单 Batching

Batching tensors with padding

训练工作流

处理多个 epoch

随机洗牌

数据预处理

图像解码，尺寸变化

使用 Python 功能

接卸 tf.Example protocol buffer 消息

时间序列窗口

使用 window

重采样

Dataset 采样

排斥性重采样

遍历 Checkpoint

tf.data + tf.keras

你可能感兴趣的:(TF 2 输入管道 (1): tf.data)