黄水生

TensorFlow2.0 Guide 官方教程学习笔记14-'tf.data: Build TensorFlow input pipelines'

本笔记参照TensorFlow官方教程，主要是对‘tf.data: Build TensorFlow input pipelines’教程内容翻译和内容结构编排，原文链接：tf.data: Build TensorFlow input pipelines

目录
一、基本结构（Basic mechanics）
1.1 数据集结构
二、读取输入数据
2.1 处理Numpy 数组（Consuming Numpy arrays）
2.2 处理Python生成器（Consuming Python generators）
2.3 处理TFRecord数据（Consuming TFRecord data）
2.4 处理文本数据（Consuming text data）
2.5 处理CSV数据（Consuming CSV data）
2.6 处理文件集（Consuming sets of files）
三、批处理数据集元素（Batching dataset elements）
3.1 简单批处理（simple batching）
3.2 用‘填充’来批处理张量（Batching tensors with padding）
四、训练工作流（training workflows）
4.1 处理多纪元（Processing multiple epochs）
4.2 随机打乱输入数据（Randomly shuffling input data）
五、预处理数据（Preprocessing data）
5.1 图像数据解码和调整大小（Decoding image data and resizing it）
5.2 应用任意Python逻辑（Applying arbitrary Python logic）
5.3 解析tf.Example协议缓冲区消息示例（Parsing tf.Example protocol buffer messages）
5.4 时间序列窗口（Time series windowing）
5.5 重采样（resampling）
六、使用高阶API
6.1 tf.keras
6.2 tf.estimator

我们可以用tf.data API从简单可重用的片数据中创建负责的输入流水线。例如：图像模型的管道可以聚合来自分布式文件系统中的文件数据，对每个图像应用随机扰动，并将随机选择的图像合并成一批进行培训。文本模型的管道可能涉及从原始文本数据中提取符号，将其转换为带有查找表的嵌入标识符，并将不同长度的序列组合在一起。tf.data API使处理大量数据、从不同的数据格式读取数据和执行复杂的转换成为可能。

tf.data API 引入一个tf.data.Dataset抽象，它表示元素序列，其中每个元素由一个或多个组件组成。例如，在一个图像管道中，一个元素可能是一个单一的训练示例，它有一对张量分量表示图像及其标签。

创建数据集有两种不同的方法：
- 从存储在内存的一个或多个文件中的数据构建数据集
- 从一个或多个tf.data.Dataset对象转换创建数据集

from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import pathlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

np.set_printoptions(precision=4)

一、基本结构（Basic mechanics）

为了创建一个输入流水线，我们必须从数据源开始。例如，从内存中创建一个数据集，我们可以使用tf.data.Dataset.from_tensors()或者tf.data.Dataset.from_tensor_slices()。或者，如果我们的输入数据以推荐的形式（TFRecord）存储在文件里，我们可以用tf.data.TFRecordDataset()。
一旦我们有了一个数据集对象，我们可以通过调用tf.data.Dataset对象中的链接方法将它转换为一个新的数据集。例如，我们可以应用预执行转换比如Dataset.map()，多元素转换如Dataset.batch()。详情参考：tf.data.Dataset
Dataset对象是一个可迭代的Python对象。这使得使用for循环来消费它的元素成为可能：

dataset = tf.data.Dataset.from_tensor_slices([8, 3, 0, 8, 2, 1])
dataset

for elem in dataset:
  print(elem.numpy())

或者使用iter显式地创建一个Python迭代器，然后使用next消费它的元素：

it = iter(dataset)

print(next(it).numpy())

或者，可以使用‘reduce’转换使用数据集元素，该转换减少所有元素以生成单个结果。下面的示例说明如何使用‘reduce’转换来计算整数集的和。

print(dataset.reduce(0, lambda state, value: state + value).numpy())

1.1 数据集结构
数据集包含的元素具有相同的（嵌套的）结构，结构的各个组件可以是tf.TypeSpec来表示的任何类型，包括张量、稀疏张量、不规则张量、张量阵列或数据集。
‘Dataset.element_spec’属性让我们可以检查每个元素组件的类型。该属性返回一个tf.TypeSpec对象的嵌套结构，匹配元素的结构，元素可以是单个组件、组件的元组或组件的嵌套元组。例如：

dataset1 = tf.data.Dataset.from_tensor_slices(tf.random.uniform([4, 10]))

dataset1.element_spec

TensorSpec(shape=(10,), dtype=tf.float32, name=None)

dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random.uniform([4]),
    tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)))

dataset2.element_spec

(TensorSpec(shape=(), dtype=tf.float32, name=None),
 TensorSpec(shape=(100,), dtype=tf.int32, name=None))

dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

dataset3.element_spec

(TensorSpec(shape=(10,), dtype=tf.float32, name=None),
 (TensorSpec(shape=(), dtype=tf.float32, name=None),
  TensorSpec(shape=(100,), dtype=tf.int32, name=None)))

# Dataset containing a sparse tensor.
dataset4 = tf.data.Dataset.from_tensors(tf.SparseTensor(indices=[[0, 0], [1, 2]], values=[1, 2], dense_shape=[3, 4]))

dataset4.element_spec

SparseTensorSpec(TensorShape([3, 4]), tf.int32)

# Use value_type to see the type of value represented by the element spec
dataset4.element_spec.value_type

tensorflow.python.framework.sparse_tensor.SparseTensor

数据集转换支持任何结构的数据集。在使用将函数应用于每个元素的Dataset.map()和Dataset.filter()转换时，元素结构决定了函数的参数：

dataset1 = tf.data.Dataset.from_tensor_slices(
    tf.random.uniform([4, 10], minval=1, maxval=10, dtype=tf.int32))

dataset1

for z in dataset1:
  print(z.numpy())

[1 4 5 3 5 8 8 3 9 6]
[2 2 1 6 5 8 7 7 2 9]
[5 6 4 3 7 4 9 5 6 6]
[8 7 8 5 7 2 2 6 5 4]

dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random.uniform([4]),
    tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)))

dataset2

dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

dataset3

for a, (b,c) in dataset3:
  print('shapes: {a.shape}, {b.shape}, {c.shape}'.format(a=a, b=b, c=c))

shapes: (10,), (), (100,)
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)

二、读取输入数据
2.1 处理Numpy 数组（Consuming Numpy arrays）
更多示例请参考（需备梯子）：Loading Numpy arrays
如果我们所有的输入数据存在内存里，那创建数据集最简单的方法是使用‘Dataset.from_tensor_slices()’方法将它们转换为‘tf.Tensor’。

train, test = tf.keras.datasets.fashion_mnist.load_data()

images, labels = train
images = images/255

dataset = tf.data.Dataset.from_tensor_slices((images, labels))
dataset

	注意：上面的代码片段把特性和标签数组作为‘tf.constant()’操作嵌入到TensorFlow图中，
	这对于小数据集来说工作的很好，但是会浪费内存---因为数组的内容会被复制多次---并且
	可能会达到‘tf.GraphDef’协议缓冲区的2GB限制。

2.2 处理Python生成器（Consuming Python generators）
另外一个可以很容易被整合成‘tf.data.Dataset’的通用数据源是Python生成器。

		注意：虽然这是一种方便的方法，但它的可移植性和可靠性有限。它必须在与创建生成器相同的python进程中运行，并且仍然受python GIL的约束。

def count(stop):
  i = 0
  while i<stop:
    yield i
    i += 1
for n in count(5):
  print(n)

‘Dataset.from_generator’构造函数将python生成器转换为全功能的‘tf.data.Dataset’。构造函数需要一个可调用对象最为输入，而不是迭代器。这样当生成器结束时可以让它重启。它（constructor）也有一个可选参数‘args’，作为一个可调用参数传递。
output_types参数是必需的因为‘tf.data’在内部创建‘tf.Graph’，而图边界需要‘tf.dtype’。

ds_counter = tf.data.Dataset.from_generator(count, args=[25], output_types=tf.int32, output_shapes = (), )

for count_batch in ds_counter.repeat().batch(10).take(10):
  print(count_batch.numpy())

[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24  0  1  2  3  4]
[ 5  6  7  8  9 10 11 12 13 14]
[15 16 17 18 19 20 21 22 23 24]
[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24  0  1  2  3  4]
[ 5  6  7  8  9 10 11 12 13 14]
[15 16 17 18 19 20 21 22 23 24]

‘output_shapes’参数虽然不是必需的，但是强烈建议添加，因为许多TensorFlow操作不支持秩未知的张量。如果某个轴的长度未知或可变，则在output_shapes中将其设置为None。
还需要注意的是，output_shapes和output_types作为其他数据集方法时遵循相同的嵌套规则。
下面是一个演示这两个方面的生成器示例，它返回数组的元组，其中第二个数组是长度未知的向量。

def gen_series():
  i = 0
  while True:
    size = np.random.randint(0, 10)
    yield i, np.random.normal(size=(size,))
    i += 1

for i, series in gen_series():
  print(i, ":", str(series))
  if i > 5:
    break

0 : [-1.1226  1.6132 -0.0095  0.8728]
1 : [0.6396 0.4688 0.2611 0.9847 0.1679 0.0287]
2 : [-0.2065  0.2807 -1.1219  0.0603]
3 : [ 1.137   0.0087 -0.774  -0.321  -2.0574 -0.4246]
4 : [ 1.0536 -1.0681 -0.8049  0.5107  1.2738 -0.1986 -0.5262  0.7247 -0.1688]
5 : [-1.7257 -0.4691  0.418   1.7976  1.863   0.3992]
6 : [-0.3747 -1.2524  0.525  -0.6958  0.4991 -0.5964 -1.7148]

第一个输出是‘int32’，第二个输出是‘float32’。第一个条目是标量，shape()，第二个是向量，长度未知，shape（None）。

ds_series = tf.data.Dataset.from_generator(
    gen_series, 
    output_types=(tf.int32, tf.float32), 
    output_shapes=((), (None,)))

ds_series

现在，它可以被用作正常的tf.data.Dataset了。注意：当批处理一个形状可变的数据集时，我们需使用‘Dataset.padded_batch’。

ds_series_batch = ds_series.shuffle(20).padded_batch(10, padded_shapes=([], [None]))

ids, sequence_batch = next(iter(ds_series_batch))
print(ids.numpy())
print()
print(sequence_batch.numpy())

[12 19  3  0  9  7  5  6  4 16]

[[ 0.      0.      0.      0.      0.      0.      0.      0.      0.    ]
 [-0.9723 -0.4083 -0.0498 -0.9856  0.      0.      0.      0.      0.    ]
 [-1.2603  0.8078 -0.6713  0.0692  0.1462 -0.7181 -0.0713  0.801   0.    ]
 [ 1.6164  0.5583  1.0472  1.5479  0.4733  0.2503 -0.5349  1.0763 -0.385 ]
 [ 1.1203 -0.7176  0.3693 -0.2975 -1.5206  1.297   0.5356 -1.2834 -0.7963]
 [-1.7671 -0.723  -0.3565  1.2658  0.6733  0.106  -0.5957  0.      0.    ]
 [ 0.      0.      0.      0.      0.      0.      0.      0.      0.    ]
 [ 0.422   0.4992 -1.5497 -0.6262 -0.1558 -0.2029  0.      0.      0.    ]
 [ 0.5232  0.8569  0.0893 -0.3251 -0.9755  1.0572 -1.5325 -1.1672  0.    ]
 [ 1.7647  0.0114  2.0847  0.6158  0.      0.      0.      0.      0.    ]]

对于更实际的示例，请尝试包装预处理image.ImageDataGenerator作为tf.data.Dataset。
先下载数据：

flowers = tf.keras.utils.get_file(
    'flower_photos',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
    untar=True)

创建‘image.ImageDataGenerator’

img_gen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255, rotation_range=20)

images, labels = next(img_gen.flow_from_directory(flowers))

Found 3670 images belonging to 5 classes.

print(images.dtype, images.shape)
print(labels.dtype, labels.shape)

float32 (32, 256, 256, 3)
float32 (32, 5)

ds = tf.data.Dataset.from_generator(
    img_gen.flow_from_directory, args=[flowers], 
    output_types=(tf.float32, tf.float32), 
    output_shapes=([32,256,256,3], [32,5])
)

ds

2.3 处理TFRecord数据（Consuming TFRecord data）
有关端到端的例子，请参考：‘Loading TFRecords’。（官方链接丢失）
‘tf.data’API支持多种文件格式，因此可以处理内存中不适合的大型数据集。例如，TFRecord文件格式是一个简单的面向记录的二进制格式，许多TensorFlow应用程序将其用于训练数据。‘tf.data.TFRecordDataset’类允许我们将一个或多个TFRecord文件的内容作为输入管道的一部分。
下面是一个使用来自法国街道名标牌(FSNS)的测试文件的示例。

# Creates a dataset that reads all of the examples from two files.
fsns_test_file = tf.keras.utils.get_file("fsns.tfrec", "https://storage.googleapis.com/download.tensorflow.org/data/fsns-20160927/testdata/fsns-00000-of-00001")

TFRecordDataset初始化器的文件名参数可以是字符串、字符串列表或字符串张量。如果我们有两组用于训练和验证的文件，我们可以创建一个工厂方法来生产数据集，将文件名作为输入参数：

dataset = tf.data.TFRecordDataset(filenames = [fsns_test_file])
dataset

许多TensorFlow项目在TFRecord文件里使用序列化‘tf.train.Example’记录。这些需要在检查前进行解码：

raw_example = next(iter(dataset))
parsed = tf.train.Example.FromString(raw_example.numpy())

parsed.features.feature['image/text']

bytes_list {
  value: "Rue Perreyon"
}

2.4 处理文本数据（Consuming text data）
有关端到端的例子，请参考：Loading text。
许多数据集作为一个或多个文本文件分发。‘tf.data.TextLineDataset’提供了一种从一个或多个文本文件中提取行的简单方法。给定一个或多个文件名，TextLineDataset将为这些文件的每行生成一个字符串元素。

directory_url = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
file_names = ['cowper.txt', 'derby.txt', 'butler.txt']

file_paths = [
    tf.keras.utils.get_file(file_name, directory_url + file_name)
    for file_name in file_names
]

dataset = tf.data.TextLineDataset(file_paths)

下面是一个文件的前几行：

for line in dataset.take(5):
  print(line.numpy())

b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;"
b'His wrath pernicious, who ten thousand woes'
b"Caused to Achaia's host, sent many a soul"
b'Illustrious into Ades premature,'
b'And Heroes gave (so stood the will of Jove)'

要在文件之间交替行，请使用‘Dataset.interleave’。这使得将文件混合在一起变得更加容易。以下是每个译本的第一行、第二行和第三行：

files_ds = tf.data.Dataset.from_tensor_slices(file_paths)
lines_ds = files_ds.interleave(tf.data.TextLineDataset, cycle_length=3)

for i, line in enumerate(lines_ds.take(9)):
  if i % 3 == 0:
    print()
  print(line.numpy())

b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;"
b"\xef\xbb\xbfOf Peleus' son, Achilles, sing, O Muse,"
b'\xef\xbb\xbfSing, O goddess, the anger of Achilles son of Peleus, that brought'

b'His wrath pernicious, who ten thousand woes'
b'The vengeance, deep and deadly; whence to Greece'
b'countless ills upon the Achaeans. Many a brave soul did it send'

b"Caused to Achaia's host, sent many a soul"
b'Unnumbered ills arose; which many a soul'
b'hurrying down to Hades, and many a hero did it yield a prey to dogs and'

默认情况下，TextLineDataset会生成每个文件的每一行，这可能是不需要的，例如：如果文件以标题行开始，或者包含注释。可以使用Dataset.skip()或Dataset.filter()转换删除这些行。在这里，我们跳过第一行，然后过滤直到找到正文。

titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
titanic_lines = tf.data.TextLineDataset(titanic_file)
for line in titanic_lines.take(10):
  print(line.numpy())

b'survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone'
b'0,male,22.0,1,0,7.25,Third,unknown,Southampton,n'
b'1,female,38.0,1,0,71.2833,First,C,Cherbourg,n'
b'1,female,26.0,0,0,7.925,Third,unknown,Southampton,y'
b'1,female,35.0,1,0,53.1,First,C,Southampton,n'
b'0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y'
b'0,male,2.0,3,1,21.075,Third,unknown,Southampton,n'
b'1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n'
b'1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n'
b'1,female,4.0,1,1,16.7,Third,G,Southampton,n'

def survived(line):
  return tf.not_equal(tf.strings.substr(line, 0, 1), "0")

survivors = titanic_lines.skip(1).filter(survived)
for line in survivors.take(10):
  print(line.numpy())

b'1,female,38.0,1,0,71.2833,First,C,Cherbourg,n'
b'1,female,26.0,0,0,7.925,Third,unknown,Southampton,y'
b'1,female,35.0,1,0,53.1,First,C,Southampton,n'
b'1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n'
b'1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n'
b'1,female,4.0,1,1,16.7,Third,G,Southampton,n'
b'1,male,28.0,0,0,13.0,Second,unknown,Southampton,y'
b'1,female,28.0,0,0,7.225,Third,unknown,Cherbourg,y'
b'1,male,28.0,0,0,35.5,First,A,Southampton,y'
b'1,female,38.0,1,5,31.3875,Third,unknown,Southampton,n'

2.5 处理CSV数据（Consuming CSV data）
更多示例请参考：Loading CSV Files和‘Loading Pandas DataFrames’（官方链接丢失）。
CSV文件格式是一种以纯文本形式存储表格数据的流行格式。

titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
df = pd.read_csv(titanic_file, index_col=None)
df.head()

如果我们的数据在内存中，‘Dataset.from_tensor_slices’方法同样作用于字典，使这些数据很轻易地被使用。

titanic_slices = tf.data.Dataset.from_tensor_slices(dict(df))

for feature_batch in titanic_slices.take(1):
  for key, value in feature_batch.items():
    print("  {!r:20s}: {}".format(key, value))

'survived'          : 0
  'sex'               : b'male'
  'age'               : 22.0
  'n_siblings_spouses': 1
  'parch'             : 0
  'fare'              : 7.25
  'class'             : b'Third'
  'deck'              : b'unknown'
  'embark_town'       : b'Southampton'
  'alone'             : b'n'

一种更具可伸缩性的方法是根据需要从磁盘加载。tf.data模块提供了从一个或多个符合RFC 4180的CSV文件中提取记录的方法。它支持列类型推断和许多其他特性，比如批处理和变换，以简化使用。
‘experimental.make_csv_dataset’是一个高级别交互函数，用来读取CSV文件集。

titanic_batches = tf.data.experimental.make_csv_dataset(
    titanic_file, batch_size=4,
    label_name="survived")
for feature_batch, label_batch in titanic_batches.take(1):
  print("'survived': {}".format(label_batch))
  print("features:")
  for key, value in feature_batch.items():
    print("  {!r:20s}: {}".format(key, value))

'survived': [0 0 0 1]
features:
  'sex'               : [b'male' b'male' b'male' b'male']
  'age'               : [28. 31. 38. 28.]
  'n_siblings_spouses': [3 0 0 0]
  'parch'             : [1 0 0 0]
  'fare'              : [25.4667 50.4958  7.05   30.5   ]
  'class'             : [b'Third' b'First' b'Third' b'First']
  'deck'              : [b'unknown' b'A' b'unknown' b'C']
  'embark_town'       : [b'Southampton' b'Southampton' b'Southampton' b'Southampton']
  'alone'             : [b'n' b'y' b'y' b'y']

如果只需要列的一个子集，那么可以使用select_columns参数。

titanic_batches = tf.data.experimental.make_csv_dataset(
    titanic_file, batch_size=4,
    label_name="survived", select_columns=['class', 'fare', 'survived'])
for feature_batch, label_batch in titanic_batches.take(1):
  print("'survived': {}".format(label_batch))
  for key, value in feature_batch.items():
    print("  {!r:20s}: {}".format(key, value))

'survived': [0 0 0 0]
  'fare'              : [27.7208 13.     20.2125  9.5   ]
  'class'             : [b'First' b'Second' b'Third' b'Third']

还有一个低级别‘experimenta.CsvDataset’类，它不支持列类型推断。相反，必须指定每个列的类型。

titanic_types  = [tf.int32, tf.string, tf.float32, tf.int32, tf.int32, tf.float32, tf.string, tf.string, tf.string, tf.string] 
dataset = tf.data.experimental.CsvDataset(titanic_file, titanic_types , header=True)

for line in dataset.take(10):
  print([item.numpy() for item in line])

[0, b'male', 22.0, 1, 0, 7.25, b'Third', b'unknown', b'Southampton', b'n']
[1, b'female', 38.0, 1, 0, 71.2833, b'First', b'C', b'Cherbourg', b'n']
[1, b'female', 26.0, 0, 0, 7.925, b'Third', b'unknown', b'Southampton', b'y']
[1, b'female', 35.0, 1, 0, 53.1, b'First', b'C', b'Southampton', b'n']
[0, b'male', 28.0, 0, 0, 8.4583, b'Third', b'unknown', b'Queenstown', b'y']
[0, b'male', 2.0, 3, 1, 21.075, b'Third', b'unknown', b'Southampton', b'n']
[1, b'female', 27.0, 0, 2, 11.1333, b'Third', b'unknown', b'Southampton', b'n']
[1, b'female', 14.0, 1, 0, 30.0708, b'Second', b'unknown', b'Cherbourg', b'n']
[1, b'female', 4.0, 1, 1, 16.7, b'Third', b'G', b'Southampton', b'n']
[0, b'male', 20.0, 0, 0, 8.05, b'Third', b'unknown', b'Southampton', b'y']

如果某些列是空的，则此低级接口允许我们提供默认值，而不是列类型。

%%writefile missing.csv
1,2,3,4
,2,3,4
1,,3,4
1,2,,4
1,2,3,
,,,

Writing missing.csv

# Creates a dataset that reads all of the records from two CSV files, each with
# four float columns which may have missing values.

record_defaults = [999,999,999,999]
dataset = tf.data.experimental.CsvDataset("missing.csv", record_defaults)
dataset = dataset.map(lambda *items: tf.stack(items))
dataset

for line in dataset:
  print(line.numpy())

[1 2 3 4]
[999   2   3   4]
[  1 999   3   4]
[  1   2 999   4]
[  1   2   3 999]
[999 999 999 999]

默认情况下，CsvDataset会生成文件每行中的每一列，这可能是不需要的，例如，如果文件以一个应该忽略的标题行开始，或者输入中不需要某些列。可以分别使用header和select_cols参数删除这些行和字段。

# Creates a dataset that reads all of the records from two CSV files with
# headers, extracting float data from columns 2 and 4.
record_defaults = [999, 999] # Only provide defaults for the selected columns
dataset = tf.data.experimental.CsvDataset("missing.csv", record_defaults, select_cols=[1, 3])
dataset = dataset.map(lambda *items: tf.stack(items))
dataset

for line in dataset:
  print(line.numpy())

[2 4]
[2 4]
[999   4]
[2 4]
[  2 999]
[999 999]

2.6 处理文件集（Consuming sets of files）
有许多数据集分布在一个文件集里，其中，每个文件是一个示例。

flowers_root = tf.keras.utils.get_file(
    'flower_photos',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
    untar=True)
flowers_root = pathlib.Path(flowers_root)

根目录包含每个类的路径：

for item in flowers_root.glob("*"):
  print(item.name)

daisy
sunflowers
roses
LICENSE.txt
tulips
dandelion

要从文件中加载数据，请使用tf.io.read_file功能：

list_ds = tf.data.Dataset.list_files(str(flowers_root/'*/*'))

for f in list_ds.take(5):
  print(f.numpy())

b'/root/.keras/datasets/flower_photos/tulips/16680998737_6f6225fe36.jpg'
b'/root/.keras/datasets/flower_photos/roses/8437935944_aab997560a_n.jpg'
b'/root/.keras/datasets/flower_photos/dandelion/808239968_318722e4db.jpg'
b'/root/.keras/datasets/flower_photos/dandelion/3502447188_ab4a5055ac_m.jpg'
b'/root/.keras/datasets/flower_photos/roses/3145692843_d46ba4703c.jpg'

将文件路径转换为(图像，标签)对:

def process_path(file_path):
  parts = tf.strings.split(file_path, '/')
  return tf.io.read_file(file_path), parts[-2]

labeled_ds = list_ds.map(process_path)

for image_raw, label_text in labeled_ds.take(1):
  print(repr(image_raw.numpy()[:100]))
  print()
  print(label_text.numpy())

b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00H\x00H\x00\x00\xff\xe2\x0cXICC_PROFILE\x00\x01\x01\x00\x00\x0cHLino\x02\x10\x00\x00mntrRGB XYZ \x07\xce\x00\x02\x00\t\x00\x06\x001\x00\x00acspMSFT\x00\x00\x00\x00IEC sRGB\x00\x00\x00\x00\x00\x00'

b'tulips'

三、批处理数据集元素（Batching dataset elements）
3.1 简单批处理（simple batching）
最简单的批处理方式是将数据集的n个连续元素堆叠成单个元素。Dataset.batch()变换就是这样做的，它有和tf.stack()运算符相同的约束，适用于元素的每个分量：也就是说，对于每个分量i，所有的元素都必须有一个完全相同形状的张量。

inc_dataset = tf.data.Dataset.range(100)
dec_dataset = tf.data.Dataset.range(0, -100, -1)
dataset = tf.data.Dataset.zip((inc_dataset, dec_dataset))
batched_dataset = dataset.batch(4)

for batch in batched_dataset.take(4):
  print([arr.numpy() for arr in batch])

[array([0, 1, 2, 3]), array([ 0, -1, -2, -3])]
[array([4, 5, 6, 7]), array([-4, -5, -6, -7])]
[array([ 8,  9, 10, 11]), array([ -8,  -9, -10, -11])]
[array([12, 13, 14, 15]), array([-12, -13, -14, -15])]

当tf.data试图传播形状信息时，默认情况下，Dataset.batch的结果批大小是未知的，因为最后一批可能没有满。注意‘shape’中的Nones参数：

batched_dataset

使用‘drop_remainder’参数来忽略最后一个批,得到完整的形状传播（shape propagation）

batched_dataset = dataset.batch(7, drop_remainder=True)
batched_dataset

3.2 用‘填充’来批处理张量（Batching tensors with padding）
以上方法适用于具有相同大小的张量。然而，许多模型(例如序列模型)处理的输入数据可能具有不同的大小(例如不同长度的序列)。为了处理这种情况，Dataset.padded_batch()转换允许我们通过指定一个或多个维度来填充不同形状的张量。

dataset = tf.data.Dataset.range(100)
dataset = dataset.map(lambda x: tf.fill([tf.cast(x, tf.int32)], x))
dataset = dataset.padded_batch(4, padded_shapes=(None,))

for batch in dataset.take(2):
  print(batch.numpy())
  print()

[[0 0 0]
 [1 0 0]
 [2 2 0]
 [3 3 3]]

[[4 4 4 4 0 0 0]
 [5 5 5 5 5 0 0]
 [6 6 6 6 6 6 0]
 [7 7 7 7 7 7 7]]

Dataset.padded_batch()转换允许我们为每个组件的每个维度设置不同的填充，它可以是可变长度（在上面的示例中由None表示）或固定长度。还可以覆盖padding值，该值默认为0。

四、训练工作流（training workflows）
4.1 处理多纪元（Processing multiple epochs）
tf.data API提供两种主要方法来处理同一数据的多个纪元。在多个epoch中遍历数据集的最简单方法是使用dataset.repeat()转换。
首先，我们创建一个包含‘titanic data’的数据集：

titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
titanic_lines = tf.data.TextLineDataset(titanic_file)

def plot_batch_sizes(ds):
  batch_sizes = [batch.shape[0] for batch in ds]
  plt.bar(range(len(batch_sizes)), batch_sizes)
  plt.xlabel('Batch number')
  plt.ylabel('Batch size')

使用不带参数的Dataset.repeat()转换将无限期地重复输入
Dataset.repeat转换连接它的参数不需要发出一个纪元结束和一个纪元开始的信号。因为随后使用的Dataset.batch()将产出可以跨纪元边界的批：

titanic_batches = titanic_lines.repeat(3).batch(128)
plot_batch_sizes(titanic_batches)

如果我们需要明确的纪元分离，把Dataset.batch放在‘repeat’前面：

titanic_batches = titanic_lines.batch(128).repeat(3)

plot_batch_sizes(titanic_batches)

如果你想在每个epoch结束时执行一个自定义计算(例如收集统计数据)，那么最简单的方法是在每个epoch上重新启动数据集迭代:

epochs = 3
dataset = titanic_lines.batch(128)

for epoch in range(epochs):
  for batch in dataset:
    print(batch.shape)
  print("End of epoch: ", epoch)

(128,)
(128,)
(128,)
(128,)
(116,)
End of epoch:  0
(128,)
(128,)
(128,)
(128,)
(116,)
End of epoch:  1
(128,)
(128,)
(128,)
(128,)
(116,)
End of epoch:  2

4.2 随机打乱输入数据（Randomly shuffling input data）
Dataset.shuffle()转换维持一个固定大小的缓冲，并从该缓冲区均匀随机地选择下一个元素。

	注意：虽然大的buffer_size会更彻底地洗牌，但是它们会占用大量内存，并且需要大量时间来填充。如果出现问题，可以考虑在文件之间使用Dataset.interleave。

添加一个索引到数据集，这样我们可以看到效果:

lines = tf.data.TextLineDataset(titanic_file)
counter = tf.data.experimental.Counter()

dataset = tf.data.Dataset.zip((counter, lines))
dataset = dataset.shuffle(buffer_size=100)
dataset = dataset.batch(20)
dataset

因为buffer_size是100，而批处理大小是20，所以第一批不包含索引超过120的元素。

n,line_batch = next(iter(dataset))
print(n.numpy())

[ 36   7  48  88  60  52  61  59  97 100  45  62  87  34  90  83  55  14
  32 114]

由于有Dataset.batch，Dataset.repeat的顺序很重要。
在转移缓冲区为空之前，Dataset.shuffle并不会发出一个纪元结束的信号。因此，重复前的转移将在转移到下一个纪元之前显示一个纪元的所有元素（So a shuffle placed before a repeat will show every element of one epoch before moving to the next）：

dataset = tf.data.Dataset.zip((counter, lines))
shuffled = dataset.shuffle(buffer_size=100).batch(10).repeat(2)

print("Here are the item ID's near the epoch boundary:\n")
for n, line_batch in shuffled.skip(60).take(5):
  print(n.numpy())

Here are the item ID's near the epoch boundary:

[565 571 381 557 594 627 503 383 562 513]
[556 603 407 604 613 496 165 252 624 623]
[484 464 572 602 569 605 574 610]
[25 52 94  2 43 85  7 44 26 74]
[ 41  49  95  61  76  16 104   0  67 105]

shuffle_repeat = [n.numpy().mean() for n, line_batch in shuffled]
plt.plot(shuffle_repeat, label="shuffle().repeat()")
plt.ylabel("Mean item ID")
plt.legend()

但在洗牌之前（shuffle），‘重复’（repeat）将纪元的界限混合在了一起：

dataset = tf.data.Dataset.zip((counter, lines))
shuffled = dataset.repeat(2).shuffle(buffer_size=100).batch(10)

print("Here are the item ID's near the epoch boundary:\n")
for n, line_batch in shuffled.skip(55).take(15):
  print(n.numpy())

Here are the item ID's near the epoch boundary:

[545 487   5 583 133 418 609 384   6 620]
[458 597 477  28 595 586 400   3  36  33]
[623 535 491 587 589  42  21  41 601 399]
[ 40 361 607 498 416 575  34  53  50  27]
[520 599  38  56 619  54  17  52  59 314]
[ 15  72  30 574 582  35  31  26  13  61]
[ 29 549  39  69  11  23   9  37 530  46]
[526  20 521  45  57 566 270 555  86  95]
[ 70  12 544  66   2 573  76  82 553 110]
[617 107 608 624  49 104  75  67  92 611]
[ 24 101 614 102 618 105 602 109 111 112]
[482 578  77  88 598 464  10 579  79 106]
[ 97 605  64  71  85 108  58  74 141  19]
[145 140 129  22 475 120  99 446 137  32]
[150  84 557 147 588 130  48 136  51 163]

repeat_shuffle = [n.numpy().mean() for n, line_batch in shuffled]

plt.plot(shuffle_repeat, label="shuffle().repeat()")
plt.plot(repeat_shuffle, label="repeat().shuffle()")
plt.ylabel("Mean item ID")
plt.legend()

五、预处理数据（Preprocessing data）
Dataset.map(f)转换通过对输入数据集的每个元素应用一个给定函数f来生成一个新的数据集。它基于map()函数，该函数通常应用于函数式编程语言中的列表(和其他结构)。f函数以tf.Tensor对象来表示输入中的单个元素，并且也是返回tf.Tensor,表示新数据集中的单个元素。它的实现使用标准的TensorFlow操作将一个元素转换成另一个元素。
下面的内容就教我们如何使用‘Dataset.map()’
5.1 图像数据解码和调整大小（Decoding image data and resizing it）
当用真实图像数据来训练神经网络时，将不同大小的图像转换成统一大小的操作通常是必要，因此它们可能被批处理成固定大小。
重建‘flower filenames’数据集：

list_ds = tf.data.Dataset.list_files(str(flowers_root/'*/*'))

编写一个控制数据元素的函数：

# Reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def parse_image(filename):
  parts = tf.strings.split(file_path, '/')
  label = parts[-2]

  image = tf.io.read_file(filename)
  image = tf.image.decode_jpeg(image)
  image = tf.image.convert_image_dtype(image, tf.float32)
  image = tf.image.resize(image, [128, 128])
  return image, label

测试它是否有效：

file_path = next(iter(list_ds))
image, label = parse_image(file_path)

def show(image, label):
  plt.figure()
  plt.imshow(image)
  plt.title(label.numpy().decode('utf-8'))
  plt.axis('off')

show(image, label)

将它映射到数据集里：

images_ds = list_ds.map(parse_image)

for image, label in images_ds.take(2):
  show(image, label)

5.2 应用任意Python逻辑（Applying arbitrary Python logic）
出于性能的原因，谷歌鼓励我们如果可以使用TensorFlow操作来预处理我们的数据。然而，有时候调用外部的Python库来解析我们的输入数据也是有帮助的。我们可以在在Dataset.map()转换中使用tf.py_function()操作。
例如，如果我们想用个随机旋转，‘tf.image’模块仅仅只有‘tf.image.rot90’，这样对图像增强不是很有用。

	注意：tensorflow_addons中有一个与TensorFlow兼容的旋转，在tensorflow_addons.image.rotate里。为了演示‘tf.py_function’，我们可以尝试使用‘scipy.ndimage.rotate'函数替代：

import scipy.ndimage as ndimage

def random_rotate_image(image):
  image = ndimage.rotate(image, np.random.uniform(-30, 30), reshape=False)
  return image

image, label = next(iter(images_ds))
image = random_rotate_image(image)
show(image, label)

Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).

为了在Dataset.map中使用这个函数，跟使用Dataset.from_generator相同的注意事项，在应用函数时需要描述返回的形状和类型：

def tf_random_rotate_image(image, label):
  im_shape = image.shape
  [image,] = tf.py_function(random_rotate_image, [image], [tf.float32])
  image.set_shape(im_shape)
  return image, label

rot_ds = images_ds.map(tf_random_rotate_image)

for image, label in rot_ds.take(2):
  show(image, label)

Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).

5.3 解析tf.Example协议缓冲区消息示例（Parsing tf.Example protocol buffer messages）
许多输入流水线从TFRecord格式提取‘tf.train.Example’协议缓冲区消息。每个‘tf.train.Example’record包含一个或多个‘特征’，输入流水线通常将这些特征转换成张量。

fsns_test_file = tf.keras.utils.get_file("fsns.tfrec", "https://storage.googleapis.com/download.tensorflow.org/data/fsns-20160927/testdata/fsns-00000-of-00001")
dataset = tf.data.TFRecordDataset(filenames = [fsns_test_file])
dataset

我们可以在‘tf.data.Dataset’外部，用‘tf.train.Example’模型来工作，以理解数据：

raw_example = next(iter(dataset))
parsed = tf.train.Example.FromString(raw_example.numpy())

feature = parsed.features.feature
raw_img = feature['image/encoded'].bytes_list.value[0]
img = tf.image.decode_png(raw_img)
plt.imshow(img)
plt.axis('off')
_ = plt.title(feature["image/text"].bytes_list.value[0])

raw_example = next(iter(dataset))
def tf_parse(raw_examples):
  example = tf.io.parse_example(
      raw_example[tf.newaxis], {
          'image/encoded': tf.io.FixedLenFeature(shape=(), dtype=tf.string),
          'image/text': tf.io.FixedLenFeature(shape=(), dtype=tf.string)
      })
  return example['image/encoded'][0], example['image/text'][0]

img, txt = tf_parse(raw_example)
print(txt.numpy())
print(repr(img.numpy()[:20]), "...")

b'Rue Perreyon'
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x02X' ...

decoded = dataset.map(tf_parse)
decoded

image_batch, text_batch = next(iter(decoded.batch(10)))
image_batch.shape

TensorShape([10])

5.4 时间序列窗口（Time series windowing）
端到端的时间序列示例请参考：Time series forecasting（官方链接丢失）
时间序列数据通常以完整的时间轴来组织。先用一个简单的Dataset.range来演示：

range_ds = tf.data.Dataset.range(100000)

通常，基于这类数据的模型需要一个连续的时间片。最简单的方法是批处理这些数据：
5.4.1 使用批处理（batch）

batches = range_ds.batch(10, drop_remainder=True)

for batch in batches.take(5):
  print(batch.numpy())

[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47 48 49]

或者为了对未来做一个密集的预测，你可以将特征和标签相对地移动一步:

def dense_1_step(batch):
  # Shift features and labels one step relative to each other.
  return batch[:-1], batch[1:]

predict_dense_1_step = batches.map(dense_1_step)

for features, label in predict_dense_1_step.take(3):
  print(features.numpy(), " => ", label.numpy())

[0 1 2 3 4 5 6 7 8]  =>  [1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18]  =>  [11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28]  =>  [21 22 23 24 25 26 27 28 29]

要预测整个窗口而不是一个固定的偏移量，你可以把批次分成两部分:

batches = range_ds.batch(15, drop_remainder=True)

def label_next_5_steps(batch):
  return (batch[:-5],   # Take the first 5 steps
          batch[-5:])   # take the remainder

predict_5_steps = batches.map(label_next_5_steps)

for features, label in predict_5_steps.take(3):
  print(features.numpy(), " => ", label.numpy())

[0 1 2 3 4 5 6 7 8 9]  =>  [10 11 12 13 14]
[15 16 17 18 19 20 21 22 23 24]  =>  [25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]  =>  [40 41 42 43 44]

为了让一批数据的特性和另一批数据的标签有一些重叠，可以使用Dataset.zip:

feature_length = 10
label_length = 5

features = range_ds.batch(feature_length, drop_remainder=True)
labels = range_ds.batch(feature_length).skip(1).map(lambda labels: labels[:-5])

predict_5_steps = tf.data.Dataset.zip((features, labels))

for features, label in predict_5_steps.take(3):
  print(features.numpy(), " => ", label.numpy())

[0 1 2 3 4 5 6 7 8 9]  =>  [10 11 12 13 14]
[10 11 12 13 14 15 16 17 18 19]  =>  [20 21 22 23 24]
[20 21 22 23 24 25 26 27 28 29]  =>  [30 31 32 33 34]

5.4.2 使用‘窗口’（window）
当使用Dataset.batch时，有些情况我们也许需要更好的控制。Dataset.window方法可以提供完全的控制，但需要注意：它返回的是数据集的数据集。有关详细信息，请参见：‘Dataset structure’（官方链接丢失）

window_size = 5

windows = range_ds.window(window_size, shift=1)
for sub_ds in windows.take(5):
  print(sub_ds)

<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>

Dataset.flat_map方法可以采取一个数据集的数据集（a dataset of datasets），并把它变成一个单一的数据集：

 for x in windows.flat_map(lambda x: x).take(30):
   print(x.numpy(), end=' ')

WARNING:tensorflow:Entity  at 0x7fe0dbe37950> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Failed to parse source code of  at 0x7fe0dbe37950>, which Python reported as:
for x in windows.flat_map(lambda x: x).take(30):

If this is a lambda function, the error may be avoided by creating the lambda in a standalone statement.
WARNING: Entity  at 0x7fe0dbe37950> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Failed to parse source code of  at 0x7fe0dbe37950>, which Python reported as:
for x in windows.flat_map(lambda x: x).take(30):

If this is a lambda function, the error may be avoided by creating the lambda in a standalone statement.
0 1 2 3 4 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9

几乎所有的用例，我们需要先批处理数据集：

def sub_to_batch(sub):
  return sub.batch(window_size, drop_remainder=True)

for example in windows.flat_map(sub_to_batch).take(5):
  print(example.numpy())

[0 1 2 3 4]
[1 2 3 4 5]
[2 3 4 5 6]
[3 4 5 6 7]
[4 5 6 7 8]

现在我们可以通过‘shift’参数看到每个窗口移动多少。将上面这些整合起来，我们可以编写出这个函数：

def make_window_dataset(ds, window_size=5, shift=1, stride=1):
  windows = ds.window(window_size, shift=shift, stride=stride)

  def sub_to_batch(sub):
    return sub.batch(window_size, drop_remainder=True)

  windows = windows.flat_map(sub_to_batch)
  return windows

ds = make_window_dataset(range_ds, window_size=10, shift = 5, stride=3)

for example in ds.take(10):
  print(example.numpy())

[ 0  3  6  9 12 15 18 21 24 27]
[ 5  8 11 14 17 20 23 26 29 32]
[10 13 16 19 22 25 28 31 34 37]
[15 18 21 24 27 30 33 36 39 42]
[20 23 26 29 32 35 38 41 44 47]
[25 28 31 34 37 40 43 46 49 52]
[30 33 36 39 42 45 48 51 54 57]
[35 38 41 44 47 50 53 56 59 62]
[40 43 46 49 52 55 58 61 64 67]
[45 48 51 54 57 60 63 66 69 72]

想以前一样，提取标签是容易的：

dense_labels_ds = ds.map(dense_1_step)

for inputs,labels in dense_labels_ds.take(3):
  print(inputs.numpy(), "=>", labels.numpy())

[ 0  3  6  9 12 15 18 21 24] => [ 3  6  9 12 15 18 21 24 27]
[ 5  8 11 14 17 20 23 26 29] => [ 8 11 14 17 20 23 26 29 32]
[10 13 16 19 22 25 28 31 34] => [13 16 19 22 25 28 31 34 37]

5.5 重采样（resampling）

在处理非常‘类不平衡’（class-imbalanced）的数据集时，我们可能需要重新取数据集。tf.data为此提供了两种方法。信用卡欺诈数据集就是这类问题的一个很好的例子。
更多信息，请参考：Imbalanced Data（官方链接丢失）

zip_path = tf.keras.utils.get_file(
    origin='https://storage.googleapis.com/download.tensorflow.org/data/creditcard.zip',
    fname='creditcard.zip',
    extract=True)

csv_path = zip_path.replace('.zip', '.csv')

creditcard_ds = tf.data.experimental.make_csv_dataset(
    csv_path, batch_size=1024, label_name="Class",
    # Set the column types: 30 floats and an int.
    column_defaults=[float()]*30+[int()])

现在，检查类的分布，它是高度倾斜的:

def count(counts, batch):
  features, labels = batch
  class_1 = labels == 1
  class_1 = tf.cast(class_1, tf.int32)

  class_0 = labels == 0
  class_0 = tf.cast(class_0, tf.int32)

  counts['class_0'] += tf.reduce_sum(class_0)
  counts['class_1'] += tf.reduce_sum(class_1)

  return counts

counts = creditcard_ds.take(10).reduce(
    initial_state={'class_0': 0, 'class_1': 0},
    reduce_func = count)

counts = np.array([counts['class_0'].numpy(),
                   counts['class_1'].numpy()]).astype(np.float32)

fractions = counts/counts.sum()
print(fractions)

[0.9948 0.0052]

用不平和数据集进行训练的通用方法是让它平衡。‘tf.data’里有几个方法可以让这个工作流工作：
5.5.1 数据集重采样
数据集重采样的一种方法是‘sample_from_datasets’。当每个类
有个单独data.Dataset时更适用。
下面用滤波器（filter）从信用卡欺诈数据中生成一个重采样数据集：

negative_ds = creditcard_ds.unbatch().filter(lambda features,label: label==0).repeat()
positive_ds = creditcard_ds.unbatch().filter(lambda features,label: label==1).repeat()

for features, label in positive_ds.batch(10).take(1):
  print(label.numpy())

[1 1 1 1 1 1 1 1 1 1]

使用‘tf.data.experimental.sample_from_datasets’传递数据集以及权重值：

balanced_ds = tf.data.experimental.sample_from_datasets([negative_ds, positive_ds], [0.5, 0.5]).batch(10)

现在数据集产生每个类的例子的概率是50/50:

for features, labels in balanced_ds.take(10):
  print(labels.numpy())

[0 1 1 1 0 0 1 0 1 0]
[0 0 0 0 0 1 1 0 0 0]
[0 1 1 0 1 0 0 1 0 0]
[0 1 0 0 0 1 1 1 0 1]
[1 1 0 1 0 1 0 1 0 1]
[0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 1 0 1 0 1 1]
[0 1 1 0 1 0 1 1 1 0]
[0 0 0 1 0 1 0 0 1 1]
[1 1 1 0 1 1 0 0 0 0]

5.5.2 拒绝重采样（Rejection resampling）
上面的‘experimental.sample_from_datasets’方法的缺点是它需要一个单独的tf.data.Dataset给每个类。使用Dataset.filter可以解决，但所有的数据将被加载两次。
‘data.experimental.rejection_resample ’方法可以用来将数据集重平衡，并且只需要加载一次数据。为了达到平衡元素将从数据集中丢弃。
‘data.experimental.rejection_resample ’需要一个‘class_func‘’参数。这个参数应用于每个元素，并且用于确定一个示例属于哪个类，以达到平衡的目的。
‘creditcard_ds’里的元素已经配对了。所以‘class_func’只需要返回这些标签就行：

def class_func(features, label):
  return label

重采样器需要一个目标分布，以及一个可选的初始化分布估计：

resampler = tf.data.experimental.rejection_resample(
    class_func, target_dist=[0.5, 0.5], initial_dist=fractions)

重采样器处理单个示例，所以你必须在应用重采样器之前取消数据集:

resample_ds = creditcard_ds.unbatch().apply(resampler).batch(10)

WARNING:tensorflow:From /tensorflow-2.0.0/python3.6/tensorflow_core/python/data/experimental/ops/resampling.py:151: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20.
Instructions for updating:
Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensure tf.print executes in graph mode:

重采样器返回从‘class_func’输出中创建的（class,example）对。在这条用例中，‘example’已经是一个（feature，label）对了，所以使用‘map’丢弃多余的标签备份：

balanced_ds = resample_ds.map(lambda extra_label, features_and_label: features_and_label)

现在数据集产生每个类的例子的概率是50/50:

for features, labels in balanced_ds.take(10):
  print(labels.numpy())

[1 0 0 1 0 0 1 1 1 0]
[0 0 1 0 0 0 0 1 1 1]
[0 0 1 1 1 0 0 1 0 1]
[0 0 1 0 1 1 1 1 0 1]
[1 1 1 0 0 0 0 0 1 1]
[1 1 1 1 1 1 0 1 1 1]
[1 0 1 0 0 1 0 0 0 1]
[1 1 0 0 1 1 1 1 0 0]
[1 0 0 1 0 0 1 0 0 1]
[1 0 0 0 1 0 0 0 1 0]

六、使用高阶API
6.1 tf.keras
tf.keras API在创建和执行机器学习模型上简化了许多。它的‘.fit()’、‘.evaluate()’、‘.predict()’ API接口支持数据集作为输入。下面是一个快速的数据集和模型设置：

train, test = tf.keras.datasets.fashion_mnist.load_data()

images, labels = train
images = images/255.0
labels = labels.astype(np.int32)

fmnist_train_ds = tf.data.Dataset.from_tensor_slices((images, labels))
fmnist_train_ds = fmnist_train_ds.shuffle(5000).batch(32)

model = tf.keras.Sequential([
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(), 
              metrics=['accuracy'])

传递（feature，label）数据集只需要Model.fit和Model.evaluate:

model.fit(fmnist_train_ds, epochs=2)

如果我们传递一个无限大的数据集，比如调用‘Dataset.repeat()’,我们需要传递‘steps_per_epoch’参数：

model.fit(fmnist_train_ds.repeat(), epochs=2, steps_per_epoch=20)

我们可以传递评价步骤数来进行评价：

loss, accuracy = model.evaluate(fmnist_train_ds)
print("Loss :", loss)
print("Accuracy :", accuracy)

对于长数据集，设置好步数来评价：

loss, accuracy = model.evaluate(fmnist_train_ds.repeat(), steps=10)
print("Loss :", loss)
print("Accuracy :", accuracy)

当调用Model.predict时不需要标签：

predict_ds = tf.data.Dataset.from_tensor_slices(images).batch(32)
result = model.predict(predict_ds, steps = 10)
print(result.shape)

(320, 10)

如果我们还是传递了标签，它将被忽略：

result = model.predict(fmnist_train_ds, steps = 10)
print(result.shape)

(320, 10)

6.2 tf.estimator
在‘tf.estimator.Estimator’中的‘input_fn’使用数据集，只需从input_fn返回数据集，框架将为我们处理它的元素。例如：

import tensorflow_datasets as tfds

def train_input_fn():
  titanic = tf.data.experimental.make_csv_dataset(
      titanic_file, batch_size=32,
      label_name="survived")
  titanic_batches = (
      titanic.cache().repeat().shuffle(500)
      .prefetch(tf.data.experimental.AUTOTUNE))
  return titanic_batches

embark = tf.feature_column.categorical_column_with_hash_bucket('embark_town', 32)
cls = tf.feature_column.categorical_column_with_vocabulary_list('class', ['First', 'Second', 'Third']) 
age = tf.feature_column.numeric_column('age')

import tempfile
model_dir = tempfile.mkdtemp()
model = tf.estimator.LinearClassifier(
    model_dir=model_dir,
    feature_columns=[embark, cls, age],
    n_classes=2
)

model = model.train(input_fn=train_input_fn, steps=100)

result = model.evaluate(train_input_fn, steps=10)

for key, value in result.items():
  print(key, ":", value)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-10-26T11:09:36Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp1rrmmxl3/model.ckpt-100
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/10]
INFO:tensorflow:Evaluation [2/10]
INFO:tensorflow:Evaluation [3/10]
INFO:tensorflow:Evaluation [4/10]
INFO:tensorflow:Evaluation [5/10]
INFO:tensorflow:Evaluation [6/10]
INFO:tensorflow:Evaluation [7/10]
INFO:tensorflow:Evaluation [8/10]
INFO:tensorflow:Evaluation [9/10]
INFO:tensorflow:Evaluation [10/10]
INFO:tensorflow:Finished evaluation at 2019-10-26-11:09:37
INFO:tensorflow:Saving dict for global step 100: accuracy = 0.728125, accuracy_baseline = 0.625, auc = 0.7681875, auc_precision_recall = 0.6506202, average_loss = 0.5628802, global_step = 100, label/mean = 0.375, loss = 0.5628802, precision = 0.72602737, prediction/mean = 0.33942837, recall = 0.44166666
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 100: /tmp/tmp1rrmmxl3/model.ckpt-100
accuracy : 0.728125
accuracy_baseline : 0.625
auc : 0.7681875
auc_precision_recall : 0.6506202
average_loss : 0.5628802
label/mean : 0.375
loss : 0.5628802
precision : 0.72602737
prediction/mean : 0.33942837
recall : 0.44166666
global_step : 100

for pred in model.predict(train_input_fn):
  for key, value in pred.items():
    print(key, ":", value)
  break

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp1rrmmxl3/model.ckpt-100
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
logits : [-0.4143]
logistic : [0.3979]
probabilities : [0.6021 0.3979]
class_ids : [0]
classes : [b'0']
all_class_ids : [0 1]
all_classes : [b'0' b'1']

你可能感兴趣的:(TensorFlow学习笔记)

tensorflow学习笔记（二）：机器学习必备API 我愛大泡泡深度学习机器学习深度学习
前一节介绍了一些最基本的概念和使用方法。因为我个人的最终目的还是在深度学习上，所以一些深度学习和机器学习模块是必须要了解的，这其中包括了tf.train、tf.contrib.learn、还有如训练神经网络必备的tf.nn等API。这里准备把常用的API和使用方法按照使用频次进行一个排列，可以当做一个以后使用参考。这一节介绍的内容可以有选择的看。而且最全的信息都在TensorFlow的API里面了
TensorFlow学习笔记 SIENTIST
使用“图”（graph）表示计算任务；在被称为“会话”（session）的“上下文”（context）中执行图；使用“张量”（tensor）表示数据，tensor可以任务是一个n维的数组或列表；通过“变量”（varible）维护状态；使用feed和fetch可以为任意的操作赋值或从中获取数据tensorflow.jpggraph中的节点称为op（operation），每个op能把输入的tensor
tensorflow学习笔记-图像分类模型-AlexNet实现飞天小小猫
之前一篇文章中总结了CNN中图像分类的经典模型，包括论文解读和分析，但是不写个代码搞一把总觉得虚～啊哈哈这个系列里准备把这些个经典模型用tensorflow实现一下。参考之前引用的blog：深度学习AlexNet模型详细分析上代码吧。参照着模型看更好读一些。'''图像分类模型的tensorflow实现之--AlexNetTensorflowVersion:1.4PythonVersion:3.6R
Tensorflow学习笔记（六）——卷积神经网络七月七叶
实现对fashion-minist分类：（1）引包importosos.environ["CUDA_VISIBLE_DEVICES"]="-1"importmatplotlibasmplimportmatplotlib.pyplotasplt%matplotlibinlineimportnumpyasnpimportpandasaspdimportsklearnimportsysimpor
tensorflow vgg基于cifar-10进行训练 GOGOYAO
最近接触tf，想在cifar-10数据集上训练下vgg网络。最开始想先跑vgg16，搜了一大圈，没有一个可以直接跑的（我参考【深度学习系列】用PaddlePaddle和Tensorflow实现经典CNN网络Vgg跑出来的精度就10%），要么是代码是针对1000种分类的，要么是预训练好的。最后在Tensorflow学习笔记：CNN篇（6）——CIFAR-10数据集VGG19实现找到了一个vgg19的
深度学习与Tensorflow学习笔记2 ——回调函数callbacks和Tensorboard 木头里有虫911
上一期我们从Fashion-mnist数据集开始，使用Tensorflow.keras搭建一个简单的神经网络来处理分类问题。通过这个简单例子我们熟悉了tf.keras的调用。本期我们来学习keras下面的回调函数callbacks的用法。这里，简单的再说一句，Tensorflow有非常完善的官方文档，相当于学习手册。（而且还有中文网站：https://tensorflow.google.cn/）在
TensorFlow学习笔记--（4）神经网络模型-数据集预处理 Postlude TensorFlow tensorflow 学习笔记
神经网络初步以scikit-leran鸢尾花为例通过scikit-learn库自带的鸢尾花数据集来测试数据的读入fromsklearnimportdatasetsfrompandasimportDataFrameimportpandasaspdx_data=datasets.load_iris().data#.data返回iris数据集所有输入特征y_data=datasets.load_iris
tensorflow学习笔记：识别图中模糊的手写体数字（2）基于多层神经网络以及TensorBoard可视化网络 heart_ace tensorflow学习笔记 tensorflow 神经网络可视化 python 深度学习
tensorflow学习笔记：识别图中模糊的手写体数字（2）基于多层神经网络以及TensorBoard可视化运行环境tensorflow-gpu1.11.0python3.6.9importtensorflowastfimportos读取MINIST数据集fromtensorflow.examples.tutorials.mnistimportinput_datamnist=input_data.
tensorflow学习笔记（十）：GAN生成手写体数字（MNIST）陈小虾深度学习框架实战 GAN手写体生成 GAN实战
文章目录一、GAN原理二、项目实战2.1项目背景2.2网络描述2.3项目实战一、GAN原理生成对抗网络简称GAN，是由两个网络组成的，一个生成器网络和一个判别器网络。这两个网络可以是神经网络（从卷积神经网络、循环神经网络到自编码器）。生成器从给定噪声中（一般是指均匀分布或者正态分布）产生合成数据，判别器分辨生成器的的输出和真实数据。前者试图产生更接近真实的数据，相应地，后者试图更完美地分辨真实数据
tensorflow学习笔记3 抬头挺胸才算活着
CreateaTensorFlowobjectthatreturnsx+yifx>y,andx-yotherwise.tf.cond相当于其他编程语言的?，比较要用tf.greatertf.cond(tf.greater(x,y),lambda:tf.add(x,y),lambda:tf.subtract(x,y))tf.case第一个参数是字典或者tuples都可以，只要是一对对，然后每一对第一
8月10日TensorFlow学习笔记——TensorFlow 数据类型、创建、索引与切片、维度变换、前向传播 Ashen_0nee tensorflow 学习 python
文章目录前言一、Numpy回归问题实战1、Step1：computeloss2、Step2：computeGradientandupdate二、手写数字识别1、Step1：XandY2、Step2：networkstructure3、Step3：循环计算Loss、梯度并更新参数三、数据类型1、tf.constant()2、TensorProperty(1)、.device(2)、.numpy()(
TensorFlow学习笔记--（3）张量的常用运算函数 Postlude TensorFlow tensorflow 学习笔记
损失函数及求偏导通过tf.GradientTape函数来指定损失函数的变量以及表达式最后通过gradient(%损失函数%,%偏导对象%)来获取求偏导的结果独热编码给出一组特征值来对图像进行分类可以用独热编码0的概率是第0种1的概率是第1种0的概率是第二种tf.one_hot(%某标签值%,%分类数%)这里还没太看懂结果的3X3矩阵是怎么来的如果单纯的是因为有几种类型就有几个1那传入的标签值参数就
tensorflow学习笔记--张量和基本运算 Yohance0_0 tensorflow框架学习深度学习
张量张量的阶和数据类型（1）张量的属性：graph：张量所属的默认图op：张量的操作名name：张量的字符串描述shape：张量形状一维{5}二维{2,3}三维{2，3，4}importtensorflowastfimportosos.environ['TF_CPP_MIN_LOG_LEVEL']='2'a=tf.constant(5.0)graph=tf.get_default_graph()p
tensorflow学习笔记----2.常用函数1 qq_35821503 tensorflow 深度学习
1.强制tensor转换为该数据类型tf.cast(张量名，dtype=数据类型)x1=tf.constant([1,2,3],dtype=tf.float64)print(x1)x2=tf.cast(x1,dtype=tf.int32)print("x2=",x2)运行结果：2.计算张量维度上元素的最小值tf.reduce_min(张量名)print("min=",tf.reduce_min(x
TensorFlow学习笔记----3.常用函数2 qq_35821503 tensorflow 深度学习
一.Gradienttape我们可以在with结构中，使用Gradienttape实现某个函数对指定参数的求导运算配合上一个文件讲的variable函数可以实现损失函数loss对参数w的求导计算with结构记录计算过程，gradient求出张量的梯度withtf.GradientTape()astape:若干个计算过程grad=tape.gradient(函数，对谁求导)withtf.Gradie
TensorFlow学习笔记--MLP多层感知机识别手写数字1-9 北航_Curry TensorFlow2.0 tensorflow 神经网络深度学习 1024程序员节
#简单粗暴tensorflow2.0合集视频p7-p9多层感知机（MLP）利用多层感知机MLP实现手写数字0-9的mnist数据集的识别importtensorflowastfimportnumpyasnp#数据的获取和预处理classMNISTLoader():def__init__(self):mnist=tf.keras.datasets.mnist(self.train_data,self
Tensorflow学习笔记--张量与会话 IT修炼家 tensorflow
张量张量是Tensorflow的核心组件之一，可以理解为Tensorflow就是张量和流组成的，张量可以简单地理解为多维数组，我的理解就是张量是一个数据模板，深度学习所有数据首先转换为张量的格式再进行计算，然后得到学习结果。横向看张量是整形、浮点型的数，另外注意张量计算中，有些计算需要张量数据的类型相同，否则会报错。纵向看张量是不同维度的“数组”，零阶张量是一个数，是计算的最小单元；二阶张量是向量
tensorflow学习笔记--Variable变量爱吃小白兔的大萝卜 tensorflow 学习 python
tf.Variable()变量：创建、初始化、保存、加载。1.创建Variable()构造函数需要变量的初始值，即任何形状和类型的张量Tensor。初始值定义其形状和类型，一旦构建，变量的类型和形状即确定。如果想要稍后改变变量的形状，需要带上validate_shape=False的赋值操作。#创建一个变量w=tf.Variable(tensor,name=)#运算y=tf.matmul(w,其他
tensorflow学习笔记：张量介绍以及张量操作函数 heart_ace tensorflow学习笔记深度学习 tensorflow 张量
张量（tensor）tensorflow程序使用tensor数据结构来代表所有的数据，计算图中，操作间传递的数据都是tensor。tensor堪为一个n维的数组或列表，每个tensor中包含类型（type）、阶（rank）和形状（shape）。tensor类型tensor类型python类型描述DF_FLOATtf.float3232位浮点数DF_DOUBLEtf.float6461为浮点数DF_
[TensorFlow 学习笔记-03]TensorFlow简介 caicaiatnbu TensorFlow学习笔记深度学习 TensorFlow
[版权说明]TensorFlow学习笔记参考：李嘉璇著TensorFlow技术解析与实战黄文坚唐源著TensorFlow实战郑泽宇顾思宇著TensorFlow实战Google深度学习框架乐毅王斌著深度学习-Caffe之经典模型详解与实战TensorFlow中文社区http://www.tensorfly.cn/极客学院著TensorFlow官方文档中文版TensorFlow官方文档英文版以及各位大
TensorFlow学习笔记--（2）张量的常用运算函数 Postlude TensorFlow tensorflow 学习笔记
张量的取值函数求张量的平均值:tf.reduce.mean(%张量名%)求张量的最小值:tf.reduce_min(%张量名%)求张量的最大值:tf.reduce_max(%张量名%)求张量的和:tf.reduce_sum(%张量名%)其次,对于上述所有操作都可在函数后添加一个新的参数axis=%维度%axis=0代表第一维度axis=1代表第二维度以此类推张量的四则运算加减乘除次方/开方特别注意
Tensorflow学习笔记：1-tensorflow-gpu部署 & keras简单使用-2023-2-12 Merlin雷 python机器学习笔记 tensorflow keras
tensorflow-gpu学习笔记：部署&keras简单使用-2023-2-12tensorflow2.6.0GPU版本部署及测试0-查看NVIDIA驱动版本1-安装2-测试3-简单使用4-tf.keras概述1、（单层）线性回归1、导包&数据读取和观察2、预测目标与损失函数3、创建模型4、训练5、预测2、多层感知器3、逻辑回归1、sigmoid函数2、交叉熵损失函数3、模型预测4、画图看损失和
TensorFlow学习笔记--（1）张量的随机生成 Postlude TensorFlow tensorflow 学习笔记
张量的生成如何判断一个张量的维数：看张量的中括号有几层012：零维数列[246]:一维向量[[123][456]]:二维数组两行三列第一行数据为123第二行数据为456以此类推n维张量有n层中括号tf.zeros(%指定一个张量的维数%)生成一个全0的张量tf.ones(%指定一个张量的维数%)生成一个全1的张量tf.fill(%指定一个张量的维数%,%Value%)生成一个全为Value的张量随
Tensorflow学习笔记：Keras函数式API 凿井而饮 tensorflow2 python tensorflow 深度学习
目录一、简介二、使用相同的层计算图定义多个模型三、模型可像层一样被调用四、处理复杂计算图拓扑1.多输入多输出模型2.建立一个小的ResNet五、共享层六、提取和重用层计算图节点七、使用自定义层扩展API八、何时使用函数式API1.函数式API的优势2.函数式API的劣势九、混合搭配的API式样1.将函数式模型用作子类化模型的一部分：2.在函数式API中使用任何子类化层或模型一、简介函数式API创建
tensorflow学习笔记--机器学习基础知识--（1）基本图像分类爱玩的阿是学习笔记 python tensorflow 机器学习深度学习
学习教材是tensorflow官网上的新手教程为了让自己有更深的印象和理解，将自己的学习笔记记录基础分类：对于衣服的图片分类本指南训练了一个神经网络模型来对衣服的图像进行分类，例如运动鞋和衬衫。本指南使用tf.keras在TensorFlow中构建和训练模型。from__future__importabsolute_import,division,print_function,unicode_li
TensorFlow学习笔记（未完待续）苏钟白 tensorflow 学习笔记
文章目录tf.Graph().as_default()sessiontensorflow.placeholder()tf.summarytf.Graph().as_default()withtf.Graph().as_default():withtf.device('/gpu:'+str(GPU_INDEX)):TensorFlow中所有计算都会被转化为计算图上的节点。是一个通过计算图的形式来表述
TensorFlow学习笔记（四）—— 入门 —— 基本使用 tiankong19999 TensorFlow TensorFlow 入门
教程地址：TensorFlow中文社区基本使用使用TensorFlow,你必须明白TensorFlow:使用图(graph)来表示计算任务.在被称之为会话(Session)的上下文(context)中执行图.使用tensor表示数据.通过变量(Variable)维护状态.使用feed和fetch可以为任意的操作(arbitraryoperation)赋值或者从其中获取数据.综述TensorFlow
TensorFlow学习笔记（四）——tf.data API 七月七叶
tf.data.Datasetcsv文件读取为dataset并用于训练tfrecord1.tf.data.Datasettf.data.Dataset使用流程：（1）以源数据创建一个dataset；（2）对数据进行预处理；（3）遍历整个dataset，进行数据处理1.1SourceDatasets（1）由数组、列表等创建，将其转化为tensor#创建一个datasetdataset=tf.data
tensorflow学习笔记————分类MNIST数据集 san.hang 人工智能 python
在使用tensorflow分类MNIST数据集中，最容易遇到的问题是下载MNIST样本的问题。一般是通过使用tensorflow内置的函数进行下载和加载，fromtensorflow.examples.tutorials.mnistimportinput_datamnist=input_data.read_data_sets("MNIST_data",one_hot=True)但是我使用时遇到了“
tensorflow学习笔记：运算函数、复数操作函数、规约计算、序列比较与索引提取以及错误类 heart_ace tensorflow学习笔记运算函数 tensorflow 错误类规约计算函数索引提前
运算函数、复数操作函数、规约计算、序列比较与索引提取以及错误类前一章提到了许多关于张量的操作函数，这里接着将一些运算函数、复数操作函数、规约计算、序列比较与索引提取以及错误类记录下来。算数运算函数函数描述tf.asign(x,y,name=None)令x=ytf.add(x,y,name=None)求和tf.subtract(x,y,name=None)减法tf.multiply(x,y,name
TOMCAT在POST方法提交参数丢失问题 357029540 java tomcat jsp
摘自http://my.oschina.net/luckyi/blog/213209 昨天在解决一个BUG时发现一个奇怪的问题，一个AJAX提交数据在之前都是木有问题的，突然提交出错影响其他处理流程。检查时发现页面处理数据较多，起初以为是提交顺序不正确修改后发现不是由此问题引起。于是删除掉一部分数据进行提交，较少数据能够提交成功。恢复较多数据后跟踪提交FORM DATA ，发现数
在MyEclipse中增加JSP模板删除-2008-08-18 ljy325 jsp xml MyEclipse
在D:\Program Files\MyEclipse 6.0\myeclipse\eclipse\plugins\com.genuitec.eclipse.wizards_6.0.1.zmyeclipse601200710\templates\jsp 目录下找到Jsp.vtl，复制一份，重命名为jsp2.vtl,然后把里面的内容修改为自己想要的格式，保存。然后在 D:\Progr
JavaScript常用验证脚本总结 eksliang JavaScript javaScript表单验证
转载请出自出处：http://eksliang.iteye.com/blog/2098985 下面这些验证脚本，是我在这几年开发中的总结，今天把他放出来，也算是一种分享吧，现在在我的项目中也在用！包括日期验证、比较，非空验证、身份证验证、数值验证、Email验证、电话验证等等...! &nb
微软BI（4） 18289753290 微软BI SSIS
1） Q:查看ssis里面某个控件输出的结果： A MessageBox.Show(Dts.Variables["v_lastTimestamp"].Value.ToString()); 这是我们在包里面定义的变量 2):在关联目的端表的时候如果是一对多的关系，一定要选择唯一的那个键作为关联字段。 3) Q：ssis里面如果将多个数据源的数据插入目的端一
定时对大数据量的表进行分表对数据备份酷的飞上天空大数据量
工作中遇到数据库中一个表的数据量比较大，属于日志表。正常情况下是不会有查询操作的，但如果不进行分表数据太多，执行一条简单sql语句要等好几分钟。。分表工具：linux的shell + mysql自身提供的管理命令原理：使用一个和原表数据结构一样的表，替换原表。 linux shell内容如下： =======================开始
本质的描述与因材施教永夜-极光感想随笔
不管碰到什么事,我都下意识的想去探索本质,找寻一个最形象的描述方式。我坚信,世界上对一件事物的描述和解释,肯定有一种最形象,最贴近本质,最容易让人理解 &
很迷茫。。。随便小屋随笔
小弟我今年研一，也是从事的咱们现在最流行的专业（计算机）。本科三流学校，为了能有个更好的跳板，进入了考研大军，非常有幸能进入研究生的行业（具体学校就不说了，怕把学校的名誉给损了）。先说一下自身的条件，本科专业软件工程。主要学习就是软件开发，几乎和计算机没有什么区别。因为学校本身三流，也就是让老师带着学生学点东西，然后让学生毕业就行了。对专业性的东西了解的非常浅。就那学的语言来说
23种设计模式的意图和适用范围 aijuans 设计模式
Factory Method 意图定义一个用于创建对象的接口，让子类决定实例化哪一个类。Factory Method 使一个类的实例化延迟到其子类。　　适用性当一个类不知道它所必须创建的对象的类的时候。　　当一个类希望由它的子类来指定它所创建的对象的时候。　　当类将创建对象的职责委托给多个帮助子类中的某一个，并且你希望将哪一个帮助子类是代理者这一信息局部化的时候。 Abstr
Java中的synchronized和volatile aoyouzi java volatile synchronized
说到Java的线程同步问题肯定要说到两个关键字synchronized和volatile。说到这两个关键字，又要说道JVM的内存模型。JVM里内存分为main memory和working memory。 Main memory是所有线程共享的，working memory则是线程的工作内存，它保存有部分main memory变量的拷贝，对这些变量的更新直接发生在working memo
js数组的操作和this关键字百合不是茶 js 数组操作 this关键字
js数组的操作; 一:数组的创建: 1、数组的创建 var array = new Array();　//创建一个数组 var array = new Array([size]);　//创建一个数组并指定长度，注意不是上限，是长度 var arrayObj = new Array([element0[, element1[, ...[, elementN]]]
别人的阿里面试感悟 bijian1013 面试分享工作感悟阿里面试
原文如下：http://greemranqq.iteye.com/blog/2007170 一直做企业系统，虽然也自己一直学习技术，但是感觉还是有所欠缺，准备花几个月的时间，把互联网的东西，以及一些基础更加的深入透析，结果这次比较意外，有点突然，下面分享一下感受吧！ &nb
淘宝的测试框架Itest Bill_chen spring maven 框架单元测试 JUnit
Itest测试框架是TaoBao测试部门开发的一套单元测试框架，以Junit4为核心，集合DbUnit、Unitils等主流测试框架，应该算是比较好用的了。近期项目中用了下，有关itest的具体使用如下： 1.在Maven中引入itest框架： <dependency> <groupId>com.taobao.test</groupId&g
【Java多线程二】多路条件解决生产者消费者问题 bit1129 java多线程
package com.tom; import java.util.LinkedList; import java.util.Queue; import java.util.concurrent.ThreadLocalRandom; import java.util.concurrent.locks.Condition; import java.util.concurrent.loc
汉字转拼音pinyin4j 白糖_ pinyin4j
以前在项目中遇到汉字转拼音的情况，于是在网上找到了pinyin4j这个工具包，非常有用，别的不说了，直接下代码： import java.util.HashSet; import java.util.Set; import net.sourceforge.pinyin4j.PinyinHelper; import net.sourceforge.pinyin
org.hibernate.TransactionException: JDBC begin failed解决方案 bozch ssh 数据库异常 DBCP
org.hibernate.TransactionException: JDBC begin failed: at org.hibernate.transaction.JDBCTransaction.begin(JDBCTransaction.java:68) at org.hibernate.impl.SessionImp
java-并查集（Disjoint-set）-将多个集合合并成没有交集的集合 bylijinnan java
import java.util.ArrayList; import java.util.Arrays; import java.util.HashMap; import java.util.HashSet; import java.util.Iterator; import java.util.List; import java.util.Map; import java.ut
Java PrintWriter打印乱码 chenbowen00 java
一个小程序读写文件，发现PrintWriter输出后文件存在乱码，解决办法主要统一输入输出流编码格式。读文件： BufferedReader 从字符输入流中读取文本，缓冲各个字符，从而提供字符、数组和行的高效读取。可以指定缓冲区的大小，或者可使用默认的大小。大多数情况下，默认值就足够大了。通常，Reader 所作的每个读取请求都会导致对基础字符或字节流进行相应的读取请求。因
[天气与气候]极端气候环境 comsci 环境
如果空间环境出现异变...外星文明并未出现,而只是用某种气象武器对地球的气候系统进行攻击,并挑唆地球国家间的战争,经过一段时间的准备...最大限度的削弱地球文明的整体力量,然后再进行入侵...... 那么地球上的国家应该做什么样的防备工作呢? &n
oracle order by与union一起使用的用法 daizj UNION oracle order by
当使用union操作时，排序语句必须放在最后面才正确，如下：只能在union的最后一个子查询中使用order by，而这个order by是针对整个unioning后的结果集的。So：如果unoin的几个子查询列名不同，如 Sql代码 select supplier_id, supplier_name from suppliers UNI
zeus持久层读写分离单元测试 deng520159 单元测试
本文是zeus读写分离单元测试,距离分库分表,只有一步了.上代码: 1.ZeusMasterSlaveTest.java package com.dengliang.zeus.webdemo.test; import java.util.ArrayList; import java.util.List; import org.junit.Assert; import org.j
Yii 截取字符串(UTF-8) 使用组件 dcj3sjt126com yii
1.将Helper.php放进protected\components文件夹下。 2.调用方法： Helper::truncate_utf8_string($content,20,false); //不显示省略号 Helper::truncate_utf8_string($content,20); //显示省略号 &n
安装memcache及php扩展 dcj3sjt126com PHP
安装memcache tar zxvf memcache-2.2.5.tgz cd memcache-2.2.5/ /usr/local/php/bin/phpize (?) ./configure --with-php-confi
JsonObject 处理日期 feifeilinlin521 java json JsonOjbect JsonArray JSONException
写这边文章的初衷就是遇到了json在转换日期格式出现了异常 net.sf.json.JSONException: java.lang.reflect.InvocationTargetException 原因是当你用Map接收数据库返回了java.sql.Date 日期的数据进行json转换出的问题话不多说直接上代码 &n
Ehcache（06）——监听器 234390216 监听器 listener ehcache
监听器 Ehcache中监听器有两种，监听CacheManager的CacheManagerEventListener和监听Cache的CacheEventListener。在Ehcache中，Listener是通过对应的监听器工厂来生产和发生作用的。下面我们将来介绍一下这两种类型的监听器。
activiti 自带设计器中chrome 34版本不能打开bug的解决 jackyrong Activiti
在acitivti modeler中，如果是chrome 34，则不能打开该设计器，其他浏览器可以，经证实为bug，参考 http://forums.activiti.org/content/activiti-modeler-doesnt-work-chrome-v34 修改为，找到 oryx.debug.js 在最头部增加 if (!Document.
微信收货地址共享接口-终极解决 laotu5i0 微信开发
最近要接入微信的收货地址共享接口，总是不成功，折腾了好几天，实在没办法网上搜到的帖子也是骂声一片。我把我碰到并解决问题的过程分享出来，希望能给微信的接口文档起到一个辅助作用，让后面进来的开发者能快速的接入，而不需要像我们一样苦逼的浪费好几天，甚至一周的青春。各种羞辱、谩骂的话就不说了，本人还算文明。如果你能搜到本贴，说明你已经碰到了各种 ed
关于人才 netkiller.github.com 工作面试招聘 netkiller 人才
关于人才每个月我都会接到许多猎头的电话，有些猎头比较专业，但绝大多数在我看来与猎头二字还是有很大差距的。与猎头接触多了，自然也了解了他们的工作，包括操作手法，总体上国内的猎头行业还处在初级阶段。总结就是“盲目推荐，以量取胜”。目前现状许多从事人力资源工作的人，根本不懂得怎么找人才。处在人才找不到企业，企业找不到人才的尴尬处境。企业招聘，通常是需要用人的部门提出招聘条件，由人
搭建 CentOS 6 服务器 - 目录 rensanning centos
(1) 安装CentOS ISO（desktop/minimal）、Cloud（AWS/阿里云）、Virtualization（VMWare、VirtualBox）详细内容 (2) Linux常用命令 cd、ls、rm、chmod...... 详细内容 (3) 初始环境设置用户管理、网络设置、安全设置...... 详细内容 (4) 常驻服务Daemon
【求助】mongoDB无法更新主键 toknowme mongodb
Query query = new Query(); query.addCriteria(new Criteria("_id").is(o.getId())); &n
jquery 页面滚动到底部自动加载插件集合 xp9802 jquery
很多社交网站都使用无限滚动的翻页技术来提高用户体验，当你页面滑到列表底部时候无需点击就自动加载更多的内容。下面为你推荐 10 个 jQuery 的无限滚动的插件： 1. jQuery ScrollPagination jQuery ScrollPagination plugin 是一个 jQuery 实现的支持无限滚动加载数据的插件。 2. jQuery Screw S

TensorFlow2.0 Guide 官方教程 学习笔记14-'tf.data: Build TensorFlow input pipelines'

一、基本结构（Basic mechanics）

你可能感兴趣的:(TensorFlow学习笔记)

TensorFlow2.0 Guide 官方教程学习笔记14-'tf.data: Build TensorFlow input pipelines'