黑暗星球

tf.data官方教程 - - 基于TF-v2

这是本人关于tf.data的第二篇博文，第一篇基于TF-v1详细介绍了tf.data，但是v1和v2很多地方不兼容，所以替大家瞧瞧v2的tf.data模块有什么新奇之处。

TensorFlow版本：2.1.0

首先贴上TF v1版本的tf.data博文地址：《TensorFlow tf.data 导入数据（tf.data官方教程）》

文章目录

使用 tf.data 构建数据输入通道

1. 基础知识 ¶

1.1 Dataset 结构介绍 ¶

2. 读取输入数据 ¶

2.1 读取Numpy数组 ¶
2.2 读取Python生成器中的数据 ¶
2.3 读取TFRecord数据 ¶
2.4 读取text数据 ¶
2.5 读取CSV数据 ¶
2.5 从文件读取数据 ¶

3. 数据集元素 batching ¶

3.1 最简单的 batching（直接 stack） ¶
4.2 将 Tensor 填充成统一大小，然后 batching ¶

4. 训练工作流程 ¶

4.1. 数据repeat多个epoch ¶
4.2. 随机shuffle输入数据 ¶

5. 数据预处理 ¶

5.1 使用Dataset.map()进行数据预处理 ¶
5.2 使用非TF函数进行数据预处理 ¶
5.3 解析tf.Exampleprotocol buffer messages ¶
5.4 时间序列windowing ¶

5.4.1 使用batch ¶
5.4.2 使用window ¶

5.5 重采样 ¶

5.5.1 Datasets.sampling ¶
5.5.2 experimental.rejection_resample ¶

6. 在高阶API中使用tf.data ¶

6.1 在 tf.keras 中使用 tf.data ¶
6.2 在 tf.estimator 中使用 tf.data ¶

使用 `tf.data` 构建数据输入通道

tf.data API编写的数据输入通道简单、并且可重用度高。tf.data能够实现非常复杂的数据输入通道。例如：图像模型的数据输入管道可能会聚集来自分布式文件系统中文件的数据，对每个图像应用随机扰动，然后将随机选择的图像合并为一批进行训练。文本模型的数据输入管道可能涉及从原始文本数据中提取符号，将其转换为带有查找表的嵌入标识符，以及将不同长度的序列分批处理。tf.dataAPI使得处理大量数据，从不同数据格式读取数据以及执行复杂的转换成为可能。

tf.data API引入了tf.data.Dataset 这个抽象概念。它是一个元素组成的序列，每个元素可以由一个或多个部分组成。例如，图像的数据输入通道中，一个元素可以是由数据和标签组成的一个训练样本。

创建dataset的方法有两种：

基于内存中的数据或硬盘中的一个或多个文件建立Dataset。
通过对Dataset进行 transform 得到一个新的Dataset。

from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf

import pathlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

np.set_printoptions(precision=4)

1. 基础知识 ¶

建立一个数据输入通道，一般需要从数据源开始。如果你的数据储存在内存中，你可以使用tf.data.Dataset.from_tensor()或tf.data.Dataset.from_tensor_slices()创建Dataset。如果你的数据是TFRecord格式，你可以使用tf.data.TFRecordDataset()创建Dataset。

一旦你有了一个Dataset对象，你可以通过调用它的方法对其进行变换产生一个新的 Dataset对象。

Dataset是一个Python可迭代对象。所以可以使用 for 循环来消耗它的元素：

dataset = tf.data.Dataset.from_tensor_slices([8, 3, 0, 8, 2, 1])
dataset

for elem in dataset:
  print(elem.numpy())

8
3
0
8
2
1

或者显式使用iter创建一个Python迭代器，并使用next来消耗其的元素：

it = iter(dataset)

print(next(it).numpy())

8

另外，也可以使用reduce()变换来消耗数据集的元素，根据所有元素产生单个结果。下面的示例说明如何使用reduce变换来计算整数数据集的总和。

print(dataset.reduce(0, lambda state, value: state + value).numpy())

22

1.1 `Dataset` 结构介绍 ¶

一个Dataset由多个相同结构的(嵌套)元素组成，每个元素又由多个可由tf.TypeSpec表示的部分组成（常见的有Tensor, SparseTensor, RaggedTensor, TensorArray, Dataset）。

利用Dataset.element_spec属性可以检查每个元素的组成部分的类型。该属性返回一个由tf.TypeSpec对象组成的嵌套结构，这个结构与Dataset中元素的结构是对应的。例如：

dataset1 = tf.data.Dataset.from_tensor_slices(tf.random.uniform([4, 10]))

dataset1.element_spec

TensorSpec(shape=(10,), dtype=tf.float32, name=None)

dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random.uniform([4]),
    tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)))

dataset2.element_spec

(TensorSpec(shape=(), dtype=tf.float32, name=None),
$\;$ TensorSpec(shape=(100,), dtype=tf.int32, name=None))

dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

dataset3.element_spec

(TensorSpec(shape=(10,), dtype=tf.float32, name=None),
$\;$ (TensorSpec(shape=(), dtype=tf.float32, name=None),
$\;$ $\;$ TensorSpec(shape=(100,), dtype=tf.int32, name=None)))

# Dataset containing a sparse tensor.
dataset4 = tf.data.Dataset.from_tensors(tf.SparseTensor(indices=[[0, 0], [1, 2]], values=[1, 2], dense_shape=[3, 4]))

dataset4.element_spec

SparseTensorSpec(TensorShape([3, 4]), tf.int32)

# Use value_type to see the type of value represented by the element spec
dataset4.element_spec.value_type

tensorflow.python.framework.sparse_tensor.SparseTensor

Dataset 的变换支持任何结构的数据集。在使用 Dataset.map()、Dataset.flat_map() 和 Dataset.filter() 函数时（这些转换会对每个元素应用一个函数），元素结构决定了函数的参数：

dataset1 = tf.data.Dataset.from_tensor_slices(
    tf.random.uniform([4, 10], minval=1, maxval=10, dtype=tf.int32))

dataset1

for z in dataset1:
  print(z.numpy())

[6 7 1 1 5 6 7 8 7 6]
[8 3 3 7 9 3 8 4 8 4]
[2 3 6 9 4 2 1 8 1 6]
[6 7 1 9 6 2 4 7 9 1]

dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random.uniform([4]),
    tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)))

dataset2

dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

dataset3

for a, (b,c) in dataset3:
  print('shapes: {a.shape}, {b.shape}, {c.shape}'.format(a=a, b=b, c=c))

shapes: (10,), (), (100,)
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)

注：为 Dataset 中的元素的各个组件命名通常会带来便利性（例如，元素的各个组件表示不同特征时）。除了元组之外，还可以使用命名元组（collections.namedtuple）或字典来表示 Dataset 的单个元素。

dataset = tf.data.Dataset.from_tensor_slices(
   {"a": tf.random.uniform([4]),
    "b": tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)})

dataset..element_spec

{‘a’: TensorSpec(shape=(), dtype=tf.float32, name=None), ‘b’: TensorSpec(shape=(100,), dtype=tf.int32, name=None)}

2. 读取输入数据 ¶

2.1 读取Numpy数组 ¶

See Loading NumPy arrays for more examples.

如果您的数据存储在内存中，则创建 Dataset 的最简单方法是使用Dataset.from_tensor_slices()创建dataset。

train, test = tf.keras.datasets.fashion_mnist.load_data() # out is np array

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
32768/29515 [=================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
26427392/26421880 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
8192/5148 [===============================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
4423680/4422102 [==============================] - 0s 0us/step

images, labels = train
images = images/255

dataset = tf.data.Dataset.from_tensor_slices((images, labels)) # auto convert np array to constant tensor
dataset

注意：上面的代码段会将 features 和 labels 数组作为 tf.constant() 嵌入 TensorFlow 图中。这非常适合小型数据集，但会浪费内存，因为这会多次复制数组的内容，并可能会达到 tf.GraphDef 协议缓冲区的 2GB 上限。

2.2 读取Python生成器中的数据 ¶

另一个常见的数据源是Python生成器。

注意：虽然使用Python生成器很简单，但这种方法的移植性、可扩展性较差。它必须与生成器运行在同一个Python进程中，并且它仍然受Python GIL的制约。

def count(stop):
  i = 0
  while i<stop:
    yield i
    i += 1

for n in count(5):
  print(n)

0
1
2
3
4

Dataset.from_generator可以将生成器转化为tf.data.Dataset。.from_generator函数将可调用对象作为输入，从而在到达生成器末尾时可重新启动生成器。它带有一个可选args参数，利用该参数可向可调用对象传递传递参数。

output_types参数是必需的，因为tf.data会在后台构建一个tf.Graph（图的边界需要tf.type）。

ds_counter = tf.data.Dataset.from_generator(count, args=[25], output_types=tf.int32, output_shapes = (), )

for count_batch in ds_counter.repeat().batch(10).take(10):
  print(count_batch.numpy())

[0 $\,$ $\,$ 1 $\,$ $\,$ 2 $\,$ $\,$ 3 $\,$ $\,$ 4 $\,$ $\,$ 5 $\,$ $\,$ 6 $\,$ $\,$ 7 $\,$ $\,$ 8 $\,$ $\,$ 9 $\,$ ]
[10 $\,$ 11 $\,$ 12 $\,$ 13 $\,$ 14 $\,$ 15 $\,$ 16 $\,$ 17 $\,$ 18 $\,$ 19]
[20 $\,$ 21 $\,$ 22 $\,$ 23 $\,$ 24 $\,$ 0 $\,$ 1 $\,$ 2 $\,$ 3 $\,$ 4]
[ 5 $\,$ 6 $\,$ 7 $\,$ 8 $\,$ 9 $\,$ 10 $\,$ 11 $\,$ 12 $\,$ 13 $\,$ 14]
[15 $\,$ 16 $\,$ 17 $\,$ 18 $\,$ 19 $\,$ 20 $\,$ 21 $\,$ 22 $\,$ 23 $\,$ 24]
[0 $\,$ $\,$ 1 $\,$ $\,$ 2 $\,$ $\,$ 3 $\,$ $\,$ 4 $\,$ $\,$ 5 $\,$ $\,$ 6 $\,$ $\,$ 7 $\,$ $\,$ 8 $\,$ $\,$ 9 $\,$ ]
[10 $\,$ 11 $\,$ 12 $\,$ 13 $\,$ 14 $\,$ 15 $\,$ 16 $\,$ 17 $\,$ 18 $\,$ 19]
[20 $\,$ 21 $\,$ 22 $\,$ 23 $\,$ 24 $\,$ 0 $\,$ 1 $\,$ 2 $\,$ 3 $\,$ 4]
[ 5 6 7 8 9 10 11 12 13 14]
[15 16 17 18 19 20 21 22 23 24]

output_shapes参数不是必须的，但是极力推荐指定该参数。因为很多TensorFlow operations不支持unknown rank的Tensor。如果某一个axis的长度是未知或者可变的，可以在output_shapes参数中将其置为None。

值得注意的是，dataset的其他方法也有output_shapes、output_types类似的规则。

下面是一个实例，它返回一个array元组，第二个array是一个长度不确定的向量：

def gen_series(): # 生成器
  i = 0
  while True:
    size = np.random.randint(0, 10)
    yield i, np.random.normal(size=(size,)) # array形状为(-1,)
    i += 1

for i, series in gen_series():
  print(i, ":", str(series))
  if i > 5:
    break

0 : [ 1.9201 0.2124 -0.3383 -0.1141 0.7749 -0.1499]
1 : []
2 : [ 0.5885 -1.1092 0.4577 2.2978 -1.1854]
3 : [-1.7452 1.0516]
4 : []
5 : []
6 : [-0.8563 -1.2055 -0.291 1.0448 0.1486 1.0402 1.8017]

第一个array是int32型，shape为**()；第二个array是一个float32型，shape为(None,)**。

ds_series = tf.data.Dataset.from_generator(
    gen_series, 
    output_types=(tf.int32, tf.float32),  # 必选参数
    output_shapes=((), (None,))) # 可选参数，但最好选上，原因前面已经提过

ds_series

现在，tf.data.Dataset建好了。但请注意：将形状可变的数据集进行 batching 时，您需要使用Dataset.padded_batch。

ds_series_batch = ds_series.shuffle(20).padded_batch(10, padded_shapes=([], [None]))

ids, sequence_batch = next(iter(ds_series_batch))
print(ids.numpy())
print()
print(sequence_batch.numpy())

[ 6 1 10 0 3 17 12 9 5 23]

[[ 0.5812 -0.825 0.6075 -1.3856 -0.8151 -1.1908 0. 0. ]
[-0.7208 0.0611 0.0084 0.6592 0.8364 0.8327 -0.7164 0.8826]
[ 0.0391 -2.0019 0.4077 0.9304 0. 0. 0. 0. ]
[ 0.4397 -0.0901 -0.4993 0.3485 0.2481 0. 0. 0. ]
[ 0.0346 0. 0. 0. 0. 0. 0. 0. ]
[-1.0478 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0.3163 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. ]]

注意：TensorFlow 2.2版本中，padded_shapes参数已经不需要了，The default behavior is to pad all axes to the longest in the batch.

ds_series_batch = ds_series.shuffle(20).padded_batch(10)

对于更实际的示例，可以尝试用preprocessing.image.ImageDataGenerator将其包装为tf.data.Dataset。

首先下载数据：

flowers = tf.keras.utils.get_file(
    'flower_photos',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
    untar=True)

Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz
228818944/228813984 [==============================] - 5s 0us/step

创建 image.ImageDataGenerator

img_gen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255, rotation_range=20)

images, labels = next(img_gen.flow_from_directory(flowers))

Found 3670 images belonging to 5 classes.

print(images.dtype, images.shape)
print(labels.dtype, labels.shape)

float32 (32, 256, 256, 3)
float32 (32, 5)

ds = tf.data.Dataset.from_generator(
    img_gen.flow_from_directory, args=[flowers], 
    output_types=(tf.float32, tf.float32), 
    output_shapes=([32,256,256,3], [32,5])
)

ds

2.3 读取TFRecord数据 ¶

See Loading TFRecords for an end-to-end example.

tf.data API支持多种文件格式，因此您可以处理超出内存大小的大型数据集。例如，TFRecord文件格式是一种简单的面向记录的二进制格式，许多TensorFlow应用程序都支持该格式的训练数据。通过 tf.data.TFRecordDataset 类，您可以将一个或多个 TFRecord 文件的内容作为数据管道的输入。

下面以French Street Name Signs(FSNS)为例：

# Creates a dataset that reads all of the examples from two files.
fsns_test_file = tf.keras.utils.get_file("fsns.tfrec", "https://storage.googleapis.com/download.tensorflow.org/data/fsns-20160927/testdata/fsns-00000-of-00001")

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/fsns-20160927/testdata/fsns-00000-of-00001
7905280/7904079 [==============================] - 0s 0us/step

TFRecordDataset的filenames 参数可以是字符串、字符串列表，也可以是字符串 tf.Tensor。因此，如果您有两组分别用于训练和验证的文件，你可以创建一个工厂方法来产生dataset（以filenames作为输入参数）。

dataset = tf.data.TFRecordDataset(filenames = [fsns_test_file])
dataset

很多TensorFlow项目在它们的TFRecords文件中，使用了序列化的tf.train.Example记录。查看这种数据需要解码：

raw_example = next(iter(dataset))
parsed = tf.train.Example.FromString(raw_example.numpy())

parsed.features.feature['image/text']

bytes_list {
value: “Rue Perreyon”
}

2.4 读取text数据 ¶

See Loading Text for an end to end example.

很多数据集都是作为一个或多个文本文件存储的。tf.data.TextLineDataset 可以从一个或多个文本文件中提取行。给定一个或多个文件名，TextLineDataset 会为这些文件的每行生成一个字符串值元素。

directory_url = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
file_names = ['cowper.txt', 'derby.txt', 'butler.txt']

file_paths = [
    tf.keras.utils.get_file(file_name, directory_url + file_name)
    for file_name in file_names
]

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt
819200/815980 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt
811008/809730 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt
811008/807992 [==============================] - 0s 0us/step

dataset = tf.data.TextLineDataset(file_paths)

查看第一个文件的前几行：

for line in dataset.take(5):
  print(line.numpy())

b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus’ son;"
b’His wrath pernicious, who ten thousand woes’
b"Caused to Achaia’s host, sent many a soul"
b’Illustrious into Ades premature,’
b’And Heroes gave (so stood the will of Jove)’

使用Dataset.interleave可以交替读取各个文件。这样可以更轻松地将文件混在一起。

files_ds = tf.data.Dataset.from_tensor_slices(file_paths)
lines_ds = files_ds.interleave(tf.data.TextLineDataset, cycle_length=3)

for i, line in enumerate(lines_ds.take(9)):
  if i % 3 == 0:
    print()
  print(line.numpy())

b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus’ son;"
b"\xef\xbb\xbfOf Peleus’ son, Achilles, sing, O Muse,"
b’\xef\xbb\xbfSing, O goddess, the anger of Achilles son of Peleus, that brought’

b’His wrath pernicious, who ten thousand woes’
b’The vengeance, deep and deadly; whence to Greece’
b’countless ills upon the Achaeans. Many a brave soul did it send’

b"Caused to Achaia’s host, sent many a soul"
b’Unnumbered ills arose; which many a soul’
b’hurrying down to Hades, and many a hero did it yield a prey to dogs and’

默认情况下，TextLineDataset 会读取每个文件的每一行，这可能是不是我们想要的。例如，如果文件以标题行开头或包含评论。可以使用 Dataset.skip() 和 Dataset.filter() 方法来移除这些行。

这里以Titanic数据集为例，演示去除标题行，过滤以查找幸存者：

titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
titanic_lines = tf.data.TextLineDataset(titanic_file)

Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
32768/30874 [===============================] - 0s 0us/step

for line in titanic_lines.take(10):
  print(line.numpy())

b’survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone’
b’0,male,22.0,1,0,7.25,Third,unknown,Southampton,n’
b’1,female,38.0,1,0,71.2833,First,C,Cherbourg,n’
b’1,female,26.0,0,0,7.925,Third,unknown,Southampton,y’
b’1,female,35.0,1,0,53.1,First,C,Southampton,n’
b’0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y’
b’0,male,2.0,3,1,21.075,Third,unknown,Southampton,n’
b’1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n’
b’1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n’
b’1,female,4.0,1,1,16.7,Third,G,Southampton,n’

def survived(line):
  return tf.not_equal(tf.strings.substr(line, 0, 1), "0")

survivors = titanic_lines.skip(1).filter(survived)

for line in survivors.take(10):
  print(line.numpy())

b’1,female,38.0,1,0,71.2833,First,C,Cherbourg,n’
b’1,female,26.0,0,0,7.925,Third,unknown,Southampton,y’
b’1,female,35.0,1,0,53.1,First,C,Southampton,n’
b’1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n’
b’1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n’
b’1,female,4.0,1,1,16.7,Third,G,Southampton,n’
b’1,male,28.0,0,0,13.0,Second,unknown,Southampton,y’
b’1,female,28.0,0,0,7.225,Third,unknown,Cherbourg,y’
b’1,male,28.0,0,0,35.5,First,A,Southampton,y’
b’1,female,38.0,1,5,31.3875,Third,unknown,Southampton,n’

2.5 读取CSV数据 ¶

See Loading CSV Files, and Loading Pandas DataFrames for more examples.

CSV是一种常见的文件格式，它以纯文本方式储存表格数据。

例如：

titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")  # 下载数据

df = pd.read_csv(titanic_file, index_col=None)
df.head()

	survived	sex	age	n_siblings_spouses	fare	class	deck	embark_town	alone
0	0	male	22.0	1	7.2500	Third	unknown	Southampton	n
1	1	female	38.0	1	71.2833	First	C	Cherbourg	n
2	1	female	26.0	0	7.9250	Third	unknown	Southampton	y
3	1	female	35.0	1	53.1000	First	C	Southampton	n
4	0	male	28.0	0	8.4583	Third	unknown	Queenstown	y

如果你的数据规模不大，能直接读入内存，Dataset.from_tensor_slices方法可以以字典为输入，从而大大方便数据的导入：

titanic_slices = tf.data.Dataset.from_tensor_slices(dict(df))

for feature_batch in titanic_slices.take(1):
  for key, value in feature_batch.items():
    print("  {!r:20s}: {}".format(key, value))

‘survived’ : 0
‘sex’ : b’male’
‘age’ : 22.0
‘n_siblings_spouses’: 1
‘parch’ : 0
‘fare’ : 7.25
‘class’ : b’Third’
‘deck’ : b’unknown’
‘embark_town’ : b’Southampton’
‘alone’ : b’n’

相比之下，直接从硬盘中读取数据是一个更灵活的方案。

tf.data模块提供了从一个或多个符合RFC 4180的 CSV文件中提取记录的方法。

experimental.make_csv_dataset函数是一个读取csv文件的高阶API。它支持列类型推断和许多其他功能（例如batching、shuffling等），以简化用法。

titanic_batches = tf.data.experimental.make_csv_dataset(
    titanic_file, batch_size=4,
    label_name="survived")

for feature_batch, label_batch in titanic_batches.take(1):
  print("'survived': {}".format(label_batch))
  print("features:")
  for key, value in feature_batch.items():
    print("  {!r:20s}: {}".format(key, value))

‘survived’: [1 1 0 0]
features:
$\;$ ‘sex’ : [b’female’ b’female’ b’male’ b’male’]
$\;$ ‘age’ : [28. 24. 29. 28.]
$\;$ ‘n_siblings_spouses’: [0 1 0 0]
$\;$ ‘parch’ : [0 0 0 0]
$\;$ ‘fare’ : [ 7.2292 26. 30. 7.725 ]
$\;$ ‘class’ : [b’Third’ b’Second’ b’First’ b’Third’]
$\;$ ‘deck’ : [b’unknown’ b’unknown’ b’D’ b’unknown’]
$\;$ ‘embark_town’ : [b’Cherbourg’ b’Southampton’ b’Southampton’ b’Queenstown’]
$\;$ ‘alone’ : [b’y’ b’n’ b’y’ b’y’]

如果只需要列的子集，则可以使用select_columns参数

titanic_batches = tf.data.experimental.make_csv_dataset(
    titanic_file, batch_size=4,
    label_name="survived", select_columns=['class', 'fare', 'survived'])

for feature_batch, label_batch in titanic_batches.take(1):
  print("'survived': {}".format(label_batch))
  for key, value in feature_batch.items():
    print("  {!r:20s}: {}".format(key, value))

‘survived’: [1 0 1 0]
$\;$ ‘fare’ : [ 10.5 7.25 23. 106.425]
$\;$ ‘class’ : [b’Second’ b’Third’ b’Second’ b’First’]

还有一个低阶experimental.CsvDataset类API，它可提供更精细的控制，但不支持列类型推断。相反，您必须指定每列的类型。

titanic_types  = [tf.int32, tf.string, tf.float32, tf.int32, tf.int32, tf.float32, tf.string, tf.string, tf.string, tf.string] 
dataset = tf.data.experimental.CsvDataset(titanic_file, titanic_types , header=True)

for line in dataset.take(10):
  print([item.numpy() for item in line])

[0，b’male’，22.0，1，0，7.25，b’Third’，b’unknown’，b’Southampton’，b’n’]
[1，b’female’，38.0，1，0， 71.2833，b’First’，b’C’，b’Cherbourg’，b’n’]
[1，b’female’，26.0，0，0，7.925，b’Third’，b’unknown’，b’南安普敦’，b’y’]
[1，b’女性’，35.0，1，0，53.1，b’First’，b’C’，b’Southampton’，b’n’]
[0，b’male ‘，28.0，0，0，8.4583，b’Third’，b’unknown’，b’Queenstown’，b’y’]
[0，b’male’，2.0，3，1，21.075，b’Third’ ，b’unknown’，b’Southampton’，b’n’]
[1，b’female’，27.0、0、2、11.1333，b’Third’，b’unknown’，b’南安普敦’，b’n’]
[1，b’female’，14.0，1，0，30.0708，b’Second’，b’unknown’，b’Cherbourg’，b’n’]
[1，b’female ‘，4.0，1，1，16.7，b’Third’，b’G’，b’Southampton’，b’n’]
[0，b’male’，20.0，0，0，8.05，b’Third’，b’unknown’，b’Southampton’，b’y’]

如果某些列为空，则此低级界面允许您提供默认值而不是列类型。

%%writefile missing.csv  # Ipython魔法命令，只在Ipython中又用
1,2,3,4
,2,3,4
1,,3,4
1,2,,4
1,2,3,
,,,

Writing missing.csv

# Creates a dataset that reads all of the records from two CSV files, each with
# four float columns which may have missing values.

record_defaults = [999,999,999,999]
dataset = tf.data.experimental.CsvDataset("missing.csv", record_defaults)
dataset = dataset.map(lambda *items: tf.stack(items))
dataset

for line in dataset:
  print(line.numpy())

[1 2 3 4 ]
[999 2 3 4 ]
[1 999 3 4 ]
[1 2 999 4 ]
[1 2 3 999]
[999 999 999 999]

默认情况下，CsvDataset会生成(yield)文件所有列的每一行，这可能是不希望的。例如，如果要忽略文件开头的标题行，或者希望去除掉某些列，可以使用header和select_cols这两个参数。

# Creates a dataset that reads all of the records from two CSV files with
# headers, extracting float data from columns 2 and 4.
record_defaults = [999, 999] # Only provide defaults for the selected columns
dataset = tf.data.experimental.CsvDataset("missing.csv", record_defaults, select_cols=[1, 3])
dataset = dataset.map(lambda *items: tf.stack(items))
dataset

for line in dataset:
  print(line.numpy())

[2 4]
[2 4]
[999 4]
[2 4]
[2 999]
[999 999]

2.5 从文件读取数据 ¶

很多数据集是由很多的文件构成，每个文件存储单个example。

flowers_root = tf.keras.utils.get_file(
    'flower_photos',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
    untar=True)
flowers_root = pathlib.Path(flowers_root)

根目录包含每个类的文件夹：

for item in flowers_root.glob("*"):
  print(item.name)

sunflowers
daisy
LICENSE.txt
roses
tulips
dandelion

每个类的文件夹中存储的是该类样本：

list_ds = tf.data.Dataset.list_files(str(flowers_root/'*/*'))

for f in list_ds.take(5):
  print(f.numpy())

b’/home/kbuilder/.keras/datasets/flower_photos/roses/2980099495_cf272e90ca_m.jpg’
b’/home/kbuilder/.keras/datasets/flower_photos/sunflowers/14678298676_6db8831ee6_m.jpg’
b’/home/kbuilder/.keras/datasets/flower_photos/tulips/485266837_671def8627.jpg’
b’/home/kbuilder/.keras/datasets/flower_photos/daisy/7377004908_5bc0cde347_n.jpg’
b’/home/kbuilder/.keras/datasets/flower_photos/dandelion/9726260379_4e8ee66875_m.jpg’

利用tf.io.read_file函数读取数据并从路径中提取标签，并返回（image, label）对:

def process_path(file_path):
  label = tf.strings.split(file_path, '/')[-2]
  return tf.io.read_file(file_path), label

labeled_ds = list_ds.map(process_path)

for image_raw, label_text in labeled_ds.take(1):
  print(repr(image_raw.numpy()[:100]))
  print()
  print(label_text.numpy())

b’\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xfe\x00\x1ccmp3.10.3.2Lq3 0xad6b4f35\x00\xff\xdb\x00C\x00\x03\x02\x02\x03\x02\x02\x03\x03\x03\x03\x04\x03\x03\x04\x05\x08\x05\x05\x04\x04\x05\n\x07\x07\x06\x08\x0c\n\x0c\x0c\x0b\n\x0b\x0b\r\x0e\x12\x10\r\x0e\x11\x0e\x0b\x0b\x10’

b’roses’

3. 数据集元素 batching ¶

3.1 最简单的 batching（直接 stack） ¶

最简单的 batching 方法是将数据集中的 n 个连续元素堆叠为单个元素。Dataset.batch() 转换正是这么做的，它与 tf.stack() 运算符具有相同的限制（被应用于元素的每个组成部分）：即对于每个组成部分 i，所有元素的shape必须完全相同。

inc_dataset = tf.data.Dataset.range(100)
dec_dataset = tf.data.Dataset.range(0, -100, -1)
dataset = tf.data.Dataset.zip((inc_dataset, dec_dataset))
batched_dataset = dataset.batch(4)

for batch in batched_dataset.take(4):
  print([arr.numpy() for arr in batch])

[array([0, 1, 2, 3]), array([ 0, -1, -2, -3])]
[array([4, 5, 6, 7]), array([-4, -5, -6, -7])]
[array([ 8, 9, 10, 11]), array([ -8, -9, -10, -11])]
[array([12, 13, 14, 15]), array([-12, -13, -14, -15])]

Dataset.batch容易导致数量未知错误，因为最后一个batch可能未满。注意shape中的None：

batched_dataset

使用drop_remainder参数忽略最后一批，并获得完整的形状传播：

batched_dataset = dataset.batch(7, drop_remainder=True)
batched_dataset

4.2 将 Tensor 填充成统一大小，然后 batching ¶

上述方法适用于具有相同大小的张量。不过，很多模型（例如序列模型）处理的输入数据可能具有变化的size（例如序列的长度不同）。为了解决这个问题，可以通过 Dataset.padded_batch() 来指定一个或多个可能被填充的维度，从而批处理不同形状的张量。

dataset = tf.data.Dataset.range(100)
dataset = dataset.map(lambda x: tf.fill([tf.cast(x, tf.int32)], x))
dataset = dataset.padded_batch(4, padded_shapes=(None,))

for batch in dataset.take(2):
  print(batch.numpy())
  print()

[[0 0 0]
$\;$ [1 0 0]
$\;$ [2 2 0]
$\;$ [3 3 3]]

[[4 4 4 4 0 0 0]
$\;$ [5 5 5 5 5 0 0]
$\;$ [6 6 6 6 6 6 0]
$\;$ [7 7 7 7 7 7 7]]

Dataset.padded_batch() 允许你为各部分的各维度设置不同的填充，并且可以采用可变长度（用 None 表示）或恒定长度。也可以更改用于填充值，默认为 0。

4. 训练工作流程 ¶

4.1. 数据repeat多个epoch ¶

tf.dataAPI提供了两种主要的方式来实现数据的epoch repeat。

最简单的方式是使用Dataset.repeat()

下面实例演示：

titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
titanic_lines = tf.data.TextLineDataset(titanic_file)

def plot_batch_sizes(ds):
  batch_sizes = [batch.shape[0] for batch in ds]
  plt.bar(range(len(batch_sizes)), batch_sizes)
  plt.xlabel('Batch number')
  plt.ylabel('Batch size')

如果不给Dataset.repeat()传递参数，数据集将无限重复输入。

Dataset.repeat无参数输入时会自动开始无缝地切换到下一次迭代。因此，先Dataset.repeat后Dataset.batch将产生跨越时期边界的批次：

titanic_batches = titanic_lines.repeat(3).batch(128)

plot_batch_sizes(titanic_batches)

如果需要清晰的epoch边界，请先Dataset.batch后Dataset.repeat：

titanic_batches = titanic_lines.batch(128).repeat(3)

plot_batch_sizes(titanic_batches)

如果您想在每个epoch结束时执行自定义计算（例如收集统计信息），那么最简单的方法是在每个epoch重新开始数据集迭代：

epochs = 3
dataset = titanic_lines.batch(128)

for epoch in range(epochs):
  for batch in dataset:
    print(batch.shape)
  print("End of epoch: ", epoch)

(128,)
(128,)
(128,)
(128,)
(116,)
End of epoch: 0
(128,)
(128,)
(128,)
(128,)
(116,)
End of epoch: 1
(128,)
(128,)
(128,)
(128,)
(116,)
End of epoch: 2

4.2. 随机shuffle输入数据 ¶

Dataset.shuffle()有一个固定大小的buffer，每次按均匀分布从buffer中取出下一个元素。

注意：越大的buffer_sizes，shuffle的越均匀，但这会占用很多内存，并且需要大量的时间来填充满该buffer（填满后，才会输出元素）。如果这导致了一些问题，可以考虑使用Dataset.interleave代替。

向数据集添加索引，以便可以看到效果：

lines = tf.data.TextLineDataset(titanic_file)
counter = tf.data.experimental.Counter()

dataset = tf.data.Dataset.zip((counter, lines))
dataset = dataset.shuffle(buffer_size=100)
dataset = dataset.batch(20)
dataset

由于的buffer_size值为100，并且批的大小为20，因此第一批不包含索引大于120的元素。

n,line_batch = next(iter(dataset))
print(n.numpy())

[ 92 84 52 3 27 100 44 26 2 63 54 93 69 97 10 101 32 65
109 40]

batch 与 shuffle 先后顺序的问题：

Dataset.shuffle在缓冲区为空之前，不会向epoch发出结束信号。

先shuffle后repeat

因此先shuffle后repeat，会把每一个epoch的数据完全用光之后，才会开始下一个epoch（将下一个epoch的数据放入shuffle buffer）：

dataset = tf.data.Dataset.zip((counter, lines))
shuffled = dataset.shuffle(buffer_size=100).batch(10).repeat(2)

print("Here are the item ID's near the epoch boundary:\n")
for n, line_batch in shuffled.skip(60).take(5):
  print(n.numpy())

Here are the item ID’s near the epoch boundary:

[523 318 510 467 627 433 514 594 454 560]
[596 566 205 613 493 570 615 411 556 496]
[598 528 623 559 299 473 391 536]
[41 14 51 3 97 70 34 99 63 52]
[ 49 69 104 0 112 90 38 88 11 83]

shuffle_repeat = [n.numpy().mean() for n, line_batch in shuffled]
plt.plot(shuffle_repeat, label="shuffle().repeat()")
plt.ylabel("Mean item ID")
plt.legend()

先repeat后shuffle

先repeat后shuffle，在当前epoch结束时，会把下一个epoch开始的数据加入shuffle buffer，与上一个epoch末尾的数据放在一起shuffle。

dataset = tf.data.Dataset.zip((counter, lines))
shuffled = dataset.repeat(2).shuffle(buffer_size=100).batch(10)

print("Here are the item ID's near the epoch boundary:\n")
for n, line_batch in shuffled.skip(55).take(15):
  print(n.numpy())

Here are the item ID’s near the epoch boundary:

[545 576 610 588 0 595 582 10 597 495]
[353 540 7 490 440 563 559 27 600 504]
[624 476 25 519 608 525 477 30 560 363]
[468 34 3 32 47 22 609 449 627 20 ]
[611 599 577 541 62 13 601 606 15 18 ]
[26 43 607 434 73 616 55 552 57 6 ]
[587 544 584 1 16 51 596 614 21 50 ]
[39 46 76 40 78 71 37 28 2 69 ]
[574 24 88 12 543 100 89 68 445 83 ]
[441 619 557 97 113 96 38 79 613 92 ]
[29 414 65 462 537 232 126 118 75 11 ]
[87 121 80 585 114 72 99 112 102 589]
[77 61 542 369 8 133 129 567 136 344]
[81 91 139 128 49 66 565 64 152 90 ]
[538 494 154 547 131 147 166 158 111 165]

repeat_shuffle = [n.numpy().mean() for n, line_batch in shuffled]

plt.plot(shuffle_repeat, label="shuffle().repeat()")
plt.plot(repeat_shuffle, label="repeat().shuffle()")
plt.ylabel("Mean item ID")
plt.legend()

5. 数据预处理 ¶

Dataset.map(f)函数的作用是将函数f应用到数据集的每一个元素，并返回处理后的数据集。这个函数为我们数据预处理提供了极大便利。

注意：
$\;\;\;\;\;$ f函数的参数和返回值都必须是tf.Tensor。

5.1 使用`Dataset.map()`进行数据预处理 ¶

使用真实数据训练神经网络时，常常需要将图像的尺寸改为一致，从而可以将多个图像组成一个batch。因此这里将演示如何使用Dataset.map()进行图像的解码、改变尺寸。

同样以花分类数据集为例：

list_ds = tf.data.Dataset.list_files(str(flowers_root/'*/*'))

编写一个函数来解析list_ds中的每个元素

# Reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def parse_image(filename):
  parts = tf.strings.split(filename, '/')
  label = parts[-2]

  image = tf.io.read_file(filename)
  image = tf.image.decode_jpeg(image)
  image = tf.image.convert_image_dtype(image, tf.float32)
  image = tf.image.resize(image, [128, 128])
  return image, label

测试下上面函数的效果：

file_path = next(iter(list_ds))
image, label = parse_image(file_path)

def show(image, label):
  plt.figure()
  plt.imshow(image)
  plt.title(label.numpy().decode('utf-8'))
  plt.axis('off')

show(image, label)

将parse_image函数应用到整个数据集list_ds上：

images_ds = list_ds.map(parse_image)

for image, label in images_ds.take(2): # 查看2个example以验证正确性
  show(image, label)

5.2 使用非TF函数进行数据预处理 ¶

使用非TF内置函数进行数据预处理的性能不如内置TF函数（Python当然没有C++跑的快，另外语言间的通讯也是个瓶颈），所以尽可能多地使用TF内置函数进行数据预处理。但是有的时候，使用Python库函数进行数据处理也是很方便的。你可以在Dataset.map()函数内部使用tf.py_function()来调用Python函数。

例如，你想对图像进行一个任意旋转，但是tf.image里只有tf.image.tot90，这对于数据增强来说不是很有效。

注意： tensorflow_addons 的 tensorflow_addons.image.rotate 中有一个TF兼容的 rotate函数。

为了实现我们上面提到的随机旋转，我们可以使用scipy.ndimage.rotate函数：

import scipy.ndimage as ndimage

def random_rotate_image(image):
  image = ndimage.rotate(image, np.random.uniform(-30, 30), reshape=False)
  return image

image, label = next(iter(images_ds))
image = random_rotate_image(image)
show(image, label)

Clipping input data to the valid range for imshow with RGB data ([0…1] for floats or [0…255] for integers).

为了在Dataset.map中使用上面写好的random_rotate_image，我们需要描述返回的shape及type：

def tf_random_rotate_image(image, label):
  im_shape = image.shape
  [image,] = tf.py_function(random_rotate_image, [image], [tf.float32])
  image.set_shape(im_shape)
  return image, label

rot_ds = images_ds.map(tf_random_rotate_image)

for image, label in rot_ds.take(2):
  show(image, label)

Clipping input data to the valid range for imshow with RGB data ([0…1] for floats or [0…255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0…1] for floats or [0…255] for integers).

5.3 解析`tf.Example`protocol buffer messages ¶

许多输入管道都是从TFRecord文件中提取tf.train.Example协议缓冲区消息。每条tf.train.Example记录包含一个或多个“特征”，并且输入管道通常会将这些特征转换为张量。

fsns_test_file = tf.keras.utils.get_file("fsns.tfrec", "https://storage.googleapis.com/download.tensorflow.org/data/fsns-20160927/testdata/fsns-00000-of-00001")
dataset = tf.data.TFRecordDataset(filenames = [fsns_test_file])
dataset

你可以在td.data.Dataset外，使用tf.train.Example protos来了解数据：

raw_example = next(iter(dataset))
parsed = tf.train.Example.FromString(raw_example.numpy())

feature = parsed.features.feature
raw_img = feature['image/encoded'].bytes_list.value[0]
img = tf.image.decode_png(raw_img)
plt.imshow(img)
plt.axis('off')
_ = plt.title(feature["image/text"].bytes_list.value[0])

raw_example = next(iter(dataset))

def tf_parse(eg):
  example = tf.io.parse_example(
      eg[tf.newaxis], {
          'image/encoded': tf.io.FixedLenFeature(shape=(), dtype=tf.string),
          'image/text': tf.io.FixedLenFeature(shape=(), dtype=tf.string)
      })
  return example['image/encoded'][0], example['image/text'][0]

img, txt = tf_parse(raw_example)
print(txt.numpy())
print(repr(img.numpy()[:20]), "...")

b’Rue Perreyon’b’
\ x89PNG \ r \ n \ x1a \ n \ x00 \ x00 \ x00 \ rIHDR \ x00 \ x00 \ x02X’…

decoded = dataset.map(tf_parse)
decoded

image_batch, text_batch = next(iter(decoded.batch(10)))
image_batch.shape

TensorShape([10])

5.4 时间序列windowing ¶

For an end to end time series example see: Time series forecasting.

时间序列数据通常以完整的时间轴进行组织。

下面用Dataset.range模拟一个时间序列：

range_ds = tf.data.Dataset.range(100000)

通常，基于此类数据的模型需要连续的时间切片。

最简单的方法是数据进行batch：

5.4.1 使用batch ¶

batches = range_ds.batch(10, drop_remainder=True)

for batch in batches.take(5):
  print(batch.numpy())

[0 1 2 3 4 5 6 7 8 9 ]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47 48 49]

如果要对未来进行one step密集预测，您可以相对于彼此移动特征和标签 one step：

def dense_1_step(batch):
  # Shift features and labels one step relative to each other.
  return batch[:-1], batch[1:]

predict_dense_1_step = batches.map(dense_1_step)

for features, label in predict_dense_1_step.take(3):
  print(features.numpy(), " => ", label.numpy())

[0 1 2 3 4 5 6 7 8 ] => [1 2 3 4 5 6 7 8 9 ]
[10 11 12 13 14 15 16 17 18] => [11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28] => [21 22 23 24 25 26 27 28 29]

如果要预测整个窗口而不是固定的偏移量，可以将batches分为两部分：

batches = range_ds.batch(15, drop_remainder=True)

def label_next_5_steps(batch):
  return (batch[:-5],   # Take the first 5 steps
          batch[-5:])   # take the remainder

predict_5_steps = batches.map(label_next_5_steps)

for features, label in predict_5_steps.take(3):
  print(features.numpy(), " => ", label.numpy())

[0 1 2 3 4 5 6 7 8 9 ] => [10 11 12 13 14]
[15 16 17 18 19 20 21 22 23 24] => [25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39] => [40 41 42 43 44]

如果要使一个batch的特征与另一个batch的标签有重合，请使用Dataset.zip：

feature_length = 10
label_length = 5

features = range_ds.batch(feature_length, drop_remainder=True)
labels = range_ds.batch(feature_length).skip(1).map(lambda labels: labels[:-5])

predict_5_steps = tf.data.Dataset.zip((features, labels))

for features, label in predict_5_steps.take(3):
  print(features.numpy(), " => ", label.numpy())

[0 1 2 3 4 5 6 7 8 9 ] => [10 11 12 13 14]
[10 11 12 13 14 15 16 17 18 19] => [20 21 22 23 24]
[20 21 22 23 24 25 26 27 28 29] => [30 31 32 33 34]

5.4.2 使用window ¶

在使用Dataset.batch时，有些情况下可能需要你精细化的控制。该Dataset.window方法让你进行完全的控制，但需格外小心：它返回Dataset的Datasets。有关详细信息，参见1.1节。

window_size = 5

windows = range_ds.window(window_size, shift=1)
for sub_ds in windows.take(5):
  print(sub_ds)

<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>

Dataset.flat_map方法可以获取数据集的数据集并将其展平为单个数据集：can take a dataset of datasets and flatten it into a single dataset:

 for x in windows.flat_map(lambda x: x).take(30):
   print(x.numpy(), end=' ')

WARNING:tensorflow:AutoGraph could not transform and will run it as-is.
Cause: could not parse the source code:

for x in windows.flat_map(lambda x: x).take(30):

This error may be avoided by creating the lambda in a standalone statement.

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform and will run it as-is.
Cause: could not parse the source code:

for x in windows.flat_map(lambda x: x).take(30):

This error may be avoided by creating the lambda in a standalone statement.

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
0 1 2 3 4 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9

几乎所有情况下，你需要先对数据集进行batch：

def sub_to_batch(sub):
  return sub.batch(window_size, drop_remainder=True)

for example in windows.flat_map(sub_to_batch).take(5):
  print(example.numpy())

[0 1 2 3 4]
[1 2 3 4 5]
[2 3 4 5 6]
[3 4 5 6 7]
[4 5 6 7 8]

现在，您可以看到shift参数控制着每个窗口的移动量。

将所有的代码放在一起，构建下面的函数：

def make_window_dataset(ds, window_size=5, shift=1, stride=1):
  windows = ds.window(window_size, shift=shift, stride=stride)

  def sub_to_batch(sub):
    return sub.batch(window_size, drop_remainder=True)

  windows = windows.flat_map(sub_to_batch)
  return windows

ds = make_window_dataset(range_ds, window_size=10, shift = 5, stride=3)

for example in ds.take(10):
  print(example.numpy())

[0 3 6 9 12 15 18 21 24 27]
[5 8 11 14 17 20 23 26 29 32]
[10 13 16 19 22 25 28 31 34 37]
[15 18 21 24 27 30 33 36 39 42]
[20 23 26 29 32 35 38 41 44 47]
[25 28 31 34 37 40 43 46 49 52]
[30 33 36 39 42 45 48 51 54 57]
[35 38 41 44 47 50 53 56 59 62]
[40 43 46 49 52 55 58 61 64 67]
[45 48 51 54 57 60 63 66 69 72]

然后，像以前一样很容易提取标签：

dense_labels_ds = ds.map(dense_1_step)

for inputs,labels in dense_labels_ds.take(3):
  print(inputs.numpy(), "=>", labels.numpy())

[0 3 6 9 12 15 18 21 24] => [3 6 9 12 15 18 21 24 27]
[5 8 11 14 17 20 23 26 29] => [8 11 14 17 20 23 26 29 32]
[10 13 16 19 22 25 28 31 34] => [13 16 19 22 25 28 31 34 37]

5.5 重采样 ¶

当使用类别非常不平衡的数据集时，您可能需要对数据集重新采样。tf.data提供了两种方法来执行此操作。信用卡欺诈数据集就是此类问题的一个很好的例子。

注意：有关完整教程，请参见不平衡数据。

zip_path = tf.keras.utils.get_file(
    origin='https://storage.googleapis.com/download.tensorflow.org/data/creditcard.zip',
    fname='creditcard.zip',
    extract=True)

csv_path = zip_path.replace('.zip', '.csv')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/creditcard.zip
69156864/69155632 [==============================] - 2s 0us/step

creditcard_ds = tf.data.experimental.make_csv_dataset(
    csv_path, batch_size=1024, label_name="Class",
    # Set the column types: 30 floats and an int.
    column_defaults=[float()]*30+[int()])

检查类别分布情况，类别是高度不均衡的：

def count(counts, batch):
  features, labels = batch
  class_1 = labels == 1
  class_1 = tf.cast(class_1, tf.int32)

  class_0 = labels == 0
  class_0 = tf.cast(class_0, tf.int32)

  counts['class_0'] += tf.reduce_sum(class_0)
  counts['class_1'] += tf.reduce_sum(class_1)

  return counts

counts = creditcard_ds.take(10).reduce(
    initial_state={'class_0': 0, 'class_1': 0},
    reduce_func = count)

counts = np.array([counts['class_0'].numpy(),
                   counts['class_1'].numpy()]).astype(np.float32)

fractions = counts/counts.sum()
print(fractions)

[0.995 0.005]

训练不平衡数据集的一种常见方法是平衡它。tf.data包括了一些进行数据平衡的方法：

5.5.1 `Datasets.sampling` ¶

一种重采样数据集的方法是使用sample_from_datasets。如果每个类别都有一个独立的data.Dataset，这种方法很适用。

在这里，只需使用过滤器从信用卡欺诈数据中生成各个类别的dataset：

negative_ds = (
  creditcard_ds
    .unbatch()
    .filter(lambda features, label: label==0)
    .repeat())
positive_ds = (
  creditcard_ds
    .unbatch()
    .filter(lambda features, label: label==1)
    .repeat())

WARNING:tensorflow:AutoGraph could not transform and will run it as-is.
Cause: could not parse the source code:

.filter(lambda features, label: label==0)

This error may be avoided by creating the lambda in a standalone statement.

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform and will run it as-is.
Cause: could not parse the source code:

.filter(lambda features, label: label==0)

This error may be avoided by creating the lambda in a standalone statement.

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform and will run it as-is.
Cause: could not parse the source code:

.filter(lambda features, label: label==1)

This error may be avoided by creating the lambda in a standalone statement.

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform and will run it as-is.
Cause: could not parse the source code:

.filter(lambda features, label: label==1)

This error may be avoided by creating the lambda in a standalone statement.

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert

for features, label in positive_ds.batch(10).take(1):
  print(label.numpy())

[1 1 1 1 1 1 1 1 1 1 1]

使用tf.data.experimental.sample_from_datasets进行数据集均衡，请执行以下操作：

balanced_ds = tf.data.experimental.sample_from_datasets(
    [negative_ds, positive_ds], [0.5, 0.5]).batch(10)

现在，数据集以50/50的概率生成每个类的示例：

for features, labels in balanced_ds.take(10):
  print(labels.numpy())

[1 0 1 1 0 1 0 0 0 0]
[1 1 0 0 0 0 0 1 0 1]
[1 1 0 0 0 1 0 0 1 1]
[1 0 0 1 0 1 1 0 0 0]
[0 0 1 0 0 0 0 1 1 1]
[0 1 1 1 1 0 0 1 0 1]
[0 0 0 0 1 0 1 1 1 1]
[0 0 0 1 1 1 0 0 0 1]
[1 1 0 1 1 1 1 1 1 0]
[1 1 1 1 0 1 0 0 1 1]

5.5.2 `experimental.rejection_resample` ¶

使用experimental.sample_from_datasets的一个问题是：它需要每一类有一个独立的tf.data.Dataset。这可以使用Dataset.filter实现，但是会导致数据被加载两次。

data.experimental.rejection_resample函数可以被用于数据集的平衡，并且数据只加载一次。元素将从数据集中删除以实现平衡。

data.experimental.rejection_resample有一个class_func参数。该class_func被用于数据集的每个元素，并用于确定示例出于平衡目的所属的类。

creditcard_ds的元素已经是(features, label)对。因此，class_func只需要返回这些标签：

def class_func(features, label):
  return label

重采样器还需要目标分布，以及可选的初始分布估计：

resampler = tf.data.experimental.rejection_resample(
    class_func, target_dist=[0.5, 0.5], initial_dist=fractions)

重采样器处理单个示例，因此您必须在应用重采样器之前先unbatch数据集：

resample_ds = creditcard_ds.unbatch().apply(resampler).batch(10)

WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/data/experimental/ops/resampling.py:156: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20.
Instructions for updating:
Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensure tf.print executes in graph mode:

重采样器返回(class, example)从的输出创建对class_func。在这种情况下，example已经是(feature, label)一对，因此可map用于删除标签的多余副本：

balanced_ds = resample_ds.map(lambda extra_label, features_and_label: features_and_label)

现在，数据集以50/50的概率生成每个类的示例：

for features, labels in balanced_ds.take(10):
  print(labels.numpy())

[1 1 1 1 0 0 1 0 0 0]
[1 1 1 1 1 1 1 1 0 1]
[1 1 0 1 0 0 0 1 0 0]
[0 0 0 1 1 0 0 1 1 0]
[1 0 1 0 0 1 1 0 1 0]
[1 1 0 1 0 1 0 0 1 0]
[0 1 1 1 0 1 1 1 1 1]
[0 0 1 0 1 0 0 1 0 1]
[1 1 0 1 1 0 0 1 1 1]
[1 1 0 0 0 1 0 1 1 0]

6. 在高阶API中使用`tf.data` ¶

6.1 在 `tf.keras` 中使用 `tf.data` ¶

tf.keras API 极大地降低了创建、使用机器学习模型的难度。它的.fit()、.evaluate()及.predict() API支持tf.data作为输入。下面是一个简单的示例：

train, test = tf.keras.datasets.fashion_mnist.load_data()

images, labels = train
images = images/255.0
labels = labels.astype(np.int32)

fmnist_train_ds = tf.data.Dataset.from_tensor_slices((images, labels))
fmnist_train_ds = fmnist_train_ds.shuffle(5000).batch(32)

model = tf.keras.Sequential([
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(10)
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=['accuracy'])

Model.fit和Model.evaluate都需要数据+标签：

model.fit(fmnist_train_ds, epochs=2)

Epoch 1/2
WARNING:tensorflow:Layer flatten is casting an input tensor from dtype float64 to the layer’s dtype of float32, which is new behavior in TensorFlow 2. The layer has dtype float32 because its dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call tf.keras.backend.set_floatx('float64'). To change just this layer, pass dtype=‘float64’ to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

1875/1875 [==============================] - 3s 2ms/step - loss: 0.6013 - accuracy: 0.7970
Epoch 2/2
1875/1875 [==============================] - 3s 2ms/step - loss: 0.4617 - accuracy: 0.8418

从上面可以看出，tf.keras对tf.data的支持还是很好的。

如果你传给.fit()方法的数据输入管道在构建过程中调用了Dataset.repeat()方法，你需要给.fit()额外传递steps_per_epoch这个参数。

model.fit(fmnist_train_ds.repeat(), epochs=2, steps_per_epoch=20)

Epoch 1/2
20/20 [==============================] - 0s 2ms/step - loss: 0.4650 - accuracy: 0.8422
Epoch 2/2
20/20 [==============================] - 0s 2ms/step - loss: 0.3897 - accuracy: 0.8797

评估时，你可以指定评估step数：

loss, accuracy = model.evaluate(fmnist_train_ds)
print("Loss :", loss)
print("Accuracy :", accuracy)

1875/1875 [==============================] - 3s 2ms/step - loss: 0.4423 - accuracy: 0.8473
Loss : 0.44227170944213867
Accuracy : 0.847266674041748

对于大数据集，可以设置评估step数：

loss, accuracy = model.evaluate(fmnist_train_ds.repeat(), steps=10)
print("Loss :", loss)
print("Accuracy :", accuracy)

10/10 [==============================] - 0s 2ms/step - loss: 0.4557 - accuracy: 0.8188
Loss : 0.45573288202285767
Accuracy : 0.8187500238418579

调用Model.predict时，不需要标签。

predict_ds = tf.data.Dataset.from_tensor_slices(images).batch(32)
result = model.predict(predict_ds, steps = 10)
print(result.shape)

(320, 10)

如果你的dataset包含标签，predict会自动忽略标签。

result = model.predict(fmnist_train_ds, steps = 10)
print(result.shape)

(320, 10)

6.2 在 `tf.estimator` 中使用 `tf.data` ¶

要在tf.estimator.Estimator的input_fn中使用Dataset，只需要保证input_fn返回的是Dataset即可

官方教程对于这块的介绍有点不足，推荐大家阅读《TensorFlow Estimator 官方文档之----Dataset for Estimator》，里面比较详细地介绍了怎么在tf.estimator中使用tf.data。

import tensorflow_datasets as tfds

def train_input_fn():
  titanic = tf.data.experimental.make_csv_dataset(
      titanic_file, batch_size=32,
      label_name="survived")
  titanic_batches = (
      titanic.cache().repeat().shuffle(500)
      .prefetch(tf.data.experimental.AUTOTUNE))
  return titanic_batches

embark = tf.feature_column.categorical_column_with_hash_bucket('embark_town', 32)
cls = tf.feature_column.categorical_column_with_vocabulary_list('class', ['First', 'Second', 'Third']) 
age = tf.feature_column.numeric_column('age')

import tempfile
model_dir = tempfile.mkdtemp()
model = tf.estimator.LinearClassifier(
    model_dir=model_dir,
    feature_columns=[embark, cls, age],
    n_classes=2
)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {’_model_dir’: ‘/tmp/tmp7xfmvz5w’, ‘_tf_random_seed’: None, ‘_save_summary_steps’: 100, ‘_save_checkpoints_steps’: None, ‘_save_checkpoints_secs’: 600, ‘_session_config’: allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, ‘_keep_checkpoint_max’: 5, ‘_keep_checkpoint_every_n_hours’: 10000, ‘_log_step_count_steps’: 100, ‘_train_distribute’: None, ‘_device_fn’: None, ‘_protocol’: None, ‘_eval_distribute’: None, ‘_experimental_distribute’: None, ‘_experimental_max_worker_delay_secs’: None, ‘_session_creation_timeout_secs’: 7200, ‘_service’: None, ‘_cluster_spec’: ClusterSpec({}), ‘_task_type’: ‘worker’, ‘_task_id’: 0, ‘_global_id_in_cluster’: 0, ‘_master’: ‘’, ‘_evaluation_master’: ‘’, ‘_is_chief’: True, ‘_num_ps_replicas’: 0, ‘_num_worker_replicas’: 1}

model = model.train(input_fn=train_input_fn, steps=100)

WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/feature_column/feature_column_v2.py:560: Layer.add_variable (from tensorflow.python.keras.engine.base_layer_v1) is deprecated and will be removed in a future version.
Instructions for updating:
Please use layer.add_weight method instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/keras/optimizer_v2/ftrl.py:143: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0…
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmp7xfmvz5w/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0…
INFO:tensorflow:loss = 0.6931472, step = 0
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 100…
INFO:tensorflow:Saving checkpoints for 100 into /tmp/tmp7xfmvz5w/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 100…
INFO:tensorflow:Loss for final step: 0.5968354.

result = model.evaluate(train_input_fn, steps=10)

for key, value in result.items():
  print(key, ":", value)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-03-28T01:27:11Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp7xfmvz5w/model.ckpt-100
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/10]
INFO:tensorflow:Evaluation [2/10]
INFO:tensorflow:Evaluation [3/10]
INFO:tensorflow:Evaluation [4/10]
INFO:tensorflow:Evaluation [5/10]
INFO:tensorflow:Evaluation [6/10]
INFO:tensorflow:Evaluation [7/10]
INFO:tensorflow:Evaluation [8/10]
INFO:tensorflow:Evaluation [9/10]
INFO:tensorflow:Evaluation [10/10]
INFO:tensorflow:Inference Time : 0.65018s
INFO:tensorflow:Finished evaluation at 2020-03-28-01:27:11
INFO:tensorflow:Saving dict for global step 100: accuracy = 0.684375, accuracy_baseline = 0.603125, auc = 0.73216105, auc_precision_recall = 0.6447562, average_loss = 0.60841894, global_step = 100, label/mean = 0.396875, loss = 0.60841894, precision = 0.76, prediction/mean = 0.31196585, recall = 0.2992126
INFO:tensorflow:Saving ‘checkpoint_path’ summary for global step 100: /tmp/tmp7xfmvz5w/model.ckpt-100
accuracy : 0.684375
accuracy_baseline : 0.603125
auc : 0.73216105
auc_precision_recall : 0.6447562
average_loss : 0.60841894
label/mean : 0.396875
loss : 0.60841894
precision : 0.76
prediction/mean : 0.31196585
recall : 0.2992126
global_step : 100

for pred in model.predict(train_input_fn):
  for key, value in pred.items():
    print(key, ":", value)
  break

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp7xfmvz5w/model.ckpt-100
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
logits : [-0.1131]
logistic : [0.4717]
probabilities : [0.5283 0.4717]
class_ids : [0]
classes : [b’0’]
all_class_ids : [0 1]
all_classes : [b’0’ b’1’]

注：本文来自于TenosrFlow官方使用tf.data导入数据的 Learn > Guide > tf.data

2020年3月29号更新

你可能感兴趣的:(TensorFlow教程,tensorflow)

Redis 安装详细教程（小白版）小小鸭程序员 spring java AI编程 spring cloud redis
一、Windows系统安装Redis方法1：直接安装（推荐新手）下载RedisforWindows访问微软维护的Redis版本：https://github.com/microsoftarchive/redis/releases下载Redis-x64-3.2.100.msi（或最新版本）安装包。安装Redis双击下载的.msi文件点击下一步，勾选“AddRedisinstallationfolde
【前端入门】应该了解和知道的几个国内外前端开发资源网站爱上大树的小猪前端
与大家分享一下几个国内外前端开发资源网站国际资源MDNWebDocs(MozillaDeveloperNetwork)用途：MDN是Web技术领域最全面的文档库之一，涵盖了HTML、CSS、JavaScript以及浏览器API等。链接:https://developer.mozilla.orgW3Schools用途：适合初学者学习Web技术，提供从基础到进阶的教程，同时还有在线练习环境。链接:ht
IntelliJ IDEA 2023.3.1安装指南从下载到配置的完整教程（附资源下载）心灵宝贝 intellij-idea java ide
安装IntelliJIDEA2023.3.1非常简单，以下是详细的安装步骤，适用于Windows、macOS和Linux系统。1.下载IntelliJIDEAIntelliJIDEA下载链接：https://pan.quark.cn/s/3ad975664934选择适合你的操作系统的版本：Ultimate版：功能全面，支持所有开发语言和框架（需付费）。Community版：免费版，适合Java和K
CUDA内核调优工具ncu的详细使用教程东北豆子哥 CUDA 数值计算/数值优化 linux 高性能计算
NVIDIANsightCompute（ncu）是一款用于CUDA内核性能分析的工具，帮助开发者优化CUDA程序。以下是详细的使用教程和示例说明。1.安装NVIDIANsightCompute确保已安装CUDAToolkit和NVIDIA驱动，然后从NVIDIA官网下载并安装NsightCompute。2.基本使用2.1启动ncu通过命令行启动ncu，基本语法如下：ncu[options][app
Python爬虫实战教程——如何爬取多个国家的实时汇率数据 Python爬虫项目 2025年爬虫实战项目 python 爬虫 chrome 信息可视化
1.引言随着全球经济一体化，跨国交易和投资变得越来越普遍，实时汇率数据成为了金融领域和国际贸易中的关键数据。对于金融分析师、投资者或者是开发者来说，能够实时获取并分析汇率数据是至关重要的。本文将深入探讨如何使用Python爬虫技术抓取多个国家的实时汇率数据。我们将使用最新的技术和工具，介绍如何通过Python编写一个高效、可扩展的汇率数据爬虫。2.为什么需要实时汇率数据？汇率数据被广泛应用于以下几
02、数据结构与算法 - 基础：数组 - 吊打面试官星星学霸数据结构与算法 -吊打面试官 python 开发语言 java 算法数据结构
更多系列教程，每天更新更多教程关注：xxxueba.com星星学霸本篇博客我们介绍数据结构的鼻祖------数组，可以说数组几乎能表示一切的数据结构，在每一门编程语言中，数组都是重要的数据结构，当然每种语言对数组的实现和处理也不相同，但是本质是都是用来存放数据的的结构，这里我们以Java语言为例，来详细介绍Java语言中数组的用法。Java中数组的介绍在Java中，数组是用来存放同一种数据类型的集
FireRedASR：精准识别普通话、方言和歌曲歌词！小红书开源工业级自动语音识别模型蚝油菜花每日 AI 项目与应用实例语音识别人工智能人工智能开源
❤️如果你也关注AI的发展现状，且对AI应用开发感兴趣，我会每日分享大模型与AI领域的开源项目和应用，提供运行实例和实用教程，帮助你快速上手AI技术！微信公众号｜搜一搜：蚝油菜花大家好，我是蚝油菜花，今天跟大家分享一下FireRedASR这个小红书开源的工业级自动语音识别模型。快速阅读FireRedASR是小红书开源的工业级自动语音识别模型，支持普通话、中文方言和英语。该模型在普通话ASR基准测试
【DevOps】Backstage介绍及如何在Azure Kubernetes Service上进行部署小涵 Azure云企业实践分享 devops azure kubernetes 容器 docker backstage
【DevOps】Backstage介绍及如何在AzureKubernetesService上进行部署推荐超级课程：本地离线DeepSeekAI方案部署实战教程【完全版】Docker快速入门到精通Kubernetes入门到大师通关课AWS云服务快速入门实战目录【DevOps】Backstage介绍及如何在AzureKubernetesService上进行部署Backstage介绍在AKS上部署Bac
python代码重构技巧_Python代码重构指南，老师Bryan Beecham完结 weixin_39916479 python代码重构技巧
本套课程由BryanBeecham，全球知名敏捷开发教练主讲的：Python代码重构指南。重构是软件改进的核心，它使软件拥有更好的结构和性能，也使代码更易于理解、修改和扩展。尽管重构并不是新事物，但是软件开发人员仍然会苦恼于如何正确地进行重构。随着敏捷运动的发展，DevOps之类的概念不断追求高质量和精心设计的代码，以实现更快的部署和反馈。不过，现有的很多关于重构的教程都基于Java语言，关于Py
100.HarmonyOS NEXT跑马灯组件教程：实际应用与场景示例 harmonyos-next
温馨提示：本篇博客的详细代码已发布到git:https://gitcode.com/nutpi/HarmonyosNext可以下载运行哦！HarmonyOSNEXT跑马灯组件教程：实际应用与场景示例1.跑马灯组件应用概述跑马灯组件在HarmonyOSNEXT应用中有着广泛的应用场景，特别是在需要在有限空间内展示较长文本内容的情况下。本文将介绍跑马灯组件的实际应用场景和使用方法，帮助开发者更好地理解
从零开始大模型开发与微调：PyCharm的下载与安装 AI天才研究院 AI大模型企业级应用开发实战 AI大模型应用入门实战与进阶 DeepSeek R1 &大数据AI人工智能大模型计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
从零开始大模型开发与微调：PyCharm的下载与安装1.背景介绍随着人工智能和深度学习技术的不断发展,大型语言模型(LargeLanguageModels,LLMs)已经成为当前最引人注目的研究热点之一。LLMs能够在各种自然语言处理任务上展现出惊人的性能,例如机器翻译、文本生成、问答系统等。PyTorch和TensorFlow等深度学习框架为训练和微调大型语言模型提供了强大的支持。PyCharm
94.HarmonyOS NEXT动画系统实现教程：深入理解FuncUtils harmonyos-next
温馨提示：本篇博客的详细代码已发布到git:https://gitcode.com/nutpi/HarmonyosNext可以下载运行哦！HarmonyOSNEXT动画系统实现教程：深入理解FuncUtils1.动画系统基础1.1核心概念概念说明应用场景动画持续时间动画执行的时长控制动画速度动画曲线动画的变化规律定义动画效果动画回调动画执行的具体内容实现状态变化1.2动画执行函数解析exportf
97.HarmonyOS NEXT跑马灯组件教程：基础概念与架构设计 harmonyos-next
温馨提示：本篇博客的详细代码已发布到git:https://gitcode.com/nutpi/HarmonyosNext可以下载运行哦！HarmonyOSNEXT跑马灯组件教程：基础概念与架构设计1.跑马灯组件概述跑马灯（Marquee）是一种常见的UI组件，主要用于在有限的空间内展示超出显示区域的文本内容。当文本内容过长无法在固定宽度内完整显示时，跑马灯组件会使文本自动滚动，以便用户可以查看全
HarmonyOS ArkTS声明式UI开发实战教程 harmonyos
引言：为何选择ArkTS？在HarmonyOS生态快速发展的当下，ArkTS作为新一代声明式UI开发框架，正在引发移动应用开发范式的变革。笔者曾在多个跨平台框架开发中经历过"命令式编程之痛"，直到接触ArkTS后才发现，原来UI开发可以如此直观高效。本文将通过完整案例解析，带您掌握声明式UI设计的精髓。一、ArkTS声明式设计核心理念1.1与命令式开发的本质差异传统开发中，我们需要逐步指示每个UI
mongodb基本使用（四） dibisha7239 数据库 javascript 数据结构与算法 ViewUI
MongoDB条件操作符描述条件操作符用于比较两个表达式并从mongoDB集合中获取数据。MongoDB中条件操作符有：(>)大于-$gt(=)大于等于-$gte(db.col.insert({title:'PHP教程',description:'PHP是一种创建动态交互性站点的强有力的服务器端脚本语言。',by:'菜鸟教程',url:'http://www.runoob.com',tags:['
PyQt6嵌入HTML5内容教程 mosquito_lover1 python pyqt html5
在PyQt6中嵌入HTML5内容可以通过QWebEngineView实现。QWebEngineView是一个基于Chromium的浏览器引擎，能够渲染HTML5内容。以下是一个简单的示例，展示如何在PyQt6中嵌入HTML5页面：1.安装PyQt6和PyQt6-WebEnginepipinstallPyQt6PyQt6-WebEngine2.创建PyQt6应用程序并嵌入HTML5内容imports
Python存储数据库教程--超详细！！小鞠.. 数据库 Python爬虫 python 数据库 mysql
目录1、首先导入需要用到的包2、连接数据库3、创建游标对象4、创建名为`dataname`的数据库，如果数据库不存在则创建，字符集设置为`utf8`。5、执行sql1语句6、创建数据表语句1.如果名为`user_id`的数据表不存在，则创建一个名为`user_tb`的数据表2.列定义7、执行sql2语句8、设置需要存入数据库的字段9、将数据插入`user_tb`数据表10、执行sql3语句11、提
JCE cannot authenticate the provider BC 刘登辉 java 报错
mmmmmd，这个报错在linux系统中使用宝塔jdk-17.0.8的环境出现的报错，找了一堆教程，用的ai，各种办法测试都没有解决！！！！！本地windows跑的版本是jdk-17.0.12，服务器是jdk-17.0.8，更换jdk版本后问题解决无语死了！！！！
linux CentOS 7.9 安装 ffmpeg 6.0 教程【亲测成功】刘登辉 ffmpeg linux centos
查看当前系统版本[[email protected]]#lsb_release-aLSBVersion::core-4.1-amd64:core-4.1-noarchDistributorID:CentOSDescription:CentOSLinuxrelease7.9.2009(Core)Release:7.9.2009Codename:Corewgethttp://www.ffm
Anaconda安装与Python虚拟环境配置保姆级图文教程(附速查字典)_anaconda配置python环境全栈工程师_oEe python 开发语言
2什么是Anaconda？Anaconda是一个开源的跨平台Python发行版本，支持WindowsmacOSLinux操作系统。Anaconda中包含了conda等180多个科学包及其依赖项。其中conda则是一个开源的软件包管理系统和环境管理系统，用于安装多个版本的软件包及其依赖关系，并在它们之间轻松切换。3Anaconda的安装进入Anaconda下载界面选择相应的操作系统，本文主要介绍在W
【使用DeepSeek辅助Python接口性能自动化测试教程-实战教程】生活De°咸鱼 AIGC python 开发语言 AI编程
使用DeepSeek辅助Python接口性能自动化测试教程一、准备工作（一）安装Python（二）安装相关库（三）获取DeepSeekAPIKey二、调用DeepSeek生成测试用例思路（一）代码实现（二）代码解释三、编写性能测试代码（一）代码实现（二）代码解释四、执行测试并分析结果（一）执行测试（二）分析结果一、准备工作（一）安装Python确保你已安装Python3.8或更高版本。若未安装，可
95.HarmonyOS NEXT 图片约束处理教程：深入理解Constrain harmonyos-next
温馨提示：本篇博客的详细代码已发布到git:https://gitcode.com/nutpi/HarmonyosNext可以下载运行哦！HarmonyOSNEXT图片约束处理教程：深入理解Constrain1.图片约束基础1.1核心概念概念说明应用场景图片适配类型定义图片如何适应容器图片展示方式偏移约束限制图片移动范围拖拽和缩放边界检测判断是否超出显示范围图片浏览1.2图片适配类型定义expor
92.HarmonyOS NEXT开发学习路径与最佳实践总结：构建高质量应用 harmonyos-next
温馨提示：本篇博客的详细代码已发布到git:https://gitcode.com/nutpi/HarmonyosNext可以下载运行哦！HarmonyOSNEXT开发学习路径与最佳实践总结：构建高质量应用1.学习路径指南1.1基础知识阶段阶段重点内容相关教程学习目标入门基础开发环境、基本语法01-03搭建环境，理解基础概念组件开发UI组件、生命周期04-06掌握组件开发和状态管理数据处理状态管理
CSS教程--动画前段被迫创业前端学习 css3 前端 css
目录一．２Ｄ转换１.ｔｒａｎｓｌａｔｅ（）平移操作２.ｒｏｔａｔｅ（）旋转操作３.ｓｃａｌｅ（）放大缩小操作３.１.ｓｃａｌｅＸ（）３.２.ｓｃａｌｅＹ（）４.ｓｋｅｗ（）倾斜操作二．３Ｄ变换１.ｒｏｔａｔｅＸ（）２.ｒｏｔａｔｅＹ（）３.ｒｏｔａｔｅＺ（）三．ＣＳＳ过渡１.ｔｒａｎｓｉｔｉｏｎ属性２.ｔｒａｎｓｉｔｉｏｎ－ｄｅｌａｙ属性３.ｔｒａｎｓｉｔｉｏｎ－ｔｉｍｉｎｇ－ｆｕｎｃｔｉｏｎ属性
Python—JSON格式标签转换为TXT格式标签详细教程2（附完整代码）资源补给站 python 图像处理笔记 python json 开发语言
这个代码主要是解析一个json文件转换成多个txt文件使用的，尤其是便于yolo训练decode_json函数中的convert函数确实是用于将坐标缩放到0-1之间的。但是，您在调用decode_json函数时设置了is_convert=False，这意味着坐标缩放功能被关闭了代码详解数字规范化的会将坐标缩放至(0—1）区间主要是修改这两个地方即可，话不多说，咱们直接附代码#下面是将`is_con
大麦抢票自动化app 黄牛都在用 AZHYL33 自动化运维经验分享
安卓：https://pan.quark.cn/s/988d9b2745ab电脑：https://pan.quark.cn/s/988d9b2745ab大麦抢票自动化教程，连夜赶制了这个教程。前提是有电脑，如果没有电脑的话，用手机的话也可以去解决问题，爱你们哦。【特别说明】：部分演唱会或者是剧场不支持PC端购买，这个是无法解决的问题。但是80%都可以用这个去玩。第一个链接里面是教程，大家跟着教程一
使用 Python 爬取高德地图交通数据并进行数据分析（完整教程） Python爬虫项目 python 数据分析数据库 selenium 爬虫开发语言 beautifulsoup
一、引言在现代交通系统中，交通数据是进行智能交通管理、交通流量预测和交通规划的重要依据。高德地图（Amap）作为国内最权威的地理和交通信息平台之一，提供了丰富的开放API，允许开发者访问包括实时交通路况、路线规划、地理编码等各种数据。本教程将使用Python构建一个完整的爬虫程序，调用高德地图API，解析和存储交通数据，并通过数据分析和可视化深入挖掘交通流量特征。二、高德地图API简介2.1高德地
YOLOv8改进添加swin transformer 兜里没有一毛钱 YOLO系列改进管理 YOLO transformer python
最近在做实验，需要改进YOLOv8，去网上找了很多教程都是充钱才能看的，NND这对一个一餐只能吃两个菜的大学生来说是多么的痛苦，所以自己去找代码手动改了一下，成功实现YOLOv8改进添加swintransformer，本人水平有限，改得不对的地方请自行改正。第一步，在ultralytics\nn\modules\block.py代码中的最后部分中添加swintransformer代码，代码如下：#
c语言%-8.3,C语言程序设计第8章-8.3.pptx 蓝精神 c语言%-8.3
C语言程序设计第8章-8.3.pptxC语言程序设计实例教程8.3二维数组和指针,C语言程序设计实例教程,第8章指针,二维数组的定义方法通过二维数组名引用二维数组元素二维数组与一维数组的关系通过定义指针变量来引用二维数组元素通过数组名的运算实现指针的移动来引用数组元素,本节要点实例37二维数组名和数组元素的地址二维数组的成员介绍,【实例任务】定义一个二维数组并赋初值，然后分别输出二维数组名的值，各
四个学生三门成绩 C语言二维数组,C语言程序设计第6章-6.2.pptx 马宇宸四个学生三门成绩 C语言二维数组
C语言程序设计第6章-6.2.pptxC语言程序设计实例教程6.2二维数组,C语言程序设计实例教程,第6章数组,二维数组的定义和数组元素的引用方法二维数组的初始化方法,本节要点实例21二维数组的定义与引用-统计总成绩及平均成绩,【实例任务】从键盘上任意输入某班n个学生的三门课程的成绩，计算每个学生的平均成绩、计算每门课程的平均成绩，并且打印成绩单，输出三门课程成绩的平均分及课程的平均分。运行结果如
深入浅出Java Annotation(元注解和自定义注解） Josh_Persistence Java Annotation 元注解自定义注解
一、基本概述　　 Annontation是Java5开始引入的新特征。中文名称一般叫注解。它提供了一种安全的类似注释的机制，用来将任何的信息或元数据（metadata）与程序元素（类、方法、成员变量等）进行关联。　　更通俗的意思是为程序的元素（类、方法、成员变量）加上更直观更明了的说明，这些说明信息是与程序的业务逻辑无关，并且是供指定的工具或
mysql优化特定类型的查询 annan211 java 工作 mysql
本节所介绍的查询优化的技巧都是和特定版本相关的，所以对于未来mysql的版本未必适用。 1 优化count查询对于count这个函数的网上的大部分资料都是错误的或者是理解的都是一知半解的。在做优化之前我们先来看看真正的count()函数的作用到底是什么。 count()是一个特殊的函数，有两种非常不同的作用，他可以统计某个列值的数量，也可以统计行数。在统
MAC下安装多版本JDK和切换几种方式棋子chessman jdk
环境： MAC AIR,OS X 10.10,64位历史：过去 Mac 上的 Java 都是由 Apple 自己提供，只支持到 Java 6，并且OS X 10.7 开始系统并不自带（而是可选安装）（原自带的是1.6）。后来 Apple 加入 OpenJDK 继续支持 Java 6，而 Java 7 将由 Oracle 负责提供。在终端中输入jav
javaScript （1） Array_06 JavaScript java 浏览器
JavaScript 1、运算符　　运算符就是完成操作的一系列符号，它有七类：　　赋值运算符（=,+=,-=,*=,/=,%=,<<=,>>=,|=,&=）、算术运算符(+,-,*,/,++,--,%)、比较运算符(>,<,<=,>=,==,===,!=,!==)、逻辑运算符(||,&&,!)、条件运算(?:)、位
国内顶级代码分享网站袁潇含 java jdk oracle .net PHP
现在国内很多开源网站感觉都是为了利益而做的当然利益是肯定的,否则谁也不会免费的去做网站 &
Elasticsearch、MongoDB和Hadoop比较随意而生 mongodb hadoop 搜索引擎
IT界在过去几年中出现了一个有趣的现象。很多新的技术出现并立即拥抱了“大数据”。稍微老一点的技术也会将大数据添进自己的特性，避免落大部队太远，我们看到了不同技术之间的边际的模糊化。假如你有诸如Elasticsearch或者Solr这样的搜索引擎，它们存储着JSON文档，MongoDB存着JSON文档，或者一堆JSON文档存放在一个Hadoop集群的HDFS中。你可以使用这三种配
mac os 系统科研软件总结张亚雄 mac os
1.1 Microsoft Office for Mac 2011 大客户版，自行搜索。 1.2 Latex （MacTex）: 系统环境：https://tug.org/mactex/ &nb
Maven实战（四）生命周期 AdyZhang maven
1. 三套生命周期 Maven拥有三套相互独立的生命周期，它们分别为clean，default和site。每个生命周期包含一些阶段，这些阶段是有顺序的，并且后面的阶段依赖于前面的阶段，用户和Maven最直接的交互方式就是调用这些生命周期阶段。以clean生命周期为例，它包含的阶段有pre-clean, clean 和 post
Linux下Jenkins迁移 aijuans Jenkins
1. 将Jenkins程序目录copy过去源程序在/export/data/tomcatRoot/ofctest-jenkins.jd.com下面 tar -cvzf jenkins.tar.gz ofctest-jenkins.jd.com &
request.getInputStream()只能获取一次的问题 ayaoxinchao request Inputstream
问题：在使用HTTP协议实现应用间接口通信时，服务端读取客户端请求过来的数据，会用到request.getInputStream()，第一次读取的时候可以读取到数据，但是接下来的读取操作都读取不到数据原因： 1. 一个InputStream对象在被读取完成后，将无法被再次读取，始终返回-1； 2. InputStream并没有实现reset方法（可以重
数据库SQL优化大总结之百万级数据库优化方案 BigBird2012 SQL优化
网上关于SQL优化的教程很多，但是比较杂乱。近日有空整理了一下，写出来跟大家分享一下，其中有错误和不足的地方，还请大家纠正补充。这篇文章我花费了大量的时间查找资料、修改、排版，希望大家阅读之后，感觉好的话推荐给更多的人，让更多的人看到、纠正以及补充。 1.对查询进行优化，要尽量避免全表扫描，首先应考虑在 where 及 order by 涉及的列上建立索引。 2.应尽量避免在 where
jsonObject的使用 bijian1013 java json
在项目中难免会用java处理json格式的数据，因此封装了一个JSONUtil工具类。 JSONUtil.java package com.bijian.json.study; import java.util.ArrayList; import java.util.Date; import java.util.HashMap;
[Zookeeper学习笔记之六]Zookeeper源代码分析之Zookeeper.WatchRegistration bit1129 zookeeper
Zookeeper类是Zookeeper提供给用户访问Zookeeper service的主要API，它包含了如下几个内部类首先分析它的内部类，从WatchRegistration开始，为指定的znode path注册一个Watcher， /** * Register a watcher for a particular p
【Scala十三】Scala核心七：部分应用函数 bit1129 scala
何为部分应用函数？ Partially applied function: A function that’s used in an expression and that misses some of its arguments.For instance, if function f has type Int => Int => Int, then f and f(1) are p
Tomcat Error listenerStart 终极大法 ronin47 tomcat
Tomcat报的错太含糊了，什么错都没报出来，只提示了Error listenerStart。为了调试，我们要获得更详细的日志。可以在WEB-INF/classes目录下新建一个文件叫logging.properties，内容如下 Java代码 handlers = org.apache.juli.FileHandler, java.util.logging.ConsoleHa
不用加减符号实现加减法 BrokenDreams 实现
今天有群友发了一个问题，要求不用加减符号(包括负号)来实现加减法。分析一下，先看最简单的情况，假设1+1，按二进制算的话结果是10，可以看到从右往左的第一位变为0，第二位由于进位变为1。
读《研磨设计模式》-代码笔记-状态模式-State bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ /* 当一个对象的内在状态改变时允许改变其行为，这个对象看起来像是改变了其类状态模式主要解决的是当控制一个对象状态的条件表达式过于复杂时的情况把状态的判断逻辑转移到表示不同状态的一系列类中，可以把复杂的判断逻辑简化如果在
CUDA程序block和thread超出硬件允许值时的异常 cherishLC CUDA
调用CUDA的核函数时指定block 和 thread大小，该大小可以是dim3类型的（三维数组），只用一维时可以是usigned int型的。以下程序验证了当block或thread大小超出硬件允许值时会产生异常！！！GPU根本不会执行运算！！！所以验证结果的正确性很重要！！！在VS中创建CUDA项目会有一个模板，里面有更详细的状态验证。以下程序在K5000GPU上跑的。
诡异的超长时间GC问题定位 chenchao051 jvm cms GC hbase swap
HBase的GC策略采用PawNew+CMS, 这是大众化的配置，ParNew经常会出现停顿时间特别长的情况，有时候甚至长到令人发指的地步，例如请看如下日志： 2012-10-17T05:54:54.293+0800: 739594.224: [GC 739606.508: [ParNew: 996800K->110720K(996800K), 178.8826900 secs] 3700
maven环境快速搭建 daizj 安装 mavne 环境配置
一下载maven 安装maven之前，要先安装jdk及配置JAVA_HOME环境变量。这个安装和配置java环境不用多说。 maven下载地址：http://maven.apache.org/download.html，目前最新的是这个apache-maven-3.2.5-bin.zip，然后解压在任意位置，最好地址中不要带中文字符，这个做java 的都知道，地址中出现中文会出现很多
PHP网站安全，避免PHP网站受到攻击的方法 dcj3sjt126com PHP
对于PHP网站安全主要存在这样几种攻击方式:1、命令注入(Command Injection)2、eval注入(Eval Injection)3、客户端脚本攻击(Script Insertion)4、跨网站脚本攻击(Cross Site Scripting, XSS)5、SQL注入攻击(SQL injection)6、跨网站请求伪造攻击(Cross Site Request Forgerie
yii中给CGridView设置默认的排序根据时间倒序的方法 dcj3sjt126com GridView
public function searchWithRelated() { $criteria = new CDbCriteria; $criteria->together = true; //without th
Java集合对象和数组对象的转换 dyy_gusi java集合
在开发中，我们经常需要将集合对象（List，Set）转换为数组对象，或者将数组对象转换为集合对象。Java提供了相互转换的工具，但是我们使用的时候需要注意，不能乱用滥用。 1、数组对象转换为集合对象最暴力的方式是new一个集合对象，然后遍历数组，依次将数组中的元素放入到新的集合中，但是这样做显然过
nginx同一主机部署多个应用 geeksun nginx
近日有一需求，需要在一台主机上用nginx部署2个php应用，分别是wordpress和wiki，探索了半天，终于部署好了，下面把过程记录下来。 1. 在nginx下创建vhosts目录，用以放置vhost文件。 mkdir vhosts 2. 修改nginx.conf的配置，在http节点增加下面内容设置，用来包含vhosts里的配置文件 #
ubuntu添加admin权限的用户账号 hongtoushizi ubuntu useradd
ubuntu创建账号的方式通常用到两种：useradd 和adduser . 本人尝试了useradd方法，步骤如下： 1:useradd 使用useradd时，如果后面不加任何参数的话，如：sudo useradd sysadm 创建出来的用户将是默认的三无用户：无home directory ,无密码,无系统shell。顾应该如下操作：
第五章常用Lua开发库2-JSON库、编码转换、字符串处理 jinnianshilongnian nginx lua
JSON库在进行数据传输时JSON格式目前应用广泛，因此从Lua对象与JSON字符串之间相互转换是一个非常常见的功能；目前Lua也有几个JSON库，本人用过cjson、dkjson。其中cjson的语法严格（比如unicode \u0020\u7eaf），要求符合规范否则会解析失败（如\u002），而dkjson相对宽松，当然也可以通过修改cjson的源码来完成
Spring定时器配置的两种实现方式OpenSymphony Quartz和java Timer详解 yaerfeng1989 timer quartz 定时器
原创整理不易，转载请注明出处：Spring定时器配置的两种实现方式OpenSymphony Quartz和java Timer详解代码下载地址：http://www.zuidaima.com/share/1772648445103104.htm 有两种流行Spring定时器配置：Java的Timer类和OpenSymphony的Quartz。 1.Java Timer定时首先继承jav
Linux下df与du两个命令的差别？ pda158 linux
　一、df显示文件系统的使用情况，与du比較，就是更全盘化。　　最经常使用的就是 df -T，显示文件系统的使用情况并显示文件系统的类型。　　举比例如以下：　　[root@localhost ~]# df -T 　　Filesystem Type &n
[转]SQLite的工具类 ---- 通过反射把Cursor封装到VO对象 ctfzh VO android sqlite 反射 Cursor
在写DAO层时，觉得从Cursor里一个一个的取出字段值再装到VO(值对象)里太麻烦了，就写了一个工具类，用到了反射，可以把查询记录的值装到对应的VO里，也可以生成该VO的List。使用时需要注意：考虑到Android的性能问题，VO没有使用Setter和Getter，而是直接用public的属性。表中的字段名需要和VO的属性名一样，要是不一样就得在查询的SQL中
该学习笔记用到的Employee表 vipbooks oracle sql 工作
这是我在学习Oracle是用到的Employee表，在该笔记中用到的就是这张表，大家可以用它来学习和练习。 drop table Employee; -- 员工信息表 create table Employee( -- 员工编号 EmpNo number(3) primary key, -- 姓