目前为止,我们只使用了能放在内存中的数据集,而深度学习往往使用非常大而无法放在RAM中的数据集。其它深度学习库可能对处理这种大型数据集比较棘手,但是TensorFlow很容易完成,这得归功于其数据API(Data API),即只需创建一个数据对象,然后赋值其数据位置和转换方法即可。TensorFlow会处理好各种细节,例如多线程、队列、批处理等等。同时TensorFlow数据API与tf.keras合作非常好。
TensorFlow数据API可以读取文本文件、二进制文件。TFRecord是一个易用且高效的格式,基于Protocol Buffers格式。TensorFlow数据API也支持从SQL数据库中读取数据。
读取数据后还需要对其进行预处理,例如归一化。对于数据预处理,可以自定义预处理的层,也可以使用Keras提供的标准预处理层。
TF Transform (tf.Transform):编写训练集上按批次进行预处理的函数,然后转换成TF函数并加入到模型结构中,使得模型部署到生产环境中时可以很快地进行样本预处理。
TF Datasets (TFDS):提供各种常用数据集下载的函数,包括ImageNet等大型数据集。
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt
import pandas as pd
import os
import tensorflow_datasets as tfds
import tensorflow_hub as hub
import tensorflow_transform as tft
for i in (tf, np, mpl, pd, tfds, hub, tft):
print("{}: {}".format(i.__name__, i.__version__))
输出:
tensorflow: 2.2.0
numpy: 1.17.4
matplotlib: 3.1.2
pandas: 0.25.3
tensorflow_datasets: 3.1.0
tensorflow_hub: 0.8.0
tensorflow_transform: 0.22.0
tf.data.Datasets.from_tensor_slices():创建一个存储于内存中的数据集:
X = tf.range(10)
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset
输出:
from_tensor_slices()函数传入一个张量,返回一个tf.data.Dataset,其中元素为输入张量的所有切片,结果与下面写法是等价的:
dataset = tf.data.Dataset.range(10)
for item in dataset:
print(item)
输出:
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)
下面例子中,首先使用repeat()方法将数据集重复3次,返回一个新的数据集
然后在新数据集上调用batch()方法,每7个数据为一个批次,也返回一个新数据集
dataset = dataset.repeat(3).batch(7)
for item in dataset:
print(item)
输出:
tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int64)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int64)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int64)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int64)
tf.Tensor([8 9], shape=(2,), dtype=int64)
可以设置drop_remainder=True将最后剩余不足一个batch的数据丢弃:
dataset = tf.data.Dataset.range(10)
dataset = dataset.repeat(3).batch(7, drop_remainder=True)
for item in dataset:
print(item)
输出:
tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int64)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int64)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int64)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int64)
注意:数据API实例的方法不会修改原始数据,每个方法会返回一个新数据集。
也可以调用map()方法进行数据转换:
dataset = tf.data.Dataset.range(10)
dataset = dataset.map(lambda x: x * 2)
dataset = dataset.repeat(3).batch(7, drop_remainder=True)
for item in dataset:
print(item)
输出:
tf.Tensor([ 0 2 4 6 8 10 12], shape=(7,), dtype=int64)
tf.Tensor([14 16 18 0 2 4 6], shape=(7,), dtype=int64)
tf.Tensor([ 8 10 12 14 16 18 0], shape=(7,), dtype=int64)
tf.Tensor([ 2 4 6 8 10 12 14], shape=(7,), dtype=int64)
注意:对于计算密集性的转换,可以通过num_parallel_calls参数设置线程数。同时map()中传入的函数必须是可以转换成TF函数的。
map()方法将函数应用到每个样本,而apply()方法将函数应用到数据集整体:
dataset = dataset.apply(tf.data.experimental.unbatch())
dataset = dataset.filter(lambda x: x < 10) # keep only items < 10
for item in dataset.take(5):
print(item)
输出:
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
当训练集中的样本是独立同分布的,则梯度下降的效果最好。保证样本独立同分布的最简单的方法就是使用shuffle()方法对数据集进行洗牌操作。shuffle()操作将产生一个buffer大小的新数据集,每次取数据时从buffer中拿一个数据,并从原始数据集中拿数据补充到buffer中,直到取完原始数据集和buffer中的数据。buffer尽量设大一些,否则洗牌操作可能不充分,但不要超过RAM。
dataset = tf.data.Dataset.range(10).repeat(3)
dataset = dataset.shuffle(buffer_size=3, seed=42).batch(7)
for item in dataset:
print(item)
输出:
tf.Tensor([0 3 4 2 1 5 8], shape=(7,), dtype=int64)
tf.Tensor([6 9 7 2 3 1 4], shape=(7,), dtype=int64)
tf.Tensor([6 0 7 9 0 1 2], shape=(7,), dtype=int64)
tf.Tensor([8 4 5 5 3 8 9], shape=(7,), dtype=int64)
tf.Tensor([7 6], shape=(2,), dtype=int64)
如上输出所示,产生0-9的数据并重复3次,设置buffer大小为3,批大小为7。
repeat()方法每次也会产生新的顺序,因此是在进行测试或调试而需要固定顺序时,可以设置reshuffle_each_iteration=False。
对于无法放进内存中的大型数据集,这样洗牌的方法好像不太充分,因为buffer相比于数据集的大小就显得太小了。其中有种解决办法是对数据本身进行洗牌操作,例如可以将原始数据拆分成多个子文件,并在训练时随机读入子文件。然而子文件中的样本还是固定顺序的,这时可以同时读入多个文件,然后每次同时读这些文件(比如每个文件读一行)。这些操作都可以由TensorFlow数据API使用简单的一行代码就能完成。
加载加利福尼亚房价数据集并进行洗牌操作,拆分成训练集、验证集和测试集。
然后将每个数据集拆分成多个CSV文件:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
housing = fetch_california_housing() # 加载数据集
X_train_full, X_test, y_train_full, y_test = train_test_split(housing.data, housing.target.reshape(-1, 1),
random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full, random_state=42)
scaler = StandardScaler()
scaler.fit(X_train)
X_mean = scaler.mean_
X_std = scaler.scale_
def save_to_multiple_csv_files(data, name_prefix, header=None, n_parts=10):
housing_dir = os.path.join("datasets", "housing")
os.makedirs(housing_dir, exist_ok=True)
path_format = os.path.join(housing_dir, "my_{}_{:02d}.csv")
filepaths = []
m = len(data)
for file_idx, row_indices in enumerate(np.array_split(np.arange(m), n_parts)):
part_csv = path_format.format(name_prefix, file_idx)
filepaths.append(part_csv)
with open(part_csv, "wt", encoding="utf-8") as f:
if header is not None:
f.write(header)
f.write("\n")
for row_idx in row_indices:
f.write(",".join([repr(col) for col in data[row_idx]]))
f.write("\n")
return filepaths
train_data = np.c_[X_train, y_train]
valid_data = np.c_[X_valid, y_valid]
test_data = np.c_[X_test, y_test]
header_cols = housing.feature_names + ["MedianHouseValue"]
header = ",".join(header_cols)
train_filepaths = save_to_multiple_csv_files(train_data, "train", header, n_parts=20)
valid_filepaths = save_to_multiple_csv_files(valid_data, "valid", header, n_parts=10)
test_filepaths = save_to_multiple_csv_files(test_data, "test", header, n_parts=10)
pd.read_csv(train_filepaths[0]).head()
输出:
with open(train_filepaths[0]) as f:
for i in range(5):
print(f.readline(), end="")
输出:
MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue
3.5214,15.0,3.0499445061043287,1.106548279689234,1447.0,1.6059933407325193,37.63,-122.43,1.442
5.3275,5.0,6.490059642147117,0.9910536779324056,3464.0,3.4433399602385686,33.69,-117.39,1.687
3.1,29.0,7.5423728813559325,1.5915254237288134,1328.0,2.2508474576271187,38.44,-122.98,1.621
7.1736,12.0,6.289002557544757,0.9974424552429667,1054.0,2.6956521739130435,33.55,-117.7,2.621
train_filepaths
输出:
['datasets\\housing\\my_train_00.csv',
'datasets\\housing\\my_train_01.csv',
'datasets\\housing\\my_train_02.csv',
'datasets\\housing\\my_train_03.csv',
'datasets\\housing\\my_train_04.csv',
'datasets\\housing\\my_train_05.csv',
'datasets\\housing\\my_train_06.csv',
'datasets\\housing\\my_train_07.csv',
'datasets\\housing\\my_train_08.csv',
'datasets\\housing\\my_train_09.csv',
'datasets\\housing\\my_train_10.csv',
'datasets\\housing\\my_train_11.csv',
'datasets\\housing\\my_train_12.csv',
'datasets\\housing\\my_train_13.csv',
'datasets\\housing\\my_train_14.csv',
'datasets\\housing\\my_train_15.csv',
'datasets\\housing\\my_train_16.csv',
'datasets\\housing\\my_train_17.csv',
'datasets\\housing\\my_train_18.csv',
'datasets\\housing\\my_train_19.csv']
list_files()返回一个经过洗牌的文件路径数据集,可以设置shuffle=False表示不打散:
filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed=42)
for filepath in filepath_dataset:
print(filepath)
输出:
tf.Tensor(b'datasets\\housing\\my_train_05.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_16.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_01.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_17.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_00.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_14.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_10.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_02.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_12.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_19.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_07.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_09.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_13.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_15.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_11.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_18.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_04.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_06.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_03.csv', shape=(), dtype=string)
tf.Tensor(b'datasets\\housing\\my_train_08.csv', shape=(), dtype=string)
使用interleave()方法每次交错地从5个文件中读取数据,skip()跳过头几行:
n_readers = 5
dataset = filepath_dataset.interleave(
lambda filepath: tf.data.TextLineDataset(filepath).skip(1), cycle_length=n_readers) # 跳过第一行:表头
for line in dataset.take(5):
print(line.numpy())
输出:
b'4.5909,16.0,5.475877192982456,1.0964912280701755,1357.0,2.9758771929824563,33.63,-117.71,2.418'
b'2.4792,24.0,3.4547038327526134,1.1341463414634145,2251.0,3.921602787456446,34.18,-118.38,2.0'
b'4.2708,45.0,5.121387283236994,0.953757225433526,492.0,2.8439306358381504,37.48,-122.19,2.67'
b'2.1856,41.0,3.7189873417721517,1.0658227848101265,803.0,2.0329113924050635,32.76,-117.12,1.205'
b'4.1812,52.0,5.701388888888889,0.9965277777777778,692.0,2.4027777777777777,33.73,-118.31,3.215'
注意:使用interleave()方法交错地读数据时应该保证每个文件长度(行数)相同,否则较长的文件后面的数据行可能读不到。
interleave()方法默认每次从1个文件读一行,是串行化的操作。如果想要并行地读数据,可以使用num_parallel_calls参数设置线程数,可以赋值成tf.data.experimental.AUTOTUNE,表示让TensorFlow自动根据CPU使用情况动态地设置线性数。
#X_mean, X_std = [...] # mean and scale of each feature in the training set
n_inputs = 8 # X_train.shape[-1]
@tf.function
def preprocess(line):
defs = [0.] * n_inputs + [tf.constant([], dtype=tf.float32)]
fields = tf.io.decode_csv(line, record_defaults=defs)
x = tf.stack(fields[:-1])
y = tf.stack(fields[-1:])
return (x - X_mean) / X_std, y
假设已经计算得到训练集的平均值和标准差。
tf.io.decode_csv():参数1表示需要解析的数据行,参数2表示每列默认值或缺省值的数组
decode_csv()函数返回标量张量的列表,tf.stack()将张量转换成1维数组。
最后对输入特征进行归一化:减去平均值并除以标准差
preprocess(b'4.2083,44.0,5.3232,0.9171,846.0,2.3370,37.47,-122.2,2.782')
输出:
(,
)
record_defaults=[0, np.nan, tf.constant(np.nan, dtype=tf.float64), "Hello", tf.constant([])]
parsed_fields = tf.io.decode_csv('1,2,3,4,5', record_defaults=record_defaults)
parsed_fields
输出:
[,
,
,
,
]
所有缺失值将被用默认值填充:
parsed_fields = tf.io.decode_csv(',,,,5', record_defaults)
parsed_fields
输出:
[,
,
,
,
]
第5个部分默认值是tf.constant([]),因此不能为空,否则报错:
try:
parsed_fields = tf.io.decode_csv(',,,,', record_defaults)
except tf.errors.InvalidArgumentError as ex:
print(ex)
输出:
Field 4 is required but missing in record 0! [Op:DecodeCSV]
字段数必须匹配,否则会报错:
try:
parsed_fields = tf.io.decode_csv('1,2,3,4,5,6,7', record_defaults)
except tf.errors.InvalidArgumentError as ex:
print(ex)
输出:
Expect 5 fields but have 7 in record 0 [Op:DecodeCSV]
将所有数据预处理步骤放在一起组成函数,方便重用:
def csv_reader_dataset(filepaths, repeat=1, n_readers=5, n_read_threads=None, shuffle_buffer_size=10000,
n_parse_threads=5, batch_size=32):
dataset = tf.data.Dataset.list_files(filepaths).repeat(repeat)
dataset = dataset.interleave(lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
cycle_length=n_readers, num_parallel_calls=n_read_threads)
dataset = dataset.shuffle(shuffle_buffer_size)
dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads)
dataset = dataset.batch(batch_size)
return dataset.prefetch(1)
train_set = csv_reader_dataset(train_filepaths, batch_size=3)
for X_batch, y_batch in train_set.take(2):
print("X =", X_batch)
print("y =", y_batch)
print()
输出:
X = tf.Tensor(
[[ 1.0026022 -0.2867314 0.01174602 -0.0658901 -0.38811532 0.07317533
0.8215112 -1.2472363 ]
[-0.74581194 -0.8404887 -0.21125445 -0.02732265 3.6885073 -0.20515272
0.5404227 -0.07777973]
[-0.67934674 -0.44494775 -0.76743394 -0.14639002 -0.05045014 0.268618
-0.5745582 -0.0427962 ]], shape=(3, 8), dtype=float32)
y = tf.Tensor(
[[3.087]
[0.743]
[2.326]], shape=(3, 1), dtype=float32)
X = tf.Tensor(
[[ 0.6130298 -1.7106786 4.2995334 2.9747813 -0.859934 0.0307362
1.7350441 -0.3026771 ]
[ 2.1392124 1.8491895 0.88371885 0.11082522 -0.5313949 -0.3833385
1.0089018 -1.4271526 ]
[ 1.200269 -0.998705 1.1007434 -0.15711978 0.43597025 0.17005198
-1.1976345 1.2715893 ]], shape=(3, 8), dtype=float32)
y = tf.Tensor(
[[1.741 ]
[5.00001]
[3.356 ]], shape=(3, 1), dtype=float32)
上面函数最后调用prefetch(1),表示模型训练的时候提前准备好1批数据,准备工作包括从磁盘读入数据并进行预处理。这种方式可以显著提高效率。
下面可以使用上面定义的预处理函数读入数据并进行模型训练:
train_set = csv_reader_dataset(train_filepaths, repeat=None)
valid_set = csv_reader_dataset(valid_filepaths)
test_set = csv_reader_dataset(test_filepaths)
model = keras.models.Sequential([
keras.layers.Dense(30, activation="relu", input_shape=X_train.shape[1:]),
keras.layers.Dense(1),
])
model.compile(loss="mse", optimizer=keras.optimizers.SGD(lr=1e-3))
batch_size = 32
model.fit(train_set, steps_per_epoch=len(X_train) // batch_size, epochs=10, validation_data=valid_set)
输出:
Epoch 1/10
362/362 [==============================] - 1s 3ms/step - loss: 1.6426 - val_loss: 1.1271
Epoch 2/10
362/362 [==============================] - 1s 2ms/step - loss: 0.8001 - val_loss: 0.7965
Epoch 3/10
362/362 [==============================] - 1s 2ms/step - loss: 0.7323 - val_loss: 0.7688
Epoch 4/10
362/362 [==============================] - 1s 2ms/step - loss: 0.6579 - val_loss: 0.7544
Epoch 5/10
362/362 [==============================] - 1s 2ms/step - loss: 0.6361 - val_loss: 0.6197
Epoch 6/10
362/362 [==============================] - 1s 2ms/step - loss: 0.5925 - val_loss: 0.6548
Epoch 7/10
362/362 [==============================] - 1s 2ms/step - loss: 0.5661 - val_loss: 0.5478
Epoch 8/10
362/362 [==============================] - 1s 2ms/step - loss: 0.5527 - val_loss: 0.5200
Epoch 9/10
362/362 [==============================] - 1s 2ms/step - loss: 0.5298 - val_loss: 0.5015
Epoch 10/10
362/362 [==============================] - 1s 2ms/step - loss: 0.4920 - val_loss: 0.5561
model.evaluate(test_set, steps=len(X_test) // batch_size)
输出:
161/161 [==============================] - 0s 1ms/step - loss: 0.4933
0.49326828122138977
new_set = test_set.map(lambda X, y: X) # we could instead just pass test_set, Keras would ignore the labels
X_new = X_test
model.predict(new_set, steps=len(X_new) // batch_size)
输出:
array([[1.9393051 ],
[0.83419716],
[1.8839278 ],
...,
[1.8963358 ],
[3.858193 ],
[1.2957773 ]], dtype=float32)
自定义训练循环:
optimizer = keras.optimizers.Nadam(lr=0.01)
loss_fn = keras.losses.mean_squared_error
n_epochs = 5
batch_size = 32
n_steps_per_epoch = len(X_train) // batch_size
total_steps = n_epochs * n_steps_per_epoch
global_step = 0
for X_batch, y_batch in train_set.take(total_steps):
global_step += 1
print("\rGlobal step {}/{}".format(global_step, total_steps), end="")
with tf.GradientTape() as tape:
y_pred = model(X_batch)
main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
loss = tf.add_n([main_loss] + model.losses)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
输出:
Global step 1810/1810
创建TF函数用于训练循环:
optimizer = keras.optimizers.Nadam(lr=0.01)
loss_fn = keras.losses.mean_squared_error
@tf.function
def train(model, n_epochs, batch_size=32,
n_readers=5, n_read_threads=5, shuffle_buffer_size=10000, n_parse_threads=5):
train_set = csv_reader_dataset(train_filepaths, repeat=n_epochs, n_readers=n_readers,
n_read_threads=n_read_threads, shuffle_buffer_size=shuffle_buffer_size,
n_parse_threads=n_parse_threads, batch_size=batch_size)
for X_batch, y_batch in train_set:
with tf.GradientTape() as tape:
y_pred = model(X_batch)
main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
loss = tf.add_n([main_loss] + model.losses)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train(model, 5)
optimizer = keras.optimizers.Nadam(lr=0.01)
loss_fn = keras.losses.mean_squared_error
@tf.function
def train(model, n_epochs, batch_size=32,
n_readers=5, n_read_threads=5, shuffle_buffer_size=10000, n_parse_threads=5):
train_set = csv_reader_dataset(train_filepaths, repeat=n_epochs, n_readers=n_readers,
n_read_threads=n_read_threads, shuffle_buffer_size=shuffle_buffer_size,
n_parse_threads=n_parse_threads, batch_size=batch_size)
n_steps_per_epoch = len(X_train) // batch_size
total_steps = n_epochs * n_steps_per_epoch
global_step = 0
for X_batch, y_batch in train_set.take(total_steps):
global_step += 1
if tf.equal(global_step % 100, 0):
tf.print("\rGlobal step", global_step, "/", total_steps)
with tf.GradientTape() as tape:
y_pred = model(X_batch)
main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
loss = tf.add_n([main_loss] + model.losses)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train(model, 5)
Dataset类方法简介:
for m in dir(tf.data.Dataset):
if not (m.startswith("_") or m.endswith("_")):
func = getattr(tf.data.Dataset, m)
if hasattr(func, "__doc__"):
print("● {:21s}{}".format(m + "()", func.__doc__.split("\n")[0]))
输出:
● apply() Applies a transformation function to this dataset.
● as_numpy_iterator() Returns an iterator which converts all elements of the dataset to numpy.
● batch() Combines consecutive elements of this dataset into batches.
● cache() Caches the elements in this dataset.
● concatenate() Creates a `Dataset` by concatenating the given dataset with this dataset.
● element_spec() The type specification of an element of this dataset.
● enumerate() Enumerates the elements of this dataset.
● filter() Filters this dataset according to `predicate`.
● flat_map() Maps `map_func` across this dataset and flattens the result.
● from_generator() Creates a `Dataset` whose elements are generated by `generator`.
● from_tensor_slices() Creates a `Dataset` whose elements are slices of the given tensors.
● from_tensors() Creates a `Dataset` with a single element, comprising the given tensors.
● interleave() Maps `map_func` across this dataset, and interleaves the results.
● list_files() A dataset of all files matching one or more glob patterns.
● map() Maps `map_func` across the elements of this dataset.
● options() Returns the options for this dataset and its inputs.
● padded_batch() Combines consecutive elements of this dataset into padded batches.
● prefetch() Creates a `Dataset` that prefetches elements from this dataset.
● range() Creates a `Dataset` of a step-separated range of values.
● reduce() Reduces the input dataset to a single element.
● repeat() Repeats this dataset so each original value is seen `count` times.
● shard() Creates a `Dataset` that includes only 1/`num_shards` of this dataset.
● shuffle() Randomly shuffles the elements of this dataset.
● skip() Creates a `Dataset` that skips `count` elements from this dataset.
● take() Creates a `Dataset` with at most `count` elements from this dataset.
● unbatch() Splits elements of a dataset into multiple elements.
● window() Combines (nests of) input elements into a dataset of (nests of) windows.
● with_options() Returns a new `tf.data.Dataset` with the given options set.
● zip() Creates a `Dataset` by zipping together the given datasets.
TFRecord格式是TensorFlow中存储和高效读取大型数据的首先格式,由不同大小二进制序列组成的二进制格式。
TFRecord组成:长度、长度的CRC校验和、数据、数据的CRC校验和。
可以使用tf.io.TFRecordWriter类创建TFRecord文件:
with tf.io.TFRecordWriter("my_data.tfrecord") as f:
f.write(b"This is the first record")
f.write(b"And this is the second record")
可以使用tf.data.TFRecordDataset读取TFRecord文件:
filepaths = ["my_data.tfrecord"] # 可以读取一个或多个文件
dataset = tf.data.TFRecordDataset(filepaths)
for item in dataset:
print(item)
输出:
tf.Tensor(b'This is the first record', shape=(), dtype=string)
tf.Tensor(b'And this is the second record', shape=(), dtype=string)
TFRecordDataset默认一个一个地读文件,当设置参数num_parallel_reads时可以并行和交叉地读文件,可以达到与之前介绍的list_files()和interleave()相同的效果:
filepaths = ["my_test_{}.tfrecord".format(i) for i in range(5)]
for i, filepath in enumerate(filepaths):
with tf.io.TFRecordWriter(filepath) as f:
for j in range(3):
f.write("File {} record {}".format(i, j).encode("utf-8"))
dataset = tf.data.TFRecordDataset(filepaths, num_parallel_reads=3)
for item in dataset:
print(item)
输出:
tf.Tensor(b'File 0 record 0', shape=(), dtype=string)
tf.Tensor(b'File 1 record 0', shape=(), dtype=string)
tf.Tensor(b'File 2 record 0', shape=(), dtype=string)
tf.Tensor(b'File 0 record 1', shape=(), dtype=string)
tf.Tensor(b'File 1 record 1', shape=(), dtype=string)
tf.Tensor(b'File 2 record 1', shape=(), dtype=string)
tf.Tensor(b'File 0 record 2', shape=(), dtype=string)
tf.Tensor(b'File 1 record 2', shape=(), dtype=string)
tf.Tensor(b'File 2 record 2', shape=(), dtype=string)
tf.Tensor(b'File 3 record 0', shape=(), dtype=string)
tf.Tensor(b'File 4 record 0', shape=(), dtype=string)
tf.Tensor(b'File 3 record 1', shape=(), dtype=string)
tf.Tensor(b'File 4 record 1', shape=(), dtype=string)
tf.Tensor(b'File 3 record 2', shape=(), dtype=string)
tf.Tensor(b'File 4 record 2', shape=(), dtype=string)
options = tf.io.TFRecordOptions(compression_type="GZIP")
with tf.io.TFRecordWriter("my_compressed.tfrecord", options) as f:
f.write(b"This is the first record")
f.write(b"And this is the second record")
dataset = tf.data.TFRecordDataset(["my_compressed.tfrecord"], compression_type="GZIP")
for item in dataset:
print(item)
输出:
tf.Tensor(b'This is the first record', shape=(), dtype=string)
tf.Tensor(b'And this is the second record', shape=(), dtype=string)
TFRecord由序列化的protocol buffers(又称protobufs)组成,TFRecord是一种可移植、可扩展且高效的二进制格式,由Google在2001年开发并在2008年开源。
%%writefile person.proto
syntax = "proto3";
message Person {
string name = 1;
int32 id = 2;
repeated string email = 3;
}
输出:
Overwriting person.proto
如上代码表示:使用版本3的protobuf格式,指定每个Person对象具有字符串类型的name属性,int32类型的id,0个或多个类型为字符串的email字段,其中数字1、2、3表示三个字段的标识符。
可以使用protobuf编译器protoc对.proto文件进行编译,产生可以在Python中访问的类。
可以在命令行中进行如下所示编译过程:
!protoc-3.12.3-win64\bin\protoc person.proto --python_out=. --descriptor_set_out=person.desc --include_imports
!dir person*
编译完生成了person_pb2.py文件,通过该文件可以导入Person类:
from person_pb2 import Person
person = Person(name="Al", id=123, email=["[email protected]"]) # create a Person
print(person) # display the Person
输出:
name: "Al"
id: 123
email: "[email protected]"
person.name # read a field
输出:
'Al'
person.name = "Alice" # modify a field
person.name
输出:
'Alice'
person.email[0] # repeated fields can be accessed like arrays
输出:
'[email protected]'
person.email.append("[email protected]") # add an email address
person.email
输出:
['[email protected]', '[email protected]']
s = person.SerializeToString() # serialize to a byte string
s
输出:
b'\n\x05Alice\x10{\x1a\[email protected]\x1a\[email protected]'
person2 = Person() # create a new Person
person2.ParseFromString(s) # parse the byte string (27 bytes)
输出:
27
person == person2 # now they are equal
输出:
True
自定义protobuf
person_tf = tf.io.decode_proto(bytes=s, message_type="Person", field_names=["name", "id", "email"],
output_types=[tf.string,tf.int32,tf.string],descriptor_source="person.desc")
person_tf.values
输出:
[,
,
]
protobuf在TensorFlow中的使用例子如下:
syntax = "proto3";
message BytesList { repeated bytes value = 1; }
message FloatList { repeated float value = 1 [packed = true]; }
message Int64List { repeated int64 value = 1 [packed = true]; }
message Feature {
oneof kind {
BytesList bytes_list = 1;
FloatList float_list = 2;
Int64List int64_list = 3;
}
};
message Features { map feature = 1; };
message Example { Features features = 1; };
BytesList = tf.train.BytesList
FloatList = tf.train.FloatList
Int64List = tf.train.Int64List
Feature = tf.train.Feature
Features = tf.train.Features
Example = tf.train.Example
person_example = Example(features=Features(feature={"name": Feature(bytes_list=BytesList(value=[b"Alice"])),
"id": Feature(int64_list=Int64List(value=[123])),
"emails": Feature(bytes_list=BytesList(value=[b"[email protected]", b"[email protected]"]))
}))
with tf.io.TFRecordWriter("my_contacts.tfrecord") as f:
f.write(person_example.SerializeToString())
使用tf.data.TFRecordDataset可以加载序列化的Example protobuf:
feature_description = {"name": tf.io.FixedLenFeature([], tf.string, default_value=""),
"id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
"emails": tf.io.VarLenFeature(tf.string)}
for serialized_example in tf.data.TFRecordDataset(["my_contacts.tfrecord"]):
parsed_example = tf.io.parse_single_example(serialized_example, feature_description)
parsed_example
输出:
{'emails': ,
'id': ,
'name': }
固定长度的特征被解析成标量张量,可变长度的特征解析成稀疏张量(sparse tensors),可以使用tf.sparse.to_dense()将稀疏张量转换成密集张量:
parsed_example["emails"].values[0]
输出:
tf.sparse.to_dense(parsed_example["emails"], default_value=b"")
输出:
parsed_example["emails"].values
输出:
BytesList可以包含任何二进制的数据,例如可以使用tf.io.encode_jpeg()使JPEG图片编码并将其二进制数据存入BytesList格式:
from sklearn.datasets import load_sample_images
img = load_sample_images()["images"][0]
plt.imshow(img)
plt.axis("off")
plt.title("Original Image")
plt.show()
输出:
data = tf.io.encode_jpeg(img) # 对JPEG图像进行编码
example_with_image = Example(features=Features(feature={
"image": Feature(bytes_list=BytesList(value=[data.numpy()]))}))
serialized_example = example_with_image.SerializeToString()
# then save to TFRecord
从TFRecord读取时,需要使用tf.io.decode_jpeg()或tf.io.decode_image()(可以解码BMP、GIF、JPEG、PNG格式)解析数据并得到原始图像
feature_description = { "image": tf.io.VarLenFeature(tf.string) }
example_with_image = tf.io.parse_single_example(serialized_example, feature_description) # 读入TFRecord
decoded_img = tf.io.decode_jpeg(example_with_image["image"].values[0]) # 解码成原始图像
#decoded_img = tf.io.decode_image(example_with_image["image"].values[0]) # 或者使用decode_image()
plt.imshow(decoded_img)
plt.title("Decoded Image")
plt.axis("off")
plt.show()
输出:
t = tf.constant([[0., 1.], [2., 3.], [4., 5.]])
s = tf.io.serialize_tensor(t)
s
输出:
tf.io.parse_tensor(s, out_type=tf.float32)
输出:
serialized_sparse = tf.io.serialize_sparse(parsed_example["emails"])
serialized_sparse
输出:
BytesList(value=serialized_sparse.numpy())
输出:
value: "\010\t\022\010\022\002\010\002\022\002\010\001\"\020\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000"
value: "\010\007\022\004\022\002\010\002\"\020\007\[email protected]@d.com"
value: "\010\t\022\004\022\002\010\001\"\010\002\000\000\000\000\000\000\000"
dataset = tf.data.TFRecordDataset(["my_contacts.tfrecord"]).batch(10)
for serialized_examples in dataset:
parsed_examples = tf.io.parse_example(serialized_examples, feature_description)
parsed_examples
输出:
{'image': }
使用Example protobuf基本能够完成大部分任务,但是对于list of list的数据就显得有些不方便,TensorFlow中提供SequenceExample专门用于处理这种数据。
syntax = "proto3";
message FeatureList { repeated Feature feature = 1; };
message FeatureLists { map feature_list = 1; };
message SequenceExample {
Features context = 1;
FeatureLists feature_lists = 2;
};
一个SequenceExample包含一个Features对象的文本数据,一个FeatureLists对象包含一个或多个FeatureList对象,每个FeatureList包含一系列的Feature对象,每个Feature对象可以是字节串列表、64位整数列表、浮点数列表等。
创建SequenceExample的方法与Example类似:
FeatureList = tf.train.FeatureList
FeatureLists = tf.train.FeatureLists
SequenceExample = tf.train.SequenceExample
context = Features(feature={"author_id": Feature(int64_list=Int64List(value=[123])),
"title": Feature(bytes_list=BytesList(value=[b"A", b"desert", b"place", b"."])),
"pub_date": Feature(int64_list=Int64List(value=[1623, 12, 25]))
})
content = [["When", "shall", "we", "three", "meet", "again", "?"],
["In", "thunder", ",", "lightning", ",", "or", "in", "rain", "?"]]
comments = [["When", "the", "hurlyburly", "'s", "done", "."],
["When", "the", "battle", "'s", "lost", "and", "won", "."]]
def words_to_feature(words):
return Feature(bytes_list=BytesList(value=[word.encode("utf-8") for word in words]))
content_features = [words_to_feature(sentence) for sentence in content]
comments_features = [words_to_feature(comment) for comment in comments]
sequence_example = SequenceExample(context=context,
feature_lists=FeatureLists(feature_list={
"content": FeatureList(feature=content_features),
"comments": FeatureList(feature=comments_features)
}))
sequence_example
输出:
context {
feature {
key: "author_id"
value {
int64_list {
value: 123
}
}
}
feature {
key: "pub_date"
value {
int64_list {
value: 1623
value: 12
value: 25
}
}
}
feature {
key: "title"
value {
bytes_list {
value: "A"
value: "desert"
value: "place"
value: "."
}
}
}
}
feature_lists {
feature_list {
key: "comments"
value {
feature {
bytes_list {
value: "When"
value: "the"
value: "hurlyburly"
value: "\'s"
value: "done"
value: "."
}
}
feature {
bytes_list {
value: "When"
value: "the"
value: "battle"
value: "\'s"
value: "lost"
value: "and"
value: "won"
value: "."
}
}
}
}
feature_list {
key: "content"
value {
feature {
bytes_list {
value: "When"
value: "shall"
value: "we"
value: "three"
value: "meet"
value: "again"
value: "?"
}
}
feature {
bytes_list {
value: "In"
value: "thunder"
value: ","
value: "lightning"
value: ","
value: "or"
value: "in"
value: "rain"
value: "?"
}
}
}
}
}
serialized_sequence_example = sequence_example.SerializeToString()
context_feature_descriptions = {"author_id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
"title": tf.io.VarLenFeature(tf.string),
"pub_date": tf.io.FixedLenFeature([3], tf.int64, default_value=[0, 0, 0]),
}
sequence_feature_descriptions = {"content": tf.io.VarLenFeature(tf.string),
"comments": tf.io.VarLenFeature(tf.string),
}
parsed_context, parsed_feature_lists = tf.io.parse_single_sequence_example(
serialized_sequence_example, context_feature_descriptions, sequence_feature_descriptions)
parsed_context
输出:
{'title': ,
'author_id': ,
'pub_date': }
parsed_context["title"].values
输出:
parsed_feature_lists
输出:
{'comments': ,
'content': }
print(tf.RaggedTensor.from_sparse(parsed_feature_lists["content"]))
输出:
任何形式的数据喂入神经网络前需要先将数据所有特征转换成数值,通常还需要归一化。
加利福尼亚房价数据集中,ocean_proximity特征是类别特征,共有5种值:<1H OCEAN、INLAND、NEAR OCEAN、NEAR BAY、ISLAND,可以考虑将其转换成one_hot编码。
注意:类别小于10时通常使用one-hot编码,类别大于50时通常使用嵌入(embddings)的方式,而类别在10到50之间就需要尝试两种方法,根据结果好坏选择一种使用。
词嵌入:自然语言处理中常用的技术之一,一般会使用预训练好的词嵌入向量。
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
os.makedirs(housing_path, exist_ok=True)
tgz_path = os.path.join(housing_path, "housing.tgz")
urllib.request.urlretrieve(housing_url, tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)
housing_tgz.close()
fetch_housing_data()
HOUSING_PATH = os.path.join("datasets", "housing")
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)
housing = load_housing_data()
housing.head()
输出:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
housing_median_age = tf.feature_column.numeric_column("housing_median_age")
housing_median_age
输出:
NumericColumn(key='housing_median_age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)
TF Transform用于模型生产化的端到端的工具,是TensorFlow Extended的一部分。
TF Transform需要额外安装,没有与TensorFlow绑定安装
try:
import tensorflow_transform as tft
def preprocess(inputs): # inputs is a batch of input features
median_age = inputs["housing_median_age"]
ocean_proximity = inputs["ocean_proximity"]
standardized_age = tft.scale_to_z_score(median_age - tft.mean(median_age))
ocean_proximity_id = tft.compute_and_apply_vocabulary(ocean_proximity)
return {"standardized_median_age": standardized_age,
"ocean_proximity_id": ocean_proximity_id }
except ImportError:
print("TF Transform is not installed. Try running: pip3 install -U tensorflow-transform")
然后TF Transform可以利用Apache Beam将preprocess()函数应用到整个训练集上,在这个过程中,将会计算整个训练集的一些统计指标,例如本例中计算平均值和标准差。做这些计算的部件叫作分析器。
更重要的是,TF Transform可以产生一个等价的TensorFlow函数,该函数可以添加到想要部署的模型中。
利用Data API、TFRecords、Keras预处理层和TF Transform就可以实现用于训练的高度可扩展的输入pipeline,并且在生产环境中使得数据预处理更加快速和便携。
TFDS使得下载常用数据集变得非常简单,小到MNIST、Fashion MNIST数据集,大于ImageNet数据集。
TFDS支持的数据集包括图像数据集、文本数据集(包括翻译数据集)、声音和视频数据集,完整的数据集列表请查看官网:https://www.tensorflow.org/datasets/datasets
TFDS需要额外安装,没有与TensorFlow绑定安装,使用tfds.load()即可下载并加载数据:
import tensorflow_datasets as tfds
print(tfds.list_builders())
输出:
['abstract_reasoning', 'aeslc', 'aflw2k3d', 'amazon_us_reviews', 'anli', 'arc', 'bair_robot_pushing_small', 'beans', 'big_patent', 'bigearthnet', 'billsum', 'binarized_mnist', 'binary_alpha_digits', 'blimp', 'c4', 'caltech101', 'caltech_birds2010', 'caltech_birds2011', 'cars196', 'cassava', 'cats_vs_dogs', 'celeb_a', 'celeb_a_hq', 'cfq', 'chexpert', 'cifar10', 'cifar100', 'cifar10_1', 'cifar10_corrupted', 'citrus_leaves', 'cityscapes', 'civil_comments', 'clevr', 'cmaterdb', 'cnn_dailymail', 'coco', 'coil100', 'colorectal_histology', 'colorectal_histology_large', 'common_voice', 'cos_e', 'cosmos_qa', 'covid19sum', 'crema_d', 'curated_breast_imaging_ddsm', 'cycle_gan', 'deep_weeds', 'definite_pronoun_resolution', 'dementiabank', 'diabetic_retinopathy_detection', 'div2k', 'dmlab', 'downsampled_imagenet', 'dsprites', 'dtd', 'duke_ultrasound', 'emnist', 'eraser_multi_rc', 'esnli', 'eurosat', 'fashion_mnist', 'flic', 'flores', 'food101', 'forest_fires', 'gap', 'geirhos_conflict_stimuli', 'german_credit_numeric', 'gigaword', 'glue', 'groove', 'higgs', 'horses_or_humans', 'i_naturalist2017', 'imagenet2012', 'imagenet2012_corrupted', 'imagenet2012_subset', 'imagenet_resized', 'imagenette', 'imagewang', 'imdb_reviews', 'irc_disentanglement', 'iris', 'kitti', 'kmnist', 'lfw', 'librispeech', 'librispeech_lm', 'libritts', 'ljspeech', 'lm1b', 'lost_and_found', 'lsun', 'malaria', 'math_dataset', 'mctaco', 'mnist', 'mnist_corrupted', 'movie_rationales', 'moving_mnist', 'multi_news', 'multi_nli', 'multi_nli_mismatch', 'natural_questions', 'newsroom', 'nsynth', 'nyu_depth_v2', 'omniglot', 'open_images_challenge2019_detection', 'open_images_v4', 'openbookqa', 'opinion_abstracts', 'opinosis', 'opus', 'oxford_flowers102', 'oxford_iiit_pet', 'para_crawl', 'patch_camelyon', 'pet_finder', 'pg19', 'places365_small', 'plant_leaves', 'plant_village', 'plantae_k', 'qa4mre', 'quickdraw_bitmap', 'reddit', 'reddit_disentanglement', 'reddit_tifu', 'resisc45', 'robonet', 'rock_paper_scissors', 'rock_you', 'samsum', 'savee', 'scan', 'scene_parse150', 'scicite', 'scientific_papers', 'shapes3d', 'smallnorb', 'snli', 'so2sat', 'speech_commands', 'squad', 'stanford_dogs', 'stanford_online_products', 'starcraft_video', 'stl10', 'sun397', 'super_glue', 'svhn_cropped', 'ted_hrlr_translate', 'ted_multi_translate', 'tedlium', 'tf_flowers', 'the300w_lp', 'tiny_shakespeare', 'titanic', 'trivia_qa', 'uc_merced', 'ucf101', 'vgg_face2', 'visual_domain_decathlon', 'voc', 'voxceleb', 'voxforge', 'waymo_open_dataset', 'web_questions', 'wider_face', 'wiki40b', 'wikihow', 'wikipedia', 'winogrande', 'wmt14_translate', 'wmt15_translate', 'wmt16_translate', 'wmt17_translate', 'wmt18_translate', 'wmt19_translate', 'wmt_t2t_translate', 'wmt_translate', 'wordnet', 'xnli', 'xsum', 'yelp_polarity_reviews']
datasets = tfds.load(name="mnist")
mnist_train, mnist_test = datasets["train"], datasets["test"]
plt.figure(figsize=(6,3))
mnist_train = mnist_train.repeat(5).batch(32).prefetch(1)
for item in mnist_train:
images = item["image"]
labels = item["label"]
for index in range(5):
plt.subplot(1, 5, index + 1)
image = images[index, ..., 0]
label = labels[index].numpy()
plt.imshow(image, cmap="binary")
plt.title(label)
plt.axis("off")
break # just showing part of the first batch
输出:
datasets = tfds.load(name="mnist")
mnist_train, mnist_test = datasets["train"], datasets["test"]
mnist_train = mnist_train.repeat(5).batch(32)
mnist_train = mnist_train.map(lambda items: (items["image"], items["label"]))
mnist_train = mnist_train.prefetch(1)
for images, labels in mnist_train.take(1):
print(images.shape)
print(labels.numpy())
输出:
(32, 28, 28, 1)
[4 1 0 7 8 1 2 7 1 6 6 4 7 7 3 3 7 9 9 1 0 6 6 9 9 4 8 9 4 7 3 3]
也可以在load()函数中指定as_supervised=True参数,就可以返回(feature, label)形式的数据,同时也可以指定batch大小:
datasets = tfds.load(name="mnist", batch_size=32, as_supervised=True)
mnist_train = datasets["train"].repeat().prefetch(1)
model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28, 1]),
keras.layers.Lambda(lambda images: tf.cast(images, tf.float32)),
keras.layers.Dense(10, activation="softmax")])
model.compile(loss="sparse_categorical_crossentropy",
optimizer=keras.optimizers.SGD(lr=1e-3),
metrics=["accuracy"])
model.fit(mnist_train, steps_per_epoch=60000 // 32, epochs=5)
Epoch 1/5
1875/1875 [==============================] - 5s 3ms/step - loss: 32.2213 - accuracy: 0.8415
Epoch 2/5
1875/1875 [==============================] - 3s 2ms/step - loss: 26.2503 - accuracy: 0.8679
Epoch 3/5
1875/1875 [==============================] - 3s 2ms/step - loss: 24.9240 - accuracy: 0.8737
Epoch 4/5
1875/1875 [==============================] - 3s 2ms/step - loss: 24.6774 - accuracy: 0.8753
Epoch 5/5
1875/1875 [==============================] - 3s 2ms/step - loss: 24.2004 - accuracy: 0.8763
import tensorflow_hub as hub
#hub_layer = hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1",
# output_shape=[50], input_shape=[], dtype=tf.string)
hub_layer = hub.KerasLayer("https://hub.tensorflow.google.cn/google/tf2-preview/nnlm-en-dim50/1",
output_shape=[50], input_shape=[], dtype=tf.string)
model = keras.Sequential()
model.add(hub_layer)
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.summary()
输出:
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
keras_layer (KerasLayer) (None, 50) 48190600
_________________________________________________________________
dense_1 (Dense) (None, 16) 816
_________________________________________________________________
dense_2 (Dense) (None, 1) 17
=================================================================
Total params: 48,191,433
Trainable params: 833
Non-trainable params: 48,190,600
_________________________________________________________________
sentences = tf.constant(["It was a great movie", "The actors were amazing"])
embeddings = hub_layer(sentences)
embeddings
输出: