Tensorflow 2.0 教程持续更新 :https://blog.csdn.net/qq_31456593/article/details/88606284
完整tensorflow2.0教程代码请看tensorflow2.0:中文教程tensorflow2_tutorials_chinese(欢迎star)
入门教程:
TensorFlow 2.0 教程- Keras 快速入门
TensorFlow 2.0 教程-keras 函数api
TensorFlow 2.0 教程-使用keras训练模型
TensorFlow 2.0 教程-用keras构建自己的网络层
TensorFlow 2.0 教程-keras模型保存和序列化
本教程展示了如何对结构化数据进行分类(例如CSV中的表格数据)。我们使用Keras定义模型,并将csv中各列的特征转化为训练的输入。 本教程包含一下功能代码:
from __future__ import absolute_import, division, print_function
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
print(tf.__version__)
2.0.0-alpha0
我们将使用克利夫兰诊所心脏病基金会提供的一个小数据集。 CSV中有几百行。 每行描述一个患者,每列描述一个属性。 我们将使用此信息来预测患者是否患有心脏病,该疾病在该数据集中是二元分类任务。
Column Description Feature Type Data Type Age Age in years Numerical integer Sex (1 = male; 0 = female) Categorical integer CP Chest pain type (0, 1, 2, 3, 4) Categorical integer Trestbpd Resting blood pressure (in mm Hg on admission to the hospital) Numerical integer Chol Serum cholestoral in mg/dl Numerical integer FBS (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) Categorical integer RestECG Resting electrocardiographic results (0, 1, 2) Categorical integer Thalach Maximum heart rate achieved Numerical integer Exang Exercise induced angina (1 = yes; 0 = no) Categorical integer Oldpeak ST depression induced by exercise relative to rest Numerical integer Slope The slope of the peak exercise ST segment Numerical float CA Number of major vessels (0-3) colored by flourosopy Numerical integer Thal 3 = normal; 6 = fixed defect; 7 = reversable defect Categorical string Target Diagnosis of heart disease (1 = true; 0 = false) Classification integer
使用pandas读取数据
URL = 'https://storage.googleapis.com/applied-dl/heart.csv'
dataframe = pd.read_csv(URL)
dataframe.head()
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63 | 1 | 1 | 145 | 233 | 1 | 2 | 150 | 0 | 2.3 | 3 | 0 | fixed | 0 |
1 | 67 | 1 | 4 | 160 | 286 | 0 | 2 | 108 | 1 | 1.5 | 2 | 3 | normal | 1 |
2 | 67 | 1 | 4 | 120 | 229 | 0 | 2 | 129 | 1 | 2.6 | 2 | 2 | reversible | 0 |
3 | 37 | 1 | 3 | 130 | 250 | 0 | 0 | 187 | 0 | 3.5 | 3 | 0 | normal | 0 |
4 | 41 | 0 | 2 | 130 | 204 | 0 | 2 | 172 | 0 | 1.4 | 1 | 0 | normal | 0 |
划分训练集验证集和测试集
train, test = train_test_split(dataframe, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')
193 train examples
49 validation examples
61 test examples
使用tf.data构造输入pipeline
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop('target')
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
return ds
batch_size = 5
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)
for feature_batch, label_batch in train_ds.take(1):
print('Every feature:', list(feature_batch.keys()))
print('A batch of ages:', feature_batch['age'])
print('A batch of targets:', label_batch )
Every feature: ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']
A batch of ages: tf.Tensor([61 51 57 51 44], shape=(5,), dtype=int32)
A batch of targets: tf.Tensor([0 0 0 1 0], shape=(5,), dtype=int32)
example_batch = next(iter(train_ds))[0]
def demo(feature_column):
feature_layer = layers.DenseFeatures(feature_column)
print(feature_layer(example_batch).numpy())
特征列的输出成为模型的输入。 数字列是最简单的列类型。 它用于表示真正有价值的特征。 使用此列时,模型将从数据框中接收未更改的列值。
age = feature_column.numeric_column("age")
demo(age)
[[61.]
[51.]
[57.]
[51.]
[44.]]
通常,您不希望将数字直接输入模型,而是根据数值范围将其值分成不同的类别。 考虑代表一个人年龄的原始数据。 我们可以使用bucketized列将年龄分成几个桶,而不是将年龄表示为数字列。 请注意,下面的one-hot描述了每行匹配的年龄范围。
age_buckets = feature_column.bucketized_column(age, boundaries=[
18, 25, 30, 35, 40, 50
])
demo(age_buckets)
[[0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 1. 0.]]
在该数据集中,thal表示为字符串(例如“固定”,“正常”或“可逆”)。 我们无法直接将字符串提供给模型。 相反,我们必须首先将它们映射到数值。 类别列提供了一种将字符串表示为单热矢量的方法(就像上面用年龄段看到的那样)。 类别表可以使用categorical_column_with_vocabulary_list作为列表传递,或者使用categorical_column_with_vocabulary_file从文件加载。
thal = feature_column.categorical_column_with_vocabulary_list('thal', ['fixed', 'normal', 'reversible'])
thal_one_hot = feature_column.indicator_column(thal)
demo(thal_one_hot)
[[0. 0. 1.]
[0. 1. 0.]
[0. 0. 1.]
[0. 0. 1.]
[0. 1. 0.]]
假设我们不是只有几个可能的字符串,而是每个类别有数千(或更多)值。 由于多种原因,随着类别数量的增加,使用单热编码训练神经网络变得不可行。 我们可以使用嵌入列来克服此限制。 嵌入列不是将数据表示为多维度的单热矢量,而是将数据表示为低维密集向量,其中每个单元格可以包含任意数字,而不仅仅是0或1.嵌入的大小是必须训练调整的参数。
注:当分类列具有许多可能的值时,最好使用嵌入列。
thal_embedding = feature_column.embedding_column(thal, dimension=8)
demo(thal_embedding)
[[ 0.21029451 0.28502795 0.27186757 -0.13927 0.44176006 0.18506278
-0.14189719 0.2901029 ]
[-0.02674027 -0.21359333 -0.26675928 0.6544374 0.12530805 -0.5243998
-0.23030454 -0.10796055]
[ 0.21029451 0.28502795 0.27186757 -0.13927 0.44176006 0.18506278
-0.14189719 0.2901029 ]
[ 0.21029451 0.28502795 0.27186757 -0.13927 0.44176006 0.18506278
-0.14189719 0.2901029 ]
[-0.02674027 -0.21359333 -0.26675928 0.6544374 0.12530805 -0.5243998
-0.23030454 -0.10796055]]
表示具有大量值的分类列的另一种方法是使用categorical_column_with_hash_bucket。 此功能列计算输入的哈希值,然后选择一个hash_bucket_size存储桶来编码字符串。 使用此列时,您不需要提供词汇表,并且可以选择使hash_buckets的数量远远小于实际类别的数量以节省空间。
注:该技术的一个重要缺点是可能存在冲突,其中不同的字符串被映射到同一个桶。
thal_hashed = feature_column.categorical_column_with_hash_bucket('thal', hash_bucket_size=1000)
demo(feature_column.indicator_column(thal_hashed))
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
将特征组合成单个特征(更好地称为特征交叉),使模型能够为每个特征组合学习单独的权重。 在这里,我们将创建一个与age和thal交叉的新功能。 请注意,crossed_column不会构建所有可能组合的完整表(可能非常大)。 相反,它由hashed_column支持,因此您可以选择表的大小。
crossed_feature = feature_column.crossed_column([age_buckets, thal], hash_bucket_size=1000)
demo(feature_column.indicator_column(crossed_feature))
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
feature_columns = []
# numeric cols
for header in ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'ca']:
feature_columns.append(feature_column.numeric_column(header))
# bucketized cols
age_buckets = feature_column.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
feature_columns.append(age_buckets)
# indicator cols
thal = feature_column.categorical_column_with_vocabulary_list(
'thal', ['fixed', 'normal', 'reversible'])
thal_one_hot = feature_column.indicator_column(thal)
feature_columns.append(thal_one_hot)
# embedding cols
thal_embedding = feature_column.embedding_column(thal, dimension=8)
feature_columns.append(thal_embedding)
# crossed cols
crossed_feature = feature_column.crossed_column([age_buckets, thal], hash_bucket_size=1000)
crossed_feature = feature_column.indicator_column(crossed_feature)
feature_columns.append(crossed_feature)
构建特征层
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)
batch_size = 32
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)
model = tf.keras.Sequential([
feature_layer,
layers.Dense(128, activation='relu'),
layers.Dense(128, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(train_ds, validation_data=val_ds,epochs=5)
Epoch 1/5
7/7 [==============================] - 1s 133ms/step - loss: 1.1864 - accuracy: 0.6357 - val_loss: 0.6905 - val_accuracy: 0.5714
Epoch 2/5
7/7 [==============================] - 0s 24ms/step - loss: 0.9603 - accuracy: 0.6804 - val_loss: 0.4047 - val_accuracy: 0.8163
Epoch 3/5
7/7 [==============================] - 0s 24ms/step - loss: 0.5744 - accuracy: 0.7389 - val_loss: 0.6673 - val_accuracy: 0.7755
Epoch 4/5
7/7 [==============================] - 0s 24ms/step - loss: 0.4890 - accuracy: 0.8092 - val_loss: 0.6298 - val_accuracy: 0.6122
Epoch 5/5
7/7 [==============================] - 0s 24ms/step - loss: 0.5618 - accuracy: 0.6795 - val_loss: 0.3861 - val_accuracy: 0.8367
测试
loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)
2/2 [==============================] - 0s 16ms/step - loss: 0.8278 - accuracy: 0.6066
Accuracy 0.60655737