本实验是在tensorflow-gpu 2.2环境下完成,实验数据集由Cleveland Clinic Foundation for Heart Disease提供,该数据集是二分类数据集,描述病人是否患病,含13个特征、303条数据。
文中使用两种方法使用 tf.data.Dataset.from_tensor_slices 处理输入流,一种输入特征为字典类型,另一种输入特征为数组类型。字典类型能够自定义各特征的处理方式,数组类型实现简单。
1. Classify structured data with feature columns
2. Load a pandas.DataFrame
import pandas as pd
import tensorflow as tf
import os
# tensorflow-gpu 2.0环境下,使用GPU会报错!
# `0`表示使用GPU,`1`表示使用CPU。
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
使用pandas读取web数据
df = pd.read_csv('https://storage.googleapis.com/applied-dl/heart.csv')
前5行样例数据
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63 | 1 | 1 | 145 | 233 | 1 | 2 | 150 | 0 | 2.3 | 3 | 0 | fixed | 0 |
1 | 67 | 1 | 4 | 160 | 286 | 0 | 2 | 108 | 1 | 1.5 | 2 | 3 | normal | 1 |
2 | 67 | 1 | 4 | 120 | 229 | 0 | 2 | 129 | 1 | 2.6 | 2 | 2 | reversible | 0 |
3 | 37 | 1 | 3 | 130 | 250 | 0 | 0 | 187 | 0 | 3.5 | 3 | 0 | normal | 0 |
4 | 41 | 0 | 2 | 130 | 204 | 0 | 2 | 172 | 0 | 1.4 | 1 | 0 | normal | 0 |
各列的意义与类型
列 | 描述 | 特征类型 | 数据类型 |
---|---|---|---|
Age | 年龄以年为单位 | Numerical | integer |
Sex | (1 = 男;0 = 女) | Categorical | integer |
CP | 胸痛类型(0,1,2,3,4) | Categorical | integer |
Trestbpd | 静息血压(入院时,以mm Hg计) | Numerical | integer |
Chol | 血清胆固醇(mg/dl) | Numerical | integer |
FBS | (空腹血糖> 120 mg/dl)(1 = true;0 = false) | Categorical | integer |
RestECG | 静息心电图结果(0,1,2) | Categorical | integer |
Thalach | 达到的最大心率 | Numerical | integer |
Exang | 运动诱发心绞痛(1 =是;0 =否) | Categorical | integer |
Oldpeak | 与休息时相比由运动引起的 ST 节段下降 | Numerical | integer |
Slope | 在运动高峰 ST 段的斜率 | Numerical | float |
CA | 荧光透视法染色的大血管动脉(0-3)的数量 | Numerical | integer |
Thal | 3 =正常;6 =固定缺陷;7 =可逆缺陷 | Categorical | string |
Target | 心脏病诊断(1 = true;0 = false) | Classification | integer |
转换为Dataset输入流对象
labels = df.pop('target')
dataset = tf.data.Dataset.from_tensor_slices((dict(df), labels))
分割训练和测试集,batch_size设为32,取2个batch_size作为测试集
batch_size = 32
test_size = batch_size * 2
test_dataset = dataset.take(test_size).batch(batch_size)
train_dataset = dataset.skip(test_size).shuffle(200, seed=7).batch(batch_size, True)
Dataset输入流需要严格定义每列的数据类型,如数值型、类别型、离散型等,此外还可以定义交叉特征,使模型能够单独学习组合特性。本节主要是声明连续型数值特征、连续值特征分桶、单词型类别特征转换为one-hot和embedding,以及生成交叉特征。
def dataset_features():
features = []
# 连续值特征
for feature in ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'ca']:
feature = tf.feature_column.numeric_column(feature)
features.append(feature)
# 连续值特征转换为离散/分桶区间
age_numeric = tf.feature_column.numeric_column('age')
age_bucket = tf.feature_column.bucketized_column(
age_numeric, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
features.append(age_bucket)
# 类别特征(非数字存储)转换为onehot
thal_category = tf.feature_column.categorical_column_with_vocabulary_list(
'thal', ['fixed', 'normal', 'reversible'])
thal_onehot = tf.feature_column.indicator_column(thal_category)
features.append(thal_onehot)
# 类别类别(非数字存储)转换为embedding
thal_embedding = tf.feature_column.embedding_column(thal_category, dimension=8)
features.append(thal_embedding)
# 多特征交叉,模型能够单独学习组合特性
cross_feature = tf.feature_column.crossed_column(
[age_bucket, thal_category], hash_bucket_size=1000)
cross_feature = tf.feature_column.indicator_column(cross_feature)
features.append(cross_feature)
return features
features = dataset_features()
model = tf.keras.Sequential([
tf.keras.layers.DenseFeatures(features),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dropout(rate=0.1),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dropout(rate=0.1),
tf.keras.layers.Dense(1)])
model.compile(
optimizer=tf.keras.optimizers.Adam(lr=0.0005),
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
model.fit(train_dataset, validation_data=test_dataset, epochs=30)
训练过程
C:\Users\merlin\.conda\envs\tfgpu22\python.exe C:/Users/merlin/Desktop/github/tensorflow2.0-tutorial/csv_classification.py
2020-05-18 23:58:19.957171: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-05-18 23:58:23.292240: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-05-18 23:58:24.029195: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:02:00.0 name: GeForce GTX 960M computeCapability: 5.0
coreClock: 1.176GHz coreCount: 5 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 74.65GiB/s
2020-05-18 23:58:24.029652: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-05-18 23:58:24.034583: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-05-18 23:58:24.038523: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-05-18 23:58:24.039686: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-05-18 23:58:24.043784: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-05-18 23:58:24.047266: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-05-18 23:58:24.055371: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-05-18 23:58:24.056715: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-05-18 23:58:24.057342: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-05-18 23:58:24.066817: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x211123c8070 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-18 23:58:24.067553: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-05-18 23:58:24.069027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:02:00.0 name: GeForce GTX 960M computeCapability: 5.0
coreClock: 1.176GHz coreCount: 5 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 74.65GiB/s
2020-05-18 23:58:24.069700: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-05-18 23:58:24.070004: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-05-18 23:58:24.070285: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-05-18 23:58:24.070574: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-05-18 23:58:24.070860: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-05-18 23:58:24.071310: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-05-18 23:58:24.071605: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-05-18 23:58:24.072630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-05-18 23:58:25.974350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-18 23:58:25.974679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108] 0
2020-05-18 23:58:25.974866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0: N
2020-05-18 23:58:25.975732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3031 MB memory) -> physical GPU (device: 0, name: GeForce GTX 960M, pci bus id: 0000:02:00.0, compute capability: 5.0)
2020-05-18 23:58:25.979506: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x21135eea390 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-05-18 23:58:25.979858: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 960M, Compute Capability 5.0
Epoch 1/30
WARNING:tensorflow:Layer dense_features is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2. The layer has dtype float32 because it's dtype defaults to floatx.
If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.
To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.
2020-05-18 23:58:27.499490: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
7/7 [==============================] - 0s 68ms/step - loss: 4.8452 - accuracy: 0.5312 - val_loss: 2.3686 - val_accuracy: 0.7656
Epoch 2/30
7/7 [==============================] - 0s 10ms/step - loss: 1.9891 - accuracy: 0.6562 - val_loss: 2.0280 - val_accuracy: 0.3125
Epoch 3/30
7/7 [==============================] - 0s 10ms/step - loss: 1.6943 - accuracy: 0.5759 - val_loss: 1.1732 - val_accuracy: 0.7656
Epoch 4/30
7/7 [==============================] - 0s 13ms/step - loss: 1.4274 - accuracy: 0.7366 - val_loss: 0.8180 - val_accuracy: 0.7188
Epoch 5/30
7/7 [==============================] - 0s 11ms/step - loss: 1.3783 - accuracy: 0.6339 - val_loss: 0.7217 - val_accuracy: 0.7188
Epoch 6/30
7/7 [==============================] - 0s 10ms/step - loss: 1.1956 - accuracy: 0.7054 - val_loss: 0.6776 - val_accuracy: 0.7812
Epoch 7/30
7/7 [==============================] - 0s 10ms/step - loss: 1.0995 - accuracy: 0.6741 - val_loss: 0.6698 - val_accuracy: 0.6719
Epoch 8/30
7/7 [==============================] - 0s 9ms/step - loss: 1.3149 - accuracy: 0.6830 - val_loss: 0.5413 - val_accuracy: 0.7812
Epoch 9/30
7/7 [==============================] - 0s 10ms/step - loss: 1.1399 - accuracy: 0.6607 - val_loss: 0.5280 - val_accuracy: 0.7812
Epoch 10/30
7/7 [==============================] - 0s 10ms/step - loss: 1.0334 - accuracy: 0.6920 - val_loss: 0.5209 - val_accuracy: 0.7812
Epoch 11/30
7/7 [==============================] - 0s 9ms/step - loss: 0.9756 - accuracy: 0.6562 - val_loss: 0.5183 - val_accuracy: 0.7812
Epoch 12/30
7/7 [==============================] - 0s 10ms/step - loss: 1.0135 - accuracy: 0.7232 - val_loss: 0.5064 - val_accuracy: 0.7969
Epoch 13/30
7/7 [==============================] - 0s 10ms/step - loss: 1.0651 - accuracy: 0.6250 - val_loss: 0.5276 - val_accuracy: 0.7812
Epoch 14/30
7/7 [==============================] - 0s 9ms/step - loss: 1.0107 - accuracy: 0.7143 - val_loss: 0.4978 - val_accuracy: 0.7812
Epoch 15/30
7/7 [==============================] - 0s 11ms/step - loss: 0.8518 - accuracy: 0.6875 - val_loss: 0.5155 - val_accuracy: 0.7656
Epoch 16/30
7/7 [==============================] - 0s 10ms/step - loss: 0.8474 - accuracy: 0.7411 - val_loss: 0.5013 - val_accuracy: 0.7812
Epoch 17/30
7/7 [==============================] - 0s 9ms/step - loss: 0.8005 - accuracy: 0.7009 - val_loss: 0.5474 - val_accuracy: 0.7656
Epoch 18/30
7/7 [==============================] - 0s 10ms/step - loss: 0.8096 - accuracy: 0.7098 - val_loss: 0.4522 - val_accuracy: 0.7969
Epoch 19/30
7/7 [==============================] - 0s 10ms/step - loss: 0.6782 - accuracy: 0.7411 - val_loss: 0.4300 - val_accuracy: 0.7969
Epoch 20/30
7/7 [==============================] - 0s 9ms/step - loss: 0.6980 - accuracy: 0.7232 - val_loss: 0.4448 - val_accuracy: 0.7969
Epoch 21/30
7/7 [==============================] - 0s 10ms/step - loss: 0.7002 - accuracy: 0.7188 - val_loss: 0.4895 - val_accuracy: 0.7969
Epoch 22/30
7/7 [==============================] - 0s 10ms/step - loss: 0.7963 - accuracy: 0.6786 - val_loss: 0.4534 - val_accuracy: 0.7812
Epoch 23/30
7/7 [==============================] - 0s 9ms/step - loss: 0.6811 - accuracy: 0.7054 - val_loss: 0.4219 - val_accuracy: 0.7969
Epoch 24/30
7/7 [==============================] - 0s 10ms/step - loss: 0.6743 - accuracy: 0.7098 - val_loss: 0.4368 - val_accuracy: 0.8125
Epoch 25/30
7/7 [==============================] - 0s 10ms/step - loss: 0.6222 - accuracy: 0.7500 - val_loss: 0.4901 - val_accuracy: 0.7656
Epoch 26/30
7/7 [==============================] - 0s 9ms/step - loss: 0.6851 - accuracy: 0.7411 - val_loss: 0.4557 - val_accuracy: 0.7812
Epoch 27/30
7/7 [==============================] - 0s 10ms/step - loss: 0.6646 - accuracy: 0.7009 - val_loss: 0.4757 - val_accuracy: 0.7656
Epoch 28/30
7/7 [==============================] - 0s 10ms/step - loss: 0.7006 - accuracy: 0.7188 - val_loss: 0.4310 - val_accuracy: 0.8125
Epoch 29/30
7/7 [==============================] - 0s 9ms/step - loss: 0.5747 - accuracy: 0.7634 - val_loss: 0.4166 - val_accuracy: 0.8125
Epoch 30/30
7/7 [==============================] - 0s 10ms/step - loss: 0.6378 - accuracy: 0.7366 - val_loss: 0.4180 - val_accuracy: 0.7969
Process finished with exit code 0
import pandas as pd
import tensorflow as tf
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
df = pd.read_csv('https://storage.googleapis.com/applied-dl/heart.csv')
df.to_csv('heart.csv')
labels = df.pop('target')
dataset = tf.data.Dataset.from_tensor_slices((dict(df), labels))
batch_size = 32
test_size = batch_size * 2
test_dataset = dataset.take(test_size).batch(batch_size)
train_dataset = dataset.skip(test_size).shuffle(200, seed=7).batch(batch_size, True)
def dataset_features():
features = []
# 连续值特征
for feature in ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'ca']:
feature = tf.feature_column.numeric_column(feature)
features.append(feature)
# 连续值特征转换为离散/分桶区间
age_numeric = tf.feature_column.numeric_column('age')
age_bucket = tf.feature_column.bucketized_column(
age_numeric, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
features.append(age_bucket)
# 类别特征(非数字存储)转换为onehot
thal_category = tf.feature_column.categorical_column_with_vocabulary_list(
'thal', ['fixed', 'normal', 'reversible'])
thal_onehot = tf.feature_column.indicator_column(thal_category)
features.append(thal_onehot)
# 类别类别(非数字存储)转换为embedding
thal_embedding = tf.feature_column.embedding_column(thal_category, dimension=8)
features.append(thal_embedding)
# 多特征交叉,模型能够单独学习组合特性
cross_feature = tf.feature_column.crossed_column(
[age_bucket, thal_category], hash_bucket_size=1000)
cross_feature = tf.feature_column.indicator_column(cross_feature)
features.append(cross_feature)
return features
features = dataset_features()
model = tf.keras.Sequential([
tf.keras.layers.DenseFeatures(features),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dropout(rate=0.1),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dropout(rate=0.1),
tf.keras.layers.Dense(1)])
model.compile(
optimizer=tf.keras.optimizers.Adam(lr=0.0005),
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
model.fit(train_dataset, validation_data=test_dataset, epochs=30)
import pandas as pd
import tensorflow as tf
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
df = pd.read_csv('https://storage.googleapis.com/applied-dl/heart.csv')
df.to_csv('heart.csv')
labels = df.pop('target')
df['thal'] = pd.Categorical(df['thal'])
df['thal'] = df.thal.cat.codes
# 输入特征均认为是连续性数值数据
dataset = tf.data.Dataset.from_tensor_slices((df.values, labels.values))
batch_size = 32
test_size = batch_size * 2
test_dataset = dataset.take(test_size).batch(batch_size)
train_dataset = dataset.skip(test_size).shuffle(200, seed=7).batch(batch_size, True)
model = tf.keras.Sequential([
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dropout(rate=0.1),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dropout(rate=0.1),
tf.keras.layers.Dense(1)])
model.compile(
optimizer=tf.keras.optimizers.Adam(lr=0.0005),
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
model.fit(train_dataset, validation_data=test_dataset, epochs=30)