Tensorflow2.x:利用tf.data.Dataset API读取CSV文件/DataFrame对象作为Keras输入流

文章目录

  • 0. Reference
  • 1. 准备数据
  • 2. 转换输入流、分割数据集
  • 3. 构造Features
  • 4. 创建、编译Keras模型
  • 5. 训练模型
  • 6. 使用自定义Features输入流的完整代码
  • 7、输入流简化版—输入特征均为连续性数值数据

本实验是在tensorflow-gpu 2.2环境下完成,实验数据集由Cleveland Clinic Foundation for Heart Disease提供,该数据集是二分类数据集,描述病人是否患病,含13个特征、303条数据。

文中使用两种方法使用 tf.data.Dataset.from_tensor_slices 处理输入流,一种输入特征为字典类型,另一种输入特征为数组类型。字典类型能够自定义各特征的处理方式,数组类型实现简单


0. Reference

1. Classify structured data with feature columns
2. Load a pandas.DataFrame


1. 准备数据

import pandas as pd

import tensorflow as tf
import os

# tensorflow-gpu 2.0环境下,使用GPU会报错!
# `0`表示使用GPU,`1`表示使用CPU。
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

使用pandas读取web数据

df = pd.read_csv('https://storage.googleapis.com/applied-dl/heart.csv')

前5行样例数据

age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
0 63 1 1 145 233 1 2 150 0 2.3 3 0 fixed 0
1 67 1 4 160 286 0 2 108 1 1.5 2 3 normal 1
2 67 1 4 120 229 0 2 129 1 2.6 2 2 reversible 0
3 37 1 3 130 250 0 0 187 0 3.5 3 0 normal 0
4 41 0 2 130 204 0 2 172 0 1.4 1 0 normal 0

各列的意义与类型

描述 特征类型 数据类型
Age 年龄以年为单位 Numerical integer
Sex (1 = 男;0 = 女) Categorical integer
CP 胸痛类型(0,1,2,3,4) Categorical integer
Trestbpd 静息血压(入院时,以mm Hg计) Numerical integer
Chol 血清胆固醇(mg/dl) Numerical integer
FBS (空腹血糖> 120 mg/dl)(1 = true;0 = false) Categorical integer
RestECG 静息心电图结果(0,1,2) Categorical integer
Thalach 达到的最大心率 Numerical integer
Exang 运动诱发心绞痛(1 =是;0 =否) Categorical integer
Oldpeak 与休息时相比由运动引起的 ST 节段下降 Numerical integer
Slope 在运动高峰 ST 段的斜率 Numerical float
CA 荧光透视法染色的大血管动脉(0-3)的数量 Numerical integer
Thal 3 =正常;6 =固定缺陷;7 =可逆缺陷 Categorical string
Target 心脏病诊断(1 = true;0 = false) Classification integer

2. 转换输入流、分割数据集

转换为Dataset输入流对象

labels = df.pop('target')
dataset = tf.data.Dataset.from_tensor_slices((dict(df), labels))

分割训练和测试集,batch_size设为32,取2个batch_size作为测试集

batch_size = 32
test_size = batch_size * 2

test_dataset = dataset.take(test_size).batch(batch_size)
train_dataset = dataset.skip(test_size).shuffle(200, seed=7).batch(batch_size, True)

3. 构造Features

Dataset输入流需要严格定义每列的数据类型,如数值型、类别型、离散型等,此外还可以定义交叉特征,使模型能够单独学习组合特性。本节主要是声明连续型数值特征连续值特征分桶单词型类别特征转换为one-hot和embedding,以及生成交叉特征

def dataset_features():
    features = []

    # 连续值特征
    for feature in ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'ca']:
        feature = tf.feature_column.numeric_column(feature)
        features.append(feature)

    # 连续值特征转换为离散/分桶区间
    age_numeric = tf.feature_column.numeric_column('age')
    age_bucket = tf.feature_column.bucketized_column(
        age_numeric, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
    features.append(age_bucket)

    # 类别特征(非数字存储)转换为onehot
    thal_category = tf.feature_column.categorical_column_with_vocabulary_list(
        'thal', ['fixed', 'normal', 'reversible'])
    thal_onehot = tf.feature_column.indicator_column(thal_category)
    features.append(thal_onehot)

    # 类别类别(非数字存储)转换为embedding
    thal_embedding = tf.feature_column.embedding_column(thal_category, dimension=8)
    features.append(thal_embedding)

    # 多特征交叉,模型能够单独学习组合特性
    cross_feature = tf.feature_column.crossed_column(
        [age_bucket, thal_category], hash_bucket_size=1000)
    cross_feature = tf.feature_column.indicator_column(cross_feature)
    features.append(cross_feature)

    return features

features = dataset_features()

4. 创建、编译Keras模型

model = tf.keras.Sequential([
    tf.keras.layers.DenseFeatures(features),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dropout(rate=0.1),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dropout(rate=0.1),
    tf.keras.layers.Dense(1)])

model.compile(
    optimizer=tf.keras.optimizers.Adam(lr=0.0005),
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=['accuracy'])

5. 训练模型

model.fit(train_dataset, validation_data=test_dataset, epochs=30)

训练过程

C:\Users\merlin\.conda\envs\tfgpu22\python.exe C:/Users/merlin/Desktop/github/tensorflow2.0-tutorial/csv_classification.py
2020-05-18 23:58:19.957171: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-05-18 23:58:23.292240: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-05-18 23:58:24.029195: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:02:00.0 name: GeForce GTX 960M computeCapability: 5.0
coreClock: 1.176GHz coreCount: 5 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 74.65GiB/s
2020-05-18 23:58:24.029652: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-05-18 23:58:24.034583: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-05-18 23:58:24.038523: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-05-18 23:58:24.039686: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-05-18 23:58:24.043784: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-05-18 23:58:24.047266: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-05-18 23:58:24.055371: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-05-18 23:58:24.056715: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-05-18 23:58:24.057342: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-05-18 23:58:24.066817: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x211123c8070 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-18 23:58:24.067553: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-05-18 23:58:24.069027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:02:00.0 name: GeForce GTX 960M computeCapability: 5.0
coreClock: 1.176GHz coreCount: 5 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 74.65GiB/s
2020-05-18 23:58:24.069700: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-05-18 23:58:24.070004: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-05-18 23:58:24.070285: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-05-18 23:58:24.070574: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-05-18 23:58:24.070860: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-05-18 23:58:24.071310: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-05-18 23:58:24.071605: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-05-18 23:58:24.072630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-05-18 23:58:25.974350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-18 23:58:25.974679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 
2020-05-18 23:58:25.974866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N 
2020-05-18 23:58:25.975732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3031 MB memory) -> physical GPU (device: 0, name: GeForce GTX 960M, pci bus id: 0000:02:00.0, compute capability: 5.0)
2020-05-18 23:58:25.979506: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x21135eea390 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-05-18 23:58:25.979858: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 960M, Compute Capability 5.0
Epoch 1/30
WARNING:tensorflow:Layer dense_features is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

2020-05-18 23:58:27.499490: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
7/7 [==============================] - 0s 68ms/step - loss: 4.8452 - accuracy: 0.5312 - val_loss: 2.3686 - val_accuracy: 0.7656
Epoch 2/30
7/7 [==============================] - 0s 10ms/step - loss: 1.9891 - accuracy: 0.6562 - val_loss: 2.0280 - val_accuracy: 0.3125
Epoch 3/30
7/7 [==============================] - 0s 10ms/step - loss: 1.6943 - accuracy: 0.5759 - val_loss: 1.1732 - val_accuracy: 0.7656
Epoch 4/30
7/7 [==============================] - 0s 13ms/step - loss: 1.4274 - accuracy: 0.7366 - val_loss: 0.8180 - val_accuracy: 0.7188
Epoch 5/30
7/7 [==============================] - 0s 11ms/step - loss: 1.3783 - accuracy: 0.6339 - val_loss: 0.7217 - val_accuracy: 0.7188
Epoch 6/30
7/7 [==============================] - 0s 10ms/step - loss: 1.1956 - accuracy: 0.7054 - val_loss: 0.6776 - val_accuracy: 0.7812
Epoch 7/30
7/7 [==============================] - 0s 10ms/step - loss: 1.0995 - accuracy: 0.6741 - val_loss: 0.6698 - val_accuracy: 0.6719
Epoch 8/30
7/7 [==============================] - 0s 9ms/step - loss: 1.3149 - accuracy: 0.6830 - val_loss: 0.5413 - val_accuracy: 0.7812
Epoch 9/30
7/7 [==============================] - 0s 10ms/step - loss: 1.1399 - accuracy: 0.6607 - val_loss: 0.5280 - val_accuracy: 0.7812
Epoch 10/30
7/7 [==============================] - 0s 10ms/step - loss: 1.0334 - accuracy: 0.6920 - val_loss: 0.5209 - val_accuracy: 0.7812
Epoch 11/30
7/7 [==============================] - 0s 9ms/step - loss: 0.9756 - accuracy: 0.6562 - val_loss: 0.5183 - val_accuracy: 0.7812
Epoch 12/30
7/7 [==============================] - 0s 10ms/step - loss: 1.0135 - accuracy: 0.7232 - val_loss: 0.5064 - val_accuracy: 0.7969
Epoch 13/30
7/7 [==============================] - 0s 10ms/step - loss: 1.0651 - accuracy: 0.6250 - val_loss: 0.5276 - val_accuracy: 0.7812
Epoch 14/30
7/7 [==============================] - 0s 9ms/step - loss: 1.0107 - accuracy: 0.7143 - val_loss: 0.4978 - val_accuracy: 0.7812
Epoch 15/30
7/7 [==============================] - 0s 11ms/step - loss: 0.8518 - accuracy: 0.6875 - val_loss: 0.5155 - val_accuracy: 0.7656
Epoch 16/30
7/7 [==============================] - 0s 10ms/step - loss: 0.8474 - accuracy: 0.7411 - val_loss: 0.5013 - val_accuracy: 0.7812
Epoch 17/30
7/7 [==============================] - 0s 9ms/step - loss: 0.8005 - accuracy: 0.7009 - val_loss: 0.5474 - val_accuracy: 0.7656
Epoch 18/30
7/7 [==============================] - 0s 10ms/step - loss: 0.8096 - accuracy: 0.7098 - val_loss: 0.4522 - val_accuracy: 0.7969
Epoch 19/30
7/7 [==============================] - 0s 10ms/step - loss: 0.6782 - accuracy: 0.7411 - val_loss: 0.4300 - val_accuracy: 0.7969
Epoch 20/30
7/7 [==============================] - 0s 9ms/step - loss: 0.6980 - accuracy: 0.7232 - val_loss: 0.4448 - val_accuracy: 0.7969
Epoch 21/30
7/7 [==============================] - 0s 10ms/step - loss: 0.7002 - accuracy: 0.7188 - val_loss: 0.4895 - val_accuracy: 0.7969
Epoch 22/30
7/7 [==============================] - 0s 10ms/step - loss: 0.7963 - accuracy: 0.6786 - val_loss: 0.4534 - val_accuracy: 0.7812
Epoch 23/30
7/7 [==============================] - 0s 9ms/step - loss: 0.6811 - accuracy: 0.7054 - val_loss: 0.4219 - val_accuracy: 0.7969
Epoch 24/30
7/7 [==============================] - 0s 10ms/step - loss: 0.6743 - accuracy: 0.7098 - val_loss: 0.4368 - val_accuracy: 0.8125
Epoch 25/30
7/7 [==============================] - 0s 10ms/step - loss: 0.6222 - accuracy: 0.7500 - val_loss: 0.4901 - val_accuracy: 0.7656
Epoch 26/30
7/7 [==============================] - 0s 9ms/step - loss: 0.6851 - accuracy: 0.7411 - val_loss: 0.4557 - val_accuracy: 0.7812
Epoch 27/30
7/7 [==============================] - 0s 10ms/step - loss: 0.6646 - accuracy: 0.7009 - val_loss: 0.4757 - val_accuracy: 0.7656
Epoch 28/30
7/7 [==============================] - 0s 10ms/step - loss: 0.7006 - accuracy: 0.7188 - val_loss: 0.4310 - val_accuracy: 0.8125
Epoch 29/30
7/7 [==============================] - 0s 9ms/step - loss: 0.5747 - accuracy: 0.7634 - val_loss: 0.4166 - val_accuracy: 0.8125
Epoch 30/30
7/7 [==============================] - 0s 10ms/step - loss: 0.6378 - accuracy: 0.7366 - val_loss: 0.4180 - val_accuracy: 0.7969

Process finished with exit code 0


6. 使用自定义Features输入流的完整代码

import pandas as pd

import tensorflow as tf
import os

os.environ['CUDA_VISIBLE_DEVICES'] = '0'

df = pd.read_csv('https://storage.googleapis.com/applied-dl/heart.csv')
df.to_csv('heart.csv')

labels = df.pop('target')
dataset = tf.data.Dataset.from_tensor_slices((dict(df), labels))

batch_size = 32
test_size = batch_size * 2
test_dataset = dataset.take(test_size).batch(batch_size)
train_dataset = dataset.skip(test_size).shuffle(200, seed=7).batch(batch_size, True)


def dataset_features():
    features = []

    # 连续值特征
    for feature in ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'ca']:
        feature = tf.feature_column.numeric_column(feature)
        features.append(feature)

    # 连续值特征转换为离散/分桶区间
    age_numeric = tf.feature_column.numeric_column('age')
    age_bucket = tf.feature_column.bucketized_column(
        age_numeric, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
    features.append(age_bucket)

    # 类别特征(非数字存储)转换为onehot
    thal_category = tf.feature_column.categorical_column_with_vocabulary_list(
        'thal', ['fixed', 'normal', 'reversible'])
    thal_onehot = tf.feature_column.indicator_column(thal_category)
    features.append(thal_onehot)

    # 类别类别(非数字存储)转换为embedding
    thal_embedding = tf.feature_column.embedding_column(thal_category, dimension=8)
    features.append(thal_embedding)

    # 多特征交叉,模型能够单独学习组合特性
    cross_feature = tf.feature_column.crossed_column(
        [age_bucket, thal_category], hash_bucket_size=1000)
    cross_feature = tf.feature_column.indicator_column(cross_feature)
    features.append(cross_feature)

    return features


features = dataset_features()
model = tf.keras.Sequential([
    tf.keras.layers.DenseFeatures(features),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dropout(rate=0.1),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dropout(rate=0.1),
    tf.keras.layers.Dense(1)])

model.compile(
    optimizer=tf.keras.optimizers.Adam(lr=0.0005),
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=['accuracy'])

model.fit(train_dataset, validation_data=test_dataset, epochs=30)

7、输入流简化版—输入特征均为连续性数值数据

import pandas as pd

import tensorflow as tf
import os

os.environ['CUDA_VISIBLE_DEVICES'] = '0'

df = pd.read_csv('https://storage.googleapis.com/applied-dl/heart.csv')
df.to_csv('heart.csv')

labels = df.pop('target')

df['thal'] = pd.Categorical(df['thal'])
df['thal'] = df.thal.cat.codes
# 输入特征均认为是连续性数值数据
dataset = tf.data.Dataset.from_tensor_slices((df.values, labels.values))

batch_size = 32
test_size = batch_size * 2
test_dataset = dataset.take(test_size).batch(batch_size)
train_dataset = dataset.skip(test_size).shuffle(200, seed=7).batch(batch_size, True)

model = tf.keras.Sequential([
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dropout(rate=0.1),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dropout(rate=0.1),
    tf.keras.layers.Dense(1)])

model.compile(
    optimizer=tf.keras.optimizers.Adam(lr=0.0005),
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=['accuracy'])

model.fit(train_dataset, validation_data=test_dataset, epochs=30)

你可能感兴趣的:(tensorflow)