本系列为tensorflow官方教程与文档学习笔记,结合个人理解提取其中的关键内容,便于日后复习。
tf.data
加载CSV数据通过tf.data.experimental.make_csv_dataset
将CSV文件读入dataset对象。几个重要的参数:
batch_size
:指定单个batch下的数据记录数目;column_names
:指定数据的列名,若未给定此参数,默认从数据文件首行获取;label_name
:指定作为label的数据列列名;na_value
:将指定的额外字符也认作NaN;num_epochs
:数据集重复的次数;例:
TRAIN_DATA_PATH = "E:/Notes/Projects/tensorflow_to_pro/eat_tensorflow2_in_30_days/data/titanic/train.csv"
TEST_DATA_PATH = "E:/Notes/Projects/tensorflow_to_pro/eat_tensorflow2_in_30_days/data/titanic/test.csv"
def get_dataset(file_path):
dataset = tf.data.experimental.make_csv_dataset(
file_path,
batch_size = 12,
label_name = 'Survived',
na_value = '?',
num_epochs = 1,
ignore_errors = True)
return dataset
raw_train_data = get_dataset(TRAIN_DATA_PATH)
raw_test_data = get_dataset(TEST_DATA_PATH)
dataset 中的每个条目都是一个批次,用一个元组(多个样本,多个标签)表示。样本中的数据组织形式是以列为主的张量(而不是以行为主的张量),每条数据中包含的元素个数就是批次大小(这个示例中是 12)。
# dataset可以用于迭代
examples, labels = next(iter(raw_train_data))
print("EXAMPLES: \n", examples, "\n")
print("LABELS: \n", labels)
# 官方示例
DATA_URL = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz'
path = tf.keras.utils.get_file('mnist.npz', DATA_URL)
with np.load(path) as data:
train_examples = data['x_train']
train_labels = data['y_train']
test_examples = data['x_test']
test_labels = data['y_test']
获取了特征数组和标签数组,将两个数组作为元组传递给tf.data.Dataset.from_tensor_slices
从而构建dataset对象:
train_dataset = tf.data.Dataset.from_tensor_slices((train_examples, train_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((test_examples, test_labels))
同样使用tf.data.Dataset.from_tensor_slices
API:
import pandas as pd
df = pd.read_csv("E:/Notes/Projects/tensorflow_to_pro/eat_tensorflow2_in_30_days/data/titanic/train.csv")
target = df.pop('Survived')
dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))
在实际操作中发现会报错:ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type int).
原因:pandas中的字符串类型为object
,使用tensorflow的API时无法识别造成。对于DataFrame中的字符串类型应先进行相应处理:
首先,通过df.dtypes
查看所有列的类型:
PassengerId int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
对object
列进行处理:
# 为演示方便,先将Name列去掉
df.pop('Name')
# 其余的列转换为离散类型数值变量
df['Sex'] = pd.Categorical(df['Sex'])
df['Sex'] = df.Sex.cat.codes
df['Ticket'] = pd.Categorical(df['Ticket'])
df['Ticket'] = df.Ticket.cat.codes
df['Cabin'] = pd.Categorical(df['Cabin'])
df['Cabin'] = df.Cabin.cat.codes
df['Embarked'] = pd.Categorical(df['Embarked'])
df['Embarked'] = df.Embarked.cat.codes
所有变量均转换为数值变量后,即可顺利转换为dataset对象。
all_img_paths
import tensorflow as tf
import pathlib
import random
data_orig = "E:\\Notes\\Projects\\tensorflow_to_pro\\eat_tensorflow2_in_30_days\\data\\cifar2\\train"
data_root = pathlib.Path(data_orig)
# 获取子文件夹下的所有内容
all_image_paths = list(data_root.glob("*/*"))
# WindowsPath对象转为字符串
all_img_paths = [str(path) for path in all_image_paths]
random.shuffle(all_img_paths)
def label_image(image_path):
img_rel = pathlib.Path(image_path).relative_to(data_root)
return str(img_rel).split("\\")[0]
# 创建包含每个文件的标签索引的列表
all_image_labels = [check_image(path) for path in all_image_paths]
# 将预处理包装在函数中
def load_and_preprocess_image(path):
image = tf.io.read_file(path)
img_tensor = tf.image.decode_jpeg(image)
img_final = tf.image.resize(img_tensor, [28, 28])
img_final = img_final / 255.0
return img_final
# 构建图片数据集
# 先创建一个路径数据集
path_ds = tf.data.Dataset.from_tensor_slices(all_img_paths)
# 再构建一个新的数据集,在路径数据集上映射load_and_preprocess_image
image_ds = path_ds.map(load_and_preprocess_image, num_parallel_calls=tf.data.experimental.AUTOTUNE)
image_ds即为获取的dataset对象。