Estimator封装了模型的构建、训练、评估、预估以及保存过程,将数据的输入从模型中分离出来。数据输入需要编写单独的函数。
加载库
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import sklearn
import pandas as pd
import os
import sys
import time
import tensorflow as tf
from tensorflow import keras
#import keras
print(tf.__version__)
print(sys.version_info)
for module in mpl,np,pd,sklearn,tf,keras:
print(module.__name__,module.__version__)
数据加载及处理
# 数据地址
# https://storage.googleapis.com/tf-datasets/titanic/train.csv
# https://storage.googleapis.com/tf-datasets/titanic/eval.csv
train_file = "./data/titanic/train.csv"
eval_file = "./data/titanic/eval.csv"
train_df = pd.read_csv(train_file)
eval_df = pd.read_csv(eval_file)
print(train_df.head())
print(eval_df.head())
#survived是要预测的值,不能在特征值中,所以要选出来。
y_train = train_df.pop('survived')
y_eval = eval_df.pop('survived')
#pop() 函数用于移除列表中的一个元素(默认最后一个元素),并且返回该元素的值
print(train_df.head())
print(eval_df.head())
print(y_train.head())
print(y_eval.head())
#查看统计量
train_df.describe() #pandas数据很方便
#feature_column使用(连续特征、离散特征)
#特征分类:离散特征、连续特征
categorical_columns = ['sex','n_siblings_spouses','parch','class','deck','embark_town','alone']
numeric_columns = ['age','fare']
feature_columns = []
#对每个离散特征进行处理
for categorical_column in categorical_columns:
vocab = train_df[categorical_column].unique() #特征所在列,所有可能的值
print(categorical_column,vocab)
feature_columns.append( #3、将编码好的离散特征加到feature_columns里面
tf.feature_column.indicator_column( #2、对离散特征进行one-hot编码
tf.feature_column.categorical_column_with_vocabulary_list( #1、对于离散函数,定义feature_column
categorical_column,vocab)))
#对于连续特征的处理
for categorical_column in numeric_columns:
feature_columns.append(
tf.feature_column.numeric_column( #连续特征,直接用numeric_column封装即可
categorical_column,dtype = tf.float32))
#定义一个函数,构建dataset
def make_dataset(data_df,label_df,epochs = 10,shuffle = True,batch_size = 32): #(x,y,运行10次,混排,一次32)
#tf.data.Dataset.from_tensor_slices()函数,切分传入Tensor的第一个维度,生成相应的dataset
dataset = tf.data.Dataset.from_tensor_slices((dict(data_df),label_df)) #构建数据集
if shuffle:
dataset = dataset.shuffle(10000)
dataset = dataset.repeat(epochs).batch(batch_size)
return dataset
#查看dict(train_df)
dict(train_df)
train_dataset = make_dataset(train_df,y_train,batch_size = 5)
for x,y in train_dataset.take(1): #zhi执行一次
print(x,y)
#keras.layers.DenseFeature
for x,y in train_dataset.take(1):
age_column = feature_columns[7]
gender_column = feature_columns[0]
print(keras.layers.DenseFeatures(age_column)(x).numpy())
print(keras.layers.DenseFeatures(gender_column)(x).numpy())
#keras.layers.DenseFeature
for x,y in train_dataset.take(1):
print(keras.layers.DenseFeatures(feature_columns)(x).numpy())
model = keras.models.Sequential([
keras.layers.DenseFeatures(feature_columns),
keras.layers.Dense(100,activation='relu'),
keras.layers.Dense(100,activation='relu'),
keras.layers.Dense(2,activation='softmax')
])
model.compile(loss = 'spare_categorical_crossentropy',optimizer = keras.optimizers.SGD(lr=0.01),
metrics = ['accuracy'])
# 1. model.fit
# 2.model -> estimator -> train
train_dataset = make_dataset(train_dataset,y_train,epochs=100)
eval_dataset = make_dataset(eval_df,y_eval,epochs=1,shuffle=False)
history = model.fit(train_dataset,
validation_data= eval_dataset,
steps_per_epoch=20,
validation_steps=8,
epochs = 100)
estimator = keras.estimator.model_to_estimator(model)
#1.function
#2.return a.(features,labels) b.dataset -> (feature,label)
estimator.train(input_fn = lambda : make_dataset(
train_df,y_train,epochs = 100))
'''linear'''
linear_output_dir = 'linear_model'
if not os.path.exists(linear_output_dir):
os.mkdir(linear_output_dir)
linear_estimator = tf.estimator.LinearClassifier(
model_dir = linear_output_dir,
n_classes=2,
feature_columns=feature_columns)
linear_estimator.train(input_fn= lambda : make_dataset(train_df,y_train,epochs=100))
linear_estimator.evaluate(input_fn= lambda : make_dataset(
eval_df,y_eval,epochs = 1,shuffle = False))
'''dnn'''
dnn_output_dir = './dnn_model'
if not os.path.exists(dnn_output_dir):
os.mkdir(dnn_output_dir)
dnn_estimator = tf.estimator.DNNClassifier(
model_dir=dnn_output_dir,
n_classes=2,
feature_columns=feature_columns,
hidden_units=[128,128],
activation_fn=tf.nn.relu,
optimizer='Adam')
dnn_estimator.train(input_fn = lambda : make_dataset(train_df,y_train,epochs = 100))
dnn_estimator.evaluate(input_fn=lambda:make_dataset(eval_df,y_eval,epochs=1,shuffle=False))
总结:用交叉特征,linear结果变好,但是dnn结果反而变差了。
说明:特征提取方法,对不同模型的效果是不一样的。不但要考虑不同特征之间的互补性,还要考虑和模型之间的适配能力。
API变动
如何将tf1.0代码升级为tf2.0代码