TensorFlow Wide & Deep Learning Tutorial

——在前面的教程TensorFlow Linear Model Tutorial中,我们训练了一个逻辑回归模型,使用Census Income

——在这个教程中,我们将介绍如何使用tf.learnAPI共同的训练一个宽度线性模型和深度前向反馈神经网络。这种方法结合了记忆和泛化的优势。这对于一般的大型的,有着稀疏输入特征(例如类别特征,有着大量可能特征值)的回归和分类问题是有用的。如果你对学习更多的关于宽度&深度学习是如何工作有兴趣,请点击research paper。

TensorFlow学习笔记9----TensorFlow Wide & Deep Learning Tutorial_第1张图片


  • 1、选择宽度部分的特征:选择你想使用的稀疏基列和交叉列。
  • 2、选择深度部分的特征:选择连序列、对每一个类别列的嵌入维度和隐藏层大小
  • 3、将他们放在一起组成宽度&深度模型(DNNLinearCombinedClassifier)。




  • 安装tensorflow
  • 下载the tutorial code
  • 安装pandas数据分析库。tf.learn不要求使用pandas,但是支持它,本教程使用pandas。安装pandas:

    • 安装pip
    # Ubuntu/Linux 64-bit
    $ sudo apt-get install python-pip python-dev
    # Mac OS X
    $ sudo easy_install pip
    $ sudo easy_install --upgrade six
    • 使用pip安装pandas
    $ sudo pip install pandas
  • 如果你安装pandas方面还有什么问题,请点击instructions。

  • 用以下的命令执行教程的代码,训练教程中描述的线性模型。
$ python wide_n_deep_tutorial.py --model_type=wide_n_deep


2、Define Base Feature Columns


import tensorflow as tf

# categotical base columns
gender = tf.contrib.layers.sparse_column_with_keys(
    column_name = "gender",
    keys = ["Female","Male"]
race = tf.contrib.layers.sparse_column_with_keys(
    column_name = "race",
    keys = ["Amer-Indian-Eskimo","Asian-Pac-Islander",
education = tf.contrib.layers.sparse_column_with_hash_bucket("education",hash_bucket_size=1000)
relationship = tf.contrib.layers.sparse_column_with_hash_bucket("relationship",hash_bucket_size=100)
workclass = tf.contrib.layers.sparse_column_with_hash_bucket("workclass",hash_bucket_size=100)
occupation = tf.contrib.layers.sparse_column_with_hash_bucket("occupation",hash_bucket_size=1000)
native_country = tf.contrib.layers.sparse_column_with_hash_bucket("native_country",hash_bucket_size=1000)

# continuous base columns
age = tf.contrib.layers.real_valued_column("age")
age_buckets = tf.contrib.layers.bucketized_column(
    boundaries = [18,25,30,35,40,45,50,55,60,65]
education_num = tf.contrib.layers.real_valued_column("education_num")
capital_gain = tf.contrib.layers.real_valued_column("capital_gain")
capital_loss = tf.contrib.layers.real_valued_column("capital_loss")
hours_per_week = tf.contrib.layers.real_valued_column("hours_per_week")

3、The Wide Model: Linear Model with Crossed Feature Columns


wide_columns = [
    gender, native_country, education, occupation, workclass,
    relationship, age_buckets,


4、The Deep Model: Neural Network with Embeddings

——向前面提到的一样,深度模型是一个前向反馈的神经网络。每一个稀疏的、高维类别特征都是首次转换为低维的、稠密的真值向量,经常作为嵌入向量的参考。这些低维、稠密、嵌入向量与连续特征是串联的,接着在前向反馈过程时,输入到神经网络的隐藏层。这些嵌入值通常随机初始化,并且和其他模型参数一起初始化来最小化训练的损失。如果你对学习更多的嵌入感兴趣,点击tensorflow的教程Vector Representations of Words或者是initialized 。


deep_columns = [




5、Combining Wide and Deep Models into One


import tempfile
model_dir = tempfile.mkdtemp()
m = tf.contrib.learn.DNNLinearCombinedClassifier(
    model_dir = model_dir,
    linear_feature_columns = wide_columns,
    dnn_feature_columns = deep_columns,
    dnn_hidden_units = [100,50]

6、Training and Evaluating The Model

——在我们训练模型之前,像我们之前在TensorFlow Linear Model tutorial上做的那样读取人口普查数据集,这份对输入数据处理的代码放在这里是为了你的方便。

import pandas as pd
import urllib

# define the column names for the data sets.
    "age", "workclass", "fnlwgt", "education", "education_num",
    "marital_status", "occupation", "relationship", "race","gender",
LABEL_COLUMN = 'label'
    "age", "education_num","capital_gain","capital_loss",

# download the training and test data to temporary files.
# alternatively, you can download them yourself and change 
# train_file and test_file to your own paths.
train_file = tempfile.NamedTemporaryFile()
test_file = tempfile.NamedTemporaryFile()

# read the training and test data sets into pandas dataframe.
df_train = pd.read_csv(train_file,names=COLUMNS,skipinitialspace=True)
df_test = pd.read_csv(test_file,names=COLUMNS,skipinitialspace=True,skiprows=1)
df_train[LABEL_COLUMN] = (df_train['income_bracket'].apply(lambda x:'>50k' in x)).astype(int)
df_test[LABEL_COLUMN] = (df.test['income_bracket'].apply(lambda x:'>50k' in x)).astype(int)

def input_fn(df):
    # creates a dictionary mapping from each continuous feature
    # column name(k) to the values of that column stored in a 
    # constant tensor
    continuous_cols = {k: tf.constant(df[k].values)
                       for k in CONTINUOUS_COLUMNS}
    # create a dictionary mapping from each categorical feature
    # column name(k) to the values of that column stored in a 
    # tf.SparseTensor.
    categorical_cols = {
        k: tf.SparseTensor(
            indices = [[i,0] for i in range(df[k].size)],
            values = df[k].values,
            shape = [df[k].size,1]
        for k in CATEGORICAL_COLUMNS
    # merges the two dictionaries into one.
    feature_cols = dict(continuous_cols.items()+categorical_cols.items())
    # converts the label column into a constant tensor.
    label = tf.constant(df[LABEL_COLUMN].values)
    # returns the feature columns and the label.
    return feature_cols,label

def train_input_fn():
    return input_fn(df_train)
def eval_input_fn():
    return input_fn(df_test)


m.fit(input_fn = train_input_fn,steps=200)
results = m.evaluate(input_fn=eval_input_fn,steps=1)
for key in sorted(results):
    print "%s: %s" % (key,results[key])

——输出的第一行应该是正确度:0.84429705。我们可以看到正确度,从只是用宽度模型的83.6%提高到使用宽度&深度模型的84.4%。如果你喜欢看端到端的例子,可以下载example code。

——-记住,这个教程只是一个在小数据集上让你熟悉tf.learnAPI的简单的例子。如果你在一个带有很多稀疏特征列,有大量可能特征值的大的数据集上,宽度&深度模型将更有力。你可以查看research paper来获取更多的想法,关于在真实世界,大型机器学习问题上,如何使用宽度&深度模型。



“/usr/local/lib/python2.7/dist-packages/pandas/core/computation/init.py:18: UserWarning: The installed version of numexpr 2.2.2 is not supported in pandas and will be not be used
The minimum supported version is 2.4.6

ver=ver, min_ver=_MIN_NUMEXPR_VERSION), UserWarning)
WARNING:tensorflow:The default stddev value of initializer will change from “1/sqrt(vocab_size)” to “1/sqrt(dimension)” after 2017/02/25.

WARNING:tensorflow:From wide_deep_train.py:58: calling init (from tensorflow.contrib.learn.python.learn.estimators.dnn_linear_combined) with fix_global_step_increment_bug=False is deprecated and will be removed after 2017-04-15.
Instructions for updating:
Please set fix_global_step_increment_bug=True and update training steps in your pipeline. See pydoc for details.

WARNING:tensorflow:Rank of input Tensor (1) should be the same as output_rank (2) for column. Will attempt to expand dims. It is highly recommended that you resize your input, as this behavior may change.

WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py:95: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.

WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py:96: histogram_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.histogram. Note that tf.summary.histogram uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in.

WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/feature_column.py:1861: calling sparse_feature_cross (from tensorflow.contrib.layers.python.ops.sparse_feature_cross_op) with hash_key=None is deprecated and will be removed after 2016-11-20.
Instructions for updating:
The default behavior of sparse_feature_cross is changing, the default
value for hash_key will change to SPARSE_FEATURE_CROSS_DEFAULT_HASH_KEY.
From that point on sparse_feature_cross will always use FingerprintCat64
to concatenate the feature fingerprints. And the underlying
_sparse_feature_cross_op.sparse_feature_cross operation will be marked
as deprecated.

WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/head.py:615: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.

2017-06-01 15:36:31.892371: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn’t compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.

