TensorFlow 使用之 Logistics Regression

这篇文章主要讲使用 TensorFlow 进行 Logistic Regression 训练。

本文要解决的问题是使用逻辑回归解决一个二分类问题,即给定了人口普查的数据如人的年龄,性别,教育程度和职业,来预测一个人的年收入是否超过5万美元,如果超过则输出1,否则输出0。


1. 模型简介

下面先简单看一下本模型的logistic regression定义。

在本文中我们可以将标签设置为Y,如果收入大于50000则设置Y=1,否则设置为0。输入向量为,那么对于给定的输入向量X,Y=1的概率为:


b是我们模型的bias,是一个常量。wi是与xi密切相关的一个量,反映了xi与label的相关性,如果xi与label是正相关的,那么wi就会增加,P(Y=1| X)的概率就会接近1,反之,则会接近0。再来看logistic 函数,它是一个sigmoid函数,

这个函数的作用是将上面的线性模型转换到 [0-1] 的区间,其实也就是一个概率值。最终的目标是求出一组有效的w来使得代价函数最小。


2. 数据基本结构

接着再来看本文所使用的数据,数据的结构如下:

Column Name Type Description
age Continuous The age of the individual
workclass Categorical The type of employer the individual has (government, military, private, etc.).
fnlwgt Continuous The number of people the census takers believe that observation represents (sample weight). This variable will not be used.
education Categorical The highest level of education achieved for that individual.
education_num Continuous The highest level of education in numerical form.
marital_status Categorical Marital status of the individual.
occupation Categorical The occupation of the individual.
relationship Categorical Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race Categorical White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
gender Categorical Female, Male.
capital_gain Continuous Capital gains recorded.
capital_loss Continuous Capital Losses recorded.
hours_per_week Continuous Hours worked per week.
native_country Categorical Country of origin of the individual.
income Categorical ">50K" or "<=50K", meaning whether the person makes more than $50,000 annually.


3. 运行环境及代码模型讲解

环境首先需要的肯定是TensorFlow了,再就是pandas。安装pandas方法如下:

$ sudo pip install pandas
运行方式为:
$ python wide_n_deep_tutorial.py --model_type=wide
后面可选模式有wide,deep,如果什么都不选则默认混合模式,如下图的中间的模式。(当然就像我下面代码直接指定的话可以直接后面不加 --model_type)

需要注意的是如果使用 deep模型则需要 python3.5 以上才可以。

下面再来看看三种模型,如下图所示:

TensorFlow 使用之 Logistics Regression_第1张图片

本文主要是使用TensorFlow的TF.learn API来训练一个wide model(具有稀疏性和变换的逻辑回归模型) 和一个深度的前馈神经网络(具有好几个隐层的前馈神经网络),还有一个wide model 和 deep model 结合的网络结构,尤其是中间结合的模型对于大规模回归和分类问题很有用。所以总共有三种模型可以选择:
1. 选择wide模式,即选择上图的第一个模型
2. 选择 deep模式,即选择上图的第三个模型
3. 选择 wide&deep模式,即选择上图的第二个模型


整体的讲解就是这些,下面看具体的代码(有详细的注释):

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tempfile
from six.moves import urllib

import pandas as pd
import tensorflow as tf

flags = tf.app.flags
FLAGS = flags.FLAGS

# 用来存放模型输出的目录设置,在第二个变量设置
flags.DEFINE_string("model_dir", "/home/duan/logistic_regression/", "Base directory for output models.")
# 用来设置用哪个模型来进行训练,在第二个变量设置,可选有:wide,deep,wide_n_deep
flags.DEFINE_string("model_type", "wide",
                    "Valid model types: {'wide', 'deep', 'wide_n_deep'}.")
# 设置训练的步数,这里设置为200
flags.DEFINE_integer("train_steps", 200, "Number of training steps.")
# 设置存放train_data的目录,在第二个变量设置
flags.DEFINE_string(
    "train_data",
    "/home/duan/logistic_regression/adult.data",
    "Path to the training data.")
# 设置存放test_data的目录,在第二个变量设置
flags.DEFINE_string(
    "test_data",
    "/home/duan/logistic_regression/adult.test",
    "Path to the test data.")

# 我们训练使用的数据的列的名称
COLUMNS = ["age", "workclass", "fnlwgt", "education", "education_num",
           "marital_status", "occupation", "relationship", "race", "gender",
           "capital_gain", "capital_loss", "hours_per_week", "native_country",
           "income_bracket"]
LABEL_COLUMN = "label"

""" 其实上面的数据的列可以分为两类,即categorical 和 continuous.
categorical colum 就是这个列有有限个属性。
例如workclass 有{ Private, Self-emp-not-inc, Self-emp-inc,etc}
ccontinuous colum 就是这个列的属性是数字的连续型,如age
"""
CATEGORICAL_COLUMNS = ["workclass", "education", "marital_status", "occupation",
                       "relationship", "race", "gender", "native_country"]
CONTINUOUS_COLUMNS = ["age", "education_num", "capital_gain", "capital_loss",
                      "hours_per_week"]


def maybe_download():
  """如果存在训练和测试数据则下载
    返回训练数据和测试数据的名字
  """
  if FLAGS.train_data:
    train_file_name = FLAGS.train_data
  else:
    train_file = tempfile.NamedTemporaryFile(delete=False)
    urllib.request.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", train_file.name)  # pylint: disable=line-too-long
    train_file_name = train_file.name
    train_file.close()
    print("Training data is downloaded to %s" % train_file_name)

  if FLAGS.test_data:
    test_file_name = FLAGS.test_data
  else:
    test_file = tempfile.NamedTemporaryFile(delete=False)
    urllib.request.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", test_file.name)  # pylint: disable=line-too-long
    test_file_name = test_file.name
    test_file.close()
    print("Test data is downloaded to %s" % test_file_name)

  return train_file_name, test_file_name


def build_estimator(model_dir):
  """
  创建预测模型
  """
  # 创建稀疏的列. 列表中的每一个键将会获得一个从 0 开始的逐渐递增的id
  # 例如 下面这句female 为 0,male为1。这种情况是已经事先知道列集合中的元素
  # 都有那些
  gender = tf.contrib.layers.sparse_column_with_keys(column_name="gender",
                                                     keys=["female", "male"])
  # 对于不知道列集合中元素有那些的情况时,可以用下面这种。
  # 例如教育列中的每个值将会被散列为一个整数id
  # 例如
  """ ID  Feature
      ... 
      9 "Bachelors"
      ... 
      103 "Doctorate"
      ... 
      375 "Masters"
  """
  education = tf.contrib.layers.sparse_column_with_hash_bucket(
      "education", hash_bucket_size=1000)
  relationship = tf.contrib.layers.sparse_column_with_hash_bucket(
      "relationship", hash_bucket_size=100)
  workclass = tf.contrib.layers.sparse_column_with_hash_bucket(
      "workclass", hash_bucket_size=100)
  occupation = tf.contrib.layers.sparse_column_with_hash_bucket(
      "occupation", hash_bucket_size=1000)
  native_country = tf.contrib.layers.sparse_column_with_hash_bucket(
      "native_country", hash_bucket_size=1000)

  # 为连续的列元素设置一个实值列
  age = tf.contrib.layers.real_valued_column("age")
  education_num = tf.contrib.layers.real_valued_column("education_num")
  capital_gain = tf.contrib.layers.real_valued_column("capital_gain")
  capital_loss = tf.contrib.layers.real_valued_column("capital_loss")
  hours_per_week = tf.contrib.layers.real_valued_column("hours_per_week")

  # 为了更好的学习规律,收入是与年龄阶段有关的,因此需要把连续的数值划分
  # 成一段一段的区间来表示收入
  age_buckets = tf.contrib.layers.bucketized_column(age,
                                                    boundaries=[
                                                        18, 25, 30, 35, 40, 45,
                                                        50, 55, 60, 65
                                                    ])

  # 上面所说的模型,
  # 这个为 wide 模型
  wide_columns = [gender, native_country, education, occupation, workclass,
                  relationship, age_buckets,
                  tf.contrib.layers.crossed_column([education, occupation],
                                                   hash_bucket_size=int(1e4)),
                  tf.contrib.layers.crossed_column(
                      [age_buckets, education, occupation],
                      hash_bucket_size=int(1e6)),
                  tf.contrib.layers.crossed_column([native_country, occupation],
                                                   hash_bucket_size=int(1e4))]

  # 这个为 deep 模型
  deep_columns = [
      tf.contrib.layers.embedding_column(workclass, dimension=8),
      tf.contrib.layers.embedding_column(education, dimension=8),
      tf.contrib.layers.embedding_column(gender, dimension=8),
      tf.contrib.layers.embedding_column(relationship, dimension=8),
      tf.contrib.layers.embedding_column(native_country,
                                         dimension=8),
      tf.contrib.layers.embedding_column(occupation, dimension=8),
      age,
      education_num,
      capital_gain,
      capital_loss,
      hours_per_week,
  ]

  # 判断选的是以哪个模型来进行训练
  # 返回模型
  if FLAGS.model_type == "wide":
    m = tf.contrib.learn.LinearClassifier(model_dir=model_dir,
                                          feature_columns=wide_columns)
  elif FLAGS.model_type == "deep":
    m = tf.contrib.learn.DNNClassifier(model_dir=model_dir,
                                       feature_columns=deep_columns,
                                       hidden_units=[100, 50])
  else:
    m = tf.contrib.learn.DNNLinearCombinedClassifier(
        model_dir=model_dir,
        linear_feature_columns=wide_columns,
        dnn_feature_columns=deep_columns,
        dnn_hidden_units=[100, 50])
  return m


def input_fn(df):
  """这个函数的主要作用就是把输入数据转换成tensor,即向量型"""
  
  # 为continuous colum列的每一个属性创建一个对于的 dict 形式的 map
  # 对应列的值存储在一个 constant 向量中
  continuous_cols = {k: tf.constant(df[k].values) for k in CONTINUOUS_COLUMNS}
  # 为 categorical colum列的每一个属性创建一个对于的 dict 形式的 map
  # 对应列的值存储在一个 tf.SparseTensor 中
  categorical_cols = {k: tf.SparseTensor(
      indices=[[i, 0] for i in range(df[k].size)],
      values=df[k].values,
      shape=[df[k].size, 1])
                      for k in CATEGORICAL_COLUMNS}
  # 合并上面两个dict类型
  feature_cols = dict(continuous_cols)
  feature_cols.update(categorical_cols)

  # 将 label column 转换成一个 constant 向量
  label = tf.constant(df[LABEL_COLUMN].values)
  
  # 返回向量形式对应列的数据和label
  return feature_cols, label


def train_and_eval():
  """这个函数是真正的入口函数,用来训练数据,
    之后才进行 evaluate。
  """
  # 首先取得train 和 test 文件的文件名
  train_file_name, test_file_name = maybe_download()

  # 用 pandas 读入数据
  df_train = pd.read_csv(
      tf.gfile.Open(train_file_name),
      names=COLUMNS,
      skipinitialspace=True,
      engine="python")
  df_test = pd.read_csv(
      tf.gfile.Open(test_file_name),
      names=COLUMNS,
      skipinitialspace=True,
      skiprows=1,
      engine="python")

  # 移除非数字
  df_train = df_train.dropna(how='any', axis=0)
  df_test = df_test.dropna(how='any', axis=0)

  # 将 收入一列 即label 转换为 0和1,即大于50K的设置为1
  # 小于50K的设置为0
  df_train[LABEL_COLUMN] = (
      df_train["income_bracket"].apply(lambda x: ">50K" in x)).astype(int)
  df_test[LABEL_COLUMN] = (
      df_test["income_bracket"].apply(lambda x: ">50K" in x)).astype(int)

  # 判断输出的目录是否存在,不存在则创建临时的
  model_dir = tempfile.mkdtemp() if not FLAGS.model_dir else FLAGS.model_dir
  print("model directory = %s" % model_dir)

  # 创建预测模型,返回的是 wide 或者 deep 或者 wide&deep 模型中的一个
  m = build_estimator(model_dir)

  # 进行训练
  m.fit(input_fn=lambda: input_fn(df_train), steps=FLAGS.train_steps)

  # 使用test 数据进行评价
  results = m.evaluate(input_fn=lambda: input_fn(df_test), steps=1)
  for key in sorted(results):
    print("%s: %s" % (key, results[key]))


def main(_):
  train_and_eval()


if __name__ == "__main__":
  tf.app.run()

使用 wide 模型的运行结果为:



--------EOF----------------------



参考文献:
https://www.tensorflow.org/versions/master/tutorials/wide_and_deep/index.html
https://www.tensorflow.org/versions/master/tutorials/wide/index.html
https://www.tensorflow.org/versions/master/api_docs/python/io_ops.html#inputs-and-readers

你可能感兴趣的:(TensorFlow 使用之 Logistics Regression)