tensorflow实现基本的回归

TensorFlow入门

TensorFlow是目前最流行的深度学习框架。我们先引用一段官网对于TensorFlow的介绍,来看一下Google对于它这个产品的定位。

TensorFlow™ is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API.上文并没有提到大红大紫的Deep Learning,而是聚焦在一个更广泛的科学计算应用领域。引文的关键词有:

Numerical Computation:应用领域是数值计算,所以TensorFlow不仅能支持Deep Learning,还支持其他机器学习算法,甚至包括更一般的数值计算任务(如求导、积分、变换等)。

Data Flow Graph:用graph来描述一个计算任务。

Node:代表一个数学运算(mathmatical operations,简称ops),这里面包括了深度学习模型经常需要使用的ops。

Edge:指向node的edge代表这个node的输入,从node引出来的edge代表这个node的输出,输入和输出都是multidimensional data arrays,即多维数组,在数学上又称之为tensor。这也是TensorFlow名字的由来,表示多维数组在graph中流动。

CPUs/GPUs:支持CPU和GPU两种设备,支持单机和分布式计算。

TensorFlow提供多种语言的支持,其中支持最完善的是Python语言,因此本文将聚焦于Python API。

Hello World

下面这段代码来自于TensorFlow官网的Get Started,展示了TensorFlow训练线性回归模型的能力。

import tensorflow as tf
import numpy as np

# Create 100 phony x, y data points in NumPy, y = x * 0.1 + 0.3
x_data = np.random.rand(100).astype(np.float32)
y_data = x_data * 0.1 + 0.3

# Try to find values for W and b that compute y_data = W * x_data + b
# (We know that W should be 0.1 and b 0.3, but TensorFlow will
# figure that out for us.)
W = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
b = tf.Variable(tf.zeros([1]))
y = W * x_data + b

# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(y - y_data))
optimizer = tf.train.GradientDescentOptimizer(0.5)
train = optimizer.minimize(loss)

# Before starting, initialize the variables.  We will 'run' this first.
init = tf.global_variables_initializer()

# Launch the graph.
sess = tf.Session()
sess.run(init)

# Fit the line.
for step in range(201):
    sess.run(train)
    if step % 20 == 0:
        print(step, sess.run(W), sess.run(b))

# Learns best fit is W: [0.1], b: [0.3]
下面我们来剖析一下关键代码。TensorFlow的代码往往由两个部分组成:

Session是一个类,作用是把graph ops部署到Devices(CPUs/GPUs),并提供具体执行这些op的方法。

为什么要这么设计呢?考虑到Python运行性能较低,我们在执行numerical computing的时候,都会尽量使用非python语言编写的代码,比如使用NumPy这种预编译好的C代码来做矩阵运算。在Python内部计算环境和外部计算环境(如NumPy)切换需要花费的时间称为overhead cost。对于一个简单运算,比如矩阵运算,从Python环境切换到Numpy,Numpy运算得到结果,再从Numpy切回Python,这个成本,比纯粹在Python内部做同类运算的成本要低很多。但是,一个复杂数值运算由多个基本运算组合而成,如果每个基本运算来一次这种环境切换,overhead cost就不可忽视了。为了减少来回的环境切换,TensorFlow的做法是,先在Python内定义好整个Graph,然后在Python外运行整个完整的Graph。因此TensorFlow的代码结构也就对应为两个阶段了。

=== Build Graph ===
 W = tf.Variable(tf.random_uniform([1], -1.0, 1.0))

  b = tf.Variable(tf.zeros([1]))
tf.Variable是TensorFlow的一个类是取值可变的Tensor构造函数的第一个参数是初始值initial_value

tf.zeros(shape, dtype=tf.float32, name=None)是一个op用于生成取值全是0的Constant Value Tensor

tf.random_uniform(shape, minval=0, maxval=None, dtype=tf.float32, seed=None, name=None)是一个op用于生成服从uniform distribution的Random Tensor
y = W * x_data + b

y是线性回归运算产生的Tensor。运算符*和+,等价为tf.multiple()和tf.add()这两个TensorFlow提供的数学类ops。 tf.multiple()的输入是W和x_data;W是Variable,属于Tensor,可以直接作为op的输入;x_data是numpy的多维数组ndarray,TensorFlow的ops接收到ndarray的输入时,会将其转化为tensor。tf.multiple()的输出是一个tensor,和b一起交给op tf.add(),得到输出结果y。

至此线性回归的模型已经建立好但这只是Graph的一部分还需要定义损失
loss = tf.reduce_mean(tf.square(y - y_data))
loss是最小二乘法需要的目标函数是一个Tensor具体的op不再赘述
 optimizer = tf.train.GradientDescentOptimizer(0.5) train = optimizer.minimize(loss)
这一步指定求解器并设定求解器的最小化目标为损失train代表了求解器执行一次的输出Tensor这里我们使用了梯度下降求解器每一步会对输入loss求一次梯度然后将loss里Variable类型的Tensor按照梯度更新取值
init = tf.global_variables_initializer()
Build Graph阶段的代码只是在Python内定义了Graph的结构并不会真正执行在Launch Graph阶段所有的变量要先进行初始化每个变量可以单独初始化但这样做有些繁琐所以TensorFlow提供了一个方便的函数global_variables_initializer()code>可以在graph中添加一个初始化所有变量的op

=== Launch Graph ===
 sess.run(init)
在进行任何计算以前先给Variable赋初始值
for step in range(201):    sess.run(train)
train操作对应梯度下降法的一步迭代当step为0时train里的variable取值为初始值根据初始值可以计算出梯度然后将初始值根据梯度更新为更好的取值当step为1时train里的variable为上一步更新的值根据这一步的值可以计算出一个新的梯度然后将variable的取值更新为更好的取值以此类推直到达到最大迭代次数
print(step, sess.run(W), sess.run(b))
如果我们将sess.run()赋值给Python环境的变量或者传给Python环境的print可以fetch执行op的输出Tensor取值这些取值会转化为numpy的ndarray结构因此这就需要一次环境的切换会增加overhead cost所以我们一般会每隔一定步骤才fetch一下计算结果以减少时间开销

线性回归

使用tensorflow训练线性回归模型,并将其与scikit-learn做比较。数据集来自Andrew Ng的网上公开课程Deep Learning

import tensorflow as tf
import numpy as np
from sklearn import linear_model

# Read x and y
x_data = np.loadtxt("ex2x.dat")
y_data = np.loadtxt("ex2y.dat")


# We use scikit-learn first to get a sense of the coefficients
reg = linear_model.LinearRegression()
reg.fit(x_data.reshape(-1, 1), y_data)

print "Coefficient of scikit-learn linear regression: k=%f, b=%f" % (reg.coef_, reg.intercept_)


# Then we apply tensorflow to achieve the similar results
# The structure of tensorflow code can be divided into two parts:

# First part: set up computation graph
W = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
b = tf.Variable(tf.zeros([1]))
y = W * x_data + b

loss = tf.reduce_mean(tf.square(y - y_data)) / 2
optimizer = tf.train.GradientDescentOptimizer(0.07)  # Try 0.1 and you will see unconvergency
train = optimizer.minimize(loss)

init = tf.initialize_all_variables()

# Second part: launch the graph
sess = tf.Session()
sess.run(init)

for step in range(1500):
    sess.run(train)
    if step % 100 == 0:
        print step, sess.run(W), sess.run(b)
print "Coeeficient of tensorflow linear regression: k=%f, b=%f" % (sess.run(W), sess.run(b))
输出如下
Coefficient of scikit-learn linear regression: k=0.063881, b=0.750163
0 [ 0.45234478] [ 0.10217379]
100 [ 0.13166969] [ 0.4169243]
200 [ 0.09332827] [ 0.58935112]
300 [ 0.07795752] [ 0.67282093]
400 [ 0.07064758] [ 0.71297228]
500 [ 0.06713474] [ 0.73227954]
600 [ 0.06544565] [ 0.74156356]
700 [ 0.06463348] [ 0.74602771]
800 [ 0.06424291] [ 0.74817437]
900 [ 0.06405514] [ 0.74920654]
1000 [ 0.06396478] [ 0.74970293]
1100 [ 0.06392141] [ 0.74994141]
1200 [ 0.06390052] [ 0.75005609]
1300 [ 0.06389045] [ 0.7501114]
1400 [ 0.0638856] [ 0.75013816]
Coeeficient of tensorflow linear regression: k=0.063883, b=0.750151

对于tensorflow,梯度下降的步长alpha参数需要很仔细的设置,步子太大容易导致无法收敛;步子太小容易等得很久。迭代次数也需要细致的尝试。

多元线性回归

import numpy as np
import tensorflow as tf
from sklearn import linear_model
from sklearn import preprocessing

# Read x and y
x_data = np.loadtxt("ex3x.dat").astype(np.float32)
y_data = np.loadtxt("ex3y.dat").astype(np.float32)


# We evaluate the x and y by sklearn to get a sense of the coefficients.
reg = linear_model.LinearRegression()
reg.fit(x_data, y_data)
print "Coefficients of sklearn: K=%s, b=%f" % (reg.coef_, reg.intercept_)


# Now we use tensorflow to get similar results.

# Before we put the x_data into tensorflow, we need to standardize it
# in order to achieve better performance in gradient descent;
# If not standardized, the convergency speed could not be tolearated.
# Reason:  If a feature has a variance that is orders of magnitude larger than others, 
# it might dominate the objective function 
# and make the estimator unable to learn from other features correctly as expected.
scaler = preprocessing.StandardScaler().fit(x_data)
print scaler.mean_, scaler.scale_
x_data_standard = scaler.transform(x_data)


W = tf.Variable(tf.zeros([2, 1]))
b = tf.Variable(tf.zeros([1, 1]))
y = tf.matmul(x_data_standard, W) + b

loss = tf.reduce_mean(tf.square(y - y_data.reshape(-1, 1)))/2
optimizer = tf.train.GradientDescentOptimizer(0.3)
train = optimizer.minimize(loss)

init = tf.initialize_all_variables()


sess = tf.Session()
sess.run(init)
for step in range(100):
    sess.run(train)
    if step % 10 == 0:
        print step, sess.run(W).flatten(), sess.run(b).flatten()

print "Coefficients of tensorflow (input should be standardized): K=%s, b=%s" % (sess.run(W).flatten(), sess.run(b).flatten())
print "Coefficients of tensorflow (raw input): K=%s, b=%s" % (sess.run(W).flatten() / scaler.scale_, sess.run(b).flatten() - np.dot(scaler.mean_ / scaler.scale_, sess.run(W)))
输出如下

Coefficients of sklearn: K=[  139.21066284 -8738.02148438], b=89597.927966
[ 2000.6809082      3.17021275] [  7.86202576e+02   7.52842903e-01]
0 [ 31729.23632812  16412.6484375 ] [ 102123.7890625]
10 [ 97174.78125      5595.25585938] [ 333681.59375]
20 [ 106480.5703125    -3611.31201172] [ 340222.53125]
30 [ 108727.5390625    -5858.10302734] [ 340407.28125]
40 [ 109272.953125     -6403.52148438] [ 340412.5]
50 [ 109405.3515625    -6535.91503906] [ 340412.625]
60 [ 109437.4921875    -6568.05371094] [ 340412.625]
70 [ 109445.296875     -6575.85644531] [ 340412.625]
80 [ 109447.1875       -6577.75097656] [ 340412.625]
90 [ 109447.640625     -6578.20654297] [ 340412.625]
Coefficients of tensorflow (input should be standardized): K=[ 109447.7421875    -6578.31152344], b=[ 340412.625]
Coefficients of tensorflow (raw input): K=[  139.21061707 -8737.9609375 ], b=[ 89597.78125]

对于梯度下降算法,变量是否标准化很重要。在这个例子中,变量一个是面积,一个是房间数,量级相差很大,如果不归一化,面积在目标函数和梯度中就会占据主导地位,导致收敛极慢。

逻辑回归

import tensorflow as tf
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing

# Read x and y
x_data = np.loadtxt("ex4x.dat").astype(np.float32)
y_data = np.loadtxt("ex4y.dat").astype(np.float32)

scaler = preprocessing.StandardScaler().fit(x_data)
x_data_standard = scaler.transform(x_data)

# We evaluate the x and y by sklearn to get a sense of the coefficients.
reg = LogisticRegression(C=999999999, solver="newton-cg")  # Set C as a large positive number to minimize the regularization effect
reg.fit(x_data, y_data)
print "Coefficients of sklearn: K=%s, b=%f" % (reg.coef_, reg.intercept_)

# Now we use tensorflow to get similar results.
W = tf.Variable(tf.zeros([2, 1]))
b = tf.Variable(tf.zeros([1, 1]))
y = 1 / (1 + tf.exp(-tf.matmul(x_data_standard, W) + b))
loss = tf.reduce_mean(- y_data.reshape(-1, 1) *  tf.log(y) - (1 - y_data.reshape(-1, 1)) * tf.log(1 - y))

optimizer = tf.train.GradientDescentOptimizer(1.3)
train = optimizer.minimize(loss)

init = tf.initialize_all_variables()

sess = tf.Session()
sess.run(init)
for step in range(100):
    sess.run(train)
    if step % 10 == 0:
        print step, sess.run(W).flatten(), sess.run(b).flatten()

print "Coefficients of tensorflow (input should be standardized): K=%s, b=%s" % (sess.run(W).flatten(), sess.run(b).flatten())
print "Coefficients of tensorflow (raw input): K=%s, b=%s" % (sess.run(W).flatten() / scaler.scale_, sess.run(b).flatten() - np.dot(scaler.mean_ / scaler.scale_, sess.run(W)))


# Problem solved and we are happy. But...
# I'd like to implement the logistic regression from a multi-class viewpoint instead of binary.
# In machine learning domain, it is called softmax regression
# In economic and statistics domain, it is called multinomial logit (MNL) model, proposed by Daniel McFadden, who shared the 2000  Nobel Memorial Prize in Economic Sciences.

print "------------------------------------------------"
print "We solve this binary classification problem again from the viewpoint of multinomial classification"
print "------------------------------------------------"

# As a tradition, sklearn first
reg = LogisticRegression(C=9999999999, solver="newton-cg", multi_class="multinomial")
reg.fit(x_data, y_data)
print "Coefficients of sklearn: K=%s, b=%f" % (reg.coef_, reg.intercept_)
print "A little bit difference at first glance. What about multiply them with 2?"

# Then try tensorflow
W = tf.Variable(tf.zeros([2, 2]))  # first 2 is feature number, second 2 is class number
b = tf.Variable(tf.zeros([1, 2]))
V = tf.matmul(x_data_standard, W) + b
y = tf.nn.softmax(V)  # tensorflow provide a utility function to calculate the probability of observer n choose alternative i, you can replace it with `y = tf.exp(V) / tf.reduce_sum(tf.exp(V), keep_dims=True, reduction_indices=[1])`

# Encode the y label in one-hot manner
lb = preprocessing.LabelBinarizer()
lb.fit(y_data)
y_data_trans = lb.transform(y_data)
y_data_trans = np.concatenate((1 - y_data_trans, y_data_trans), axis=1)  # Only necessary for binary class 

loss = tf.reduce_mean(-tf.reduce_sum(y_data_trans * tf.log(y), reduction_indices=[1]))
optimizer = tf.train.GradientDescentOptimizer(1.3)
train = optimizer.minimize(loss)

init = tf.initialize_all_variables()

sess = tf.Session()
sess.run(init)
for step in range(100):
    sess.run(train)
    if step % 10 == 0:
        print step, sess.run(W).flatten(), sess.run(b).flatten()

print "Coefficients of tensorflow (input should be standardized): K=%s, b=%s" % (sess.run(W).flatten(), sess.run(b).flatten())
print "Coefficients of tensorflow (raw input): K=%s, b=%s" % ((sess.run(W) / scaler.scale_).flatten(),  sess.run(b).flatten() - np.dot(scaler.mean_ / scaler.scale_, sess.run(W)))
输出如下

Coefficients of sklearn: K=[[ 0.14834077  0.15890845]], b=-16.378743
0 [ 0.33699557  0.34786162] [ -4.84287721e-09]
10 [ 1.15830743  1.22841871] [ 0.02142336]
20 [ 1.3378191   1.42655993] [ 0.03946959]
30 [ 1.40735555  1.50197577] [ 0.04853692]
40 [ 1.43754184  1.53418231] [ 0.05283691]
50 [ 1.45117068  1.54856908] [ 0.05484771]
60 [ 1.45742035  1.55512536] [ 0.05578374]
70 [ 1.46030474  1.55814099] [ 0.05621871]
80 [ 1.46163988  1.55953443] [ 0.05642065]
90 [ 1.46225858  1.56017959] [ 0.0565144]
Coefficients of tensorflow (input should be standardized): K=[ 1.46252561  1.56045783], b=[ 0.05655487]
Coefficients of tensorflow (raw input): K=[ 0.14831361  0.15888004], b=[-16.26265144]
------------------------------------------------
We solve this binary classification problem again from the viewpoint of multinomial classification
------------------------------------------------
Coefficients of sklearn: K=[[ 0.07417039  0.07945423]], b=-8.189372
A little bit difference at first glance. What about multiply them with 2?
0 [-0.33699557  0.33699557 -0.34786162  0.34786162] [  6.05359674e-09  -6.05359674e-09]
10 [-0.68416572  0.68416572 -0.72988117  0.72988123] [ 0.02157043 -0.02157041]
20 [-0.72234094  0.72234106 -0.77087188  0.77087194] [ 0.02693938 -0.02693932]
30 [-0.72958517  0.72958535 -0.7784785   0.77847856] [ 0.02802362 -0.02802352]
40 [-0.73103166  0.73103184 -0.77998811  0.77998811] [ 0.02824244 -0.02824241]
50 [-0.73132294  0.73132324 -0.78029168  0.78029174] [ 0.02828659 -0.02828649]
60 [-0.73138171  0.73138207 -0.78035289  0.78035301] [ 0.02829553 -0.02829544]
70 [-0.73139352  0.73139393 -0.78036523  0.78036535] [ 0.02829732 -0.0282972 ]
80 [-0.73139596  0.73139632 -0.78036767  0.78036791] [ 0.02829764 -0.02829755]
90 [-0.73139644  0.73139679 -0.78036815  0.78036839] [ 0.02829781 -0.02829765]
Coefficients of tensorflow (input should be standardized): K=[-0.7313965   0.73139679 -0.78036827  0.78036839], b=[ 0.02829777 -0.02829769]
Coefficients of tensorflow (raw input): K=[-0.07417037  0.07446811 -0.07913655  0.07945422], b=[ 8.1893692  -8.18937111]
  • 对于逻辑回归,损失函数比线性回归模型复杂了一些。首先需要通过sigmoid函数,将线性回归的结果转化为0至1之间的概率值。然后写出每个样本的发生概率(似然),那么所有样本的发生概率就是每个样本发生概率的乘积。为了求导方便,我们对所有样本的发生概率取对数,保持其单调性的同时,可以将连乘变为求和(加法的求导公式比乘法的求导公式简单很多)。对数极大似然估计方法的目标函数是最大化所有样本的发生概率;机器学习习惯将目标函数称为损失,所以将损失定义为对数似然的相反数,以转化为极小值问题。
  • 我们提到逻辑回归时,一般指的是二分类问题;然而这套思想是可以很轻松就拓展为多分类问题的,在机器学习领域一般称为softmax回归模型。本文的作者是统计学与计量经济学背景,因此一般将其称为MNL模型。

你可能感兴趣的:(tensorflow,机器学习,深度学习,tensorflow,线性回归,逻辑回归)