在这节我们使用Theano用于最基本的分类器:Logistic回归(Logistic Regression)。
全部的代码可以在我的CSDN下载中免费下载:http://download.csdn.net/detail/ws_20100/9222263。
下面我们从模型开始。
逻辑回归是一个概率,线性分类器。它的参数包含一个权值矩阵 W 和一个偏置向量 b 。分类器将输入向量映射到一系列超平面上,每个超平面对应一个类别。输入向量与超平面的距离反映了输入属于对应类别的概率。
在数学上,一个输入向量 x 属于类别 i (概率变量 Y 的值)的概率,记为:
# initialize with 0 the weights W as a matrix of shape (n_in, n_out)
self.W = theano.shared(
value=numpy.zeros(
(n_in, n_out),
dtype=theano.config.floatX
),
name='W',
borrow=True
)
# initialize the biases b as a vector of n_out 0s
self.b = theano.shared(
value=numpy.zeros(
(n_out,),
dtype=theano.config.floatX
),
name='b',
borrow=True
)
# symbolic expression for computing the matrix of class-membership
# probabilities
# Where:
# W is a matrix where column-k represent the separation hyperplane for
# class-k
# x is a matrix where row-j represents input training sample-j
# b is a vector where element-k represent the free parameter of
# hyperplane-k
self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)
# symbolic description of how to compute prediction as class whose
# probability is maximal
self.y_pred = T.argmax(self.p_y_given_x, axis=1)
由于模型的参数在训练时始终要保持回归状态,因此我们将 W 和 b 共享权值。它们在符号和内容上都存在共享。然后我们使用点积(dot)和softmax回归运算计算向量值 P(Y|x,W,b) 。结果p_y_given_x
是一个向量类型的符号变量。为了得到实际模型的预测值,我们使用T.argmax
运算符,这将返回一个索引值,代表在p_y_given_x
的哪个位置的值最大(例如,具有最大概率的类别)。
现在,我们所定义的模型其实并不能做任何事情,因为所有的参数都处于初始状态,下面将会介绍,如何学习最优化参数。
训练最优化模型参数,涉及到最小化损失函数。在多类别的分类问题中,很自然的想法是使用负对数似然作为损失函数。这等价于在参数为 θ 的模型下,最大化数据集 D 的似然。我们从定义似然 L 和损失 ℓ 始:
# y.shape[0] is (symbolically) the number of rows in y, i.e.,
# number of examples (call it n) in the minibatch
# T.arange(y.shape[0]) is a symbolic vector which will contain
# [0,1,2,... n-1] T.log(self.p_y_given_x) is a matrix of
# Log-Probabilities (call it LP) with one row per example and
# one column per class LP[T.arange(y.shape[0]),y] is a vector
# v containing [LP[0,y[0]], LP[1,y[1]], LP[2,y[2]], ...,
# LP[n-1,y[n-1]]] and T.mean(LP[T.arange(y.shape[0]),y]) is
# the mean (across minibatch examples) of the elements in v,
# i.e., the mean log-likelihood across the minibatch.
return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])
尽管在格式上,损失函数定义为数据集上每个误差项和的形式。但是在代码中,使用的是平均函数
(T.mean)
,因为这样可以使得学习率不依赖于数据批的大小。
我们现在要定义一个LogisticRegression
类,来囊括逻辑回归的所有特征。下面的代码包含了许多方面,并且注释的很清楚哦。
class LogisticRegression(object):
"""Multi-class Logistic Regression Class The logistic regression is fully described by a weight matrix :math:`W` and bias vector :math:`b`. Classification is done by projecting data points onto a set of hyperplanes, the distance to which is used to determine a class membership probability. """
def __init__(self, input, n_in, n_out):
""" Initialize the parameters of the logistic regression :type input: theano.tensor.TensorType :param input: symbolic variable that describes the input of the architecture (one minibatch) :type n_in: int :param n_in: number of input units, the dimension of the space in which the datapoints lie :type n_out: int :param n_out: number of output units, the dimension of the space in which the labels lie """
# start-snippet-1
# initialize with 0 the weights W as a matrix of shape (n_in, n_out)
self.W = theano.shared(
value=numpy.zeros(
(n_in, n_out),
dtype=theano.config.floatX
),
name='W',
borrow=True
)
# initialize the biases b as a vector of n_out 0s
self.b = theano.shared(
value=numpy.zeros(
(n_out,),
dtype=theano.config.floatX
),
name='b',
borrow=True
)
# symbolic expression for computing the matrix of class-membership
# probabilities
# Where:
# W is a matrix where column-k represent the separation hyperplane for
# class-k
# x is a matrix where row-j represents input training sample-j
# b is a vector where element-k represent the free parameter of
# hyperplane-k
self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)
# symbolic description of how to compute prediction as class whose
# probability is maximal
self.y_pred = T.argmax(self.p_y_given_x, axis=1)
# end-snippet-1
# parameters of the model
self.params = [self.W, self.b]
# keep track of model input
self.input = input
def negative_log_likelihood(self, y):
"""Return the mean of the negative log-likelihood of the prediction of this model under a given target distribution. .. math:: \frac{1}{|\mathcal{D}|} \mathcal{L} (\theta=\{W,b\}, \mathcal{D}) = \frac{1}{|\mathcal{D}|} \sum_{i=0}^{|\mathcal{D}|} \log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\ \ell (\theta=\{W,b\}, \mathcal{D}) :type y: theano.tensor.TensorType :param y: corresponds to a vector that gives for each example the correct label Note: we use the mean instead of the sum so that the learning rate is less dependent on the batch size """
# start-snippet-2
# y.shape[0] is (symbolically) the number of rows in y, i.e.,
# number of examples (call it n) in the minibatch
# T.arange(y.shape[0]) is a symbolic vector which will contain
# [0,1,2,... n-1] T.log(self.p_y_given_x) is a matrix of
# Log-Probabilities (call it LP) with one row per example and
# one column per class LP[T.arange(y.shape[0]),y] is a vector
# v containing [LP[0,y[0]], LP[1,y[1]], LP[2,y[2]], ...,
# LP[n-1,y[n-1]]] and T.mean(LP[T.arange(y.shape[0]),y]) is
# the mean (across minibatch examples) of the elements in v,
# i.e., the mean log-likelihood across the minibatch.
return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])
# end-snippet-2
def errors(self, y):
"""Return a float representing the number of errors in the minibatch over the total number of examples of the minibatch ; zero one loss over the size of the minibatch :type y: theano.tensor.TensorType :param y: corresponds to a vector that gives for each example the correct label """
# check if y has same dimension of y_pred
if y.ndim != self.y_pred.ndim:
raise TypeError(
'y should have the same shape as self.y_pred',
('y', y.type, 'y_pred', self.y_pred.type)
)
# check if y is of the correct datatype
if y.dtype.startswith('int'):
# the T.neq operator returns a vector of 0s and 1s, where 1
# represents a mistake in prediction
return T.mean(T.neq(self.y_pred, y))
else:
raise NotImplementedError()
那么如何来实例化一个LogisticRegression
类呢,可以看如下代码:
# generate symbolic variables for input (x and y represent a
# minibatch)
x = T.matrix('x') # data, presented as rasterized images
y = T.ivector('y') # labels, presented as 1D vector of [int] labels
# construct the logistic regression class
# Each MNIST image has size 28*28
classifier = LogisticRegression(input=x, n_in=28 * 28, n_out=10)
在上面的代码中,首先定义了输入变量 x 和对应类别 y 的符号变量。需要注意的是,x
和y
是定义在LogisticRegression
对象以外的。因为这个类需要输入值作为它__init__
函数的参数,当你希望连接这些实例,来组成深度网络时,这个设定非常有用。一个层的输出可以当作下一层的输入。(这里并没有构建多层网络,但是代码可以在多层网络中重用)
最后,我们定义一个(符号化)cost
变量,用来最小化,使用实例方法classifier.negative_log_likelihood
。
# the cost we minimize during training is the negative log likelihood of
# the model in symbolic format
cost = classifier.negative_log_likelihood(y)
注意定义cost
中有一个隐含的符号输入x
,因为classifier
的符号变量在初始化时就定义在x
中。
在多数编程语言(C/C++,Matlab,Python)中实现MSGD,可以使用损失函数对于参数的梯度表达式: ∂ℓ/∂W 和 ∂ℓ/∂b 。在复杂的模型中,具有严格的表达式形式 ∂ℓ/∂θ ,特别是在需要考虑数值稳定性的情况下。
使用Theano,这种工作被大量简化。它可以完成自动求导,并应用相应的数学变换,以提高数值稳定性。
在Theano中获得 ∂ℓ/∂W 和 ∂ℓ/∂b ,仅仅只需下面代码:
g_W = T.grad(cost=cost, wrt=classifier.W)
g_b = T.grad(cost=cost, wrt=classifier.b)
g_W
和g_b
是符号变量,可以用于计算。函数train_model
可以完成梯度下降,定义如下:
# specify how to update the parameters of the model as a list of
# (variable, update expression) pairs.
updates = [(classifier.W, classifier.W - learning_rate * g_W),
(classifier.b, classifier.b - learning_rate * g_b)]
# compiling a Theano function `train_model` that returns the cost, but in
# the same time updates the parameter of the model based on the rules
# defined in `updates`
train_model = theano.function(
inputs=[index],
outputs=cost,
updates=updates,
givens={
x: train_set_x[index * batch_size: (index + 1) * batch_size],
y: train_set_y[index * batch_size: (index + 1) * batch_size]
}
)
updates
是一系列二元组。在每个二元组中,第一个元素是待更新的符号变量,第二个元素是用于计算新数值的符号函数。相似地,givens
是一个字典,其中关键字是符号变量,其中的值指定了它们的置换。函数train_model
定义如下:
index
,连带数据批的大小(这不是输入,因为它是一个固定值),定义了 x 和相应的标签 y 。index
对应的 x 和 y 所计算出的代价/损失值。index
置换训练集中x
和y
的值。然后,将会评估该数据批所对应的代价值,并应用updates
更新。每次调用train_model(index)
,它将会计算数据批,返回其代价值,并完成了MSGD的一步。整个的学习算法会不断地循环遍历所有的数据集,并且一次只会考虑一批数据内的所有样本,然后重复地调用train_model
函数。
当测试模型的时候,我们注重于错误分类的样本个数。所以,LogisticRegression
类有一个额外的实例方法,用于尝试减少每个数据批中错误分类的样本个数。代码如下:
def errors(self, y):
"""Return a float representing the number of errors in the minibatch over the total number of examples of the minibatch ; zero one loss over the size of the minibatch :type y: theano.tensor.TensorType :param y: corresponds to a vector that gives for each example the correct label """
# check if y has same dimension of y_pred
if y.ndim != self.y_pred.ndim:
raise TypeError(
'y should have the same shape as self.y_pred',
('y', y.type, 'y_pred', self.y_pred.type)
)
# check if y is of the correct datatype
if y.dtype.startswith('int'):
# the T.neq operator returns a vector of 0s and 1s, where 1
# represents a mistake in prediction
return T.mean(T.neq(self.y_pred, y))
else:
raise NotImplementedError()
然后,我们创建了一个函数test_model
和一个函数validate_model
,我们可以调用这些函数来挽回错误分类的值。你将会看到,validate_model
是迭代退出的关键。这些函数的输入参数是数据批的索引号,然后函数将计算该索引号所对应的数据批中错误分类的个数。两个函数的唯一区别在于test_model
面向的是测试集,validate_model
面向的是验证集。
# compiling a Theano function that computes the mistakes that are made by
# the model on a minibatch
test_model = theano.function(
inputs=[index],
outputs=classifier.errors(y),
givens={
x: test_set_x[index * batch_size: (index + 1) * batch_size],
y: test_set_y[index * batch_size: (index + 1) * batch_size]
}
)
validate_model = theano.function(
inputs=[index],
outputs=classifier.errors(y),
givens={
x: valid_set_x[index * batch_size: (index + 1) * batch_size],
y: valid_set_y[index * batch_size: (index + 1) * batch_size]
}
)
最后的代码如下:【下载地址:http://download.csdn.net/detail/ws_20100/9222263】
使用者可以通过输入以下命令,使用SGD逻辑回归对MNIST字符进行分类。
python code/logistic_sgd.py
输出应该是这样的形式:
...
epoch 72, minibatch 83/83, validation error 7.510417 %
epoch 72, minibatch 83/83, test error of best model 7.510417 %
epoch 73, minibatch 83/83, validation error 7.500000 %
epoch 73, minibatch 83/83, test error of best model 7.489583 %
Optimization complete with best validation score of 7.500000 %,with test performance 7.489583 %
The code run for 74 epochs, with 1.936983 epochs/sec
在一个Intel酷睿双核CPU E8400 @3.00 Ghz的主机上,大约1.936 epochs/sec,在经历75 epochs后,测试误差为7.489%。在GPU上,大约10.0 epochs/sec。
当训练达到最低误差的时候,我们可以重新载入模型对新数据的标签进行预测,predict
函数完成了这些操作:
def predict():
""" An example of how to load a trained model and use it to predict labels. """
# load the saved model
classifier = cPickle.load(open('best_model.pkl'))
# compile a predictor function
predict_model = theano.function(
inputs=[classifier.input],
outputs=classifier.y_pred)
# We can test it on some examples from test test
dataset='mnist.pkl.gz'
datasets = load_data(dataset)
test_set_x, test_set_y = datasets[2]
test_set_x = test_set_x.get_value()
predicted_values = predict_model(test_set_x[:10])
print ("Predicted values for the first 10 examples in test set:")
print predicted_values
Theano深度学习资料:http://deeplearning.net/tutorial/logreg.html
【脚注】
[1]对于更小的数据集或更简单的模型,复杂的下降算法可能更加有效。logistic_cg.py代码阐述了如何使用SciPy的共扼梯度方法(conjugate gradient solver)完成逻辑回归任务。logistic_cg.py代码可以在我的CSDN下载中免费下载:http://download.csdn.net/detail/ws_20100/9223959