使用Logistic回归对MNIST手写字符进行分类识别

简介

在这节我们使用Theano用于最基本的分类器:Logistic回归(Logistic Regression)。
全部的代码可以在我的CSDN下载中免费下载:http://download.csdn.net/detail/ws_20100/9222263。
下面我们从模型开始。

模型

逻辑回归是一个概率,线性分类器。它的参数包含一个权值矩阵 W 和一个偏置向量 b 。分类器将输入向量映射到一系列超平面上,每个超平面对应一个类别。输入向量与超平面的距离反映了输入属于对应类别的概率
在数学上,一个输入向量 x 属于类别 i (概率变量 Y 的值)的概率,记为:

P(Y=i|x,W,b)=softmaxi(Wx+b)=eWix+bijeWix+bi
模型的预测值 ypred 为概率最大的类别,定义为
ypred=argmaxiP(Y=i|x,W,b)
Theano中关于模型建立的代码如下:

# initialize with 0 the weights W as a matrix of shape (n_in, n_out)
self.W = theano.shared(
    value=numpy.zeros(
        (n_in, n_out),
        dtype=theano.config.floatX
    ),
    name='W',
    borrow=True
)
# initialize the biases b as a vector of n_out 0s
self.b = theano.shared(
    value=numpy.zeros(
        (n_out,),
        dtype=theano.config.floatX
    ),
    name='b',
    borrow=True
)

# symbolic expression for computing the matrix of class-membership
# probabilities
# Where:
# W is a matrix where column-k represent the separation hyperplane for
# class-k
# x is a matrix where row-j represents input training sample-j
# b is a vector where element-k represent the free parameter of
# hyperplane-k
self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)

# symbolic description of how to compute prediction as class whose
# probability is maximal
self.y_pred = T.argmax(self.p_y_given_x, axis=1)

由于模型的参数在训练时始终要保持回归状态,因此我们将 W b 共享权值。它们在符号和内容上都存在共享。然后我们使用点积(dot)和softmax回归运算计算向量值 P(Y|x,W,b) 。结果p_y_given_x是一个向量类型的符号变量。为了得到实际模型的预测值,我们使用T.argmax运算符,这将返回一个索引值,代表在p_y_given_x的哪个位置的值最大(例如,具有最大概率的类别)。
现在,我们所定义的模型其实并不能做任何事情,因为所有的参数都处于初始状态,下面将会介绍,如何学习最优化参数。

定义损失函数

训练最优化模型参数,涉及到最小化损失函数。在多类别的分类问题中,很自然的想法是使用负对数似然作为损失函数。这等价于在参数为 θ 的模型下,最大化数据集 D 的似然。我们从定义似然 L 和损失 始:

L(θ={W,b},D)=i=0|D|log(P(Y=y(i)|x(i),W,b))
(θ={W,b},D)=L(θ={W,b},D)
在最优化理论中,最小化任意非线性函数的最简单方法是梯度下降法(gradient descent)。
这里使用的是小批量的概率梯度方法(mini-batches Stochastic Gradient Descent, MSGD)。Theano代码中关于损失的定义如下:

# y.shape[0] is (symbolically) the number of rows in y, i.e.,
# number of examples (call it n) in the minibatch
# T.arange(y.shape[0]) is a symbolic vector which will contain
# [0,1,2,... n-1] T.log(self.p_y_given_x) is a matrix of
# Log-Probabilities (call it LP) with one row per example and
# one column per class LP[T.arange(y.shape[0]),y] is a vector
# v containing [LP[0,y[0]], LP[1,y[1]], LP[2,y[2]], ...,
# LP[n-1,y[n-1]]] and T.mean(LP[T.arange(y.shape[0]),y]) is
# the mean (across minibatch examples) of the elements in v,
# i.e., the mean log-likelihood across the minibatch.
return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])

尽管在格式上,损失函数定义为数据集上每个误差项和的形式。但是在代码中,使用的是平均函数(T.mean),因为这样可以使得学习率不依赖于数据批的大小。

创建一个回归类

我们现在要定义一个LogisticRegression类,来囊括逻辑回归的所有特征。下面的代码包含了许多方面,并且注释的很清楚哦。

class LogisticRegression(object):
    """Multi-class Logistic Regression Class The logistic regression is fully described by a weight matrix :math:`W` and bias vector :math:`b`. Classification is done by projecting data points onto a set of hyperplanes, the distance to which is used to determine a class membership probability. """

    def __init__(self, input, n_in, n_out):
        """ Initialize the parameters of the logistic regression :type input: theano.tensor.TensorType :param input: symbolic variable that describes the input of the architecture (one minibatch) :type n_in: int :param n_in: number of input units, the dimension of the space in which the datapoints lie :type n_out: int :param n_out: number of output units, the dimension of the space in which the labels lie """
        # start-snippet-1
        # initialize with 0 the weights W as a matrix of shape (n_in, n_out)
        self.W = theano.shared(
            value=numpy.zeros(
                (n_in, n_out),
                dtype=theano.config.floatX
            ),
            name='W',
            borrow=True
        )
        # initialize the biases b as a vector of n_out 0s
        self.b = theano.shared(
            value=numpy.zeros(
                (n_out,),
                dtype=theano.config.floatX
            ),
            name='b',
            borrow=True
        )

        # symbolic expression for computing the matrix of class-membership
        # probabilities
        # Where:
        # W is a matrix where column-k represent the separation hyperplane for
        # class-k
        # x is a matrix where row-j represents input training sample-j
        # b is a vector where element-k represent the free parameter of
        # hyperplane-k
        self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)

        # symbolic description of how to compute prediction as class whose
        # probability is maximal
        self.y_pred = T.argmax(self.p_y_given_x, axis=1)
        # end-snippet-1

        # parameters of the model
        self.params = [self.W, self.b]

        # keep track of model input
        self.input = input

    def negative_log_likelihood(self, y):
        """Return the mean of the negative log-likelihood of the prediction of this model under a given target distribution. .. math:: \frac{1}{|\mathcal{D}|} \mathcal{L} (\theta=\{W,b\}, \mathcal{D}) = \frac{1}{|\mathcal{D}|} \sum_{i=0}^{|\mathcal{D}|} \log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\ \ell (\theta=\{W,b\}, \mathcal{D}) :type y: theano.tensor.TensorType :param y: corresponds to a vector that gives for each example the correct label Note: we use the mean instead of the sum so that the learning rate is less dependent on the batch size """
        # start-snippet-2
        # y.shape[0] is (symbolically) the number of rows in y, i.e.,
        # number of examples (call it n) in the minibatch
        # T.arange(y.shape[0]) is a symbolic vector which will contain
        # [0,1,2,... n-1] T.log(self.p_y_given_x) is a matrix of
        # Log-Probabilities (call it LP) with one row per example and
        # one column per class LP[T.arange(y.shape[0]),y] is a vector
        # v containing [LP[0,y[0]], LP[1,y[1]], LP[2,y[2]], ...,
        # LP[n-1,y[n-1]]] and T.mean(LP[T.arange(y.shape[0]),y]) is
        # the mean (across minibatch examples) of the elements in v,
        # i.e., the mean log-likelihood across the minibatch.
        return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])
        # end-snippet-2

    def errors(self, y):
        """Return a float representing the number of errors in the minibatch over the total number of examples of the minibatch ; zero one loss over the size of the minibatch :type y: theano.tensor.TensorType :param y: corresponds to a vector that gives for each example the correct label """

        # check if y has same dimension of y_pred
        if y.ndim != self.y_pred.ndim:
            raise TypeError(
                'y should have the same shape as self.y_pred',
                ('y', y.type, 'y_pred', self.y_pred.type)
            )
        # check if y is of the correct datatype
        if y.dtype.startswith('int'):
            # the T.neq operator returns a vector of 0s and 1s, where 1
            # represents a mistake in prediction
            return T.mean(T.neq(self.y_pred, y))
        else:
            raise NotImplementedError()

那么如何来实例化一个LogisticRegression类呢,可以看如下代码:

# generate symbolic variables for input (x and y represent a
# minibatch)
x = T.matrix('x')  # data, presented as rasterized images
y = T.ivector('y')  # labels, presented as 1D vector of [int] labels

# construct the logistic regression class
# Each MNIST image has size 28*28
classifier = LogisticRegression(input=x, n_in=28 * 28, n_out=10)

在上面的代码中,首先定义了输入变量 x 和对应类别 y 的符号变量。需要注意的是,xy是定义在LogisticRegression对象以外的。因为这个类需要输入值作为它__init__函数的参数,当你希望连接这些实例,来组成深度网络时,这个设定非常有用。一个层的输出可以当作下一层的输入。(这里并没有构建多层网络,但是代码可以在多层网络中重用)
最后,我们定义一个(符号化)cost变量,用来最小化,使用实例方法classifier.negative_log_likelihood

# the cost we minimize during training is the negative log likelihood of
# the model in symbolic format
cost = classifier.negative_log_likelihood(y)

注意定义cost中有一个隐含的符号输入x,因为classifier的符号变量在初始化时就定义在x中。

学习模型

在多数编程语言(C/C++,Matlab,Python)中实现MSGD,可以使用损失函数对于参数的梯度表达式: /W /b 。在复杂的模型中,具有严格的表达式形式 /θ ,特别是在需要考虑数值稳定性的情况下。
使用Theano,这种工作被大量简化。它可以完成自动求导,并应用相应的数学变换,以提高数值稳定性。
在Theano中获得 /W /b ,仅仅只需下面代码:

g_W = T.grad(cost=cost, wrt=classifier.W)
g_b = T.grad(cost=cost, wrt=classifier.b)

g_Wg_b是符号变量,可以用于计算。函数train_model可以完成梯度下降,定义如下:

# specify how to update the parameters of the model as a list of
# (variable, update expression) pairs.
updates = [(classifier.W, classifier.W - learning_rate * g_W),
           (classifier.b, classifier.b - learning_rate * g_b)]

# compiling a Theano function `train_model` that returns the cost, but in
# the same time updates the parameter of the model based on the rules
# defined in `updates`
train_model = theano.function(
    inputs=[index],
    outputs=cost,
    updates=updates,
    givens={
        x: train_set_x[index * batch_size: (index + 1) * batch_size],
        y: train_set_y[index * batch_size: (index + 1) * batch_size]
    }
)

updates是一系列二元组。在每个二元组中,第一个元素是待更新的符号变量,第二个元素是用于计算新数值的符号函数。相似地,givens是一个字典,其中关键字是符号变量,其中的值指定了它们的置换。函数train_model定义如下:

  • 输入是小批量数据的索引index,连带数据批的大小(这不是输入,因为它是一个固定值),定义了 x 和相应的标签 y
  • 返回值是index对应的 x y 所计算出的代价/损失值。
  • 对于每个函数调用,首先根据index置换训练集中xy的值。然后,将会评估该数据批所对应的代价值,并应用updates更新。

每次调用train_model(index),它将会计算数据批,返回其代价值,并完成了MSGD的一步。整个的学习算法会不断地循环遍历所有的数据集,并且一次只会考虑一批数据内的所有样本,然后重复地调用train_model函数。

测试模型

当测试模型的时候,我们注重于错误分类的样本个数。所以,LogisticRegression类有一个额外的实例方法,用于尝试减少每个数据批中错误分类的样本个数。代码如下:

def errors(self, y):
    """Return a float representing the number of errors in the minibatch over the total number of examples of the minibatch ; zero one loss over the size of the minibatch :type y: theano.tensor.TensorType :param y: corresponds to a vector that gives for each example the correct label """

    # check if y has same dimension of y_pred
    if y.ndim != self.y_pred.ndim:
        raise TypeError(
            'y should have the same shape as self.y_pred',
            ('y', y.type, 'y_pred', self.y_pred.type)
        )
    # check if y is of the correct datatype
    if y.dtype.startswith('int'):
        # the T.neq operator returns a vector of 0s and 1s, where 1
        # represents a mistake in prediction
        return T.mean(T.neq(self.y_pred, y))
    else:
        raise NotImplementedError()

然后,我们创建了一个函数test_model和一个函数validate_model,我们可以调用这些函数来挽回错误分类的值。你将会看到,validate_model是迭代退出的关键。这些函数的输入参数是数据批的索引号,然后函数将计算该索引号所对应的数据批中错误分类的个数。两个函数的唯一区别在于test_model面向的是测试集,validate_model面向的是验证集。

# compiling a Theano function that computes the mistakes that are made by
# the model on a minibatch
test_model = theano.function(
    inputs=[index],
    outputs=classifier.errors(y),
    givens={
        x: test_set_x[index * batch_size: (index + 1) * batch_size],
        y: test_set_y[index * batch_size: (index + 1) * batch_size]
    }
)

validate_model = theano.function(
    inputs=[index],
    outputs=classifier.errors(y),
    givens={
        x: valid_set_x[index * batch_size: (index + 1) * batch_size],
        y: valid_set_y[index * batch_size: (index + 1) * batch_size]
    }
)

综合代码

最后的代码如下:【下载地址:http://download.csdn.net/detail/ws_20100/9222263】
使用者可以通过输入以下命令,使用SGD逻辑回归对MNIST字符进行分类。

python code/logistic_sgd.py

输出应该是这样的形式:

...
epoch 72, minibatch 83/83, validation error 7.510417 %
     epoch 72, minibatch 83/83, test error of best model 7.510417 %
epoch 73, minibatch 83/83, validation error 7.500000 %
     epoch 73, minibatch 83/83, test error of best model 7.489583 %
Optimization complete with best validation score of 7.500000 %,with test performance 7.489583 %
The code run for 74 epochs, with 1.936983 epochs/sec

在一个Intel酷睿双核CPU E8400 @3.00 Ghz的主机上,大约1.936 epochs/sec,在经历75 epochs后,测试误差为7.489%。在GPU上,大约10.0 epochs/sec。

使用已训练的模型进行预测

当训练达到最低误差的时候,我们可以重新载入模型对新数据的标签进行预测,predict函数完成了这些操作:

def predict():
    """ An example of how to load a trained model and use it to predict labels. """

    # load the saved model
    classifier = cPickle.load(open('best_model.pkl'))

    # compile a predictor function
    predict_model = theano.function(
        inputs=[classifier.input],
        outputs=classifier.y_pred)

    # We can test it on some examples from test test
    dataset='mnist.pkl.gz'
    datasets = load_data(dataset)
    test_set_x, test_set_y = datasets[2]
    test_set_x = test_set_x.get_value()

    predicted_values = predict_model(test_set_x[:10])
    print ("Predicted values for the first 10 examples in test set:")
    print predicted_values

参考资料

Theano深度学习资料:http://deeplearning.net/tutorial/logreg.html

【脚注】
[1]对于更小的数据集或更简单的模型,复杂的下降算法可能更加有效。logistic_cg.py代码阐述了如何使用SciPy的共扼梯度方法(conjugate gradient solver)完成逻辑回归任务。logistic_cg.py代码可以在我的CSDN下载中免费下载:http://download.csdn.net/detail/ws_20100/9223959

你可能感兴趣的:(逻辑回归,Logistic,MNIST,手写字符,概率梯度下降)