深度学习 —— 深度置信网络

[Hinton06] 提出了RBMs可以堆叠起来以贪婪的方式进行训练以构成深度置信网络(DBN)。DBNs是学习提取训练数据深度结构表征的图模型,为观察到的向量 x\ell隐藏层 h^k的联合分布建模如下:

P(x, h^1, \ldots, h^{\ell}) = \left(\prod_{k=0}^{\ell-2} P(h^k|h^{k+1})\right) P(h^{\ell-1},h^{\ell})

其中x=h^0P(h^{k-1} | h^k)是k层已RBM的隐藏单元为条件的可见单元的条件性分布。P(h^{\ell-1}, h^{\ell})是在RBM顶层的可见-隐藏联合分布。图示如下:

深度学习 —— 深度置信网络_第1张图片

DBN和RBM可使用贪婪的层际无监督训练原则是每层的基石,过程如下:

1、将第一层作为RBM训练,将x =h^{(0)}输入作为可见层。

2、将第一层获取的输入表征作为第二层数据。两种方式存在,可以选择平均激活 p(h^{(1)}=1|h^{(0)})或者p(h^{(1)}|h^{(0)})样本。

3、将第二层作为RBM训练,将转化后的数据(样本或平均激活)作为训练样本(该RBM层的可见部分)。

4、重复2和3,选择满意的层数,每次向上传递样本或平均值。

5、细调该深度结构的所有参数,参考DBN指数相似代理,或有监督训练标准(加入额外的学习机器把习得的表征转化为有监督的预测,例如线形分类)。

这里,我们仅关注通过有监督的梯度下降细调。特别的我们使用逻辑回归分类器去分类输入x,基于DBN最终隐藏层的输出。细调因此通过负指数相似成本函数的有监督梯度下降实现。考虑到有监督的梯度对每层的权重和隐藏层偏差是非空(空代表每个RBM的可见偏差),这个过程相当于使用非监督训练策略取得权重和隐藏层偏差来初始化深度MLP参数。

论证贪婪的层际预训练

为什么这种算法有效 ?假设一个2层DBN有隐藏层h1和h2,相应的权重为W1和W2,[Hinton06]证明了logp(x)可以写成\log p(x) = &KL(Q(h^{(1)}|x)||p(h^{(1)}|x)) + H_{Q(h^{(1)}|x)} + \\            &\sum_h Q(h^{(1)}|x)(\log p(h^{(1)}) + \log p(x|h^{(1)})).

KL(Q(h^{(1)}|x) || p(h^{(1)}|x))代表了第一个独立RBM的后验Q(h^{(1)}|x)和该层由整体DBN(考虑了顶层RBM定义的先验)定义的概率P之间的KL分离。

H_{Q(h^{(1)}|x)}Q(h^{(1)}|x)分布的交叉熵。

如果我们初始化隐藏层使 W^{(2)}={W^{(1)}}^TQ(h^{(1)}|x)=p(h^{(1)}|x)则KL分离项为空。如果我们习得第一层RBM然后固定参数W^{(1)},依据W^{(2)}优化公式则只能提高p(x)的可能性。

注意如果我们分离各项仅依赖W^{(2)},我们得到

\sum_h Q(h^{(1)}|x)p(h^{(1)})

依据W^{(2)}优化训练RBM第二阶段,使用Q(h^{(1)}|x)作为训练分布,x是从第一个RBM训练分布中取样获得。

实现

在Theano中实现DBNs,我们使用在受限伯尔曼机种定义的类。可以看到DBN的代码与SdA非常相似,因为两者都涉及非监督层际预训练后进行深度MLP的有监督细调。主要不同是使用RBM类而不是dA类。

我们首先定义DBN类,它将储存MLP的层和相应的RBMs。既然我们使用RBMs来初始化MLP,代码尽可能的区分用来初始化网络的RBMs和用来分类的MLP。

class DBN(object):
    """Deep Belief Network

    A deep belief network is obtained by stacking several RBMs on top of each
    other. The hidden layer of the RBM at layer `i` becomes the input of the
    RBM at layer `i+1`. The first layer RBM gets as input the input of the
    network, and the hidden layer of the last RBM represents the output. When
    used for classification, the DBN is treated as a MLP, by adding a logistic
    regression layer on top.
    """

    def __init__(self, numpy_rng, theano_rng=None, n_ins=784,
                 hidden_layers_sizes=[500, 500], n_outs=10):
        """This class is made to support a variable number of layers.

        :type numpy_rng: numpy.random.RandomState
        :param numpy_rng: numpy random number generator used to draw initial
                    weights

        :type theano_rng: theano.tensor.shared_randomstreams.RandomStreams
        :param theano_rng: Theano random generator; if None is given one is
                           generated based on a seed drawn from `rng`

        :type n_ins: int
        :param n_ins: dimension of the input to the DBN

        :type hidden_layers_sizes: list of ints
        :param hidden_layers_sizes: intermediate layers size, must contain
                               at least one value

        :type n_outs: int
        :param n_outs: dimension of the output of the network
        """

        self.sigmoid_layers = []
        self.rbm_layers = []
        self.params = []
        self.n_layers = len(hidden_layers_sizes)

        assert self.n_layers > 0

        if not theano_rng:
            theano_rng = MRG_RandomStreams(numpy_rng.randint(2 ** 30))

        # allocate symbolic variables for the data

        # the data is presented as rasterized images
        self.x = T.matrix('x')

        # the labels are presented as 1D vector of [int] labels
        self.y = T.ivector('y')

self.sigmoid_layers将储存前项传递图,用来构建MLP,而self.rbm_layers将储存用来预训练MLP每层的RBMs。

下一步我们构建n_layers sigmoid层(我们使用多层感知机里引入的HiddenLayer类,仅把其中非线性部分从tanh改为逻辑函数s(x) = \frac{1}{1+e^{-x}})和n_layers RBMs,n_layers是模型的深度。我们把sigmoid鞥连接起来构建MLP,并以共享权重矩阵和相应sigmoid层隐藏偏差的方式来构建RBM。

     
 for i in range(self.n_layers):
            # construct the sigmoidal layer

            # the size of the input is either the number of hidden
            # units of the layer below or the input size if we are on
            # the first layer
            if i == 0:
                input_size = n_ins
            else:
                input_size = hidden_layers_sizes[i - 1]

            # the input to this layer is either the activation of the
            # hidden layer below or the input of the DBN if you are on
            # the first layer
            if i == 0:
                layer_input = self.x
            else:
                layer_input = self.sigmoid_layers[-1].output

            sigmoid_layer = HiddenLayer(rng=numpy_rng,
                                        input=layer_input,
                                        n_in=input_size,
                                        n_out=hidden_layers_sizes[i],
                                        activation=T.nnet.sigmoid)

            # add the layer to our list of layers
            self.sigmoid_layers.append(sigmoid_layer)

            # its arguably a philosophical question...  but we are
            # going to only declare that the parameters of the
            # sigmoid_layers are parameters of the DBN. The visible
            # biases in the RBM are parameters of those RBMs, but not
            # of the DBN.
            self.params.extend(sigmoid_layer.params)

            # Construct an RBM that shared weights with this layer
            rbm_layer = RBM(numpy_rng=numpy_rng,
                            theano_rng=theano_rng,
                            input=layer_input,
                            n_visible=input_size,
                            n_hidden=hidden_layers_sizes[i],
                            W=sigmoid_layer.W,
                            hbias=sigmoid_layer.b)
            self.rbm_layers.append(rbm_layer)

然后我们将最后的逻辑回归层堆叠上去。我们使用逻辑回归里介绍的LogisticRegression类。

 
 self.logLayer = LogisticRegression(
            input=self.sigmoid_layers[-1].output,
            n_in=hidden_layers_sizes[-1],
            n_out=n_outs)
        self.params.extend(self.logLayer.params)

        # compute the cost for second phase of training, defined as the
        # negative log likelihood of the logistic regression (output) layer
        self.finetune_cost = self.logLayer.negative_log_likelihood(self.y)

        # compute the gradients with respect to the model parameters
        # symbolic variable that points to the number of errors made on the
        # minibatch given by self.x and self.y
        self.errors = self.logLayer.errors(self.y)

同时提供为每个RBMs生成训练函数的方法。他们作为列表返回,元素i是执行RBM第i层单步训练的函数。

  def pretraining_functions(self, train_set_x, batch_size, k):
        '''Generates a list of functions, for performing one step of
        gradient descent at a given layer. The function will require
        as input the minibatch index, and to train an RBM you just
        need to iterate, calling the corresponding function on all
        minibatch indexes.

        :type train_set_x: theano.tensor.TensorType
        :param train_set_x: Shared var. that contains all datapoints used
                            for training the RBM
        :type batch_size: int
        :param batch_size: size of a [mini]batch
        :param k: number of Gibbs steps to do in CD-k / PCD-k

        '''

        # index to a [mini]batch
        index = T.lscalar('index')  # index to a minibatch

为了能在训练时更改训练速率,我们将Theano变量连接到默认值

   
  learning_rate = T.scalar('lr')  # learning rate to use

        # begining of a batch, given `index`
        batch_begin = index * batch_size
        # ending of a batch given `index`
        batch_end = batch_begin + batch_size

        pretrain_fns = []
        for rbm in self.rbm_layers:

            # get the cost and the updates list
            # using CD-k here (persisent=None) for training each RBM.
            # TODO: change cost function to reconstruction error
            cost, updates = rbm.get_cost_updates(learning_rate,
                                                 persistent=None, k=k)

            # compile the theano function
            fn = theano.function(
                inputs=[index, theano.In(learning_rate, value=0.1)],
                outputs=cost,
                updates=updates,
                givens={
                    self.x: train_set_x[batch_begin:batch_end]
                }
            )
            # append `fn` to the list of functions
            pretrain_fns.append(fn)

        return pretrain_fns

现在任一函数pretrain_fns[i]接受声明index和可选的lr-学习速率。同样的DBN类构建一个细调的方法函数。

 
  def build_finetune_functions(self, datasets, batch_size, learning_rate):
        '''Generates a function `train` that implements one step of
        finetuning, a function `validate` that computes the error on a
        batch from the validation set, and a function `test` that
        computes the error on a batch from the testing set

        :type datasets: list of pairs of theano.tensor.TensorType
        :param datasets: It is a list that contain all the datasets;
                        the has to contain three pairs, `train`,
                        `valid`, `test` in this order, where each pair
                        is formed of two Theano variables, one for the
                        datapoints, the other for the labels
        :type batch_size: int
        :param batch_size: size of a minibatch
        :type learning_rate: float
        :param learning_rate: learning rate used during finetune stage

        '''

        (train_set_x, train_set_y) = datasets[0]
        (valid_set_x, valid_set_y) = datasets[1]
        (test_set_x, test_set_y) = datasets[2]

        # compute number of minibatches for training, validation and testing
        n_valid_batches = valid_set_x.get_value(borrow=True).shape[0]
        n_valid_batches //= batch_size
        n_test_batches = test_set_x.get_value(borrow=True).shape[0]
        n_test_batches //= batch_size

        index = T.lscalar('index')  # index to a [mini]batch

        # compute the gradients with respect to the model parameters
        gparams = T.grad(self.finetune_cost, self.params)

        # compute list of fine-tuning updates
        updates = []
        for param, gparam in zip(self.params, gparams):
            updates.append((param, param - gparam * learning_rate))

        train_fn = theano.function(
            inputs=[index],
            outputs=self.finetune_cost,
            updates=updates,
            givens={
                self.x: train_set_x[
                    index * batch_size: (index + 1) * batch_size
                ],
                self.y: train_set_y[
                    index * batch_size: (index + 1) * batch_size
                ]
            }
        )

        test_score_i = theano.function(
            [index],
            self.errors,
            givens={
                self.x: test_set_x[
                    index * batch_size: (index + 1) * batch_size
                ],
                self.y: test_set_y[
                    index * batch_size: (index + 1) * batch_size
                ]
            }
        )

        valid_score_i = theano.function(
            [index],
            self.errors,
            givens={
                self.x: valid_set_x[
                    index * batch_size: (index + 1) * batch_size
                ],
                self.y: valid_set_y[
                    index * batch_size: (index + 1) * batch_size
                ]
            }
        )

        # Create a function that scans the entire validation set
        def valid_score():
            return [valid_score_i(i) for i in range(n_valid_batches)]

        # Create a function that scans the entire test set
        def test_score():
            return [test_score_i(i) for i in range(n_test_batches)]

        return train_fn, valid_score, test_score

注意返回的valid_score和test_score不是Theano函数而是Python函数。

总结

以下代码构建了深度置信网络

  numpy_rng = numpy.random.RandomState(123)
    print('... building the model')
    # construct the Deep Belief Network
    dbn = DBN(numpy_rng=numpy_rng, n_ins=28 * 28,
              hidden_layers_sizes=[1000, 1000, 1000],
              n_outs=10)

训练该网络有两个阶段(1)层际预训练(2)细调

在预训练阶段我们遍历网络每一层。每一层我们使用Theano确定i层 RBM的输入并执行单步CD-k。训练次数由pretraining_epochs给定固定值。

 #########################
    # PRETRAINING THE MODEL #
    #########################
    print('... getting the pretraining functions')
    pretraining_fns = dbn.pretraining_functions(train_set_x=train_set_x,
                                                batch_size=batch_size,
                                                k=k)

    print('... pre-training the model')
    start_time = timeit.default_timer()
    # Pre-train layer-wise
    for i in range(dbn.n_layers):
        # go through pretraining epochs
        for epoch in range(pretraining_epochs):
            # go through the training set
            c = []
            for batch_index in range(n_train_batches):
                c.append(pretraining_fns[i](index=batch_index,
                                            lr=pretrain_lr))
            print('Pre-training layer %i, epoch %d, cost ' % (i, epoch), end=' ')
            print(numpy.mean(c, dtype='float64'))

    end_time = timeit.default_timer()

细调与多层感知机介绍的内容相似,不同在于我们使用build_finetune_functions函数。

执行代码

python code/DBN.py

使用默认函数,代码运行100次预训练,微批次大小为10,相当于执行500000次无监督参数更新。我们使用无监督学习速率0.01,监督学习速率0.1。DBN自身包含3个隐藏层,每层1000个单元。使用提前停止,46次监督训练后,该设置去的最小验证误差1.27,相应测试误差为1.34。

在Intel(R) Xeon(R) CPU X5560 2.80GHz,使用多线程MKL库(在4个核上运行),预训练花费615分钟,平均为2.05分钟/层×次。细调只需要101分钟,或约2.20分钟/次。

超参数由最优化验证误差获得。我们测试了无监督学习速率

和无监督学习速率除提前停止外没有使用正则化,也没有对预训练更新优化。

技巧

提高运行速度的一个方式(有足够的内存) 是一次传入i层数据集并计算表征,给定i-1层固定。从训练地一层RBM开始。一旦训练完成,可以计算数据集中每个样本的隐藏单元值并将他们存为新数据集用以训练第2层的RBM。依次类推。这样避免了计算中间(隐藏层)表征,但会加大内存使用。


你可能感兴趣的:(深度学习,Theano)