Feedforward Neural Network

Feedforward neural network, Deep feedforward network, Multi-layer perceptron

XOR

Gradient-based Learning

The largest difference between the linear models and neural networks is that the nonlinearity of a neural network causes most interesting loss functions to become non-convex. This means that neural networks are usually trained by using iterative, gradient-based optimizers that merely drive the cost function to a very low value, rather than the linear equation solvers used to train linear regression models or the convex optimization algorithms with global convergence guarantees used to train logistic regression or SVMs.

神经网络和线性模型最大的不同是神经网络的非线性使损失函数变成非凸函数。所以不像线性模型（logistic regression,SVM）可以使用凸优化算法保证损失函数收敛到全局最小值，神经网络通常使用迭代的，基于梯度的优化方法使损失函数减小到一个较小值，

Cost Function

In most cases, our parametric model defines a distribution p(y | x;θ) and we simply use the principle of maximum likelihood. This means we use the cross-entropy between the training data and the model’s predictions as the cost function.The total cost function used to train a neural network will often combine one of the primary cost functions described here with a regularization term.

大多数情况下，我们的参数模型定义了条件分布 p(y | x;θ) ，可以简单地使用最大似然原理构造损失函数，也就是说可以用训练数据和模型预测值的交叉熵作为损失函数。训练神经网络的损失函数通常由主损失函数和正则项组成。

Learning Conditional Distributions with Maximum Likelihood

Most modern neural networks are trained using maximum likelihood. This means that the cost function is simply the negative log-likelihood, equivalently described as the cross-entropy between the training data and the model distribution.

大多数现代的神经网络使用最大似然方法训练。意味着损失函数是负log似然，等价为训练数据和模型预测值分布的交叉熵。

$\begin {align*}J(\theta)&=-loglikelihood=-\frac{1}{N}\sum_{N}\log p_{model}(y_i|x_i) \\&=-\sum_{x,y}\frac{count(x,y)}{N}\log p_{model}(y|x) \\&=-\sum_{x,y}p_{data}(x,y)\log p_{model}(y|x) \\&=-E_{x,y\sim p_{data}}\log p_{model}(y|x)\end {align*}$

One recurring theme throughout neural network design is that the gradient of the cost function must be large and predictable enough to serve as a good guide for the learning algorithm. Functions that saturate (become very flat) undermine this objective because they make the gradient become very small. In many cases this happens because the activation functions used to produce the output of the hidden units or the output units saturate. The negative log-likelihood helps to avoid this problem for many models. Many output units involve an exp function that can saturate when its argument is very negative. The log function in the negative log-likelihood cost function undoes the exp of some output units.

贯穿神经网络设计的一个主题是损失函数的梯度必须足够大和可预测，这样才能给学习算法提供好的指导。由于饱和函数的梯度会变的非常小，饱和函数会破坏这个目标，这种情况通常是由于隐含层或输出层激活函数饱和导致。负log似然帮助许多模型避免了这个问题，许多输出单元包含一个指数函数，该函数在输入是较大负值时会饱和，负log似然损失函数中的log函数“抵消”了输出单元的指数函数的饱和影响。

output units

The choice of cost function is tightly coupled with the choice of output unit.

损失函数的选择和输出单元的选择紧密相关。

Linear Units for Gaussian Output Distributions

Given features h, a layer of linear output units produces a vector .

Linear output layers are often used to produce the mean of a conditional Gaussian distribution:Maximizing the log-likelihood is then equivalent to minimizing the mean squared error.

给定特征h，线性输出单元层产生向量。线性输出层一般用来生成条件高斯分布的均值，因此最大化log-似然等价于最小化均方差。

Sigmoid Units for Bernoulli Output Distributions

离散输出变量y为二值变量时可以用sigmoid作为输出单元。

The sigmoid can be motivated by constructing an unnormalized probability distribution P ̃(y), which does not sum to 1. We can then divide by an appropriate constant to obtain a valid probability distribution. If we begin with the assumption that the unnormalized log probabilities are linear in y and z , we can exponentiate to obtain the unnormalized probabilities. We then normalize to see that this yields a Bernoulli distribution controlled by a sigmoidal transformation of z:

sigmoid可以通过构造未归一化的概率分布，然后除以一个常量归一化得到。假设未归一化的log概率和y,z是线性的，可以取幂获得未归一化概率。归一化后获得由z的sigmoid转换作为参数的伯努利分布。(令)

$z = W^T h+ b \\\log \hat P(y) = yz \\\hat P(y) = \exp (yz) \\P(y) = \frac {\exp (yz)}{ \sum_{y^·=0}^1 \exp(y^·z) } \\P(y) = \sigma ((2y − 1)z) \\ P(y=1)=\sigma (z) \\ P(y=0)=1-\sigma(z)=\sigma(-z)$

This approach to predicting the probabilities in log-space is natural to use with maximum likelihood learning. Because the cost function used with maximum likelihood is −logP(y | x), the log in the cost function undoes the exp of the sigmoid. Without this effect, the saturation of the sigmoid could prevent gradient-based learning from making good progress. The loss function for maximum likelihood learning of a Bernoulli parametrized by a sigmoid is

$\sigma (x) = \frac{\exp(x)}{\exp(0) + \exp(x)} =\frac{\exp(x)}{1 + \exp(x)} = \frac{1}{1+\exp(-x)} \\\frac{d \sigma}{d x} = \sigma(x)(1-\sigma(x))=\sigma(x)\sigma(-x) \\\sigma (-x) = 1 - \sigma(x) \\\zeta (x)=log(1+\exp (x)) \\\log \sigma(x) = -\zeta(-x) \\\frac{d \zeta}{d x} = \sigma(x) \\\zeta(x) - \zeta(-x) = x$

By rewriting the loss in terms of the softplus function, we can see that it saturates only when(1 − 2y)z is very negative. Saturation thus occurs only when the model already has the right answer—when y = 1 and z is very positive, or y = 0 and z is very negative. When z has the wrong sign, the argument to the softplus function,(1− 2y )z, may be simplified to |z|. As |z| becomes large while z has the wrong sign, the softplus function asymptotes toward simply returning its argument |z|. The derivative with respect to z asymptotes to sign(z), so, in the limit of extremely incorrect z, the softplus function does not shrink the gradient at all. This property is very useful because it means that gradient-based learning can act to quickly correct a mistaken z.

该情况下当预测值和实际值相符时损失函数饱和，梯度很小；当预测值和实际值相反时损失函数不饱和，梯度绝对值接近1.这个特性使基于梯度的学习算法能很快的纠正错误的z。

Softmax Units for Multinoulli Output Distributions

离散输出变量有n个取值时可以用softmax作为输出单元。

As with the logistic sigmoid, the use of the exp function works very well when training the softmax to output a target value y using maximum log-likelihood. In this case, we wish to maximize . Defining the softmax in terms of exp is natural because the log in the log-likelihood can undo the exp of the softmax: .

对损失函数一直有直接影响，因为这一项不会饱和，所以模型学习可以继续下去。第二项可以近似为.当是z中最大的项时，，和第一项抵消了，该实例对损失函数的贡献变得很小。当不是z中最大的项时，损失函数对第二项的惩罚变大。

Hidden Units

Most hidden units can be described as accepting a vector of inputs x, computing an affine transformation z = W >x + b, and then applying an element-wise nonlinear function g(z). Most hidden units are distinguished from each other only by the choice of the form of the activation function g(z).

Rectified Linear Units and Their Generalizations

Rectified linear units use the activation function .The derivatives through a rectified linear unit remain large whenever the unit is active.The gradients are not only large but also consistent.

Rectified linear units are typically used on top of an affine transformation:

it can be a good practice to set all elements of b to a small, positive value, such as 0.1. This makes it very likely that the rectified linear units will be initially active for most inputs in the training set and allow the derivatives to pass through.

One drawback to rectified linear units is that they cannot learn via gradient-based methods on examples for which their activation is zero.

Logistic Sigmoid and Hyperbolic Tangent

Prior to the introduction of rectified linear units, most neural networks used the logistic sigmoid activation function

or the hyperbolic tangent activation function

The widespread saturation of sigmoidal units can make gradient-based learning very difficult. For this reason, their use as hidden units in feedforward networks is now discouraged.

Sigmoidal activation functions are more common in settings other than feedforward networks.Recurrent networks, many probabilistic models, and some autoencoders have additional requirements that rule out the use of piecewise linear activation functions and make sigmoidal units more appealing despite the drawbacks of saturation.

Other Hidden Units

Radial basis function(RBF):, This function becomes more active as x approaches a template W:,i. Because it saturates to 0 for most x, it can be difficult to optimize.

Softplus: ,This is a smooth version of the rectifier.The use of the softplus is generally discouraged. The softplus demonstrates that the performance of hidden unit types can be very counterintuitive—one might expect it to have an advantage over the rectifier due to being differentiable everywhere or due to saturating less completely, but empirically it does not.

Hard tanh: this is shaped similarly to the tanh and the rectifier but unlike the latter, it is bounded, .

Architecture Design

The word architecture refers to the overall structure of the network: how many units it should have and how these units should be connected to each other.

Most neural networks are organized into groups of units called layers. Most neural network architectures arrange these layers in a chain structure, with each layer being a function of the layer that preceded it. In this structure, the first layer is given by

the second layer is given by

and so on.

In these chain-based architectures, the main architectural considerations are to choose the depth of the network and the width of each layer. As we will see, a network with even one hidden layer is sufficient to fit the training set. Deeper networks often are able to use far fewer units per layer and far fewer parameters and often generalize to the test set, but are also often harder to optimize. The ideal network architecture for a task must be found via experimentation guided by monitoring the validation set error.

Universal Approximation Properties and Depth

The universal approximation theorem states that a feedforward network with a linear output layerand at least one hidden layer with any “squashing” activation function (such as the logistic sigmoid activation function) can approximate any Borel measurable function from one finite-dimensional space to another with any desired non-zero amount of error, provided that the network is given enough hidden units.

The universal approximation theorem means that regardless of what function we are trying to learn, we know that a large MLP will be able to represent this function. However, we are not guaranteed that the training algorithm will be able to learn that function: the optimization algorithm used for training may not be able to find the value of the parameters that corresponds to the desired function. Second, the training algorithm might choose the wrong function due to overfitting.

The theorem does not say how large this network will be.

Other Architectural Considerations

Many neural network architectures have been developed for specific tasks. Specialized architectures for computer vision called convolutional networks. Feedforward networks may also be generalized to the recurrent neural networks for sequence processing.

In general, the layers need not be connected in a chain, even though this is the most common practice. Many architectures build a main chain but then add extra architectural features to it, such as skip connections going from layer i to layeri + 2 or higher. These skip connections make it easier for the gradient to flow from output layers to layers nearer the input.

Another key consideration of architecture design is exactly how to connect a pair of layers to each other. In the default neural network layer described by a linear transformation via a matrix W , every input unit is connected to every output unit. Many specialized networks in the chapters ahead have fewer connections, so that each unit in the input layer is connected to only a small subset of units in the output layer.

Back-Propagation and Other Differentiation Algorithms

When we use a feedforward neural network to accept an input x and produce an output yˆ, information flows forward through the network. The inputs x provide the initial information that then propagates up to the hidden units at each layer and finally produces yˆ. This is called forward propagation. During training, forward propagation can continue onward until it produces a scalar cost J (θ). The back-propagation algorithm, often simply called backprop, allows the information from the cost to then flow backwards through the network, in order to compute the gradient.

Computational Graphs

We use each node in the graph to indicate a variable. The variable may be a scalar, vector, matrix, tensor, or even a variable of another type. To formalize our graphs, we also need to introduce the idea of an operation. An operation is a simple function of one or more variables. Our graph language is accompanied by a set of allowable operations. Functions more complicated than the operations in this set may be described by composing many operations together.

If a variable y is computed by applying an operation to a variable x, then we draw a directed edge from x to y. We sometimes annotate the output node with the name of the operation applied, and other times omit this label when the operation is clear from context.

Chain Rule of Calculus

Back-propagation is an algorithm that computes the chain rule, with a specific order of operations that is highly efficient.

Recursively Applying the Chain Rule to Obtain Backprop

【待】

Back-Propagation Computation in Fully-Connected MLP

【待】

Symbol-to-Symbol Derivatives

Algebraic expressions and computational graphs both operate on symbols, or variables that do not have specific values. These algebraic and graph-based representations are called symbolic representations. When we actually use or train a neural network, we must assign specific values to these symbols. We replace a symbolic input to the network x with a specific numeric value, such as [1.2,3.765,−1.8] .

Some approaches to back-propagation take a computational graph and a set of numerical values for the inputs to the graph, then return a set of numerical values describing the gradient at those input values. We call this approach “symbol-to-number” differentiation. This is the approach used by libraries such as Torch and Caffe.

Another approach is to take a computational graph and add additional nodes to the graph that provide a symbolic description of the desired derivatives. This is the approach taken by Theano and TensorFlow.

General Back-Propagation

【待】

Example: Back-Propagation for MLP Training

假设模型为(一个example)

其中 $x \in R^{n\times 1},W^{(1)} \in R^{n \times m},b^{(1)} \in R^{m \times 1},W^{(2)} \in R^{m \times d},b^{(2)} \in R^{d\times 1},y \in R^{d \times 1}，\\ \hat y \in R^{d \times 1},a^{(0)}, h^{(1)} \in R^{m \times 1}, a^{(1)} \in R^{d \times 1}$

back-propagation

$\frac{dJ}{dy}=[-\frac{\hat y_1}{y_1}, -\frac{\hat y_2}{y_2}, \ldots,-\frac{\hat y_d}{y_d}]^T \\ \frac{dy}{da^{(1)}}=\left[ \begin{matrix} \frac{dy_1}{da^{(1)}_1} & \frac{dy_1}{da^{(1)}_2} & \cdots & \frac{dy_1}{da^{(1)}_d} \\ \frac{dy_2}{da^{(1)}_1} & \frac{dy_2}{da^{(1)}_2} & \cdots & \frac{dy_2}{da^{(1)}_d} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{dy_d}{da^{(1)}_1} & \frac{dy_d}{da^{(1)}_2} & \cdots & \frac{dy_d}{da^{(1)}_d} \\ \end{matrix} \right]=\left[ \begin{matrix} y_1(1-y_1) & -y_1y_2 & \cdots & -y_1y_d \\ -y_2y_1 & y_2(1-y_2) & \cdots & -y_2y_ d \\ \vdots & \vdots & \ddots & \vdots \\ -y_dy_1 & -y_dy_2 & \cdots & y_d(1-y_d) \\ \end{matrix} \right] \\ \frac{da^{(1)}}{dW^{(2)}}=h^{(1)}{a^{(1)}}^T=\left[ \begin{matrix} h^{(1)}_1a^{(1)}_1 & h^{(1)}_1a^{(1)}_2 & \cdots & h^{(1)}_1a^{(1)}_d \\ h^{(1)}_2a^{(1)}_1 & h^{(1)}_2a^{(1)}_2 & \cdots & h^{(1)}_2a^{(1)}_d \\ \vdots & \vdots & \ddots & \vdots \\ h^{(1)}_ma^{(1)}_1 & h^{(1)}_ma^{(1)}_2 & \cdots & h^{(1)}_ma^{(1)}_d \\ \end{matrix} \right] \\ \frac{da^{(1)}}{db^{(2)}}=I \\ \frac{da^{(1)}}{dh^{(1)}}=W^{(2)} \\ \frac{dh^{(1)}}{da^{(0)}}=[h^{(1)}_1(1-h^{(1)}_1),h^{(1)}_2(1-h^{(1)}_2),\ldots,h^{(1)}_m(1-h^{(1)}_m)]^T \\ \frac{da^{(0)}}{dW^{(1)}}=\left[ \begin{matrix} x_1a^{(0)}_1 & x_1a^{(0)}_2 & \cdots & x_1a^{(0)}_m \\ x_2a^{(0)}_1 & x_2a^{(0)}_2 & \cdots & x_2a^{(0)}_m \\ \vdots & \vdots & \ddots & \vdots \\ x_na^{(0)}_1 & x_na^{(0)}_2 & \cdots & x_na^{(0)}_m \\ \end{matrix} \right ] \\ \frac{da^{(0)}}{db^{(1)}}=I$

应用链式法则

$\frac{dJ}{da^{(1)}}= (\frac{dy}{da^{(1)}})^T\frac{dJ}{dy} \\ \frac{dJ}{dW^{(2)}}=(\frac{dJ}{da^{(1)}})^T\otimes \frac{da^{(1)}}{dW^{(2)}} \ \ \otimes 代表row-wise\ multiply\\ \frac{dJ}{db^{(2)}}=\frac{dJ}{da^{(1)}}I=\frac{dJ}{da^{(1)}} \\ \frac{dJ}{da^{(0)}}=\frac{da^{(1)}}{dh^{(1)}}\frac{dJ}{da^{(1)}}\otimes \frac{dh^{(1)}}{da^{(0)}} \ \ \otimes 代表element-wise\ multiply \\ \frac{dJ}{dW^{(1)}}=(\frac{dJ}{da^{(0)}})^T \otimes \frac{da^{(0)}}{dW^{(1)}}\ \ \otimes 代表row-wise\ multiply \\ \frac{dJ}{db^{(1)}}=\frac{dJ}{da^{(0)}}I=\frac{dJ}{da^{(0)}} \\$

Complications

【待】

Differentiation outside the Deep Learning Community

【待】

Higher-Order Derivatives

【待】