xulinshadow701

UFLDL Tutorial_Sparse Autoencoder

Neural Networks

Consider a supervised learning problem where we have access to labeled training examples $(x (i), y (i))$ . Neural networks give a way of defining a complex, non-linear form of hypotheses $h W, b (x)$ , with parameters $W, b$ that we can fit to our data.

To describe neural networks, we will begin by describing the simplest possible neural network, one which comprises a single "neuron." We will use the following diagram to denote a single neuron:

This "neuron" is a computational unit that takes as input $x 1, x 2, x 3$ (and a +1 intercept term), and outputs $\textstyle h_{W,b}(x) = f(W^Tx) = f(\sum_{i=1}^3 W_{i}x_i +b)$ , where $f : \Re \mapsto \Re$ is called the activation function. In these notes, we will choose $f(\cdot)$ to be the sigmoid function:

$f(z) = \frac{1}{1+\exp(-z)}.$

Thus, our single neuron corresponds exactly to the input-output mapping defined by logistic regression.

Although these notes will use the sigmoid function, it is worth noting that another common choice for $f$ is the hyperbolic tangent, or tanh, function:

$f(z) = \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}},$

Here are plots of the sigmoid and $tanh$ functions:

The $tanh(z)$ function is a rescaled version of the sigmoid, and its output range is $[ - 1,1]$ instead of $[0,1]$ .

Note that unlike some other venues (including the OpenClassroom videos, and parts of CS229), we are not using the convention here of $x 0 = 1$ . Instead, the intercept term is handled separately by the parameter $b$ .

Finally, one identity that'll be useful later: If $f (z) = 1 / (1 + exp( - z))$ is the sigmoid function, then its derivative is given by $f'(z) = f (z)(1 - f (z))$ . (If $f$ is the tanh function, then its derivative is given by $f'(z) = 1 - (f (z)) 2$ .) You can derive this yourself using the definition of the sigmoid (or tanh) function.

Neural Network model

A neural network is put together by hooking together many of our simple "neurons," so that the output of a neuron can be the input of another. For example, here is a small neural network:

In this figure, we have used circles to also denote the inputs to the network. The circles labeled "+1" are called bias units, and correspond to the intercept term. The leftmost layer of the network is called the input layer, and the rightmost layer the output layer (which, in this example, has only one node). The middle layer of nodes is called the hidden layer, because its values are not observed in the training set. We also say that our example neural network has 3 input units (not counting the bias unit), 3hidden units, and 1 output unit.

We will let $n l$ denote the number of layers in our network; thus $n l = 3$ in our example. We label layer $l$ as $L l$ , so layer $L 1$ is the input layer, and layer $L_{n_l}$ the output layer. Our neural network has parameters $(W, b) = (W (1), b (1), W (2), b (2))$ , where we write $W^{(l)}_{ij}$ to denote the parameter (or weight) associated with the connection between unit $j$ in layer $l$ , and unit $i$ in layer $l + 1$ . (Note the order of the indices.) Also, $b^{(l)}_i$ is the bias associated with unit $i$ in layer $l + 1$ . Thus, in our example, we have $W^{(1)} \in \Re^{3\times 3}$ , and $W^{(2)} \in \Re^{1\times 3}$ . Note that bias units don't have inputs or connections going into them, since they always output the value +1. We also let $s l$ denote the number of nodes in layer $l$ (not counting the bias unit).

We will write $a^{(l)}_i$ to denote the activation (meaning output value) of unit $i$ in layer $l$ . For $l = 1$ , we also use $a^{(1)}_i = x_i$ to denote the $i$ -th input. Given a fixed setting of the parameters $W, b$ , our neural network defines a hypothesis $h W, b (x)$ that outputs a real number. Specifically, the computation that this neural network represents is given by:

$UFLDL Tutorial_Sparse Autoencoder_第5张图片$

In the sequel, we also let $z^{(l)}_i$ denote the total weighted sum of inputs to unit $i$ in layer $l$ , including the bias term (e.g., $\textstyle z_i^{(2)} = \sum_{j=1}^n W^{(1)}_{ij} x_j + b^{(1)}_i$ ), so that $a^{(l)}_i = f(z^{(l)}_i)$ .

Note that this easily lends itself to a more compact notation. Specifically, if we extend the activation function $f(\cdot)$ to apply to vectors in an element-wise fashion (i.e., $f ([z 1, z 2, z 3]) = [f (z 1), f (z 2), f (z 3)]$ ), then we can write the equations above more compactly as:

$UFLDL Tutorial_Sparse Autoencoder_第6张图片$

We call this step forward propagation. More generally, recalling that we also use $a (1) = x$ to also denote the values from the input layer, then given layer $l$ 's activations $a (l)$ , we can compute layer $l + 1$ 's activations $a (l + 1)$ as:

$\begin{align}z^{(l+1)} &= W^{(l)} a^{(l)} + b^{(l)} \\a^{(l+1)} &= f(z^{(l+1)})\end{align}$

By organizing our parameters in matrices and using matrix-vector operations, we can take advantage of fast linear algebra routines to quickly perform calculations in our network.

We have so far focused on one example neural network, but one can also build neural networks with other architectures (meaning patterns of connectivity between neurons), including ones with multiple hidden layers. The most common choice is a $\textstyle n_l$ -layered network where layer $\textstyle 1$ is the input layer, layer $\textstyle n_l$ is the output layer, and each layer $\textstyle l$ is densely connected to layer $\textstyle l+1$ . In this setting, to compute the output of the network, we can successively compute all the activations in layer $\textstyle L_2$ , then layer $\textstyle L_3$ , and so on, up to layer $\textstyle L_{n_l}$ , using the equations above that describe the forward propagation step. This is one example of a feedforwardneural network, since the connectivity graph does not have any directed loops or cycles.

Neural networks can also have multiple output units. For example, here is a network with two hidden layers layers $L 2$ and $L 3$ and two output units in layer $L 4$ :

To train this network, we would need training examples $(x (i), y (i))$ where $y^{(i)} \in \Re^2$ . This sort of network is useful if there're multiple outputs that you're interested in predicting. (For example, in a medical diagnosis application, the vector $x$ might give the input features of a patient, and the different outputs $y i$ 's might indicate presence or absence of different diseases.)

Backpropagation Algorithm

Suppose we have a fixed training set $\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}$ of $m$ training examples. We can train our neural network using batch gradient descent. In detail, for a single training example $(x, y)$ , we define the cost function with respect to that single example to be:

$\begin{align}J(W,b; x,y) = \frac{1}{2} \left\| h_{W,b}(x) - y \right\|^2.\end{align}$

This is a (one-half) squared-error cost function. Given a training set of $m$ examples, we then define the overall cost function to be:

$UFLDL Tutorial_Sparse Autoencoder_第8张图片$

The first term in the definition of $J (W, b)$ is an average sum-of-squares error term. The second term is a regularization term (also called a weight decay term) that tends to decrease the magnitude of the weights, and helps prevent overfitting.

[Note: Usually weight decay is not applied to the bias terms $b^{(l)}_i$ , as reflected in our definition for $J (W, b)$ . Applying weight decay to the bias units usually makes only a small difference to the final network, however. If you've taken CS229 (Machine Learning) at Stanford or watched the course's videos on YouTube, you may also recognize this weight decay as essentially a variant of the Bayesian regularization method you saw there, where we placed a Gaussian prior on the parameters and did MAP (instead of maximum likelihood) estimation.]

The weight decay parameter $λ$ controls the relative importance of the two terms. Note also the slightly overloaded notation: $J (W, b; x, y)$ is the squared error cost with respect to a single example; $J (W, b)$ is the overall cost function, which includes the weight decay term.

This cost function above is often used both for classification and for regression problems. For classification, we let $y = 0$ or $1$ represent the two class labels (recall that the sigmoid activation function outputs values in $[0,1]$ ; if we were using a tanh activation function, we would instead use -1 and +1 to denote the labels). For regression problems, we first scale our outputs to ensure that they lie in the $[0,1]$ range (or if we were using a tanh activation function, then the $[ - 1,1]$ range).

Our goal is to minimize $J (W, b)$ as a function of $W$ and $b$ . To train our neural network, we will initialize each parameter $W^{(l)}_{ij}$ and each $b^{(l)}_i$ to a small random value near zero (say according to a $Normal (0,ε 2)$ distribution for some small $ε$ , say $0.01$ ), and then apply an optimization algorithm such as batch gradient descent. Since $J (W, b)$ is a non-convex function, gradient descent is susceptible to local optima; however, in practice gradient descent usually works fairly well. Finally, note that it is important to initialize the parameters randomly, rather than to all 0's. If all the parameters start off at identical values, then all the hidden layer units will end up learning the same function of the input (more formally, $W^{(1)}_{ij}$ will be the same for all values of $i$ , so that $a^{(2)}_1 = a^{(2)}_2 = a^{(2)}_3 = \ldots$ for any input $x$ ). The random initialization serves the purpose of symmetry breaking.

One iteration of gradient descent updates the parameters $W, b$ as follows:

$UFLDL Tutorial_Sparse Autoencoder_第9张图片$

where $α$ is the learning rate. The key step is computing the partial derivatives above. We will now describe the backpropagationalgorithm, which gives an efficient way to compute these partial derivatives.

We will first describe how backpropagation can be used to compute $\textstyle \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b; x, y)$ and $\textstyle \frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x, y)$ , the partial derivatives of the cost function $J (W, b; x, y)$ defined with respect to a single example $(x, y)$ . Once we can compute these, we see that the derivative of the overall cost function $J (W, b)$ can be computed as:

The two lines above differ slightly because weight decay is applied to $W$ but not $b$ .

The intuition behind the backpropagation algorithm is as follows. Given a training example $(x, y)$ , we will first run a "forward pass" to compute all the activations throughout the network, including the output value of the hypothesis $h W, b (x)$ . Then, for each node $i$ in layer $l$ , we would like to compute an "error term" $\delta^{(l)}_i$ that measures how much that node was "responsible" for any errors in our output. For an output node, we can directly measure the difference between the network's activation and the true target value, and use that to define $\delta^{(n_l)}_i$ (where layer $n l$ is the output layer). How about hidden units? For those, we will compute $\delta^{(l)}_i$ based on a weighted average of the error terms of the nodes that uses $a^{(l)}_i$ as an input. In detail, here is the backpropagation algorithm:

Perform a feedforward pass, computing the activations for layers $L 2$ , $L 3$ , and so on up to the output layer $L_{n_l}$ .
For each output unit $i$ in layer $n l$ (the output layer), set

$\begin{align}\delta^{(n_l)}_i= \frac{\partial}{\partial z^{(n_l)}_i} \;\; \frac{1}{2} \left\|y - h_{W,b}(x)\right\|^2 = - (y_i - a^{(n_l)}_i) \cdot f'(z^{(n_l)}_i)\end{align}$
For $l = n_l-1, n_l-2, n_l-3, \ldots, 2$

For each node $i$ in layer $l$ , set

$\delta^{(l)}_i = \left( \sum_{j=1}^{s_{l+1}} W^{(l)}_{ji} \delta^{(l+1)}_j \right) f'(z^{(l)}_i)$
Compute the desired partial derivatives, which are given as:

$UFLDL Tutorial_Sparse Autoencoder_第10张图片$

Finally, we can also re-write the algorithm using matrix-vectorial notation. We will use " $\textstyle \bullet$ " to denote the element-wise product operator (denoted ".*" in Matlab or Octave, and also called the Hadamard product), so that if $\textstyle a = b \bullet c$ , then $\textstyle a_i = b_ic_i$ . Similar to how we extended the definition of $\textstyle f(\cdot)$ to apply element-wise to vectors, we also do the same for $\textstyle f'(\cdot)$ (so that $\textstyle f'([z_1, z_2, z_3]) =[f'(z_1),f'(z_2),f'(z_3)]$ ).

The algorithm can then be written:

Perform a feedforward pass, computing the activations for layers $\textstyle L_2$ , $\textstyle L_3$ , up to the output layer $\textstyle L_{n_l}$ , using the equations defining the forward propagation steps
For the output layer (layer $\textstyle n_l$ ), set

$\begin{align}\delta^{(n_l)}= - (y - a^{(n_l)}) \bullet f'(z^{(n_l)})\end{align}$
For $\textstyle l = n_l-1, n_l-2, n_l-3, \ldots, 2$

Set

$\begin{align} \delta^{(l)} = \left((W^{(l)})^T \delta^{(l+1)}\right) \bullet f'(z^{(l)}) \end{align}$
Compute the desired partial derivatives:

$\begin{align}\nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\\nabla_{b^{(l)}} J(W,b;x,y) &= \delta^{(l+1)}.\end{align}$

Implementation note: In steps 2 and 3 above, we need to compute $\textstyle f'(z^{(l)}_i)$ for each value of $\textstyle i$ . Assuming $\textstyle f(z)$ is the sigmoid activation function, we would already have $\textstyle a^{(l)}_i$ stored away from the forward pass through the network. Thus, using the expression that we worked out earlier for $\textstyle f'(z)$ , we can compute this as $\textstyle f'(z^{(l)}_i) = a^{(l)}_i (1- a^{(l)}_i)$ .

Finally, we are ready to describe the full gradient descent algorithm. In the pseudo-code below, $\textstyle \Delta W^{(l)}$ is a matrix (of the same dimension as $\textstyle W^{(l)}$ ), and $\textstyle \Delta b^{(l)}$ is a vector (of the same dimension as $\textstyle b^{(l)}$ ). Note that in this notation, " $\textstyle \Delta W^{(l)}$ " is a matrix, and in particular it isn't " $\textstyle \Delta$ times $\textstyle W^{(l)}$ ." We implement one iteration of batch gradient descent as follows:

Set $\textstyle \Delta W^{(l)} := 0$ , $\textstyle \Delta b^{(l)} := 0$ (matrix/vector of zeros) for all $\textstyle l$ .
For to ,
1. Use backpropagation to compute $\textstyle \nabla_{W^{(l)}} J(W,b;x,y)$ and $\textstyle \nabla_{b^{(l)}} J(W,b;x,y)$ .
2. Set $\textstyle \Delta W^{(l)} := \Delta W^{(l)} + \nabla_{W^{(l)}} J(W,b;x,y)$ .
3. Set $\textstyle \Delta b^{(l)} := \Delta b^{(l)} + \nabla_{b^{(l)}} J(W,b;x,y)$ .
Update the parameters:

$UFLDL Tutorial_Sparse Autoencoder_第11张图片$

To train our neural network, we can now repeatedly take steps of gradient descent to reduce our cost function $\textstyle J(W,b)$ .

Gradient checking and advanced optimization

Backpropagation is a notoriously difficult algorithm to debug and get right, especially since many subtly buggy implementations of it—for example, one that has an off-by-one error in the indices and that thus only trains some of the layers of weights, or an implementation that omits the bias term—will manage to learn something that can look surprisingly reasonable (while performing less well than a correct implementation). Thus, even with a buggy implementation, it may not at all be apparent that anything is amiss. In this section, we describe a method for numerically checking the derivatives computed by your code to make sure that your implementation is correct. Carrying out the derivative checking procedure described here will significantly increase your confidence in the correctness of your code.

Suppose we want to minimize $\textstyle J(\theta)$ as a function of $\textstyle \theta$ . For this example, suppose $\textstyle J : \Re \mapsto \Re$ , so that $\textstyle \theta \in \Re$ . In this 1-dimensional case, one iteration of gradient descent is given by

$\begin{align}\theta := \theta - \alpha \frac{d}{d\theta}J(\theta).\end{align}$

Suppose also that we have implemented some function $\textstyle g(\theta)$ that purportedly computes $\textstyle \frac{d}{d\theta}J(\theta)$ , so that we implement gradient descent using the update $\textstyle \theta := \theta - \alpha g(\theta)$ . How can we check if our implementation of $\textstyle g$ is correct?

Recall the mathematical definition of the derivative as

$\begin{align}\frac{d}{d\theta}J(\theta) = \lim_{\epsilon \rightarrow 0}\frac{J(\theta+ \epsilon) - J(\theta-\epsilon)}{2 \epsilon}.\end{align}$

Thus, at any specific value of $\textstyle \theta$ , we can numerically approximate the derivative as follows:

$\begin{align}\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}\end{align}$

In practice, we set $EPSILON$ to a small constant, say around $\textstyle 10^{-4}$ . (There's a large range of values of $EPSILON$ that should work well, but we don't set $EPSILON$ to be "extremely" small, say $\textstyle 10^{-20}$ , as that would lead to numerical roundoff errors.)

Thus, given a function $\textstyle g(\theta)$ that is supposedly computing $\textstyle \frac{d}{d\theta}J(\theta)$ , we can now numerically verify its correctness by checking that

$\begin{align}g(\theta) \approx\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}.\end{align}$

The degree to which these two values should approximate each other will depend on the details of $\textstyle J$ . But assuming $\textstyle {\rm EPSILON} = 10^{-4}$ , you'll usually find that the left- and right-hand sides of the above will agree to at least 4 significant digits (and often many more).

Now, consider the case where $\textstyle \theta \in \Re^n$ is a vector rather than a single real number (so that we have $\textstyle n$ parameters that we want to learn), and $\textstyle J: \Re^n \mapsto \Re$ . In our neural network example we used " $\textstyle J(W,b)$ ," but one can imagine "unrolling" the parameters $\textstyle W,b$ into a long vector $\textstyle \theta$ . We now generalize our derivative checking procedure to the case where $\textstyle \theta$ may be a vector.

Suppose we have a function $\textstyle g_i(\theta)$ that purportedly computes $\textstyle \frac{\partial}{\partial \theta_i} J(\theta)$ ; we'd like to check if $\textstyle g_i$ is outputting correct derivative values. Let $\textstyle \theta^{(i+)} = \theta +{\rm EPSILON} \times \vec{e}_i$ , where

$\begin{align}\vec{e}_i = \begin{bmatrix}0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0\end{bmatrix}\end{align}$

is the $\textstyle i$ -th basis vector (a vector of the same dimension as $\textstyle \theta$ , with a "1" in the $\textstyle i$ -th position and "0"s everywhere else). So, $\textstyle \theta^{(i+)}$ is the same as $\textstyle \theta$ , except its $\textstyle i$ -th element has been incremented by $EPSILON$ . Similarly, let $\textstyle \theta^{(i-)} = \theta - {\rm EPSILON} \times \vec{e}_i$ be the corresponding vector with the $\textstyle i$ -th element decreased by $EPSILON$ . We can now numerically verify $\textstyle g_i(\theta)$ 's correctness by checking, for each $\textstyle i$ , that:

$\begin{align}g_i(\theta) \approx\frac{J(\theta^{(i+)}) - J(\theta^{(i-)})}{2 \times {\rm EPSILON}}.\end{align}$

When implementing backpropagation to train a neural network, in a correct implementation we will have that

$\begin{align}\nabla_{W^{(l)}} J(W,b) &= \left( \frac{1}{m} \Delta W^{(l)} \right) + \lambda W^{(l)} \\\nabla_{b^{(l)}} J(W,b) &= \frac{1}{m} \Delta b^{(l)}.\end{align}$

This result shows that the final block of psuedo-code in Backpropagation Algorithm is indeed implementing gradient descent. To make sure your implementation of gradient descent is correct, it is usually very helpful to use the method described above to numerically compute the derivatives of $\textstyle J(W,b)$ , and thereby verify that your computations of $\textstyle \left(\frac{1}{m}\Delta W^{(l)} \right) + \lambda W$ and $\textstyle \frac{1}{m}\Delta b^{(l)}$ are indeed giving the derivatives you want.

Finally, so far our discussion has centered on using gradient descent to minimize $\textstyle J(\theta)$ . If you have implemented a function that computes $\textstyle J(\theta)$ and $\textstyle \nabla_\theta J(\theta)$ , it turns out there are more sophisticated algorithms than gradient descent for trying to minimize $\textstyle J(\theta)$ . For example, one can envision an algorithm that uses gradient descent, but automatically tunes the learning rate $\textstyle \alpha$ so as to try to use a step-size that causes $\textstyle \theta$ to approach a local optimum as quickly as possible. There are other algorithms that are even more sophisticated than this; for example, there are algorithms that try to find an approximation to the Hessian matrix, so that it can take more rapid steps towards a local optimum (similar to Newton's method). A full discussion of these algorithms is beyond the scope of these notes, but one example is the L-BFGS algorithm. (Another example is the conjugate gradient algorithm.) You will use one of these algorithms in the programming exercise. The main thing you need to provide to these advanced optimization algorithms is that for any $\textstyle \theta$ , you have to be able to compute $\textstyle J(\theta)$ and $\textstyle \nabla_\theta J(\theta)$ . These optimization algorithms will then do their own internal tuning of the learning rate/step-size $\textstyle \alpha$ (and compute its own approximation to the Hessian, etc.) to automatically search for a value of $\textstyle \theta$ that minimizes $\textstyle J(\theta)$ . Algorithms such as L-BFGS and conjugate gradient can often be much faster than gradient descent.

Autoencoders and Sparsity

So far, we have described the application of neural networks to supervised learning, in which we have labeled training examples. Now suppose we have only a set of unlabeled training examples $\textstyle \{x^{(1)}, x^{(2)}, x^{(3)}, \ldots\}$ , where $\textstyle x^{(i)} \in \Re^{n}$ . An autoencoder neural network is an unsupervised learning algorithm that applies backpropagation, setting the target values to be equal to the inputs. I.e., it uses $\textstyle y^{(i)} = x^{(i)}$ .

Here is an autoencoder:

UFLDL Tutorial_Sparse Autoencoder_第12张图片

The autoencoder tries to learn a function $\textstyle h_{W,b}(x) \approx x$ . In other words, it is trying to learn an approximation to the identity function, so as to output $\textstyle \hat{x}$ that is similar to $\textstyle x$ . The identity function seems a particularly trivial function to be trying to learn; but by placing constraints on the network, such as by limiting the number of hidden units, we can discover interesting structure about the data. As a concrete example, suppose the inputs $\textstyle x$ are the pixel intensity values from a $\textstyle 10 \times 10$ image (100 pixels) so $\textstyle n=100$ , and there are $\textstyle s_2=50$ hidden units in layer $\textstyle L_2$ . Note that we also have $\textstyle y \in \Re^{100}$ . Since there are only 50 hidden units, the network is forced to learn a compressed representation of the input. I.e., given only the vector of hidden unit activations $\textstyle a^{(2)} \in \Re^{50}$ , it must try to reconstruct the 100-pixel input $\textstyle x$ . If the input were completely random---say, each $\textstyle x_i$ comes from an IID Gaussian independent of the other features---then this compression task would be very difficult. But if there is structure in the data, for example, if some of the input features are correlated, then this algorithm will be able to discover some of those correlations. In fact, this simple autoencoder often ends up learning a low-dimensional representation very similar to PCAs.

Our argument above relied on the number of hidden units $\textstyle s_2$ being small. But even when the number of hidden units is large (perhaps even greater than the number of input pixels), we can still discover interesting structure, by imposing other constraints on the network. In particular, if we impose a sparsity constraint on the hidden units, then the autoencoder will still discover interesting structure in the data, even if the number of hidden units is large.

Informally, we will think of a neuron as being "active" (or as "firing") if its output value is close to 1, or as being "inactive" if its output value is close to 0. We would like to constrain the neurons to be inactive most of the time. This discussion assumes a sigmoid activation function. If you are using a tanh activation function, then we think of a neuron as being inactive when it outputs values close to -1.

Recall that $\textstyle a^{(2)}_j$ denotes the activation of hidden unit $\textstyle j$ in the autoencoder. However, this notation doesn't make explicit what was the input $\textstyle x$ that led to that activation. Thus, we will write $\textstyle a^{(2)}_j(x)$ to denote the activation of this hidden unit when the network is given a specific input $\textstyle x$ . Further, let

$\begin{align}\hat\rho_j = \frac{1}{m} \sum_{i=1}^m \left[ a^{(2)}_j(x^{(i)}) \right]\end{align}$

be the average activation of hidden unit $\textstyle j$ (averaged over the training set). We would like to (approximately) enforce the constraint

$\begin{align}\hat\rho_j = \rho,\end{align}$

where $\textstyle \rho$ is a sparsity parameter, typically a small value close to zero (say $\textstyle \rho = 0.05$ ). In other words, we would like the average activation of each hidden neuron $\textstyle j$ to be close to 0.05 (say). To satisfy this constraint, the hidden unit's activations must mostly be near 0.

To achieve this, we will add an extra penalty term to our optimization objective that penalizes $\textstyle \hat\rho_j$ deviating significantly from $\textstyle \rho$ . Many choices of the penalty term will give reasonable results. We will choose the following:

$\begin{align}\sum_{j=1}^{s_2} \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j}.\end{align}$

Here, $\textstyle s_2$ is the number of neurons in the hidden layer, and the index $\textstyle j$ is summing over the hidden units in our network. If you are familiar with the concept of KL divergence, this penalty term is based on it, and can also be written

$\begin{align}\sum_{j=1}^{s_2} {\rm KL}(\rho || \hat\rho_j),\end{align}$

where $\textstyle {\rm KL}(\rho || \hat\rho_j) = \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j}$ is the Kullback-Leibler (KL) divergence between a Bernoulli random variable with mean $\textstyle \rho$ and a Bernoulli random variable with mean $\textstyle \hat\rho_j$ . KL-divergence is a standard function for measuring how different two different distributions are. (If you've not seen KL-divergence before, don't worry about it; everything you need to know about it is contained in these notes.)

This penalty function has the property that $\textstyle {\rm KL}(\rho || \hat\rho_j) = 0$ if $\textstyle \hat\rho_j = \rho$ , and otherwise it increases monotonically as $\textstyle \hat\rho_j$ diverges from $\textstyle \rho$ . For example, in the figure below, we have set $\textstyle \rho = 0.2$ , and plotted $\textstyle {\rm KL}(\rho || \hat\rho_j)$ for a range of values of $\textstyle \hat\rho_j$ :

UFLDL Tutorial_Sparse Autoencoder_第13张图片

We see that the KL-divergence reaches its minimum of 0 at $\textstyle \hat\rho_j = \rho$ , and blows up (it actually approaches $\textstyle \infty$ ) as $\textstyle \hat\rho_j$ approaches 0 or 1. Thus, minimizing this penalty term has the effect of causing $\textstyle \hat\rho_j$ to be close to $\textstyle \rho$ .

Our overall cost function is now

$\begin{align}J_{\rm sparse}(W,b) = J(W,b) + \beta \sum_{j=1}^{s_2} {\rm KL}(\rho || \hat\rho_j),\end{align}$

where $\textstyle J(W,b)$ is as defined previously, and $\textstyle \beta$ controls the weight of the sparsity penalty term. The term $\textstyle \hat\rho_j$ (implicitly) depends on $\textstyle W,b$ also, because it is the average activation of hidden unit $\textstyle j$ , and the activation of a hidden unit depends on the parameters $\textstyle W,b$ .

To incorporate the KL-divergence term into your derivative calculation, there is a simple-to-implement trick involving only a small change to your code. Specifically, where previously for the second layer ( $\textstyle l=2$ ), during backpropagation you would have computed

$\begin{align}\delta^{(2)}_i = \left( \sum_{j=1}^{s_{2}} W^{(2)}_{ji} \delta^{(3)}_j \right) f'(z^{(2)}_i),\end{align}$

now instead compute

$\begin{align}\delta^{(2)}_i = \left( \left( \sum_{j=1}^{s_{2}} W^{(2)}_{ji} \delta^{(3)}_j \right)+ \beta \left( - \frac{\rho}{\hat\rho_i} + \frac{1-\rho}{1-\hat\rho_i} \right) \right) f'(z^{(2)}_i) .\end{align}$

One subtlety is that you'll need to know $\textstyle \hat\rho_i$ to compute this term. Thus, you'll need to compute a forward pass on all the training examples first to compute the average activations on the training set, before computing backpropagation on any example. If your training set is small enough to fit comfortably in computer memory (this will be the case for the programming assignment), you can compute forward passes on all your examples and keep the resulting activations in memory and compute the $\textstyle \hat\rho_i$ s. Then you can use your precomputed activations to perform backpropagation on all your examples. If your data is too large to fit in memory, you may have to scan through your examples computing a forward pass on each to accumulate (sum up) the activations and compute $\textstyle \hat\rho_i$ (discarding the result of each forward pass after you have taken its activations $\textstyle a^{(2)}_i$ into account for computing $\textstyle \hat\rho_i$ ). Then after having computed $\textstyle \hat\rho_i$ , you'd have to redo the forward pass for each example so that you can do backpropagation on that example. In this latter case, you would end up computing a forward pass twice on each example in your training set, making it computationally less efficient.

The full derivation showing that the algorithm above results in gradient descent is beyond the scope of these notes. But if you implement the autoencoder using backpropagation modified this way, you will be performing gradient descent exactly on the objective $\textstyle J_{\rm sparse}(W,b)$ . Using the derivative checking method, you will be able to verify this for yourself as well.

Visualizing a Trained Autoencoder

Having trained a (sparse) autoencoder, we would now like to visualize the function learned by the algorithm, to try to understand what it has learned. Consider the case of training an autoencoder on $\textstyle 10 \times 10$ images, so that $\textstyle n = 100$ . Each hidden unit $\textstyle i$ computes a function of the input:

$\begin{align}a^{(2)}_i = f\left(\sum_{j=1}^{100} W^{(1)}_{ij} x_j + b^{(1)}_i \right).\end{align}$

We will visualize the function computed by hidden unit $\textstyle i$ ---which depends on the parameters $\textstyle W^{(1)}_{ij}$ (ignoring the bias term for now)---using a 2D image. In particular, we think of $\textstyle a^{(2)}_i$ as some non-linear feature of the input $\textstyle x$ . We ask: What input image $\textstyle x$ would cause $\textstyle a^{(2)}_i$ to be maximally activated? (Less formally, what is the feature that hidden unit $\textstyle i$ is looking for?) For this question to have a non-trivial answer, we must impose some constraints on $\textstyle x$ . If we suppose that the input is norm constrained by $\textstyle ||x||^2 = \sum_{i=1}^{100} x_i^2 \leq 1$ , then one can show (try doing this yourself) that the input which maximally activates hidden unit $\textstyle i$ is given by setting pixel $\textstyle x_j$ (for all 100 pixels, $\textstyle j=1,\ldots, 100$ ) to

$\begin{align}x_j = \frac{W^{(1)}_{ij}}{\sqrt{\sum_{j=1}^{100} (W^{(1)}_{ij})^2}}.\end{align}$

By displaying the image formed by these pixel intensity values, we can begin to understand what feature hidden unit $\textstyle i$ is looking for.

If we have an autoencoder with 100 hidden units (say), then we our visualization will have 100 such images---one per hidden unit. By examining these 100 images, we can try to understand what the ensemble of hidden units is learning.

When we do this for a sparse autoencoder (trained with 100 hidden units on 10x10 pixel inputs¹ we get the following result:

UFLDL Tutorial_Sparse Autoencoder_第14张图片

Each square in the figure above shows the (norm bounded) input image $\textstyle x$ that maximally actives one of 100 hidden units. We see that the different hidden units have learned to detect edges at different positions and orientations in the image.

These features are, not surprisingly, useful for such tasks as object recognition and other vision tasks. When applied to other input domains (such as audio), this algorithm also learns useful representations/features for those domains too.

¹ The learned features were obtained by training on whitened natural images. Whitening is a preprocessing step which removes redundancy in the input, by causing adjacent pixels to become less correlated.

Sparse Autoencoder Notation Summary

Here is a summary of the symbols used in our derivation of the sparse autoencoder:

Symbol	Meaning
$\textstyle x$	Input features for a training example, $\textstyle x \in \Re^{n}$ .
$\textstyle y$	Output/target values. Here, $\textstyle y$ can be vector valued. In the case of an autoencoder, $\textstyle y=x$ .
$\textstyle (x^{(i)}, y^{(i)})$	The $\textstyle i$ -th training example
$\textstyle h_{W,b}(x)$	Output of our hypothesis on input $\textstyle x$ , using parameters $\textstyle W,b$ . This should be a vector of the same dimension as the target value $\textstyle y$ .
$\textstyle W^{(l)}_{ij}$	The parameter associated with the connection between unit $\textstyle j$ in layer $\textstyle l$ , and unit $\textstyle i$ in layer $\textstyle l+1$ .
$\textstyle b^{(l)}_{i}$	The bias term associated with unit $\textstyle i$ in layer $\textstyle l+1$ . Can also be thought of as the parameter associated with the connection between the bias unit in layer $\textstyle l$ and unit $\textstyle i$ in layer $\textstyle l+1$ .
$\textstyle \theta$	Our parameter vector. It is useful to think of this as the result of taking the parameters $\textstyle W,b$ and ``unrolling them into a long column vector.
$\textstyle a^{(l)}_i$	Activation (output) of unit $\textstyle i$ in layer $\textstyle l$ of the network. In addition, since layer $\textstyle L_1$ is the input layer, we also have $\textstyle a^{(1)}_i = x_i$ .
$\textstyle f(\cdot)$	The activation function. Throughout these notes, we used $\textstyle f(z) = \tanh(z)$ .
$\textstyle z^{(l)}_i$	Total weighted sum of inputs to unit $\textstyle i$ in layer $\textstyle l$ . Thus, $\textstyle a^{(l)}_i = f(z^{(l)}_i)$ .
$\textstyle \alpha$	Learning rate parameter
$\textstyle s_l$	Number of units in layer $\textstyle l$ (not counting the bias unit).
$\textstyle n_l$	Number layers in the network. Layer $\textstyle L_1$ is usually the input layer, and layer $\textstyle L_{n_l}$ the output layer.
$\textstyle \lambda$	Weight decay parameter.
$\textstyle \hat{x}$	For an autoencoder, its output; i.e., its reconstruction of the input $\textstyle x$ . Same meaning as $\textstyle h_{W,b}(x)$ .
$\textstyle \rho$	Sparsity parameter, which specifies our desired level of sparsity
$\textstyle \hat\rho_i$	The average activation of hidden unit $\textstyle i$ (in the sparse autoencoder).
$\textstyle \beta$	Weight of the sparsity penalty term (in the sparse autoencoder objective).

Exercise:Sparse Autoencoder

[hide]

1 Download Related Reading
2 Sparse autoencoder implementation
- 2.1 Step 1: Generate training set
- 2.2 Step 2: Sparse autoencoder objective
- 2.3 Step 3: Gradient checking
- 2.4 Step 4: Train the sparse autoencoder
- 2.5 Step 5: Visualization
3 Results

Download Related Reading

sparseae_reading.pdf
sparseae_exercise.pdf

Sparse autoencoder implementation

In this problem set, you will implement the sparse autoencoder algorithm, and show how it discovers that edges are a good representation for natural images. (Images provided by Bruno Olshausen.) The sparse autoencoder algorithm is described in the lecture notes found on the course website.

In the file sparseae_exercise.zip, we have provided some starter code in Matlab. You should write your code at the places indicated in the files ("YOUR CODE HERE"). You have to complete the following files: sampleIMAGES.m, sparseAutoencoderCost.m, computeNumericalGradient.m. The starter code in train.m shows how these functions are used.

Specifically, in this exercise you will implement a sparse autoencoder, trained with 8×8 image patches using the L-BFGS optimization algorithm.

A note on the software: The provided .zip file includes a subdirectory minFunc with 3rd party software implementing L-BFGS, that is licensed under a Creative Commons, Attribute, Non-Commercial license. If you need to use this software for commercial purposes, you can download and use a different function (fminlbfgs) that can serve the same purpose, but runs ~3x slower for this exercise (and thus is less recommended). You can read more about this in the Fminlbfgs_Details page.

Step 1: Generate training set

The first step is to generate a training set. To get a single training example $x$ , randomly pick one of the 10 images, then randomly sample an 8×8 image patch from the selected image, and convert the image patch (either in row-major order or column-major order; it doesn't matter) into a 64-dimensional vector to get a training example $x \in \Re^{64}.$

Complete the code in sampleIMAGES.m. Your code should sample 10000 image patches and concatenate them into a 64×10000 matrix.

To make sure your implementation is working, run the code in "Step 1" of train.m. This should result in a plot of a random sample of 200 patches from the dataset.

Implementational tip: When we run our implemented sampleImages(), it takes under 5 seconds. If your implementation takes over 30 seconds, it may be because you are accidentally making a copy of an entire 512×512 image each time you're picking a random image. By copying a 512×512 image 10000 times, this can make your implementation much less efficient. While this doesn't slow down your code significantly for this exercise (because we have only 10000 examples), when we scale to much larger problems later this quarter with $106$ or more examples, this will significantly slow down your code. Please implement sampleIMAGES so that you aren't making a copy of an entire 512×512 image each time you need to cut out an 8x8 image patch.

Step 2: Sparse autoencoder objective

Implement code to compute the sparse autoencoder cost function $J sparse (W, b)$ (Section 3 of the lecture notes) and the corresponding derivatives of $J sparse$ with respect to the different parameters. Use the sigmoid function for the activation function, $f(z) = \frac{1}{{1+e^{-z}}}$ . In particular, complete the code in sparseAutoencoderCost.m.

The sparse autoencoder is parameterized by matrices $W^{(1)} \in \Re^{s_1\times s_2}$ , $W^{(2)} \in \Re^{s_2\times s_3}$ vectors $b^{(1)} \in \Re^{s_2}$ , $b^{(2)} \in \Re^{s_3}$ . However, for subsequent notational convenience, we will "unroll" all of these parameters into a very long parameter vector $θ$ with $s 1 s 2 + s 2 s 3 + s 2 + s 3$ elements. The code for converting between the $(W (1), W (2), b (1), b (2))$ and the $θ$ parameterization is already provided in the starter code.

Implementational tip: The objective $J sparse (W, b)$ contains 3 terms, corresponding to the squared error term, the weight decay term, and the sparsity penalty. You're welcome to implement this however you want, but for ease of debugging, you might implement the cost function and derivative computation (backpropagation) only for the squared error term first (this corresponds to setting $λ = β = 0$ ), and implement the gradient checking method in the next section to first verify that this code is correct. Then only after you have verified that the objective and derivative calculations corresponding to the squared error term are working, add in code to compute the weight decay and sparsity penalty terms and their corresponding derivatives.

Step 3: Gradient checking

Following Section 2.3 of the lecture notes, implement code for gradient checking. Specifically, complete the code incomputeNumericalGradient.m. Please use EPSILON = 10^-4 as described in the lecture notes.

We've also provided code in checkNumericalGradient.m for you to test your code. This code defines a simple quadratic function $h: \Re^2 \mapsto \Re$ given by $h(x) = x_1^2 + 3x_1 x_2$ , and evaluates it at the point $x = (4,10) T$ . It allows you to verify that your numerically evaluated gradient is very close to the true (analytically computed) gradient.

After using checkNumericalGradient.m to make sure your implementation is correct, next use computeNumericalGradient.m to make sure that yoursparseAutoencoderCost.m is computing derivatives correctly. For details, see Steps 3 in train.m. We strongly encourage you not to proceed to the next step until you've verified that your derivative computations are correct.

Implementational tip: If you are debugging your code, performing gradient checking on smaller models and smaller training sets (e.g., using only 10 training examples and 1-2 hidden units) may speed things up.

Step 4: Train the sparse autoencoder

Now that you have code that computes $J sparse$ and its derivatives, we're ready to minimize $J sparse$ with respect to its parameters, and thereby train our sparse autoencoder.

We will use the L-BFGS algorithm. This is provided to you in a function called minFunc (code provided by Mark Schmidt) included in the starter code. (For the purpose of this assignment, you only need to call minFunc with the default parameters. You do not need to know how L-BFGS works.) We have already provided code in train.m (Step 4) to call minFunc. The minFunc code assumes that the parameters to be optimized are a long parameter vector; so we will use the " $θ$ " parameterization rather than the " $(W (1), W (2), b (1), b (2))$ " parameterization when passing our parameters to it.

Train a sparse autoencoder with 64 input units, 25 hidden units, and 64 output units. In our starter code, we have provided a function for initializing the parameters. We initialize the biases $b^{(l)}_i$ to zero, and the weights $W^{(l)}_{ij}$ to random numbers drawn uniformly from the interval $\left[-\sqrt{\frac{6}{n_{\rm in}+n_{\rm out}+1}},\sqrt{\frac{6}{n_{\rm in}+n_{\rm out}+1}}\,\right]$ , where $n in$ is the fan-in (the number of inputs feeding into a node) and $n out$ is the fan-in (the number of units that a node feeds into).

The values we provided for the various parameters ( $λ,β,ρ$ , etc.) should work, but feel free to play with different settings of the parameters as well.

Implementational tip: Once you have your backpropagation implementation correctly computing the derivatives (as verified using gradient checking in Step 3), when you are now using it with L-BFGS to optimize $J sparse (W, b)$ , make sure you're not doing gradient-checking on every step. Backpropagation can be used to compute the derivatives of $J sparse (W, b)$ fairly efficiently, and if you were additionally computing the gradient numerically on every step, this would slow down your program significantly.

Step 5: Visualization

After training the autoencoder, use display_network.m to visualize the learned weights. (See train.m, Step 5.) Run "print -djpeg weights.jpg" to save the visualization to a file "weights.jpg" (which you will submit together with your code).

Results

To successfully complete this assignment, you should demonstrate your sparse autoencoder algorithm learning a set of edge detectors. For example, this was the visualization we obtained:

Our implementation took around 5 minutes to run on a fast computer. In case you end up needing to try out multiple implementations or different parameter values, be sure to budget enough time for debugging and to run the experiments you'll need.

Also, by way of comparison, here are some visualizations from implementations that we do not consider successful (either a buggy implementation, or where the parameters were poorly tuned):

CS294A/CS294W Programming Assignment Starter Code
STEP 0: Here we provide the relevant parameters values that will
STEP 1: Implement sampleIMAGES
STEP 2: Implement sparseAutoencoderCost
STEP 3: Gradient Checking
STEP 4: After verifying that your implementation of
STEP 5: Visualization

CS294A/CS294W Programming Assignment Starter Code

%  Instructions
%  ------------
%
%  This file contains code that helps you get started on the
%  programming assignment. You will need to complete the code in sampleIMAGES.m,
%  sparseAutoencoderCost.m and computeNumericalGradient.m.
%  For the purpose of completing the assignment, you do not need to
%  change the code in this file.
%
%%======================================================================

STEP 0: Here we provide the relevant parameters values that will

allow your sparse autoencoder to get good filters; you do not need to
change the parameters below.

visibleSize = 8*8;   % number of input units
hiddenSize = 25;     % number of hidden units
sparsityParam = 0.01;   % desired average activation of the hidden units.
                     % (This was denoted by the Greek alphabet rho, which looks like a lower-case "p",
             %  in the lecture notes).
lambda = 0.0001;     % weight decay parameter
beta = 3;            % weight of sparsity penalty term

%%======================================================================

STEP 1: Implement sampleIMAGES

After implementing sampleIMAGES, the display_network command should
display a random sample of 200 patches from the dataset

patches = sampleIMAGES;
display_network(patches(:,randi(size(patches,2),204,1)),8);%randi(size(patches,2),204,1)
                                                           %为产生一个204维的列向量，每一维的值为0~10000
                                                           %中的随机数，说明是随机取204个patch来显示


%  Obtain random parameters theta
theta = initializeParameters(hiddenSize, visibleSize);

%%======================================================================

Error using load
Unable to read file 'IMAGES': no such file or directory.

Error in sampleIMAGES (line 5)
load IMAGES;    % load images from disk 

Error in train (line 31)
patches = sampleIMAGES;

STEP 2: Implement sparseAutoencoderCost

You can implement all of the components (squared error cost, weight decay term,
sparsity penalty) in the cost function at once, but it may be easier to do
it step-by-step and run gradient checking (see STEP 3) after each step.  We
suggest implementing the sparseAutoencoderCost function using the following steps:

(a) Implement forward propagation in your neural network, and implement the
    squared error term of the cost function.  Implement backpropagation to
    compute the derivatives.   Then (using lambda=beta=0), run Gradient Checking
    to verify that the calculations corresponding to the squared error cost
    term are correct.

(b) Add in the weight decay term (in both the cost function and the derivative
    calculations), then re-run Gradient Checking to verify correctness.

(c) Add in the sparsity penalty term, then re-run Gradient Checking to
    verify correctness.

Feel free to change the training settings when debugging your
code.  (For example, reducing the training set size or
number of hidden units may make your code run faster; and setting beta
and/or lambda to zero may be helpful for debugging.)  However, in your
final submission of the visualized weights, please use parameters we
gave in Step 0 above.

[cost, grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, lambda, ...
                                     sparsityParam, beta, patches);

%%======================================================================

STEP 3: Gradient Checking

Hint: If you are debugging your code, performing gradient checking on smaller models and smaller training sets (e.g., using only 10 training examples and 1-2 hidden units) may speed things up.

% First, lets make sure your numerical gradient computation is correct for a
% simple function.  After you have implemented computeNumericalGradient.m,
% run the following:
%checkNumericalGradient();

% Now we can use it to check your cost function and derivative calculations
% for the sparse autoencoder.
% numgrad = computeNumericalGradient( @(x) sparseAutoencoderCost(x, visibleSize, ...
%                                                   hiddenSize, lambda, ...
%                                                   sparsityParam, beta, ...
%                                                   patches), theta);

% Use this to visually compare the gradients side by side
%disp([numgrad grad]);

% Compare numerically computed gradients with the ones obtained from backpropagation
% diff = norm(numgrad-grad)/norm(numgrad+grad);
% disp(diff); % Should be small. In our implementation, these values are
            % usually less than 1e-9.

            % When you got this working, Congratulations!!!

%%======================================================================

STEP 4: After verifying that your implementation of

sparseAutoencoderCost is correct, You can start training your sparse
autoencoder with minFunc (L-BFGS).

%  Randomly initialize the parameters
theta = initializeParameters(hiddenSize, visibleSize);

%  Use minFunc to minimize the function
addpath minFunc/
options.Method = 'cg'; % Here, we use L-BFGS to optimize our cost
                          % function. Generally, for minFunc to work, you
                          % need a function pointer with two outputs: the
                          % function value and the gradient. In our problem,
                          % sparseAutoencoderCost.m satisfies this.
options.maxIter = 400;      % Maximum number of iterations of L-BFGS to run
options.display = 'on';


[opttheta, cost] = minFunc( @(p) sparseAutoencoderCost(p, ...
                                   visibleSize, hiddenSize, ...
                                   lambda, sparsityParam, ...
                                   beta, patches), ...
                              theta, options);

%%======================================================================

STEP 5: Visualization

W1 = reshape(opttheta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
b1 = opttheta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
aaa = logsig(W1*patches+repmat(b1,1,10000));
ind=find(abs(aaa)<0.01);
aaa2=aaa;
aaa2(ind)=0;
featureMatrix=aaa2;


figure;
display_network(W1', 12);

print -djpeg weights.jpg   % save the visualization to a file

---------- YOUR CODE HERE --------------------------------------
---------------------------------------------------------------
---------------------------------------------------------------

function patches = sampleIMAGES()

% sampleIMAGES
% Returns 10000 patches for training

load IMAGES;    % load images from disk

patchsize = 8;  % we'll use 8x8 patches
numpatches = 1000;

% Initialize patches with zeros.  Your code will fill in this matrix--one
% column per patch, 10000 columns.
patches = zeros(patchsize*patchsize, numpatches);

Error using load
Unable to read file 'IMAGES': no such file or directory.

Error in sampleIMAGES (line 5)
load IMAGES;    % load images from disk

---------- YOUR CODE HERE --------------------------------------

Instructions: Fill in the variable called "patches" using data
from IMAGES.

IMAGES is a 3D array containing 10 images
For instance, IMAGES(:,:,6) is a 512x512 array containing the 6th image,
and you can type "imagesc(IMAGES(:,:,6)), colormap gray;" to visualize
it. (The contrast on these images look a bit off because they have
been preprocessed using using "whitening."  See the lecture notes for
more details.) As a second example, IMAGES(21:30,21:30,1) is an image
patch corresponding to the pixels in the block (21,21) to (30,30) of
Image 1

for imageNum = 1:10%在每张图片中随机选取1000个patch，共1000个patch
    [rowNum colNum] = size(IMAGES(:,:,imageNum));
    for patchNum = 1:1000%实现每张图片选取1000个patch
        xPos = randi([1,rowNum-patchsize+1]);
        yPos = randi([1, colNum-patchsize+1]);
        patches(:,(imageNum-1)*1000+patchNum) = reshape(IMAGES(xPos:xPos+7,yPos:yPos+7,...
                                                        imageNum),64,1);
    end
end

---------------------------------------------------------------

For the autoencoder to work well we need to normalize the data Specifically, since the output of the network is bounded between [0,1] (due to the sigmoid activation function), we have to make sure the range of pixel values is also bounded between [0,1]

patches = normalizeData(patches);

end

---------------------------------------------------------------

function patches = normalizeData(patches)

% Squash data to [0.1, 0.9] since we use sigmoid as the activation
% function in the output layer

% Remove DC (mean of images).
patches = bsxfun(@minus, patches, mean(patches));

% Truncate to +/-3 standard deviations and scale to -1 to 1
pstd = 3 * std(patches(:));
patches = max(min(patches, pstd), -pstd) / pstd;%因为根据3sigma法则，95%以上的数据都在该区域内
                                                % 这里转换后将数据变到了-1到1之间

% Rescale from [-1,1] to [0.1,0.9]
patches = (patches + 1) * 0.4 + 0.1;

end

function [h, array] = display_network(A, opt_normalize, opt_graycolor, cols, opt_colmajor)
% This function visualizes filters in matrix A. Each column of A is a
% filter. We will reshape each column into a square image and visualizes
% on each cell of the visualization panel.
% All other parameters are optional, usually you do not need to worry
% about it.
% opt_normalize: whether we need to normalize the filter so that all of
% them can have similar contrast. Default value is true.
% opt_graycolor: whether we use gray as the heat map. Default is true.
% cols: how many columns are there in the display. Default value is the
% squareroot of the number of columns in A.
% opt_colmajor: you can switch convention to row major for A. In that
% case, each row of A is a filter. Default value is false.
warning off all

if ~exist('opt_normalize', 'var') || isempty(opt_normalize)
    opt_normalize= true;
end

if ~exist('opt_graycolor', 'var') || isempty(opt_graycolor)
    opt_graycolor= true;
end

if ~exist('opt_colmajor', 'var') || isempty(opt_colmajor)
    opt_colmajor = false;
end

% rescale
A = A - mean(A(:));

if opt_graycolor, colormap(gray); end

% compute rows, cols
[L M]=size(A);
sz=sqrt(L);
buf=1;
if ~exist('cols', 'var')
    if floor(sqrt(M))^2 ~= M
        n=ceil(sqrt(M));
        while mod(M, n)~=0 && n<1.2*sqrt(M), n=n+1; end
        m=ceil(M/n);
    else
        n=sqrt(M);
        m=n;
    end
else
    n = cols;
    m = ceil(M/n);
end

array=-ones(buf+m*(sz+buf),buf+n*(sz+buf));

if ~opt_graycolor
    array = 0.1.* array;
end


if ~opt_colmajor
    k=1;
    for i=1:m
        for j=1:n
            if k>M,
                continue;
            end
            clim=max(abs(A(:,k)));
            if opt_normalize
                array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz)/clim;
            else
                array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz)/max(abs(A(:)));
            end
            k=k+1;
        end
    end
else
    k=1;
    for j=1:n
        for i=1:m
            if k>M,
                continue;
            end
            clim=max(abs(A(:,k)));
            if opt_normalize
                array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz)/clim;
            else
                array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz);
            end
            k=k+1;
        end
    end
end

if opt_graycolor
    h=imagesc(array,'EraseMode','none',[-1 1]);
else
    h=imagesc(array,'EraseMode','none',[-1 1]);
end
axis image off

drawnow;

warning on all

Initialize parameters randomly based on layer sizes.

function theta = initializeParameters(hiddenSize, visibleSize)

Initialize parameters randomly based on layer sizes.

r  = sqrt(6) / sqrt(hiddenSize+visibleSize+1);   % we'll choose weights uniformly from the interval [-r, r]
W1 = rand(hiddenSize, visibleSize) * 2 * r - r;
W2 = rand(visibleSize, hiddenSize) * 2 * r - r;

b1 = zeros(hiddenSize, 1);
b2 = zeros(visibleSize, 1);

% Convert weights and bias gradients to the vector form.
% This step will "unroll" (flatten and concatenate together) all
% your parameters into a vector, which can then be used with minFunc.
theta = [W1(:) ; W2(:) ; b1(:) ; b2(:)];

Error using initializeParameters (line 4)
Not enough input arguments.

end

---------- YOUR CODE HERE --------------------------------------

function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ...
                                             lambda, sparsityParam, beta, data)

% visibleSize: the number of input units (probably 64)
% hiddenSize: the number of hidden units (probably 25)
% lambda: weight decay parameter
% sparsityParam: The desired average activation for the hidden units (denoted in the lecture
%                           notes by the greek alphabet rho, which looks like a lower-case "p").
% beta: weight of sparsity penalty term
% data: Our 64x10000 matrix containing the training data.  So, data(:,i) is the i-th training example.

% The input theta is a vector (because minFunc expects the parameters to be a vector).
% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this
% follows the notation convention of the lecture notes.

%将长向量转换成每一层的权值矩阵和偏置向量值
W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);

% Cost and gradient variables (your code needs to compute these values).
% Here, we initialize them to zeros.
cost = 0;
W1grad = zeros(size(W1));
W2grad = zeros(size(W2));
b1grad = zeros(size(b1));
b2grad = zeros(size(b2));

Error using sparseAutoencoderCost (line 17)
Not enough input arguments.

---------- YOUR CODE HERE --------------------------------------

Instructions: Compute the cost/optimization objective J_sparse(W,b) for the Sparse Autoencoder,
              and the corresponding gradients W1grad, W2grad, b1grad, b2grad.

W1grad, W2grad, b1grad and b2grad should be computed using backpropagation. Note that W1grad has the same dimensions as W1, b1grad has the same dimensions as b1, etc. Your code should set W1grad to be the partial derivative of J_sparse(W,b) with respect to W1. I.e., W1grad(i,j) should be the partial derivative of J_sparse(W,b) with respect to the input parameter W1(i,j). Thus, W1grad should be equal to the term [(1/m) \Delta W^{(1)} + \lambda W^{(1)}] in the last block of pseudo-code in Section 2.2 of the lecture notes (and similarly for W2grad, b1grad, b2grad).

Stated differently, if we were using batch gradient descent to optimize the parameters, the gradient descent update to W1 would be W1 := W1 - alpha * W1grad, and similarly for W2, b1, b2.

Jcost = 0;%直接误差
Jweight = 0;%权值惩罚
Jsparse = 0;%稀疏性惩罚
[n m] = size(data);%m为样本的个数，n为样本的特征数

%前向算法计算各神经网络节点的线性组合值和active值
z2 = W1*data+repmat(b1,1,m);%注意这里一定要将b1向量复制扩展成m列的矩阵
a2 = sigmoid(z2);
z3 = W2*a2+repmat(b2,1,m);
a3 = sigmoid(z3);

% 计算预测产生的误差
Jcost = (0.5/m)*sum(sum((a3-data).^2));

%计算权值惩罚项
Jweight = (1/2)*(sum(sum(W1.^2))+sum(sum(W2.^2)));

%计算稀释性规则项
rho = (1/m).*sum(a2,2);%求出第一个隐含层的平均值向量
Jsparse = sum(sparsityParam.*log(sparsityParam./rho)+ ...
        (1-sparsityParam).*log((1-sparsityParam)./(1-rho)));

%损失函数的总表达式
cost = Jcost+lambda*Jweight+beta*Jsparse;

%反向算法求出每个节点的误差值
d3 = -(data-a3).*sigmoidInv(z3);
sterm = beta*(-sparsityParam./rho+(1-sparsityParam)./(1-rho));%因为加入了稀疏规则项，所以
                                                             %计算偏导时需要引入该项
d2 = (W2'*d3+repmat(sterm,1,m)).*sigmoidInv(z2);

%计算W1grad
W1grad = W1grad+d2*data';
W1grad = (1/m)*W1grad+lambda*W1;

%计算W2grad
W2grad = W2grad+d3*a2';
W2grad = (1/m).*W2grad+lambda*W2;

%计算b1grad
b1grad = b1grad+sum(d2,2);
b1grad = (1/m)*b1grad;%注意b的偏导是一个向量，所以这里应该把每一行的值累加起来

%计算b2grad
b2grad = b2grad+sum(d3,2);
b2grad = (1/m)*b2grad;



% %%方法二,每次处理1个样本，速度慢
% m=size(data,2);
% rho=zeros(size(b1));
% for i=1:m
%     %feedforward
%     a1=data(:,i);
%     z2=W1*a1+b1;
%     a2=sigmoid(z2);
%     z3=W2*a2+b2;
%     a3=sigmoid(z3);
%     %cost=cost+(a1-a3)'*(a1-a3)*0.5;
%     rho=rho+a2;
% end
% rho=rho/m;
% sterm=beta*(-sparsityParam./rho+(1-sparsityParam)./(1-rho));
% %sterm=beta*2*rho;
% for i=1:m
%     %feedforward
%     a1=data(:,i);
%     z2=W1*a1+b1;
%     a2=sigmoid(z2);
%     z3=W2*a2+b2;
%     a3=sigmoid(z3);
%     cost=cost+(a1-a3)'*(a1-a3)*0.5;
%     %backpropagation
%     delta3=(a3-a1).*a3.*(1-a3);
%     delta2=(W2'*delta3+sterm).*a2.*(1-a2);
%     W2grad=W2grad+delta3*a2';
%     b2grad=b2grad+delta3;
%     W1grad=W1grad+delta2*a1';
%     b1grad=b1grad+delta2;
% end
%
% kl=sparsityParam*log(sparsityParam./rho)+(1-sparsityParam)*log((1-sparsityParam)./(1-rho));
% %kl=rho.^2;
% cost=cost/m;
% cost=cost+sum(sum(W1.^2))*lambda/2.0+sum(sum(W2.^2))*lambda/2.0+beta*sum(kl);
% W2grad=W2grad./m+lambda*W2;
% b2grad=b2grad./m;
% W1grad=W1grad./m+lambda*W1;
% b1grad=b1grad./m;


%-------------------------------------------------------------------
% After computing the cost and gradient, we will convert the gradients back
% to a vector format (suitable for minFunc).  Specifically, we will unroll
% your gradient matrices into a vector.

grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];

end

%-------------------------------------------------------------------
% Here's an implementation of the sigmoid function, which you may find useful
% in your computation of the costs and the gradients.  This inputs a (row or
% column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)).

function sigm = sigmoid(x)

    sigm = 1 ./ (1 + exp(-x));
end

%sigmoid函数的逆函数
function sigmInv = sigmoidInv(x)

    sigmInv = sigmoid(x).*(1-sigmoid(x));
end

你可能感兴趣的:(matlab,NetWork,deep,learning,learning,machine,Neural)

OpenAI揭示o3的推理过程，以弥合与DeepSeek-R1的差距 c++服务器开发人工智能 deepseek
生成式人工智能开发商OpenAI公司首席执行官SamAltman最近在RedditAMA问答活动中承认，该公司在开源软件研究方面站在了“历史错误的一边”。尽管OpenAI公司尚未发布其开源模型，但已经迈出了提高透明度的第一步。正如该公司在其X帐号上所宣布的那样，其最新的推理模型o3-mini现在展示了其思维链（CoT）跟踪的更详细版本。此前，OpenAI公司的推理模型仅展示了CoT的高级概述，这使
释放 DeepSeek 的力量：像专家一样本地安装与探索！ guzhoumingyue AI python
要在本地运行DeepSeek，您需要遵循以下步骤。请确保您的计算机上已安装Python和Git，并且满足DeepSeek的依赖项。步骤1:安装依赖项安装Python和pip确保您已安装Python（建议使用Python3.6及以上版本）。您可以通过在终端/命令提示符中输入以下命令来检查Python是否已安装：bash复制代码python--version或者bash复制代码python3--ver
国鑫DeepSeek 671B本地部署方案：以高精度、高性价比重塑AI推理新标杆 Gooxi国鑫人工智能服务器
随着DeepSeek大模型应用火爆全球，官方服务器总是被挤爆。而且基于企业对数据安全、网络、算力的更高需求，模型本地化部署的需求日益增长，如何在有限预算内实现高效、精准的AI推理能力，成为众多企业的核心诉求。国鑫作为深耕AI领域的技术先锋，推出基于4台48GRTX4090或8台24GRTX4090服务器的2套DeepSeek“满血”版本地部署方案，以FP16高精度、高性价比、强扩展性三大优势，为企
教你本地复现Deep Research：DeepSeek R1+ LangChain+Milvus 大模型入门教程 langchain 人工智能大模型学习大模型 DeepSeek AI大模型大模型教程
金融机构、律所、科研党的福音来了！不久前，OpenAI新推出了一项名叫DeepResearch（深度研究）的功能，迅速风靡全球。我们可以将其理解为大模型+超级搜索+研究助理的三合一。在这项功能里，用户输入查询问题后，只需要选择DeepResearch选项，OpenAIo3就能自动查找分析数百优质在线资源，并对其进行综合整理并加工，为用户生成一份具备专业分析师水准的综合报告。不仅内容生成更加详实，而
“深入浅出”系列之QT：（10）Qt接入Deepseek 我真不会起名字啊 qt 开发语言
项目配置：在.pro文件中添加网络模块：QT+=corenetworkAPI配置：将apiUrl替换为实际的DeepSeekAPI端点将apiKey替换为你的有效API密钥根据API文档调整请求参数（模型名称、温度值等）功能说明：使用QNetworkAccessManager处理HTTP请求自动处理JSON序列化/反序列化支持异步请求处理包含基本的错误处理扩展建议：添加更完善的错误处理（HTTP状
大模型产品Deepseek（九）、LMstudio + AnythingLLM提交文件、网页内容，回复更专业准确伯牙碎琴大模型 DeepSeek 大模型知识库 LMstudio 嵌入数据联网查询
使用LMstudio和AnythingLLM向DeepSeek提交数据以提高回复的准确性在本篇文章中，我们将介绍如何使用LMstudio和AnythingLLM工具将文件或网页内容提交给DeepSeek，确保它能够提供更专业和精准的回答。这种方式特别适合那些无法使用Ollama部署但有数据投喂需求的场景。一.准备工作在开始之前，确保您已经安装了LMstudio和AnythingLLM工具，并且De
第26篇：pFedLoRA: Model-Heterogeneous Personalized Federated Learning with LoRA使用lora微调的模型异构个性化联邦学习还不秃顶的计科生联邦学习深度学习人工智能开发语言
第一部分：解决的问题联邦学习（FederatedLearning,FL）是一种分布式机器学习方法，允许客户端在本地数据上训练模型，同时通过中心服务器共享学习成果。传统FL框架假设客户端使用相同的模型结构（模型同构），但在实际中可能面对：统计异质性：客户端的数据分布不均（non-IID）。资源异质性：客户端硬件资源有限。模型异质性：客户端可能拥有不同的模型结构。模型异构的个性化联邦学习（MHPFL）
一文读懂MUSIC算法DOA估计的数学原理并仿真迎风打盹儿阵列信号处理 MUSIC算法 DOA估计阵列信号处理信号子空间噪声子空间
一文读懂MUSIC算法DOA估计的数学原理并仿真文章目录前言一、DOA估计基本原理二、MATLAB仿真总结前言MUSIC（MultipleSignalClassification）算法于1979年由R.O.Schmidt提出，是阵列信号处理中广泛应用的经典DOA（DirectionofArrival）估计算法，凭借其超分辨的估计性能受到广泛关注。本文将从数学公式推导的角度出发系统阐述MUSIC算法
图片粘贴上传实现 SarinaDu javascript html5
图片上传htmldemo直接粘贴本地运行查看效果即可，有看不懂的直接喂给deepseek会解释的很清晰粘贴图片上传示例-使用场景，粘贴桌面图片上传、粘贴word文档中图片上传、直接截图上传等body{font-family:Arial,sans-serif;padding:20px;}.upload-area{width:100%;height:200px;border:2pxdashed#ccc
基于MUSIC算法的DOA估计Matlab仿真 fpga和matlab ★MATLAB算法仿真经验板块1:通信与信号处理 matlab MUSIC算法 DOA估计
up目录一、理论基础二、核心程序三、测试结果一、理论基础阵列信号处理是信号处理领域内的一个重要分支，在近些年来得到了迅速发展。波达方向（DirectionofArrival，DOA）估计是阵列信号处理的一个重要的研究领域，在雷达、通信、声纳、地震学等领域都有着广泛的应用前景。在DOA估计的发展过程中，人们对高分辨DOA估计算法一直有很大的研究兴趣，并在这一领域取得了很多重要的进展。阵列信号处理主要
“傻瓜”学计量——主成分分析法PCA（原理+实操） nn坚持学stata+matlab 计量算法机器学习人工智能学习笔记学习方法经验分享
提纲：1.PCA原理2.视频推荐：PCA原理spass操作stata操作+matlab实操1.背景在一些领域中，需要对大量数据进行观测。但是可能会带来变量之间具有相关性、分别对每个指标分析带来的偏误等问题。因此，要寻找一个合理的方法，在减少需要分析的直白的同时，尽量减少原指标包含的信息缺失。通常做法是对有关联性的变量进行合并，这样就可以用较少的综合指标分别代表存在于各个变量中的各类信息。常用的方法
DeepSeek爆火背后：AI如何助力GIS发展 GIS前端嘉欣前端 GIS webgis
2025年的春节，一款名为DeepSeek的AI工具以“推理能力超群”“性价比碾压巨头”的标签火遍全网：日活用户突破3000万，微信搜索接入其长思考模式，三大电信运营商全面部署其开源框架。这场由低成本+高性能+开源驱动的技术革命，不仅让AI开发门槛大幅降低，更预示着一个全新的产业趋势——AI与GIS的深度融合，正在重塑城市、环境和商业的底层逻辑。012025年，AI+GIS深度融合的四大趋势1.城
华为昇腾服务器部署DeepSeek模型实战 gzroy 人工智能语言模型
在华为的昇腾服务器上部署了DeepSeekR1的模型进行验证测试，记录一下相关的过程。服务器是配置了8块910B3的显卡，每块显卡有64GB显存，根据DeepSeekR1各个模型的参数计算，如果部署R1的Qwen14B版本，需要1张显卡，如果是32B版本，需要2张，Llama70B的模型需要4张显卡。如果是R1全参数版本，则需要32张显卡，也就是4台满配的昇腾服务器。这里先选择32B的模型进行部署
腾讯云放大招：3 行代码让 DeepSeek “入住” 微信小程序 BuluAI 腾讯云微信小程序云计算
小程序开发的革命性突破近日，技术圈迎来一则重磅消息——腾讯云推出全新功能，仅需3行代码，就能让DeepSeek大模型“入住”微信小程序，这无疑为开发者们带来了一场革命性的变革。在过去，将大模型能力集成到微信小程序中，过程复杂繁琐，代码量庞大，高门槛让众多开发者望而却步。但如今，腾讯云的这一创新举措，直接将难题“秒解”。开发者们只需轻松敲下3行代码，即可实现DeepSeek大模型在微信小程序中的接入
DeepSeek预测25考研分数线 GIS前端嘉欣考研前端 GIS webgis
25考研分数马上要出了。目前，多所大学已经陆续给出了分数查分时间，综合往年情况来看，每年的查分时间一般集中在2月底。等待出成绩的日子，学子们的心情是万分焦急，小编用最近爆火的“活人感”十足的DeepSeek帮大家预测一下25考研的分数线。一起来看看吧~影响国家线的关键因素1）报考人数2023年考研报名人数为474万（首次下降），2024年回升至438万（官方未公布，网传数据存疑）。若2025年报考
接入DeepSeek后，智慧园区安全调度系统的全面提升 Guheyunyi 安全数据分析 python 智慧城市人工智能信息可视化
随着人工智能技术的快速发展，智慧园区的安全管理正逐步向智能化、自动化方向迈进。DeepSeek作为先进的人工智能解决方案，为智慧园区安全调度系统注入了强大的技术动力。通过接入DeepSeek，智慧园区安全调度系统在多个方面实现了显著提升，进一步增强了园区的安全性、管理效率和用户体验。1.智能化监控：从被动到主动传统的监控系统主要依赖人工查看视频画面，容易出现漏检或误判。接入DeepSeek后，智慧
清华大学第四发《DeepSeek+DeepResearch 让科研像聊天一样简单》人工智能
当下科研领域，传统模式急需改变，清华大学第四版《DeepSeek+DeepResearch：让科研像聊天一样简单》全文一共86页，以下是文档的关键内容总结：一、智能组合优势DeepSeek与DeepResearch构建先进技术体系，有强大模型运算、智能数据处理和友好交互界面。模型在数据处理速度、精准度和泛化能力上远超传统模型。数据采集渠道广、处理快，能读取多种格式文件。数据分析深入，可视化直观，还
docker部署kafka（单节点） + Springboot集成kafka wsdhla docker kafka spring boot zookeeper
环境：操作系统：win10Docker：DockerDesktop4.21.1(114176)、DockerEnginev24.0.2SpringBoot：2.7.15步骤1：创建网络：dockernetworkcreate--subnet=172.18.0.0/16net-kafka步骤2：安装zk镜像dockerpullzookeeper:latestdockerrun-d--restarta
2025基金公司私有化部署趋势分析：技术自主权的崛起
标题：基金公司私有化部署：数据主权时代的战略选择与实战指南副标题：从DeepSeek到板栗看板，解密金融巨头如何用私有化部署重塑竞争力【热点引入：一场无声的金融科技革命】2025年2月，、十余家公募基金密集宣布完成DeepSeek大模型的私有化部署，这一现象登上财经热搜榜首。据不完全统计，超60%的头部基金公司已启动私有化部署计划，涉及投研、风控、客户服务等核心场景。这场革命背后的驱动力，正是金融
（九万字）面向2025年BOSS直聘人工智能算法工程师高频面试题解析快撑死的鱼人工智能回归 python pytorch
面向2025年BOSS直聘人工智能算法工程师高频面试题解析1.机器学习（ML）理论解析机器学习是让计算机从数据中学习规律的一套方法论，包含监督学习、无监督学习和强化学习等范式。在监督学习中，给定带标签的数据，算法尝试学习从输入到输出的映射关系；无监督学习则在缺乏标签的情况下挖掘数据内在结构；强化学习则让智能体通过与环境交互、依据奖赏反馈来改进策略(Q-learning-Wikipedia)。机器学
AI 如何接口调试？可以展示推理过程人工智能深度学习机器学习
如何在开发AI接口的同时，能看到实时的AI回复，避免传统的轮询方式，而无需长时间等待。常用的AI模型（比如Deepseek、Gemini）都是支持流式输出，那有没有一款API接口软件可以实现这功能？近期Apifox增强了调试SSE接口功能，实现了发起HTTP请求流式响应就会自动合并为可读文本，实时以自然语言呈现响应。而且针对Deepseek还能展示思考推理过程！这大大降低AI应用开发难度，有图为证
FakeApp 技术浅析（一）爱研究的小牛 AIGC—深度伪造虚拟现实人工智能 AIGC 深度学习机器学习
FakeApp是一款早期的深度伪造（Deepfake）工具，最初于2018年发布，用于生成和编辑换脸视频。尽管FakeApp已经不再更新，但它在深度伪造技术的发展中起到了重要作用。1.技术背景与理论基础1.1生成对抗网络（GANs）生成对抗网络（GANs）是深度学习领域中的一种重要模型，由生成器（Generator）和判别器（Discriminator）组成。生成器负责生成逼真的数据（如图像、视频
简易java调用DeepSeek Api教程 m0_62519278 学习小本本 java 数据库开发语言
一、请求格式首先观察官方文档给出的访问api的样例脚本curlhttps://api.deepseek.com/chat/completions\-H"Content-Type:application/json"\-H"Authorization:Bearer"\-d'{"model":"deepseek-chat","messages":[{"role":"system","content":"
DeepSeek 赋能工业软件之全流程方案爱吃青菜的大力水手人工智能自动化持续部署语言模型开源
deepseek赋能工业软件之全流程方案之侧重半导体FABdeepseek在工业软件中的应用场景“deepseek”大模型在工业软件领域拥有广泛的应用场景，包括以下几个方面：智能调度：利用深度学习和优化算法，根据实时数据动态调整生产计划和资源分配。它可以综合考虑订单需求、设备状态和产能限制，智能生成最优的生产排程方案，减少等待时间和切换成本。例如在汽车制造工厂，deepseek可根据订单需求和设备
数据挖掘十大经典算法详解（附原理解析与代码示例） IT程序媛-桃子华为认证数据挖掘算法经验分享华为
1.PageRank（链接分析）应用场景：搜索引擎排名、社交网络分析核心原理PageRank通过网页之间的链接关系计算网页的重要性，影响力大的网页排名更高。网页影响力=所有入链页面的加权影响力之和阻尼因子D（通常设为0.85）用于模拟用户随机访问网页的行为代码示例importnetworkxasnxG=nx.DiGraph()G.add_edges_from([("A","B"),("A","C"
在瑞芯微RK3588平台上使用RKNN部署YOLOv8Pose模型的C++实战指南机＿长 YOLO系列模型有效涨点改进深度学习落地实战 YOLO c++开发语言
在人工智能和计算机视觉领域，人体姿态估计是一项极具挑战性的任务，它对于理解人类行为、增强人机交互等方面具有重要意义。YOLOv8Pose作为YOLO系列中的新成员，以其高效和准确性在人体姿态估计任务中脱颖而出。本文将详细介绍如何在瑞芯微RK3588平台上，使用RKNN（RockchipNeuralNetworkToolkit）框架部署YOLOv8Pose模型，并进行C++代码的编译和运行。注本文全
DeepSeek-R1 技术全景解析：从原理到实践的“炼金术配方” ——附多阶段训练流程图与核心误区澄清... 雪停时偶遇一叶春流程图
合集-人工智能(5)1.如何改进AI模型在特定环境中的知识检索2024-09-242.深度学习与统计学中的时间序列预测2024-10-033.《使用coze搭建一个会搜索、写ppt、思维导图的Agent》2024-10-294.深入浅出：Agent如何调用工具——从OpenAIFunctionCall到CrewAI框架01-145.DeepSeek-R1技术全景解析：从原理到实践的“炼金术配方”—
PyCharm 集成 DeepSeek：本地运行 or API 直连？打造你的 AI 编程神器！ AI云极【AI智能系列】pycharm 人工智能 ide deepseek
在AI赋能编程的时代，如何让AI辅助写代码，提升开发效率？DeepSeek作为一款开源、强大、免费的AI编程助手，结合PyCharm，能够大幅提升Python编程体验。今天，我们就来详细讲解如何在PyCharm中接入DeepSeek，无论你想使用本地部署的DeepSeek，还是官方API版本，都能轻松实现！为什么选择DeepSeek+PyCharm？DeepSeekR1采用6710亿参数的MoE（
深入了解 CDN：概念、原理、过程、作用及工作场景羊村懒哥网络网络加速缓存
目录一、CDN的概念二、CDN的工作原理三、CDN的工作过程四、CDN的作用五、CDN可结合使用的技术六、CDN能够解决的网络问题七、CDN的工作场景在当今互联网飞速发展的时代，用户对于网页加载速度和内容获取的时效性要求越来越高。CDN（ContentDeliveryNetwork，⭐内容分发网络）应运而生，它在提升网络性能和用户体验方面发挥着关键作用。本文将详细介绍CDN的概念、工作原理、工作过
本地搭建小型 DeepSeek 并进行微调非著名架构师大模型知识文档智能硬件人工智能大数据大模型 deepseek
本文将指导您在本地搭建一个小型的DeepSeek模型，并进行微调，以处理您的特定数据。1.环境准备Python3.7或更高版本PyTorch1.8或更高版本CUDA(可选，用于GPU加速)Git2.克隆DeepSeek仓库bash复制gitclonehttps://github.com/deepseek-ai/deepseek.gitcddeepseek3.安装依赖bash复制pipinstall
开发者关心的那些事圣子足道 ios 游戏编程 apple 支付
我要在app里添加IAP，必须要注册自己的产品标识符（product identifiers）。产品标识符是什么？产品标识符（Product Identifiers）是一串字符串，它用来识别你在应用内贩卖的每件商品。App Store用产品标识符来检索产品信息，标识符只能包含大小写字母（A-Z）、数字（0-9）、下划线（-）、以及圆点(.)。你可以任意排列这些元素，但我们建议你创建标识符时使用
负载均衡器技术Nginx和F5的优缺点对比 bijian1013 nginx F5
对于数据流量过大的网络中，往往单一设备无法承担，需要多台设备进行数据分流，而负载均衡器就是用来将数据分流到多台设备的一个转发器。目前有许多不同的负载均衡技术用以满足不同的应用需求，如软/硬件负载均衡、本地/全局负载均衡、更高
LeetCode[Math] - #9 Palindrome Number Cwind java Algorithm 题解 LeetCode Math
原题链接：#9 Palindrome Number 要求：判断一个整数是否是回文数，不要使用额外的存储空间难度：简单分析：题目限制不允许使用额外的存储空间应指不允许使用O(n)的内存空间，O(1)的内存用于存储中间结果是可以接受的。于是考虑将该整型数反转，然后与原数字进行比较。注：没有看到有关负数是否可以是回文数的明确结论，例如
画图板的基本实现 15700786134 画图板
要实现画图板的基本功能，除了在qq登陆界面中用到的组件和方法外，还需要添加鼠标监听器，和接口实现。首先，需要显示一个JFrame界面： public class DrameFrame extends JFrame { //显示
linux的ps命令被触发 linux
Linux中的ps命令是Process Status的缩写。ps命令用来列出系统中当前运行的那些进程。ps命令列出的是当前那些进程的快照，就是执行ps命令的那个时刻的那些进程，如果想要动态的显示进程信息，就可以使用top命令。要对进程进行监测和控制，首先必须要了解当前进程的情况，也就是需要查看当前进程，而 ps 命令就是最基本同时也是非常强大的进程查看命令。使用该命令可以确定有哪些进程正在运行
Android 音乐播放器下一曲连续跳几首歌肆无忌惮_ android
最近在写安卓音乐播放器的时候遇到个问题。在MediaPlayer播放结束时会回调 player.setOnCompletionListener(new OnCompletionListener() { @Override public void onCompletion(MediaPlayer mp) { mp.reset(); Log.i("H
java导出txt文件的例子知了ing java servlet
代码很简单就一个servlet,如下： package com.eastcom.servlet; import java.io.BufferedOutputStream; import java.io.IOException; import java.net.URLEncoder; import java.sql.Connection; import java.sql.Resu
Scala stack试玩, 提高第三方依赖下载速度矮蛋蛋 scala sbt
原文地址： http://segmentfault.com/a/1190000002894524 sbt下载速度实在是惨不忍睹, 需要做些配置优化下载typesafe离线包, 保存为ivy本地库 wget http://downloads.typesafe.com/typesafe-activator/1.3.4/typesafe-activator-1.3.4.zip 解压r
phantomjs安装(linux，附带环境变量设置) ，以及casperjs安装。 alleni123 linux spider
1. 首先从官网 http://phantomjs.org/下载phantomjs压缩包，解压缩到/root/phantomjs文件夹。 2. 安装依赖 sudo yum install fontconfig freetype libfreetype.so.6 libfontconfig.so.1 libstdc++.so.6 3. 配置环境变量 vi /etc/profil
JAVA IO FileInputStream和FileOutputStream，字节流的打包输出百合不是茶 java核心思想 JAVA IO操作字节流
在程序设计语言中，数据的保存是基本，如果某程序语言不能保存数据那么该语言是不可能存在的，JAVA是当今最流行的面向对象设计语言之一，在保存数据中也有自己独特的一面，字节流和字符流 1，字节流是由字节构成的，字符流是由字符构成的字节流和字符流都是继承的InputStream和OutPutStream ,java中两种最基本的就是字节流和字符流类 FileInputStream
Spring基础实例（依赖注入和控制反转） bijian1013 spring
前提条件：在http://www.springsource.org/download网站上下载Spring框架，并将spring.jar、log4j-1.2.15.jar、commons-logging.jar加载至工程1.武器接口 package com.bijian.spring.base3; public interface Weapon { void kil
HR看重的十大技能 bijian1013 提升能力 HR 成长
一个人掌握何种技能取决于他的兴趣、能力和聪明程度，也取决于他所能支配的资源以及制定的事业目标，拥有过硬技能的人有更多的工作机会。但是，由于经济发展前景不确定，掌握对你的事业有所帮助的技能显得尤为重要。以下是最受雇主欢迎的十种技能。　　一、解决问题的能力　　每天，我们都要在生活和工作中解决一些综合性的问题。那些能够发现问题、解决问题并迅速作出有效决
【Thrift一】Thrift编译安装 bit1129 thrift
什么是Thrift The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and s
【Avro三】Hadoop MapReduce读写Avro文件 bit1129 mapreduce
Avro是Doug Cutting(此人绝对是神一般的存在）牵头开发的。开发之初就是围绕着完善Hadoop生态系统的数据处理而开展的（使用Avro作为Hadoop MapReduce需要处理数据序列化和反序列化的场景）,因此Hadoop MapReduce集成Avro也就是自然而然的事情。这个例子是一个简单的Hadoop MapReduce读取Avro格式的源文件进行计数统计，然后将计算结果
nginx定制500，502，503，504页面 ronin47 nginx　错误显示
server { listen 80; error_page 500/500.html; error_page 502/502.html; error_page 503/503.html; error_page 504/504.html; location /test {return502;}} 配置很简单，和配
java-1.二叉查找树转为双向链表 bylijinnan 二叉查找树
import java.util.ArrayList; import java.util.List; public class BSTreeToLinkedList { /* 把二元查找树转变成排序的双向链表题目：输入一棵二元查找树，将该二元查找树转换成一个排序的双向链表。要求不能创建任何新的结点，只调整指针的指向。 10 / \ 6 14 / \
Netty源码学习-HTTP-tunnel bylijinnan java netty
Netty关于HTTP tunnel的说明： http://docs.jboss.org/netty/3.2/api/org/jboss/netty/channel/socket/http/package-summary.html#package_description 这个说明有点太简略了一个完整的例子在这里： https://github.com/bylijinnan
JSONUtil.serialize(map)和JSON.toJSONString(map)的区别 coder_xpf jquery json map val()
JSONUtil.serialize(map)和JSON.toJSONString(map)的区别数据库查询出来的map有一个字段为空通过System.out.println()输出 JSONUtil.serialize(map)： {"one":"1","two":"nul
Hibernate缓存总结 cuishikuan 开源 ssh javaweb hibernate缓存三大框架
一、为什么要用Hibernate缓存？ Hibernate是一个持久层框架，经常访问物理数据库。为了降低应用程序对物理数据源访问的频次，从而提高应用程序的运行性能。缓存内的数据是对物理数据源中的数据的复制，应用程序在运行时从缓存读写数据，在特定的时刻或事件会同步缓存和物理数据源的数据。二、Hibernate缓存原理是怎样的？ Hibernate缓存包括两大类：Hib
CentOs6 dalan_123 centos
首先su - 切换到root下面1、首先要先安装GCC GCC-C++ Openssl等以来模块：yum -y install make gcc gcc-c++ kernel-devel m4 ncurses-devel openssl-devel2、再安装ncurses模块yum -y install ncurses-develyum install ncurses-devel3、下载Erang
10款用 jquery 实现滚动条至页面底端自动加载数据效果 dcj3sjt126com JavaScript
无限滚动自动翻页可以说是web2.0时代的一项堪称伟大的技术，它让我们在浏览页面的时候只需要把滚动条拉到网页底部就能自动显示下一页的结果，改变了一直以来只能通过点击下一页来翻页这种常规做法。无限滚动自动翻页技术的鼻祖是微博的先驱：推特(twitter)，后来必应图片搜索、谷歌图片搜索、google reader、箱包批发网等纷纷抄袭了这一项技术，于是靠滚动浏览器滚动条
ImageButton去边框&Button或者ImageButton的背景透明 dcj3sjt126com imagebutton
在ImageButton中载入图片后，很多人会觉得有图片周围的白边会影响到美观，其实解决这个问题有两种方法一种方法是将ImageButton的背景改为所需要的图片。如：android:background="@drawable/XXX" 第二种方法就是将ImageButton背景改为透明，这个方法更常用在XML里； <ImageBut
JSP之c:foreach eksliang jsp forearch
原文出自：http://www.cnblogs.com/draem0507/archive/2012/09/24/2699745.html <c:forEach>标签用于通用数据循环，它有以下属性属性描述是否必须缺省值 items 进行循环的项目否无 begin 开始条件否 0 end 结束条件否集合中的最后一个项目 step 步长否 1
Android实现主动连接蓝牙耳机 gqdy365 android
在Android程序中可以实现自动扫描蓝牙、配对蓝牙、建立数据通道。蓝牙分不同类型，这篇文字只讨论如何与蓝牙耳机连接。大致可以分三步：一、扫描蓝牙设备： 1、注册并监听广播： BluetoothAdapter.ACTION_DISCOVERY_STARTED BluetoothDevice.ACTION_FOUND BluetoothAdapter.ACTION_DIS
android学习轨迹之四：org.json.JSONException: No value for hyz301 json
org.json.JSONException: No value for items 在JSON解析中会遇到一种错误，很常见的错误 06-21 12:19:08.714 2098-2127/com.jikexueyuan.secret I/System.out﹕ Result:{"status":1,"page":1,&
干货分享：从零开始学编程系列汇总 justjavac 编程
程序员总爱重新发明轮子，于是做了要给轮子汇总。从零开始写个编译器吧系列 (知乎专栏) 从零开始写一个简单的操作系统 (伯乐在线) 从零开始写JavaScript框架 (图灵社区) 从零开始写jQuery框架 (蓝色理想 ) 从零开始nodejs系列文章 (粉丝日志) 从零开始编写网络游戏
jquery-autocomplete 使用手册 macroli jquery Ajax 脚本
jquery-autocomplete学习一、用前必备官方网站：http://bassistance.de/jquery-plugins/jquery-plugin-autocomplete/ 当前版本：1.1 需要JQuery版本：1.2.6 二、使用 <script src="./jquery-1.3.2.js" type="text/ja
PLSQL-Developer或者Navicat等工具连接远程oracle数据库的详细配置以及数据库编码的修改超声波 oracle plsql
　　在服务器上将Oracle安装好之后接下来要做的就是通过本地机器来远程连接服务器端的oracle数据库，常用的客户端连接工具就是PLSQL-Developer或者Navicat这些工具了。刚开始也是各种报错，什么TNS:no listener;TNS:lost connection;TNS:target hosts...花了一天的时间终于让PLSQL-Developer和Navicat等这些客户
数据仓库数据模型之：极限存储--历史拉链表 superlxw1234 极限存储数据仓库数据模型拉链历史表
在数据仓库的数据模型设计过程中，经常会遇到这样的需求： 1. 数据量比较大; 2. 表中的部分字段会被update,如用户的地址，产品的描述信息，订单的状态等等; 3. 需要查看某一个时间点或者时间段的历史快照信息，比如，查看某一个订单在历史某一个时间点的状态，比如，查看某一个用户在过去某一段时间内，更新过几次等等; 4. 变化的比例和频率不是很大，比如，总共有10
10点睛Spring MVC4.1-全局异常处理 wiselyman spring mvc
10.1 全局异常处理使用@ControllerAdvice注解来实现全局异常处理; 使用@ControllerAdvice的属性缩小处理范围 10.2 演示演示控制器 package com.wisely.web; import org.springframework.stereotype.Controller; import org.spring

UFLDL Tutorial_Sparse Autoencoder

Neural Networks

Neural Network model

Backpropagation Algorithm

Gradient checking and advanced optimization

Autoencoders and Sparsity

Visualizing a Trained Autoencoder

Sparse Autoencoder Notation Summary

Exercise:Sparse Autoencoder

Contents

Download Related Reading

Sparse autoencoder implementation

Step 1: Generate training set

Step 2: Sparse autoencoder objective

Step 3: Gradient checking

Step 4: Train the sparse autoencoder

Step 5: Visualization

Results

Contents

CS294A/CS294W Programming Assignment Starter Code

STEP 0: Here we provide the relevant parameters values that will

STEP 1: Implement sampleIMAGES

STEP 2: Implement sparseAutoencoderCost

STEP 3: Gradient Checking

STEP 4: After verifying that your implementation of

STEP 5: Visualization

Contents

---------- YOUR CODE HERE --------------------------------------

---------------------------------------------------------------

---------------------------------------------------------------

Contents

Initialize parameters randomly based on layer sizes.

Contents

---------- YOUR CODE HERE --------------------------------------

你可能感兴趣的:(matlab,NetWork,deep,learning,learning,machine,Neural)