GarfieldEr007

BP反向传播算法是如何工作的How the backpropagation algorithm works

In the last chapter we saw how neural networks can learn their weights and biases using the gradient descent algorithm. There was, however, a gap in our explanation: we didn't discuss how to compute the gradient of the cost function. That's quite a gap! In this chapter I'll explain a fast algorithm for computing such gradients, an algorithm known as backpropagation.

The backpropagation algorithm was originally introduced in the 1970s, but its importance wasn't fully appreciated until a famous 1986 paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams. That paper describes several neural networks where backpropagation works far faster than earlier approaches to learning, making it possible to use neural nets to solve problems which had previously been insoluble. Today, the backpropagation algorithm is the workhorse of learning in neural networks.

This chapter is more mathematically involved than the rest of the book. If you're not crazy about mathematics you may be tempted to skip the chapter, and to treat backpropagation as a black box whose details you're willing to ignore. Why take the time to study those details?

The reason, of course, is understanding. At the heart of backpropagation is an expression for the partial derivative of the cost function with respect to any weight (or bias ) in the network. The expression tells us how quickly the cost changes when we change the weights and biases. And while the expression is somewhat complex, it also has a beauty to it, with each element having a natural, intuitive interpretation. And so backpropagation isn't just a fast algorithm for learning. It actually gives us detailed insights into how changing the weights and biases changes the overall behaviour of the network. That's well worth studying in detail.

With that said, if you want to skim the chapter, or jump straight to the next chapter, that's fine. I've written the rest of the book to be accessible even if you treat backpropagation as a black box. There are, of course, points later in the book where I refer back to results from this chapter. But at those points you should still be able to understand the main conclusions, even if you don't follow all the reasoning.

Warm up: a fast matrix-based approach to computing the output from a neural network

Before discussing backpropagation, let's warm up with a fast matrix-based algorithm to compute the output from a neural network. We actually already briefly saw this algorithm near the end of the last chapter, but I described it quickly, so it's worth revisiting in detail. In particular, this is a good way of getting comfortable with the notation used in backpropagation, in a familiar context.

Let's begin with a notation which lets us refer to weights in the network in an unambiguous way. We'll use to denote the weight for the connection from the neuron in the layer to the neuron in the layer. So, for example, the diagram below shows the weight on a connection from the fourth neuron in the second layer to the second neuron in the third layer of a network:

This notation is cumbersome at first, and it does take some work to master. But with a little effort you'll find the notation becomes easy and natural. One quirk of the notation is the ordering of the

and

indices. You might think that it makes more sense to use

to refer to the input neuron, and

to the output neuron, not vice versa, as is actually done. I'll explain the reason for this quirk below.

We use a similar notation for the network's biases and activations. Explicitly, we use for the bias of the neuron in the layer. And we use for the activation of the neuron in the layer. The following diagram shows examples of these notations in use:

With these notations, the activation

of the

neuron in the

layer is related to the activations in the

layer by the equation (compare Equation (4) and surrounding discussion in the last chapter)

where the sum is over all neurons

in the

layer. To rewrite this expression in a matrix form we define a weight matrix

for each layer,

. The entries of the weight matrix

are just the weights connecting to the

layer of neurons, that is, the entry in the

row and

column is

. Similarly, for each layer

we define a bias vector ,

. You can probably guess how this works - the components of the bias vector are just the values

, one component for each neuron in the

layer. And finally, we define an activation vector

whose components are the activations

The last ingredient we need to rewrite (23) in a matrix form is the idea of vectorizing a function such as . We met vectorization briefly in the last chapter, but to recap, the idea is that we want to apply a function such as to every element in a vector . We use the obvious notation to denote this kind of elementwise application of a function. That is, the components of are just . As an example, if we have the function then the vectorized form of has the effectthat is, the vectorized just squares every element of the vector.

With these notations in mind, Equation (23) can be rewritten in the beautiful and compact vectorized formThis expression gives us a much more global way of thinking about how the activations in one layer relate to activations in the previous layer: we just apply the weight matrix to the activations, then add the bias vector, and finally apply the function**By the way, it's this expression that motivates the quirk in the notation mentioned earlier. If we used to index the input neuron, and to index the output neuron, then we'd need to replace the weight matrix in Equation (25) by the transpose of the weight matrix. That's a small change, but annoying, and we'd lose the easy simplicity of saying (and thinking) "apply the weight matrix to the activations".. That global view is often easier and more succinct (and involves fewer indices!) than the neuron-by-neuron view we've taken to now. Think of it as a way of escaping index hell, while remaining precise about what's going on. The expression is also useful in practice, because most matrix libraries provide fast ways of implementing matrix multiplication, vector addition, and vectorization. Indeed, the code in the last chapter made implicit use of this expression to compute the behaviour of the network.

When using Equation (25) to compute , we compute the intermediate quantity along the way. This quantity turns out to be useful enough to be worth naming: we call theweighted input to the neurons in layer . We'll make considerable use of the weighted input later in the chapter. Equation (25) is sometimes written in terms of the weighted input, as . It's also worth noting that has components , that is, is just the weighted input to the activation function for neuron in layer .

The two assumptions we need about the cost function

The goal of backpropagation is to compute the partial derivatives and of the cost function with respect to any weight or bias in the network. For backpropagation to work we need to make two main assumptions about the form of the cost function. Before stating those assumptions, though, it's useful to have an example cost function in mind. We'll use the quadratic cost function from last chapter (c.f. Equation (6)). In the notation of the last section, the quadratic cost has the formwhere: is the total number of training examples; the sum is over individual training examples, ; is the corresponding desired output; denotes the number of layers in the network; and is the vector of activations output from the network when is input.

Okay, so what assumptions do we need to make about our cost function, , in order that backpropagation can be applied? The first assumption we need is that the cost function can be written as an average over cost functions for individual training examples, . This is the case for the quadratic cost function, where the cost for a single training example is . This assumption will also hold true for all the other cost functions we'll meet in this book.

The reason we need this assumption is because what backpropagation actually lets us do is compute the partial derivatives and for a single training example. We then recover and by averaging over training examples. In fact, with this assumption in mind, we'll suppose the training example has been fixed, and drop the subscript, writing the cost as . We'll eventually put the back in, but for now it's a notational nuisance that is better left implicit.

The second assumption we make about the cost is that it can be written as a function of the outputs from the neural network:

For example, the quadratic cost function satisfies this requirement, since the quadratic cost for a single training example

may be written as

and thus is a function of the output activations. Of course, this cost function also depends on the desired output

, and you may wonder why we're not regarding the cost also as a function of

. Remember, though, that the input training example

is fixed, and so the output

is also a fixed parameter. In particular, it's not something we can modify by changing the weights and biases in any way, i.e., it's not something which the neural network learns. And so it makes sense to regard

as a function of the output activations

alone, with

merely a parameter that helps define that function.

The Hadamard product,

The backpropagation algorithm is based on common linear algebraic operations - things like vector addition, multiplying a vector by a matrix, and so on. But one of the operations is a little less commonly used. In particular, suppose and are two vectors of the same dimension. Then we use to denote theelementwise product of the two vectors. Thus the components of are just . As an example,This kind of elementwise multiplication is sometimes called theHadamard product or Schur product. We'll refer to it as the Hadamard product. Good matrix libraries usually provide fast implementations of the Hadamard product, and that comes in handy when implementing backpropagation.

The four fundamental equations behind backpropagation

Backpropagation is about understanding how changing the weights and biases in a network changes the cost function. Ultimately, this means computing the partial derivatives and . But to compute those, we first introduce an intermediate quantity, , which we call the error in the neuron in the layer. Backpropagation will give us a procedure to compute the error , and then will relate to and .

To understand how the error is defined, imagine there is a demon in our neural network:

The demon sits at the

neuron in layer

. As the input to the neuron comes in, the demon messes with the neuron's operation. It adds a little change

to the neuron's weighted input, so that instead of outputting

, the neuron instead outputs

. This change propagates through later layers in the network, finally causing the overall cost to change by an amount

Now, this demon is a good demon, and is trying to help you improve the cost, i.e., they're trying to find a which makes the cost smaller. Suppose has a large value (either positive or negative). Then the demon can lower the cost quite a bit by choosing to have the opposite sign to . By contrast, if is close to zero, then the demon can't improve the cost much at all by perturbing the weighted input . So far as the demon can tell, the neuron is already pretty near optimal**This is only the case for small changes , of course. We'll assume that the demon is constrained to make such small changes.. And so there's a heuristic sense in which is a measure of the error in the neuron.

Motivated by this story, we define the error of neuron in layer byAs per our usual conventions, we use to denote the vector of errors associated with layer . Backpropagation will give us a way of computing for every layer, and then relating those errors to the quantities of real interest, and .

You might wonder why the demon is changing the weighted input . Surely it'd be more natural to imagine the demon changing the output activation , with the result that we'd be using as our measure of error. In fact, if you do this things work out quite similarly to the discussion below. But it turns out to make the presentation of backpropagation a little more algebraically complicated. So we'll stick with as our measure of error**In classification problems like MNIST the term "error" is sometimes used to mean the classification failure rate. E.g., if the neural net correctly classifies 96.0 percent of the digits, then the error is 4.0 percent. Obviously, this has quite a different meaning from our vectors. In practice, you shouldn't have trouble telling which meaning is intended in any given usage..

Plan of attack: Backpropagation is based around four fundamental equations. Together, those equations give us a way of computing both the error and the gradient of the cost function. I state the four equations below. Be warned, though: you shouldn't expect to instantaneously assimilate the equations. Such an expectation will lead to disappointment. In fact, the backpropagation equations are so rich that understanding them well requires considerable time and patience as you gradually delve deeper into the equations. The good news is that such patience is repaid many times over. And so the discussion in this section is merely a beginning, helping you on the way to a thorough understanding of the equations.

Here's a preview of the ways we'll delve more deeply into the equations later in the chapter: I'll give a short proof of the equations, which helps explain why they are true; we'll restate the equations in algorithmic form as pseudocode, and see how the pseudocode can be implemented as real, running Python code; and, in the final section of the chapter, we'll develop an intuitive picture of what the backpropagation equations mean, and how someone might discover them from scratch. Along the way we'll return repeatedly to the four fundamental equations, and as you deepen your understanding those equations will come to seem comfortable and, perhaps, even beautiful and natural.

An equation for the error in the output layer, : The components of are given byThis is a very natural expression. The first term on the right, , just measures how fast the cost is changing as a function of the output activation. If, for example, doesn't depend much on a particular output neuron, , then will be small, which is what we'd expect. The second term on the right, , measures how fast the activation function is changing at .

Notice that everything in (BP1) is easily computed. In particular, we compute while computing the behaviour of the network, and it's only a small additional overhead to compute . The exact form of will, of course, depend on the form of the cost function. However, provided the cost function is known there should be little trouble computing . For example, if we're using the quadratic cost function then , and so , which obviously is easily computable.

Equation (BP1) is a componentwise expression for . It's a perfectly good expression, but not the matrix-based form we want for backpropagation. However, it's easy to rewrite the equation in a matrix-based form, asHere, is defined to be a vector whose components are the partial derivatives . You can think of as expressing the rate of change of with respect to the output activations. It's easy to see that Equations (BP1a) and (BP1) are equivalent, and for that reason from now on we'll use (BP1) interchangeably to refer to both equations. As an example, in the case of the quadratic cost we have , and so the fully matrix-based form of (BP1) becomesAs you can see, everything in this expression has a nice vector form, and is easily computed using a library such as Numpy.

An equation for the error in terms of the error in the next layer, : In particularwhere is the transpose of the weight matrix for the layer. This equation appears complicated, but each element has a nice interpretation. Suppose we know the error at the layer. When we apply the transpose weight matrix, , we can think intuitively of this as moving the error backwardthrough the network, giving us some sort of measure of the error at the output of the layer. We then take the Hadamard product . This moves the error backward through the activation function in layer , giving us the error in the weighted input to layer .

By combining (BP2) with (BP1) we can compute the error for any layer in the network. We start by using (BP1) to compute , then apply Equation (BP2) to compute , then Equation (BP2) again to compute , and so on, all the way back through the network.

An equation for the rate of change of the cost with respect to any bias in the network: In particular:That is, the error is exactly equal to the rate of change . This is great news, since (BP1) and (BP2) have already told us how to compute . We can rewrite (BP3) in shorthand aswhere it is understood that is being evaluated at the same neuron as the bias .

An equation for the rate of change of the cost with respect to any weight in the network: In particular:This tells us how to compute the partial derivatives in terms of the quantities and , which we already know how to compute. The equation can be rewritten in a less index-heavy notation aswhere it's understood that is the activation of the neuron input to the weight , and is the error of the neuron output from the weight . Zooming in to look at just the weight , and the two neurons connected by that weight, we can depict this as:

A nice consequence of Equation (32) is that when the activation

is small,

, the gradient term

will also tend to be small. In this case, we'll say the weight learns slowly , meaning that it's not changing much during gradient descent. In other words, one consequence of (BP4) is that weights output from low-activation neurons learn slowly.

There are other insights along these lines which can be obtained from (BP1)-(BP4). Let's start by looking at the output layer. Consider the term in (BP1). Recall from the graph of the sigmoid function in the last chapter that the function becomes very flat when is approximately or . When this occurs we will have . And so the lesson is that a weight in the final layer will learn slowly if the output neuron is either low activation () or high activation (). In this case it's common to say the output neuron has saturated and, as a result, the weight has stopped learning (or is learning slowly). Similar remarks hold also for the biases of output neuron.

We can obtain similar insights for earlier layers. In particular, note the term in (BP2). This means that is likely to get small if the neuron is near saturation. And this, in turn, means that any weights input to a saturated neuron will learn slowly**This reasoning won't hold if has large enough entries to compensate for the smallness of . But I'm speaking of the general tendency..

Summing up, we've learnt that a weight will learn slowly if either the input neuron is low-activation, or if the output neuron has saturated, i.e., is either high- or low-activation.

None of these observations is too greatly surprising. Still, they help improve our mental model of what's going on as a neural network learns. Furthermore, we can turn this type of reasoning around. The four fundamental equations turn out to hold for any activation function, not just the standard sigmoid function (that's because, as we'll see in a moment, the proofs don't use any special properties of ). And so we can use these equations to design activation functions which have particular desired learning properties. As an example to give you the idea, suppose we were to choose a (non-sigmoid) activation function so that is always positive, and never gets close to zero. That would prevent the slow-down of learning that occurs when ordinary sigmoid neurons saturate. Later in the book we'll see examples where this kind of modification is made to the activation function. Keeping the four equations (BP1)-(BP4) in mind can help explain why such modifications are tried, and what impact they can have.

Problem

Alternate presentation of the equations of backpropagation: I've stated the equations of backpropagation (notably (BP1) and (BP2)) using the Hadamard product. This presentation may be disconcerting if you're unused to the Hadamard product. There's an alternative approach, based on conventional matrix multiplication, which some readers may find enlightening. (1) Show that (BP1) may be rewritten aswhere is a square matrix whose diagonal entries are the values , and whose off-diagonal entries are zero. Note that this matrix acts on by conventional matrix multiplication. (2) Show that (BP2) may be rewritten as(3) By combining observations (1) and (2) show thatFor readers comfortable with matrix multiplication this equation may be easier to understand than (BP1) and (BP2). The reason I've focused on (BP1) and (BP2) is because that approach turns out to be faster to implement numerically.

Proof of the four fundamental equations (optional)

We'll now prove the four fundamental equations (BP1)-(BP4). All four are consequences of the chain rule from multivariable calculus. If you're comfortable with the chain rule, then I strongly encourage you to attempt the derivation yourself before reading on.

Let's begin with Equation (BP1), which gives an expression for the output error, . To prove this equation, recall that by definitionApplying the chain rule, we can re-express the partial derivative above in terms of partial derivatives with respect to the output activations,where the sum is over all neurons in the output layer. Of course, the output activation of the neuron depends only on the input weight for the neuron when . And so vanishes when . As a result we can simplify the previous equation toRecalling that the second term on the right can be written as , and the equation becomeswhich is just (BP1), in component form.

Next, we'll prove (BP2), which gives an equation for the error in terms of the error in the next layer, . To do this, we want to rewrite in terms of . We can do this using the chain rule,where in the last line we have interchanged the two terms on the right-hand side, and substituted the definition of . To evaluate the first term on the last line, note thatDifferentiating, we obtainSubstituting back into (42) we obtainThis is just (BP2) written in component form.

The final two equations we want to prove are (BP3) and (BP4). These also follow from the chain rule, in a manner similar to the proofs of the two equations above. I leave them to you as an exercise.

Exercise

Prove Equations (BP3) and (BP4).

That completes the proof of the four fundamental equations of backpropagation. The proof may seem complicated. But it's really just the outcome of carefully applying the chain rule. A little less succinctly, we can think of backpropagation as a way of computing the gradient of the cost function by systematically applying the chain rule from multi-variable calculus. That's all there really is to backpropagation - the rest is details.

The backpropagation algorithm

The backpropagation equations provide us with a way of computing the gradient of the cost function. Let's explicitly write this out in the form of an algorithm:

Input : Set the corresponding activation for the input layer.
Feedforward: For each compute and .
Output error : Compute the vector .
Backpropagate the error: For each compute .
Output: The gradient of the cost function is given by and .

Examining the algorithm you can see why it's calledbackpropagation. We compute the error vectors backward, starting from the final layer. It may seem peculiar that we're going through the network backward. But if you think about the proof of backpropagation, the backward movement is a consequence of the fact that the cost is a function of outputs from the network. To understand how the cost varies with earlier weights and biases we need to repeatedly apply the chain rule, working backward through the layers to obtain usable expressions.

Exercises

Backpropagation with a single modified neuronSuppose we modify a single neuron in a feedforward network so that the output from the neuron is given by , where is some function other than the sigmoid. How should we modify the backpropagation algorithm in this case?
Backpropagation with linear neurons Suppose we replace the usual non-linear function with throughout the network. Rewrite the backpropagation algorithm for this case.

As I've described it above, the backpropagation algorithm computes the gradient of the cost function for a single training example, . In practice, it's common to combine backpropagation with a learning algorithm such as stochastic gradient descent, in which we compute the gradient for many training examples. In particular, given a mini-batch of training examples, the following algorithm applies a gradient descent learning step based on that mini-batch:

Input a set of training examples
For each training example : Set the corresponding input activation ax,1, and perform the following steps:
- Feedforward: For each compute and .
- Output error : Compute the vector.
- Backpropagate the error: For each compute .
Gradient descent: For each update the weights according to the rule , and the biases according to the rule .

Of course, to implement stochastic gradient descent in practice you also need an outer loop generating mini-batches of training examples, and an outer loop stepping through multiple epochs of training. I've omitted those for simplicity.

The code for backpropagation

Having understood backpropagation in the abstract, we can now understand the code used in the last chapter to implement backpropagation. Recall from that chapter that the code was contained in the update_mini_batch and backprop methods of the Networkclass. The code for these methods is a direct translation of the algorithm described above. In particular, the update_mini_batchmethod updates the Network's weights and biases by computing the gradient for the current mini_batch of training examples:

class Network(object):
...
    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The "mini_batch" is a list of tuples "(x, y)", and "eta"
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw 
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb 
                       for b, nb in zip(self.biases, nabla_b)]

Most of the work is done by the line delta_nabla_b, delta_nabla_w = self.backprop(x, y) which uses the backprop method to figure out the partial derivatives

and

. The backprop method follows the algorithm in the last section closely. There is one small change - we use a slightly different approach to indexing the layers. This change is made to take advantage of a feature of Python, namely the use of negative list indices to count backward from the end of a list, so, e.g., l[-3] is the third last entry in a list l . The code for backprop is below, together with a few helper functions, which are used to compute the

function, the derivative

, and the derivative of the cost function. With these inclusions you should be able to understand the code in a self-contained way. If something's tripping you up, you may find it helpful to consult the original description (and complete listing) of the code .

class Network(object):
...
   def backprop(self, x, y):
        """Return a tuple "(nabla_b, nabla_w)" representing the
        gradient for the cost function C_x.  "nabla_b" and
        "nabla_w" are layer-by-layer lists of numpy arrays, similar
        to "self.biases" and "self.weights"."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book.  Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on.  It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in xrange(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

...

    def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x /
        \partial a for the output activations."""
        return (output_activations-y) 

def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

Problem

Fully matrix-based approach to backpropagation over a mini-batch Our implementation of stochastic gradient descent loops over training examples in a mini-batch. It's possible to modify the backpropagation algorithm so that it computes the gradients for all training examples in a mini-batch simultaneously. The idea is that instead of beginning with a single input vector, , we can begin with a matrix whose columns are the vectors in the mini-batch. We forward-propagate by multiplying by the weight matrices, adding a suitable matrix for the bias terms, and applying the sigmoid function everywhere. We backpropagate along similar lines. Explicitly write out pseudocode for this approach to the backpropagation algorithm. Modify network.pyso that it uses this fully matrix-based approach. The advantage of this approach is that it takes full advantage of modern libraries for linear algebra. As a result it can be quite a bit faster than looping over the mini-batch. (On my laptop, for example, the speedup is about a factor of two when run on MNIST classification problems like those we considered in the last chapter.) In practice, all serious libraries for backpropagation use this fully matrix-based approach or some variant.

In what sense is backpropagation a fast algorithm?

In what sense is backpropagation a fast algorithm? To answer this question, let's consider another approach to computing the gradient. Imagine it's the early days of neural networks research. Maybe it's the 1950s or 1960s, and you're the first person in the world to think of using gradient descent to learn! But to make the idea work you need a way of computing the gradient of the cost function. You think back to your knowledge of calculus, and decide to see if you can use the chain rule to compute the gradient. But after playing around a bit, the algebra looks complicated, and you get discouraged. So you try to find another approach. You decide to regard the cost as a function of the weights alone (we'll get back to the biases in a moment). You number the weights , and want to compute for some particular weight . An obvious way of doing that is to use the approximationwhere is a small positive number, and is the unit vector in the direction. In other words, we can estimate by computing the cost for two slightly different values of , and then applying Equation (46). The same idea will let us compute the partial derivatives with respect to the biases.

This approach looks very promising. It's simple conceptually, and extremely easy to implement, using just a few lines of code. Certainly, it looks much more promising than the idea of using the chain rule to compute the gradient!

Unfortunately, while this approach appears promising, when you implement the code it turns out to be extremely slow. To understand why, imagine we have a million weights in our network. Then for each distinct weight we need to compute in order to compute . That means that to compute the gradient we need to compute the cost function a million different times, requiring a million forward passes through the network (per training example). We need to compute as well, so that's a total of a million and one passes through the network.

What's clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives using just one forward pass through the network, followed by one backward pass through the network. Roughly speaking, the computational cost of the backward pass is about the same as the forward pass**This should be plausible, but it requires some analysis to make a careful statement. It's plausible because the dominant computational cost in the forward pass is multiplying by the weight matrices, while in the backward pass it's multiplying by the transposes of the weight matrices. These operations obviously have similar computational cost.. And so the total cost of backpropagation is roughly the same as making just two forward passes through the network. Compare that to the million and one forward passes we needed for the approach based on (46)! And so even though backpropagation appears superficially more complex than the approach based on (46), it's actually much, much faster.

This speedup was first fully appreciated in 1986, and it greatly expanded the range of problems that neural networks could solve. That, in turn, caused a rush of people using neural networks. Of course, backpropagation is not a panacea. Even in the late 1980s people ran up against limits, especially when attempting to use backpropagation to train deep neural networks, i.e., networks with many hidden layers. Later in the book we'll see how modern computers and some clever new ideas now make it possible to use backpropagation to train such deep neural networks.

Backpropagation: the big picture

As I've explained it, backpropagation presents two mysteries. First, what's the algorithm really doing? We've developed a picture of the error being backpropagated from the output. But can we go any deeper, and build up more intuition about what is going on when we do all these matrix and vector multiplications? The second mystery is how someone could ever have discovered backpropagation in the first place? It's one thing to follow the steps in an algorithm, or even to follow the proof that the algorithm works. But that doesn't mean you understand the problem so well that you could have discovered the algorithm in the first place. Is there a plausible line of reasoning that could have led you to discover the backpropagation algorithm? In this section I'll address both these mysteries.

To improve our intuition about what the algorithm is doing, let's imagine that we've made a small change to some weight in the network, :

That change in weight will cause a change in the output activation from the corresponding neuron: That, in turn, will cause a change in all the activations in the next layer: Those changes will in turn cause changes in the next layer, and then the next, and so on all the way through to causing a change in the final layer, and then in the cost function: The change

in the cost is related to the change

in the weight by the equation

This suggests that a possible approach to computing

is to carefully track how a small change in

propagates to cause a small change in

. If we can do that, being careful to express everything along the way in terms of easily computable quantities, then we should be able to compute

Let's try to carry this out. The change causes a small change in the activation of the neuron in the layer. This change is given by

The change in activation

will cause changes in all the activations in the next layer, i.e., the

layer. We'll concentrate on the way just a single one of those activations is affected, say

In fact, it'll cause the following change:

Substituting in the expression from Equation (48) , we get:

Of course, the change

will, in turn, cause changes in the activations in the next layer. In fact, we can imagine a path all the way through the network from

, with each change in activation causing a change in the next activation, and, finally, a change in the cost at the output. If the path goes through activations

then the resulting expression is

that is, we've picked up a

type term for each additional neuron we've passed through, as well as the

term at the end. This represents the change in

due to changes in the activations along this particular path through the network. Of course, there's many paths by which a change in

can propagate to affect the cost, and we've been considering just a single path. To compute the total change in

it is plausible that we should sum over all the possible paths between the weight and the final cost, i.e.,

where we've summed over all possible choices for the intermediate neurons along the path. Comparing with (47) we see that

Now, Equation (53) looks complicated. However, it has a nice intuitive interpretation. We're computing the rate of change of

with respect to a weight in the network. What the equation tells us is that every edge between two neurons in the network is associated with a rate factor which is just the partial derivative of one neuron's activation with respect to the other neuron's activation. The edge from the first weight to the first neuron has a rate factor

. The rate factor for a path is just the product of the rate factors along the path. And the total rate of change

is just the sum of the rate factors of all paths from the initial weight to the final cost. This procedure is illustrated here, for a single path:

What I've been providing up to now is a heuristic argument, a way of thinking about what's going on when you perturb a weight in a network. Let me sketch out a line of thinking you could use to further develop this argument. First, you could derive explicit expressions for all the individual partial derivatives in Equation(53). That's easy to do with a bit of calculus. Having done that, you could then try to figure out how to write all the sums over indices as matrix multiplications. This turns out to be tedious, and requires some persistence, but not extraordinary insight. After doing all this, and then simplifying as much as possible, what you discover is that you end up with exactly the backpropagation algorithm! And so you can think of the backpropagation algorithm as providing a way of computing the sum over the rate factor for all these paths. Or, to put it slightly differently, the backpropagation algorithm is a clever way of keeping track of small perturbations to the weights (and biases) as they propagate through the network, reach the output, and then affect the cost.

Now, I'm not going to work through all this here. It's messy and requires considerable care to work through all the details. If you're up for a challenge, you may enjoy attempting it. And even if not, I hope this line of thinking gives you some insight into what backpropagation is accomplishing.

What about the other mystery - how backpropagation could have been discovered in the first place? In fact, if you follow the approach I just sketched you will discover a proof of backpropagation. Unfortunately, the proof is quite a bit longer and more complicated than the one I described earlier in this chapter. So how was that short (but more mysterious) proof discovered? What you find when you write out all the details of the long proof is that, after the fact, there are several obvious simplifications staring you in the face. You make those simplifications, get a shorter proof, and write that out. And then several more obvious simplifications jump out at you. So you repeat again. The result after a few iterations is the proof we saw earlier**There is one clever step required. In Equation(53) the intermediate variables are activations like . The clever idea is to switch to using weighted inputs, like , as the intermediate variables. If you don't have this idea, and instead continue using the activations , the proof you obtain turns out to be slightly more complex than the proof given earlier in the chapter. - short, but somewhat obscure, because all the signposts to its construction have been removed! I am, of course, asking you to trust me on this, but there really is no great mystery to the origin of the earlier proof. It's just a lot of hard work simplifying the proof I've sketched in this section.

from:　http://neuralnetworksanddeeplearning.com/chap2.html

你可能感兴趣的:(Deep,Learning)

deepin分享-Linux & Windows 双系统时间不一致解决方案 deepin
在双系统环境中（如Windows和Linux），时间同步问题是一个常见的困扰。Windows和Linux对系统时间的处理方式不同，这可能导致时间显示不一致。本文将介绍两种解决方法，帮助你解决Linux和Windows双系统时间不一致的问题。问题背景Windows操作系统直接将CMOS时间（硬件时钟）视为本地时间，不根据时区进行转换。每次调整系统时区或修改时间时，Windows会直接修改CMOS时间
deepin-UEFI 引导：从入门到重装 deepin
在现代计算机中，UEFI（统一可扩展固件接口）已成为主流的启动方式，逐渐取代了传统的BIOS。UEFI提供了许多改进，如更灵活的启动管理、更大的分区支持以及更快的启动速度。然而，对于许多Linux用户来说，UEFI的复杂性可能会带来一些挑战，尤其是在多系统环境中。本文将详细介绍如何在Linux下使用UEFI引导系统，以及如何在出现问题时进行修复和重装。1.UEFI的基本原理UEFI是一种替代传统B
强化学习代码实践1.DDQN:在CartPole游戏中实现 Double DQN 洪小帅游戏 python gym pytorch 深度学习
强化学习代码实践1.DDQN:在CartPole游戏中实现DoubleDQN1.导入依赖2.定义Q网络3.创建Agent4.训练过程5.解释6.调整超参数在CartPole游戏中实现DoubleDQN（DDQN）训练网络时，我们需要构建一个使用两个Q网络（一个用于选择动作，另一个用于更新目标）的方法。DoubleDQN通过引入目标网络来减少Q-learning中过度估计的偏差。下面是一个基于PyT
目标跟踪概念、多目标跟踪算法SORT和deep SORT原理 yhwang-hub 深度学习
目录目标跟踪、单目标跟踪、多目标跟踪的概念欧氏距离、马氏距离、余弦距离欧氏距离马氏距离余弦距离SORT算法原理SORT算法中的匈牙利匹配算法指派问题中的匈牙利算法预测模型（卡尔曼滤波器）数据关联（匈牙利匹配）目标丢失问题的处理SORT算法过程deepSORT算法原理状态估计轨迹处理分配问题的评价指标级联匹配深度表观描述子算法总结目标跟踪、单目标跟踪、多目标跟踪的概念目标跟踪分为静态背景下的目标跟踪
mtls加密双向认证 sun007700 安全 ssl https http
https://www.cloudflare.com/en-gb/learning/access-management/what-is-mutual-tls/HTTPS双向认证（MutualTLSauthentication)-API网关-阿里云SSL/TLS双向认证(一)--SSL/TLS工作原理_ustccw-CSDN博客_双向认证SSL/TSL双向认证过程与Wireshark抓包分析_区块链
DeepSpeed 常见问题解决方案申晓容Lucille
DeepSpeed常见问题解决方案DeepSpeedDeepSpeedisadeeplearningoptimizationlibrarythatmakesdistributedtrainingandinferenceeasy,efficient,andeffective.项目地址:https://gitcode.com/gh_mirrors/de/DeepSpeed1.项目基础介绍和主要编程语言
Flink系列-2、Flink架构体系技术武器库大数据专栏 flink 架构 jvm
版权声明：本文为博主原创文章，遵循CC4.0BY-SA版权协议，转载请附上原文出处链接和本声明。大数据系列文章目录官方网址：https://flink.apache.org/学习资料：https://flink-learning.org.cn/目录Flink中的重要角⾊Flink数据流编程模型Libraries支持Flink集群搭建Local本地模式（开发测试）Standalone-伪分布环境（开
几个导致DeepFaceLab训练速度较慢的原因 AlphaFinance 多媒体AI技术人工智能 python 机器学习
可能有几个原因导致DeepFaceLab训练速度较慢：复杂度：DeepFaceLab的算法和模型较为复杂，需要处理大量数据和计算复杂的数学运算，这可能导致训练速度较慢。硬件配置：DeepFaceLab需要较高的计算机配置才能运行，包括较大的内存、高性能的GPU、快速的存储器等。如果你的计算机配置不够高，可能会导致训练速度较慢。数据量：DeepFaceLab需要大量的训练数据来训练模型，如果你的数据
Cursor 收费太贵？3分钟教你接入超低价 DeepSeek-V3，代码质量逼近 Claude 3.5 人工智能
DeepSeek-V3实在是太便宜了，就跟不要钱似的：每百万输入tokens0.1元(缓存命中)/1元(缓存未命中)，每百万输出tokens2元跟其他模型相比，DeepSeek-V3的性价比非常高，只能用“真香”来形容。Sealos推出的AI聚合代理服务SealosAIProxy为用户提供了便捷的AI模型访问通道，其中就包含了DeepSeek-V3模型。而且通过SealosAIProxy使用这些模
DeepSeek-V2 百态老人学习
DeepSeek-V2是由幻方量化旗下的AI公司DeepSeek发布的第二代MoE（Mixture-of-Experts）大模型，具有显著的性能和成本优势。以下是关于DeepSeek-V2的详细分析：性能表现：DeepSeek-V2是一个参数量为2360亿的MoE模型，其性能接近GPT-4Turbo，并在多个基准测试中表现优异，如AlignBench、MT-Bench等，超越了GPT-4，与GPT
error: libcublasLt.so.11: cannot open shared object file: No such file or directory/缺少libcublas.so查找鼾声鼾语 linux ubuntu 服务器 python can通讯方法
1,问题：gstnvtracker:Loadinglow-levellibat/opt/nvidia/deepstream/deepstream/lib/libnvds_nvmultiobjecttracker.sogstnvtracker:Failedtoopenlow-levellibat/opt/nvidia/deepstream/deepstream/lib/libnvds_nvmulti
增强大型语言模型（LLM）可访问性：深入探究在单块AMD GPU上通过QLoRA微调Llama 2的过程 109702008 人工智能 #ROCm #python 语言模型 llama 人工智能
EnhancingLLMAccessibility:ADeepDiveintoQLoRAThroughFine-tuningLlama2onasingleAMDGPU—ROCmBlogs基于之前的博客《使用LoRA微调Llama2》的内容，我们深入研究了一种称为量化低秩调整（QLoRA）的参数高效微调（PEFT）方法。本次重点是利用QLoRA技术在单块AMDGPU上，使用ROCm微调Llama-2
【深度学习基础】线性神经网络 | softmax回归的简洁实现 Francek Chen PyTorch深度学习深度学习神经网络回归 softmax 人工智能
【作者主页】FrancekChen【专栏介绍】⌈⌈⌈PyTorch深度学习⌋⌋⌋深度学习(DL,DeepLearning)特指基于深层神经网络模型和方法的机器学习。它是在统计机器学习、人工神经网络等算法模型基础上，结合当代大数据和大算力的发展而发展出来的。深度学习最重要的技术特征是具有自动提取特征的能力。神经网络算法、算力和数据是开展深度学习的三要素。深度学习在计算机视觉、自然语言处理、多模态数据
深度求索DeepSeek V2.5-1210发布：AI代码生成器迎来全新升级 2401_89759264 人工智能前端
深度学习技术日新月异，而强大的AI代码生成器也随之不断进化。今天，我们将聚焦于深度求索团队发布的DeepSeekV2.5-1210版本，这款标志着DeepSeekV2系列收官之作，为我们带来了令人惊喜的Post-Training能力提升和备受期待的联网搜索功能。这篇文章将深入探讨DeepSeekV2.5-1210的各项改进，以及其开源带来的深远影响。DeepSeekV2系列的研发历程与V2.5-1
DCGAN - 深度卷积生成对抗网络：基于卷积神经网络的GAN 池央生成对抗网络 cnn 深度学习
深度卷积生成对抗网络（DCGAN，DeepConvolutionalGenerativeAdversarialNetwork）是生成对抗网络（GAN）的一种扩展，它通过使用卷积神经网络（CNN）来实现生成器和判别器的构建。与标准的GAN相比，DCGAN通过引入卷积层来改善图像生成质量，使得生成器能够生成更清晰、更高分辨率的图像。DCGAN提出了一种通过卷积结构来提高图像生成效果的策略，并在多个领域
细嗦Transformer（三）：准备训练，讲解及代码实现优化器、学习率调整策略、正则化和KL散度损失 Ace_bb 算法 LLM transformer
文章目录关注我：细嗦大模型批处理对象/BatchesandMasking训练循环主函数/TrainingLoop优化器/Optimizer学习率调整策略/Learningrateadjustmentstrategy样例测试正则化/RegularizationLabelsmoothing标签平滑KL散度损失样例测试Github完整代码----求求了给个star和关注吧参考资料求求了，给个star和关
lxml.etree模式使用(一) 卫生纸不够用 python爬虫 python 前端 javascript
fromlxmlimportetreefromcopyimportdeepcopydefprettyprint(element,**kwargs):print("/")xml=etree.tostring(element,pretty_print=True,**kwargs)print(xml.decode(),end='')#1.创建元素root=etree.Element("root")#2.
深度求索DeepSeek V2.5-1210发布：AI代码生成器迎来全新升级前端
深度学习技术日新月异，而强大的AI代码生成器也随之不断进化。今天，我们将聚焦于深度求索团队发布的DeepSeekV2.5-1210版本，这款标志着DeepSeekV2系列收官之作，为我们带来了令人惊喜的Post-Training能力提升和备受期待的联网搜索功能。这篇文章将深入探讨DeepSeekV2.5-1210的各项改进，以及其开源带来的深远影响。DeepSeekV2系列的研发历程与V2.5-1
deepin-grep详解：文本搜索的强大工具 deepin
在Linux系统中，grep命令是一个极其强大的文本搜索工具，广泛应用于文本处理、日志分析和数据筛选等场景。它的全称是“GlobalsearchREgularexpressionandPrintouttheline”，即全局搜索正则表达式并打印匹配的行。本文将详细介绍grep命令的基本用法、常用选项以及正则表达式的使用技巧。1.grep命令的基本功能grep命令的主要作用是从文本文件或管道数据流中
deepin分享-Linux 磁盘分区和挂载指南 deepin
在Linux系统中(如deepin等)，磁盘分区和挂载是系统管理的重要组成部分。了解如何进行分区、格式化和挂载操作，可以帮助你更好地管理磁盘空间，优化系统性能，并确保数据的安全存储。本文将详细介绍Linux磁盘分区和挂载的基本概念、操作步骤以及一些实用的命令。1.基本概念Linux系统采用了一种独特的文件系统结构，无论系统中有多少个分区，它们最终都归属于一个根目录（/），形成一个统一的文件系统。每
【人工智能】Python实战：构建高效的多任务学习模型蒙娜丽宁 Python杂谈 AI 人工智能 python 学习
《PythonOpenCV从菜鸟到高手》带你进入图像处理与计算机视觉的大门！解锁Python编程的无限可能：《奇妙的Python》带你漫游代码世界多任务学习（Multi-taskLearning,MTL）作为机器学习领域中的一种重要方法，通过在单一模型中同时学习多个相关任务，不仅能够提高模型的泛化能力，还能有效利用任务间的共享信息。本文深入探讨了多任务学习的基本概念、优势及其在实际应用中的重要性。
DeepSeek：极致的中国技术理想 X_taiyang18 AI与机器学习人工智能
揭秘DeepSeek:一个更极致的中国技术理想主义故事划重点中国的大模型创业公司DeepSeek因其创新的MLA架构和DeepSeekMoESparse结构，使推理成本降低至每百万token仅1块钱，引发中国大模型价格战。与其他大公司烧钱补贴不同，DeepSeek是有利润的，背后是DeepSeek对模型架构的全面创新。DeepSeek创始人梁文锋认为，中国的大模型创业者除应用创新外，也可以加入到全
vscode accelerate deepspeed配置 Ctrl_Cver vscode ide 编辑器
accelerate训练{//UseIntelliSensetolearnaboutpossibleattributes.//Hovertoviewdescriptionsofexistingattributes.//Formoreinformation,visit:https://go.microsoft.com/fwlink/?linkid=830387"version":"0.2.0","c
deepin 23 Preview 运行自定义 exe 的方法 deepin
在deepin23Preview版本中，运行自定义的exe程序可以通过以下步骤实现：一、安装Wine运行器（一）使用linglong格式包的Wine应用如果你已经安装了linglong格式包的Wine程序，在WINE版本处将直接出现选项供你选择使用。需要注意的是：在使用linglong包的Wine应用时，必须先安装至少一个linglong的使用Wine软件包，才会出现该选项。程序识别到的Wine是
AAAI2024论文解读|Towards Fairer Centroids in k-means Clustering面向更公平的 k 均值聚类中心 paixiaoxin 文献阅读论文合集支持向量机机器学习人工智能聚类公平性 k 均值聚类质心代表性群体代表性公平性
论文标题TowardsFairerCentroidsink-meansClustering面向更公平的k均值聚类中心论文链接TowardsFairerCentroidsink-meansClustering论文下载论文作者StanleySimoes,DeepakP,MuirisMacCarthaigh内容简介本文提出了一种新的聚类级质心公平性（Cluster-levelCentroidFairne
大模型介绍詹姆斯爱研究Java spring
大模型（LargeModel）指的是拥有庞大参数量的机器学习模型。由于具有更多的参数，大模型能够更好地拟合复杂的数据和模式，从而提供更准确的预测和更好的性能。大模型的参数量通常远远超过常规模型，可以达到数百万甚至数十亿个参数。这些参数通常通过深度神经网络（DeepNeuralNetwork）来表示，包括多个隐藏层和大量的神经元。大模型的训练需要大量的计算资源和数据。通常，它们需要在多个GPU或TP
论文阅读：Deep Bilateral Learning for Real-Time Image Enhancement-google-hdrnet-slicing SetMaker 论文阅读
项目地址:https://gitcode.com/google/hdrnethdrnet作为超分领域的经典文章，由google提出主要用来用轻量化的方法来实现高分辨率的图像生成，hdrnet结合cnn可以让更高分辨率的图像部署在板端。如图所示，原始图像比如4k图像，首先分为两个主要模块：grid和guide。grid就是对应图上面的那一条特征提取网络，具体来说，原始图像经过下采样之后，默认256分
2017-SIGGRAPH-Google,MIT-(HDRNet)Deep Bilateral Learning for Real-Time Image Enhancements WX Chen HDR技术深度学习神经网络机器学习
双边网格本质上是一个可以保存边缘信息的3维的数据结构。对于一张2维图片,在2维空间中增加了一维代表像素的强度slice操作(上采样)BilateralGuidedUpsampling这篇文章用双边网格实现图像的操作算子的加速。算法的核心思想是将一幅高分辨率的图像通过下采样转换成一个双边网格,在双边网格中每个格子就是一个图像的仿射变换算子,它的原理是在空间与值域相近的区域内,相似输入图像的亮度经算子
AWS GCR EKS Resource：构建高效弹性云原生应用的利器杨女嫚
AWSGCREKSResource：构建高效弹性云原生应用的利器eks-workshop-greater-chinaAWSWorkshopforLearningEKSforGreaterChina项目地址:https://gitcode.com/gh_mirrors/ek/eks-workshop-greater-china在云计算的浪潮中，AWS（AmazonWebServices）一直处于创新
AI行业高压与人才健康：纪念Felix Hill，并探讨AI代码生成工具的价值前端
今天，我们怀着沉痛的心情悼念GoogleDeepMind研究科学家FelixHill，这位杰出的AI学者在41岁的年纪离开了我们。他的离世引发了我们对AI行业高压环境与人才健康问题的深刻反思。Felix生前曾公开表达AI行业前所未有的压力，这促使我们思考如何利用技术，例如AI代码生成器，来改善开发者的工作环境，提升效率，守护人才健康。FelixHill在自然语言处理和人工智能领域取得了令人瞩目的成
HQL之投影查询归来朝歌 HQL Hibernate 查询语句投影查询
在HQL查询中，常常面临这样一个场景，对于多表查询，是要将一个表的对象查出来还是要只需要每个表中的几个字段，最后放在一起显示？针对上面的场景，如果需要将一个对象查出来： HQL语句写“from 对象”即可 Session session = HibernateUtil.openSession();
Spring整合redis bylijinnan redis
pom.xml <dependencies>  <dependency> <groupId>org.springframework.data</groupId> <artifactId>spring-data-redi
org.hibernate.NonUniqueResultException: query did not return a unique result: 2 0624chenhong Hibernate
参考：http://blog.csdn.net/qingfeilee/article/details/7052736 org.hibernate.NonUniqueResultException: query did not return a unique result: 2 在项目中出现了org.hiber
android动画效果不懂事的小屁孩 android动画
前几天弄alertdialog和popupwindow的时候，用到了android的动画效果，今天专门研究了一下关于android的动画效果，列出来，方便以后使用。 Android 平台提供了两类动画。一类是Tween动画，就是对场景里的对象不断的进行图像变化来产生动画效果（旋转、平移、放缩和渐变）。第二类就是 Frame动画，即顺序的播放事先做好的图像，与gif图片原理类似。
js delete 删除机理以及它的内存泄露问题的解决方案换个号韩国红果果 JavaScript
delete删除属性时只是解除了属性与对象的绑定，故当属性值为一个对象时，删除时会造成内存泄露（其实还未删除）举例： var person={name:{firstname:'bob'}} var p=person.name delete person.name p.firstname -->'bob' // 依然可以访问p.firstname，存在内存泄露
Oracle将零干预分析加入网络即服务计划蓝儿唯美 oracle
由Oracle通信技术部门主导的演示项目并没有在本月较早前法国南斯举行的行业集团TM论坛大会中获得嘉奖。但是，Oracle通信官员解雇致力于打造一个支持零干预分配和编制功能的网络即服务（NaaS）平台，帮助企业以更灵活和更适合云的方式实现通信服务提供商（CSP）的连接产品。这个Oracle主导的项目属于TM Forum Live!活动上展示的Catalyst计划的19个项目之一。Catalyst计
spring学习——springmvc（二） a-john springMVC
Spring MVC提供了非常方便的文件上传功能。 1，配置Spring支持文件上传： DispatcherServlet本身并不知道如何处理multipart的表单数据，需要一个multipart解析器把POST请求的multipart数据中抽取出来，这样DispatcherServlet就能将其传递给我们的控制器了。为了在Spring中注册multipart解析器，需要声明一个实现了Mul
POJ-2828-Buy Tickets aijuans ACM_POJ
POJ-2828-Buy Tickets http://poj.org/problem?id=2828 线段树，逆序插入 #include<iostream>#include<cstdio>#include<cstring>#include<cstdlib>using namespace std;#define N 200010struct
Java Ant build.xml详解 asia007 build.xml
1,什么是antant是构建工具2,什么是构建概念到处可查到，形象来说，你要把代码从某个地方拿来，编译，再拷贝到某个地方去等等操作，当然不仅与此，但是主要用来干这个3,ant的好处跨平台 --因为ant是使用java实现的，所以它跨平台使用简单--与ant的兄弟make比起来语法清晰--同样是和make相比功能强大--ant能做的事情很多，可能你用了很久，你仍然不知道它能有
android按钮监听器的四种技术百合不是茶 android xml配置监听器实现接口
android开发中经常会用到各种各样的监听器,android监听器的写法与java又有不同的地方; 1,activity中使用内部类实现接口 ,创建内部类实例使用add方法与java类似创建监听器的实例 myLis lis = new myLis(); 使用add方法给按钮添加监听器
软件架构师不等同于资深程序员 bijian1013 程序员架构师架构设计
本文的作者Armel Nene是ETAPIX Global公司的首席架构师，他居住在伦敦，他参与过的开源项目包括 Apache Lucene,，Apache Nutch， Liferay 和 Pentaho等。如今很多的公司
TeamForge Wiki Syntax & CollabNet User Information Center sunjing TeamForge How do Attachement Anchor Wiki Syntax
the CollabNet user information center http://help.collab.net/ How do I create a new Wiki page? A CollabNet TeamForge project can have any number of Wiki pages. All Wiki pages are linked, and
【Redis四】Redis数据类型 bit1129 redis
概述 Redis是一个高性能的数据结构服务器，称之为数据结构服务器的原因是，它提供了丰富的数据类型以满足不同的应用场景，本文对Redis的数据类型以及对这些类型可能的操作进行总结。 Redis常用的数据类型包括string、set、list、hash以及sorted set.Redis本身是K/V系统，这里的数据类型指的是value的类型，而不是key的类型，key的类型只有一种即string
SSH2整合-附源码白糖_ eclipse spring tomcat Hibernate Google
今天用eclipse终于整合出了struts2+hibernate+spring框架。我创建的是tomcat项目，需要有tomcat插件。导入项目以后，鼠标右键选择属性，然后再找到“tomcat”项，勾选一下“Is a tomcat project”即可。具体方法见源码里的jsp图片，sql也在源码里。补充1：项目中部分jar包不是最新版的，可能导
[转]开源项目代码的学习方法 braveCS 学习方法
转自： http://blog.sina.com.cn/s/blog_693458530100lk5m.html http://www.cnblogs.com/west-link/archive/2011/06/07/2074466.html 1）阅读features。以此来搞清楚该项目有哪些特性2）思考。想想如果自己来做有这些features的项目该如何构架3）下载并安装d
编程之美-子数组的最大和（二维） bylijinnan 编程之美
package beautyOfCoding; import java.util.Arrays; import java.util.Random; public class MaxSubArraySum2 { /** * 编程之美子数组之和的最大值（二维） */ private static final int ROW = 5; private stat
读书笔记-3 chengxuyuancsdn jquery笔记 resultMap配置 ibatis一对多配置
1、resultMap配置 2、ibatis一对多配置 3、jquery笔记 1、resultMap配置当<select resultMap="topic_data"> <resultMap id="topic_data">必须一一对应。 (1)<resultMap class="tblTopic&q
[物理与天文]物理学新进展 comsci
如果我们必须获得某种地球上没有的矿石,才能够进行某些能量输出装置的设计和建造,而要获得这种矿石,又必须首先进行深空探测,而要进行深空探测,又必须获得这种能量输出装置,这个矛盾的循环,会导致地球联盟在与宇宙文明建立关系的时候,陷入困境怎么办呢?
Oracle 11g新特性:Automatic Diagnostic Repository daizj oracle ADR
Oracle Database 11g的FDI（Fault Diagnosability Infrastructure）是自动化诊断方面的又一增强。 FDI的一个关键组件是自动诊断库（Automatic Diagnostic Repository-ADR）。在oracle 11g中，alert文件的信息是以xml的文件格式存在的，另外提供了普通文本格式的alert文件。这两份log文
简单排序:选择排序 dieslrae 选择排序
public void selectSort(int[] array){ int select; for(int i=0;i<array.length;i++){ select = i; for(int k=i+1;k<array.leng
C语言学习六指针的经典程序，互换两个数字 dcj3sjt126com c
示例程序，swap_1和swap_2都是错误的，推理从1开始推到2，2没完成，推到3就完成了 # include <stdio.h> void swap_1(int, int); void swap_2(int *, int *); void swap_3(int *, int *); int main(void) { int a = 3; int b =
php 5.4中php-fpm 的重启、终止操作命令 dcj3sjt126com PHP
php 5.4中php-fpm 的重启、终止操作命令: 查看php运行目录命令：which php/usr/bin/php 查看php-fpm进程数：ps aux | grep -c php-fpm 查看运行内存/usr/bin/php -i|grep mem 重启php-fpm/etc/init.d/php-fpm restart 在phpinfo()输出内容可以看到php
线程同步工具类 shuizhaosi888 同步工具类
同步工具类包括信号量（Semaphore）、栅栏（barrier）、闭锁（CountDownLatch）闭锁（CountDownLatch） public class RunMain { public long timeTasks(int nThreads, final Runnable task) throws InterruptedException { fin
bleeding edge是什么意思 haojinghua DI
不止一次，看到很多讲技术的文章里面出现过这个词语。今天终于弄懂了——通过朋友给的浏览软件，上了wiki。我再一次感到，没有辞典能像WiKi一样，给出这样体贴人心、一清二楚的解释了。为了表达我对WiKi的喜爱，只好在此一一中英对照，给大家上次课。 In computer science, bleeding edge is a term that
c中实现utf8和gbk的互转 jimmee c iconv utf8&gbk编码
#include <iconv.h> #include <stdlib.h> #include <stdio.h> #include <unistd.h> #include <fcntl.h> #include <string.h> #include <sys/stat.h> int code_c
大型分布式网站架构设计与实践 lilin530 应用服务器搜索引擎
1.大型网站软件系统的特点？ a.高并发，大流量。 b.高可用。 c.海量数据。 d.用户分布广泛，网络情况复杂。 e.安全环境恶劣。 f.需求快速变更，发布频繁。 g.渐进式发展。 2.大型网站架构演化发展历程？ a.初始阶段的网站架构。应用程序，数据库，文件等所有的资源都在一台服务器上。 b.应用服务器和数据服务器分离。 c.使用缓存改善网站性能。 d.使用应用
在代码中获取Android theme中的attr属性值 OliveExcel android theme
Android的Theme是由各种attr组合而成, 每个attr对应了这个属性的一个引用, 这个引用又可以是各种东西. 在某些情况下, 我们需要获取非自定义的主题下某个属性的内容 (比如拿到系统默认的配色colorAccent), 操作方式举例一则: int defaultColor = 0xFF000000; int[] attrsArray = { andorid.r.
基于Zookeeper的分布式共享锁 roadrunners zookeeper 分布式共享锁
首先，说说我们的场景，订单服务是做成集群的，当两个以上结点同时收到一个相同订单的创建指令，这时并发就产生了，系统就会重复创建订单。等等......场景。这时，分布式共享锁就闪亮登场了。共享锁在同一个进程中是很容易实现的，但在跨进程或者在不同Server之间就不好实现了。Zookeeper就很容易实现。具体的实现原理官网和其它网站也有翻译，这里就不在赘述了。官
两个容易被忽略的MySQL知识 tomcat_oracle mysql
1、varchar(5)可以存储多少个汉字，多少个字母数字？　　相信有好多人应该跟我一样，对这个已经很熟悉了，根据经验我们能很快的做出决定，比如说用varchar(200)去存储url等等，但是，即使你用了很多次也很熟悉了，也有可能对上面的问题做出错误的回答。　　这个问题我查了好多资料，有的人说是可以存储5个字符，2.5个汉字（每个汉字占用两个字节的话），有的人说这个要区分版本，5.0
zoj 3827 Information Entropy(水题) 阿尔萨斯 format
题目链接：zoj 3827 Information Entropy 题目大意：三种底，计算和。解题思路：调用库函数就可以直接算了，不过要注意Pi = 0的时候，不过它题目里居然也讲了。。。limp→0+plogb(p)=0，因为p是logp的高阶。 #include <cstdio> #include <cstring> #include <cmath&

BP反向传播算法是如何工作的How the backpropagation algorithm works

Warm up: a fast matrix-based approach to computing the output from a neural network

The two assumptions we need about the cost function

The Hadamard product, s⊙t

The four fundamental equations behind backpropagation

Problem

Proof of the four fundamental equations (optional)

Exercise

The backpropagation algorithm

Exercises

The code for backpropagation

Problem

In what sense is backpropagation a fast algorithm?

Backpropagation: the big picture

你可能感兴趣的:(Deep,Learning)

The Hadamard product,