mydear_11000

CHAPTER 2 How the backpropagation algorithm works

In the last chapter we saw how neural networks canlearn their weights and biases using the gradient descent algorithm.There was, however, a gap in our explanation: we didn't discuss how tocompute the gradient of the cost function. That's quite a gap! Inthis chapter I'll explain a fast algorithm for computing suchgradients, an algorithm known as backpropagation.

The backpropagation algorithm was originally introduced in the 1970s,but its importance wasn't fully appreciated until afamous 1986 paper byDavid Rumelhart,Geoffrey Hinton, andRonald Williams. That paper describes severalneural networks where backpropagation works far faster than earlierapproaches to learning, making it possible to use neural nets to solveproblems which had previously been insoluble. Today, thebackpropagation algorithm is the workhorse of learning in neuralnetworks.

This chapter is more mathematically involved than the rest of thebook. If you're not crazy about mathematics you may be tempted toskip the chapter, and to treat backpropagation as a black box whosedetails you're willing to ignore. Why take the time to study thosedetails?

The reason, of course, is understanding. At the heart ofbackpropagation is an expression for the partial derivative ∂C/∂w

of the cost function

C with respect to any weight

w (or bias

) in the network. The expression tells us how quicklythe cost changes when we change the weights and biases. And while theexpression is somewhat complex, it also has a beauty to it, with eachelement having a natural, intuitive interpretation. And sobackpropagation isn't just a fast algorithm for learning. It actuallygives us detailed insights into how changing the weights and biaseschanges the overall behaviour of the network. That's well worthstudying in detail.

With that said, if you want to skim the chapter, or jump straight tothe next chapter, that's fine. I've written the rest of the book tobe accessible even if you treat backpropagation as a black box. Thereare, of course, points later in the book where I refer back to resultsfrom this chapter. But at those points you should still be able tounderstand the main conclusions, even if you don't follow all thereasoning.

Warm up: a fast matrix-based approach to computing the output from a neural network

Before discussing backpropagation, let's warm up with a fastmatrix-based algorithm to compute the output from a neural network.We actually already briefly saw this algorithmnear the end of the last chapter, but I described it quickly, so it'sworth revisiting in detail. In particular, this is a good way ofgetting comfortable with the notation used in backpropagation, in afamiliar context.

Let's begin with a notation which lets us refer to weights in thenetwork in an unambiguous way. We'll use wljk

to denote theweight for the connection from the

kth neuron in the

(l−1)th layer to the

jth neuron in the

lth

layer. So, for example, the diagram below shows the weight on aconnection from the fourth neuron in the second layer to the secondneuron in the third layer of a network:

This notation is cumbersome at first, and it does take some work tomaster. But with a little effort you'll find the notation becomeseasy and natural. One quirk of the notation is the ordering of the

j and

k indices. You might think that it makes more sense to use

j to refer to the input neuron, and

k to the output neuron, notvice versa, as is actually done. I'll explain the reason for thisquirk below.

We use a similar notation for the network's biases and activations.Explicitly, we use blj

for the bias of the

jth neuron inthe

lth layer. And we use

alj for the activation of the

jth neuron in the

lth

layer. The following diagramshows examples of these notations in use:

With these notations, the activation

alj of the

jth neuron in the

lth layer is related to the activations in the

(l−1)th layer by the equation (compareEquation (4) and surroundingdiscussion in the last chapter)

a l j = σ (\sum k w l j k a l - 1 k + b l j), (23)

where the sum is over all neurons

k in the

(l−1)th layer. Torewrite this expression in a matrix form we define a weight matrix

wl for each layer,

l . The entries of the weight matrix

wl are just the weights connecting to the

lth layer of neurons,that is, the entry in the

jth row and

kth column is

wljk .Similarly, for each layer

l we define a bias vector,

bl .You can probably guess how this works - the components of the biasvector are just the values

blj , one component for each neuron inthe

lth layer. And finally, we define an activation vector

al whose components are the activations

alj .

The last ingredient we need to rewrite (23) in amatrix form is the idea of vectorizing a function such as σ

.We met vectorization briefly in the last chapter, but to recap, theidea is that we want to apply a function such as

σ to everyelement in a vector

v . We use the obvious notation

σ(v) todenote this kind of elementwise application of a function. That is,the components of

σ(v) are just

σ(v)j=σ(vj) .As an example, if we have the function

f(x)=x2 then thevectorized form of

f has the effect

f ([23]) = [f (2) f (3)] = [49], (24)

that is, the vectorized

just squares every element of the vector.

With these notations in mind, Equation (23) canbe rewritten in the beautiful and compact vectorized form

a l = σ (w l a l - 1 + b l) . (25)

This expression gives us a much more global way of thinking about howthe activations in one layer relate to activations in the previouslayer: we just apply the weight matrix to the activations, then addthe bias vector, and finally apply the

σ function* *By the way, it's this expression that motivates the quirk in the

wljk notation mentioned earlier. If we used

j to index the input neuron, and

to index the output neuron, then we'd need to replace the weight matrix in Equation (25) by the transpose of the weight matrix. That's a small change, but annoying, and we'd lose the easy simplicity of saying (and thinking) "apply the weight matrix to the activations".. That global view is often easier andmore succinct (and involves fewer indices!) than the neuron-by-neuronview we've taken to now. Think of it as a way of escaping index hell,while remaining precise about what's going on. The expression is alsouseful in practice, because most matrix libraries provide fast ways ofimplementing matrix multiplication, vector addition, andvectorization. Indeed, thecodein the last chapter made implicit use of this expression to computethe behaviour of the network.

When using Equation (25) to compute al

,we compute the intermediate quantity

zl≡wlal−1+bl along the way. This quantity turns out to be useful enough to beworth naming: we call

zl the weighted input to the neuronsin layer

l . We'll make considerable use of the weighted input

zl later in the chapter. Equation (25) issometimes written in terms of the weighted input, as

al=σ(zl) . It's also worth noting that

zl has components

zlj=∑kwljkal−1k+blj , that is,

zlj is just theweighted input to the activation function for neuron

j in layer

The two assumptions we need about the cost function

The goal of backpropagation is to compute the partial derivatives ∂C/∂w

and

∂C/∂b of the costfunction

C with respect to any weight

w or bias

b in thenetwork. For backpropagation to work we need to make two mainassumptions about the form of the cost function. Before stating thoseassumptions, though, it's useful to have an example cost function inmind. We'll use the quadratic cost function from last chapter(c.f. Equation (6)). In the notation ofthe last section, the quadratic cost has the form

C = 1 2 n \sum x ∥ y (x) - a L (x) ∥ 2, (26)

where:

n is the total number of training examples; the sum is overindividual training examples,

x ;

y=y(x) is the correspondingdesired output;

L denotes the number of layers in the network; and

aL=aL(x) is the vector of activations output from the networkwhen

is input.

Okay, so what assumptions do we need to make about our cost function, C

, in order that backpropagation can be applied? The firstassumption we need is that the cost function can be written as anaverage

C=1n∑xCx over cost functions

Cx forindividual training examples,

x . This is the case for the quadraticcost function, where the cost for a single training example is

Cx=12∥y−aL∥2

. This assumption will also hold true forall the other cost functions we'll meet in this book.

The reason we need this assumption is because what backpropagationactually lets us do is compute the partial derivatives ∂Cx/∂w

and

∂Cx/∂b for a single trainingexample. We then recover

∂C/∂w and

∂C/∂b by averaging over training examples. In fact, with thisassumption in mind, we'll suppose the training example

x has beenfixed, and drop the

x subscript, writing the cost

Cx as

C .We'll eventually put the

back in, but for now it's a notationalnuisance that is better left implicit.

The second assumption we make about the cost is that it can be writtenas a function of the outputs from the neural network:

For example, the quadratic cost function satisfies this requirement,since the quadratic cost for a single training example

x may bewritten as

C = 1 2 ∥ y - a L ∥ 2 = 1 2 \sum j (y j - a L j) 2, (27)

and thus is a function of the output activations. Of course, thiscost function also depends on the desired output

y , and you maywonder why we're not regarding the cost also as a function of

y .Remember, though, that the input training example

x is fixed, and sothe output

y is also a fixed parameter. In particular, it's notsomething we can modify by changing the weights and biases in any way,i.e., it's not something which the neural network learns. And so itmakes sense to regard

C as a function of the output activations

aL alone, with

y merely a parameter that helps define thatfunction.

The Hadamard product, s⊙t

The backpropagation algorithm is based on common linear algebraicoperations - things like vector addition, multiplying a vector by amatrix, and so on. But one of the operations is a little lesscommonly used. In particular, suppose s

and

t are two vectors ofthe same dimension. Then we use

s⊙t to denote the elementwise product of the two vectors. Thus the components of

s⊙t are just

(s⊙t)j=sjtj . As an example,

[12] ⊙ [34] = [1 * 3 2 * 4] = [38] . (28)

This kind of elementwise multiplication is sometimes called theHadamard product or Schur product. We'll refer to it asthe Hadamard product. Good matrix libraries usually provide fastimplementations of the Hadamard product, and that comes in handy whenimplementing backpropagation.

The four fundamental equations behind backpropagation

Backpropagation is about understanding how changing the weights andbiases in a network changes the cost function. Ultimately, this meanscomputing the partial derivatives ∂C/∂wljk

and

∂C/∂blj . But to compute those, we firstintroduce an intermediate quantity,

δlj , which we call the error in the

jth neuron in the

lth layer.Backpropagation will give us a procedure to compute the error

δlj , and then will relate

δlj to

∂C/∂wljk and

∂C/∂blj

To understand how the error is defined, imagine there is a demon inour neural network:

CHAPTER 2 How the backpropagation algorithm works_第1张图片

The demon sits at the

jth neuron in layer

l . As the input to theneuron comes in, the demon messes with the neuron's operation. Itadds a little change

Δzlj to the neuron's weighted input, sothat instead of outputting

σ(zlj) , the neuron instead outputs

σ(zlj+Δzlj) . This change propagates through laterlayers in the network, finally causing the overall cost to change byan amount

∂C∂zljΔzlj .

Now, this demon is a good demon, and is trying to help you improve thecost, i.e., they're trying to find a Δzlj

which makes thecost smaller. Suppose

∂C∂zlj has a largevalue (either positive or negative). Then the demon can lower thecost quite a bit by choosing

Δzlj to have the opposite signto

∂C∂zlj . By contrast, if

∂C∂zlj is close to zero, then the demoncan't improve the cost much at all by perturbing the weighted input

zlj . So far as the demon can tell, the neuron is already prettynear optimal* *This is only the case for small changes

Δzlj , of course. We'll assume that the demon is constrained to make such small changes.. And so there's a heuristic sense inwhich

∂C∂zlj

is a measure of the error inthe neuron.

Motivated by this story, we define the error δlj

of neuron

j in layer

l by

δ l j \equiv \partial C \partial z l j . (29)

As per our usual conventions, we use

δl to denote the vectorof errors associated with layer

l . Backpropagation will give us away of computing

δl for every layer, and then relating thoseerrors to the quantities of real interest,

∂C/∂wljk and

∂C/∂blj

You might wonder why the demon is changing the weighted input zlj

.Surely it'd be more natural to imagine the demon changing the outputactivation

alj , with the result that we'd be using

∂C∂alj as our measure of error. In fact, if you dothis things work out quite similarly to the discussion below. But itturns out to make the presentation of backpropagation a little morealgebraically complicated. So we'll stick with

δlj=∂C∂zlj as our measure of error* *In classification problems like MNIST the term "error" is sometimes used to mean the classification failure rate. E.g., if the neural net correctly classifies 96.0 percent of the digits, then the error is 4.0 percent. Obviously, this has quite a different meaning from our

vectors. In practice, you shouldn't have trouble telling which meaning is intended in any given usage..

Plan of attack: Backpropagation is based around fourfundamental equations. Together, those equations give us a way ofcomputing both the error δl

and the gradient of the costfunction. I state the four equations below. Be warned, though: youshouldn't expect to instantaneously assimilate the equations. Such anexpectation will lead to disappointment. In fact, the backpropagationequations are so rich that understanding them well requiresconsiderable time and patience as you gradually delve deeper into theequations. The good news is that such patience is repaid many timesover. And so the discussion in this section is merely a beginning,helping you on the way to a thorough understanding of the equations.

Here's a preview of the ways we'll delve more deeply into theequations later in the chapter: I'llgive a short proof of the equations, which helps explain why they aretrue; we'll restate the equations in algorithmic form as pseudocode, andsee how thepseudocode can be implemented as real, running Python code; and, inthe final section of the chapter, we'll develop an intuitive picture of whatthe backpropagation equations mean, and how someone might discoverthem from scratch. Along the way we'll return repeatedly to the fourfundamental equations, and as you deepen your understanding thoseequations will come to seem comfortable and, perhaps, even beautifuland natural.

An equation for the error in the output layer, δL

:The components of

δL are given by

δ L j = \partial C \partial a L j σ' (z L j) . (BP1)

This is a very natural expression. The first term on the right,

∂C/∂aLj , just measures how fast the cost ischanging as a function of the

jth output activation. If, forexample,

C doesn't depend much on a particular output neuron,

j ,then

δLj will be small, which is what we'd expect. Thesecond term on the right,

σ′(zLj) , measures how fast theactivation function

σ is changing at

zLj

Notice that everything in (BP1) is easily computed. Inparticular, we compute zLj

while computing the behaviour of thenetwork, and it's only a small additional overhead to compute

σ′(zLj) . The exact form of

∂C/∂aLj will, of course, depend on the form of the cost function. However,provided the cost function is known there should be little troublecomputing

∂C/∂aLj . For example, if we're usingthe quadratic cost function then

C=12∑j(yj−aj)2 ,and so

∂C/∂aLj=(aj−yj)

, which obviously iseasily computable.

Equation (BP1) is a componentwise expression for δL

.It's a perfectly good expression, but not the matrix-based form wewant for backpropagation. However, it's easy to rewrite the equationin a matrix-based form, as

δ L = \nabla a C ⊙ σ' (z L) . (BP1a)

Here,

∇aC is defined to be a vector whose components are thepartial derivatives

∂C/∂aLj . You can think of

∇aC as expressing the rate of change of

C with respect tothe output activations. It's easy to see that Equations (BP1a)and (BP1) are equivalent, and for that reason from now on we'lluse (BP1) interchangeably to refer to both equations. As anexample, in the case of the quadratic cost we have

∇aC=(aL−y) , and so the fully matrix-based form of (BP1) becomes

δ L = (a L - y) ⊙ σ' (z L) . (30)

As you can see, everything in this expression has a nice vector form,and is easily computed using a library such as Numpy.

An equation for the error δl

in terms of the error in the next layer, δl+1 : In particular

δ l = ((w l + 1) T δ l + 1) ⊙ σ' (z l), (BP2)

where

(wl+1)T is the transpose of the weight matrix

wl+1 forthe

(l+1)th layer. This equation appears complicated, buteach element has a nice interpretation. Suppose we know the error

δl+1 at the

l+1th layer. When we apply thetranspose weight matrix,

(wl+1)T , we can think intuitively ofthis as moving the error backward through the network, givingus some sort of measure of the error at the output of the

lth layer. We then take the Hadamard product

⊙σ′(zl) . Thismoves the error backward through the activation function in layer

l ,giving us the error

δl in the weighted input to layer

By combining (BP2) with (BP1) we can compute the error δl

for any layer in the network. We start byusing (BP1) to compute

δL , then applyEquation (BP2) to compute

δL−1 , thenEquation (BP2) again to compute

δL−2

, and so on, allthe way back through the network.

An equation for the rate of change of the cost with respect to any bias in the network: In particular:

\partial C \partial b l j = δ l j . (BP3)

That is, the error

δlj is exactly equal to the rate ofchange

∂C/∂blj . This is great news, since (BP1) and (BP2) have already told us how to compute

δlj . We can rewrite (BP3) in shorthand as

\partial C \partial b = δ, (31)

where it is understood that

δ is being evaluated at the sameneuron as the bias

An equation for the rate of change of the cost with respect to any weight in the network: In particular:

\partial C \partial w l j k = a l - 1 k δ l j . (BP4)

This tells us how to compute the partial derivatives

∂C/∂wljk in terms of the quantities

δl and

al−1 , which we already know how to compute. The equation can berewritten in a less index-heavy notation as

\partial C \partial w = a i n δ o u t, (32)

where it's understood that

ain is the activation of theneuron input to the weight

w , and

δout is the error ofthe neuron output from the weight

w . Zooming in to look at just theweight

, and the two neurons connected by that weight, we candepict this as:

A nice consequence of Equation (32) isthat when the activation

ain is small,

ain≈0 , the gradient term

∂C/∂w will also tend to besmall. In this case, we'll say the weight learns slowly,meaning that it's not changing much during gradient descent. In otherwords, one consequence of (BP4) is that weights output fromlow-activation neurons learn slowly.

There are other insights along these lines which can be obtainedfrom (BP1)-(BP4). Let's start by looking at the outputlayer. Consider the term σ′(zLj)

in (BP1). Recallfrom the graph of the sigmoid function in the last chapter that the

σ function becomesvery flat when

σ(zLj) is approximately

0 or

1 . When thisoccurs we will have

σ′(zLj)≈0 . And so the lesson isthat a weight in the final layer will learn slowly if the outputneuron is either low activation (

≈0 ) or high activation(

≈1

). In this case it's common to say the output neuron hassaturated and, as a result, the weight has stopped learning (oris learning slowly). Similar remarks hold also for the biases ofoutput neuron.

We can obtain similar insights for earlier layers. In particular,note the σ′(zl)

term in (BP2). This means that

δlj is likely to get small if the neuron is near saturation.And this, in turn, means that any weights input to a saturated neuronwill learn slowly* *This reasoning won't hold if

wl+1Tδl+1 has large enough entries to compensate for the smallness of

σ′(zlj)

. But I'm speaking of the general tendency..

Summing up, we've learnt that a weight will learn slowly if either theinput neuron is low-activation, or if the output neuron has saturated,i.e., is either high- or low-activation.

None of these observations is too greatly surprising. Still, theyhelp improve our mental model of what's going on as a neural networklearns. Furthermore, we can turn this type of reasoning around. Thefour fundamental equations turn out to hold for any activationfunction, not just the standard sigmoid function (that's because, aswe'll see in a moment, the proofs don't use any special properties of σ

). And so we can use these equations to designactivation functions which have particular desired learningproperties. As an example to give you the idea, suppose we were tochoose a (non-sigmoid) activation function

σ so that

σ′

is always positive, and never gets close to zero. That would preventthe slow-down of learning that occurs when ordinary sigmoid neuronssaturate. Later in the book we'll see examples where this kind ofmodification is made to the activation function. Keeping the fourequations (BP1)-(BP4) in mind can help explain why suchmodifications are tried, and what impact they can have.

Problem

Alternate presentation of the equations of backpropagation: I've stated the equations of backpropagation (notably (BP1) and (BP2)) using the Hadamard product. This presentation may be disconcerting if you're unused to the Hadamard product. There's an alternative approach, based on conventional matrix multiplication, which some readers may find enlightening. (1) Show that (BP1) may be rewritten as $δ L = Σ' (z L) \nabla a C, (33)$

where

Σ′(zL) is a square matrix whose diagonal entries are the values

σ′(zLj) , and whose off-diagonal entries are zero. Note that this matrix acts on

∇aC by conventional matrix multiplication. (2) Show that (BP2) may be rewritten as

δ l = Σ' (z l) (w l + 1) T δ l + 1 . (34)

(3) By combining observations (1) and (2) show that

δ l = Σ' (z l) (w l + 1) T \dots Σ' (z L - 1) (w L) T Σ' (z L) \nabla a C (35)

For readers comfortable with matrix multiplication this equation may be easier to understand than (BP1) and (BP2). The reason I've focused on (BP1) and (BP2) is because that approach turns out to be faster to implement numerically.

Proof of the four fundamental equations (optional)

We'll now prove the four fundamentalequations (BP1)-(BP4). All four are consequences of thechain rule from multivariable calculus. If you're comfortable withthe chain rule, then I strongly encourage you to attempt thederivation yourself before reading on.

Let's begin with Equation (BP1), which gives an expression forthe output error, δL

. To prove this equation, recall that bydefinition

δ L j = \partial C \partial z L j . (36)

Applying the chain rule, we can re-express the partial derivativeabove in terms of partial derivatives with respect to the outputactivations,

δ L j = \sum k \partial C \partial a L k \partial a L k \partial z L j, (37)

where the sum is over all neurons

k in the output layer. Of course,the output activation

aLk of the

kth neuron depends onlyon the input weight

zLj for the

jth neuron when

k=j .And so

∂aLk/∂zLj vanishes when

k≠j . Asa result we can simplify the previous equation to

δ L j = \partial C \partial a L j \partial a L j \partial z L j . (38)

Recalling that

aLj=σ(zLj) the second term on the rightcan be written as

σ′(zLj) , and the equation becomes

δ L j = \partial C \partial a L j σ' (z L j), (39)

which is just (BP1), in component form.

Next, we'll prove (BP2), which gives an equation for the error δl

in terms of the error in the next layer,

δl+1 .To do this, we want to rewrite

δlj=∂C/∂zlj in terms of

δl+1k=∂C/∂zl+1k .We can do this using the chain rule,

δ l j = = = \partial C \partial z l j \sum k \partial C \partial z l + 1 k \partial z l + 1 k \partial z l j \sum k \partial z l + 1 k \partial z l j δ l + 1 k, (40) (41) (42)

where in the last line we have interchanged the two terms on theright-hand side, and substituted the definition of

δl+1k .To evaluate the first term on the last line, note that

z l + 1 k = \sum j w l + 1 k j a l j + b l + 1 k = \sum j w l + 1 k j σ (z l j) + b l + 1 k . (43)

Differentiating, we obtain

\partial z l + 1 k \partial z l j = w l + 1 k j σ' (z l j) . (44)

Substituting back into (42) we obtain

δ l j = \sum k w l + 1 k j δ l + 1 k σ' (z l j) . (45)

This is just (BP2) written in component form.

The final two equations we want to prove are (BP3)and (BP4). These also follow from the chain rule, in a mannersimilar to the proofs of the two equations above. I leave them to youas an exercise.

Exercise

Prove Equations (BP3) and (BP4).

That completes the proof of the four fundamental equations ofbackpropagation. The proof may seem complicated. But it's reallyjust the outcome of carefully applying the chain rule. A little lesssuccinctly, we can think of backpropagation as a way of computing thegradient of the cost function by systematically applying the chainrule from multi-variable calculus. That's all there really is tobackpropagation - the rest is details.

The backpropagation algorithm

The backpropagation equations provide us with a way of computing thegradient of the cost function. Let's explicitly write this out in theform of an algorithm:

Input x

: Set the corresponding activation

for the input layer.

Feedforward: For each

l=2,3,…,L compute

zl=wlal−1+bl and

al=σ(zl)

Output error δL : Compute the vector

δL=∇aC⊙σ′(zL)

Backpropagate the error: For each

l=L−1,L−2,…,2 compute

δl=((wl+1)Tδl+1)⊙σ′(zl)

Output: The gradient of the cost function is given by

∂C∂wljk=al−1kδlj and

∂C∂blj=δlj

Examining the algorithm you can see why it's calledbackpropagation. We compute the error vectors δl

backward, starting from the final layer. It may seem peculiar thatwe're going through the network backward. But if you think about theproof of backpropagation, the backward movement is a consequence ofthe fact that the cost is a function of outputs from the network. Tounderstand how the cost varies with earlier weights and biases we needto repeatedly apply the chain rule, working backward through thelayers to obtain usable expressions.

Exercises

Backpropagation with a single modified neuron Suppose we modify a single neuron in a feedforward network so that the output from the neuron is given by f(∑jwjxj+b)

, where

is some function other than the sigmoid. How should we modify the backpropagation algorithm in this case?

Backpropagation with linear neurons Suppose we replace the usual non-linear

σ function with

σ(z)=z

throughout the network. Rewrite the backpropagation algorithm for this case.

As I've described it above, the backpropagation algorithm computes thegradient of the cost function for a single training example, C=Cx

. In practice, it's common to combine backpropagation with alearning algorithm such as stochastic gradient descent, in which wecompute the gradient for many training examples. In particular, givena mini-batch of

training examples, the following algorithm appliesa gradient descent learning step based on that mini-batch:

Input a set of training examples
For each training example x

: Set the corresponding input activation

ax,1 , and perform the following steps:

Feedforward: For each l=2,3,…,L

compute

zx,l=wlax,l−1+bl and

ax,l=σ(zx,l)

Output error δx,L : Compute the vector

δx,L=∇aCx⊙σ′(zx,L)

Backpropagate the error: For each

l=L−1,L−2,…,2 compute

δx,l=((wl+1)Tδx,l+1)⊙σ′(zx,l)

Gradient descent: For each

l=L,L−1,…,2 update the weights according to the rule

wl→wl−ηm∑xδx,l(ax,l−1)T , and the biases according to the rule

bl→bl−ηm∑xδx,l

Of course, to implement stochastic gradient descent in practice youalso need an outer loop generating mini-batches of training examples,and an outer loop stepping through multiple epochs of training. I'veomitted those for simplicity.

The code for backpropagation

Having understood backpropagation in the abstract, we can nowunderstand the code used in the last chapter to implementbackpropagation. Recall fromthat chapter that the code was contained in the update_mini_batchand backprop methods of the Network class. The code forthese methods is a direct translation of the algorithm describedabove. In particular, the update_mini_batch method updates theNetwork's weights and biases by computing the gradient for thecurrent mini_batch of training examples:

class Network(object):
...
    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The "mini_batch" is a list of tuples "(x, y)", and "eta"
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw 
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb 
                       for b, nb in zip(self.biases, nabla_b)]

Most of the work is done by the linedelta_nabla_b, delta_nabla_w = self.backprop(x, y) which usesthe backprop method to figure out the partial derivatives

∂Cx/∂blj and

∂Cx/∂wljk . The backprop method follows the algorithm in thelast section closely. There is one small change - we use a slightlydifferent approach to indexing the layers. This change is made totake advantage of a feature of Python, namely the use of negative listindices to count backward from the end of a list, so, e.g.,l[-3] is the third last entry in a list l. The code forbackprop is below, together with a few helper functions, whichare used to compute the

σ function, the derivative

σ′ ,and the derivative of the cost function. With these inclusions youshould be able to understand the code in a self-contained way. Ifsomething's tripping you up, you may find it helpful to consultthe original description (and complete listing) of the code.

class Network(object):
...
   def backprop(self, x, y):
        """Return a tuple "(nabla_b, nabla_w)" representing the
        gradient for the cost function C_x.  "nabla_b" and
        "nabla_w" are layer-by-layer lists of numpy arrays, similar
        to "self.biases" and "self.weights"."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book.  Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on.  It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in xrange(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

...

    def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x /
        \partial a for the output activations."""
        return (output_activations-y) 

def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

Problem

Fully matrix-based approach to backpropagation over a mini-batch Our implementation of stochastic gradient descent loops over training examples in a mini-batch. It's possible to modify the backpropagation algorithm so that it computes the gradients for all training examples in a mini-batch simultaneously. The idea is that instead of beginning with a single input vector, x

, we can begin with a matrix

X=[x1x2…xm]

whose columns are the vectors in the mini-batch. We forward-propagate by multiplying by the weight matrices, adding a suitable matrix for the bias terms, and applying the sigmoid function everywhere. We backpropagate along similar lines. Explicitly write out pseudocode for this approach to the backpropagation algorithm. Modify network.py so that it uses this fully matrix-based approach. The advantage of this approach is that it takes full advantage of modern libraries for linear algebra. As a result it can be quite a bit faster than looping over the mini-batch. (On my laptop, for example, the speedup is about a factor of two when run on MNIST classification problems like those we considered in the last chapter.) In practice, all serious libraries for backpropagation use this fully matrix-based approach or some variant.

In what sense is backpropagation a fast algorithm?

In what sense is backpropagation a fast algorithm? To answer thisquestion, let's consider another approach to computing the gradient.Imagine it's the early days of neural networks research. Maybe it'sthe 1950s or 1960s, and you're the first person in the world to thinkof using gradient descent to learn! But to make the idea work youneed a way of computing the gradient of the cost function. You thinkback to your knowledge of calculus, and decide to see if you can usethe chain rule to compute the gradient. But after playing around abit, the algebra looks complicated, and you get discouraged. So youtry to find another approach. You decide to regard the cost as afunction of the weights C=C(w)

alone (we'll get back to the biasesin a moment). You number the weights

w1,w2,… , and want tocompute

∂C/∂wj for some particular weight

wj .An obvious way of doing that is to use the approximation

\partial C \partial w j \approx C ( w + ϵ e j ) - C ( w ) ϵ, (46)

where

ϵ>0 is a small positive number, and

ej is the unitvector in the

jth direction. In other words, we can estimate

∂C/∂wj by computing the cost

C for two slightlydifferent values of

wj , and then applyingEquation (46). The same idea will let uscompute the partial derivatives

∂C/∂b

with respectto the biases.

This approach looks very promising. It's simple conceptually, andextremely easy to implement, using just a few lines of code.Certainly, it looks much more promising than the idea of using thechain rule to compute the gradient!

Unfortunately, while this approach appears promising, when youimplement the code it turns out to be extremely slow. To understandwhy, imagine we have a million weights in our network. Then for eachdistinct weight wj

we need to compute

C(w+ϵej) in orderto compute

∂C/∂wj . That means that to computethe gradient we need to compute the cost function a million differenttimes, requiring a million forward passes through the network (pertraining example). We need to compute

C(w)

as well, so that's atotal of a million and one passes through the network.

What's clever about backpropagation is that it enables us tosimultaneously compute all the partial derivatives ∂C/∂wj

using just one forward pass through the network,followed by one backward pass through the network. Roughly speaking,the computational cost of the backward pass is about the same as theforward pass**This should be plausible, but it requires some analysis to make a careful statement. It's plausible because the dominant computational cost in the forward pass is multiplying by the weight matrices, while in the backward pass it's multiplying by the transposes of the weight matrices. These operations obviously have similar computational cost.. And so the total cost ofbackpropagation is roughly the same as making just two forward passesthrough the network. Compare that to the million and one forwardpasses we needed for the approach basedon (46)! And so even though backpropagationappears superficially more complex than the approach basedon (46), it's actually much, much faster.

This speedup was first fully appreciated in 1986, and it greatlyexpanded the range of problems that neural networks could solve.That, in turn, caused a rush of people using neural networks. Ofcourse, backpropagation is not a panacea. Even in the late 1980speople ran up against limits, especially when attempting to usebackpropagation to train deep neural networks, i.e., networks withmany hidden layers. Later in the book we'll see how modern computersand some clever new ideas now make it possible to use backpropagationto train such deep neural networks.

Backpropagation: the big picture

As I've explained it, backpropagation presents two mysteries. First,what's the algorithm really doing? We've developed a picture of theerror being backpropagated from the output. But can we go any deeper,and build up more intuition about what is going on when we do allthese matrix and vector multiplications? The second mystery is howsomeone could ever have discovered backpropagation in the first place?It's one thing to follow the steps in an algorithm, or even to followthe proof that the algorithm works. But that doesn't mean youunderstand the problem so well that you could have discovered thealgorithm in the first place. Is there a plausible line of reasoningthat could have led you to discover the backpropagation algorithm? Inthis section I'll address both these mysteries.

To improve our intuition about what the algorithm is doing, let'simagine that we've made a small change Δwljk

to someweight in the network,

wljk

That change in weight will cause a change in the output activationfrom the corresponding neuron:

CHAPTER 2 How the backpropagation algorithm works_第2张图片

That, in turn, will cause a change in all the activations inthe next layer:

CHAPTER 2 How the backpropagation algorithm works_第3张图片

Those changes will in turn cause changes in the next layer, and thenthe next, and so on all the way through to causing a change in thefinal layer, and then in the cost function:

CHAPTER 2 How the backpropagation algorithm works_第4张图片

The change

ΔC in the cost is related to the change

Δwljk in the weight by the equation

Δ C \approx \partial C \partial w l j k Δ w l j k . (47)

This suggests that a possible approach to computing

∂C∂wljk is to carefully track how a small change in

wljk propagates to cause a small change in

C . If we can dothat, being careful to express everything along the way in terms ofeasily computable quantities, then we should be able to compute

∂C/∂wljk .

Let's try to carry this out. The change Δwljk

causes asmall change

Δalj in the activation of the

jth neuron inthe

lth layer. This change is given by

Δ a l j \approx \partial a l j \partial w l j k Δ w l j k . (48)

The change in activation

Δalj will cause changes inall the activations in the next layer, i.e., the

(l+1)th layer. We'll concentrate on the way just a single one of thoseactivations is affected, say

al+1q

CHAPTER 2 How the backpropagation algorithm works_第5张图片

In fact, it'll cause the following change:

Δ a l + 1 q \approx \partial a l + 1 q \partial a l j Δ a l j . (49)

Substituting in the expression from Equation (48),we get:

Δ a l + 1 q \approx \partial a l + 1 q \partial a l j \partial a l j \partial w l j k Δ w l j k . (50)

Of course, the change

Δal+1q will, in turn, cause changesin the activations in the next layer. In fact, we can imagine a pathall the way through the network from

wljk to

C , with eachchange in activation causing a change in the next activation, and,finally, a change in the cost at the output. If the path goes throughactivations

alj,al+1q,…,aL−1n,aLm then theresulting expression is

Δ C \approx \partial C \partial a L m \partial a L m \partial a L - 1 n \partial a L - 1 n \partial a L - 2 p \dots \partial a l + 1 q \partial a l j \partial a l j \partial w l j k Δ w l j k, (51)

that is, we've picked up a

∂a/∂a type term foreach additional neuron we've passed through, as well as the

∂C/∂aLm term at the end. This represents the change in

C due to changes in the activations along this particular path throughthe network. Of course, there's many paths by which a change in

wljk can propagate to affect the cost, and we've beenconsidering just a single path. To compute the total change in

C itis plausible that we should sum over all the possible paths betweenthe weight and the final cost, i.e.,

Δ C \approx \sum m n p \dots q \partial C \partial a L m \partial a L m \partial a L - 1 n \partial a L - 1 n \partial a L - 2 p \dots \partial a l + 1 q \partial a l j \partial a l j \partial w l j k Δ w l j k, (52)

where we've summed over all possible choices for the intermediateneurons along the path. Comparing with (47) wesee that

\partial C \partial w l j k = \sum m n p \dots q \partial C \partial a L m \partial a L m \partial a L - 1 n \partial a L - 1 n \partial a L - 2 p \dots \partial a l + 1 q \partial a l j \partial a l j \partial w l j k . (53)

Now, Equation (53) looks complicated. However,it has a nice intuitive interpretation. We're computing the rate ofchange of

C with respect to a weight in the network. What theequation tells us is that every edge between two neurons in thenetwork is associated with a rate factor which is just the partialderivative of one neuron's activation with respect to the otherneuron's activation. The edge from the first weight to the firstneuron has a rate factor

∂alj/∂wljk . Therate factor for a path is just the product of the rate factors alongthe path. And the total rate of change

∂C/∂wljk is just the sum of the rate factors of all paths from theinitial weight to the final cost. This procedure is illustrated here,for a single path:

CHAPTER 2 How the backpropagation algorithm works_第6张图片

What I've been providing up to now is a heuristic argument, a way ofthinking about what's going on when you perturb a weight in a network.Let me sketch out a line of thinking you could use to further developthis argument. First, you could derive explicit expressions for allthe individual partial derivatives inEquation (53). That's easy to do with a bit ofcalculus. Having done that, you could then try to figure out how towrite all the sums over indices as matrix multiplications. This turnsout to be tedious, and requires some persistence, but notextraordinary insight. After doing all this, and then simplifying asmuch as possible, what you discover is that you end up with exactlythe backpropagation algorithm! And so you can think of thebackpropagation algorithm as providing a way of computing the sum overthe rate factor for all these paths. Or, to put it slightlydifferently, the backpropagation algorithm is a clever way of keepingtrack of small perturbations to the weights (and biases) as theypropagate through the network, reach the output, and then affect thecost.

Now, I'm not going to work through all this here. It's messy andrequires considerable care to work through all the details. If you'reup for a challenge, you may enjoy attempting it. And even if not, Ihope this line of thinking gives you some insight into whatbackpropagation is accomplishing.

What about the other mystery - how backpropagation could have beendiscovered in the first place? In fact, if you follow the approach Ijust sketched you will discover a proof of backpropagation.Unfortunately, the proof is quite a bit longer and more complicatedthan the one I described earlier in this chapter. So how was thatshort (but more mysterious) proof discovered? What you find when youwrite out all the details of the long proof is that, after the fact,there are several obvious simplifications staring you in the face.You make those simplifications, get a shorter proof, and write thatout. And then several more obvious simplifications jump out atyou. So you repeat again. The result after a few iterations is theproof we saw earlier**There is one clever step required. In Equation (53) the intermediate variables are activations like al+1q

. The clever idea is to switch to using weighted inputs, like zl+1q , as the intermediate variables. If you don't have this idea, and instead continue using the activations al+1q , the proof you obtain turns out to be slightly more complex than the proof given earlier in the chapter. - short, butsomewhat obscure, because all the signposts to its construction havebeen removed! I am, of course, asking you to trust me on this, butthere really is no great mystery to the origin of the earlier proof.It's just a lot of hard work simplifying the proof I've sketched inthis section.

你可能感兴趣的:(CHAPTER 2 How the backpropagation algorithm works)

python multiprocessing iteye_20379 python
importmultiprocessingimportmathdeffactorize_naive(n):"""Anaivefactorizationmethod.Takeinteger'n',returnlistoffactors."""ifn=n:factors.append(n)returnfactorselifp>2:#Advanceinstepsof2overoddnumbersp+=2
【转载】wordpress工作原理 iteye_20685 WordPress 工作 PHP Blog F#
[url]http://blog.ossxp.com/2010/01/166/[/url]WP初始化的过程：当你输入/wordpress对wordpress进行初始化时，wordpress默认会找根目录下的index.php页面，看一下index.php页面。你会发现，它会去调用根目录下的wp-blog-header.php，我们继续看wp-blog-header.php。通过wp-load.ph
Python进阶—高级语法 Echo.py Python基础语法 python 开发语言
目录文章目录目录1、在==和is之间选择2、元组的相对不可变性3、字典中的键映射多个值4、Linux5、python中字典的key要求6、编码7、进制之间的转换8、关系运算符(时间处理)9、时间处理模块❶常用时间处理方法❷转化为13位时间戳10、三元运算符11、成员运算符12、For循环机制13、变量的分类14、闭包(函数的嵌套)15、函数(方法)的执行流程16、匿名函数17、Django和Fla
Python timeit的使用 egzosn python 开发语言
假设您要测量代码段的执行时间。你是做什么？直到现在，我就像大多数人一样会做以下事情：登录后复制#导入时间start_time=time.time()"""某些代码"""end_time=time.time()print(f“执行时间为：{end_time-start_time}”)1.2.3.4.5.现在说我们要比较两个不同函数的执行时间，然后：登录后复制#导入时间deffunction_1(*参
基于FPGA的DDS设计 Squirrels43 verilog fpga
文章目录目标一、DDS电路核心RTL1.设计一个DDS的核心RTL代码。2.使用Matlab生成DDS的波表ROM3.验证目标二、DDS开发板测试平台1.使用Quartus的SignalTAP观察DDS的输出波形2.导出SignalTAP的捕获数据至电脑（生成List文件）3.用UltraEdit的列操作模式编辑数据格式。（matlab变量定义）4.使用Matlab分析DDS生成的正弦信号的频谱纯
Python多进程 multiprocessing 培之编程语言 python 机器学习开发语言
在大数据时代，Python已经成为最受追捧的语言。在本文中，让我们专注于Python的一个特定方面，它使其成为最强大的编程语言之一——Multi-Processing。在阅读本文之前，我建议您阅读我之前关于Python中的线程的文章，因为它可以为当前文章提供更好的上下文。多进程是什么？假设你是一名小学生，你的作业是让1200对数字相乘，这让你感到麻木。假设您能够在3秒内将一对数字相乘。那么总共需要
Flask --（2）Flask 框架的诞生 feiyy404 flask
Flask诞生于2010年，是Arminronacher（人名）用Python语言基于Werkzeug工具箱编写的轻量级Web开发框架。Flask本身相当于一个内核，其他几乎所有的功能都要用到扩展（邮件扩展Flask-Mail，用户认证Flask-Login），都需要用第三方的扩展来实现。比如可以用Flask-extension加入ORM、窗体验证工具，文件上传、身份验证等。Flask没有默认使用
数据结构之栈，队列，树一只小bit 数据结构数据结构开发语言 c语言 c++
目录一.栈1.栈的概念及结构2.栈的实现3.实现讲解1.初始化栈2.销毁栈3.压栈4.出栈5.返回栈顶元素6.返回栈内元素个数7.判断栈内是否为空二.队列1.队列的概念及结构2.队列的实现3.实现讲解1.初始化队列2.销毁队列3.单个成员入队列4.单个成员出队列5.判断队列是否为空6.返回队列内元素个数7.返回队列首个元素8.返回队列尾部元素三.树1.树的概念概念及结构2.树的相关概念3.树的实现
对象的克隆单例模式黄亚磊11 c++
1)如何实现对象的克隆？1、为什么需要实现对象的克隆？在某些情况下，需要创建一个与现有对象完全相同的副本，这就是对象克隆。例如，在需要对对象进行备份、在不同的上下文中使用相同的类型的对象或者实现某些设计模式（如原型模式）时，克隆对象是很有用的。2、在C++中如何实现对象的克隆？浅克隆：简单的复制对象的成员变量，但如果成员变量是指针类型，只会复制指针的值，而不是指针所指向的对象。这可能会导致多个对象
【Bug合集】——Java大小写引起传参失败，获取值为null的解决方案三三是该溜子 java 开发语言 Springboot
阿华代码，不是逆风，就是我疯你们的点赞收藏是我前进最大的动力！！希望本文内容能够帮助到你！！目录一：本文面向的人群二：错误场景引入三：正确场景引入四：问题解决五：解决方法1：public修饰2：使用@JsonProperty注解一：本文面向的人群本文主要针对类中成员变量命名问题引起传送json字符串，但是变量为null的情况做出解释。其中涉及到@Data注解（Spring自动生成的get和set方
Kubectl常用命令操作 _Eden_ linux 运维服务器
kubectl命令格式：kubectlcommandtypenamecommand:表示子命令，用于操作kubernetes的集群资源对象，如：createdeletedescribegetapplytype:资源对象的类型name:资源对象的名称1.创建资源对象kubectlcreate-fmy-service.yaml表示根据yaml配置文件创建service2.查看资源对象kubectlge
2K200Hz显示器哪个值得选？ yybcp9 计算机外设
眼花缭乱的显示器市场，究竟在2K200Hz显示器这个领域，哪个品牌的哪个型号值得选呢？今天就来给大家讲讲。1.HKCG27H2Pro-2K200Hz显示器哪个值得选外观设计-HKCG27H2Pro2K200Hz显示器整体风格：G27H2Pro的外观充满电竞风又不失简约大气，猎鹰系列标志性设计很吸睛，背部“鹰翼图腾”切割线搭配炎红色倒C型弧线，科技感爆棚，放在桌面上瞬间让整个空间的电竞氛围感拉满。边
华为OD机试E卷 --矩形相交的面积--24年OD统一考试（Java & JS & Python & C & C++）飞码创造者最新华为OD机试题库2024 华为od java javascript python js c语言
文章目录题目描述输入描述输出描述用例题目解析JS算法源码Java算法源码python算法源码题目描述给出3组点坐标(x，y,w,h)，-1000
2.UFS4.0 架构概述 >Andre< UFS 4.0解读嵌入式硬件
5.1UFS顶层架构图5.1展示了通用闪存存储（UFS）的顶层架构。图5.1—UFS顶层架构UFS通信是一种分层通信架构。它基于SCSISAM架构模型[SAM]。5.1.1应用层应用层由UFS命令集（UCS）、设备管理器和任务管理器组成。UCS将处理诸如读、写等常规命令。UFS可能支持多种命令集。UFS被设计为与协议无关。此版本UFS标准的命令集基于SCSI命令集。特别是，为UFS选择了一个简化的
Docker入门学习 _Eden_ docker 学习容器
一、容器1.将单个操作系统中的资源划分到孤立的组中，在孤立的组中平衡有冲突的资源使用需求2.Docker提供了容器管理的工具可以无需关注底层操作，使用效果类似于轻量级的虚拟机，并且容器的创建和停止相对于虚拟机来说比较快；虽然不同容器之间为了保证一定的安全性采取安全隔离，但是在某些情况下需要消息共享灯可以利用通信机制进行通信二、虚拟化虚拟化的核心是对资源进行抽象呈现出来来打破实体结构之间不可切割的障
千万年薪招揽AI大牛！罗福莉加盟小米，将如何改变其大模型战略？前端
近年来，人工智能(AI)领域发展迅速，其中大模型技术的突破更是引领着新一轮科技浪潮。AI代码生成器作为AI技术的重要应用，也正逐渐改变着软件开发的模式。1月18日，一则重磅消息震惊业界：DeepSeek开源大模型DeepSeek-V2的关键开发者之一罗福莉将加入小米，并可能领导小米大模型团队，年薪高达千万级别。这一举动不仅体现了小米对AI大模型技术的重视，也预示着小米在大模型领域的战略布局将迎来新
会捡垃圾、能干家务，元萝卜“视觉+机械臂”技术扫地机器人首秀量子位
在2025开年科技盛宴CES（国际消费电子展）上，AI机器人无疑成为全场焦点，而其中来自中国科技企业展示和发布的仿生多关节机械手技术在扫地机器人产品上的应用，更获得了全球媒体的高度关注。通过将视觉感知与机械臂技术相结合，能够自主完成拾取垃圾入桶等任务，不仅展示了家用机器人发展的未来形态，也让大众看到了具身智能机器人融入家庭生活的广阔前景。随着大模型技术和具身智能浪潮汹涌而至，家用机器人正迎来全新发
vite-plugin-vconsole在windows不生效的原因排查 vitevue3
背景在Vite使用vConsole，方便移动端的本地开发。官方文档见这里：https://github.com/vadxq/vite...。场景复现windows客户端"vite-plugin-vconsole":"^1.1.0""vite":"^2.7.0","vconsole":"^3.9.5",nodev12.18.3yarn1.22.15vite.config.js配置如下：import{
vue组件学习三(插槽) @爱学习的小姜 vue.js
目录1、匿名插槽2、渲染作用域3、默认内容4、具名插槽5、条件插槽6、作用域插槽7、具名作用域插槽最后1、匿名插槽父组件调用Mycomponet1组件clickme子组件为最后结果为clickme2、渲染作用域因为插槽的内容是在父组件中定义的，所以能访问到父组件中的数据作用域，无法访问到子组件的数据。3、默认内容在我们没有从父组件中提供插槽内容时，可以提供一个默认值例如：默认内容在这个例子中，父组
例题4-10 输出m到n之间的全部素数 (20 分) @爱学习的小姜 PTA c语言 c语言
例题4-10输出m到n之间的全部素数(20分)本题要求输出给定整数M和N区间内的全部素数，每行输出10个。素数就是只能被1和自身整除的正整数。注意：1不是素数，2是素数。输入格式:输入在一行中给出两个正整数M和N（1≤M≤N≤500）。输出格式:输出素数，每个数占6位，每行输出10个。如果最后一行输出的素数个数不到10个，也需要换行。若输入的范围不合法，则输出"Invalid."。输入样例1:21
Pulsar：网络足迹的扫描尖兵渗透小白鼠网络 php 开发语言
免责声明：该文章所涉及到的安全工具和技术仅做分享和技术交流学习使用，使用时应当遵守国家法律，做一位合格的白帽专家。使用本工具的用户需要自行承担任何风险和不确定因素，如有人利用工具做任何后果均由使用者承担，本人及文章作者还有泷羽sec团队不承担任何责任如本文章侵权，请联系作者删除B站红队公益课：https://space.bilibili.com/350329294学习网盘资源链接：https://
Python笔记 Lucky_1129 笔记 python 笔记
Python笔记1.Python数组和列表的区别1.创建方式不同列表可以直接创建，数组需要引用numpy包2.存储对象不同列表可以存储任何的对象，包括数字，字符串，数组，字典等等数组只能存储单一的数据类型3.运算方式不同数组可以进行四则运算，列表只能使用加号进行拼接，拼接之后形成一个新的列表4.运算效率不同array数组是为了精确便捷的处理庞大的类似的数据而产生的，他的存储效率要比列表快着很多2.
k8s 安装nfs_k8s共享存储之nfs weixin_39941732 k8s 安装nfs
特别说明：测试使用，不建议生产环境1、在master节点配置(node1)1)yum安装nfs#yum-yinstallnfs-utilsNFS的关键工具包括：主要配置文件：/etc/exports；NFS文件系统维护命令：/usr/bin/exportfs；共享资源的日志文件：/var/lib/nfs/*tab；客户端查询共享资源命令：/usr/sbin/showmount；端口配置：/etc/
华为交换机vlan配置举例_一步步详解华为交换机配置实例，一看就会 weixin_39595164 华为交换机vlan配置举例
项目配置：一、配置LSW11、划分vlan那么就在lsw1中创建了两个vlan，10与20。2、配置与pc1连接的0/0/1端口，将0/0/1端口分配给vlan10。3、配置与pc2连接的0/0/2端口，将0/0/2端口分配给vlan20。4、配置与LSW2连接的0/0/3端口，允许通过vlan10与vlan20。这里面0/0/3端口配置的模式是trunk，因为端口0/0/3是与lsw2相关的。那
华为防火墙做单臂路由_华为单臂路由配置详解 mizore 华为防火墙做单臂路由
利用华为路由器单臂路由的配置原理，可以使同一交换机上不同VLAN之间实现通信。需要掌握以下基本概念：链路类型-交换机连接主机的端口为access链路；-交换机连接路由器的端口为trunk链路子接口-路由器的物理接口可以被划分为多个子接口；-每个子接口对应一个VLAN的网关配置拓扑图如下所示：在交换机上配置如下：[SW]vlanbatch1020[SW]interfaceEthernet0/0/2[
k8s中使用MySQL共享存储_k8s使用NFS做动态存储做mysql容器主从同步罗-Moline k8s中使用MySQL共享存储
k8s里面存储一直是比较难搞得，之前做的静态存储，写这篇文档记录一下动态存储创建的过程。使用动态存储的好处是开发者可以更关注自己的开发环境，不用关心后端的资源，还有就是更换存储类型不用做大的改变，只需切换一下storageclassName即可。根据这篇博客来的！谢谢博主！！！https://www.cnblogs.com/00986014w/p/9406962.html我把大致上思路分成三步：1
《CPython Internals》阅读笔记：p336-p352 python
《CPythonInternals》学习第17天，p336-p352总结，总计17页。一、技术总结1.GDBGDB是GNUDbugger的缩写。(1)安装sudoaptinstallgdb(2)创建.gdbinit文件touch~/.gdbinitvim~/.gdbinit(3)配置.gdbinit文件add-auto-load-safe-path/project/cpython注：1./proj
元戎启行周光：智能驾驶的竞争，靠VLA模型决出胜负量子位
智能驾驶行业，有黑马杀出。据中国电动汽车百人会最新数据统计，自2024年9月至2024年12月，短短4个月时间，元戎启行凭借两款量产车，冲击行业第一梯队，在城区高阶智能驾驶供应商市场中拿下近10%的市场份额。对元戎启行来说，10%，只是一个开始。2025年1月22日，在第17届日本国际汽车工业技术展上，元戎启行再次亮出技术杀手锏——VLA模型（VisionLanguageActionModel，视
【华为网络-配置-005】-单臂路由配置郁闷的猴子哥哥华为网络学习
要求：1、PC1能够和PC2互相通信一、AR1配置交换机LSW1这里不再进行vlan配置。可参考链接配置1、GE0/0/1：trunk，允许vlan1020通过2、GE0/0/2：access，默认vlan103、GE0/0/3：access，默认vlan20（站内链接：vlan配置方法）创建vlan1020。[AR1]vlanbatch1020进入物理接口的子接口（接口后跟个点表示子接口，子接口
基于卡尔曼滤波的系统参数辨识matlab仿真软件算法开发 MATLAB程序开发 #参数辨识 matlab 网络
目录1.程序功能描述2.测试软件版本以及运行结果展示3.核心程序4.本算法原理4.1、卡尔曼滤波的基本原理4.2、基于卡尔曼滤波的系统参数辨识5.完整程序1.程序功能描述通过kalman滤波的方法，对系统的参数进行辨识，整个程序仿真输出参数辨识的收敛过程，参数辨识误差，参数辨识之后系统的输出和真实的系统输出误差，最后设置不同的信噪比，对比不同干扰下的系统参数辨识误差。2.测试软件版本以及运行结果展
jQuery 键盘事件keydown ,keypress ,keyup介绍 107x js jquery keydown keypress keyup
本文章总结了下些关于jQuery 键盘事件keydown ,keypress ,keyup介绍，有需要了解的朋友可参考。一、首先需要知道的是： 1、keydown() keydown事件会在键盘按下时触发. 2、keyup() 代码如下复制代码 $('input').keyup(funciton(){
AngularJS中的Promise bijian1013 JavaScript AngularJS Promise
一.Promise Promise是一个接口，它用来处理的对象具有这样的特点：在未来某一时刻（主要是异步调用）会从服务端返回或者被填充属性。其核心是，promise是一个带有then()函数的对象。为了展示它的优点，下面来看一个例子，其中需要获取用户当前的配置文件： var cu
c++ 用数组实现栈类 CrazyMizzz 数据结构 C++
#include<iostream> #include<cassert> using namespace std; template<class T, int SIZE = 50> class Stack{ private: T list[SIZE];//数组存放栈的元素 int top;//栈顶位置 public: Stack(
java和c语言的雷同麦田的设计者 java 递归 scaner
软件启动时的初始化代码，加载用户信息2015年5月27号从头学java二 1、语言的三种基本结构：顺序、选择、循环。废话不多说，需要指出一下几点： a、return语句的功能除了作为函数返回值以外，还起到结束本函数的功能，return后的语句不会再继续执行。 b、for循环相比于whi
LINUX环境并发服务器的三种实现模型被触发 linux
服务器设计技术有很多，按使用的协议来分有TCP服务器和UDP服务器。按处理方式来分有循环服务器和并发服务器。 1 循环服务器与并发服务器模型在网络程序里面，一般来说都是许多客户对应一个服务器，为了处理客户的请求，对服务端的程序就提出了特殊的要求。目前最常用的服务器模型有： ·循环服务器：服务器在同一时刻只能响应一个客户端的请求 ·并发服务器：服
Oracle数据库查询指令肆无忌惮_ oracle数据库
20140920 单表查询 -- 查询************************************************************************************************************ -- 使用scott用户登录 -- 查看emp表 desc emp
ext右下角浮动窗口知了ing JavaScript ext
第一种 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/
浅谈REDIS数据库的键值设计矮蛋蛋 redis
http://www.cnblogs.com/aidandan/ 原文地址：http://www.hoterran.info/redis_kv_design 丰富的数据结构使得redis的设计非常的有趣。不像关系型数据库那样，DEV和DBA需要深度沟通，review每行sql语句，也不像memcached那样，不需要DBA的参与。redis的DBA需要熟悉数据结构，并能了解使用场景。
maven编译可执行jar包 alleni123 maven
http://stackoverflow.com/questions/574594/how-can-i-create-an-executable-jar-with-dependencies-using-maven <build> <plugins> <plugin> <artifactId>maven-asse
人力资源在现代企业中的作用百合不是茶 HR 企业管理
//人力资源在在企业中的作用人力资源为什么会存在，人力资源究竟是干什么的人力资源管理是对管理模式一次大的创新，人力资源兴起的原因有以下点：工业时代的国际化竞争，现代市场的风险管控等等。所以人力资源在现代经济竞争中的优势明显的存在，人力资源在集团类公司中存在着明显的优势(鸿海集团)，有一次笔者亲自去体验过红海集团的招聘，只知道人力资源是管理企业招聘的当时我被招聘上了，当时给我们培训的人
Linux自启动设置详解 bijian1013 linux
linux有自己一套完整的启动体系，抓住了linux启动的脉络，linux的启动过程将不再神秘。阅读之前建议先看一下附图。本文中假设inittab中设置的init tree为： /etc/rc.d/rc0.d /etc/rc.d/rc1.d /etc/rc.d/rc2.d /etc/rc.d/rc3.d /etc/rc.d/rc4.d /etc/rc.d/rc5.d /etc
Spring Aop Schema实现 bijian1013 java spring AOP
本例使用的是Spring2.5 1.Aop配置文件spring-aop.xml <?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans" xmln
【Gson七】Gson预定义类型适配器 bit1129 gson
Gson提供了丰富的预定义类型适配器，在对象和JSON串之间进行序列化和反序列化时，指定对象和字符串之间的转换方式， DateTypeAdapter public final class DateTypeAdapter extends TypeAdapter<Date> { public static final TypeAdapterFacto
【Spark八十八】Spark Streaming累加器操作（updateStateByKey) bit1129 update
在实时计算的实际应用中，有时除了需要关心一个时间间隔内的数据，有时还可能会对整个实时计算的所有时间间隔内产生的相关数据进行统计。比如：对Nginx的access.log实时监控请求404时，有时除了需要统计某个时间间隔内出现的次数，有时还需要统计一整天出现了多少次404，也就是说404监控横跨多个时间间隔。 Spark Streaming的解决方案是累加器，工作原理是，定义
linux系统下通过shell脚本快速找到哪个进程在写文件 ronin47
一个文件正在被进程写我想查看这个进程文件一直在增大找不到谁在写使用lsof也没找到这个问题挺有普遍性的，解决方法应该很多，这里我给大家提个比较直观的方法。 linux下每个文件都会在某个块设备上存放，当然也都有相应的inode, 那么透过vfs.write我们就可以知道谁在不停的写入特定的设备上的inode。幸运的是systemtap的安装包里带了inodewatch.stp，位
java-两种方法求第一个最长的可重复子串 bylijinnan java 算法
import java.util.Arrays; import java.util.Collections; import java.util.List; public class MaxPrefix { public static void main(String[] args) { String str="abbdabcdabcx";
Netty源码学习-ServerBootstrap启动及事件处理过程 bylijinnan java netty
Netty是采用了Reactor模式的多线程版本，建议先看下面这篇文章了解一下Reactor模式： http://bylijinnan.iteye.com/blog/1992325 Netty的启动及事件处理的流程，基本上是按照上面这篇文章来走的文章里面提到的操作，每一步都能在Netty里面找到对应的代码其中Reactor里面的Acceptor就对应Netty的ServerBo
servelt filter listener 的生命周期 cngolon filter listener servelt 生命周期
1. servlet 当第一次请求一个servlet资源时，servlet容器创建这个servlet实例，并调用他的 init(ServletConfig config)做一些初始化的工作，然后调用它的service方法处理请求。当第二次请求这个servlet资源时，servlet容器就不在创建实例，而是直接调用它的service方法处理请求，也就是说
jmpopups获取input元素值 ctrain JavaScript
jmpopups 获取弹出层form表单首先，我有一个div，里面包含了一个表单，默认是隐藏的，使用jmpopups时，会弹出这个隐藏的div，其实jmpopups是将我们的代码生成一份拷贝。当我直接获取这个form表单中的文本框时，使用方法：$('#form input[name=test1]').val()；这样是获取不到的。我们必须到jmpopups生成的代码中去查找这个值，$(
vi查找替换命令详解 daizj linux 正则表达式替换查找 vim
一、查找查找命令 /pattern<Enter> ：向下查找pattern匹配字符串 ?pattern<Enter>：向上查找pattern匹配字符串使用了查找命令之后，使用如下两个键快速查找： n：按照同一方向继续查找 N：按照反方向查找字符串匹配 pattern是需要匹配的字符串，例如： 1: /abc<En
对网站中的js,css文件进行打包 dcj3sjt126com PHP 打包
一，为什么要用smarty进行打包 apache中也有给js,css这样的静态文件进行打包压缩的模块，但是本文所说的不是以这种方式进行的打包，而是和smarty结合的方式来把网站中的js,css文件进行打包。为什么要进行打包呢，主要目的是为了合理的管理自己的代码。现在有好多网站，你查看一下网站的源码的话，你会发现网站的头部有大量的JS文件和CSS文件，网站的尾部也有可能有大量的J
php Yii: 出现undefined offset 或者 undefined index解决方案 dcj3sjt126com undefined
在开发Yii 时，在程序中定义了如下方式： if($this->menuoption[2] === 'test')，那么在运行程序时会报：undefined offset:2，这样的错误主要是由于php.ini 里的错误等级太高了，在windows下错误等级
linux 文件格式（1） sed工具 eksliang linux linux sed工具 sed工具 linux sed详解
转载请出自出处： http://eksliang.iteye.com/blog/2106082 简介 sed 是一种在线编辑器，它一次处理一行内容。处理时，把当前处理的行存储在临时缓冲区中，称为“模式空间”（pattern space），接着用sed命令处理缓冲区中的内容，处理完成后，把缓冲区的内容送往屏幕。接着处理下一行，这样不断重复，直到文件末尾
Android应用程序获取系统权限 gqdy365 android
引用如何使Android应用程序获取系统权限第一个方法简单点，不过需要在Android系统源码的环境下用make来编译： 1. 在应用程序的AndroidManifest.xml中的manifest节点
HoverTree开发日志之验证码 hvt .net C#asp.net hovertree webform
HoverTree是一个ASP.NET的开源CMS，目前包含文章系统，图库和留言板功能。代码完全开放，文章内容页生成了静态的HTM页面，留言板提供留言审核功能，文章可以发布HTML源代码，图片上传同时生成高品质缩略图。推出之后得到许多网友的支持，再此表示感谢！留言板不断收到许多有益留言，但同时也有不少广告，因此决定在提交留言页面增加验证码功能。ASP.NET验证码在网上找，如果不是很多，就是特别多
JSON API：用 JSON 构建 API 的标准指南中文版 justjavac json
译文地址：https://github.com/justjavac/json-api-zh_CN 如果你和你的团队曾经争论过使用什么方式构建合理 JSON 响应格式，那么 JSON API 就是你的 anti-bikeshedding 武器。通过遵循共同的约定，可以提高开发效率，利用更普遍的工具，可以是你更加专注于开发重点：你的程序。基于 JSON API 的客户端还能够充分利用缓存，
数据结构随记_2 lx.asymmetric 数据结构笔记
第三章栈与队列一．简答题 1. 在一个循环队列中，队首指针指向队首元素的前一个位置。 2.在具有n个单元的循环队列中，队满时共有 n-1 个元素。 3. 向栈中压入元素的操作是先移动栈顶指针&n
Linux下的监控工具dstat 网络接口 linux
1) 工具说明dstat是一个用来替换 vmstat,iostat netstat,nfsstat和ifstat这些命令的工具, 是一个全能系统信息统计工具. 与sysstat相比, dstat拥有一个彩色的界面, 在手动观察性能状况时, 数据比较显眼容易观察; 而且dstat支持即时刷新, 譬如输入dstat 3, 即每三秒收集一次, 但最新的数据都会每秒刷新显示. 和sysstat相同的是,
C 语言初级入门--二维数组和指针 1140566087 二维数组 c/c++指针
/* 二维数组的定义和二维数组元素的引用二维数组的定义：当数组中的每个元素带有两个下标时，称这样的数组为二维数组； (逻辑上把数组看成一个具有行和列的表格或一个矩阵); 语法：类型名数组名[常量表达式1][常量表达式2] 二维数组的引用：引用二维数组元素时必须带有两个下标，引用形式如下：例如： int a[3][4]; 引用：
10点睛Spring4.1-Application Event wiselyman application
10.1 Application Event Spring使用Application Event给bean之间的消息通讯提供了手段应按照如下部分实现bean之间的消息通讯继承ApplicationEvent类实现自己的事件实现继承ApplicationListener接口实现监听事件使用ApplicationContext发布消息