What is a neural network? To get started, I’ll explain a type of artificial neuron called a perceptron. Perceptrons were developed in the 1950s and 1960s by the scientist Frank Rosenblatt, inspired by earlier work by Warren McCulloch and Walter Pitts. Today, it’s more common to use other models of artificial neurons - in this book, and in much modern work on neural networks, the main neuron model used is one called the sigmoid neuron. We’ll get to sigmoid neurons shortly. But to understand why sigmoid neurons are defined the way they are, it’s worth taking the time to first understand perceptrons.
什么是神经网络?首先,我将介绍一种名为感知器的人工神经元。感知器是二十世纪五六十年代由Frank Rosenblatt受Warren McCulloch和Walter Pitts工作的启发研究出来的。现在本书和更现代的神经网络中常用的是其他人工神经元——sigmoid神经元。我们很快就会介绍sigmoid神经元。但是首先花些时间去理解感知器可以更好的理解为什么这么定义sigmoid神经元。
So how do perceptrons work? A perceptron takes several binary inputs, x1,x2,… , and produces a single binary output:
那么感知器是如何工作的呢?一个感知器有多个二进制输入 x1,x2,… ,并且产生一个二进制输出:
In the example shown the perceptron has three inputs, x1,x2,x3 . In general it could have more or fewer inputs. Rosenblatt proposed a simple rule to compute the output. He introduced weights, w1,w2,… real numbers expressing the importance of the respective inputs to the output. The neuron’s output, 0 or 1, is determined by whether the weighted sum ∑jwjxj is less than or greater than some threshold value. Just like the weights, the threshold is a real number which is a parameter of the neuron. To put it in more precise algebraic terms:
That’s all there is to how a perceptron works!
在这个例子中,感知器有三个输入 x1,x2,x3 。通常来说它会有更多或者更少的输入。Rosenblatt设想了一个简单的规则来计算输出。他引入了实数权重 w1,w2,… ,来表示每个输入对于输出的重要性。神经元的输出0或者1是由加权和 ∑jwjxj 的结果比阈值高或者低所决定的。这就是感知器所有的运算规则
That’s the basic mathematical model. A way you can think about the perceptron is that it’s a device that makes decisions by weighing up evidence. Let me give an example. It’s not a very realistic example, but it’s easy to understand, and we’ll soon get to more realistic examples. Suppose the weekend is coming up, and you’ve heard that there’s going to be a cheese festival in your city. You like cheese, and are trying to decide whether or not to go to the festival. You might make your decision by weighing up three factors:
这是基础的数学模型。你可以把感知器理解为一种权衡各种因素来判决的设备。让我们来看一个例子。它不是一个非常实际的例子,却很容易理解,我们接下来很快将接触更多实际的例子。假设周末就要到了,你听说在你的城市将举行一个奶酪节。你喜爱奶酪,然后尝试决定是否去参加奶酪节。你将权衡以下三个因素来下决定:
We can represent these three factors by corresponding binary variables x1,x2 , and x3 . For instance, we’d have x1=1 if the weather is good, and x1=0 if the weather is bad. Similarly, x2=1 if your boyfriend or girlfriend wants to go, and x2=0 if not. And similarly again for x3 and public transit.
我们可以将这三个因素映射成二进制变量 x1,x2 和 x3 。例如,如果天气好则 x1=1 ,天气不好则 x1=0 。类似的,如果你男朋友或者女朋友愿意去则 x2=1 ,反之则 x2=0 。同样 x3 也可以根据公共交通系统情况取值。
Now, suppose you absolutely adore cheese, so much so that you’re happy to go to the festival even if your boyfriend or girlfriend is uninterested and the festival is hard to get to. But perhaps you really loathe bad weather, and there’s no way you’d go to the festival if the weather is bad. You can use perceptrons to model this kind of decision-making. One way to do this is to choose a weight w1=6 for the weather, and w2 and w3=2 for the other conditions. The larger value of w1 indicates that the weather matters a lot to you, much more than whether your boyfriend or girlfriend joins you, or the nearness of public transit. Finally, suppose you choose a threshold of 5 for the perceptron. With these choices, the perceptron implements the desired decision-making model, outputting 1 whenever the weather is good, and 0 whenever the weather is bad. It makes no difference to the output whether your boyfriend or girlfriend wants to go, or whether public transit is nearby.
现在假设你十分喜爱奶酪,所以你十分乐意去参加奶酪节即使你男朋友或者女朋友对此不感兴趣并且那里很难过去。但是也许你十分讨厌坏天气,如果是坏天气你压根不想去。你可以使用感知器来为这类决策建模。一种方法是为天气情况选择权重 w1=6 ,为其他因素选择权重 w2 和 w3=2 。 w1 比较大意味着天气情况对你的影响远大于男朋友或者女朋友是否陪你或者公共交通系统的远近。最后,假设你为感知器设置的阈值是5。这样选择之后,感知器执行决策模型,只要天气好它就输出1,只要天气不好它就输出0。输出与男朋友或者女朋友是否愿意去或者公共交通系统是否近无关。
By varying the weights and the threshold, we can get different models of decision-making. For example, suppose we instead chose a threshold of 3. Then the perceptron would decide that you should go to the festival whenever the weather was good or when both the festival was near public transit and your boyfriend or girlfriend was willing to join you. In other words, it’d be a different model of decision-making. Dropping the threshold means you’re more willing to go to the festival.
通过修改权重和阈值,我们可以得到不同的决策模型。例如,假设我们取阈值为3。那么感知器将得出结论:如果天气好或者奶酪节临近公共交通系统且你的男朋友或者女朋友愿意陪你去,那么你就应该去参加奶酪节。换句话说,它是另外一个决策模型。降低阈值意味着你更愿意去参加奶酪节。
Obviously, the perceptron isn’t a complete model of human decision-making! But what the example illustrates is how a perceptron can weigh up different kinds of evidence in order to make decisions. And it should seem plausible that a complex network of perceptrons could make quite subtle decisions:
显然,感知器不是一个人类决策的完整模型!不过这个例子说明的是感知器如何权衡不同种类因素来做出决策的。并且一个由感知器组成的复杂网络似乎真的可以做出精准的决定。
In this network, the first column of perceptrons - what we’ll call the first layer of perceptrons - is making three very simple decisions, by weighing the input evidence. What about the perceptrons in the second layer? Each of those perceptrons is making a decision by weighing up the results from the first layer of decision-making. In this way a perceptron in the second layer can make a decision at a more complex and more abstract level than perceptrons in the first layer. And even more complex decisions can be made by the perceptron in the third layer. In this way, a many-layer network of perceptrons can engage in sophisticated decision making.
在这个网络中,第一列感知器——我们称之为感知器的第一层——是通过权衡输入因素做三个十分简单的决策。那么第二层的感知器呢?第二层的每个感知器是通过权衡第一层每个感知器的输出结果来做决策的。这个意义上第二层的感知器可以做出比第一次的感知器更加复杂和抽象的推断。那么第三层的感知器当然可以做出更为复杂的决策。这个意义上多层感知器网络可以做出复杂而老练的决策。
Incidentally, when I defined perceptrons I said that a perceptron has just a single output. In the network above the perceptrons look like they have multiple outputs. In fact, they’re still single output. The multiple output arrows are merely a useful way of indicating that the output from a perceptron is being used as the input to several other perceptrons. It’s less unwieldy than drawing a single output line which then splits.
此外,在我之前定义感知器时我说一个感知器只有一个输出。在上面的网络中感知器看上去似乎有多个输出。实际上,他们仍然只是一个输出。多个输出箭头仅仅是表示感知器的输出被用作其他感知器的输入。这比画一个输出然后再分开显得不那么笨重。
Let’s simplify the way we describe perceptrons. The condition ∑jwjxj>threshold is cumbersome, and we can make two notational changes to simplify it. The first change is to write ∑jwjxj as a dot product, w⋅x≡∑jwjxj , where w and x are vectors whose components are the weights and inputs, respectively. The second change is to move the threshold to the other side of the inequality, and to replace it by what’s known as the perceptron’s bias, b≡−threshold . Using the bias instead of the threshold, the perceptron rule can be rewritten:
让我们简化描述感知器。 ∑jwjxj>threshold 看起来十分笨重,我们可以用两个符号来简化它。第一个变化是将 ∑jwjxj 写作点积。第二个变化是将阈值移到等号的另一边,并且用感知器的偏移量来代替它, b≡−threshold 。用偏移量来代替阈值,感知规则可以重写为:
I’ve described perceptrons as a method for weighing evidence to make decisions. Another way perceptrons can be used is to compute the elementary logical functions we usually think of as underlying computation, functions such as AND, OR, and NAND. For example, suppose we have a perceptron with two inputs, each with weight −2 , and an overall bias of 3 . Here’s our perceptron:
我将感知器描述为一种权衡因素来做决策的方法。另一方面感知器可以被用来计算那些我们通常认为是基本运算的基本逻辑方程,如与, 或, 和与非。举例来说,假设我们有个两输入的感知,每个输入的权重是 −2 ,偏移量是 3 。这就是我们的感知器。
Then we see that input 00 produces output 1 , since (−2)∗0+(−2)∗0+3=3 is positive. Here, I’ve introduced the ∗ symbol to make the multiplications explicit. Similar calculations show that the inputs 01 and 10 produce output 1 . But the input 11 produces output 0 , since (−2)∗1+(−2)∗1+3=−1 is negative. And so our perceptron implements a NAND gate!
我们会看见输入 00 将输出 1 ,因为 (−2)∗0+(−2)∗0+3=3 是正的。这里,我已经使用了 ∗ 符号使得乘法更加显著。类似的,计算表明输入 01 或 10 也会输出 1 。但是输入 11 将输出 0 ,因为 (−2)∗1+(−2)∗1+3=−1 是负值。因此我们的感知器表示一个与非门。
The NAND example shows that we can use perceptrons to compute simple logical functions. In fact, we can use networks of perceptrons to compute any logical function at all. The reason is that the NAND gate is universal for computation, that is, we can build any computation up out of NAND gates. For example, we can use NAND gates to build a circuit which adds two bits, x1 and x2 . This requires computing the bitwise sum, x1⊕x2 , as well as a carry bit which is set to 1 when both x1 and x2 are 1 , i.e., the carry bit is just the bitwise product x1x2 :
这个与非门的例子告诉我们可以使用感知器来运算简单的逻辑方程。事实上,我们可以使用感知器网络来计算任何逻辑方程。理由是与非门是计算的通用配件,也就是我们可以用与非门组成任何逻辑运算。例如,我们可以使用与非门来组成一个运算两比特 x1 和 x2 相加的回路。这个需要按位相加, x1⊕x2 ,同时当 x1 和 x2 都为 1 时进位也置 1 ,进位比特是按位相乘。
To get an equivalent network of perceptrons we replace all the NAND gates by perceptrons with two inputs, each with weight −2 , and an overall bias of 3 . Here’s the resulting network. Note that I’ve moved the perceptron corresponding to the bottom right NAND gate a little, just to make it easier to draw the arrows on the diagram:
为了得到等效的感知器网络,我们将与非门替换为两输入的感知器,每个输入的权重都是 −2 ,全局偏移量为 3 。下面是构造的网络。注意我已经将右下角与非门对应的感知器挪了挪,使得箭头更容易画。
One notable aspect of this network of perceptrons is that the output from the leftmost perceptron is used twice as input to the bottommost perceptron. When I defined the perceptron model I didn’t say whether this kind of double-output-to-the-same-place was allowed. Actually, it doesn’t much matter. If we don’t want to allow this kind of thing, then it’s possible to simply merge the two lines, into a single connection with a weight of -4 instead of two connections with -2 weights. (If you don’t find this obvious, you should stop and prove to yourself that this is equivalent.) With that change, the network looks as follows, with all unmarked weights equal to -2, all biases equal to 3, and a single weight of -4, as marked:
这个感知器网络一个值得注意的地方是最左边感知器的输出被作为最右边感知器的输入使用了两次。当我在定义感知器模型的时候,我并没有说这种两个到同一个地方的输出是被允许的。事实上,它们无关紧要。如果我们不想使用这类表示,那么也许就要简单的将这两条权重为-2的线合并成一条权重为-4的连接(如果你不能清楚的理解这事,你应该停下来自行证明这个是等效的)。通过这样的变换,这个网络变成了下面的样子,所有未标注的权重等于-2,所有的偏移量等于3,只有一个权重被标记为-4:
Up to now I’ve been drawing inputs like x1 and x2 as variables floating to the left of the network of perceptrons. In fact, it’s conventional to draw an extra layer of perceptrons - the input layer - to encode the inputs:
到现在为止我已经将输入如 x1 和 x2 画成变量浮动在感知器网络的左侧。事实上,常规上它们被画成一层额外的感知器——输入层,来对输入编码。
This notation for input perceptrons, in which we have an output, but no inputs,
is a shorthand. It doesn’t actually mean a perceptron with no inputs. To see this, suppose we did have a perceptron with no inputs. Then the weighted sum ∑jwjxj would always be zero, and so the perceptron would output 1 if b>0 , and 0 if b≤0 . That is, the perceptron would simply output a fixed value, not the desired value ( x1 , in the example above). It’s better to think of the input perceptrons as not really being perceptrons at all, but rather special units which are simply defined to output the desired values, x1,x2,…
这种用只有一个输出而没有输入的符号表示输入感知器是一种简化。它并不真正代表一个感知器没有输入。为了证明这点,我们假设有个感知器没有输入。那么加权和 ∑jwjxj 将永远是零,并且感知器的输出要么是 1 如果 b>0 ,要么是 0 如果 b>0 。也就是说感知器将简单的输出固定的值而不是我们想要的值(在上面的例子中是 x1 )。更好的方式是并不将输入感知器看作真正的感知器,而更像一个简单定义为输出需要值的特殊单元, x1,x2,…
The adder example how a network of perceptrons can be used to simulate a circuit containing many NAND gates. And because NAND gates are universal for computation, it follows that perceptrons are also universal for computation.
这个加法器的例子演示了如何使用一个感知器网络模拟一个多与非门的回路。并且由于与非门是在计算中是通用的,因此感知器在计算中也是通用的。
The computational universality of perceptrons is simultaneously reassuring and disappointing. It’s reassuring because it tells us that networks of perceptrons can be as powerful as any other computing device. But it’s also disappointing, because it makes it seem as though perceptrons are merely a new type of NAND gate. That’s hardly big news!
感知器在计算上的通用性既是令人安心的也是令人失望的。令人安心的地方在于它告诉我们感知器网络可以和其他任何计算设备一样强大。不过令人失望的地方在于似乎感知器仅仅是一种新的与非门。这可不是个好消息!
However, the situation is better than this view suggests. It turns out that we can devise learning algorithms which can automatically tune the weights and biases of a network of artificial neurons. This tuning happens in response to external stimuli, without direct intervention by a programmer. These learning algorithms enable us to use artificial neurons in a way which is radically different to conventional logic gates. Instead of explicitly laying out a circuit of NAND and other gates, our neural networks can simply learn to solve problems, sometimes problems where it would be extremely difficult to directly design a conventional circuit.
但是情况比这个好的多。它证明我们可以设计自动优化人工神经元网络的权重和偏置的学习算法。这种优化发生在应对外界刺激的反应中,而没有程序员的直接干预。这种学习算法使得我们可以将人工神经元用于完全不同于传统逻辑门的地方。替代设计一个与非门和其他逻辑门的回路,我们的神经网络可以简单的学着解决问题,有时候问题是非常困难的直接设计一个传统回路。