简单神经网络训练数据集

In this post we will explore the mechanism of neural network training, but I’ll do my best to avoid rigorous mathematical discussions and keep it intuitive.

在这篇文章中，我们将探讨神经网络训练的机制，但是我将尽力避免进行严格的数学讨论并使之直观。

Consider the following task: you receive an image, and want an algorithm that returns (predicts) the correct number of people in the image.

考虑以下任务：您收到一张图像，并且想要一种算法来返回(预测)图像中的正确人数。

We start by assuming that there is, indeed, some mathematical function out there that relates the collection of all possible images, with the collection of integer values describing the number of people in each image. We accept the fact that we will never know the actual function, but we hope to learn a model with finite complexity that approximates this function well enough.

我们首先假设确实存在一些数学函数，该函数将所有可能的图像的集合关联在一起，其中整数值的集合描述了每个图像中的人数。我们接受这样一个事实，即我们永远不知道实际函数，但是我们希望学习一个有限复杂度的模型，该模型足够好地近似该函数。

Let’s assume that you’ve constructed some kind of neural network to perform this task. For the sake of this discussion, it’s not really important how many layers are there in the network or the nature of mathematical manipulations carried out in each layer. What is important, however, is that in the end there is one output neuron that predicts a (non-negative, hopefully integer) value.

假设您已经构建了某种神经网络来执行此任务。为了便于讨论，网络中有多少层或每一层中进行的数学运算的性质并不重要。但是，重要的是，最终会有一个输出神经元预测一个(非负数，希望是整数)值。

The mathematical operation of the network can be expressed as a function:

网络的数学运算可以表示为一个函数：

f(x, w) = y

f(x，w)= y

where x is the input image (we can think of it as a vector containing all the pixel values), y is the network’s prediction, and w is a vector containing all the internal parameters of the function (e.g. in f(x, w) = a + bx + exp(c*x) = y, the values of a, b and c are the parameters of the function).

其中x是输入图像(我们可以将其视为包含所有像素值的向量)，y是网络的预测，而w是包含函数的所有内部参数的向量(例如，在f(x， w )中= a + bx + exp(c * x)= y，a，b和c的值是该函数的参数)。

As we saw in the post on perceptrons, during training we want some kind of mechanism that:

正如我们在感知器帖子中所看到的，在训练期间，我们需要某种机制来：

Evaluates the network’s prediction on a given input,
根据给定的输入评估网络的预测，
Compares it to the desired answer (the ground truth),
将其与所需的答案(基本事实)进行比较，
Produces a feedback that corresponds to the magnitude of the error,
产生与误差大小相对应的反馈，
And finally, modifies the network parameters in a way that improves its prediction (decreases the error magnitude).
最后，以改善网络预测(降低误差幅度)的方式修改网络参数。

And thanks to some clever minds — we have such mechanism. In order to understand it we need to cover two topics:

还要感谢一些聪明的人-我们有这样的机制。为了理解它，我们需要涵盖两个主题：

Loss function
损失函数
Backpropagation
反向传播

Uri Almog Instagram Uri Almog Instagram

损失函数 (Loss Function)

Simply put, the loss function is the error magnitude. In more details, a good loss function should be a metric i.e. it defines a distance between points in the space of prediction values. — You can read more about distance functions here.

简而言之，损失函数就是误差幅度。更详细地说，好的损失函数应该是一个度量，即它定义了预测值空间中各点之间的距离 。 —您可以在此处阅读有关距离功能的更多信息。

We would like to use a loss function that returns a small value when the network’s prediction is close to the ground truth, and large when it is far from the ground truth.

我们想使用一个损失函数，当网络的预测接近基本事实时，它会返回一个较小的值；而当远离基本事实时，它会返回一个较大的值。

The aim of the loss function is to tell our training mechanism how big the error is. If there are 6 people in the image and our network predicted only 1, it’s a bigger error than if the network predicted 5. So it would be a logical choice to use as a loss function

损失函数的目的是告诉我们的训练机制误差有多大。如果图像中有6个人，并且我们的网络预测的误差仅为1，则该误差要比网络预测的误差大5。因此，将其用作损失函数将是一个合理的选择

L(y, y’) = abs(y-y’),

L(y，y')= abs(y-y')，

where y an y’ are the network prediction and the ground truth, respectively.

其中y和y'分别是网络预测和地面真理。

A more common choice is:

一个更常见的选择是：

L(y, y’) = (y-y’)²

L(y，y')=(y-y')²

This function is preferred because of two important properties:

由于具有两个重要属性，因此首选此功能：

Its emphasis on large errors is larger than just linear,
它对大误差的重视不只是线性的，
It is smooth everywhere, even at y=y’. We will soon understand why this is important.
即使在y = y'处，它也很平滑。我们将很快理解为什么这很重要。

The value we are actually interested in is not the loss itself. What we want our training mechanism to calculate is the loss derivative. ‘Derivative with respect to what?’, you’re probably asking. And here comes the heart of the matter: contrary to what many machine learning beginners might think, we don’t derive the loss with respect to the input value x. For us, the input is a given, independent value. Instead, we derive the loss with respect to the network parameters w.

我们真正感兴趣的价值不是损失本身。我们想要我们的训练机制来计算的是损耗导数 。您可能会问：“关于什么的导数？”。这就是问题的核心：与许多机器学习初学者可能想到的相反，我们没有得出关于输入值x的损失。对于我们来说，输入是给定的独立值。相反，我们得出关于网络参数w的损耗。

Wikipedia 维基百科

Why?

为什么？

Because the loss derivative, or gradient with respect to network parameters, dL/dw, tells us how much the loss changes (on the given input x), if we change the network parameters slightly. If we imagine the loss function as a landscape with hills and valleys on the parameter space (it’s easy to do it with just two parameters, but remember that in practice there are usually millions of parameters), then the loss gradient is a vector that points uphill. If we want to decrease the error we should move (i.e. slightly change the parameters) in the negative gradient direction — downhill.

因为损耗导数或相对于网络参数的梯度 dL / dw告诉我们，如果我们稍微改变网络参数，损耗的变化量(在给定输入x上)。如果我们将损失函数想象为参数空间上具有丘陵和山谷的地貌(仅使用两个参数就很容易做到，但是请记住，实际上通常有数百万个参数)，那么损失梯度就是一个指向的向量上坡。如果要减少误差，我们应该沿负梯度方向(下坡)移动(即，稍微更改参数)。

Now it is clear why we prefer derivable functions. Using functions with non smooth properties can cause issues if not done carefully.

现在很清楚为什么我们偏爱可导函数。如果使用不光滑，则使用功能不当可能会导致问题。

Uri Almog Photography Uri Almog摄影

How can we derive an unknown function?

我们如何导出未知函数？

We will never know the real underlying function for the task. What we have is a model function learning to approximate that underlying function. And that model function can be derived numerically by slightly changing each parameter separately and calculating the change to the loss.

我们将永远不知道该任务的真正基础功能。我们拥有的是学习近似于该基础功能的模型功能。通过分别略微改变每个参数并计算损耗的变化，可以从数值上推导该模型函数。

Uri Almog Instagram Uri Almog Instagram

反向传播 (Backpropagation)

Backpropagation, or back propagation, is a clever way to perform the numeric derivation process. We will not go into detail, but it’s basically picking each of the network parameters at a time (starting from the output layers and moving backward), and calculating the loss derivative according to the chain rule. You can read more on backpropagation here.

反向传播或反向传播是执行数值推导过程的聪明方法。我们将不做详细介绍，但基本上是一次选择每个网络参数(从输出层开始并向后移动)，然后根据链式规则计算损耗导数。您可以在此处阅读有关反向传播的更多信息。

Deep learning frameworks have built-in methods for backpropagation, so luckily we don’t need to implement it ourselves each time we want to build and train a network.

深度学习框架具有用于反向传播的内置方法，因此幸运的是，我们不需要每次想要构建和训练网络时都自己实现。

Once we have the Loss gradient, we update the network parameters:

有了损耗梯度后，我们将更新网络参数：

w_new = w — lr * grad(L)

where w_new is the updated value of the parameters vector, grad(L) is the loss gradient with respect to the parameters, and lr is the learning rate: a coefficient that controls the step size of the update. It’s a hyperparameter set by the user and usually decreases over the training stages according to a preset scheduling.

其中w_new是参数向量的更新值，grad(L)是相对于参数的损耗梯度，而lr是学习率 ：控制更新步长的系数。它是用户设置的超参数，通常会根据预设的计划在训练阶段降低。

By updating the parameters, we’ve completed a single training step, and are now ready to repeat the process with another data point.

通过更新参数，我们已经完成了一个培训步骤，现在准备对另一个数据点重复该过程。

Each full cycle through the training set is called an epoch. In a typical training process there can be hundreds or thousands of epochs, depending on the case.

训练集中的每个完整周期称为一个时期。在典型的训练过程中，可能会出现数百个或数千个纪元，具体取决于情况。

什么时候停止训练？ (When to stop training?)

Training loss (blue) and validation loss (red) over training steps. After step 17 the training is not effective. 训练步骤中的训练损失(蓝色)和验证损失(红色)。在步骤17之后，培训无效。

We use two indicators to understand whether the network is training effectively:

我们使用两个指标来了解网络是否在有效培训：

Monitoring the loss: When a network trains effectively, we expect to see a decrease of the loss over time. If the loss doesn’t decrease, it may mean that the network has converged to a deep minimum point, or that the learning rate is too high and the network keeps missing the minima in the loss landscape.
监控损失：当网络有效训练时，我们期望随着时间的流逝，损失会减少。如果损失没有减少，则可能意味着网络已经收敛到很深的最低点，或者学习率太高，网络一直在损失领域缺少最小值。
Running a validation test: At fixed periods, run the network on inference mode (using the network for prediction without updating parameters) on a validation set — a collection of images that is not a part of the training set. We expect the loss to decrease on that set as well. If the loss decreases on the trainings set but not on the validation set, it means that the network is starting to overfit — it is learning the attributes of the training examples but isn’t learning to generalize and its performance on unseen data will eventually degrade. This occurs when the training set is too small or the network has too many parameters.
运行验证测试 ：在固定时间段内，对验证集 (不属于训练集的一部分图像)以推断模式 (使用网络进行预测而不更新参数)运行网络。我们预计损失也将在此基础上减少。如果损失在训练集上减少了， 但在验证集上没有减少，则意味着网络正在开始过度拟合 -它正在学习训练示例的属性，但没有在学习概括，其在看不见的数据上的性能最终会降低。当训练集太小或网络参数太多时，会发生这种情况。

Visualizing the Loss Landscape of Neural Nets 可视化神经网络的损失情况

关于培训的更多话 (A few more words about training)

This post is meant to give the basic concepts of neural network training. Beyond the basics, things get complicated very quickly.

这篇文章旨在提供神经网络训练的基本概念。除了基础知识之外，事情还很快变得复杂。

Training a network belongs to a section in math and engineering called optimization. You probably encountered optimization problems in math class at school where you were required to find the shape of a field with the largest area etc. Analytically these problems require setting the derivative of some function to zero and solving it for the required properties (e.g. field length, width etc.). In real life problems we can’t always calculate the derivative but we can approximate it. That’s what we do in network training: we define a loss function which, for a perfect model, would have its parameters at a minimum point. The goal of the training process is to end up at a point in parameter space which is at a deep enough local minimum of the loss function.
培训网络属于数学和工程学中称为优化的部分 。您可能在学校的数学课上遇到了优化问题，要求您找到面积最大的场的形状等。从分析上看，这些问题需要将某些函数的导数设置为零，并针对所需的属性(例如场长)进行求解。，宽度等)。在现实生活中，我们不能总是计算导数，但可以对其进行近似。这就是我们在网络培训中所做的工作：我们定义了一个损失函数，对于一个完美的模型，它将在最小点具有其参数。训练过程的目标是结束于参数空间中的某个点，该点位于损失函数的足够深的局部最小值处。
Usually neural networks predict more than a single value: detection networks predict thousands of values per image, corresponding to object bounding box coordinates, confidence levels, object classes. Image enhancement networks literally predict whole images, etc. Generally the loss will have multiple dependencies on the different output features: e.g. an object detection network will have a loss for missing objects, but also for objects that were not localized correctly, and for objects that were not classified correctly. Choosing a good loss function is more than just math and engineering, it’s art. Two ML engineers working on the same network architecture and the same task can design different loss functions. Through the loss function you sculpture the network to learn features in a certain way. Some loss functions, may not be good enough, in the sense that it’s hard to converge to their local minima. Some loss functions contain complex relations between different output nodes of the network, or loss coefficients that change at different phases of the training (e.g. Yolo).
通常，神经网络预测的不止一个值：检测网络预测每个图像的数千个值，分别对应于对象边界框坐标，置信度和对象类别。图像增强网络从字面上预测整个图像，等等。通常，丢失将对不同的输出特征具有多种依赖性：例如，对象检测网络将丢失丢失的对象，也可能丢失未正确定位的对象，以及丢失的对象。没有正确分类。选择一个好的损失函数不仅仅是数学和工程学，这是艺术。从事相同网络架构和相同任务的两名ML工程师可以设计不同的损失函数。通过损失功能，您可以雕刻网络以某种方式学习功能。某些损失函数可能还不够好，因为很难收敛到其局部最小值。一些损失函数包含网络的不同输出节点之间的复杂关系，或者包含在训练的不同阶段变化的损失系数(例如Yolo)。
Usually training steps are performed on more than one data point at a time. The average loss is calculated over a batch of inputs, resulting in a smoother loss landscape (and better utilizing parallel computation techniques for higher speed).
通常，一次要对一个以上的数据点执行训练步骤。在一批输入上计算平均损耗，从而产生更平滑的损耗情况(并更好地利用并行计算技术来提高速度)。
As can be seen in the loss landscape image above, finding the deepest minimum is not as simple as letting a marble roll into a pit. A typical loss landscape has many crevasses, canyons and saddle points that will divert us from the root to the deepest point. Furthermore, we don’t expect to find The global minimum even in a successful training. We are satisfied with one of the many local minima that give good results on the data.
从上面的损失景观图中可以看出，找到最深的最小值并不像让大理石滚入坑中那样简单。一个典型的损失景观具有许多裂缝，峡谷和鞍点 ，这些会使我们从根部转移到最深处。此外，即使在成功的培训中，我们也不希望找到“全球最低要求”。我们对在数据上得出良好结果的许多局部最小值感到满意。
There are many heuristics on how to change the learning rate during training. Too big a step will prevent convergence, while too small a step will make the training very long and may get stuck in less than optimal minima. A common method is exponential decay (multiplying lr by a fixed factor < 1 at fixed training intervals). Another method changes lr in a cosine manner, either just half cycle during the training, or periodically. See some examples here. Beside learning rate, there are other tricks to control the dynamics of the convergence, such as momentum (using a weighted average of the current and previous gradients).
关于如何在训练过程中改变学习率的方法有很多。太大的步长会阻止收敛，而太大的步长会使训练很长，并且可能会卡在最佳最小值以下。常用的方法是指数衰减 (在固定的训练间隔下将lr乘以固定因子<1)。另一种方法以余弦方式改变lr，或者在训练期间仅改变半个周期，或者定期改变。在这里查看一些示例。除了学习率之外，还有其他技巧可以控制收敛的动态，例如动量 (使用当前和以前的梯度的加权平均值)。
In order to help the network generalize (and also increase the effective size of the training set), we usually add random augmentations to the training images — e.g. — changing brightness, flipping the image horizontally, adding noise — as long as the resulting image is a valid input for the task. This method is very helpful when your training data is insufficient or the images are highly correlated (as is the case when you use frames taken from a video clip).
为了帮助网络泛化(并增加训练集的有效大小)，我们通常在训练图像上添加随机增强，例如，更改亮度，水平翻转图像，添加噪声，只要结果图像是任务的有效输入。当训练数据不足或图像高度相关时(例如，使用从视频剪辑中获取的帧时)，此方法非常有用。

翻译自: https://medium.com/@urialmog/training-neural-networks-explained-simply-902388561613

简单神经网络训练数据集

简单神经网络训练数据集_训练神经网络简单解释

损失函数 (Loss Function)

反向传播 (Backpropagation)

什么时候停止训练？ (When to stop training?)

关于培训的更多话 (A few more words about training)

你可能感兴趣的:(神经网络,python,深度学习,tensorflow,机器学习)