宝贵的核心珍藏
Building neural networks is difficult because there is so much variability involved. With these 10 tips and tricks, you’ll not only have concrete pointers on changes to try but a strategy and mentality towards approaching the ambiguous task of building a successful neural network.
建立神经网络很困难,因为其中涉及很多可变性。 有了这10个提示和技巧,您不仅将有具体的尝试更改指针,还可以掌握策略和思路,以完成构建成功的神经网络的模糊任务。
输出层的激活功能 (Activation functions for output layers)
The last layer of the neural network is important to get right because it’s the last step in the prediction process. Furthermore, the activation function of the last layer is the last function before the information flow of the network is outputted as a prediction. Getting the function right is important.
神经网络的最后一层对于正确设置很重要,因为它是预测过程的最后一步。 此外,最后层的激活功能是在输出网络的信息流作为预测之前的最后功能。 正确设置功能很重要。
The sigmoid function is used for binary outputs. For instance, if a network were to predict if an image was a cat or a dog, sigmoid would be the way to go. The sigmoid function — defined mathematically as 1/(1+e^(-x))
— notably, is the inverse of the logit function, which is why it is a natural choice for modelling probability.
sigmoid函数用于二进制输出。 例如,如果网络要预测图像是猫还是狗,那么乙状结肠将是必经之路。 S型函数-数学上定义为1/(1+e^(-x))
-特别是logit函数的逆函数,这就是为什么它是自然建模概率的原因。
On the other hand, if you have multiple outputs, for instance classifying images into 10 possible digits, we would use the SoftMax function, which is an expansion of the sigmoid function for multiple classes with the restriction that all probabilities must add to 1. For instance, if predictions for four classes are [0.48, 0.23, 0.56,0.03]
, it is an illegitimate result because probabilities center around the fact that they always sum to 100%.
另一方面,如果您有多个输出(例如,将图像分类为10个可能的数字),我们将使用SoftMax函数,它是针对多个类的Sigmoid函数的扩展,所有条件都必须加到1。例如,如果对四个类别的预测为[0.48, 0.23, 0.56,0.03]
,那么这是不合法的结果,因为概率集中于它们始终总和为100%的事实。
Since sigmoid and SoftMax are both bounded functions, meaning that their y-values cannot go above or below a certain value (0 and 1), they cannot handle regression problems, which are continuous values, not probabilities.
由于sigmoid和SoftMax都是有界函数,这意味着它们的y值不能大于或小于某个值(0和1),因此它们不能处理回归问题,这些问题是连续值,而不是概率。
Instead, use a linear activation function (y=x
), which is not bounded. Additionally, if all the y-values are larger than 0, using ReLU is also acceptable, since the slope of the line (a
in max(0, ax)
) is a tunable parameter. Alternatively, an activation unbounded on both sides like Leaky ReLU or ELU (Exponential Linear Unit) can be used.
而是使用不受限制的线性激活函数( y=x
)。 此外,如果所有y值是大于0,使用RELU也是可以接受的,因为线的斜率( a
在max(0, ax)
)是一个可调参数。 另外,也可以使用无泄漏的激活,如Leaky ReLU或ELU(指数线性单位)。
使用预先训练的模型 (Use pretrained models)
It’s difficult to learn everything from scratch — even as humans — which is why it’s beneficial to have some universal knowledge as a base for further learning. In the image recognition world, several pretrained models like VGG16 and Inception, designed and trained to recognize objects, can be accessed and further fine-tuned.
从头开始学习所有内容(即使是人类)也是很困难的,这就是为什么拥有一些通用知识作为进一步学习的基础是有益的。 在图像识别世界中,可以访问并进一步微调一些经过预先训练的模型(例如VGG16和Inception),这些模型经过设计和训练以识别对象。
For instance, consider the task of recognizing a dog or a cat in an image; while one could construct a neural network from scratch, why not use an architecture and weights that have proven to work on a similar task? Libraries that offer pretrained models also allow users to stack on additional layers before or after the model for some customizability.
例如,考虑识别图像中的狗或猫的任务; 虽然可以从头开始构建神经网络,但为什么不使用已被证明可以完成类似任务的架构和权重呢? 提供预训练模型的库还允许用户在模型之前或之后在其他层上进行堆栈,以实现一些可定制性。
In Natural Language Processing, embeddings are used to map words or other tokens from their high-dimensional vectorized form into a lower-dimensional space. Words that have similar meanings or are in similar categories with respect to the context of the dataset (e.g. water
& liquid
, king
& man
, queen
& woman
) are placed physically closer.
在自然语言处理中,嵌入用于将单词或其他标记从其高维向量化形式映射到低维空间。 对于数据集的上下文,具有相似含义或处于相似类别的词(例如water
和liquid
, king
和man
, queen
和woman
)在物理上放置得更近。
Embeddings are valuable because they are a less-costly method to extract deeper-level meanings of words before expensive neural network operations. They reduce the dimensionality of data effectively and can be visualized & interpreted using manifold learning techniques like t-SNE to make sure the network is on the right learning track.
嵌入是有价值的,因为它们是在昂贵的神经网络操作之前提取单词的更深层含义的低成本方法。 它们有效地降低了数据的维数,并且可以使用多种学习技术(例如t-SNE)进行可视化和解释,以确保网络处在正确的学习轨道上。
Pre-trained embeddings like GloVE, or Global Vectors for Word Representation, are created on created on the world’s largest repositories of text, like Wikipedia. Incorporating pretrained elements in your model increases its power and reduces guesswork.
在全球最大的文本存储库(例如Wikipedia)上创建的基础上,创建了诸如GloVE或用于词表示的Global Vectors之类的预训练嵌入。 在模型中加入预训练的元素可以增加其功能,并减少猜测。
Check out these resources: using pre-trained image recognition models in Keras, guide to using pre-trained embeddings in Keras, guide to using embeddings in Keras.
查阅以下资源: 在Keras中使用预训练的图像识别模型,在Keras中使用预训练的嵌入 指南,在Keras中使用嵌入的指南 。
刺激健康的梯度流动 (Stimulate healthy gradient flow)
Especially with deep networks, it’s always important to prioritize a healthy gradient flow. Backpropagation can only work effectively if the architecture is built in a way such that the signal can pass through the entire network. If it gradually diminishes, the front layers are not changed (vanishing gradient problem), and if it is amplified too much, over time weight shifts become so extreme to the point where they become NaN.
尤其是对于深层网络,确定健康的梯度流的优先级始终很重要。 仅当以某种方式构建体系结构以使信号可以通过整个网络时,反向传播才能有效地工作。 如果逐渐减小,则前层不会改变(梯度消失),如果放大太多,则随着时间的推移,重量变化会变得非常极端,直至变为NaN。
Employ skip connections. These are connections that jump across a layer, which means that gradients can continue even if they are blocked by a problematic layer. These connections are central to the successful ResNet architecture (which is a pretrained model that can be accessed in most deep learning libraries).
使用跳过连接 。 这些是跨越层的连接,这意味着即使梯度被有问题的层阻止,梯度也可以继续。 这些连接是成功的ResNet架构(该架构是可以在大多数深度学习库中访问的预训练模型)的关键。
Use batch normalization layers. Not only do these normalize the inputs to ensure optimal training, they smooth the error space, allowing the optimizer to work with an easier landscape.
使用批处理规范化层 。 这些不仅可以标准化输入以确保最佳训练,而且可以平滑错误空间,从而使优化器可以更轻松地工作。
Try to use unbounded activations. Vanishing gradient problems are often caused by bounded functions like tanh, sigmoid, and softmax, because inputs that naturally shift during earlier stages of training end up having diminishing relative importance.
尝试使用无限制的激活 。 逐渐消失的梯度问题通常是由诸如tanh,Sigmoid和softmax之类的有限函数引起的,因为在训练的早期阶段自然转移的输入最终的相对重要性降低了。
ReLU, which is unbounded on one side, is usually a default; sometimes, alternatives like Leaky ReLU perform better; in general, bounded functions are not widely used except for the last output layer.
ReLU在一侧不受限制,通常是默认设置。 有时,诸如Leaky ReLU之类的替代方案效果更好; 通常,除了最后一个输出层,有界函数没有得到广泛使用。
Read
读
this to get a good intuition for why ReLU is so effective and how the linear unit adds valuable nonlinearity.
这对于ReLU为什么如此有效以及线性单元如何增加有价值的非线性有很好的直觉。
牢记监管 (Keep in mind regulation)
Training a neural network is a difficult task — it is highly dependent on a variety of parameters and initialization. Regulation can act as a guard rail and prevent the neural network from straying too far from its purpose, which is fairly abstract: to not underfit to the data, and not to overfit to it.
训练神经网络是一项艰巨的任务-它高度依赖于各种参数和初始化。 监管可以充当护栏,并防止神经网络偏离其目的太远,这是相当抽象的:不要欠缺数据,也不要过分拟合。
Of course, the actual goal is to perform well on the test set, but the neural network isn’t supposed to be exposed to it, or it will not be a test set in the first place. Naturally, a neural network — designed to approximate functions — will fit to the data. Regulation can prevent it from overfitting, or taking the easy way out by memorizing data points instead of actually learning generalizations.
当然,实际目标是在测试集上表现良好,但是神经网络不应该暴露在测试集上,否则它就不会成为测试集。 自然地,设计用于近似函数的神经网络将适合数据。 法规可以防止其过度拟合,或者通过记忆数据点而不是实际学习概括来避免简单的出路。
Adding dropout layers is perhaps the easiest method to add regulation into a neural network. Dropout randomly blocks a fraction of neurons from connecting to the next layer, so intuitively it prevents the network from passing on too much specific information. There’s many other perspectives of understanding why Dropout works so well — for example, it can be thought of as an ensemble or as progressive updating. Read more here.
添加辍学层可能是将调节添加到神经网络的最简单方法。 丢失会随机阻止一部分神经元连接到下一层,因此从直观上讲,它可以防止网络传递过多的特定信息。 还有许多其他观点可以理解为什么Dropout如此出色地工作-例如,可以将其视为整体或渐进式更新。 在这里阅读更多。
Additionally, L2 regularization is another method of keeping weights in check, although it is arguably a less ‘natural’ method of doing so. Regardless of the means used to achieve it, regularization and gradient flow should be the at the forefront of architecture design.
此外,L2正则化是控制权重的另一种方法,尽管可以说这是不太自然的方法。 不管采用何种方法实现,正则化和梯度流都应该是体系结构设计的最前沿。
确切的数字不要太多。 (Don’t sweat too much about the exact numbers.)
Building neural networks can be daunting because so much of the process is variable — the number of layers, the number of neurons in each layer, the types of layers, etc. Chances are, however, that changing the amount of neurons isn’t really going to change the predictive power.
建立神经网络可能会令人生畏,因为很多过程都是可变的-层数,每一层中的神经元数,层的类型等。但是,改变神经元的数量并没有真正将改变预测能力。
In tutorials, you may often see the number of neurons in each layer or batch sizes written as powers of two. While there are some studies that claim using powers of two is more efficient (something to do with efficient GPU usage and the bit…?), they’re likely not universal. It hasn’t been shown decisively that using powers of two is optimal in any way for the selection of hyperparameters.
在教程中,您可能经常会看到每层神经元的数量或批处理大小写为2的幂。 尽管有些研究声称使用2的幂是更有效的(这与有效使用GPU和位有关……),但它们可能并不通用。 还没有确定性地表明使用2的幂对于选择超参数是最佳的。
However, it is a great method to follow for few reasons:
但是,由于以下几个原因,这是一种很好的遵循方法:
- It provides structure to a process with so little. Instead of fidgeting if the neurons in a layer should be 25 or 26, just settle on 32. It’s a convenient template to choose neuron sizes. 它为流程提供的结构很少。 不必烦恼层中的神经元应该是25还是26,而只需固定在32即可。这是选择神经元大小的方便模板。
- It is a logarithmic approach to search. Since the change between 32 and 33 neurons is much less than from 1 to 2, it makes sense that we would double the amount of neurons each time to search for the next plausible one. If you find that you want to change the number of neurons in a layer, halve it or double it to see any real potential change. 这是一种对数搜索方法。 由于32和33个神经元之间的变化远小于1到2之间的变化,因此有意义的是,我们每次将神经元的数量加倍以寻找下一个可能的神经元。 如果发现要更改层中神经元的数量,请将其减半或加倍以查看任何实际的潜在变化。
Of course, it’s good to think through the information flow of the network, but the number of layers or neurons is relatively arbitrary within a certain range. For a more intuitive explanation, explore the Universal Approximation Theorem, which demonstrates how neural networks approach tasks and the role a single neuron plays in prediction.
当然,最好考虑一下网络的信息流,但是层或神经元的数量在一定范围内是相对任意的。 要获得更直观的解释,请探索通用近似定理 ,该定理演示了神经网络如何处理任务以及单个神经元在预测中扮演的角色。
尝试设置尽可能少的常量 (Try to set as few constants as possible)
In general, the philosophy of model-building is to follow the data and reduce as much dependency on hard-coding constants as possible. This is especially true yet difficult with neural networks. While certain constants like the number of neurons in a layer or filter sizes must be hard-coded*, try to stray away from setting constants yourself if you can, because these are potential sources of error.
通常,模型构建的原理是跟踪数据并尽可能减少对硬编码常量的依赖。 对于神经网络来说尤其如此,但困难重重。 虽然某些常数(例如层中神经元的数量或过滤器大小)必须进行硬编码*,但如果可以的话,请尽量避免自己设置常数,因为这些都是潜在的错误来源。
For instance, don’t hard-code the learning rate. You can set a high initial learning rate and use a tool like Reduce Learning Rate on Plateau to automatically adjust the rate when performance stagnates. This is true for many other complex parameters of the neural network — as a very dynamic algorithm, some parameters simply cannot remain fixed.
例如,不要硬编码学习率。 您可以设置较高的初始学习率,并使用诸如Plateau上的“ 降低学习率”之类的工具在性能停滞时自动调整学习率 。 对于神经网络的许多其他复杂参数,这是正确的-作为一种非常动态的算法,某些参数根本无法保持固定。
*However, the development of automatic selection of neural network architectures with very little constant hard-coding and machine learning algorithms in general is very new and promising.
*然而,通常很少有恒定的硬编码和机器学习算法的神经网络体系结构自动选择的开发是非常新的和有希望的 。
利用预处理层 (Make use of preprocessing layers)
Keras and several other deep learning libraries offer preprocessing libraries that can be added as the first few layers of a neural network, for instance in vectorizing text, standardizing images, or normalizing data. In order to make your model portable and easy to work with, it’s always a good idea to use preprocessing layers for deployment.
Keras和其他一些深度学习库提供了预处理库,可以将其添加为神经网络的前几层,例如,在对文本进行矢量化,对图像进行标准化或对数据进行归一化处理时。 为了使您的模型可移植且易于使用,使用预处理层进行部署始终是一个好主意。
Read about preprocessing layers in Keras here.
在此处阅读有关Keras中的预处理层的信息 。
资料扩充 (Data augmentation)
If you’re not using data augmentation in image recognition tasks, you’re wasting your dataset.
如果您没有在图像识别任务中使用数据增强功能,那是在浪费数据集。
Image data is hard to come by, and it’s a shame if a network can only extract a limited amount of learning from such an expensive piece of information. Data augmentation artificially boosts the size of the dataset by passing images through random generated filters, which can apply zooms, rotations, flips, dimming, brightening, whitening, color changes, etc.
图像数据很难获得,如果网络只能从如此昂贵的信息中提取有限的学习信息,那就太可惜了。 数据扩充通过使图像经过随机生成的滤镜来人为地增加数据集的大小,滤镜可以应用缩放,旋转,翻转,调暗,变亮,变白,变色等。
Keras. Image free to share. Keras 。 图片免费分享。When data augmentation is applied correctly, it improves the network’s ability to generalize to images and better addresses real-world issues in object recognition like adversarial/malicious inputs, which can for instance trick a sign-recognizing self-driving car into accelerating at deadly speeds.
正确应用数据增强后,它可以提高网络将图像推广到图像的能力,并更好地解决对象识别中的现实问题,例如对抗性/恶意输入 ,例如,这可能会诱使识别符号的自动驾驶汽车以致命的速度加速。
Understand all sorts of parameters in data generators and why you need to be careful with which ones to use here.
了解数据生成器中的各种参数,以及为什么需要谨慎使用此处的参数 。
更多数据! (More Data!)
More data is, of course, the best solution. The fanciest algorithms can’t even compare to the benefit a good batch of data can bring: data is a valuable, and hence expensive, commodity. Simply incorporating additional data can widen the horizons of the model, more than spending hours fine-tuning a model’s technical parameters ever could.
当然,更多的数据是最好的解决方案。 最奇妙的算法甚至无法与大量数据所带来的好处进行比较:数据是有价值的商品,因此是昂贵的商品。 仅仅合并其他数据就可以拓宽模型的视野,而不是花费数小时来微调模型的技术参数。
In this article, I reduce the mean absolute error over tenfold for coronavirus forecasting using a simple random forest regression model simply by adding Wikipedia data for statistics by country.
在本文中 ,我只需使用Wikipedia数据(按国家/地区统计),就可以使用简单的随机森林回归模型将冠状病毒的平均预测绝对误差降低十倍以上。
利用现有模型的灵感。 (Use inspiration from existing models.)
To build a great neural network, look towards the masters! Existing models like BERT, Google Neural Machine Translation (GNMT), and Inception are built in certain ways for certain reasons.
要建立一个强大的神经网络,请期待主人! 由于某些原因,以某些方式构建了诸如BERT, Google神经机器翻译 (GNMT)和Inception之类的现有模型。
Inception is built in a modular format, with multiple stacked Inception cells. It uses 1x1 convolutions before others and orders pooling and convolutions for certain reasons. GNMT stacks eight LSTMs each into an encoder and a decoder for Google Translate, and while you may not care about how translation works, it’s worthwhile exploring how the architecture deals with such a deep recurrent architecture and apply it to your RNN designs.
Inception以模块化格式构建,具有多个堆叠的Inception单元。 它先使用1x1卷积 ,并出于某些原因对池和卷积进行排序。 GNMT将8个LSTM分别堆叠到Google Translate的编码器和解码器中,尽管您可能并不关心翻译的工作原理,但值得探索一下该体系结构如何处理这种深度循环体系并将其应用于RNN设计 。
Jeremy Jordan. Image free to share. 杰里米·乔丹(Jeremy Jordan) 。 图片免费分享。The best tips and tricks come from the innovation at top companies and research departments developing the newest methods and cultivating the newest concepts. Take time to explore and understand ideas.
最好的技巧和窍门来自顶级公司和研究部门的创新,他们开发了最新的方法并培养了最新的概念。 花时间去探索和理解想法。
Thanks for reading!
谢谢阅读!
翻译自: https://towardsdatascience.com/10-invaluable-tips-tricks-for-building-successful-neural-networks-566aca17a0f1
宝贵的核心珍藏