softmax激活函数
The objective of this post is three-fold. The first part discusses the motivation behind sparsemax and its relation to softmax, summary of the original research paper in which this activation function was first introduced, and an overview of advantages from using sparsemax. Part two and three are dedicated to the mathematical derivations, concretely finding a closed-form solution as well as an appropriate loss function.
这篇文章的目标是三个方面。 第一部分讨论了sparsemax背后的动机及其与softmax的关系,首次介绍了该激活函数的原始研究论文摘要,以及使用sparsemax的优点概述。 第二部分和第三部分专门讨论数学推导,具体地找到闭合形式的解以及适当的损失函数。
1。 Sparsemax概述 (1 . Overview of Sparsemax)
In the paper “From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification”, Martins et al. propose a new alternative to the widely known softmax activation function by introducing Sparsemax.
在论文“从Softmax到Sparsemax:注意力和多标签分类的稀疏模型”中 ,Martins等人。 通过引入Sparsemax,提出了一种替代众所周知的softmax激活函数的新方法 。
While softmax is an appropriate choice for multi-class classification that outputs a normalized probability distribution over K probabilities, in many tasks, we want to obtain an output that is more sparse. Martins et al. introduce a new activation function, called sparsemax, that outputs sparse probabilities of a multinomial distribution and, therefore, filters out noise from the mass of the distribution. This means that sparsemax would assign a probability of exactly 0 for some classes, while softmax would instead keep those classes and assign them very small values like 10⁻³. Sparsemax can be especially favorable in large classification problems; for instance in Natural Language Processing (NLP) tasks, where the softmax layer is modeling a multinomial distribution over a very large vocabulary set.
虽然softmax是输出在K个概率上归一化的概率分布的多类分类的适当选择,但在许多任务中,我们希望获得一个更稀疏的输出。 马丁斯等。 引入了一个新的激活函数,称为sparsemax,该函数输出多项式分布的稀疏概率,因此从分布的质量中滤除了噪声。 这意味着sparsemax将为某些类分配恰好为0的概率,而softmax会保留这些类并为它们分配非常小的值,如10-3。 在大型分类问题中,稀疏最大值可能特别有利; 例如在自然语言处理(NLP)任务中,其中softmax层正在为非常大的词汇集建模多项分布。
In practice, however, changing the softmax function into a sparse estimator is not a straightforward task. Obtaining such a transformation while preserving some of the fundamental properties of softmax — e.g. simple to evaluate, inexpensive to differentiate and easily transformed to a convex loss function — turns out to be quite challenging. A traditional way around it in machine learning is to use the L1 penalty that allows for some level of sparsity with regards to the input variables and/or deep layers in neural networks. While this approach is relatively straightforward, L1 penalty influences the weights of a neural network rather than the targeted outputs as sparse probabilities. Therefore, Martins et al. recognize the need for a supplementary activation function, i.e. sparsemax, which they formulate as a solvable quadratic problem and find a solution under a set of constraints to get similar properties to softmax.
但是,实际上,将softmax函数更改为稀疏估计器并不是一件容易的事。 在保持softmax的某些基本属性的同时获得这种转换(例如,易于评估,廉价地区分并易于转换为凸损失函数)变得非常具有挑战性。 机器学习中解决该问题的传统方法是使用L1惩罚,该惩罚在神经网络中的输入变量和/或深层方面允许一定程度的稀疏性。 虽然这种方法相对简单,但是L1惩罚会影响神经网络的权重 ,而不是作为稀疏概率的目标输出 。 因此,Martins等。 认识到需要补充激活函数( 即 sparsemax),他们将其公式化为可解决的二次问题,并在一组约束条件下找到解决方案以获得与softmax类似的性质。
Before diving into the proofs behind sparsemax implementation, let us first discuss few important high-level findings from the paper. The following bullet points summarize some of the main takeaways:
在深入研究sparsemax实现背后的证据之前,让我们首先讨论论文中的一些重要的高级发现。 以下要点总结了一些主要内容:
Sparsemax is a piecewise linear activation function
Sparsemax是分段线性激活函数
While softmax shape is equivalent to the traditional sigmoid, sparsemax is a “hard” sigmoid in one dimension. Additionally, in two dimensions, sparsemax is a piecewise linear function with entire saturated zones (0 or 1). Here is a figure from the paper to help you visualize softmax and sparsemax.
虽然softmax形状等效于传统的S型,但Sparsemax在一个维度上却是“硬”的S型。 此外,在两个维度上,sparsemax是具有整个饱和区域(0或1)的分段线性函数。 这是论文中的图表,可帮助您可视化softmax和sparsemax。
Sparsemax Loss is related to the classification Huber Loss
Sparsemax Loss与分类Huber Loss有关
The derived sparsemax loss function in a binary case is directly related to the modified Huber loss used for classification (defined in Zhang, Tong. Statistical Behavior and Consistency of Classification Methods Based on Convex Risk Minimization. Annals of Statistics, pp. 56–85, 2004 and Zou, Hui, Zhu, Ji, and Hastie, Trevor. The Margin Vector, Admissible Loss and Multi-class Margin-Based Classifiers. Technical report, Stanford University, 2006). That is, if x and y are the two scores before sparsemax, using a sparsemax layer and a sparsemax loss, with t = x - y, and assuming without loss of generality that the correct label is 1, we can show that:
在二元情况下得出的稀疏最大损失函数与用于分类的修正的Huber损失直接相关(在Zhang,Tong中定义。基于最小化凸风险的统计行为和分类方法的一致性。《统计年鉴》,第56-85页, 2004年 ;邹辉,朱吉和Hastie,Trevor,《边际矢量,可容许损失和基于多边际的分类器》,斯坦福大学技术报告,2006年 )。 也就是说,如果x和y是sparsemax之前的两个分数,则使用sparsemax层和sparsemax损失,且t = x-y ,并且在不失一般性的前提下,假设正确的标签为1 ,我们可以证明:
This is a nice property that proves the theoretical foundation of sparsemax; Huber loss is a tradeoff between L1 and L2 penalties, which is exactly what we are trying to obtain from the softmax activation while including sparsity. Additionally, this similarity to Huber loss can be demonstrated by comparing the loss to other standard classification losses:
这是一个很好的性质,证明了sparsemax的理论基础; Huber损失是L1和L2惩罚之间的折衷,这正是我们试图从softmax激活中获得的结果,同时包括稀疏性。 此外,可以通过将损失与其他标准分类损失进行比较来证明与Huber损失的相似性:
In the above graph, you can see that for negative values of t, i.e. for cases of big error, the loss is linearly scaling with the error, similarly to the hinge loss. However, as t converges to 1, i.e. the error diminishes, we observe a squared relationship, similar to the least squares loss.
在上图中,您可以看到,对于t的负值,