人脸识别主要是对人脸特征的提去,属于一种类内特征的区分。
早期训练人脸模型的loss 有contrastive loss, triplet loss等,都是将几个Samples一起送到模型训练,然后努力找到他们的不同性。
今天主要谈一谈近几年比较重要(面试常考,工作常用)的loss 。
作者在introduction中提出:
很多任务都是Close-Set Identification,比如分类网络,
Close-set identification指的是,检测出的结果必须包含在训练数据集的种类中,比如ImageNet数据集,我们检测的图形中的目标类别必须包含在ImageNet数据集的1000个种类中。
这里涉及到了泛化性问题。
人脸识别的本质是要区分每一个人,可以说每个人的脸都是一个类,所以我们可以提出一个新概念。所以人脸识别并不是一个Close-set identification。而解决这种非Close-set identification的问题方法是,让类内的区分更加明显,也就是每个人的人脸特征应该拥有明显的辨识度。
所以同样是分类,作者谈到了两种分类解决方案。
Separable 和 Discriminative。如下图中,同样是分类,但是显然后者的分类能力更强。
由此,作者引出了Center loss ,通过loss来设计一个拥有好的Discriminative的能力的网络。
对于一般的分类模型,我们用cross entropy loss ,下图a和b是分别基于训练集和测试集上画出的层激活值的图像,这是基于cross entropy loss 的分类结果。
cross entropy loss 的形式就是负对数loss (NLL),其中预测预测值为softmax的输出。
关于NLL和cross entropy (CE)的关系,NLL是CE的特殊形式,CE的意义是真实分布和预测分布之间的交叉熵,信息论中如果一个事件k发生的概率是 P k P_k Pk,那么这个事件信息量就是 − l o g P k -log P_k −logPk,那么对于这个概率分布p,它的entropy H ( p ) H(p) H(p) 就是。
而如果事件k有两种不同的概率分布p和q,那么它们的cross-entropy H(p, q) 的定义如下:
而如果我们把这里的概率p当做训练集的真实分布,q当做预测分布。当p是一个one-hot类型的向量的时候,我们直接把cross entropy loss简化为下图形式。
公式在原论文的含义如下,注意m和n分别是batch size和class_number,
一次loss是在一个batch size上计算的。
在Pytroch等框架设计torch.nn.CrossEntropyLoss 的时候, 他实际上是结合了LogSoftmax()和NLLLoss()两个函数,可以参考这篇文章 pytorch中的一些loss函数。所以接受的是来自模型层直接的输出,而不会经过一个softmax层,因为softmax层会直接整合到了torch.nn.CrossEntropyLoss函数中,这也是为什么有时候大家喜欢叫softmax loss,这其实不太正式。
结论:learned features are separable, but deep features are not discriminative
enough, since they still show significant intra-class variations. Consequently, it
is not suitable to directly use these features for recognition.
下图就是center loss,理解起来也比较容易, c i c_i ci是类别为 y i y_i yi的所有样本feature的中心,这种loss就是为了让x尽可能的离loss较近,起到降低类内距离(噪音)的目的。
下面是三个问题和答案,答案是GPT写的,我觉得挺好的。
4. explain center loss
Center loss is a loss function used in deep learning to improve the performance of models on classification tasks, particularly for face recognition. The idea behind center loss is to learn a set of features that better represent the different classes in the dataset by encouraging the model to group together features of the same class and separate those of different classes.
To do this, center loss calculates a "center" for each class, which is a vector that represents the average of all the feature vectors of that class. The goal of the loss function is then to minimize the distance between a given input feature vector and its corresponding class center, while maximizing the distance between feature vectors of different classes.
The formula for center loss can be written as:
L_c = 1/2 * ∑_i=1^N ||x_i - c_yi||^2
where L_c is the center loss, N is the number of samples in a mini-batch, x_i is the feature vector of the i-th sample, c_yi is the center of the class to which the i-th sample belongs, and ||.||^2 denotes the Euclidean distance.
During training, center loss is typically used in conjunction with a softmax cross-entropy loss, which is a standard loss function for classification tasks. The softmax cross-entropy loss encourages the model to correctly predict the class label of each input, while center loss helps to learn better feature representations.
By combining these two losses, the model learns to both accurately classify inputs and separate the feature representations of different classes. This can result in improved performance on face recognition and other classification tasks.
2. why center loss can not be used directly
Center loss is a popular method in deep learning for improving the performance of
face recognition and other related tasks. It aims to learn a set of feature
representations such that the distance between the features of the same class is
minimized while the distance between the features of different classes is
maximized. This is achieved by adding a center loss term to the standard softmax
cross-entropy loss.
While center loss can be effective in improving the performance of deep learning
models, it cannot be used directly in all cases. One reason is that center loss
is a form of regularization, which means that it is intended to prevent
overfitting by constraining the model's learned representations. However, if the
model is already underfitting or is not complex enough to learn meaningful
features, center loss may not be effective.
Another reason why center loss cannot be used directly is that it may not be
appropriate for all types of tasks. For example, center loss is most commonly
used for face recognition tasks, where the goal is to identify individuals based
on their facial features. However, for tasks that require more general feature
learning, such as object detection or semantic segmentation, other loss functions
may be more appropriate.
Additionally, center loss may require careful tuning of hyperparameters, such as
the weight assigned to the center loss term, in order to achieve the best
performance. This tuning process can be time-consuming and computationally
expensive.
In summary, while center loss can be a useful tool in deep learning, it is not a
one-size-fits-all solution and may require careful consideration and adaptation
to different tasks and models.
3.why center loss has regularization
Center loss can be considered as a form of regularization in deep learning
because it helps to prevent overfitting of the model by constraining the learned
representations.
During training, the center loss encourages the model to learn features that are
more representative of each class by minimizing the distance between an input
feature vector and its corresponding class center. By doing so, the model is
encouraged to group together features of the same class and separate those of
different classes.
This has a regularizing effect because it encourages the model to learn more
generalizable features that are less likely to be specific to the training data
and more likely to be applicable to new, unseen data. In other words, the model
is encouraged to learn features that are better at generalizing to new data and
less likely to memorize specific examples from the training data.
Additionally, center loss can help to reduce the impact of noisy or mislabeled
data on the model's learned representations. Since the loss is based on the
distance between input feature vectors and class centers, it can help to pull
misclassified or noisy samples towards the correct class center and away from the
incorrect one.
Overall, by encouraging the model to learn better, more generalizable feature
representations, center loss can help to prevent overfitting and improve the
performance of deep learning models on classification tasks.
那么为了使用center loss,作者在训练中使用了一些调整方法。
我们取不同的 λ \lambda λ,可以得到不同的训练结果。结果符合我们的预期。
人脸识别算法,比如CosFace或者ArcFace,里面的 θ \theta θ夹角是什么意思?回答如下
In face recognition systems such as CosFace and ArcFace, the theta angle refers
to the angle between the feature embedding vector of a face and the weight vector
associated with the ground truth label of that face.
In more detail, the feature embedding vector is a high-dimensional representation
of a face that is generated by a deep neural network, while the weight vector is
a learned parameter vector associated with the ground truth label of that face.
The theta angle is then calculated as the cosine similarity between these two
vectors, which measures the degree of alignment between the feature embedding
vector and the weight vector.
The use of the theta angle in face recognition systems allows for a more
discriminative and robust feature representation of faces, leading to improved
performance in face recognition tasks.
那特征embedding向量又是什么?回答如下
The feature embedding vector is a high-dimensional numerical representation of a
face that is learned by a deep neural network. In face recognition systems, the
neural network is typically trained on a large dataset of face images in an
unsupervised or supervised manner to learn a feature representation that is
highly discriminative and robust to variations in lighting, pose, and other
factors.
The output of the neural network for a given face image is a feature embedding
vector, which can be thought of as a set of numerical values that encode the
salient characteristics of the face. These values can be used to compare and
match faces for tasks such as identification or verification.
The feature embedding vector is typically a high-dimensional vector, ranging from
a few hundred to a few thousand dimensions, depending on the specific neural
network architecture used. The dimensions of the vector are not directly
interpretable, but the values of the vector are learned to be highly informative
for distinguishing between different faces.
在了解到了 θ \theta θ的含义之后,我们可以更好的了解这些Loss的设计思路,回看CosFace Loss形式。
设计思路:
假设下面有两个夹角 θ 1 \theta_1 θ1和 θ 2 \theta_2 θ2,我们知道 θ \theta θ是输出特征向量和Ground Truth向量之间的夹角,那么对于第i个类而言, c o s ( θ ) − m cos(\theta)-m cos(θ)−m对应的夹角 θ 2 \theta_2 θ2比 c o s ( θ ) cos(\theta) cos(θ)对应的夹角 θ 1 \theta_1 θ1要 大,也就是说,加入m这个参数之后,我们学习参数 θ \theta θ的能力变差了,这也意味着,我们真正适应参数m之后, θ \theta θ会更小,也就是类内会更聚拢。
CosFace Loss的代码实现,直接看forward部分,
先把feature和weight用l2 norm 归一化。
然后两个矩阵相乘,就得到cos值了。因为范数为1了。
接下来减去m,然后通过onehot的label,挑选出需要减去m的样本,保留其他位置的cos值,
最后乘一个s,返回,送入交叉墒中就可以了。
class AddMarginProduct(nn.Module):
r"""Implement of large margin cosine distance: :
Args:
in_features: size of each input sample
out_features: size of each output sample
s: norm of input feature
m: margin
cos(theta) - m
"""
def __init__(self, in_features, out_features, s=30.0, m=0.40):
super(AddMarginProduct, self).__init__()
self.in_features = in_features
self。.out_features = out_features
self.s = s
self.m = m
self.weight = Parameter(torch.FloatTensor(out_features, in_features))
nn.init.xavier_uniform_(self.weight)
def forward(self, input, label):
# --------------------------- cos(theta) & phi(theta) ---------------------------
cosine = F.linear(F.normalize(input), F.normalize(self.weight))
phi = cosine - self.m
# --------------------------- convert label to one-hot ---------------------------
one_hot = torch.zeros(cosine.size(), device='cuda')
# one_hot = one_hot.cuda() if cosine.is_cuda else one_hot
one_hot.scatter_(1, label.view(-1, 1).long(), 1)
# -------------torch.where(out_i = {x_i if condition_i else y_i) -------------
output = (one_hot * phi) + ((1.0 - one_hot) * cosine)
# you can use torch.where if your torch.__version__ is 0.4
output *= self.s
# print(output)
return output
def __repr__(self):
return self.__class__.__name__ + '(' \
+ 'in_features=' + str(self.in_features) \
+ ', out_features=' + str(self.out_features) \
+ ', s=' + str(self.s) \
+ ', m=' + str(self.m) + ')'
来看看ArcFace的pipeline,很多部分和CosFace相似,不同的是对角度的处理方式。
跟CosFace的思想大同小异,通过在 θ \theta θ上加一个m,那么 c o s ( θ + m ) cos(\theta+m) cos(θ+m)的值小于 c o s ( θ ) cos(\theta) cos(θ),然后整个Loss会变大,也会导致学习难度上升,进而达到促进 θ \theta θ变小的目的。
ArcFace Loss 代码在下面
这里m的是一个弧度值角度,实验设置为0.5
代码唯一造成困惑的地方可能就是easy margin那里,如果easy margin为true,只要夹角小于90度,这才再角度上加一个m。如果为false,以self.th为阈值,大家可以画下图,如果角度值theta小于m,在单调区间0-pi上,余弦值一定大于该阈值。所以th的作用就是保证了theta加m,依然在单调区间0-pi上。
class ArcMarginProduct(nn.Module):
r"""Implement of large margin arc distance: :
Args:
in_features: size of each input sample
out_features: size of each output sample
s: norm of input feature
m: margin
cos(theta + m)
"""
def __init__(self, in_features, out_features, s=30.0, m=0.50, easy_margin=False):
super(ArcMarginProduct, self).__init__()
self.in_features = in_features
self.out_features = out_features
self.s = s
self.m = m
self.weight = Parameter(torch.FloatTensor(out_features, in_features))
nn.init.xavier_uniform_(self.weight)
self.easy_margin = easy_margin
self.cos_m = math.cos(m)
self.sin_m = math.sin(m)
self.th = math.cos(math.pi - m)
self.mm = math.sin(math.pi - m) * m
def forward(self, input, label):
cosine = F.linear(F.normalize(input), F.normalize(self.weight))
sine = torch.sqrt(1.0 - torch.pow(cosine, 2))
# cos(a+b)=cos(a)*cos(b)-size(a)*sin(b)
phi = cosine * self.cos_m - sine * self.sin_m
if self.easy_margin:
phi = torch.where(cosine > 0, phi, cosine)
else:
phi = torch.where(cosine > self.th, phi, cosine - self.mm)
one_hot = torch.zeros(cosine.size(), device='cuda')
one_hot.scatter_(1, label.view(-1, 1).long(), 1)
output = (one_hot * phi) + ((1.0 - one_hot) * cosine)
output *= self.s
return output