Deep SAD论文学习

前提:

  • 核函数:核函数
    Deep SAD论文学习_第1张图片
    设X是输入空间(欧氏空间或离散集合),Η为特征空间(希尔伯特空间)。
    存在一个从X到Η的映射 ϕ ( x ) \phi(x) ϕ(x),使得函数 K ( x , z ) = < ϕ ( x ) , ϕ ( z ) > K(x,z)=<\phi(x),\phi(z)> K(x,z)=<ϕ(x),ϕ(z)>
    映射函数: ϕ ( x ) \phi(x) ϕ(x), 核函数 K ( x , z ) = < ϕ ( x ) , ϕ ( z ) > K(x,z)=<\phi(x),\phi(z)> K(x,z)=<ϕ(x),ϕ(z)>
    Deep SAD论文学习_第2张图片核函数满足Mercer Theorem
    数据在什么情况下可以用核函数?
    从个人理解出发,即只要在原始数据维度大于零且涉及空间度量的前提下,都可以运用核函数方法(诚然,最后是否选择还需要根据优化目标进行决定)。也正是基于此,很多在使用核函数方法时就会陷入一个误区,简单认为核函数就是一个从低纬度映射到高维度的函数,核函数在特征变换时,的确会包含了映射这一层个逻辑,但是低纬度到高纬度的表述就不严谨了,因为常规的空间是有度量的,即有维度的概念,但是希尔伯特空间已经延展至无限维度(即没有维度的定义了),故再谈维度,已然没有任何意义了!

  • 高斯核函数:Gaussian kernel
    Deep SAD论文学习_第3张图片

  • SVDD
    把原始的数据映射到高维空间,然后在高维空间中按照上述方法找一个超球体。然后把超球体逆映射到原始的数据空间。然后就可以得到一个轮廓.
    概述
    详细
    SVDD loss:
    Deep SAD论文学习_第4张图片

  • One-class SVM
    Deep SAD论文学习_第5张图片
    Support vector data description (SVDD) finds the smallest hypersphere that contains all samples, except for some outliers. One-class SVM (OC-SVM) separates the inliers from the outliers by finding a hyperplane of maximal distance from the origin.

  • deep SVDD
    论文:Deep One-class Classification
    解读:Deep One-class Classification
    deep SVDD loss:
    Deep SAD论文学习_第6张图片在这里插入图片描述
    Deep SAD论文学习_第7张图片
    anomaly score:
    在这里插入图片描述
    One-Class Deep SVDD也可以视为找到最小体积的超球:
    deep SVDD通过直接惩罚半径和落在球体外的数据表示而收缩球体
    One-Class Deep SVDD通过最小化所有数据表示到中心的平均距离来收缩球体。 同样,为了将数据(平均)尽可能映射到接近中心c,神经网络必须提取变异的共同因子。 对所有数据点上的平均距离进行惩罚而不是允许某些点落在超球外,这与大多数训练数据来自一个类的假设是一致的

-----------------------------------一个分割线--------------------------------------
Deep SAD

  1. Motivation:
  • shallow methods: require manual feature engineering to be effective on high-dimensional data and are limited in their scalability to
    large datasets.
  • unsupervised(OC-SVM): only learns the distribution of the normal representation. blurry boundaries between normal and anomalous data. The outcome of this shortfall are low confidence detection of anomalies.
  • semi-setting:
    • only labelled normal samples
    • cluster assumption: invalid for the “anomaly class” since anomalies are not necessarily similar to one another.
      Deep SAD论文学习_第8张图片

contribution:

  • deep SAD: a generalization of the unsupervised Deep SVDD
  • an information-theoretic framework for deep AD
  1. information-theoretic view:
  • Information Bottleneck principle
    input variable X, latent variable Z (e.g., the final layer of a deep
    network), and output variable Y
    the trade-off between finding a minimal compression Z of the input X while retaining the informativeness of Z for predicting the
    label Y .
    supervised deep learning seeks to minimize the mutual information I(X; Z) while maximizing the mutual information I(Z; Y ) between Z and the classification task Y
    在这里插入图片描述
    α \alpha α控制分类精度和复杂度之间的tradeoff.
    ps. latent variable–quora

  • 无监督:没有Y
    informax principle: max the mutual info I(X;Z)
    在这里插入图片描述
    R(Z): regularization, hyperparameter β > 0 : obtain statistical properties desired for some specific downstream
    task.
    deep repre for ad中:R(Z)–sparcity, the distance to some latent prior distribution e.g. measured via the KL divergence , an adversarial loss , or simply a bottleneck in dimensionality
    ----the latent representation of the normal data should be in some sense “compact.”

  • loss:
    Deep SAD论文学习_第9张图片

  • obj:
    ①正常的熵小,异常的熵大
    ② the network must attempt to map known anoma-
    lies to some heavy-tailed distribution.
    heavy tail distribution重尾分布-wiki
    ③ does not impose any cluster assumption on the anomaly-generating distribution

for labelled data: hyperparameter η > 0: balance between the labeled and the unlabeled term. Setting η > 1 puts more emphasis on the labeled data whereas η < 1 emphasizes the unlabeled data
For the labeled normal (˜y = +1): impose a quadratic loss on the distances, thus intending to overall learn a latent distribution which concentrates the normal data
For the labeled anomalies (˜y = −1): penalize the inverse of the distances such that anomalies must be mapped further away from the center.
SGD using backpropagation
在这里插入图片描述

  • 实验
    对比:
    • shallow unsupervised: OCSVM, SVDD(Gaussian kernel), Isolation Forest, KDE
    • deep unsupervised: ae, deep SVDD
    • semi-supervised(use labelled): shallow SSAD
      ----no deep competitor: hybrid–Apply SSAD to the latent codes of ae
    • deep semi(classification): SS-DGM
      设计:
    • grant the shallow and hybrid methods an unfair advantage by selecting their hyperparameters to maximize AUC on a subset (10%) of the test set to minimize hyperparameter selection issues.
    • use the same (LeNet-type) deep networks.

讲稿:
The challenge of this paper is as followed
First, as most supervised methods impose cluster assumption on the anomaly distribution while anomalies don’t have to be similar, the paper tried to generalize the unsupervised method, deep SVDD into a semi-supervised one, so one of the challenge is how to generalize, how to utilize the labelled anomalies and make the detection not domain-specific

Second, previous deep unsupervised methods implicitly apply Infomax principle and share the ideas that normal data should be in some sense “compact”, how to interpret this and construct a general framework for this idea in unsupervised and semi supervised settings

They provide their solution based on deep SVDD.
Deep SVDD ‘s objective is to train the neural network (phi) to learn a transformation that minimizes the volume of a data-enclosing hypersphere centered on a pre-determined point c.
Penalizing the mean squared forces the network to extract common factors of data variation which are the most stable in the dataset. As a consequence, normal data points get mapped near the hypersphere center, while anomalies are mapped further out.

Deep SVDD equation is equivalent to minimizing the empirical variance and thus an upper bound on the entropy of a latent Gaussian. In simple terms, this technique minimizes entropy around point c and within the arbitrary hypersphere. I’ll further explain this in my next slides
The loss for deep SAD is basically the same as Deep SVDD, except for the second expression. In the second expression, η (eta) is hyperparameter controlling the amount of emphasis placed on labeled vs unlabeled data. Parameter m represents labeled samples and the y-bar represents a value of -1 or 1 depending on whether it belongs to anomalous or normal distribution, respectively.
For the labeled normal samples (˜y = +1), they impose a quadratic loss on the distances of the mapped points to the center c, thus intending to overall learn a la tent distribution which concentrates the normal data.
For labeled anomalies (˜y = −1) in contrast, we penalize the inverse of the distances such that anomalies must be mapped further away from the center.
In addition to the inverse squared norm loss we experimented with several other losses including the negative squared norm loss, negative robust losses, and the hinge loss. But this one ultimately performed the best while remaining conceptually simple.
An information-theoretic view:
the latent distribution of normal data — low entropy
the latent distribution of anomalies — high entropy

Deep SVDD may not only be interpreted in geometric terms as minimum volume estimation, but also in probabilistic terms as entropy minimization over the latent distribution.

你可能感兴趣的:(Deep SAD论文学习)