首先要明确的最顶层的概念是:CNN是一个visual representation learning的过程,所有的convolutional filters wi是学到的features template,每一个convolutional操作就是一个template matching,通过dot-product这种计算方式来量化滑窗位置的image patch xi与convolutional filter(feature template)之间的相似度。这个相似度以一个一维实数表示。
那么,我们期待学到什么样的feature(filters)才是最好的feature呢?从分类的角度出发,我们希望一个feature能够很好地discriminate & generalize on 两件事(1). Intra-class variations,即类内差距;(2). Inter-class variations, 即类间差距(semantic difference)。(从分类的角度推广到广义的机器学习任务,则可以认为intra-class variation是wi与xi之间的engery,而inter-class variation是wi与xi之间的similarity。)
显然,传统的CNN基于dot-product based 的相似度计算,只能获得一个一维的实数值,又如何能够用一个一维实数去explicitly地表征两种variations呢?这种表征不力的现象,可以理解为dot-product的计算“couples”两种本该分开的variations,从而限制了传统convolution操作的discrimination & generalization power。所谓“couples”,就是传统CNN make a strong assumption that two variations can be represented via a multiplications of norms and cosines into one single value (即dot-product的操作)。
自然地,为了提高convolutional operations的discrimiantion & generalization power,我们就要把coupled的dot-product进行“de-couple(解耦)”。
那么,如何de-couple呢?de-couple之后,如何分别反应intra- & inter-class variations呢?这里就要引入几何理解,即:
1):features的norm(范数,即强度magnitude)反应intra-class variation;
2):feature的angle(角度)反应inter-class variation。(这里提到的features,指的是卷积操作之后得到的值。原本的dot-product based得到的是一个实数,而现在是分别用范数值和角度值这两个量来反映wi和xi的相似度)
如上图所示,对与一个0-9数字10分类的问题,不同的类别反应为角度分开,类间差距越大,夹角越大;而相同类别之内,比如数字“8”,在这个方向上norm越大的feature表示类内相似度越高,即xi与模板wi(标准的“8”)越像,相似度越大。
而所谓“coupled”与“de-coupled”卷积操作,在表示上无非如此:
本文提出这两个function,也即operator的设计,由两类方案:
1)bounded operators:好处在于 faster convergence and better robustness against adversarial attacks;
2)unbounded operators:好处在于better representatioanl power。
有两种设计:
1)No ||w||: Removing w from h() indicates that all kernels (operators) are assigned with equal importance, which encorages the network to make decision based on as many kernels as possible and therefore may take the network generalize better.
2) with ||w|| (weighted decoupled operators): incorporating the kernel importance to the network learning can improve the representatioanl power and may be useful when dealing with a large-scale dataset with numerous categories.
a). Hyperspherical Convolution (SphereConv): projecting w and x to a hypersphere and then performing dot-product
b). Hyperball Convolution (BallConv): more robust and flexible than SphereConv in the sense that SphereConv may amplify the x with very small ||x||:
c). Hyperbolic Tangent Convolution (TanhConv): smooth version of BallConv, more convergence gain due to its smoothness:
a). Linear Convolution (LinearConv):
b). Segmented Convolution (SegConv): is a flexible multi-range linear function corresponding to ||x||, both Linear and BallConv are special cases for SegConv:
c). Logarithm Convolution (LogConv): smooth
d). Mixed Convolution (MixConv): combo of any forms above, better flexibility:
a). Linear Angular Activation:
b). Cosine Angular Activation:dot-product is this
c). Sigmoid Angular Activation:
d). Square Cosine Angular Activation: