layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 32
pad: 2
kernel_size: 5
stride: 1
weight_filler {
type: "gaussian"
std: 0.0001
}
bias_filler {
type: "constant"
}
}
}
当激活函数是sigmoid时,使用标准初始化往往性能比较差,收敛较慢且容易陷入局部最优。
“xavier”初始化是一种有效的神经网络舒适化方法,来自于2010年 Xavier Glorot和Yoshua Bengio两人的一篇论文<< Understanding the difficulty of training deep feedforward neural networks >>,配合tanh等函数能够获得比较好的效果。
定义参数所在层的输入维度为n,输出维度为m,则参数讲均匀分布在
Caffe中的实现:
/**
* @brief Fills a Blob with values @f$ x \sim U(-a, +a) @f$ where @f$ a @f$ is
* set inversely proportional to number of incoming nodes, outgoing
* nodes, or their average.
*
* A Filler based on the paper [Bengio and Glorot 2010]: Understanding
* the difficulty of training deep feedforward neuralnetworks.
*
* It fills the incoming matrix by randomly sampling uniform data from [-scale,
* scale] where scale = sqrt(3 / n) where n is the fan_in, fan_out, or their
* average, depending on the variance_norm option. You should make sure the
* input blob has shape (num, a, b, c) where a * b * c = fan_in and num * b * c
* = fan_out. Note that this is currently not the case for inner product layers.
*
* TODO(dox): make notation in above comment consistent with rest & use LaTeX.
*/
template <typename Dtype>
class XavierFiller : public Filler {
public:
explicit XavierFiller(const FillerParameter& param)
: Filler(param) {}
virtual void Fill(Blob* blob) {
CHECK(blob->count());
int fan_in = blob->count() / blob->num();
int fan_out = blob->count() / blob->channels();
Dtype n = fan_in; // default to fan_in
if (this->filler_param_.variance_norm() ==
FillerParameter_VarianceNorm_AVERAGE) {
n = (fan_in + fan_out) / Dtype(2);
} else if (this->filler_param_.variance_norm() ==
FillerParameter_VarianceNorm_FAN_OUT) {
n = fan_out;
}
Dtype scale = sqrt(Dtype(3) / n);
caffe_rng_uniform(blob->count(), -scale, scale,
blob->mutable_cpu_data());
CHECK_EQ(this->filler_param_.sparse(), -1)
<< "Sparsity not supported by this Filler.";
}
};
caffe中提供了3种方式:
(1)默认情况,只考虑输入
(3)FillerParameter_VarianceNorm_AVERAGE
来自于MSRA研究员何恺明2015年论文
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
传统的固定方差的高斯分布初始化,在网络变深时使得模型很难收敛。
改善的方法有用预训练的模型去初始化网络的部分层;xavier也是不错的初始化方法,但是它需要满足激活函数线性的条件。在MSRA初始化中,考虑了ReLU和PReLU。
MSRA初始化的权重分布是一个均值为0,均值为 2n 的高斯分布,初始化满足下式
Caffe中的实现:
/**
* @brief Fills a Blob with values @f$ x \sim N(0, \sigma^2) @f$ where
* @f$ \sigma^2 @f$ is set inversely proportional to number of incoming
* nodes, outgoing nodes, or their average.
*
* A Filler based on the paper [He, Zhang, Ren and Sun 2015]: Specifically
* accounts for ReLU nonlinearities.
*
* Aside: for another perspective on the scaling factor, see the derivation of
* [Saxe, McClelland, and Ganguli 2013 (v3)].
*
* It fills the incoming matrix by randomly sampling Gaussian data with std =
* sqrt(2 / n) where n is the fan_in, fan_out, or their average, depending on
* the variance_norm option. You should make sure the input blob has shape (num,
* a, b, c) where a * b * c = fan_in and num * b * c = fan_out. Note that this
* is currently not the case for inner product layers.
*/
template <typename Dtype>
class MSRAFiller : public Filler {
public:
explicit MSRAFiller(const FillerParameter& param)
: Filler(param) {}
virtual void Fill(Blob* blob) {
CHECK(blob->count());
int fan_in = blob->count() / blob->num();
int fan_out = blob->count() / blob->channels();
Dtype n = fan_in; // default to fan_in
if (this->filler_param_.variance_norm() ==
FillerParameter_VarianceNorm_AVERAGE) {
n = (fan_in + fan_out) / Dtype(2);
} else if (this->filler_param_.variance_norm() ==
FillerParameter_VarianceNorm_FAN_OUT) {
n = fan_out;
}
Dtype std = sqrt(Dtype(2) / n);
caffe_rng_gaussian(blob->count(), Dtype(0), std,
blob->mutable_cpu_data());
CHECK_EQ(this->filler_param_.sparse(), -1)
<< "Sparsity not supported by this Filler.";
}
};
同样有3种方案:
(1)默认情况下,n是输入层的维度
(2)n取输出层的维度
(3)n取输入和输出层的均值
(1)“constant”: 常量初始化
(2)“gaussian”: 固定方差高斯分布初始化
(3)“positive_utiball”:每个值为在[0,1]之间,对于每个 i :
(4)“uniform”: 均匀分布初始化
(5)“bilinear”: 双线性初始化,通常用于反卷积核