论文解读:Multiple Adaptive Bayesian Linear Regression for Scalable Bayesian Optimization with Warm Start...

本文提出了一种结合前馈神经网络和贝叶斯估计算法,进行多任务学习(迁移学习)的计算架构。

贝叶斯估计基础:
https://www.cnblogs.com/joezou/p/10658883.html

子母论文:

  • Scalable Bayesian Optimization Using Deep Neural Networks
  • Scalable Hyperparameter Transfer Learning

尽管在BO算法的目标函数中已经给出了,但在本文中有另一种更为直观的表示:
1612911-20190405163401270-1468572177.png
1612911-20190405163415525-887117520.png
1612911-20190405163437672-1268270840.png

同时,上式将原始目标函数:
1612911-20190405163501997-2071323434.png
变为:
1612911-20190405163519138-259474064.png
此时算法复杂度进一步从N(D3+D2)下降为ND2+D3 (这里N=T)(算法复杂度待核算)

本文的意图在于维持复杂度对样本维数成立方级增长,对样本数量线性增长的条件下实现多任务学习,具体贡献为两步:结合前馈神经网络NN与BO,实现多任务;将BO估计与NN结合,反过来加强NN对潜在表示向量的学习。
示意图如下:
论文解读:Multiple Adaptive Bayesian Linear Regression for Scalable Bayesian Optimization with Warm Start..._第1张图片
示意图来自作者

如上,子任务数据集Di被同时输入前馈网络NN,这基于目标拟合函数的基向量(或潜在表示向量)具有一定相似性的前提,我们在NN中采用共享权重,并将多输出结果分别传递到各自的BO-warm start估计函数中,计算GP参数,计算误差,计算偏导,再将NN的偏导权重(求和或加权后)反馈回NN网络,完成一次训练,结果是NN的潜在表示系数被学习出来,各自子任务的GP参数也被学习出来。
好处是,对于新的子任务,我们的NN已经被预训练,从而可以加速,称为“多任务迁移学习”。

文中指出:随机傅里叶变换比它的多任务可扩展性更好,但它的鲁棒性更强。
1612911-20190405163712242-738039624.png
1612911-20190405163721003-2024800793.png

实验设置:
测试环境:
(1)拟合二次函数,其中目标函数参数为随机生成,采用的训练-测试比为29:1
(2)拟合LIBSVM:

与作者问答:

  1. Would you mind sharing your idea about why your algorithm work more robustly comparing with the one in [21], as you mention in the sixth line on page 3?
  • What we meant there is that L-BFGS does not come with hyperparameters such as the SGD stepsize (note that [21] uses SGD). This is an advantage as you would have to set the stepsize for each specific BO problem, and if you just fix the stepsize to be the same for all BO problems, then you algorithm may not perform as robustly.
  1. I wonder if I miss the proof that ABLR can scale linearly, or you think it is a prerequisite knowledge that did not mentioned it. Would you like to point out where I can find it?
  • The idea is that instead of inverting an N x N matrix when computing the predictive mean and variance, you invert a D x D matrix, so the scaling is D^3 instead of N^3. This can be observed directly from looking at equations (1), (2), (3), where the most expensive operation is the matrix inversion.

Thanks for Valerio Perrone's answer to the questions purposed on this page.

转载于:https://www.cnblogs.com/joezou/p/10658905.html

你可能感兴趣的:(论文解读:Multiple Adaptive Bayesian Linear Regression for Scalable Bayesian Optimization with Warm Start...)