文章目录
- 前言
- 《THE LOTTERY TICKET HYPOTHESIS: FINDING SPARSE, TRAINABLE NEURAL NETWORKS》ICLR 19
- 《Rigging the Lottery: Making All Tickets Winners》ICML 20
- 《Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training》ICML 21
- 《EFFECTIVE MODEL SPARSIFICATION BY SCHEDULED GROW-AND-PRUNE METHODS》ICLR 2022
- 《Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training》 NAACL 2022
- 《How fine can fine-tuning be? Learning efficient language models》AISTATS 2020
- 《Movement Pruning: Adaptive Sparsity by Fine-Tuning》 NIPS20
- 《SNIP: SINGLE-SHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY》 ICLR19
- 《PROSPECT PRUNING: FINDING TRAINABLE WEIGHTS AT INITIALIZATION USING META-GRADIENTS》 ICLR 22
- 《DiSparse: Disentangled Sparsification for Multitask Model Compression》CVPR 22
- 《Cross-stitch Networks for Multi-task Learning》 CVPR 16
前言
ICLR 2019 best paper《THE LOTTERY TICKET HYPOTHESIS: FINDING SPARSE, TRAINABLE NEURAL NETWORKS》提出了彩票假设(lottery ticket hypothesis):“dense, randomly-initialized, feed-forward networks contain subnetworks (winning tickets) that—when trained in isolationreach test accuracy comparable to the original network in a similar number of iterations.”
而笔者在[文献阅读] Sparsity in Deep Learning: Pruning and growth for efficient inference and training in NN也(稀烂地)记录了这方面的工作。
本文打算进一步简述这方面最新的工作。另外,按照“when to sparsify”,这方面工作可被分为:Sparsify after training、Sparsify during training、Sparse training,而笔者更为关注后两者(因为是end2end的),所以本文(可能)会更加关注这两个子类别的工作。
《THE LOTTERY TICKET HYPOTHESIS: FINDING SPARSE, TRAINABLE NEURAL NETWORKS》ICLR 19
步骤:
- 初始化完全连接的神经网络θ,并确定裁剪率p
- 训练一定步数,得到θ1
- 从θ1中根据参数权重的数量级大小,裁剪掉p的数量级小的权重,并将剩下的权重重置成原来的初始化权重
- 继续训练
代码:
- tf:https://github.com/google-research/lottery-ticket-hypothesis
- pt:https://github.com/rahulvigneswaran/Lottery-Ticket-Hypothesis-in-Pytorch
《Rigging the Lottery: Making All Tickets Winners》ICML 20
步骤:
- 初始化神经网络,并预先进行裁剪。预先裁剪的方式考虑:
- uniform:每一层的稀疏率相同;
- 其它方式:层中参数越多,稀疏程度越高,使得不同层剩下的参数量总体均衡;
- 在训练过程中,每ΔT步,更新稀疏参数的分布。考虑drop和grow两种更新操作:
- drop:裁剪掉一定比例的数量级小的权重
- grow:从被裁剪的权重中,恢复相同比例梯度数量级大的权重
- drop\grow比例的变化策略:
其中,α是初始的更新比例,一般设为0.3。
特点:
代码:
- tf:https://github.com/google-research/rigl
- pt:https://github.com/varun19299/rigl-reproducibility
《Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training》ICML 21
提出了In-Time Over-Parameterization指标:
认为,在可靠的探索的前提下,上述指标越高,也就是模型在训练过程中探索了更多的参数,最终的性能越好。
所以训练步骤应该和《Rigging the Lottery: Making All Tickets Winners》类似,不过具体使用的Sparse Evolutionary Training (SET)——《A. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science》,它的grow是随机的。这样选择是因为:
SET activates new weights in a random fashion which naturally considers all possible parameters to explore. It also helps to avoid the dense over-parameterization bias introduced by the gradient-based methods e.g., The Rigged Lottery (RigL) (Evci et al., 2020a) and Sparse Networks from Scratch (SNFS) (Dettmers & Zettlemoyer, 2019), as the latter utilize dense gradients in the backward pass to explore new weights
特点:
代码:https://github.com/Shiweiliuiiiiiii/In-Time-Over-Parameterization
《EFFECTIVE MODEL SPARSIFICATION BY SCHEDULED GROW-AND-PRUNE METHODS》ICLR 2022
做法:
特点:
- 扩大了探索空间
- 做了wmt14 de-en translation的实验
代码:https://github.com/Shiweiliuiiiiiii/In-Time-Over-Parameterization
《Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training》 NAACL 2022
做法:
特点:
- mask可训练,而非magnitude-based
- Task-Agnostic Mask Training(接下来不妨看看task-specific的?)
代码:https://github.com/llyx97/TAMT
《How fine can fine-tuning be? Learning efficient language models》AISTATS 2020
方法:
- L0-close fine-tuning:通过预实验,发现某些层、某些模块,在下游任务上finetune之后,其参数和原先的预训练参数差别不大,于是将其排除在本文提出的这种finetune过程中
- Sparsification as fine-tuning:为每一个task训练一个0\1mask
特点:
- 训练mask,来达到finetune pretrained model的效果
代码:无
但是有https://github.com/llyx97/TAMT这里面也有关于mask-training的代码,可以参考?
《Movement Pruning: Adaptive Sparsity by Fine-Tuning》 NIPS20
方法:
- Movement Pruning:
- hard:有一个重要性分数矩阵S,通过top_k来得到mask,通过straight-through estimator来更新S
- soft:不使用top_k,而是设定一个阈值r,保留S>r的参数。然后将S的加和作为正则项,希望S中的元素值变小,来完成掩膜。正则项的权重能够控制稀疏度
理解:
S_i_j增大,则只有以下两种情况:
代表模型参数在远离0。
magnitude保留数量级大的,movement保留远离0的。而pretrain-finetune中,finetuned model的参数数量级不会变化太多,这种情况下movement更好。
特点:movement pruning
代码:https://github.com/huggingface/transformers/tree/eca77f4719531ecaabe9ec6b2dee6075a391d98a/examples/research_projects/movement-pruning
《SNIP: SINGLE-SHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY》 ICLR19
方法:使用一个mini-batch来为每一个weight计算在当前任务上的sensitivity,然后在训练之前就完成裁切。
特点:
代码:
- tf:https://github.com/namhoonlee/snip-public
- pt:https://github.com/mil-ad/snip
《PROSPECT PRUNING: FINDING TRAINABLE WEIGHTS AT INITIALIZATION USING META-GRADIENTS》 ICLR 22
方法:
和SNIP的差别在于:
特点:
- 使用多个batch,利用meta-gradients(gradient of gradient),来提升裁剪后网络的trainability
- 在训练之前就完成裁切
代码:https://github.com/mil-ad/prospr
《DiSparse: Disentangled Sparsification for Multitask Model Compression》CVPR 22
方法:先用SNIP、RigL等方法给每一个任务学习一个mask。由于是CV中的多任务,参数本身就有share和specific的成分。所以这里的mask也可分为share的和specific的。对于specific的那一部分mask,直接用于对应任务。对于share的那一部分mask,得到所有任务在share的参数上的mask之后,通过OR或者Majority Vote来得到最终share参数的那部分mask。
特点:
- 涉及到多任务。但是,不太理解其对于share的那部分参数,合并所有task的mask,这样的操作,是出于什么motivation(回:文章的motivation并不在这里,主要就是想解决在多任务学习中,如何设计出一个standard的compression framework)
代码:https://github.com/SHI-Labs/DiSparse-Multitask-Model-Compression
《Cross-stitch Networks for Multi-task Learning》 CVPR 16
方法:通过训练中间的cross-stitch units,来决定是否share
特点:能够对representation做一些加权计算,这和多语言机翻不太相关。