original link
“Continual Learning is the constant development of increasingly complex behaviors; the process of building more complicated skills on top of those already developed.” — Ring(1997).CHILD: A First Step Towards Continual Learning
Continual Learning is also referred to as lifelong Learning, sequential learing or incremental Learning. They have the same define.
“Studies the problem of Learning froman infinite stream of data, with the goal of gradually extending acquired knowledge and using it for future Learning.” — Z.Chen. Lifelong machine Learning
In others words, Continual Learning tries to make machine like human to adaptive continuou Learning in a dynamic environment to learn tasks sequentially (from birth to death).
Continual Learning(CL) is an algorithm whose goal is to make machine Learning models train on non-stationary data (different from I.I.D. data.) from sequential tasks.
Take an example1, we define a sequence of tasks D = { D 1 , … , D T } D = \{D_1, \ldots, D_T\} D={D1,…,DT}, where the t-th task D t = { ( x i t , y i t ) } i = 1 n t D_t= \{(\mathbf{x}_i^t,y_i^t)\}_{i=1}^{n_t} Dt={(xit,yit)}i=1nt contains tuples of the input sample x i t ∈ X \mathbf{x}_i^t \in \mathcal{X} xit∈X , and it’s label y i t ∈ Y y_i^t \in \mathcal{Y} yit∈Y. The goal of the CL model is to train a single model f θ : X → Y f_\theta : \mathcal{X} \rightarrow \mathcal{Y} fθ:X→Y parameterized by θ \theta θ, and it can predicts the label y = f θ ( x ) ∈ Y y = f_\theta (\mathbf{x}) \in \mathcal{Y} y=fθ(x)∈Y, where x \mathbf{x} x is an unseen test sample from arbitrary tsaks. And data from the previous tasks may not be seen anymore when training future tasks.
As we all know, Alpha-Go kills everyone in the Go world, however when it face to Chess, it is powerless. Similarly, YOLO(A model you only look once) can detect the dog easily, but it can only detect the specific object. Therefore, people look forward to a model that can resolve the aforementioned problems. This calls for systems that adapt Continually and keep on Learning over time.
And talk about the application scenarious, Continual Learning can be used in many areas. Take some simple examples, a robot need to acquire new skills in different environment to complish new tasks, a self-driving car need to adapt to different environments (from a country road to a highway to a city), and the conversational agents should adapt to different users, situations, tasks.
Nowadays, methods of realizing Continual Learning almost use Neural Networks(CNN, TransFormer and so on). And due to the limitations of the Neural Networks, the Continual Learning faces two major challenges, Catastrophic Forgetting and Balance between Learning and Forgeting(Stability vs Plasticity).
Catastrophic Forgetting. When the data is updated incrementally, the model will face catastrophic interference or forgetting, which leads to the model forgetting how to solve the old task after Learning the new task.
For example: A vision model, which can classify images into two categories. First, we train the vision model by Cat vs Dog Datasets
, and then we get a perfect Acc(maybe 99.98%?) on current datasets. Second, we put the pre-trained model to another datset(e.g. Car vs Ship Datasets
) to train, and can get a nice performance at the current datsets too. However, when we go back to the Cat vs Dog Datasets
, we will find that the model forgets the previous data and can not divide them accurately.
Stability vs Plasticity.
For people, the faster you learn, the faster you forget. The same is true for machines. How to balance the relationship between them is also a challenge.
Albeit a challenging problem, progress in Continual Learning has led to real-world applications starting to emerge.
Due to the general difficulty and variety of challenges in Continual Learning, many methods relax the general setting to an easier task incremental one.
Before understand the assumptions of the Continual Learing, we should know some pre-settings. The same to A low level definition
X - input vector
Y - class label
T - task.
The concept ‘task’ refers to an isolated training phase with a new batch of data, belonging to a new group of classes, a new domain, or a different output space.
( X t , Y t ) (\mathcal{X}^t,\mathcal{Y}^t) (Xt,Yt) - Dataset for task t.
{ Y t } \{\mathcal{Y}^t\} {Yt} - Class labels. e.g.:Dog Cat Bird …
P ( X t ) P(\mathcal{X}^t) P(Xt) - input distributions. For different task, P ( X t ) ≠ P ( X t + 1 ) P(\mathcal{X}^t) \neq P(\mathcal{X}^{t+1}) P(Xt)=P(Xt+1)
f t ( X t ; θ ) f_t(\mathcal{X^t};\theta) ft(Xt;θ) -The predicted label of Y t \mathcal{Y^t} Yt,model is parameterized by θ \theta θ
The four assumptions of Continual Learning :
Task ID observed at training:
Task ID not observed at training:
Detail description of four setting:
Task incremental learning considers a sequence of tasks, receiving trainig data of just one task at a time to perform traing until convergence. During this setting, models are always informed about which task needs to be performed (both at train and test time). However, data is no longer available for old tasks, impeding evaluation of statistical risk for the new parameter values.
Express it with formulas:
Data ( X t , Y t ) (\mathcal{X}^t,\mathcal{Y}^t) (Xt,Yt) is a training-data of task t, the current task is T \mathcal{T} T.
The goal is to control the statistical risk of all seen tasks given limited or no access to data from previous tasks t < T t < \mathcal{T} t<T. In others words, the research focals on optimizing the below formula parameterized by θ \theta θ:
∑ t = 1 T E ( X t , Y t ) [ L ( f t ( X t ; θ ) , Y t ) ] , \sum\limits_{t=1}^{\mathcal{T}}\mathbb{E}_{(\mathcal{X}^t,\mathcal{Y}^t)}[\mathscr{L}(f_t(\mathcal{X^t};\theta),\mathcal{Y^t})], t=1∑TE(Xt,Yt)[L(ft(Xt;θ),Yt)],
For the current task T \mathcal{T} T, the statistical risk can be approximated by the empirical risk:
1 N T ∑ t = 1 N T L ( f t ( x i T ; θ ) , y i T ) ] , \frac{1}{N_\mathcal{T}}\sum\limits_{t=1}^{N_\mathcal{T}}\mathscr{L}(f_t(x_i^{\mathcal{T}};\theta),y_i^{\mathcal{T}})], NT1t=1∑NTL(ft(xiT;θ),yiT)],
where N T N_{\mathcal{T}} NT is the number data of task T \mathcal{T} T.
All in all, this setting assumptions are: P ( X t ) ≠ P ( X t + 1 ) P(\mathcal{X}^t) \neq P(\mathcal{X}^{t+1}) P(Xt)=P(Xt+1) and { Y } t ≠ { Y t + 1 } {\{\mathcal{Y}\}^t\neq \{\mathcal{Y}^{t+1}\}} {Y}t={Yt+1}(different labels when in different task), P ( Y t ) ≠ P ( Y t + 1 ) P(\mathcal{Y}^t) \neq P(\mathcal{Y}^{t+1}) P(Yt)=P(Yt+1), but you know which task it is when in test.(each task has it’s specific task-label t
).
‘An algorithm learns continuously from a sequential data stream in which new classes occur. At any time, the learner is able to perform multi-class classification for all classes observed so far.2’
Models must be able not only to solve each task seen so far, but also to infer which task they are presented with.(You don’t know which task you are facing) The new class labels may be added into the model in new task.
The setting assumptions are: P ( X t ) ≠ P ( X t + 1 ) P(\mathcal{X}^t) \neq P(\mathcal{X}^{t+1}) P(Xt)=P(Xt+1) and { Y } t ⊂ { Y t + 1 } {\{\mathcal{Y}\}^t\subset \{\mathcal{Y}^{t+1}\}} {Y}t⊂{Yt+1}(Class incremental), P ( Y t ) ≠ P ( Y t + 1 ) P(\mathcal{Y}^t) \neq P(\mathcal{Y}^{t+1}) P(Yt)=P(Yt+1), and you don’t know which task it is when in test.
It defines a more general continual learning setting for any data stream without notion of task, class or domain.
Models only need to solve the task at hand; they are not required to infer which task it is. In other words, task concept is not specific now, but it also have the task.
The setting assumptions are: { Y } t = { Y t + 1 } {\{\mathcal{Y}\}^t= \{\mathcal{Y}^{t+1}\}} {Y}t={Yt+1}, P ( Y t ) = P ( Y t + 1 ) P(\mathcal{Y}^t) = P(\mathcal{Y}^{t+1}) P(Yt)=P(Yt+1).
Task identity is not available even at training time! Task-Agnostic Learning has no task concept at all, and it is the ideal condition of Continual Learning.
For a clearer understanding Task incremental Learning,Class incremental Learning and Domain incremental Learning, you can see the following images3:
Permuted Mnist Task: Permute each image in MNIST after vectorization. Actually use a group of random indexes to disrupt the position of each element in the vector(image). Different random indexes will generate different tasks after being disrupted.
Multi-Task Gradient Dynamics: Tug-of-War(拔河拉锯)
However, the tasks are not available simultaneously in CL!
Need to use some form of memory, or to modify the gradients, to still take into account what solutions are good for previous tasks
Note: We need to maximize Transfer and minimize Interference.
Refer to Lange, M. D., et al.4, I try to draw a mind mapping for better understand the current mainstream methods of Continual Learning.
The define of each method4:
As you see, replay is the key. To realize replay, this line of work should store samples in raw format or generate pseudo-samples with a generative model (e.g. GAN/diffusion model) because of privacy policy. Then, these previous task samples are replayed while learning a new task to alleviate forgetting. According to different ways of use, replay methods can be divided into the following three categories:
Rehearsal (Easy to implement, but poor performence )
It is the esaiest way to understand. Just combine a limited subset of stored samples(old tasks) into new task, and retrain the model.
Advantage:
Disadvantage:
Pseudo Rehearsal
Feed random input to previous models, use the output as a pseudo-sample. (Generative models are also used nowadays but add training complexity.)5
Novel GR method6: internal or hidden representations are replayed that are generated by the network’s own, context-modulated feedback connections.
Constrained Optimization
Minimize interference with old tasks by constraining updates on the new task. The goal is to optimize the loss on the current examples(s) without increasing the losses on the previously learned examples.
Assume the examples are observed one at a time. Formulate the goal as the following constrained optimization problem.
θ t = arg min θ ℓ ( f ( x t ; θ ) , y t ) \theta^{t}=\argmin_\theta \ell\left(f\left(x_{t} ; \theta\right), y_{t}\right) θt=θargminℓ(f(xt;θ),yt) s . t . ℓ ( f ( x i ; θ ) , y i ) ≤ ℓ ( f ( x i ; θ t − 1 ) , y i ) ; ∀ i ∈ [ 0 … t − 1 ] s.t. \ell\left(f\left(x_{i} ; \theta\right), y_{i}\right) \leq \ell\left(f\left(x_{i} ; \theta^{t-1}\right), y_{i}\right) ; \forall i \in[0 \ldots t-1] s.t.ℓ(f(xi;θ),yi)≤ℓ(f(xi;θt−1),yi);∀i∈[0…t−1]
f ( . ; θ ) f(. ; \theta) f(.;θ) is a model parameterized by θ \theta θ, ℓ \ell ℓ is the loss function. t t t is the index of the current example and i i i indexes the previous examples.
The original constraints can be rephrased to the constraints in the gradient space:
⟨ g , g i ⟩ = ⟨ ∂ ℓ ( f ( x t ; θ ) , y t ) ∂ θ , ∂ ℓ ( f ( x i ; θ ) , y i ) ∂ θ ⟩ ≥ 0 \left\langle g, g_{i}\right\rangle=\left\langle\frac{\partial \ell\left(f\left(x_{t} ; \theta\right), y_{t}\right)}{\partial \theta}, \frac{\partial \ell\left(f\left(x_{i} ; \theta\right), y_{i}\right)}{\partial \theta}\right\rangle \geq 0 ⟨g,gi⟩=⟨∂θ∂ℓ(f(xt;θ),yt),∂θ∂ℓ(f(xi;θ),yi)⟩≥0
These method avoids storing raw inputs, prioritizing privacy, and alleviating memory requirements.
In these methods, an extra regularization term is introduced in the loss function, to consolidate previous knowledge when learning on new data. We can further divide these methods into datafocused and prior-focused methods.4
Data-Focused Methods
The basic building block in data-focused methods is knowledge distillation from a previous model (trained on a previous task) to the model being trained on the new data.
Prior-Focused Methods
To mitigate forgetting, prior-focused methods estimate a distribution over the model parameters, used as prior when learning from new data. Typically, importance of all neural network parameters is estimated, with parameters assumed independent to ensure feasibility. During training of later tasks, changes to important parameters are penalized.
This family dedicates different model parameters to each task, to prevent any possible forgetting. These mehods avoid forgetting by using different parameters for each task.
Best-suited for: task-incremental setting, unconstrained model capacity, performance is the priority.
Fixed Network Methods
Network parts used for previous tasks are masked out when learning new tasks (e.g., at neuronal level (HAT) or at parameter level (PackNet, PathNet)
Dynamic Architecture Methods
When model size is not constrained: grow new branches for new tasks, while freezing previous task parameters (RCL), or dedicate a model copy to each task (Expert Gate), etc.
TODO! Summaries will be added when i am familiar enough with this field.
Only brief introduction, read the origional paper for more information.
iCaRL belongs to Rehearsal and Class incremental Learning.
iCaRL, that allows learning in such a classincremental way: only the training data for a small number of classes(NOT ALL DATA! new data + some old data) has to be present at the same time and new classes can be added progressively.
The author introduces three main components that in combination allow iCaRL to fulfill all criteria put forth above.
classification by a nearest-mean-of-exemplars rule
prioritized exemplar selection based on herding
representation learning using knowledge distillation and prototype rehearsal.
Classification (nearest-mean-of-exemplars)
Algorithm 1 describes the mean-of-exemplars classifier that is used to classify images into the set of classes observed so far.
where P = ( P 1 , … , P t ) \mathcal{P} = (P_1,\ldots,P_t) P=(P1,…,Pt) is exemplar images that it selects dynamically out of the data stream.
And t t t denotes the number of classes that have been observed so far( t t t increases with time).
φ : X → R d \varphi:\mathcal{X}\rightarrow \mathbb{R}^d φ:X→Rd, a trainable feature extractor, followed by a single classification layer with as many sigmoid output nodes as classes observed so far.
Class label y ∈ { 1 , … , t } y\in \{1,\ldots,t\} y∈{1,…,t}.
Training
For training, iCaRL processes batches of classes at a time using an incremental learning strategy. Every time data for new classes is available iCaRL calls an update routine (Algorithm 2)
Other algorithm (For more detail, you can visit 10.1109/CVPR.2017.587)
Some important definition:
Note: Analogous to Transfer and Interference.
Evaluation:
GEM:
Experiments:
TODO : The future of Continue Learning.
TODO : Details of some papers。
Wang, Z., et al. (2022). Learning To Prompt for Continual Learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). ↩︎
Rebuffi, S., et al. (2017). iCaRL: Incremental Classifier and Representation Learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). ↩︎
Gido van de Ven and Andreas S. Tolias.(2019) Three scenarios for continual learning. arXiv:1904.07734 ↩︎
Lange, M. D., et al. (2022). “A Continual Learning Survey: Defying Forgetting in Classification Tasks.” Ieee Transactions on Pattern Analysis and Machine Intelligence 44(7): 3366-3385. ↩︎ ↩︎ ↩︎
https://icml.cc/virtual/2021/tutorial/10833 Part of blog’s pictures come from this link. Thanks ↩︎
van de Ven, G. M., et al. (2020). “Brain-inspired replay for continual learning with artificial neural networks.” Nature Communications 11(1): 4069. ↩︎
Lopez-Paz, D. and M. t. A. Ranzato (2017). Gradient Episodic Memory for Continual Learning. Advances in Neural Information Processing Systems, Curran Associates, Inc. ↩︎