Lecture01: Introduction to RL
-future state distribution depends only on present action and state (Markovian)
-γ: discount factor. rewards we get in the future are less valuable from rewards we get right now (money).
-value function: how good the state is
-q-function: how good is the state-action pair
-Bellman equation: mean that the optimal policy dictates that we take the optimal action at every state. The optimal action is specified by _∗
Solving the Bellman equation enables us to discover _∗.
- Reinforcement learning is a sequential decision making problem (the output will affect the input).
Feature of RL:
-Need access to the environment
-Jointly learning and planning from correlated samples
-Data distribution changes with action choice
1) Model based:(policy/value iteration). need transition matrix
Cons: if pi is deterministic, we will never get to some state-action pairs;
Solution: start at random state
2)Model free: model q(s,a) directly.
Cons: need complete episodes. Use monte-Carlo (sampling). Can’t evaluate states.
3)Policy iteration: policy evaluation(test current pi, update v)->policy improvement(find better pi, update pi)
cons: not efficient because it needs evaluate all state again.
4)Value iteration: evaluate state only once.
Policy/value iteration are most efficient in DP (always optimal).
5)On policy:
attempt to evaluate/improve the policy that is used to make the decision. (cannot reach all, more efficient). e.g., Sarsa: Q value from current policy γQ(s’,a’), also a temporal difference learning (TD) (Sarsa use more exploration), epsilon greedy.
Pros: take exploration into account.
6)Off policy: attempt to evaluate/improve a policy other than the one used to generate the data
Target policy: try to learn;
Behavior policy: generate the data (stochastic)
Importance sampling: relative probability for target and behavior
Why Importance sampling: get the expectation of rewards.
6.5) Monte Carlo
Pros: Can learn V and Q directly from the environment; No need for models of the environment; No need to learn about all states.
Cons: Only works for episodic (finite) environments;
Learn from complete episodes (i.e. no bootstrapping)
Must wait until the end of the; episode to know what is the return.
Solution: Temporal Difference
7)Temporal Difference (TD)/learning.
Pros: can update each step; Use both Monte Carlo and Dynamic Programming; Estimate V instead of real vπ
TD error: St = Rt+1+γ v(st+1)-v(st)
Assumption: we can have to wait one time step in order to update, so V doesn’t change from one step to another.
**8)**ϵ -Greedy algorithm
Exploration: use different actions to better reward
Exploitation: maximize reward by choosing best
Traditional greedy algorithm never explore, it may stuck in a sub-optimal actions.
With probability (1-ϵ) to select best action.
Improve: A good heuristic to improve this algorithm is to set all values of Q to a very high value and then use Monte Carlo sampling to drive them down gradually.
9)Deep Q learning
Use a function to create approximation of q: q(s,a,theta) with loss function and gradient descent.
Cons: sequential prediction state (correlated states) and target is not stationary. Easily forget previous.
Improve: fix Q-target (freeze); Experience relay(random replay or prioritize samples<high TD or stochastic>), (solve forget and correlation);
High TD: make probability of sampling with TD, which is suitable for incremental approaches: SARSA, Q-learning.
10)Double DQN
**
DQN is inaccurate Q-value (sub optimal actions).
Method: 2 Q-functions, learn from different set of experience, decisions are made jointly.
Pros: Double Q-learning is not a full solution to the problem of inaccurate values, but it has been proven to perform well.
11)Dueling QQN
Intuition: the importance of choosing right action is not equal across all states. So decouple action from state.
State values: V(s) | action values: A(s,a);
Q(s,a)=V(s)+b*A(s,a)
b is proportion.
Pros: keep a relative rank of the actions
Lecture02: Policy Gradient DQN
1)Policy gradients: model policy directly
Output is parameterized policy:
Pros: greater than ϵ-greedy because its action probability change smoothy;* ϵ-greedy causes drastic change in values .
Policy gradient uses a learning rate to adjust changing scale.
2)REINFORCE (improve of 1)
Update is based on the reward from that point and chance of taking the action;
All updates are made after completion of trajectory.
3)REINFORCE with baseline
Why: Because of our use of sampling, the REINFORCE algorithm might fluctuate and be slow to converge (i.e., high variance)
Intuition: a value to reduce the size of update (baseline—not change with action).
State value function: v (s,w)
Policy function: δ=Gt-v(s,w), use this to back propagate gradient.
Always converge faster than REINFORCE.
4)Summary of policy gradient:
Pros: model probability of taking an action; balance exploration/exploitation; handle continues state-space; sometimes easier than value-function.
Cons: require plenty sampling; slow to converge
5)Actor- Critic
Intuition: Combine Actor and Critic
Actor only methods: rely on sampling/simulation. Cons: large variance, no accumulation of old information
Critic only methods: rely on value function, solve bellman equation. Pros: near-optimal. Cons: lack guaranties of the solution near optimal
The actor attempts to improve the policy
-Can be done greedily (e.g. Q-functions), mostly applicable to small state-space
-Can be done using policy gradients
The critic evaluates the current policy
-Can use a different policy to evaluate Q-functions (target vs. behavior policies)
-Bootstrapping methods can reduce the variance, as in REINFORCE with baseline
E.g.: One-step actor-critic + one step TD;
Pros: can be applied to a continuous or very large actions space
6)A3C
Intuition: An asynchronous version of the actor-critic algorithm
Method: Multiple critics operating simultaneously, all updating a single network that produces the value function
Pros: The use of multiple agents can increase exploration, since they can be initialized with differing policies.
Lecture03. Imitation learning
0)terminology: regret: the difference between π* and π . Maximizing rewards = minimizing regret.
Why imitation intuition: apply RL is difficult in these situations.(Nonlinear dynamics, less training samples, highly correlated states.)
it is sometimes difficult to define similarity between similar states.
Solution: convert problem to prediction. To maximize rewards rather than value function.
Suitable for: Navigation, autonomous driving, helicopter, question answering
Problems: hardly to recover from failure with supervised learning. And hard to response on unseen states.
Notations: dπt—the state distribution following π form time 1 to t-1
Cs,a—immediate cost of performing a in state s.
1)Apprenticeship learning:
Algorithm: record expert dataàlearn dynamics modelàfind near optimal policy by reinforcement learning -> test. (the dynamics of the environment and the transition matrix are learned together)
Training: We can use regularized linear regression to explore all the trajectories to calculate the utility. We defined utility as the average sum of rewards.
Cons: completely greedy, only exploitation; suitable for cases where exploration is dangerous; can not cope with unexplored states; can not recover form errors
Pros: suitable for cases where exploration is dangerous/costly (autonomous driving).
Compare with imitation: imitation assume an unknown and complex world dynamics. Slower but general.
2)Supervised learning:
Approach: reduce the sequence into many decoupled supervised learning problems.
Use a stationary policy throughout the trajectory.
Objective: find a policy π to minimize the observed surrogate loss.
Intuition: once classifier makes a mistake, find it in a previously unseen state, from that point every action is a mistake. The mistakes compound, with each step incurring the maximal loss. The loss growth will be super-linear.
3)Forward training:
Algorithm: sample trajectories generated by the latest policy -> ask the expert to provide state-action combinations -> We train a new classifier to provide the policy for the next step -> use the policy to advance us one step forward.
Cons: The algorithm requires iterating over all T steps; if T is large, it is impractical.
Pros: near linear regret; we train the algorithm on currently visited states, the algorithm can make good decisions; The input from the expert enables it to recover from mistakes
Analysis: regret for this algorithm is bounded by (). near-linear; represents the diversion from the optimal policy.
If the diversion is maximal for each step, we get to (^2 ) – like supervised learning.
4)Dagger: (dataset aggregation)
**
Dagger is a stationary algorithm ( policy is not updated while running)
Train pi on Dgetàrun pi get Dnewàask expert perform DnewàD=D U Dnewà train pi on D
Pros: faster and more efficient; update once per trajectory; near-linear regret (low regret bound); work well for both simple and complex problems.
Cons: huge difference between learner and expert
With Coaching: balance between expert and policy actions. optimal and within capability for policy.
hope action as
πis=argmaxa∈A(λi∙scoreπis,a-L(s,a))
where _ specifies how close the coach is to the expert (_≥0) and s_(_ ) is the likelihood of choosing action in state . (,) is the immediate loss.
Lecture04. Multi-arm bandits
why: get to know the different distribution (fixed) of each arm’s payoff.
Target: maximize their profit or minimize their regret across all plays.
Expected cumulative regret(optimal – our strategy):
E[Regn]=n⋅R*-i=1nE[ri]
Application: recommendation system.
EE-problem: Exploration (the discovery of new information) and Exploitation (the use of information he already has).
Algorithm:
Contextual free: -greedy, UCB1
Contextual: LinUCB, Thompson sampling
Method:
**
With probability to explore and exploitation(constant step-size parameter to give greater weight to more recent rewards.).
Cons: not elegant—algorithm explicitly distinguish between exploration and exploitation; Suboptimal estimation: any arm is equality likely to be picked; less effective dealing with context.
Solution: explore with less confidence ( solve any arm is equality)
Solve: epsilon-greedy explore any arm equally
Intuition: explore these less confidence.
At=argmaxaQta+clntNta
At=value term + exploration term
Pull maxAt arm. (max uncertainty)
Cons: nonstationary problems: methods more complex; large state space; action selection is not practical.
Pros: get higher rewards. Optimal.
Example: different rewards from the same arm for varying contexts (state).
Intuition: Expectation for each arm is modeled as a linear function of the context.
Loss: confidence interval is in fact standard deviation
computational complexity:
**
Linear in the number of arms; At most cubic in the number of features.
Pros: works well for a dynamic arm set (arms come and go)
Suitable for : article recommendation
Intuition: A simple natural Bayesian heuristic(Maintain a belief (distribution) for the unknown parameters); assumes that the reward distribution of every arm is fixed, though unknown. Play an arm with posterior probability
Beta Distribution: shape of the beta function is defined by a) how many times we “won” and; b) how many times we “lost”. That’s why it’s great for exploration/exploitation. When a is larger, we have a larger probability of small values. Hence, the sampling changes the distribution.
Algorithm:
For each round:
Sample random variable X from each arm’s Beta Distribution
Select the arm with largest X
Observe the result of selected arm
Update prior Beta distribution for selected arm
Pros: near-linear regret.
Lecture05. Monte Carlo search tree, AlphaGo, AutoML
0)general
Selection with Upper confidence bound (UCB)(combine with exploration and exploitation) à Expansion: reach a leaf(End of game or uncharted state (initialize; rollout policy))->Simulation: at the end of game get score. We can run more than one simulation->Backpropagation: update Q(s,a).
Exploration is usually given a larger weight at the beginning and then reduced over time.
1)MCTS
Pros: no specific knowledge / interrupt at any time
Cons: difficulty to sacrifices well
Application: MCTS with a policy, MCTS with a value function
Improve:
Combine MCTS with a policy
Combine MCTS with a value function
2)Apply monte-carlo to Go
Perfect information games: chess, checkers, Go. They have optimal value function
b: game breadth (length moves)
d: game depth(length of game)
2.2) Combine MCTS with a Policy
policy of the software is defined using a set of hand-crafted features;
Once an expanded node has had a sufficient number of simulations, the policy is used to determine whether or not it should remain in the tree or be “pruned-off”;
This approach slows the growth of the tree while focusing its “attention” on high-probability moves;
2.3) Combine MCTS with a Value Function
The function is defined as
_ (,)=((,)^ )
where the combination of (binary features) and (weights) defines the value function. The function maps the value function to an estimated probability of winning the game. Instead of setting a random policy for the sampling process, the authors experimented with three options: -greedy policy; Greedy policy with a noisy value function; A softmax distribution with a temperature parameter.
3)AlphaGo:
Supervised learning policy network: recommend good action by prediction. Maximize the log-likelihood function of taking ‘human’ action (3weeks, 50gpu)
RL policy network (give action): improve the network by play with itself. Replay to determine gradients. (30gpu, a day)
RL value network (test): same structure of policy network give how good the game is. (50M minibatch, 50gpu, a week).
Rollout policy: first L steps look ahead for all. After L steps use (policy or value) based;
When the number of visits exceeds a predefined threshold, the leaf is added to the tree
4)AlphaGo Zero
Improvement: self-reply; only black-white features; single neural net; simple tree search (not MC rollout)
two outputs: policy; value
Tree search: policy iteration
MCTS: policy evaluation and improvement
Train: play with old generation. 55% win to be new baseline.
Compare with AlphaGo:
AlphaGo Zero do not have rollout
Single DNN
Leaf node are always expanded, evaluated using the neural net
No tree policy
Solely base on self-play
No reliance on human knowledge
5)Apply Alpha Zero to other domain:
Field of autoML:
Automatic pipeline generation; AlphaD3M; Neural Architecture Search; Deepline
Lecture06: Meta and Transfer
Classification/regression aim to generalize across data points. A meta-learner is readily applicable, because it learns how to compare input points, rather than memorize a specific mapping from points to classes.
**propose:**Meta-learning generalize across datasets (learn to learn).
Process: Input – many tasks and data, train F
1)Meta-Learning with Memory-Augmented Networks
Problems:
Zero-shot learning: at test time, a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to.
One/few shot learning: similar to zero-shot learning.
Goal: develop a policy that optimizes our performance across multiple learning tasks.
Method:
**
1) Label shuffling – labels are presented a step after their samples were introduced. This prevents the model from simply mapping inputs to outputs.
2)External memory module – a stable “container” of knowledge that is called upon to respond to changing circumstances
Algorithm:
1)Feed sample at time , but provide the label as input at the next iteration (_1,),(_2,_1 )…(_(+1),_).
\2) In addition, the labels are shuffled across datasets (i.e., samples from different datasets may appear in the same sequence)
\3) This forces the neural net to hold the samples in memory until a relevant sample
Pros: general and expressive; variety choice to design architecture
Cons: the model is very complex for these complex problems; need a lot of data to train.
External memory module: store the accurate classification.
which ”memories” to read: Indexes cannot be embedded, so use a similarity measure to generate a weights vectors.
Updating memory: least recently used access (write memories to: The least used memory location; The most recently used memory location (update, don’t replace)).
Problems: hard to identify events most relevant to the current problem. possibly because of the difficulty of modeling long and complex sequences.
Solution: Convolutions with dilations solve this problem.
2)Simple Neural Attentive Meta-Learner(SNAIL)
Combine dilation and attention (complement with each other).
Temporal convolutions with dilations: provide high-bandwidth access and limit the size of the network and layers
Attention: provide pinpoint access over an infinitely large context.
Algorithm:
Dense blocks – concatenate the current input with a 1D dilated convolution of all previous states.
Temporal convolutions layer – stacks the dense blocks into a single input.
Attention block – uses an attention mechanism to output a single value (the action).
Cons: model consumes more time.
3)Model Agnostic Meta-Learning (MAML)
To solve: limited amount of data (particularly 1-shot)** is easily overfit.
Method: Generalize well for held-out data over tasks (Design the network specifically for fast adaptation, small changes in the parameters will produce large improvements on the loss function of any of the tasks).
Algorithm: pre-train by meta learning and then fine-tune in specific task.
Cons: uses second-order derivatives, which makes it more expensive (gradient over a gradient)
4)ReBAL(Real time)
Challenge: generating samples is expensive;** unexpected perturbations & unseen situations cause specialized policies to fail at test time.
difficulties in the real-world:
Failure of a robot’s components
Encountering new terrain
Environmental factors (lightning, wind)
Intuition:
Environments are common structure, but differing dynamics.
For each environment, we learn from a trajectory that lasts as far as the environment does.
To solve: generating samples is expensive**;** unexpected perturbations & unseen situations cause specialized policies to fail at test time.
The proposed method: use meta-learning to train a dynamics model that, when combined with recent data, can be rapidly adapted to the local context,
5) comparison
SNAIL | MAML | ReBAL | |
---|---|---|---|
Ability to improve with more data(Consistency) | No | Yes | Yes |
ability to represent multiple types of algorithms (re-use algorithms)(Expressive) | Yes | No | ~ |
ability to explore the problem space in an efficient way (less samples needed) (Structured exploration) | ~ | ~ | No |
Efficient & off-policy(run real-time) | Policy gradients, so both are inefficient. Moreover, no way to use off-policy data | Yes | |
6)Transfer Learning |
Aim to: model trained on one problem/dataset is applied to another.
Why:
Multiple models are harder to train/manage.
We would like to leverage knowledge across problems/domains.
Example of Reuse model: Image analysis, Word embedding.
“Forward” transfer: train on one task, transfer to another
Multi-task transfer: train on many tasks, transfer to other
7)Forward Transfer
7.1) Fine-Tune
A Diversity Model:
Always with stochastic policy (deterministic policy cannot easily consider multiple possible actions).
Pros:
Can be learned on one model and fine-tuned on another;
better for multi-objective tasks;
be helpful for uncertain dynamics and imitation learning;
A stochastic policy:
**
Exploration in cases of multi-model objectives, Robustness to noise & attacks, Imitation learning.
7.2) Deep Energy-Based Policies (Entropy reinforcement learning)
Aim to: train model with diversity.
Intuition: based on stochastic policy, Instead of learning the best way to perform the task, the generated policies try to learn all the ways of performing the task. (robust to noise, better for exploration, a good initialization).
How to do: Prioritize (in addition to the reward) states with high entropy; Supports exploration of multiple solutions to a given problem.
Problems: solving maximum entropy in stochastic policy is difficult.
Solution: use energy-based models.
Idea: formulate a stochastic policy a conditional energy-based model. The energy function corresponds to a “soft” Q-function that is obtained when optimizing for maximal entropy.
Algorithm: Max entropy for the entire trajectory -> learn all ways to perform the task instead of optimal -> use energy-based model.
Sampling:
Pros: Far better for exploration; achieve better results when dealing with the optimization of multiple objectives.
7.3) Progressive Networks(fine-tune)
**Fine-Tune problem:**easily overfitting;easily forget previous knowledge (“catastrophic forgetting”)
Solution: Freeze the network; add new layers for new tasks.
Structure: every task has its own network layers. When apply to new task,old network layers will be frozen,and then add one new column.(Like homework 3)
Target:
Solve K independent tasks at the end of the training
Accelerate training via transfer
Avoid catastrophic forgetting
Cons:
7.4) Self-Supervision for RL
**idea:**The paper puts emphasis on representation learning(While learning to optimize the policy for the reward function, we also implicitly learn to represent the environment)
Problems: learning from sparse rewards may be difficult. Learning from the environment, however, occurs all the time.
Methods:
focuses on auxiliary (辅助) losses: State, Dynamics, Inverse dynamics, Rewards.
Example:
rewards bin: the rewards were binned to “positive” and “negative”. The model was required to predict the outcome for the next step.
Dynamics: give state and action, predict next state.
Inverse dynamics: 2 states predict action.
Corrupt dynamics: replace states (or observations of states) with “nearby” (either past or future) ones. The model is required to predict the correct outcome.
Reconstruction: uses a variational autoencoder to reconstruct the input.
Algorithm:
An encoder-decoder architecture->Train the network on one domain->discard the decoder, and place a new network on-top of the encoder.
The encoder can both frozen or be updated with the new task (the later can get better results).
Pros:
representation can be useful for multiple tasks.
Transformation into discriminative (歧视) learning enables the development of useful and more robust representations
7.5) Another Option: Randomize the Data for Better Generalization
7.6) Forward Transfer Summary
Pre-training and fine-tuning:
In general, we are usually interested in using modest amounts of data from the target domain.
We make the assumption that the differences are functionally irrelevant – sometimes a problematic assumption!
8)multi-task transfer:
Assumption:More tasks = more diversity = better transfer
Challenge: How to we merge/transfer information from multiple sources.
solution: Build a lifetime of diverse experiences (harder to model, we don’t know how humans solve problems in the target domain)
8.1) Model-Based RL (63-64)
Intuition: for different past tasks, the laws of physics is in common.
But: Same robot does different chores;
Same car driving to multiple destinations;
Trying to accomplish different things in the same open-ended video game.
2 approaches:
8.2) Actor-Mimic Approach (65-69)
Target: train one network that can play all Atari (which is a deterministic game) games at a level close to that of a dedicated network;
**process:**multi-task training get model E1~En and use pretrained as expert to train multi-task policy; not direct reward loss but rather attempted to imitate the Q-values of each “expert” for its given game. (Use a SoftMax instead of direct values because the differences in the range of Q-values across the different tasks. The differences made it very difficult to train directly.)
SoftMax: we can view using the SoftMax from the perspective of forcing the student to focus more on mimicking the action chosen by the guiding expert at each state, where the exact values of the state are less important.
Why is the approach called “actor-mimic”?
Policy Loss: the cross-entropy between the expert and multi-task policies
Feature Regression: make the activations themselves more similar to the expert (Like a critic net). bridging the gap between different layer size.
(Another way to improve the perforce of the AMN is by configuring the activations of each hidden layer to be close to those of each of the expert).
Transfer the knowledge to a new task:
Remove the final softmax layer of the AMN.
Use the weights of AMN as an instantiation for a DQN that will be trained on the new target task.
Train the DQN (standard process).
Pros: Learn very fast.
8.3) Distillation for Multi-Task Transfer (70-72)
Distillation: Knowledge distillation is the process of moving knowledge from a large model to a smaller one while maintaining validity.
Why: An ensemble of models is too expensive and slow to run in production. So, train the ensemble off-line and then distill the information into a new single model.
Key insight: there is a lot to be learned from the “wrong” classifications as well (Even when there are some labels that received low probabilities, some classes received much lower scores than others).
method: An obvious way to transfer the generalization ability of the cumbersome model to a small model is to use the class probabilities produced by the cumbersome model as “soft targets” for training the small model.
Structures:
The original and target model are trained using a “soft” SoftMax.
When =1 we have a “regular” SoftMax; higher values produce “softer” class distributions.
Training:
The original model is trained with a high temperature, and so does the distilled network during training. However, after the training is concluded, the distilled network is using =1.
Improve not only predicting the score distribution, but also predicting the “right” class. The two predictions are combined using weighted averaging.
8.4) Modular Neural Network (73-77)
Definition: A modular neural network is an artificial neural network characterized by a series of independent neural networks moderated by some intermediary.
Idea: decompose neural network policies into “task-specific” and “robot-specific” modules
pros:
This modeling can train “mix-and-match” modules that can solve new robot-task combinations that were not seen during training.
This allows for sharing task information, such as perception, between robots and sharing robot information, such as dynamics and kinematics, between tasks.
Structure:
Suitable for:
8.5) Summary of multi-task transfer
Pros:
More tasks = more diversity = better transfer**;**
Often easier to obtain multiple different but relevant prior tasks.
Specific features:Model-based RL: transfer the physics, not the behavior
Distillation: combines multiple policies into one, can accelerate all tasks through sharing
Modular network: an architecture designed specifically for multi-task learning
Lecture07: Dealing with Large and State and Action Spaces
Why: Being able to operate in environments with a large number of discrete actions is essential for applying DRL to many domains. Achieving this goal requires the ability to generalize.
**Common method:**Embed the discrete actions into a continuous space where they can be generalized.
Then, use generative models to generate actions.
Apply KNN to select a small subset of actions to evaluate.
Pros: This setting decouples the policy network from the Q-function network.
1)GANs
Idea: Train a** Generator and a Discriminator to compete.
Generator: generate fake samples, tries to fool the Discriminator. (Never directly see true samples.)
Discriminator: tries to distinguish between real and fake samples.
pros:Sampling (or generation) is straightforward; Training doesn’t involve Maximum Likelihood estimation; Robust to Overfitting since the Generator never sees the training data; Empirically, good at capturing the modes of the distribution.
**Train method:**Freeze one another when training(Generator and Discriminator).
Generator selects a random noise z, passes it through the generator network to generate a fake image, and then passes it through the discriminator network to output the probability of it being fake.
It then calculates its loss, and update its weights while keeping the discriminator’s weights as constant
And we repeat this process of Alternate updates.
How to deal: use a network AEN (with bandit technology) to recommend actions to take.
2)Wolpertinger Policy
Based on actor-critic. Actor generates action The actor generates a proto-action and then finds the K-nearest neighbors using simple L2 distance. The critic performs additional refinement and chooses the highest-ranked neighbor with Q-function.
3)Action Elimination with DRL
**To solve:**large action spaces cause Q-learning to converge to sub-optimal policy
**Idea:**reduce the number of actions by eliminating some from consideration to reduce action number.
Two methods:
Reward shaping – modify the reward received in a trajectory based on “wrongness” of the chosen actions (also used in imitation learning)
Two headed (interleaved) policy – maximize the reward while minimizing action elimination error
**Main idea:**decouple the elimination signal from the (Markov decision process) MDP using contextual multi-armed bandits.
The challenges of using bandits:
A very large action space would mean we need a lot of bandits; The representation of the actions needs to be fixed, although the network trains and changes.
Algorithm:
Two networks: Standard DQN and An action elimination network (AEN).
The last hidden layer of AEN is the input of the contextual bandit.
Applied every L iterations: The bandits eliminate some actions, and pass their recommendation to the DQN.
The bandits’ recommendation is “plugged” into the AEN’s last layer so it can calculate the loss
Actions are chosen using -greedy to ensure that eliminated actions get a chance
Comments:
**
Use the AEN to produce an embedding of the states. The bandits are trained on this embedding, thus eliminating actions. The valid actions are then sent to the DQN.
The bandits use confidence bounds (see relevant previous lecture) to determine which actions need to be eliminated.
Because bandits require a static representation, we need to retrain a new bandits model often.
The architecture also uses a replay buffer (this way we have a lot more data to train the bandits). The new embedding representation is applied to the previous states.
4)Hierarchical DRL for Sparse Reward Environments(Auxiliary goal)
Method: define intrinsic goals and try to accomplish them.
Algorithm (generative model): define a controller and a meta-controller;
The meta-controller receives a state _ and chooses a goal _∈.
The controller then chooses a an action based on _, _t.
The controller will pursue _ for a fixed number of steps, then select another goal.
An internal critic is responsible for assessing whether the controller succeeded, and allocates the rewards accordingly.
How to deal:
Controller: choose actions normally, to max cumulative intrinsic rewards.
Meta-controller: max external rewards received from the environment.
Key: work in different update time;
Process:
Meta controller -goals-> critic -reward (intrinsic rewards)-> controller -action-> system.
e.g. Montezuma’s revenge.
Cons: intrinsic rewards can only deal with specific question.
Pros:
Efficient exploration in large environments
Learning when rewards are sparse
5) “Montezuma’s Revenge” Atari Game
Problems: very long sequences, with very few rewards.
Compare: Regular DQN obtains a score of 0
Asynchronous actor-critic achieves a non-zero score after 100s of millions of frames.
6) Automatic ML pipeline generation
Two approaches:
**
Constrained space – create a fixed frame with “placeholders”, then populate it.
Unconstrained space – place little or no restriction on the pipeline architecture, but come at a higher computational cost.
AlphaD3M:
Attempts to apply the framework of Alpha Zero to the field of automatic pipeline generation.
Solution:
**
Data pre-processing; Feature pre-processing; Feature selection; Feature engineering; Classification & regression; Combiners.
Hierarchical-Step Plugin:
**
Enables us to use a fixed-size representation to analyze a changing number of actions.
Using this representation, we can dynamically create a list of only the “legal” actions without having to use various penalties.
Considerably accelerates training.
7) Resource-efficient Malware Detection
Problem:
**
Current malware detection platforms often deploy multiple detectors (an ensemble) to increase their performance:
Creates lots of redundancy (in most cases one detector is enough)
Computationally expensive, time-consuming
Solution:
query a subset of the detectors, decide based on the classification whether to query more.
Challenge:
find a way to model both the ”reward” of correct classification and the ”cost” of performing the classification.
8) Branching Dueling Q-Networks (BQD)
Algorithm:
**
Common state-value estimator (think dueling DQN)
-Can more efficiently identify action redundancies.
-Similar to dueling DQN, the advantage and state value are combined via an aggregation layer.
The top-performing method for TD-error calculation was averaging across the branches.
The chosen loss function averages the TD-error across all branches.
Pros:
Applies to problems where complex actions can be broken down - like joints in the human body
Useful in cases where each “sub-problem” can be optimized with a large degree of independence.
9) Jointly-Learned State-Action Embedding
Problem:
Multiple studies propose advanced ways to learn states and action embeddings. However, these embeddings are separate and don’t take into account the relationships among actions and states.
Pros:
Better generalization for large state/action spaces; Can be combined with any policy gradients-based algorithm; The embeddings are learned in a supervised manner, which improves sampling efficiency.
Lecture08: Advanced Model Learning & Exploration Methods
States are unknown, only observation, so we need to deal with potential space.
Solution: model-free or model-based.
Choice:
learning on the observations, disregarding the state . We can also learn an embedding of the states (_).This is called a latent space.
Pros: Learn to process visual input very efficiently
Cons: No guarantee that the autoencoder will capture the “essence” of the problem;** Not necessarily suitable for model-based methods. Then we can apply model-based or model-free approaches.
Important: model-free methods take a long time to converge for high-dimensional data (this is a relatively small problem, so it’s OK).
e.g., train a small car running.
Algorithm:
Use an exploratory policy to collect data.
Use autoencoders (the bottle neck part of neural network) to create a low-dimensional representation of the image.
Run Q-learning to train the policy.
Loss in autoencoder:
The loss is the reconstruction error – how different is the reconstructed image from the original. This can be calculated in many ways. The simplest is MSE.
Algorithm:
**
Use an exploratory policy to collect data
Learn a smooth, structured embedding of the image
Lean local-linear model with embedding
Use quadratic approximation (iLQG) to reach image of goal & goal gripper pose.
Training: train vision and controller separately. In the training process they alternate between minimizing the error with respect to and with minimizing the error with respect to the modeled dynamics ().
Rewards: Because they don’t use states, they need a different method to calculate the reward: they use the image of the state they want to reach
Pros:
Efficiently learn complex visual skills
Structured representation enables effective learning
Cons: No guarantee that the autoencoder will capture the “essence” of the problem.
Simple: Input –(encoding)-> latent representation -(decoding)-> input reconstruction.
Variational Autoencoder: Input -(encoding)-> latent distribution -(sampling)-> sampled latent representation -(decoding)-> input reconstruction. (more efficient)
Models directly in the image space.
Key idea: predict frames forward, thus improving the model’s ability to decide on better actions.
**algorithm:**Predicting multiple steps (visual encoding and visual decoding) into the future;Curriculum learning课程学习
problems: MSE is not a good way to measure the similarity of images, and still no method.
Problems: when making 1-step predictions, errors can compound over time.
Solution: train the model to minimize the average squared error over K-steps.
Problem 2: training the model to predict K-steps into the future is difficult.
Solution: use curriculum learning
**
Combine predictions for a few steps with the “real” trajectory in order to reach K steps.
Increase the number of predicted steps as the network converges.
Problems: One common technique for exploration is -greedy. The strategy is effective, but it’s relatively slow.
Solution: Video prediction can be used to improve the exploration process.
Algorithm:
Use the -greedy method, choose actions that lead to frames seen least often in the last time steps (not random selection.)
The last frames are stored in a trajectory memory.
The predictive model is used to get the next frame ^(()) for every action
The visit frequency is estimated by summing the similarity between a predicted frame and the most recent frames
Pros:
Cons:
Problem: Above methods are based on image reconstruction to make decision/prediction. It is hard to construct image in reality and most part of image construction has less relevant to the task.
Idea: try to recreate the action a_t from observer o_t and o_{t+1}.
How to train: Gather data in many trajectories and robot learn the model of the system.
How to use: inverse model provides the actions required for achieving the goal.
Pros: Very limited human involvement (supervised learning); No need to restructure the image
Cons: Can’t plan with an inverse model; Inverse models focuses only on the action; might ignore many important aspects that don’t relate to the action.
Model-free approaches
Pros: Make little assumptions aside from the reward function; Effective for learning complex policies
Cons: Requires a lot of experience (slower); Not transferable
Model-based approaches
Pros: Easy to collect data in a scalable way; Transferability – can learn across tasks; Typically requires a smaller quantity of supervised data
Cons: Models don’t optimize for task performance (try to predict everything); Sometimes harder to learn than a policy; Often needs base assumptions to learn complex skills (world is too complicated.
Problem: Many studies on exploration use discriminative or generative models to predict the next state(s) or state distribution. These approaches may struggle in states that have complex representations.
Intuition: explore the novelty of various states without modeling the state or the observation
-Some studies use counts (where applicable) or use generative models to approximate the density of the states.
-The aim here is to use only discriminative models.
Goal: Our goal is to approximate the states’ density distribution, which is essential for efficient exploration.
Algorithm: The approach uses Exemplar models:
given a dataset ={_1,…,_ }, we train a set of classifiers/discriminators {_(_1 ),…,_(_ )}.
Each discriminator _(_ ) is trained to distinguish sample _ (the “exemplar”) from all other samples.
Training:
It is possible to train a discriminator per sample, but effiency’s sake we can share layers, or train a discriminator with multiple ”labels”.
We integrate the novelty into the reward function
The rationale(reason): the better we are able to distinguish xi from the rest of the samples, the more unique it is, and therefore worth exploring.
To solve: extrinsic rewards (e.g., reaching the goal) are very sparse.
Method: define an intrinsic reward based on the agent’s inability to predict the outcomes of its own actions.
However, unlike previous solutions, the authors ensure to consider only changes that are the result of the agent’s actions (and not the environment).
This is accomplished through self-supervision.
Difficulty: How to judge non-predictable state/event that is caused by our action.
**solution:**consider action with their resulted environmental changing. Classify agent and observation in to 3 types:
-Things the agent can control
-agent can’t control, can affect the agent (e.g., another car)
-agent can’t control, have no effect on it (e.g., leaves blowing)
Algorithm:
define two types of rewards:
_^ - the extrinsic reward in time
_^ - the intrinsic reward in time
The overall reward for timestep is _=_+_
The solution consists of two sub-architectures:
A reward generator that outputs the internal rewards
A policy that outputs actions to maximize _
training:(contains 2 parts)
Lecture09: Transformers & DRL
Background: RNN\CNN is slow and Long term dependencies is difficult;
solution: use advanced attention mechanisms
network structure: Scaled Dot-Product Attention; Multi-Head Attention; Position-wise Feed-Forward Networks; Embeddings and SoftMax; Positional Encoding.
algorithm: Formed by 2 parts – (encoder and decoder). First extract every word’s embedding and position in a sentence. Then combine these 2 embeddings and put it into Encoder. At last, put the output vector from encoder to decoder to process.
BERT uses a 2 directions structure. (The training method of MASK ML is relatively important but gross and highly stochastic.)
Problem: DRL mainly solve sequential data, and current algorithm is based on RNN (e.g., LSTM). Training DRL with Transformer needs a lot of data and hard to optimize. In the future, it may be ok.
Training:
Rather than train specifically per task, BERT is trained on two general problems:
Masked language model (LM).
Next Sentence prediction.
BERT is trained on a binary classification task whose goal is to determine whether the given sentence will appear next.
While the test is easy (~98% success rate) incorporating this task improved the model’s performance on multiple tasks.
Pros:
Bidirectional(双向).
Deep – multiple layers of transformers.
Pre-training a general model using two tasks.
Embedding both for segment and token.
Method: by modifying layers normalization and by using a gating mechanism to make DRL convergence under transformer structure.
Component:
Reordering of layer normalization.
Relative positional encoding to support a larger contextual horizon.
Gating mechanisms and modifying layer normalization.
The skip connections do not undergo normalization.
Cons: Due to computing complexity. transformer can’t deal with long sequences.
Gating function: used to modify the sequential state representation of a trajectory.
How: use gating function to replace sum after multi-head-attention.
Problems: Hard to apply Transformer-based architectures to long sequences in their complexity:
BERT supports a 512-token window.
Larger windows are computationally intractable.
Approaches:
Longformer; Big-Bird; Extended Transformer Construction (ETC)
idea: don’t use a larger attention span (stick to 512), but divide it among several “perspectives”(global sliding window).
Sliding window of full-attention: same as the “standard” BERT, but applied to a sliding window around the analyzed token rather than the entire document.
Dilated sliding window: the same techniques as in dilated CNNs, but applied to the analyzed tokens.
Global attention: assigned to fixed locations throughout the entire input.
Developed at the same time as Longformer, using the same intuition.
Main difference: sparse random attention
The main idea:
Enables one to bypass the need for bootstrapping (due to long credit assignment).
No need to use discounts (which may result in short-sighted behavior).
Can use recently developed stable training methods.
Process: (rewards, state, action …)->casual transformer -> linear decoder -> (at-1, at, …)
Training: standard DRL process, only predict action.(offline learning) (predicting the reward or the state did not yeild any improvement).
Test: The model is “motivated” to perform well when it is provided with ̂ with a very high value
In order to reduce the error, the transformer needs to produce actions that will result in very high rewards
The rewards have to be defined manually (if you ask for something unrealistic, things go badly)
**idea:**attempt to solve RL problems as a supervised learning task. off-line learning approach, evaluate on large amounts of data.
**Training :**training is based on the GPT training process, with teacher forcing
Beam search: for planning (heuristic solution)
Enable to overcome the problem of local optimum
Lecture10: Model-based approaches and gradient-free learning
Model-based methods aim to model the dynamics of the problem
Knowing the dynamics + the cost/reward function enables optimal planning.
While they are more efficient sampling-wise, optimization might prove much trickier.
Knowing the model dynamics enables using simple (linear or near-linear) techniques, such as LQR.
In this type of algorithms, if the dynamics is not known, we can use sampling in order to evaluate it.
Problem: this type of methods could be exposed to model bias.
Cons:
This approach works particularly well if we have an initial good representation of the data.
Vulnerable to drifting.
Version 1: collect random samples, train the dynamics, plan. (use the mean square error (MSE) to determine how far we are from the intended goal, works particularly well if we have an initial good representation of the data)
Pros: simple, no iterative procedure.
Cons: distribution mismatch problem, is vulnerable to drifting.
Version 2: iteratively collect data, re-plan, collect data(Dagger, updated once all sampled trajectories are followed)
Pros: simple, solves distribution mismatch
Con: open loop plan might perform poorly, esp. in stochastic domains; it’s more problematic for stochastic states because you can’t predict a state, only a distribution.
Version 3: iteratively collect data using MPC (re-plan at each step)
Pro: robust to small model errors
Con: computationally expensive, but has a planning algorithm available
Version 4: learn the dynamics model and backpropagate directly into the policy (more efficient than 3)
Pro: computationally cheap at runtime
Con: can be numerically unstable (vanishing/exploding gradients), especially in stochastic domains.
Gaussian processes:
Pros: Very data efficient
Cons: Difficulty with non-smooth dynamics
Slow when the data is big/high dimensional
Neural networks
Pros: Very expressive
Can work with massive amounts of data
Cons:
Problematic when data is limited
Gaussian mixture models (GMM):
Decomposes the problem space into multiple regions.
Once we identify the region, we can fit the relevant model.
DNN: require a lot of data, they are not always used in model-based learning.
Pros: it takes only seconds to train the model. Compare this to DQN;
cons: if 100K experiments are required to learn a task, robots and other physical objects simply wear out.
Global: one model was used to make decisions throughout the state space.
cons: Global model will seek out the areas where it improves best and worst, which may have many, resulting time costly. Sometimes the global model is much more complex than the policy.
Local: instead of finding a “one fit all” model, find a model that describes your current location well.
pros: local model is stable and faster to model specific environment.
A linear local model (because most of the dynamics are linear).
Intuition:
The goal of LQR is to calculate a matrix such that =−
It is used to minimize and is referred to as the linear control regulator
is called the gain matrix
LQR also assumes there is noise, which is sampled from a Gaussian
However, because the mean of the Gaussian is assumed to be zero, it does not affect optimal policy and can be ignored. (quadratic cost is easy to find minimum using linear algebra);
Idea of Cost:
**
define a cost on how much it “costs” that we’re not where the action was supposed to take us (i.e., where we wanted to go);Some errors can be more “painful” than others; Also factors in the cost of taking different actions.
So slow convergence will “cost” more (we have an incentive to converge faster). Slow convergence also means that I had to take a lot of actions.
Target: minimize two things: a) the distance of the state from what I wanted in to be; b) the amount of effort it took me to get to where I currently am.
Method:
Fit A,B matrix consider noise.* xt+1=Axt+But ; backward and then forward recursion.
Pros: can learns easily and efficiently on simple questions. Best for cases where we want to stay close to a given state (e.g., the cart-pole problem).
Cons: more tricky when we’re dealing with more complex goals/trajectories
Solve local: Our models are only good for local regions. If we go too far (e.g., to another region where our modeling is very inaccurate) we will have problems. So, we want to set up a mechanism designed to ensure that we don’t “stray” too much.
solve: apply LQR to non-linear systems.
Method: iteratively(迭代地) approximate using Taylor expansion; (Because calculating second derivatives can be very complicated, we use iLQR(first derivatives)).
Loss: Estimate the linear dynamics using the derivatives, then estimate the cost and quadratic cost.
Method: Initialization: given ̂_0, pick a random control sequence ̂_0… ̂_ and obtain the corresponding state sequence ̂_0… ̂_.
Test: once trained, we make decisions based on our understanding of the dynamics and hope for the best.
Problem: does not compensate well for
Perturbations/noise (e.g., winds pushing the ball)
Initial state being ”off”
Imperfect modeling of the dynamics
Solution: re-plan
Method:
-iteratively solve the control problem using iLQR from time-step (current step) to (if is not fixed than we will have +);
-re-plan for time greater than (This is called receding horizon control)
Problem: fixed horizon
Receding horizon control: To solve the problem of fixed horizon; not predict but fixed control choice; model will give up for not enough time for objective function.
Pros: both re-plans the actions and updates the dynamic models.
Method:
Train a local policies for multiple scenarios, with the object located at multiple positions.
Use the local policies as demonstrations, and use one neural net to learn a global policy.
Training: The cost is modified in order to consider the distance from other policies, to make sure there are no conflicts
Step 2: train the global model
Step 3: enforce the KL divergence constraint
The uderlying principle of this solution: distillation
These algorithms often perform an exploration/exploitation trade-off:
Pros: Lack of backpropagation usually enables these methods to be faster. Also enables parallelization (the lack of which is one of the main problems in deep learning)
Black-box optimization (A black-box search aim at minimizing :R^→R)
Can be applied for “rugged” (non-convex) problems:
Discontinuous. “sharp” bends. Noise. Local optima.
The cost function is non-linear, non-quadratic, non-convex;
Ruggedness: non-smooth, discontinuous, noisy;
High dimensionality;
Non-separability: dependency between the objective variables;
Ill-conditioned problems;
Intuition: maximum-likelihood – increase the likelihood of sampling candidates that performed well. The goal is to update the values of the distribution so that sampling of high-performing candidates becomes more likely.
Co-variance matrix adaptation – incrementally update the matrix so that the probability of taking previously-successful search steps (i.e. actions) is increased
Comment: This process is a form of natural gradient descent. In NGD we take into account both the gradient and the curvature of the distribution space. Done by applying the Fisher information matrix on the distribution. We can use a Monte-Carlo approximation
Caveat: warning; Skewed: incline; Asynchronous: 异步; Aggression: cumulative; Trade off: 权衡; heuristic: 启发式; uncharted: unknown; prune: 修剪; retain: maintain; entropy: 熵; accrue: obtain; dilation: expansion; auxiliary: assistant; adversarial: 敌对; Interleaved: 交错的; latent: potential; arbitrary: casually; intrinsic: original; Agnostic: 不可知论者; receding: 后退; mitigate: lighten; permissive: 宽容; ensemble: 合奏
真题及答案
The model will seek out where it performs best and worst. However, there maybe many of these areas, which will cause the training of model takes a lot of time. The real model can be very complex (than the policy).
Because in Duel model we will set two different strategies for different states. We need to set the policy that chooses the better action for the given condition(more concerned issue).
It aims to let the rewards we get in the future less valuable than the rewards we get right now. The expected return may be infinite in infinite horizon, which is meaningless. The rewards we get in future have merely no effect on the goal (expected return).
Assume we would like to use Modular Neural Networks to operate a self-driving vehicle. Propose two modules, trained on different domains, that could be combined to achieve this goal (needless to say, there’s no need - or way - to ensure good performance). Use 2-3 sentences.
State one advantage and one disadvantage of model-based approaches compared to model-free approaches (1 sentence each)
Model-free approach is not very efficient, because there is no guarantee the network can get the core of the problem. The model-based approach enables the effective learning.
Because that the target policy(from expert) is not achievable for the agent, it’s hard for the agent to converge to a correct policy. Besides, this may also cause oscillating for the model.
No, because in this case the algorithm is not able to go through all the states or all the actions, thus can’t guarantee the optimality of the solution. But we can still use some approximation method to get a good enough but only sub-optimal solution.
SARSA uses the Q’ following a ε-greedy policy exactly, as A’ is drawn from it. In contrast, Q-learning uses the maximum Q’ over all possible actions for the next step. This makes it look like following a greedy policy with ε=0, which means NO exploration in this part.
The agent can’t get reward and therefore not able to learn what is a good action. The model will not be motivated to learn new things. With a intrinsic reward, the model can explore the large environment efficiently.
explain the meaning of each of the two components in this function (1-2 sentences for each)
L_policy gives the policy loss (the cross-entropy between the expert and multi-task policies). L_FeatureRegression aims to make the activations themselves more similar to the expert.
Importance sampling is the relative probability of the trajectory for the target and behavior policies. And it is used to bridge the gap between the target and behavior policy. It is suitable for off-policy methods.
Off-Policy.
For example, when the state-space model for the problem is time varying. In this case, the model parameter is keeping changing, the global model maybe not able to adapt to this kind of model or may need a much more complex model.
DRL gets the reward by feedback of the environment. The agent will learn from the trial and trail process. Supervised learning needs the labels for the input.
Using TD-error is for selecting but it is sensitive to noise generated by stochastic rewards. It can reduce exploration because it will be greedy, and states could be correlated preventing exploration.
DQN in general has a shortcoming, the sequential prediction where states could be correlated and chasing a moving target. In this task the target is non stationary as the queue is dynamic and changing, the other problem is that size of input also changing making DQN impractical.
Divide the final goal into multiple small goals. Use a meta learning network to differentiate goals, and a common DRL to select actual actions to achieve small goals. It can avoid the case that because the final goal is too difficult to achieve (can’t get reward), which will make the convergence of network hard.
One shortcoming is our system has a large state space, it will not be practical.
It needs to evaluate states multiple times during evaluation phase and second time during improvement phase where testing action different from these recommended by policy.
Unbiased estimator uses the collected data to estimate the average (target value). It becomes easier for the model to over estimate.
Hard converge; easy to overfit
The action space is too large to converge and different operation has various distribution.
At test phase, MCTS is used to train the value network to estimate whether the current state is good or not. Besides, while doing self-play, MCTS can help to record the explored action-state and how good they are.
The reward is called “return to go” here. The model needs the prior knowledge to calculate the “return to go”. Besides, it’s now only implemented with offline policy.
Sarsa takes the exploration into account, so it plots a safer, non-optimal course. Q-learning learns the optimal policy and executes it, but because of the ϵ-greedy exploration it sometimes falls in the cliff.
In other use-case where the price for wrong action will be smaller than Q-learning might perform better where its greedy nature will prevail.
Policy iteration is very inefficient due to the need to evaluate the current policy at all states again and again. Besides, this method is infeasible for a large number of actions and states.
Generally, use value iteration. We have many states and a few actions, and value iteration is generally cheaper than policy iteration.
On-policy methods tend to be more efficient, can’t ensure optimal solution because of exploration. Off-policy methods are more powerful and general.
a.) Because we train sequentially, we may override previous experiences with new information.
b.) Because we use the same parameters for the estimation and the Q-target, both values move.
Both problems will make the algorithm hard to converge.
Sometimes the action space and state space may be very big. In this case, it’s not applicable to record every transition and calculate the Q-table. Besides, sometimes we may not able to get the value function for the action.
The training of AlphaGoZero doesn’t rely on human expert knowledge, while AlphaGo needs. AlphaGoZero only uses a single DNN instead of several. AlphaGoZero doesn’t use rollouts. For AlphaGoZero, Leaf nodes are always expanded, and the newly expanded nodes are evaluated using the neural net (instead of simulation and backup).