BGU Deep Reinforcement Learning final examination review

Lecture01: Introduction to RL

  1. Terminology

-future state distribution depends only on present action and state (Markovian)

-γ: discount factor. rewards we get in the future are less valuable from rewards we get right now (money).

-value function: how good the state is

-q-function: how good is the state-action pair

-Bellman equation: mean that the optimal policy dictates that we take the optimal action at every state. The optimal action is specified by _∗

Solving the Bellman equation enables us to discover _∗.

- Reinforcement learning is a sequential decision making problem (the output will affect the input).

Feature of RL:

-Need access to the environment

-Jointly learning and planning from correlated samples

-Data distribution changes with action choice

1) Model based:(policy/value iteration). need transition matrix

Cons: if pi is deterministic, we will never get to some state-action pairs;

Solution: start at random state

2)Model free: model q(s,a) directly.

Cons: need complete episodes. Use monte-Carlo (sampling). Can’t evaluate states.

3)Policy iteration: policy evaluation(test current pi, update v)->policy improvement(find better pi, update pi)

cons: not efficient because it needs evaluate all state again.

4)Value iteration: evaluate state only once.

Policy/value iteration are most efficient in DP (always optimal).

5)On policy:

attempt to evaluate/improve the policy that is used to make the decision. (cannot reach all, more efficient). e.g., Sarsa: Q value from current policy γQ(s’,a’), also a temporal difference learning (TD) (Sarsa use more exploration), epsilon greedy.

Pros: take exploration into account.

6)Off policy: attempt to evaluate/improve a policy other than the one used to generate the data .(Do not test policy, slow; pros: powerful, general, can learn from experts). E.g., Q-learning: Q value from optimal policy γ maxQ(s’,a), with Temporal Difference, like a greedy.

Target policy: try to learn;

Behavior policy: generate the data (stochastic)

Importance sampling: relative probability for target and behavior

Why Importance sampling: get the expectation of rewards.

6.5) Monte Carlo

Pros: Can learn V and Q directly from the environment; No need for models of the environment; No need to learn about all states.

Cons: Only works for episodic (finite) environments;

Learn from complete episodes (i.e. no bootstrapping)

Must wait until the end of the; episode to know what is the return.

Solution: Temporal Difference

7)Temporal Difference (TD)/learning.

Pros: can update each step; Use both Monte Carlo and Dynamic Programming; Estimate V instead of real vπ

TD error: St = Rt+1+γ v(st+1)-v(st)

Assumption: we can have to wait one time step in order to update, so V doesn’t change from one step to another.

**8)**ϵ -Greedy algorithm

Exploration: use different actions to better reward

Exploitation: maximize reward by choosing best

Traditional greedy algorithm never explore, it may stuck in a sub-optimal actions.

With probability (1-ϵ) to select best action.

Improve: A good heuristic to improve this algorithm is to set all values of Q to a very high value and then use Monte Carlo sampling to drive them down gradually.

9)Deep Q learning

Use a function to create approximation of q: q(s,a,theta) with loss function and gradient descent.

Cons: sequential prediction state (correlated states) and target is not stationary. Easily forget previous.

Improve: fix Q-target (freeze); Experience relay(random replay or prioritize samples<high TD or stochastic>), (solve forget and correlation);

High TD: make probability of sampling with TD, which is suitable for incremental approaches: SARSA, Q-learning.

10)Double DQN
**
DQN is inaccurate Q-value (sub optimal actions).

Method: 2 Q-functions, learn from different set of experience, decisions are made jointly.

Pros: Double Q-learning is not a full solution to the problem of inaccurate values, but it has been proven to perform well.

11)Dueling QQN

Intuition: the importance of choosing right action is not equal across all states. So decouple action from state.

State values: V(s) | action values: A(s,a);

Q(s,a)=V(s)+b*A(s,a)

b is proportion.

Pros: keep a relative rank of the actions

Lecture02: Policy Gradient DQN

1)Policy gradients: model policy directly

Output is parameterized policy:

  • Discrete action space – probability of action
  • Continuous – mean and covariance of a Gaussian

Pros: greater than ϵ-greedy because its action probability change smoothy;* ϵ-greedy causes drastic change in values .

Policy gradient uses a learning rate to adjust changing scale.

2)REINFORCE (improve of 1)

Update is based on the reward from that point and chance of taking the action;

All updates are made after completion of trajectory.

3)REINFORCE with baseline

Why: Because of our use of sampling, the REINFORCE algorithm might fluctuate and be slow to converge (i.e., high variance)

Intuition: a value to reduce the size of update (baseline—not change with action).

State value function: v (s,w)

Policy function: δ=Gt-v(s,w), use this to back propagate gradient.

Always converge faster than REINFORCE.

4)Summary of policy gradient:

Pros: model probability of taking an action; balance exploration/exploitation; handle continues state-space; sometimes easier than value-function.

Cons: require plenty sampling; slow to converge

5)Actor- Critic

Intuition: Combine Actor and Critic

Actor only methods: rely on sampling/simulation. Cons: large variance, no accumulation of old information

Critic only methods: rely on value function, solve bellman equation. Pros: near-optimal. Cons: lack guaranties of the solution near optimal

The actor attempts to improve the policy

-Can be done greedily (e.g. Q-functions), mostly applicable to small state-space

-Can be done using policy gradients

The critic evaluates the current policy

-Can use a different policy to evaluate Q-functions (target vs. behavior policies)

-Bootstrapping methods can reduce the variance, as in REINFORCE with baseline

E.g.: One-step actor-critic + one step TD;

Pros: can be applied to a continuous or very large actions space

6)A3C

Intuition: An asynchronous version of the actor-critic algorithm

Method: Multiple critics operating simultaneously, all updating a single network that produces the value function

Pros: The use of multiple agents can increase exploration, since they can be initialized with differing policies.

Lecture03. Imitation learning

0)terminology: regret: the difference between π* and π . Maximizing rewards = minimizing regret.

Why imitation intuition: apply RL is difficult in these situations.(Nonlinear dynamics, less training samples, highly correlated states.)

it is sometimes difficult to define similarity between similar states.

Solution: convert problem to prediction. To maximize rewards rather than value function.

Suitable for: Navigation, autonomous driving, helicopter, question answering

Problems: hardly to recover from failure with supervised learning. And hard to response on unseen states.

Notations: dπt—the state distribution following π form time 1 to t-1

Cs,a—immediate cost of performing a in state s.

1)Apprenticeship learning:

Algorithm: record expert dataàlearn dynamics modelàfind near optimal policy by reinforcement learning -> test. (the dynamics of the environment and the transition matrix are learned together)

Training: We can use regularized linear regression to explore all the trajectories to calculate the utility. We defined utility as the average sum of rewards.

Cons: completely greedy, only exploitation; suitable for cases where exploration is dangerous; can not cope with unexplored states; can not recover form errors

Pros: suitable for cases where exploration is dangerous/costly (autonomous driving).

Compare with imitation: imitation assume an unknown and complex world dynamics. Slower but general.

2)Supervised learning:

Approach: reduce the sequence into many decoupled supervised learning problems.

Use a stationary policy throughout the trajectory.

Objective: find a policy π to minimize the observed surrogate loss.

Intuition: once classifier makes a mistake, find it in a previously unseen state, from that point every action is a mistake. The mistakes compound, with each step incurring the maximal loss. The loss growth will be super-linear.

3)Forward training:

Algorithm: sample trajectories generated by the latest policy -> ask the expert to provide state-action combinations -> We train a new classifier to provide the policy for the next step -> use the policy to advance us one step forward.

Cons: The algorithm requires iterating over all T steps; if T is large, it is impractical.

Pros: near linear regret; we train the algorithm on currently visited states, the algorithm can make good decisions; The input from the expert enables it to recover from mistakes

Analysis: regret for this algorithm is bounded by (). near-linear; represents the diversion from the optimal policy.

If the diversion is maximal for each step, we get to (^2 ) – like supervised learning.

4)Dagger: (dataset aggregation)
**
Dagger is a stationary algorithm ( policy is not updated while running)

Train pi on Dgetàrun pi get Dnewàask expert perform DnewàD=D U Dnewà train pi on D

Pros: faster and more efficient; update once per trajectory; near-linear regret (low regret bound); work well for both simple and complex problems.

Cons: huge difference between learner and expert

With Coaching: balance between expert and policy actions. optimal and within capability for policy.

hope action as

πis=argmaxa∈A(λi∙scoreπis,a-L(s,a))

where _ specifies how close the coach is to the expert (_≥0) and s_(_ ) is the likelihood of choosing action in state . (,) is the immediate loss.

Lecture04. Multi-arm bandits

  1. general

why: get to know the different distribution (fixed) of each arm’s payoff.

Target: maximize their profit or minimize their regret across all plays.

Expected cumulative regret(optimal – our strategy):

E[Regn]=n⋅R*-i=1nE[ri]

Application: recommendation system.

EE-problem: Exploration (the discovery of new information) and Exploitation (the use of information he already has).

Algorithm:

Contextual free: -greedy, UCB1

Contextual: LinUCB, Thompson sampling

  1. ϵ**-greedy approach (simple, stationary)*

Method:
**
With probability to explore and exploitation(constant step-size parameter to give greater weight to more recent rewards.).

Cons: not elegant—algorithm explicitly distinguish between exploration and exploitation; Suboptimal estimation: any arm is equality likely to be picked; less effective dealing with context.

Solution: explore with less confidence ( solve any arm is equality)

  1. Upper confidence bound (UCB):

Solve: epsilon-greedy explore any arm equally

Intuition: explore these less confidence.

At=argmaxaQta+clntNta

At=value term + exploration term

Pull maxAt arm. (max uncertainty)

Cons: nonstationary problems: methods more complex; large state space; action selection is not practical.

Pros: get higher rewards. Optimal.

  1. Contextual Bandits (model state)

Example: different rewards from the same arm for varying contexts (state).

  1. LinUCB algorithm

Intuition: Expectation for each arm is modeled as a linear function of the context.

Loss: confidence interval is in fact standard deviation

computational complexity:
**
Linear in the number of arms; At most cubic in the number of features.

Pros: works well for a dynamic arm set (arms come and go)

Suitable for : article recommendation

  1. Thompson sampling

Intuition: A simple natural Bayesian heuristic(Maintain a belief (distribution) for the unknown parameters); assumes that the reward distribution of every arm is fixed, though unknown. Play an arm with posterior probability

Beta Distribution: shape of the beta function is defined by a) how many times we “won” and; b) how many times we “lost”. That’s why it’s great for exploration/exploitation. When a is larger, we have a larger probability of small values. Hence, the sampling changes the distribution.

Algorithm:

For each round:

Sample random variable X from each arm’s Beta Distribution

Select the arm with largest X

Observe the result of selected arm

Update prior Beta distribution for selected arm

Pros: near-linear regret.

Lecture05. Monte Carlo search tree, AlphaGo, AutoML

0)general

Selection with Upper confidence bound (UCB)(combine with exploration and exploitation) à Expansion: reach a leaf(End of game or uncharted state (initialize; rollout policy))->Simulation: at the end of game get score. We can run more than one simulation->Backpropagation: update Q(s,a).

Exploration is usually given a larger weight at the beginning and then reduced over time.

1)MCTS

Pros: no specific knowledge / interrupt at any time

Cons: difficulty to sacrifices well

Application: MCTS with a policy, MCTS with a value function

Improve:

Combine MCTS with a policy

Combine MCTS with a value function

2)Apply monte-carlo to Go

Perfect information games: chess, checkers, Go. They have optimal value function

b: game breadth (length moves)

d: game depth(length of game)

2.2) Combine MCTS with a Policy

policy of the software is defined using a set of hand-crafted features;

Once an expanded node has had a sufficient number of simulations, the policy is used to determine whether or not it should remain in the tree or be “pruned-off”;

This approach slows the growth of the tree while focusing its “attention” on high-probability moves;

2.3) Combine MCTS with a Value Function

The function is defined as
_ (,)=((,)^ )

where the combination of (binary features) and (weights) defines the value function. The function maps the value function to an estimated probability of winning the game. Instead of setting a random policy for the sampling process, the authors experimented with three options: -greedy policy; Greedy policy with a noisy value function; A softmax distribution with a temperature parameter.

3)AlphaGo:

Supervised learning policy network: recommend good action by prediction. Maximize the log-likelihood function of taking ‘human’ action (3weeks, 50gpu)

RL policy network (give action): improve the network by play with itself. Replay to determine gradients. (30gpu, a day)

RL value network (test): same structure of policy network give how good the game is. (50M minibatch, 50gpu, a week).

Rollout policy: first L steps look ahead for all. After L steps use (policy or value) based;

When the number of visits exceeds a predefined threshold, the leaf is added to the tree

4)AlphaGo Zero

Improvement: self-reply; only black-white features; single neural net; simple tree search (not MC rollout)

two outputs: policy; value

Tree search: policy iteration

MCTS: policy evaluation and improvement

Train: play with old generation. 55% win to be new baseline.

Compare with AlphaGo:

AlphaGo Zero do not have rollout

Single DNN

Leaf node are always expanded, evaluated using the neural net

No tree policy

Solely base on self-play

No reliance on human knowledge

5)Apply Alpha Zero to other domain:

Field of autoML:

Automatic pipeline generation; AlphaD3M; Neural Architecture Search; Deepline

Lecture06: Meta and Transfer

  1. Meta Learning

Classification/regression aim to generalize across data points. A meta-learner is readily applicable, because it learns how to compare input points, rather than memorize a specific mapping from points to classes.

**propose:**Meta-learning generalize across datasets (learn to learn).

Process: Input – many tasks and data, train F

1)Meta-Learning with Memory-Augmented Networks

Problems:

Zero-shot learning: at test time, a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to.

One/few shot learning: similar to zero-shot learning.

Goal: develop a policy that optimizes our performance across multiple learning tasks.

Method:
**
1) Label shuffling – labels are presented a step after their samples were introduced. This prevents the model from simply mapping inputs to outputs.

2)External memory module – a stable “container” of knowledge that is called upon to respond to changing circumstances

Algorithm:

1)Feed sample at time , but provide the label as input at the next iteration (_1,),(_2,_1 )…(_(+1),_).

\2) In addition, the labels are shuffled across datasets (i.e., samples from different datasets may appear in the same sequence)

\3) This forces the neural net to hold the samples in memory until a relevant sample

Pros: general and expressive; variety choice to design architecture

Cons: the model is very complex for these complex problems; need a lot of data to train.

External memory module: store the accurate classification.

which ”memories” to read: Indexes cannot be embedded, so use a similarity measure to generate a weights vectors.

Updating memory: least recently used access (write memories to: The least used memory location; The most recently used memory location (update, don’t replace)).

Problems: hard to identify events most relevant to the current problem. possibly because of the difficulty of modeling long and complex sequences.

Solution: Convolutions with dilations solve this problem.

2)Simple Neural Attentive Meta-Learner(SNAIL)

Combine dilation and attention (complement with each other).

Temporal convolutions with dilations: provide high-bandwidth access and limit the size of the network and layers

Attention: provide pinpoint access over an infinitely large context.

Algorithm:

Dense blocks – concatenate the current input with a 1D dilated convolution of all previous states.

Temporal convolutions layer – stacks the dense blocks into a single input.

Attention block – uses an attention mechanism to output a single value (the action).

Cons: model consumes more time.

3)Model Agnostic Meta-Learning (MAML)

To solve: limited amount of data (particularly 1-shot)** is easily overfit.

Method: Generalize well for held-out data over tasks (Design the network specifically for fast adaptation, small changes in the parameters will produce large improvements on the loss function of any of the tasks).

Algorithm: pre-train by meta learning and then fine-tune in specific task.

Cons: uses second-order derivatives, which makes it more expensive (gradient over a gradient)

4)ReBAL(Real time)

Challenge: generating samples is expensive;** unexpected perturbations & unseen situations cause specialized policies to fail at test time.

difficulties in the real-world:

Failure of a robot’s components

Encountering new terrain

Environmental factors (lightning, wind)

Intuition:

Environments are common structure, but differing dynamics.

For each environment, we learn from a trajectory that lasts as far as the environment does.

To solve: generating samples is expensive**;** unexpected perturbations & unseen situations cause specialized policies to fail at test time.

The proposed method: use meta-learning to train a dynamics model that, when combined with recent data, can be rapidly adapted to the local context,

5) comparison

SNAIL MAML ReBAL
Ability to improve with more data(Consistency) No Yes Yes
ability to represent multiple types of algorithms (re-use algorithms)(Expressive) Yes No ~
ability to explore the problem space in an efficient way (less samples needed) (Structured exploration) ~ ~ No
Efficient & off-policy(run real-time) Policy gradients, so both are inefficient. Moreover, no way to use off-policy data Yes
6)Transfer Learning

Aim to: model trained on one problem/dataset is applied to another.

Why:

Multiple models are harder to train/manage.

We would like to leverage knowledge across problems/domains.

Example of Reuse model: Image analysis, Word embedding.

“Forward” transfer: train on one task, transfer to another

  • Fingers crosses – no adaptation, hope for the best
  • Finetuning
  • Architectures for transfer: progressive networks

Multi-task transfer: train on many tasks, transfer to other

  • Model-based RL
  • Model distillation
  • Modular policy networks

7)Forward Transfer

7.1) Fine-Tune

  • Option 1: take a trained model from one task, then re-train on another
    • Can be useful in cases where the target dataset is small
    • In deep learning, it is also possible to “freeze” the lower layers so only the final last ones are re-trained
  • Option 2: pre-train the model for diversity
    • Training the model to search for multiple solutions to a given problem
    • Makes it more robust and general

A Diversity Model:

Always with stochastic policy (deterministic policy cannot easily consider multiple possible actions).

Pros:

Can be learned on one model and fine-tuned on another;

better for multi-objective tasks;

be helpful for uncertain dynamics and imitation learning;

A stochastic policy:
**
Exploration in cases of multi-model objectives, Robustness to noise & attacks, Imitation learning.

7.2) Deep Energy-Based Policies (Entropy reinforcement learning)

Aim to: train model with diversity.

Intuition: based on stochastic policy, Instead of learning the best way to perform the task, the generated policies try to learn all the ways of performing the task. (robust to noise, better for exploration, a good initialization).

How to do: Prioritize (in addition to the reward) states with high entropy; Supports exploration of multiple solutions to a given problem.

Problems: solving maximum entropy in stochastic policy is difficult.

Solution: use energy-based models.

Idea: formulate a stochastic policy a conditional energy-based model. The energy function corresponds to a “soft” Q-function that is obtained when optimizing for maximal entropy.

Algorithm: Max entropy for the entire trajectory -> learn all ways to perform the task instead of optimal -> use energy-based model.

Sampling:

  • Don’t use Monte Carlo sampling for it’s slow in real-time.
  • Instead, it uses Stein Variational Gradient Descent (SVGD)(Faster and produce unbiased estimate of the posterior distribution; Unbiased enables employ actor-critic like architecture).

Pros: Far better for exploration; achieve better results when dealing with the optimization of multiple objectives.

7.3) Progressive Networks(fine-tune)

**Fine-Tune problem:**easily overfitting;easily forget previous knowledge (“catastrophic forgetting”)

Solution: Freeze the network; add new layers for new tasks.

Structure: every task has its own network layers. When apply to new task,old network layers will be frozen,and then add one new column.(Like homework 3)

Target:

Solve K independent tasks at the end of the training

Accelerate training via transfer

Avoid catastrophic forgetting

Cons:

  • The number of networks grows with the number of given tasks, but only a fraction of the capacity is used
  • While the network retains the ability to solve K tasks, there’s no way of knowing which of them are actually similar/relevant to the current problem

7.4) Self-Supervision for RL

**idea:**The paper puts emphasis on representation learning(While learning to optimize the policy for the reward function, we also implicitly learn to represent the environment)

Problems: learning from sparse rewards may be difficult. Learning from the environment, however, occurs all the time.

Methods:

focuses on auxiliary (辅助) losses: State, Dynamics, Inverse dynamics, Rewards.

Example:

rewards bin: the rewards were binned to “positive” and “negative”. The model was required to predict the outcome for the next step.

Dynamics: give state and action, predict next state.

Inverse dynamics: 2 states predict action.

Corrupt dynamics: replace states (or observations of states) with “nearby” (either past or future) ones. The model is required to predict the correct outcome.

Reconstruction: uses a variational autoencoder to reconstruct the input.

Algorithm:

An encoder-decoder architecture->Train the network on one domain->discard the decoder, and place a new network on-top of the encoder.

The encoder can both frozen or be updated with the new task (the later can get better results).

Pros:

representation can be useful for multiple tasks.

Transformation into discriminative (歧视) learning enables the development of useful and more robust representations

7.5) Another Option: Randomize the Data for Better Generalization

7.6) Forward Transfer Summary

Pre-training and fine-tuning:

  • Standard fine-tuning with RL is difficult
  • Maximum entropy formulation can help (robustness, generality). Randomizing the input can help a lot.

In general, we are usually interested in using modest amounts of data from the target domain.

We make the assumption that the differences are functionally irrelevant – sometimes a problematic assumption!

8)multi-task transfer:

Assumption:More tasks = more diversity = better transfer

Challenge: How to we merge/transfer information from multiple sources.

solution: Build a lifetime of diverse experiences (harder to model, we don’t know how humans solve problems in the target domain)

8.1) Model-Based RL (63-64)

Intuition: for different past tasks, the laws of physics is in common.

But: Same robot does different chores;

Same car driving to multiple destinations;

Trying to accomplish different things in the same open-ended video game.

2 approaches:

  • Simple: train model on past tasks, use it to solve new tasks
  • complex: adapt/fine-tune the model to the new task

8.2) Actor-Mimic Approach (65-69)

Target: train one network that can play all Atari (which is a deterministic game) games at a level close to that of a dedicated network;

**process:**multi-task training get model E1~En and use pretrained as expert to train multi-task policy; not direct reward loss but rather attempted to imitate the Q-values of each “expert” for its given game. (Use a SoftMax instead of direct values because the differences in the range of Q-values across the different tasks. The differences made it very difficult to train directly.)

SoftMax: we can view using the SoftMax from the perspective of forcing the student to focus more on mimicking the action chosen by the guiding expert at each state, where the exact values of the state are less important.

Why is the approach called “actor-mimic”?

  • We can think of the policy regression as a teacher telling the AMN how to act (by mimicking the expert)
  • the feature regression is analogous to a teacher telling a student why it should act that way (mimic expert’s thinking process)

Policy Loss: the cross-entropy between the expert and multi-task policies

Feature Regression: make the activations themselves more similar to the expert (Like a critic net). bridging the gap between different layer size.

(Another way to improve the perforce of the AMN is by configuring the activations of each hidden layer to be close to those of each of the expert).

Transfer the knowledge to a new task:

Remove the final softmax layer of the AMN.

Use the weights of AMN as an instantiation for a DQN that will be trained on the new target task.

Train the DQN (standard process).

Pros: Learn very fast.

8.3) Distillation for Multi-Task Transfer (70-72)

Distillation: Knowledge distillation is the process of moving knowledge from a large model to a smaller one while maintaining validity.

Why: An ensemble of models is too expensive and slow to run in production. So, train the ensemble off-line and then distill the information into a new single model.

Key insight: there is a lot to be learned from the “wrong” classifications as well (Even when there are some labels that received low probabilities, some classes received much lower scores than others).

method: An obvious way to transfer the generalization ability of the cumbersome model to a small model is to use the class probabilities produced by the cumbersome model as “soft targets” for training the small model.

Structures:

The original and target model are trained using a “soft” SoftMax.

When =1 we have a “regular” SoftMax; higher values produce “softer” class distributions.

Training:

The original model is trained with a high temperature, and so does the distilled network during training. However, after the training is concluded, the distilled network is using =1.

Improve not only predicting the score distribution, but also predicting the “right” class. The two predictions are combined using weighted averaging.

8.4) Modular Neural Network (73-77)

Definition: A modular neural network is an artificial neural network characterized by a series of independent neural networks moderated by some intermediary.

Idea: decompose neural network policies into “task-specific” and “robot-specific” modules

  • task-specific modules are shared across robots
  • robot-specific modules are shared across all tasks on that robot

pros:

This modeling can train “mix-and-match” modules that can solve new robot-task combinations that were not seen during training.

This allows for sharing task information, such as perception, between robots and sharing robot information, such as dynamics and kinematics, between tasks.

Structure:

  • The “world” is defined as a matrix of all tasks and configurations (e.g. degrees of freedom on the robotic arm)
  • The network operates on observations, not states
  • Observations can be converted to the same dimensionality for all cases
  • The cost is defined using two components: an “intrinsic” part (i.e., the robot) and an “extrinsic” part (i.e., the task being performed):
  • Let the functions , represent the robot-specific and task-specific parts of the policy respectively.
  • The rationale: worlds with the same robot instantiation would reuse the same robot module _ , while worlds with the same task instantiation would use the same task module _

Suitable for:

  • This approach could also be applied for larger tasks: the modules can be arranged in any order, as long as they for a DAG
  • Key point: this setting enables modular components that were never used together before to be applied to a new problem

8.5) Summary of multi-task transfer

Pros:

More tasks = more diversity = better transfer**;**

Often easier to obtain multiple different but relevant prior tasks.

Specific features:Model-based RL: transfer the physics, not the behavior

Distillation: combines multiple policies into one, can accelerate all tasks through sharing

Modular network: an architecture designed specifically for multi-task learning

Lecture07: Dealing with Large and State and Action Spaces

  1. General

Why: Being able to operate in environments with a large number of discrete actions is essential for applying DRL to many domains. Achieving this goal requires the ability to generalize.

**Common method:**Embed the discrete actions into a continuous space where they can be generalized.

Then, use generative models to generate actions.

Apply KNN to select a small subset of actions to evaluate.

Pros: This setting decouples the policy network from the Q-function network.

1)GANs

Idea: Train a** Generator and a Discriminator to compete.

Generator: generate fake samples, tries to fool the Discriminator. (Never directly see true samples.)

Discriminator: tries to distinguish between real and fake samples.

pros:Sampling (or generation) is straightforward; Training doesn’t involve Maximum Likelihood estimation; Robust to Overfitting since the Generator never sees the training data; Empirically, good at capturing the modes of the distribution.

**Train method:**Freeze one another when training(Generator and Discriminator).

Generator selects a random noise z, passes it through the generator network to generate a fake image, and then passes it through the discriminator network to output the probability of it being fake.

It then calculates its loss, and update its weights while keeping the discriminator’s weights as constant

And we repeat this process of Alternate updates.

How to deal: use a network AEN (with bandit technology) to recommend actions to take.

2)Wolpertinger Policy

Based on actor-critic. Actor generates action The actor generates a proto-action and then finds the K-nearest neighbors using simple L2 distance. The critic performs additional refinement and chooses the highest-ranked neighbor with Q-function.

3)Action Elimination with DRL

**To solve:**large action spaces cause Q-learning to converge to sub-optimal policy

**Idea:**reduce the number of actions by eliminating some from consideration to reduce action number.

Two methods:

Reward shaping – modify the reward received in a trajectory based on “wrongness” of the chosen actions (also used in imitation learning)

  • Difficult to tune, slows convergence
  • Not sample efficient (requires exploring actions before their elimination)

Two headed (interleaved) policy – maximize the reward while minimizing action elimination error

  • Main challenge: the two goals are strongly coupled and affect each other’s observations
  • Convergence, therefore, is not trivial

**Main idea:**decouple the elimination signal from the (Markov decision process) MDP using contextual multi-armed bandits.

The challenges of using bandits:

A very large action space would mean we need a lot of bandits; The representation of the actions needs to be fixed, although the network trains and changes.

Algorithm:

Two networks: Standard DQN and An action elimination network (AEN).

The last hidden layer of AEN is the input of the contextual bandit.

Applied every L iterations: The bandits eliminate some actions, and pass their recommendation to the DQN.

The bandits’ recommendation is “plugged” into the AEN’s last layer so it can calculate the loss

Actions are chosen using -greedy to ensure that eliminated actions get a chance

Comments:
**
Use the AEN to produce an embedding of the states. The bandits are trained on this embedding, thus eliminating actions. The valid actions are then sent to the DQN.

The bandits use confidence bounds (see relevant previous lecture) to determine which actions need to be eliminated.

Because bandits require a static representation, we need to retrain a new bandits model often.

The architecture also uses a replay buffer (this way we have a lot more data to train the bandits). The new embedding representation is applied to the previous states.

4)Hierarchical DRL for Sparse Reward Environments(Auxiliary goal)

Method: define intrinsic goals and try to accomplish them.

Algorithm (generative model): define a controller and a meta-controller;

The meta-controller receives a state _ and chooses a goal _∈.

The controller then chooses a an action based on _, _t.

The controller will pursue _ for a fixed number of steps, then select another goal.

An internal critic is responsible for assessing whether the controller succeeded, and allocates the rewards accordingly.

How to deal:

Controller: choose actions normally, to max cumulative intrinsic rewards.

Meta-controller: max external rewards received from the environment.

Key: work in different update time;

Process:

Meta controller -goals-> critic -reward (intrinsic rewards)-> controller -action-> system.

e.g. Montezuma’s revenge.

Cons: intrinsic rewards can only deal with specific question.

Pros:

Efficient exploration in large environments

Learning when rewards are sparse

5) “Montezuma’s Revenge” Atari Game

Problems: very long sequences, with very few rewards.

Compare: Regular DQN obtains a score of 0

Asynchronous actor-critic achieves a non-zero score after 100s of millions of frames.

6) Automatic ML pipeline generation

Two approaches:
**
Constrained space – create a fixed frame with “placeholders”, then populate it.

Unconstrained space – place little or no restriction on the pipeline architecture, but come at a higher computational cost.

AlphaD3M:

Attempts to apply the framework of Alpha Zero to the field of automatic pipeline generation.

Solution:
**
Data pre-processing; Feature pre-processing; Feature selection; Feature engineering; Classification & regression; Combiners.

Hierarchical-Step Plugin:
**
Enables us to use a fixed-size representation to analyze a changing number of actions.

Using this representation, we can dynamically create a list of only the “legal” actions without having to use various penalties.

Considerably accelerates training.

7) Resource-efficient Malware Detection

Problem:
**
Current malware detection platforms often deploy multiple detectors (an ensemble) to increase their performance:

Creates lots of redundancy (in most cases one detector is enough)

Computationally expensive, time-consuming

Solution:

query a subset of the detectors, decide based on the classification whether to query more.

Challenge:

find a way to model both the ”reward” of correct classification and the ”cost” of performing the classification.

8) Branching Dueling Q-Networks (BQD)

Algorithm:
**
Common state-value estimator (think dueling DQN)

-Can more efficiently identify action redundancies.

-Similar to dueling DQN, the advantage and state value are combined via an aggregation layer.

The top-performing method for TD-error calculation was averaging across the branches.

The chosen loss function averages the TD-error across all branches.

Pros:

Applies to problems where complex actions can be broken down - like joints in the human body

Useful in cases where each “sub-problem” can be optimized with a large degree of independence.

9) Jointly-Learned State-Action Embedding

Problem:

Multiple studies propose advanced ways to learn states and action embeddings. However, these embeddings are separate and don’t take into account the relationships among actions and states.

Pros:

Better generalization for large state/action spaces; Can be combined with any policy gradients-based algorithm; The embeddings are learned in a supervised manner, which improves sampling efficiency.

Lecture08: Advanced Model Learning & Exploration Methods

  1. general

States are unknown, only observation, so we need to deal with potential space.

Solution: model-free or model-based.

Choice:

learning on the observations, disregarding the state . We can also learn an embedding of the states (_).This is called a latent space.

  1. Model-Free Approaches for Learning the Latent Space

Pros: Learn to process visual input very efficiently

Cons: No guarantee that the autoencoder will capture the “essence” of the problem;** Not necessarily suitable for model-based methods. Then we can apply model-based or model-free approaches.

Important: model-free methods take a long time to converge for high-dimensional data (this is a relatively small problem, so it’s OK).

e.g., train a small car running.

Algorithm:

Use an exploratory policy to collect data.

Use autoencoders (the bottle neck part of neural network) to create a low-dimensional representation of the image.

Run Q-learning to train the policy.

Loss in autoencoder:

The loss is the reconstruction error – how different is the reconstructed image from the original. This can be calculated in many ways. The simplest is MSE.

  1. Model-Based Approaches for Learning the Latent Space

Algorithm:
**
Use an exploratory policy to collect data

Learn a smooth, structured embedding of the image

Lean local-linear model with embedding

Use quadratic approximation (iLQG) to reach image of goal & goal gripper pose.

Training: train vision and controller separately. In the training process they alternate between minimizing the error with respect to and with minimizing the error with respect to the modeled dynamics ().

Rewards: Because they don’t use states, they need a different method to calculate the reward: they use the image of the state they want to reach

Pros:

Efficiently learn complex visual skills

Structured representation enables effective learning

Cons: No guarantee that the autoencoder will capture the “essence” of the problem.

  1. Variational Autoencoder

Simple: Input –(encoding)-> latent representation -(decoding)-> input reconstruction.

Variational Autoencoder: Input -(encoding)-> latent distribution -(sampling)-> sampled latent representation -(decoding)-> input reconstruction. (more efficient)

  1. Action-Conditional Video Prediction

Models directly in the image space.

Key idea: predict frames forward, thus improving the model’s ability to decide on better actions.

**algorithm:**Predicting multiple steps (visual encoding and visual decoding) into the future;Curriculum learning课程学习

problems: MSE is not a good way to measure the similarity of images, and still no method.

  1. Curriculum learning with multi-step prediction

Problems: when making 1-step predictions, errors can compound over time.

Solution: train the model to minimize the average squared error over K-steps.

Problem 2: training the model to predict K-steps into the future is difficult.

Solution: use curriculum learning
**
Combine predictions for a few steps with the “real” trajectory in order to reach K steps.

Increase the number of predicted steps as the network converges.

  1. Informed Exploration(Action-Conditional Video Prediction)

Problems: One common technique for exploration is -greedy. The strategy is effective, but it’s relatively slow.

Solution: Video prediction can be used to improve the exploration process.

Algorithm:

Use the -greedy method, choose actions that lead to frames seen least often in the last time steps (not random selection.)

The last frames are stored in a trajectory memory.

The predictive model is used to get the next frame ^(()) for every action

The visit frequency is estimated by summing the similarity between a predicted frame and the most recent frames

Pros:

  • Stability through multi-step prediction and curriculum learning
  • Useful for control (via improved exploration)

Cons:

  • Generating synthetic images (e.g., Atari) is easy, real-world images are much harder
  • Not immediately clear how to use this approach for planning (lacking a good metric for image similarity)
  1. Inverse Reinforcement Learning on Images

Problem: Above methods are based on image reconstruction to make decision/prediction. It is hard to construct image in reality and most part of image construction has less relevant to the task.

Idea: try to recreate the action a_t from observer o_t and o_{t+1}.

How to train: Gather data in many trajectories and robot learn the model of the system.

How to use: inverse model provides the actions required for achieving the goal.

Pros: Very limited human involvement (supervised learning); No need to restructure the image

Cons: Can’t plan with an inverse model; Inverse models focuses only on the action; might ignore many important aspects that don’t relate to the action.

  1. compare

Model-free approaches

Pros: Make little assumptions aside from the reward function; Effective for learning complex policies

Cons: Requires a lot of experience (slower); Not transferable

Model-based approaches

Pros: Easy to collect data in a scalable way; Transferability – can learn across tasks; Typically requires a smaller quantity of supervised data

Cons: Models don’t optimize for task performance (try to predict everything); Sometimes harder to learn than a policy; Often needs base assumptions to learn complex skills (world is too complicated.

  1. EX2: Exploration with Exemplar Models for DRL

Problem: Many studies on exploration use discriminative or generative models to predict the next state(s) or state distribution. These approaches may struggle in states that have complex representations.

Intuition: explore the novelty of various states without modeling the state or the observation

-Some studies use counts (where applicable) or use generative models to approximate the density of the states.

-The aim here is to use only discriminative models.

Goal: Our goal is to approximate the states’ density distribution, which is essential for efficient exploration.

Algorithm: The approach uses Exemplar models:

given a dataset ={_1,…,_ }, we train a set of classifiers/discriminators {_(_1 ),…,_(_ )}.

Each discriminator _(_ ) is trained to distinguish sample _ (the “exemplar”) from all other samples.

Training:

It is possible to train a discriminator per sample, but effiency’s sake we can share layers, or train a discriminator with multiple ”labels”.

We integrate the novelty into the reward function

The rationale(reason): the better we are able to distinguish xi from the rest of the samples, the more unique it is, and therefore worth exploring.

  1. Curiosity-driven Exploration by Self-supervised Prediction

To solve: extrinsic rewards (e.g., reaching the goal) are very sparse.

Method: define an intrinsic reward based on the agent’s inability to predict the outcomes of its own actions.

However, unlike previous solutions, the authors ensure to consider only changes that are the result of the agent’s actions (and not the environment).

This is accomplished through self-supervision.

Difficulty: How to judge non-predictable state/event that is caused by our action.

**solution:**consider action with their resulted environmental changing. Classify agent and observation in to 3 types:

-Things the agent can control

-agent can’t control, can affect the agent (e.g., another car)

-agent can’t control, have no effect on it (e.g., leaves blowing)

Algorithm:

define two types of rewards:

_^ - the extrinsic reward in time

_^ - the intrinsic reward in time

The overall reward for timestep is _=_+_

The solution consists of two sub-architectures:

A reward generator that outputs the internal rewards

A policy that outputs actions to maximize _

training:(contains 2 parts)

  • A module that takes the encodings ϕ(st) and ϕ(st+1) and predicts the action that was used to create them (an inverse model). To decrease the prediction error.
  • A module that receives at and ϕ(st) and predicts the future encoding.

Lecture09: Transformers & DRL

  1. Transformer

Background: RNN\CNN is slow and Long term dependencies is difficult;

solution: use advanced attention mechanisms

network structure: Scaled Dot-Product Attention; Multi-Head Attention; Position-wise Feed-Forward Networks; Embeddings and SoftMax; Positional Encoding.

algorithm: Formed by 2 parts – (encoder and decoder). First extract every word’s embedding and position in a sentence. Then combine these 2 embeddings and put it into Encoder. At last, put the output vector from encoder to decoder to process.

  1. BERT: Bidirectional Encoder Representation from Transformers

BERT uses a 2 directions structure. (The training method of MASK ML is relatively important but gross and highly stochastic.)

Problem: DRL mainly solve sequential data, and current algorithm is based on RNN (e.g., LSTM). Training DRL with Transformer needs a lot of data and hard to optimize. In the future, it may be ok.

Training:

Rather than train specifically per task, BERT is trained on two general problems:

Masked language model (LM).

Next Sentence prediction.

BERT is trained on a binary classification task whose goal is to determine whether the given sentence will appear next.

While the test is easy (~98% success rate) incorporating this task improved the model’s performance on multiple tasks.

Pros:

Bidirectional(双向).

Deep – multiple layers of transformers.

Pre-training a general model using two tasks.

Embedding both for segment and token.

  1. Gated Transformer Architectures

Method: by modifying layers normalization and by using a gating mechanism to make DRL convergence under transformer structure.

Component:

Reordering of layer normalization.

Relative positional encoding to support a larger contextual horizon.

Gating mechanisms and modifying layer normalization.

The skip connections do not undergo normalization.

Cons: Due to computing complexity. transformer can’t deal with long sequences.

Gating function: used to modify the sequential state representation of a trajectory.

How: use gating function to replace sum after multi-head-attention.

  1. Adapting Transformers for Long Sequences

Problems: Hard to apply Transformer-based architectures to long sequences in their complexity:

BERT supports a 512-token window.

Larger windows are computationally intractable.

Approaches:

Longformer; Big-Bird; Extended Transformer Construction (ETC)

  1. Longformer

idea: don’t use a larger attention span (stick to 512), but divide it among several “perspectives”(global sliding window).

Sliding window of full-attention: same as the “standard” BERT, but applied to a sliding window around the analyzed token rather than the entire document.

Dilated sliding window: the same techniques as in dilated CNNs, but applied to the analyzed tokens.

Global attention: assigned to fixed locations throughout the entire input.

  1. Big-Bird

Developed at the same time as Longformer, using the same intuition.

Main difference: sparse random attention

  1. The Decision Transformer

The main idea:

Enables one to bypass the need for bootstrapping (due to long credit assignment).

No need to use discounts (which may result in short-sighted behavior).

Can use recently developed stable training methods.

Process: (rewards, state, action …)->casual transformer -> linear decoder -> (at-1, at, …)

Training: standard DRL process, only predict action.(offline learning) (predicting the reward or the state did not yeild any improvement).

Test: The model is “motivated” to perform well when it is provided with ̂ with a very high value

In order to reduce the error, the transformer needs to produce actions that will result in very high rewards

The rewards have to be defined manually (if you ask for something unrealistic, things go badly)

  1. Offline Reinforcement Learning as One Big Sequence Modeling Problem

**idea:**attempt to solve RL problems as a supervised learning task. off-line learning approach, evaluate on large amounts of data.

**Training :**training is based on the GPT training process, with teacher forcing

Beam search: for planning (heuristic solution)

Enable to overcome the problem of local optimum

Lecture10: Model-based approaches and gradient-free learning

  1. General

Model-based methods aim to model the dynamics of the problem

Knowing the dynamics + the cost/reward function enables optimal planning.

While they are more efficient sampling-wise, optimization might prove much trickier.

Knowing the model dynamics enables using simple (linear or near-linear) techniques, such as LQR.

In this type of algorithms, if the dynamics is not known, we can use sampling in order to evaluate it.

Problem: this type of methods could be exposed to model bias.

  1. Learning the Model

Cons:

This approach works particularly well if we have an initial good representation of the data.

Vulnerable to drifting.

Version 1: collect random samples, train the dynamics, plan. (use the mean square error (MSE) to determine how far we are from the intended goal, works particularly well if we have an initial good representation of the data)

Pros: simple, no iterative procedure.

Cons: distribution mismatch problem, is vulnerable to drifting.

Version 2: iteratively collect data, re-plan, collect data(Dagger, updated once all sampled trajectories are followed)

Pros: simple, solves distribution mismatch

Con: open loop plan might perform poorly, esp. in stochastic domains; it’s more problematic for stochastic states because you can’t predict a state, only a distribution.

Version 3: iteratively collect data using MPC (re-plan at each step)

Pro: robust to small model errors

Con: computationally expensive, but has a planning algorithm available

Version 4: learn the dynamics model and backpropagate directly into the policy (more efficient than 3)

Pro: computationally cheap at runtime

Con: can be numerically unstable (vanishing/exploding gradients), especially in stochastic domains.

  1. Model the environment

Gaussian processes:

Pros: Very data efficient

Cons: Difficulty with non-smooth dynamics

Slow when the data is big/high dimensional

Neural networks

Pros: Very expressive

Can work with massive amounts of data

Cons:

Problematic when data is limited

Gaussian mixture models (GMM):

Decomposes the problem space into multiple regions.

Once we identify the region, we can fit the relevant model.

DNN: require a lot of data, they are not always used in model-based learning.

  1. Backpropagating to Policy: the PILCO Algorithm

Pros: it takes only seconds to train the model. Compare this to DQN;

cons: if 100K experiments are required to learn a task, robots and other physical objects simply wear out.

  1. Global vs Local model

Global: one model was used to make decisions throughout the state space.

cons: Global model will seek out the areas where it improves best and worst, which may have many, resulting time costly. Sometimes the global model is much more complex than the policy.

Local: instead of finding a “one fit all” model, find a model that describes your current location well.

pros: local model is stable and faster to model specific environment.

  1. LQR:

A linear local model (because most of the dynamics are linear).

Intuition:

The goal of LQR is to calculate a matrix such that =−

It is used to minimize and is referred to as the linear control regulator

is called the gain matrix

LQR also assumes there is noise, which is sampled from a Gaussian

However, because the mean of the Gaussian is assumed to be zero, it does not affect optimal policy and can be ignored. (quadratic cost is easy to find minimum using linear algebra);

Idea of Cost:
**
define a cost on how much it “costs” that we’re not where the action was supposed to take us (i.e., where we wanted to go);Some errors can be more “painful” than others; Also factors in the cost of taking different actions.

So slow convergence will “cost” more (we have an incentive to converge faster). Slow convergence also means that I had to take a lot of actions.

Target: minimize two things: a) the distance of the state from what I wanted in to be; b) the amount of effort it took me to get to where I currently am.

Method:

Fit A,B matrix consider noise.* xt+1=Axt+But ; backward and then forward recursion.

Pros: can learns easily and efficiently on simple questions. Best for cases where we want to stay close to a given state (e.g., the cart-pole problem).

Cons: more tricky when we’re dealing with more complex goals/trajectories

Solve local: Our models are only good for local regions. If we go too far (e.g., to another region where our modeling is very inaccurate) we will have problems. So, we want to set up a mechanism designed to ensure that we don’t “stray” too much.

  1. iLQR:

solve: apply LQR to non-linear systems.

Method: iteratively(迭代地) approximate using Taylor expansion; (Because calculating second derivatives can be very complicated, we use iLQR(first derivatives)).

Loss: Estimate the linear dynamics using the derivatives, then estimate the cost and quadratic cost.

Method: Initialization: given ̂_0, pick a random control sequence ̂_0… ̂_ and obtain the corresponding state sequence ̂_0… ̂_.

Test: once trained, we make decisions based on our understanding of the dynamics and hope for the best.

Problem: does not compensate well for

Perturbations/noise (e.g., winds pushing the ball)

Initial state being ”off”

Imperfect modeling of the dynamics

Solution: re-plan

  1. MPC:

Method:

-iteratively solve the control problem using iLQR from time-step (current step) to (if is not fixed than we will have +);

-re-plan for time greater than (This is called receding horizon control)

Problem: fixed horizon

Receding horizon control: To solve the problem of fixed horizon; not predict but fixed control choice; model will give up for not enough time for objective function.

Pros: both re-plans the actions and updates the dynamic models.

  1. Combining Local Models Into a Global Model

Method:

Train a local policies for multiple scenarios, with the object located at multiple positions.

Use the local policies as demonstrations, and use one neural net to learn a global policy.

Training: The cost is modified in order to consider the distance from other policies, to make sure there are no conflicts

Step 2: train the global model

Step 3: enforce the KL divergence constraint

The uderlying principle of this solution: distillation

  1. Evolutionary Algorithm: Highlights
  • Evolutionary algorithms use a population of individual solutions that are evaluated using a fitness function
  • The population is updated at each step using the following operations:
    • Selection – a stochastic process of choosing the best individuals from the population (quality = fitness function score)
    • Mutation – generate “children” of existing parents through local perturbation
    • Recombination – a non-local perturbation, generated by combining two or more solutions

These algorithms often perform an exploration/exploitation trade-off:

  • Exploration – generation of new solutions
  • Exploitation – selection of (current) good solutions

Pros: Lack of backpropagation usually enables these methods to be faster. Also enables parallelization (the lack of which is one of the main problems in deep learning)

Black-box optimization (A black-box search aim at minimizing :R^→R)

Can be applied for “rugged” (non-convex) problems:

Discontinuous. “sharp” bends. Noise. Local optima.

  1. Some Problems Are Hard to Solve with BP

The cost function is non-linear, non-quadratic, non-convex;

Ruggedness: non-smooth, discontinuous, noisy;

High dimensionality;

Non-separability: dependency between the objective variables;

Ill-conditioned problems;

  1. Covariance Matrix Adaptation Evolution Strategy (CMA-ES)

Intuition: maximum-likelihood – increase the likelihood of sampling candidates that performed well. The goal is to update the values of the distribution so that sampling of high-performing candidates becomes more likely.

Co-variance matrix adaptation – incrementally update the matrix so that the probability of taking previously-successful search steps (i.e. actions) is increased

Comment: This process is a form of natural gradient descent. In NGD we take into account both the gradient and the curvature of the distribution space. Done by applying the Fisher information matrix on the distribution. We can use a Monte-Carlo approximation

  1. Unfamiliar words

Caveat: warning; Skewed: incline; Asynchronous: 异步; Aggression: cumulative; Trade off: 权衡; heuristic: 启发式; uncharted: unknown; prune: 修剪; retain: maintain; entropy: 熵; accrue: obtain; dilation: expansion; auxiliary: assistant; adversarial: 敌对; Interleaved: 交错的; latent: potential; arbitrary: casually; intrinsic: original; Agnostic: 不可知论者; receding: 后退; mitigate: lighten; permissive: 宽容; ensemble: 合奏

真题及答案

  1. State true/false regarding the following statements (distillation):
    1. Distillation produces a more efficient model after the training is complete, but requires additional steps during training - true
    2. Using a “hard” softmax (i.e. a less “smooth” distribution of values across classes), we improve the distillation training process because it is easier for the distilled network to identify mistakes – false
  2. Provide two reasons why we may prefer multiple local models instead of one global model (describe a general case, 1 sentence for each reason).

The model will seek out where it performs best and worst. However, there maybe many of these areas, which will cause the training of model takes a lot of time. The real model can be very complex (than the policy).

  1. When implementing a dueling DQN, it is critical that we are able to discern the contribution of the state from that of the chosen action. Explain why (2-3 sentences).

Because in Duel model we will set two different strategies for different states. We need to set the policy that chooses the better action for the given condition(more concerned issue).

  1. What is the purpose of the discount factor (1 sentence)? what are the risks of setting the value to be too big (1 sentence)? too small (1 sentence)?

It aims to let the rewards we get in the future less valuable than the rewards we get right now. The expected return may be infinite in infinite horizon, which is meaningless. The rewards we get in future have merely no effect on the goal (expected return).

  1. Assume we would like to use Modular Neural Networks to operate a self-driving vehicle. Propose two modules, trained on different domains, that could be combined to achieve this goal (needless to say, there’s no need - or way - to ensure good performance). Use 2-3 sentences.

  2. State one advantage and one disadvantage of model-based approaches compared to model-free approaches (1 sentence each)

Model-free approach is not very efficient, because there is no guarantee the network can get the core of the problem. The model-based approach enables the effective learning.

  1. One of the possible shortcomings of the “basic” DAgger algorithm can occur when the learner’s policy space is very far/different from that of the expert. Explain why that is a problem (2-3 sentences)

Because that the target policy(from expert) is not achievable for the agent, it’s hard for the agent to converge to a correct policy. Besides, this may also cause oscillating for the model.

  1. Are policy iteration/value iteration methods easier to operate in a large state space or a large action space? Explain. (2-3 sentences)

No, because in this case the algorithm is not able to go through all the states or all the actions, thus can’t guarantee the optimality of the solution. But we can still use some approximation method to get a good enough but only sub-optimal solution.

  1. SARSA is an on-policy algorithm, while Q-learning is off-policy. This difference is mostly expressed by the way they update Q(s,a). Explain what is the difference and its learning (2 sentences for the first part, 1 for the second)

SARSA uses the Q’ following a ε-greedy policy exactly, as A’ is drawn from it. In contrast, Q-learning uses the maximum Q’ over all possible actions for the next step. This makes it look like following a greedy policy with ε=0, which means NO exploration in this part.

    1. Explain in short (2-3 sentences) how the approach used by AlphaGoZero was converted in order to be applied to the AlphaD3M algorithm.
  • By chaning the Unit, State, Action and Reward in the network. In AlphaD3M the Unit is pipeline primitive, the state is the meta data, task and pipeline, the action is insert, delete and replace, and the Reward is the pipeline performance. This “game” does not have two players as in Go or Chess. But it doesn’t really bother us, because the classifier outputs a result at the end
  1. In the paper “hierarchical DRL for sparse reward environments” we discussed the use of intrinsic goals (e.g., “Montezuma’s Revenge”). Explain why sparse rewards are a problem in general (1 sentence) and how the use of intrinsic goals addresses this problem (2 sentences).

The agent can’t get reward and therefore not able to learn what is a good action. The model will not be motivated to learn new things. With a intrinsic reward, the model can explore the large environment efficiently.

  1. The loss function of the actor-mimic approach is defined as follows:
    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-o033aYHd-1677254320319)(Aspose.Words.2c606fb1-bb48-4c1a-b6fd-82b380413b0c.001.png)]

explain the meaning of each of the two components in this function (1-2 sentences for each)

L_policy gives the policy loss (the cross-entropy between the expert and multi-task policies). L_FeatureRegression aims to make the activations themselves more similar to the expert.

  1. Explain the purpose of importance sampling (what problem is it used to address?) (2 sentences). To which type (or types) of algorithms is it suitable (1 sentence).

Importance sampling is the relative probability of the trajectory for the target and behavior policies. And it is used to bridge the gap between the target and behavior policy. It is suitable for off-policy methods.

  1. Would we prefer meta-learning algorithms to be on-policy or off-policy? Explain.

Off-Policy.

  1. Provide an example of a scenario where the use of local models might be preferable to the use of a global model. Explain your reasoning. (up to 3 sentences for the entire answer)

For example, when the state-space model for the problem is time varying. In this case, the model parameter is keeping changing, the global model maybe not able to adapt to this kind of model or may need a much more complex model.

  1. A common question one encounters when explaining DRL for the first time is “what is the difference from simply using LSTM? Both can be applied to a sequence of actions”. Describe two substantial differences between DRL and supervised learning using LSTM (1-2 sentences each).

DRL gets the reward by feedback of the environment. The agent will learn from the trial and trail process. Supervised learning needs the labels for the input.

  1. [Large state & action spaces] Why are sparse rewards problematic for DRL algorithms (1 sentence)? How is this problem made worse by large state/action spaces (1 sentence)? Explain how auxiliary rewards can help alleviate this problem? (1-2 sentences)

Using TD-error is for selecting but it is sensitive to noise generated by stochastic rewards. It can reduce exploration because it will be greedy, and states could be correlated preventing exploration.

  1. [Model-based learning] Receding horizon control functions as a “compromise” between using finite and infinite horizons. A) Explain the possible drawbacks of using either finite or infinite horizon (1 sentence each). B) Explain why receding horizon possibly addresses these shortcomings (1 sentence).

DQN in general has a shortcoming, the sequential prediction where states could be correlated and chasing a moving target. In this task the target is non stationary as the queue is dynamic and changing, the other problem is that size of input also changing making DQN impractical.

  1. In class, we’ve seen how auxiliary goals can be used to solve complex challenges like Montezuma’s revenge. Propose a way to combine generative models in this approach (2-3 sentences). Describe the merits of your approach.

Divide the final goal into multiple small goals. Use a meta learning network to differentiate goals, and a common DRL to select actual actions to achieve small goals. It can avoid the case that because the final goal is too difficult to achieve (can’t get reward), which will make the convergence of network hard.

  1. Applying DRL to the problem of NAS is a very challenging problem because of the extremely large space and action spaces. Assume that you are tasked with developing a DRL solution to such a domain, where there are no constraints on the size or form of the architecture. Propose two techniques/approaches (1-2 sentences each) that can assist you in overcoming this challenge.

One shortcoming is our system has a large state space, it will not be practical.

It needs to evaluate states multiple times during evaluation phase and second time during improvement phase where testing action different from these recommended by policy.

  1. We often talked about the importance of unbiased estimators for DRL algorithms. Explain what is an unbiased estimator in the context of the algorithms we learned (1 sentence). What is the potential outcome of using a biased estimator (1-2 sentences).

Unbiased estimator uses the collected data to estimate the average (target value). It becomes easier for the model to over estimate.

  1. State two shortcomings of only selecting the experience replay samples with the highest TD-Error (1 sentence per shortcoming – I expect a short, to-the-point explanation).

Hard converge; easy to overfit

  1. Assume a large queue of dynamic size, with each item requiring a dynamic set of operations/tests for classification. Each operation requires a different amount of time, and this time is Normally distributed. Assume a standard DRL agent (assume DQN) whose goals are to minimize the average “wait” time in the queue and maintain a sufficiently high classification accuracy (the agent decides how many tests to apply). Describe two challenges/problems in applying the standard DRL agent to this task.

The action space is too large to converge and different operation has various distribution.

  1. In the AlphaGo algorithm we have the policy network and value network that enable us to select the best actions. Explain what role does the MCTS component of the algorithm play at the test phase (2-3 sentences).

At test phase, MCTS is used to train the value network to estimate whether the current state is good or not. Besides, while doing self-play, MCTS can help to record the explored action-state and how good they are.

  1. The Decision Transformer uses an interesting setup, where the reward is part of the input [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wrGVez21-1677254320323)(Aspose.Words.2c606fb1-bb48-4c1a-b6fd-82b380413b0c.002.png)]. Explain how the rewards should be set at the test phase (1 sentence)? What is a potential problem with this setup (1-2 sentences)?

The reward is called “return to go” here. The model needs the prior knowledge to calculate the “return to go”. Besides, it’s now only implemented with offline policy.

  1. In class we discussed an example of a use-case where Sarsa outperformed Q-Learning (the “cliff” example). Explain why Sarsa did better in this use-case (1 sentence). Explain why Q-Learning might work better in other use-cases (2 sentences).

Sarsa takes the exploration into account, so it plots a safer, non-optimal course. Q-learning learns the optimal policy and executes it, but because of the ϵ-greedy exploration it sometimes falls in the cliff.

In other use-case where the price for wrong action will be smaller than Q-learning might perform better where its greedy nature will prevail.

  1. State two shortcomings of the policy iteration approach (1 sentence each)

Policy iteration is very inefficient due to the need to evaluate the current policy at all states again and again. Besides, this method is infeasible for a large number of actions and states.

  1. Suppose you have a robot trying to reach a goal and avoid cliffs in a small grid world. It can only move North, South, East, or West, but occasionally fails to move in the intended direction. If you were to model this using an MDP and were trying to solve it optimally, should you use value iteration or policy iteration?

Generally, use value iteration. We have many states and a few actions, and value iteration is generally cheaper than policy iteration.

  1. State one advantage of on-policy methods with respect to off-policy methods and vice versa (1 sentence each).

On-policy methods tend to be more efficient, can’t ensure optimal solution because of exploration. Off-policy methods are more powerful and general.

  1. When discussing the shortcomings of DQN we mentioned two: a) sequential prediction problem (i.e. correlated states); b) Non-stationary target. Explain each problem using 1 sentence.

a.) Because we train sequentially, we may override previous experiences with new information.

b.) Because we use the same parameters for the estimation and the Q-target, both values move.

Both problems will make the algorithm hard to converge.

  1. Policy gradients approaches use a parameterized approximation of the policy. Explain why this is the case (that is, instead of directly modelling the policy with respect to every state and action). Use 2-3 sentences.

Sometimes the action space and state space may be very big. In this case, it’s not applicable to record every transition and calculate the Q-table. Besides, sometimes we may not able to get the value function for the action.

  1. We described imitation learning as “brittle” (שבירה) in class. Explain what we meant (1 sentence). Explain how DAGGER attempts to alleviate this problem (2 sentences)
    Because of the “compounding error” problem, the algorithm can’t recover from failure. Dagger is solving the problem by using iterations. By keeping giving the solution from the expert at each iteration to solve the “compounding error” problem.
  2. Specify two significant differences between the AlphaGo and AlphaGoZero algorithm (1 sentence for each difference)

The training of AlphaGoZero doesn’t rely on human expert knowledge, while AlphaGo needs. AlphaGoZero only uses a single DNN instead of several. AlphaGoZero doesn’t use rollouts. For AlphaGoZero, Leaf nodes are always expanded, and the newly expanded nodes are evaluated using the neural net (instead of simulation and backup).

你可能感兴趣的:(强化学习,机器学习,deep,learning)