Reinforcement Learning 第四周课程笔记

本周三件事:看课程视频,阅读 Sutton (1988),作业3(HW3)。

以下为视频截图和笔记:

Temporal Difference Learning

Reinforcement Learning 第四周课程笔记_第1张图片
Read Sutton 1988 first
  • Read Sutton, Read Sutton, Read Sutton. Because the final project was based on it!
Reinforcement Learning 第四周课程笔记_第2张图片
Three families of RL algorithms
  1. Model based
  2. Model free
  3. Policy search
  • Form 1 --> 3: more direct learning
  • From 3 --> 1 more supervised

TD-lambda

Reinforcement Learning 第四周课程笔记_第3张图片
TD-lambda
Reinforcement Learning 第四周课程笔记_第4张图片
Quiz 1: TD-lambda Example
  • in this case the model is known, the calculation is easy.
Reinforcement Learning 第四周课程笔记_第5张图片
Quiz 2: Estimating from Data
  • Remember from the previous lecture, we need to get value from each episode and average over them.
Reinforcement Learning 第四周课程笔记_第6张图片
Computing Estimates Incrementally
  • The rewrite makes the formula looks a lot like neuro-net learning. and alpha is introduced.
Reinforcement Learning 第四周课程笔记_第7张图片
Quiz 2: alpha will mke learning converge (tips:if 指数i大于1, 1/(T)i will be bounded
Reinforcement Learning 第四周课程笔记_第8张图片
TD (1) rule
Reinforcement Learning 第四周课程笔记_第9张图片
TD(1) with and without repeated states
  • When no repeated states, the TD(1) is the same as outcome-based updates ( which is see all the rewards in each state and update weights).
  • when there is repeated states, extra learning happens.
Reinforcement Learning 第四周课程笔记_第10张图片
Why TD(1) is "Wrong"
  • in case of TD(1) rule, V(s2) can be estimated by average episodes. we only see V(s2) once and the value is 12. Then V(s2) = 12
  • in case of Maximum likelihood estimates, we have to kind of learn the transition from data. e.g. for the first 5 episodes, we saw s3->s4 3 times and s3 -> s5 2 times. So the transition probability can be extracted from data as 0.6 and 0.4 respectively.
Reinforcement Learning 第四周课程笔记_第11张图片
TD(0) Rule
  • First of all, if we have infinite data, TD(1) will also do the right thing.
  • When we have finite data, we can repeatedly infinitely sample the data to figure out all the ML. This is what TD(0) do.
Reinforcement Learning 第四周课程笔记_第12张图片
Connecting TD(0) and TD(1)

K-Step Estimators

Reinforcement Learning 第四周课程笔记_第13张图片
K-Step Estimators
  • E1 is one-step estimator (one-step look up) TD(0)
  • E2 is two-step estimator, and Ek is k-step lookup.
  • When K goes to infinity, we got TD(1)
Reinforcement Learning 第四周课程笔记_第14张图片
K-step Estimators and TD-lambda

TD-lambda can be seen as weighted combination of K-step estimators. the weight factor are λk(1-λ).

Reinforcement Learning 第四周课程笔记_第15张图片
Why use TD-lambda?

The best performed lambda is typically not TD(0), but some λ in between 0 and 1.

Reinforcement Learning 第四周课程笔记_第16张图片
Summary
2015-09-5 初稿
2015-12-03 reviewed and revised until the "Connecting TD(0) and TD(1)" slides

你可能感兴趣的:(Reinforcement Learning 第四周课程笔记)