Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO

Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第1张图片

 

 

 

 

 

 Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第2张图片

 

 

 

 Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第3张图片

 

 

 

 

 Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第4张图片

 

 

 

 

Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第5张图片

https://statweb.stanford.edu/~owen/mc/Ch-var-is.pdf

 https://zhuanlan.zhihu.com/p/29934206

 

 

 

 

Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第6张图片

 

 

 

 

 

 

 Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第7张图片

 blue curve is the lower bounded one

 

 

 

Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第8张图片

 

conjugate gradient to solve the optimization problem.

 

 

Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第9张图片

Fisher information matrix, natural policy gradient

 

 

 

 

Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第10张图片

 

 

 

 

 

 Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第11张图片

 

To write down an optimization problem, we can solve more robustly with more sample efficiency to update policy

 But Lis Lpg is not constrained, so we use KL to ...

 

 

 

Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第12张图片

it's hard to choose beta

 

 

 

 

 Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第13张图片

 

 

 

 

Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第14张图片

 

 

 

 

 Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第15张图片

 

 

 

 

 

Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第16张图片

 

 

 

 

Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第17张图片

 

 

 

 

 Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第18张图片

 

 

 

 

 

 Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第19张图片

 

 

 

 

Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第20张图片

 

TRPO is much worse than A3C on imaging game, where PPO does better

see the slide: limitations of TRPO

 

 

 

Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第21张图片

 

 

Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第22张图片

 

 

 

Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO_第23张图片

 

转载于:https://www.cnblogs.com/ecoflex/p/8976876.html

你可能感兴趣的:(Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO)