【CS229 lecture19】微分动态规划

首先声明一下,这节课基本没听懂,但是还是把课程笔记写下。

lecture19 微分动态规划
继续强化学习算法的讨论
Agenda:
【CS229 lecture19】微分动态规划_第1张图片

课程中段我曾讲过调试learning algorithm,今天再来将强化学习的部分:
【CS229 lecture19】微分动态规划_第2张图片
The motivating example is robotic control;
first build a simulator…
【CS229 lecture19】微分动态规划_第3张图片
debugging a RL algorithm(这只是在直升机上一个例子,事实上,一般的ML问题中你总是要自己提出diagnostic approach去debug your algorithm,方法都不唯一。)

下面是LQR,微分动态规划
recap:
【CS229 lecture19】微分动态规划_第4张图片
one specific way to take a system and come up with a linear model for it is if you have some simulator, say…(下面是提出一个线性模型并用DP算法反向迭代的公式)

最后一件事:黑板左上角

what’s more, there is one special property in LQR.(虽然挺不太懂,但是我把Andrew的原话抄一下,同样利用上一个图片)
从黑板左上角来看,pi的获取与psi无关(因为Lt与psi无关);
而黑板下面可以看到,phi的更新也与psi无关;
再看黑板最下面,sigma矩阵只在psi更新中出现;
也就是说,我们可以不管psi而得到optimal policy,
或者说对于St+1=At*St+Bt*at+wt
我们得到的optimal policy 与噪声(wt)的幅度无关。
这个性质如果脱离了线性动力学的LQR问题背景则不复存在。
等会讲卡尔曼滤波时还会利用到LQR的这个性质。

接下来讲 a specific way of applying LQR, that’s called differential dynamic programming微分动态规划
Differential Dynamic Programming
假如要去控制一架直升机。比如你有一个simulator St+1 = f(St,at) nonlinear, deterministric
比如有一些轨迹你的直升机需要遵循,here’s what DDP does:
首先得到标称轨迹 nominal trajectory(使用很差的控制器得到) 然后在其附近线性化
This will actually be the first time that I’ll make explicit use of the ability of LQR or these finer horizon problems to handle non-stationery dynamics.
【CS229 lecture19】微分动态规划_第5张图片
然后又得到一个new nominal trajectory,然后又线性化…and repeat
这正是我们在直升机上做的,而且在很多问题上都能很好地应用。
【CS229 lecture19】微分动态规划_第6张图片

这就是DDP,下一讲将会讲一个DDP的实例。。。。。。不明觉厉??

最后我想要讲的是Kalman filter and LQG control (linear quadratic gaussian control)
Remember all I talk previously, I’ve been assuming that every time step you know what the state of the system is, and so you can compute a policy to some function of the state is in.(for instance, in LQR the action we take is Lt*St). But what I want to do now is to talk about the different type of problem where you don’t get to observe the state explicitly…let’s first forget control for now, and consider some dynamic systems where we can’t observe the state explicitly, and later we’ll tie this back to controlling system.
As a concrete example, let’s consider using a radar to track (从A矩阵的构成就可以看出来)the helicopter(using extremely simplified model)…

根据直升机飞行轨迹的observations(在states的基础上加了高斯噪声)寻找实际states的最佳估计。可以用之前说到的factor analysis等得到一个对于states的很好的估计,但是随着observations的增多,sigma矩阵等的维数会迅速增大,因此这虽然理论上是合理的,但在计算上却是inefficient的。
【CS229 lecture19】微分动态规划_第7张图片
相反,Kalman 滤波算法 can efficiently do this.
Just on the side, if you remember Dan’s discussion section on HMM’s the Kalman filter model turns out to actually be a hidden Markov model. If you remember Dan’s kind of section of the hidden Markov model, this linear dynamical system with observations is actually an HMM problem.(notations will be a bit different.见下图左下角 In Dan’s section, z and x are used to represented states and observations)
What I am going to do turns out to be a hidden Markov model with continuous states rather than discrete states.
Here’s is the outline of Kalman filter algorithm:
first step: predict second: update… (看不懂!!!!!!????)
【CS229 lecture19】微分动态规划_第8张图片
【CS229 lecture19】微分动态规划_第9张图片

Andrew解释为什么kalman在计算上更小, 图片上面是用来解释原先的计算复杂度,下面是大概解释kalman的原理,即利用y1得到p(s1|y1),再利用y2和p(s1|y1)得到p(s2|y1,y2)…不明觉厉??
【CS229 lecture19】微分动态规划_第10张图片
最后,综合以上:putting kalman filters and LQR control together, you get an algorithm called LQG control. So we have a linear dynamical system I want to control St+1=A*St+B*at+wt…, and I don’t get to observe the states directly. I only get to observe these variables yt. It turns out you can solve the LQG problem as follows(图片下半部分) 不明觉厉????

以上! 感觉跟没上一样,什么都没学到 -_-

你可能感兴趣的:(算法,动态规划)