Reinforcement Learning 第五周课程笔记

本周三件事：看课程视频 Convergence. 读 Littman and Szepesvari (1996), 作业四。

Learning without control:

no actions by learner. The value function is that value equals reward R(s) plus discounted expected value of the next state. (T(s, s') is transision function.
The new estimated value function for the state the learner just left is TD(0).

Learning with control/actions.

Q function is used to estimate the value at current state, take a action then get a reward and ended up in a new state < s_t-1, a_t-1,r_t,s_t >. The Q updating rule takes care of two approximations: 1) if model is known, it can be used to update Q; 2) if Q* is known, it can also be used to update Q.

And both will converge.

Quiz 1: Bellman Operator

See value iteration here

Contraction mapping def: If applying the B operator makes the distance between two functions smaller than the the distance between the original functions.

Quize 2:

Practically, if B is multiplying a number in [0,1), B is contraction mapping.
take the first option for example: ||B(x) - B(y)||=||x/2 - y/2|| =1/2 ||x - y||, so there exists γ >=1/2 makes ||B(x) - B(y)|| <=γ||x - y||. So B(x) = x/2 is a contraction mapping.

Contraction Properties

① and ② are true, so that BF_t-1 = F_t, F_t will converge at F* through value iteration.
If there is two fix point, G* and F*, Putting them into the B operator will not change the distance of them because both of them are fixed, and this violates the definition of B.

Applying B operator to Bellman equation and unpacking ||BQ₁ - BQ₂||_∞.
Not combining the max a' because the a' are not the same in Q₁ and Q₂

stop here, first, will go back. that's call this the unfinished proof.

quiz 3

Max is non-expansion: ||maxf(a) - maxg(a)|| <= max|(f(a) -g(a)|. Using this

Statement of therom

The three properties need to be true for Q_t to converge to Q*. and they are true

So, Q-learning converges.

B_t is the operator we are going to update Q at t time step. Q(s,a) is the Q function value of the state we just left and Q function w is the Q value of the state we just arrive. In regular Q Learning update, Q and w are the same, here we separated them in the theorem.

So, rule number one is saying if we know the Q* and use the updating rule (in pink) to update the Q function, the expected value of the one-step look ahead (w) can be calculated and the stochasticity will be averaged out.

If we hold the Q(s,a) fixed and only varies the way we calculate the one-step look ahead, the distance between the Q* and Q can only get closer with each update.

The third condition is the learning rate condition. which is needed for Bellman equation.

Quiz 4: 1. decision making on estimated value based on best next action (regular MDP). 2. The environment puts you in the worst possible state and you choose the next best action given the state (risk averse); 3. The state is the expected value but the action to take is based on ranking of the action(exploration-sensitive, min, max, mediocre ); 4. (zero-sum game)

the take home messages are, these generalized MDPs all converges.
so the correct answers are: (from top to bottom) worst possible decision, not-optimal decision, two agent competing decision making (the zero-sum game).

Recap

Generalized MDP can be seen as redefine fix point. Contraction might be something like "收敛" in Chinese. Q-learning converges to Q*. Generalized Convergence theorem uses two Q-functions to prove the convergence of Bellman Equation.

Reinforcement Learning 第五周课程笔记

你可能感兴趣的:(Reinforcement Learning 第五周课程笔记)