A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients

A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第1张图片
A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第2张图片

The stochastic process to be controlled is described by the state transition probability density function f.



A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第3张图片
A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第4张图片

Once the first transition onto a next state has been made, π governs the rest of the action selection. The relationship between these two definitions for the value function is given by


With some manipulation, (2) and (3) can be put into a recursive form [18]. For the state value function, this is



These recursive relationships are called Bellman equations [7].


A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第5张图片

B. Average Reward

A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第6张图片

A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第7张图片

The Bellman equations for the average reward—in this case also called the Poisson equations [20]—are




Note that (8) and (9) both require the value J(π), which isunknown and hence needs to be estimated in some way. The Bellman optimality equations, describing an optimum for the average reward case, are


A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第8张图片

III. ACTOR-CRITIC IN THE CONTEXT OF REINFORCEMENT LEARNING





A prerequisite for this is that the critic is able to accurately evaluate a given policy. In other words, the goal of the critic is to find an approximate solution to the Bellman equation for that policy. The difference between the right-hand and left-hand sides of the Bellman equation, whether it is the one for the discounted reward setting (4) or the average reward setting (8), is called the TD error and is used to update the critic. Using the function approximation for the critic and a transition sample (xk ,uk ,rk+1 ,xk+1 ), the TD error is estimated as






A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第9张图片



A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第10张图片


This clearly shows the relationship between the policy gradient ∇ϑJ and the critic function Qπ(x, u) and ties together the update equations of the actor and critic in the templates (19) and (20).




A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第11张图片


A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第12张图片

A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第13张图片

A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第14张图片

A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第15张图片


A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第16张图片

A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第17张图片

A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第18张图片

The product δkza,k in Equation (29e) can then be interpreted as the gradient of the performance with respect to the policy parameter.
Although no use was made of advanced function approximation techniques, good results were obtained. A mere division of the state space into boxes meant that there was no generalization among the states, indicating that learning speeds could definitely be improved upon. Nevertheless, the actor-critic structure itself remained and later work largely focused on better representations for the actor and the calculation of the critic。


A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第19张图片

A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第20张图片



The update of the actor in Equation (31d) uses the policy gradient estimate from Theorem 2.
The update equations for the average cost and the critic are the same as Equations (31a) and (31c), but the actor update is slightly changed into

A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第21张图片
V. NATURAL GRADIENT ACTOR-CRITIC ALGORITHMS



A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第22张图片

The natural gradient clearly performs better as it always finds the optimal point, whereas the standard gradient generates paths that are leading to points in the space which are not even feasible, because of the radius which needs to be positive.


A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第23张图片





A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第24张图片


A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第25张图片

A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第26张图片





A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第27张图片


As shown above, the crucial property of the natural gradient is that it takes into account the structure of the manifold over which the cost function is defined, locally characterized by the Riemannian metric tensor.
To apply this insight in the context of gradient methods, the main question is then what is an appropriate manifold, and once that is known, what is its Riemannian metric tensor。For a manifold of distributions, the Riemannian tensor is the socalled Fisher information matrix (FIM) [75], which for the policy above is


A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients_第28张图片

你可能感兴趣的:(A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy Gradients)