[Chapter 6] Reinforcement Learning (4) Policy Search

In the previous sections, we try to learn the utility function, or more usually, the action-value functions and greedily select the action with the highest Q-value:

This means that once we have learnt the Q-function well, we can get an optimal policy, so before this, all methods were directly or indirectly learning the Q-function, however, for the policy search method, it tries to update the policy function directly.

Policy Search

Based on the function approximation, we can write the policy function as:

As a function mapping from state to action, the policy function is also a function with parameters to learn. Then policy search method adjusts to improve the policy directly without approximate the Q-values or utilities.

However, in the formula above, there are two main problem we need to solve firstly:

  • The operation arg⁡max is not differentiable, which makes the gradient based search difficult
  • In the environment with discrete actions, which means the outputs of the function are discrete

In fact, one method can solve them easily, you can think the problem to be a classification problem, why? When the agent selects an action, it selects the action with the highest Q-value regards the current state; in a classification problem, our model predicts the probability for each class that the input belongs to and output the class with the highest probability. They are one same thing actually. Remember how we solve the classification problem? Yes, we are using softmax function, here we can also use it:

Given a state , the model can classify it to a class which indicates which action to execute (with highest Q-value).

Using the gradient method, we can get the parameter update formula:

Another version for the above formulas is to perform logarithmic operations on both sides of the equation, then we can get:

Variance Reduction using a Baseline

Another technology is using a baseline to reduce the variance of the Q-function, to replace the with . Usually, a natural choice for the baseline is , then we define a new advantage function:

Actor Critic

Actor-Critic algorithm tries to combine both the Q-function based learning and the policy search together. It establishes two outputs, one learns a policy that takes action, called actor, at the same time, another learns a value or Q-function that is used only for evaluation, called critic. It divided the evaluation and improvement into two parts, they are executed alternatively.

In the DRL, to save the memory and training time, we usually let these two parts share the bottom layers that are used for feature extracting and divide the network at a higher layer.

你可能感兴趣的:([Chapter 6] Reinforcement Learning (4) Policy Search)