I will try all out to discuss the DQN algorithm in this article.
We have witnessed the power of deep learning about solving high-computation problems and the strengh of reinforcement learning at decision-making. Trying to combine those two methods is an obvious thing, which contributes to the appear of deep reinforcement learning. DQN, which was first put forward by OpenAI at the paper named , is a typical algorithm of deep reinforcement learning.
Meanwhile, the basic assumption behind reinforcement learning is that the agent can have a deeper understanding about the environment by interacting with it. And we can increase the ability to reach the goal after several interations.
Markov Decision Process satisfies the markov property, which is that the nest state is only determined by current state and current action instead of the history.
MDPs consists of 5 components.
S: the finite states sets
A: the finite actions sets
P: the state transition matrix
R: the reward function
gama: the discount factor
We introduce the value function to measure the long-term reward and judge the policy in order to make better choice.
Value function can be normalized to bellman equation, the standard bellman equation can be solve by literation.
Overall, there are three different ways to solve the value function:
We aim at analysing the value of different action towards current state, which brings about the action-value function.
We can regularize the action-state function into these form:
We can normalize it into this form:
Then, after put it into action-value function, we can get :
Widely used interation methods designed to solve bellman equation can be categorized to policy iteration and value iteration.
pi can be converged to optimal proved theoretically.
Value iteration is based on Bellman Optimal equation and convert it into iteration form.
Policy iteration updates value by using the naive bellman equation. Meanwhile, the optimal value(vpi) is the optimal value at current policy, also called the estimation of a particular policy.
Value iteration update value by using Bellman Optimal Equation. Meanwhile, the optimal value(vpi) is the optimal value at current state.
Value iteration is more direct concerning getting an optimal value.
The basic idea of Q learning is based on value iteration. We update Q value every value iteration, which means all the state and action. But due to the fact that we just get limited examples, Q learning puts forward a new way to update Q value:
We can demiss the influence of error like gradient descent method, which converges to the optimal Q value.
We can combine exploration and exploitation by setting a fixed threshold ita, which method is called ita-greedy policy.
We store Q(s,a) in a table, which represents all the states and actions. But when we deal with image problems, the computation will be exponent that even can not be solved. So we need to reflect how to optimal the value function in another way.
Firstly, we introduced Value Function Approximation.
In order to decrease dimension, we need to approximate the value function by another function. For example, we may use a linear function to approximate the value function, which just like this:
Q(s,a) = f(s,a) = w1s + w2a + b
Thus we get Q(s,a) approximates to f(s,a:w)
As is often the case, the dimension of actions are extinctively smaller than the dimension of states. In order to update value Q more effiently, we need to reduce the dimension of states. just like this form:
Q(s) approximates to f(s,w) where s is a vector: [Q(s,a1),...,Q(s,an)].
The typical deep neural network method is an optimal problem. The optimal target of neural network is the loss function, which is the bais between label and output. As the name suggests, the optimal of loss function is attempting to minimize the bias. We will need a lot of samples to train the parameters of neural network by policy gredent by backpropogation.
Following the basic idea of neural network, we regard the target value Q as a label. And then trying to approximate value Q to target value Q.
Thus, the traing loss function is:
The basic idea of naive DQN is trying to train Q-Learning and SGD algorithm synchronously. Store all the sample and then randomly sampling, which is what we called experience replay. Learning by reflecting.
Trial for serveral times and then store all the data. Then do SGD by randomly sampling when getting a considerable number of datas.
There are several different methods used to improve the efficiency of DQN. Double DQN, Prioritied Replay and Dueling Network are three instinctive methods.
Nature DQN means the methods mentioned on (Human Level Control …) by DeepMind. Nature DQN is also based on experience replay. The difference between it and naive DQN is that nature DQN introduced a Target Q network. Like this:
In order to decrease the relevatively between target Q and current value Q, they designed a target Q network with updating delay, which means update the parameters after trained for a time.
The content of double DQN, prioritied replay and dueling network will be disscussed later. This part remains to be seen after reading those papers.
And also, I will give a base summary of policy gradient method and A3C series about deep reinforcement learning.