AC算法(Actor-Critic)架构可以追溯到三、四十年前, 其概念最早由Witten在1977年提出,然后Barto, Sutton和Anderson等在1983年左右引入了actor-critic架构。AC算法结合了value-based和policy-based方法,value-based可以在游戏的每一步都进行更新,但是只能对离散值进行处理;policy-based可以处理离散值和连续值,但是必须等到每一回合游戏结束才可以进行处理。而AC算法结合两者的优点,既可以处理连续值又可以单步更新。
Paper:
Witten(1977): An adaptive optimal controller for discrete-time Markov environments
Barto(1983): Neuronlike adaptive elements that can solve difficult learning control problems
Advantage Actor Critic (A2C): Actor-Critic Algorithms
Github:https://github.com/xiaochus/Deep-Reinforcement-Learning-Practice
环境
- Python 3.6
- Tensorflow-gpu 1.8.0
- Keras 2.2.2
- Gym 0.10.8
算法原理
AC算法的结构如下图所示。在AC中,policy网络是actor(行动者),输出动作(action-selection)。value网络是critic(评价者),用来评价actor网络所选动作的好坏(action value estimated),并生成TD_error信号同时指导actor网络的更新。 在这里我们引入DNN模型作为函数近似。
Actor-Critic的实现流程如下:
Actor看到游戏目前的state,做出一个action。
Critic根据state和action两者,对actor刚才的表现打一个分数。
Actor依据critic(评委)的打分,调整自己的策略(actor神经网络参数),争取下次做得更好。
Critic根据系统给出的reward(相当于ground truth)和其他评委的打分(critic target)来调整自己的打分策略(critic神经网络参数)。
一开始actor随机表演,critic随机打分。但是由于reward的存在,critic评分越来越准,actor表现越来越好。
AC算法的关键问题在于使用critic引导actor的更新。在Policy Network中,我们使用每一轮游戏的discount reward来引导策略模型的更新方向;在AC中,discount reward被替换为critic的Q值。在AC中critic的学习率要高于actor的学习率,因为我们需要让critic学习的比actor快,以此指导actor的更新方向。
算法实现
keras实现的的AC如下所示:
# -*- coding: utf-8 -*-
import os
import numpy as np
from keras.layers import Input, Dense
from keras.models import Model
from keras.optimizers import Adam
import keras.backend as K
from DRL import DRL
class AC(DRL):
"""Actor Critic Algorithms with sparse action.
"""
def __init__(self):
super(AC, self).__init__()
self.actor = self._build_actor()
self.critic = self._build_critic()
if os.path.exists('model/actor_acs.h5') and os.path.exists('model/critic_acs.h5'):
self.actor.load_weights('model/actor_acs.h5')
self.critic.load_weights('model/critic_acs.h5')
self.gamma = 0.9
def _build_actor(self):
"""actor model.
"""
inputs = Input(shape=(4,))
x = Dense(20, activation='relu')(inputs)
x = Dense(20, activation='relu')(x)
x = Dense(1, activation='sigmoid')(x)
model = Model(inputs=inputs, outputs=x)
return model
def _build_critic(self):
"""critic model.
"""
inputs = Input(shape=(4,))
x = Dense(20, activation='relu')(inputs)
x = Dense(20, activation='relu')(x)
x = Dense(1, activation='linear')(x)
model = Model(inputs=inputs, outputs=x)
return model
def _actor_loss(self, y_true, y_pred):
"""actor loss function.
Arguments:
y_true: (action, reward)
y_pred: action_prob
Returns:
loss: reward loss
"""
action_pred = y_pred
action_true, td_error = y_true[:, 0], y_true[:, 1]
action_true = K.reshape(action_true, (-1, 1))
loss = K.binary_crossentropy(action_true, action_pred)
loss = loss * K.flatten(td_error)
return loss
def discount_reward(self, next_states, reward, done):
"""Discount reward for Critic
Arguments:
next_states: next_states
rewards: reward of last action.
done: if game done.
"""
q = self.critic.predict(next_states)[0][0]
target = reward
if not done:
target = reward + self.gamma * q
return target
def train(self, episode):
"""training model.
Arguments:
episode: ganme episode
Returns:
history: training history
"""
self.actor.compile(loss=self._actor_loss, optimizer=Adam(lr=0.001))
self.critic.compile(loss='mse', optimizer=Adam(lr=0.01))
history = {'episode': [], 'Episode_reward': [],
'actor_loss': [], 'critic_loss': []}
for i in range(episode):
observation = self.env.reset()
rewards = []
alosses = []
closses = []
while True:
x = observation.reshape(-1, 4)
# choice action with prob.
prob = self.actor.predict(x)[0][0]
action = np.random.choice(np.array(range(2)), p=[1 - prob, prob])
next_observation, reward, done, _ = self.env.step(action)
next_observation = next_observation.reshape(-1, 4)
rewards.append(reward)
target = self.discount_reward(next_observation, reward, done)
y = np.array([target])
# loss1 = mse((r + gamma * next_q), current_q)
loss1 = self.critic.train_on_batch(x, y)
# TD_error = (r + gamma * next_q) - current_q
td_error = target - self.critic.predict(x)[0][0]
y = np.array([[action, td_error]])
loss2 = self.actor.train_on_batch(x, y)
observation = next_observation[0]
alosses.append(loss2)
closses.append(loss1)
if done:
episode_reward = sum(rewards)
aloss = np.mean(alosses)
closs = np.mean(closses)
history['episode'].append(i)
history['Episode_reward'].append(episode_reward)
history['actor_loss'].append(aloss)
history['critic_loss'].append(closs)
print('Episode: {} | Episode reward: {} | actor_loss: {:.3f} | critic_loss: {:.3f}'.format(i, episode_reward, aloss, closs))
break
self.actor.save_weights('model/actor_acs.h5')
self.critic.save_weights('model/critic_acs.h5')
return history
if __name__ == '__main__':
model = AC()
history = model.train(300)
model.save_history(history, 'ac_sparse.csv')
model.play('ac')
游戏结果如下:
play...
Reward for this episode was: 137.0
Reward for this episode was: 132.0
Reward for this episode was: 144.0
Reward for this episode was: 118.0
Reward for this episode was: 124.0
Reward for this episode was: 113.0
Reward for this episode was: 117.0
Reward for this episode was: 131.0
Reward for this episode was: 154.0
Reward for this episode was: 139.0
从上述实验可以看出,AC算法能够对这个问题进行优化但是模型收敛的并不稳定,效果也无法达到最优。这是因为单纯的AC算法属于on-policy方法,Actor部分的效果取决于Critic部分得到的td_error。在没有采取任何优化措施的情况下,DQN很难收敛由此导致整个AC算法无法收敛。