A3C算法的全称是Asynchronous Advantage Actor-Critic,异步优势执行者/评论者算法。这个算法和优势执行者/评论者算法的区别在于,在执行过程中不是每一步都更新参数,而是在回合结束后用整个轨迹进行更新。因此可以让多个Worker来进行轨迹的搜集和参数更新。每个执行者的更新都是异步的。这个算法与优势执行者/评论者算法相比,优点在于可以大大提高执行效率,因为对于策略更新算法来说,最耗时间的是在轨迹的搜集部分。
算法的步骤如下:
输入:环境
输出:最优策略的估计
参数:优化器,折扣因子,控制回合数和回合内步数的参数
1.(同步全局参数)
2. 逐回合执行以下过程
2.1 用策略生成轨迹,直到回合结束或执行步数到达上限T
2.2 计算梯度
2.2.1(初始化目标)若是终止状态,则,否则
2.2.2(初始化梯度)
2.3(异步计算梯度)对t=T-1,T-2,...,0,执行以下内容:
2.3.1(估计目标)计算
2.3.2(估计策略梯度方向)
2.3.3(估计价值梯度方向)
3. 更新全局参数
下面将基于Tensorflow 2.0来实现A3C算法,对Atari游戏进行训练:
先加载需要的库
import tensorflow as tf
from tensorflow import keras as k
import numpy as np
from typing import Any, List, Sequence, Tuple
import tqdm
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib notebook
import pandas as pd
import gym
import cv2
import random
from queue import Queue
from threading import Thread, Lock
import threading
import time
定义执行者和评论者模型。这个模型很简单,首先把环境的最近四张图片缩放并转换为黑白后叠加,然后把这四张图片的数据展平为一个(80*80*4)的向量。通过一个节点数为512的全连接层转换,然后分别输出为执行者模型和评论者模型的输出。
image_height = 80
image_width = 80
image_stack = 4
num_actions = 6
original_inputs = k.Input(shape=(image_height, image_width, image_stack,), name="atari_input")
outputs = k.layers.Flatten(data_format='channels_last')(original_inputs)
outputs = k.layers.Dense(512, activation='elu', kernel_initializer='he_uniform')(outputs)
actor_outputs = k.layers.Dense(num_actions, activation="softmax", kernel_initializer='he_uniform')(outputs)
critic_outputs = k.layers.Dense(1, kernel_initializer='he_uniform')(outputs)
actor_model = k.Model(inputs=original_inputs, outputs=actor_outputs)
critic_model = k.Model(inputs=original_inputs, outputs=critic_outputs)
actor_model.compile(loss='categorical_crossentropy', optimizer=k.optimizers.RMSprop(learning_rate=0.000025))
critic_model.compile(loss='mse', optimizer=k.optimizers.RMSprop(learning_rate=0.000025))
以下代码定义一个方法,用于搜集轨迹和对模型训练。这个方法之后将被用于在多个子线程中运行:
def thread_training(episodes, env, max_steps, out_q):
global lock, actor_model, critic_model, gamma
t = threading.currentThread()
thread_id = t.ident
thread_message = {'ThreadID': thread_id}
def image_process(image: np.ndarray) -> np.ndarray:
frame_cropped = image[35:195:2, ::2, :]
# Convert the image from rgb to black and white
frame_rgb = 0.299*frame_cropped[:,:,0] + 0.587*frame_cropped[:,:,1] + 0.114*frame_cropped[:,:,2]
frame_rgb[frame_rgb < 100] = 0
frame_rgb[frame_rgb >= 100] = 255
new_frame = np.array(frame_rgb).astype(np.float32) / 255.0
return new_frame
def run_episode(initial_state):
"""Runs a single episode to collect training data."""
action_probs = []
values = []
rewards = []
images = []
states = []
actions = []
returns = []
random_steps = random.randint(0,30)
image = image_process(initial_state)
image = image[np.newaxis,:,:]
images = [image, image, image, image]
state = np.vstack(images)
state = np.transpose(state, [1,2,0])
for t in range(max_steps):
# Convert state into a batched tensor (batch size = 1)
state = state[np.newaxis,:,:,:]
states.append(state)
# Run the model and to get action probabilities and critic value
action_probs_t = actor_model(state, training=False)
value = critic_model(state, training=False)
if t<=random_steps:
# Generate random action in the beginning of the game
action = np.random.choice(num_actions)
else:
# Sample next action from the action probability distribution
action = np.random.choice(num_actions, p=action_probs_t[0].numpy())
# Store critic values
values.append(tf.squeeze(value).numpy())
# Store log probability of the action chosen
action_probs.append(action_probs_t[0,action].numpy())
actions.append((t,action))
# Apply action to the environment to get next state and reward
observation, reward, done, _ = env(action)
# Process the image from observation and combine with the previous three images to the state
image = image_process(observation)
image = image[np.newaxis,:,:]
images = images[1:]
images.append(image)
state = np.vstack(images)
state = tf.transpose(state, [1,2,0])
# Store reward
rewards.append(reward)
if done:
break
action_probs = np.stack(action_probs)
values = np.stack(values)
rewards = np.stack(rewards)
states = np.vstack(states)
actions_indices = np.stack(actions)
actions = np.zeros([actions_indices.shape[0], num_actions])
actions[actions_indices[:,0], actions_indices[:,1]] = 1.
# Compute the discounted expect return
reversed_rewards = rewards[::-1]
expected_return = 0
for i in range(rewards.shape[0]):
expected_return = gamma*expected_return + reversed_rewards[i]
returns.append(expected_return)
returns = np.stack(returns)[::-1]
returns -= np.mean(returns) # normalizing the result
returns /= np.std(returns) # divide by standard deviation
return actions, action_probs, values, rewards, states, returns
for _ in range(episodes):
initial_state = env.reset()
actions, action_probs, values, rewards, states, returns = run_episode(initial_state)
# Compute advantage
advantages = returns - values
# Acquire the lock to train and update the model
lock.acquire()
history = actor_model.fit(states, actions, sample_weight=advantages, epochs=1, verbose=0)
actor_loss = history.history['loss'][0]
history = critic_model.fit(states, returns, epochs=1, verbose=0)
critic_loss = history.history['loss'][0]
lock.release()
# Calculate the loss and send message to the queue
total_loss = actor_loss + critic_loss
thread_message['loss'] = total_loss
thread_message['reward'] = np.sum(rewards)
thread_message['steps'] = values.shape[0]
out_q.put(thread_message)
最后就是在主线程里面启动多个子线程,每个子线程都进行轨迹搜集和模型更新,我开启了5个子线程,每个子线程进行100个回合的训练。另外运行一个子线程用于接收训练的消息,并打印相应结果:
lock = Lock()
q = Queue()
_sentinel = object()
threads = []
train_episodes = 0
for i in range(thread_num):
threads.append(Thread(target=thread_training, args=(100,envs[i],10000,q,)))
def consumer(in_q):
global train_episodes
while True:
# Get some data
time.sleep(1)
data = in_q.get()
if data is _sentinel:
in_q.put(_sentinel)
break
if data:
train_episodes += 1
print("Episode {}".format(train_episodes))
print(data)
t_consumer = Thread(target=consumer, args=(q,))
t_consumer.start()
for t in threads:
t.start()
time.sleep(1)
for t in threads:
time.sleep(3)
t.join()
q.put(_sentinel)
最终总共训练了500回合,用时2小时33分,训练后的每回合回报为