强化学习的A3C算法应用(训练Atari游戏)

A3C算法的全称是Asynchronous Advantage Actor-Critic,异步优势执行者/评论者算法。这个算法和优势执行者/评论者算法的区别在于,在执行过程中不是每一步都更新参数,而是在回合结束后用整个轨迹进行更新。因此可以让多个Worker来进行轨迹的搜集和参数更新。每个执行者的更新都是异步的。这个算法与优势执行者/评论者算法相比,优点在于可以大大提高执行效率,因为对于策略更新算法来说,最耗时间的是在轨迹的搜集部分。

算法的步骤如下:

输入:环境

输出:最优策略的估计\pi(\theta)

参数:优化器,折扣因子\gamma,控制回合数和回合内步数的参数

1.(同步全局参数)\theta'\leftarrow \theta, w'\leftarrow w

2. 逐回合执行以下过程

2.1 用策略\pi(\theta')生成轨迹S_{0},A_{0},R_{1},A_{1},R_{2},\cdots,S_{T-1},A_{T-1},R_{T},S_{T},直到回合结束或执行步数到达上限T

2.2 计算梯度

2.2.1(初始化目标U_{T})若S_{T}是终止状态,则U\leftarrow 0,否则U\leftarrow v(S_{T};w')

2.2.2(初始化梯度)g^{(\theta)}\leftarrow 0, g^{(w)}\leftarrow 0

2.3(异步计算梯度)对t=T-1,T-2,...,0,执行以下内容:

2.3.1(估计目标U_{t})计算U\leftarrow \gamma U+R_{t+1}

2.3.2(估计策略梯度方向)g^{(\theta)}\leftarrow g^{(\theta)}+[U-v(S_{t};w')]\triangledown ln\pi(A_{t}|S_{t};\theta')

2.3.3(估计价值梯度方向)g^{(w)}\leftarrow g^{(w)}+[U-v(S_{t};w')]\triangledown v(S_{t};w')

3. 更新全局参数

下面将基于Tensorflow 2.0来实现A3C算法,对Atari游戏进行训练:

先加载需要的库

import tensorflow as tf
from tensorflow import keras as k
import numpy as np
from typing import Any, List, Sequence, Tuple
import tqdm
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib notebook
import pandas as pd
import gym
import cv2
import random
from queue import Queue
from threading import Thread, Lock
import threading
import time

定义执行者和评论者模型。这个模型很简单,首先把环境的最近四张图片缩放并转换为黑白后叠加,然后把这四张图片的数据展平为一个(80*80*4)的向量。通过一个节点数为512的全连接层转换,然后分别输出为执行者模型和评论者模型的输出。

image_height = 80
image_width = 80
image_stack = 4
num_actions = 6

original_inputs = k.Input(shape=(image_height, image_width, image_stack,), name="atari_input")
outputs = k.layers.Flatten(data_format='channels_last')(original_inputs)
outputs = k.layers.Dense(512, activation='elu', kernel_initializer='he_uniform')(outputs)
actor_outputs = k.layers.Dense(num_actions, activation="softmax", kernel_initializer='he_uniform')(outputs)
critic_outputs = k.layers.Dense(1, kernel_initializer='he_uniform')(outputs)
actor_model = k.Model(inputs=original_inputs, outputs=actor_outputs)
critic_model = k.Model(inputs=original_inputs, outputs=critic_outputs)
actor_model.compile(loss='categorical_crossentropy', optimizer=k.optimizers.RMSprop(learning_rate=0.000025))
critic_model.compile(loss='mse', optimizer=k.optimizers.RMSprop(learning_rate=0.000025))

以下代码定义一个方法,用于搜集轨迹和对模型训练。这个方法之后将被用于在多个子线程中运行:

def thread_training(episodes, env, max_steps, out_q):
    global lock, actor_model, critic_model, gamma
    t = threading.currentThread()
    thread_id = t.ident
    thread_message = {'ThreadID': thread_id}

    def image_process(image: np.ndarray) -> np.ndarray:
        frame_cropped = image[35:195:2, ::2, :]
        # Convert the image from rgb to black and white
        frame_rgb = 0.299*frame_cropped[:,:,0] + 0.587*frame_cropped[:,:,1] + 0.114*frame_cropped[:,:,2]
        frame_rgb[frame_rgb < 100] = 0
        frame_rgb[frame_rgb >= 100] = 255
        new_frame = np.array(frame_rgb).astype(np.float32) / 255.0
        return new_frame

    def run_episode(initial_state):
        """Runs a single episode to collect training data."""
        action_probs = []
        values = []
        rewards = []
        images = []
        states = []
        actions = []
        returns = []
        random_steps = random.randint(0,30)

        image = image_process(initial_state)
        image = image[np.newaxis,:,:]
        images = [image, image, image, image]
        state = np.vstack(images)
        state = np.transpose(state, [1,2,0])

        for t in range(max_steps):
            # Convert state into a batched tensor (batch size = 1)
            state = state[np.newaxis,:,:,:]
            states.append(state)
            # Run the model and to get action probabilities and critic value
            action_probs_t = actor_model(state, training=False)
            value = critic_model(state, training=False)
            if t<=random_steps:
                # Generate random action in the beginning of the game
                action = np.random.choice(num_actions)
            else:
                # Sample next action from the action probability distribution
                action = np.random.choice(num_actions, p=action_probs_t[0].numpy())
            # Store critic values
            values.append(tf.squeeze(value).numpy())
            # Store log probability of the action chosen
            action_probs.append(action_probs_t[0,action].numpy())
            actions.append((t,action))
            # Apply action to the environment to get next state and reward
            observation, reward, done, _ = env(action)
            # Process the image from observation and combine with the previous three images to the state
            image = image_process(observation)
            image = image[np.newaxis,:,:]
            images = images[1:]
            images.append(image)
            state = np.vstack(images)
            state = tf.transpose(state, [1,2,0])
            # Store reward
            rewards.append(reward)
            if done:
                break
        action_probs = np.stack(action_probs)
        values = np.stack(values)
        rewards = np.stack(rewards)
        states = np.vstack(states)
        actions_indices = np.stack(actions)
        actions = np.zeros([actions_indices.shape[0], num_actions])
        actions[actions_indices[:,0], actions_indices[:,1]] = 1.
        # Compute the discounted expect return
        reversed_rewards = rewards[::-1]
        expected_return = 0
        for i in range(rewards.shape[0]):
            expected_return = gamma*expected_return + reversed_rewards[i]
            returns.append(expected_return)
        returns = np.stack(returns)[::-1]
        returns -= np.mean(returns) # normalizing the result
        returns /= np.std(returns) # divide by standard deviation
        return actions, action_probs, values, rewards, states, returns
    
    for _ in range(episodes):
        initial_state = env.reset()
        actions, action_probs, values, rewards, states, returns = run_episode(initial_state)
        # Compute advantage
        advantages = returns - values
        # Acquire the lock to train and update the model
        lock.acquire()
        history = actor_model.fit(states, actions, sample_weight=advantages, epochs=1, verbose=0)
        actor_loss = history.history['loss'][0]
        history = critic_model.fit(states, returns, epochs=1, verbose=0)
        critic_loss = history.history['loss'][0]
        lock.release()
        # Calculate the loss and send message to the queue
        total_loss = actor_loss + critic_loss
        thread_message['loss'] = total_loss
        thread_message['reward'] = np.sum(rewards)
        thread_message['steps'] = values.shape[0]
        out_q.put(thread_message)

最后就是在主线程里面启动多个子线程,每个子线程都进行轨迹搜集和模型更新,我开启了5个子线程,每个子线程进行100个回合的训练。另外运行一个子线程用于接收训练的消息,并打印相应结果:

lock = Lock()
q = Queue()
_sentinel = object()
threads = []
train_episodes = 0
for i in range(thread_num):
    threads.append(Thread(target=thread_training, args=(100,envs[i],10000,q,)))

def consumer(in_q):
    global train_episodes
    while True:
    # Get some data
        time.sleep(1)
        data = in_q.get()
        if data is _sentinel:
            in_q.put(_sentinel)
            break            
        if data:
            train_episodes += 1
            print("Episode {}".format(train_episodes))
            print(data)

t_consumer = Thread(target=consumer, args=(q,))
t_consumer.start()

for t in threads:
    t.start()
    time.sleep(1)

for t in threads:
    time.sleep(3)
    t.join()

q.put(_sentinel)

最终总共训练了500回合,用时2小时33分,训练后的每回合回报为

你可能感兴趣的:(Python编程,人工智能,机器学习,算法)