人工智能学习:倒立摆强化学习控制-DQN(10)

DQN(Deep Q-Network)可以用来实现对倒立摆(CartPole)对象的控制。

DQN的原理就是建立一个神经网络来替代Q-Learning算法中Q-Table,根据对象的状态和采用的动作输出对应的Q值,Q值越高表示动作能得到的奖励越高。在DQN用于强化学习时,采取历史回放和Fixed Target策略,即系统状态和动作被记录的历史数据中,并被在学习过程中被回放进行学习,以模拟人的学习原理。另外,采用两个网络,一个网络(Target Net)相对稳定,用来评估目标的Q值,另一个网络不断学习迭代,经过一定次数迭代后再替换原来的Target Net。关于Q-Learning和DQN等详细介绍,可以参考文末参考链接。

实现示例如下

1 载入模块
载入需要的模块

# import required modules
import gym
import random
import numpy as np
import math

import torch
import torch.nn as nn

2 定义网络

class Net(nn.Module):
    def __init__(self, n_states, n_actions):
        super().__init__()
        
        self.fc1 = nn.Linear(n_states, 10)
        self.fc2 = nn.Linear(10, n_actions)
        
        self.fc1.weight.data.normal_(0,0.1)
        self.fc2.weight.data.normal_(0,0.1)
        
    def forward(self, inputs):
        x = self.fc1(inputs)
        x = nn.functional.relu(x)
        outputs = self.fc2(x)
        
        return outputs

网络由一个前向的全连接层,一个relu层,和一个全连接输出层组成。输入为倒立摆对象的状态(位置、速度、角度、角速度),输出为对应两个动作(0和1)的Q值,Q值高的动作被选择作为采取的动作。

3 定义DQN学习策略

# define DQN
class DQN:
    def __init__(self, n_states, n_actions):
        # two nets
        self.eval_net = Net(n_states, n_actions)
        self.target_net = Net(n_states, n_actions)
        
        self.loss = nn.MSELoss()
        
        self.optimizer = torch.optim.Adam(self.eval_net.parameters(), lr=0.01)
        
        self.learn_step = 0
        
        self.n_states = 4
        self.n_actions = 2
        
        # define variables for history data storage
        self.history_capacity = 5000
        self.history_data = np.zeros((self.history_capacity, 2*n_states+2)) # state, action, reward, next_state
        self.history_index = 0
        self.history_num = 0
        
    def random_action(self):
        r = random.random()
        if r >= 0.5:
            action = 1
        else:
            action = 0
        
        return action
    
    def choose_action(self, state, epsilon):
        r = random.random()
        if r > epsilon:
            action = self.random_action()
        else:
            state = torch.FloatTensor(state)
            state = torch.unsqueeze(state, 0)
            
            evals = self.eval_net.forward(state)
            action = np.argmax(evals.data.numpy())
            
        return action
    
    # function to calculate reward
    def get_reward(self, state):
        pos, vel, ang, avel = state
        
        pos1 = 1.0
        ang1 = math.pi/9
        
        r1 = 5-10*abs(pos/pos1)
        r2 = 5-10*abs(ang/ang1)
        if r1 < -5.0:
            r1 = -5.0
        if r2 < -5.0:
            r2 = -5.0
            
        return r1+r2
    
    # definition of game end
    def gg(self, state):
        pos, vel, ang, avel = state
        
        bad = abs(pos) > 2.0 or abs(ang) > math.pi/4
        
        return bad

	# function to store history data    
    def store_transition(self, prev_state, action, reward, state):
        transition = np.hstack((prev_state, action, reward, state))
        self.history_data[self.history_index] = transition
        
        self.history_index = (self.history_index + 1)%self.history_capacity
        
        if self.history_num < self.history_index:
            self.history_num = self.history_index

	# learn
    def learn(self):
        # random choose a batch of sample indices
        indices = np.random.choice(self.history_num, 64)
        
        samples = self.history_data[indices, :]
        state = torch.FloatTensor(samples[:, 0:self.n_states])
        action = torch.LongTensor(samples[:, self.n_states:self.n_states+1])
        reward = torch.FloatTensor(samples[:, self.n_states+1:self.n_states+2])
        next_state = torch.FloatTensor(samples[:, self.n_states+2:])

        q_eval = self.eval_net(state).gather(1, action)
        q_next = self.target_net(next_state)
        
        # calculate target q values
        q_target = 0.9*q_next.max(1).values.unsqueeze(1) + reward
        
        loss = self.loss(q_eval, q_target)
        
        # backward training
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        self.learn_step += 1
        
        # update target net every 50 learn steps
        if self.learn_step % 50 == 0:
            self.target_net.load_state_dict(self.eval_net.state_dict())

其中,包含两个网络评估网络(eval_net)和目标网络(target_net)网络,评估网络用于进行当前的动作决策和网络迭代,target_net用于计算目标的q值。在eval_net迭代一段时间之后再复制到target_net,以用于下一阶段的q值计算。这种机制有助于学习的稳定性。在动作决策中,采用一定的随机性,随着学习步数的增加,对决策网络的置信度增加。

4 仿真程序

# create cartpole model
env = gym.make('CartPole-v1', render_mode='human')

# reset state of env
state, _ = env.reset()

# crate DQN model
model = DQN(4, 2)

# parameters
max_sim_step = 100000

# simulate
for i in range(max_sim_step):
    env.render()

    epsilon = 0.7 + i / max_sim_step * (0.95 - 0.7)
    action = model.choose_action(state, epsilon)
    
    prev_state = state
    state, reward, _, _, _ = env.step(action)
    
    reward += model.get_reward(state)

    model.store_transition(prev_state, action, reward, state)

	# perform model learning every 10 simulation steps    
    if i>1000 and i%10 == 0:
        model.learn()

    if model.gg(state):
        state, _ = env.reset()

env.close()

在程序中,进行max_sim_step(100000)次数的仿真,每次仿真,DQN按一定策略选择一个动作施加于倒立摆(CartPole),从1000步开始模型每隔10步进行学习。在每次仿真中,倒立摆的状态、动作、奖励和下一个状态被记录到历史数据,在模型学习过程中,从历史数据中随机选择一定数量的样本进行学习。

经过实测,倒立摆对象在开始时很容易倒下,随着不断的控制失败和学习,倒立摆保持稳定状态的时间越来越长,经过一段时间之后,倒立摆能够长时间地保持在稳定的状态。

人工智能学习:倒立摆强化学习控制-DQN(10)_第1张图片
参考链接:
https://blog.csdn.net/Leon_winter/article/details/106456683
https://blog.csdn.net/BIT_csy/article/details/124557798

你可能感兴趣的:(人工智能,CartPole,强化学习,DQN)