DQN(Deep Q-Network)可以用来实现对倒立摆(CartPole)对象的控制。
DQN的原理就是建立一个神经网络来替代Q-Learning算法中Q-Table,根据对象的状态和采用的动作输出对应的Q值,Q值越高表示动作能得到的奖励越高。在DQN用于强化学习时,采取历史回放和Fixed Target策略,即系统状态和动作被记录的历史数据中,并被在学习过程中被回放进行学习,以模拟人的学习原理。另外,采用两个网络,一个网络(Target Net)相对稳定,用来评估目标的Q值,另一个网络不断学习迭代,经过一定次数迭代后再替换原来的Target Net。关于Q-Learning和DQN等详细介绍,可以参考文末参考链接。
实现示例如下
1 载入模块
载入需要的模块
# import required modules
import gym
import random
import numpy as np
import math
import torch
import torch.nn as nn
2 定义网络
class Net(nn.Module):
def __init__(self, n_states, n_actions):
super().__init__()
self.fc1 = nn.Linear(n_states, 10)
self.fc2 = nn.Linear(10, n_actions)
self.fc1.weight.data.normal_(0,0.1)
self.fc2.weight.data.normal_(0,0.1)
def forward(self, inputs):
x = self.fc1(inputs)
x = nn.functional.relu(x)
outputs = self.fc2(x)
return outputs
网络由一个前向的全连接层,一个relu层,和一个全连接输出层组成。输入为倒立摆对象的状态(位置、速度、角度、角速度),输出为对应两个动作(0和1)的Q值,Q值高的动作被选择作为采取的动作。
3 定义DQN学习策略
# define DQN
class DQN:
def __init__(self, n_states, n_actions):
# two nets
self.eval_net = Net(n_states, n_actions)
self.target_net = Net(n_states, n_actions)
self.loss = nn.MSELoss()
self.optimizer = torch.optim.Adam(self.eval_net.parameters(), lr=0.01)
self.learn_step = 0
self.n_states = 4
self.n_actions = 2
# define variables for history data storage
self.history_capacity = 5000
self.history_data = np.zeros((self.history_capacity, 2*n_states+2)) # state, action, reward, next_state
self.history_index = 0
self.history_num = 0
def random_action(self):
r = random.random()
if r >= 0.5:
action = 1
else:
action = 0
return action
def choose_action(self, state, epsilon):
r = random.random()
if r > epsilon:
action = self.random_action()
else:
state = torch.FloatTensor(state)
state = torch.unsqueeze(state, 0)
evals = self.eval_net.forward(state)
action = np.argmax(evals.data.numpy())
return action
# function to calculate reward
def get_reward(self, state):
pos, vel, ang, avel = state
pos1 = 1.0
ang1 = math.pi/9
r1 = 5-10*abs(pos/pos1)
r2 = 5-10*abs(ang/ang1)
if r1 < -5.0:
r1 = -5.0
if r2 < -5.0:
r2 = -5.0
return r1+r2
# definition of game end
def gg(self, state):
pos, vel, ang, avel = state
bad = abs(pos) > 2.0 or abs(ang) > math.pi/4
return bad
# function to store history data
def store_transition(self, prev_state, action, reward, state):
transition = np.hstack((prev_state, action, reward, state))
self.history_data[self.history_index] = transition
self.history_index = (self.history_index + 1)%self.history_capacity
if self.history_num < self.history_index:
self.history_num = self.history_index
# learn
def learn(self):
# random choose a batch of sample indices
indices = np.random.choice(self.history_num, 64)
samples = self.history_data[indices, :]
state = torch.FloatTensor(samples[:, 0:self.n_states])
action = torch.LongTensor(samples[:, self.n_states:self.n_states+1])
reward = torch.FloatTensor(samples[:, self.n_states+1:self.n_states+2])
next_state = torch.FloatTensor(samples[:, self.n_states+2:])
q_eval = self.eval_net(state).gather(1, action)
q_next = self.target_net(next_state)
# calculate target q values
q_target = 0.9*q_next.max(1).values.unsqueeze(1) + reward
loss = self.loss(q_eval, q_target)
# backward training
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
self.learn_step += 1
# update target net every 50 learn steps
if self.learn_step % 50 == 0:
self.target_net.load_state_dict(self.eval_net.state_dict())
其中,包含两个网络评估网络(eval_net)和目标网络(target_net)网络,评估网络用于进行当前的动作决策和网络迭代,target_net用于计算目标的q值。在eval_net迭代一段时间之后再复制到target_net,以用于下一阶段的q值计算。这种机制有助于学习的稳定性。在动作决策中,采用一定的随机性,随着学习步数的增加,对决策网络的置信度增加。
4 仿真程序
# create cartpole model
env = gym.make('CartPole-v1', render_mode='human')
# reset state of env
state, _ = env.reset()
# crate DQN model
model = DQN(4, 2)
# parameters
max_sim_step = 100000
# simulate
for i in range(max_sim_step):
env.render()
epsilon = 0.7 + i / max_sim_step * (0.95 - 0.7)
action = model.choose_action(state, epsilon)
prev_state = state
state, reward, _, _, _ = env.step(action)
reward += model.get_reward(state)
model.store_transition(prev_state, action, reward, state)
# perform model learning every 10 simulation steps
if i>1000 and i%10 == 0:
model.learn()
if model.gg(state):
state, _ = env.reset()
env.close()
在程序中,进行max_sim_step(100000)次数的仿真,每次仿真,DQN按一定策略选择一个动作施加于倒立摆(CartPole),从1000步开始模型每隔10步进行学习。在每次仿真中,倒立摆的状态、动作、奖励和下一个状态被记录到历史数据,在模型学习过程中,从历史数据中随机选择一定数量的样本进行学习。
经过实测,倒立摆对象在开始时很容易倒下,随着不断的控制失败和学习,倒立摆保持稳定状态的时间越来越长,经过一段时间之后,倒立摆能够长时间地保持在稳定的状态。
参考链接:
https://blog.csdn.net/Leon_winter/article/details/106456683
https://blog.csdn.net/BIT_csy/article/details/124557798