事后诸葛亮其实并不是一无是处,OpenAI研究如何“事后诸葛亮”缓解强化学习稀疏奖励问题
Hindsight Experience Replay(HER)原理详解
Hindsight Experience Replay(HER)是强化学习中解决**稀疏奖励(Sparse Reward)问题的关键技术,通过目标重标记(Goal Relabeling)**将失败经验转化为有效训练数据,显著提升样本效率。以下是其核心原理和实现细节:
智能体与环境交互,存储经验元组 (s, a, r, s', g)
,其中:
s
:当前状态a
:执行的动作r
:根据原始目标g
计算的奖励s'
:下一状态g
:原始目标对每个存储的经验,额外生成新经验,用替代目标 g'
替换原始目标 g
,并重新计算奖励 r'
。例如:
g'
。g'
。g'
。将原始经验和新生成的经验共同用于训练,使智能体学习多目标策略,同时解决稀疏奖励问题。
g
,奖励函数通常设计为:g'
时,奖励变为:任务:机械臂抓取物体到目标位置
(s, a, r=0, s', g'=B)
,用于训练。s
到达任意位置(如B),而不仅限于A。import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
import random
# 定义DQN网络(示例算法A)
class DQN(nn.Module):
def __init__(self, state_dim, goal_dim, action_dim, hidden_dim=256):
super(DQN, self).__init__()
# 将状态和goal拼接作为输入
self.net = nn.Sequential(
nn.Linear(state_dim + goal_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim)
)
def forward(self, state, goal):
# 拼接状态和goal
x = torch.cat([state, goal], dim=1)
return self.net(x)
# 回放缓冲区实现
class ReplayBuffer:
def __init__(self, capacity):
self.buffer = deque(maxlen=capacity)
def add(self, transition):
"""添加单条transition"""
self.buffer.append(transition)
def add_batch(self, transitions):
"""批量添加transition"""
self.buffer.extend(transitions)
def sample(self, batch_size):
"""随机采样minibatch"""
return random.sample(self.buffer, batch_size)
def __len__(self):
return len(self.buffer)
# 定义HER参数
HER_SAMPLE_N = 4 # 每个transition采样额外目标数
FUTURE_STEP = 4 # 未来目标采样步长
# 初始化组件
state_dim = 3 # 示例环境状态维度
goal_dim = 3 # 目标维度(通常与状态维度相同)
action_dim = 2 # 示例动作空间维度
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dqn = DQN(state_dim, goal_dim, action_dim).to(device)
target_dqn = DQN(state_dim, goal_dim, action_dim).to(device)
target_dqn.load_state_dict(dqn.state_dict())
optimizer = optim.Adam(dqn.parameters(), lr=1e-3)
buffer = ReplayBuffer(100000) # 回放缓冲区容量
# 定义奖励函数示例(需要根据具体问题修改)
def compute_reward(state, action, goal):
"""示例:使用目标距离的负值作为奖励"""
distance = torch.norm(state - goal, dim=1)
return -distance.unsqueeze(1) # 保持维度一致
# 训练循环
num_episodes = 1000
for episode in range(num_episodes):
# 采样初始状态和目标
current_state = torch.randn(state_dim).to(device) # 示例随机初始化
goal = torch.randn(goal_dim).to(device) # 示例随机目标
episode_transitions = []
states = []
# 收集轨迹数据
for t in range(10): # 示例episode长度设为10
# 行为策略选择动作(epsilon-greedy示例)
if np.random.rand() < 0.3:
action = torch.randint(0, action_dim, (1,))
else:
with torch.no_grad():
q_values = dqn(current_state.unsqueeze(0), goal.unsqueeze(0))
action = q_values.argmax().item()
# 执行动作,获得新状态(这里用随机状态模拟环境)
next_state = current_state + torch.randn_like(current_state)*0.1
states.append(current_state)
# 存储原始transition
episode_transitions.append((
current_state.clone(),
torch.tensor([action]),
next_state.clone(),
goal.clone()
))
current_state = next_state
# 后见经验回放处理
for t in range(len(episode_transitions)):
state, action, next_state, original_goal = episode_transitions[t]
# 添加原始目标transition
reward = compute_reward(next_state, action, original_goal)
buffer.add((
torch.cat([state, original_goal]),
action,
reward,
torch.cat([next_state, original_goal])
))
# HER核心部分:采样额外目标
# 使用future策略采样未来状态作为目标
future_indices = np.random.choice(
np.arange(t, len(episode_transitions)),
size=HER_SAMPLE_N,
replace=True
)
additional_goals = [episode_transitions[idx][2] for idx in future_indices]
# 为每个额外目标生成transition
for g in additional_goals:
new_reward = compute_reward(next_state, action, g)
buffer.add((
torch.cat([state, g]),
action,
new_reward,
torch.cat([next_state, g])
))
# 优化步骤
if len(buffer) >= 512: # 当缓冲区足够时开始训练
batch = buffer.sample(512)
state_batch = torch.stack([x[0] for x in batch]).to(device)
action_batch = torch.stack([x[1] for x in batch]).to(device)
reward_batch = torch.stack([x[2] for x in batch]).to(device)
next_state_batch = torch.stack([x[3] for x in batch]).to(device)
# DQN更新逻辑
current_q = dqn(state_batch[:, :state_dim], state_batch[:, state_dim:]).gather(1, action_batch)
with torch.no_grad():
max_next_q = target_dqn(
next_state_batch[:, :state_dim], # current state, state space
next_state_batch[:, state_dim:] # goal, state space
).max(1)[0].unsqueeze(1)
target_q = reward_batch + 0.99 * max_next_q
loss = nn.MSELoss()(current_q, target_q)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# 定期更新目标网络
if episode % 10 == 0:
target_dqn.load_state_dict(dqn.state_dict())
# 注意:这是一个简化实现,具体应用时需要根据实际问题调整:
# 1. 环境交互逻辑
# 2. 奖励函数实现
# 3. 目标采样策略(当前使用future策略)
# 4. 探索策略(当前为简单epsilon-greedy)
# 5. 网络结构和超参数
关键实现细节说明:
双网络架构:采用DQN的标准双网络设计(在线网络+目标网络)来稳定训练
状态-目标拼接:始终将状态和goal拼接后作为网络输入
torch.cat([state, goal], dim=1)
future_indices = np.random.choice(
np.arange(t, len(episode_transitions)),
size=HER_SAMPLE_N,
replace=True
)
new_reward = compute_reward(next_state, action, g)
buffer.add((
torch.cat([state, g]),
action,
new_reward,
torch.cat([next_state, g])
))
current_q = dqn(state_batch[:, :state_dim],
state_batch[:, state_dim:]).gather(1, action_batch)
实际使用时需要根据具体任务调整以下部分:
HER通过重构失败经验的目标和奖励,将稀疏奖励问题转化为密集奖励问题,是解决复杂环境探索难题的高效方法。其核心在于“即使未达到原始目标,也能从中学到有价值的信息”,这一思想在机器人控制、游戏AI等领域广泛应用。