Deep Reinforcement Learning in Large Discrete Action Spaces--Wolpertinger Architecture


自己在使用DDPG解决问题时,会遇到action space很大的情况,会导致算法不收敛或者收敛得很慢。如何解决这种Large Discrete Action Spaces,即大规模离散动作空间得问题。


使用 k-nearest-neighbor mapping 可以将 DDPG中 policy network 输出的action 映射到K个相近的action,从而帮助收敛。

[1]G. Dulac-Arnold et al., ‘Deep Reinforcement Learning in Large Discrete Action Spaces’. arXiv, Apr. 04, 2016. Accessed: Dec. 27, 2023. [Online]. Available:


ChangyWen/wolpertinger_ddpg: Wolpertinger Training with DDPG (Pytorch), Deep Reinforcement Learning in Large Discrete Action Spaces. Multi-GPU/Singer-GPU/CPU compatible. (



对于RL中action space比较大/连续的情况,一般是采取DDPG或者PPO。

DDPG是输出的一个连续的值,即一个/组确定性的动作。因此,对于大规模离散动作空间,需要将DDPG输出的连续值 映射到 离散的动作。 即使使用DDPG,大规模问题还是很难收敛。

采取的方案:Wolpertinger Architecture

Deep Reinforcement Learning in Large Discrete Action Spaces--Wolpertinger Architecture_第1张图片

Deep Reinforcement Learning in Large Discrete Action Spaces--Wolpertinger Architecture_第2张图片

Wolpertinger Architecture 的核心是采用approximate nearest-neighbor methods 来泛化动作。

具体来说,正常的DDPG 在从policy network 得到输出动作\hat{a}之后,执行动作并转到下一个新的状态。 但是, 加入Wolpertinger policy之后,从policy network 输出的动作并不直接执行,而是先映射到 K 组动作A_k,再根据 critic network评价这K 组 action的Q值,选取Q值最大的action执行。

Action generation:

那么问题又来了?既然要 把 \hat{a} 映射到 K组动作?怎么映射呢?

Deep Reinforcement Learning in Large Discrete Action Spaces--Wolpertinger Architecture_第3张图片Deep Reinforcement Learning in Large Discrete Action Spaces--Wolpertinger Architecture_第4张图片

g_k is a k-nearest-neighbor mapping from a continuous space to a discrete set. It returns the k actions in A that are closest to \hat{a} by L_2 distance.

Action refinement:



Deep Reinforcement Learning in Large Discrete Action Spaces--Wolpertinger Architecture_第5张图片


Time-complexity of the above algorithm scales linearly in the number of selected actions, k.
