We extend Q-learning to a noncooperative multiagent context, using the framework of general-sum stochastic games. A learning agent maintains Q-functions over joint actions, and performs updates based on assuming Nash equilibrium behavior over the current Q-values. This learning protocol provably converges given certain restrictions on the stage games (defined by Q-values) that arise during learning. Experiments with a pair of two-player grid games suggest that such restric-tions on the game structure are not necessarily required. Stage games encountered during learning in both grid environments violate the conditions. However, learning consistently converges in the first grid game, which has a unique equilibrium Q-function, but sometimes fails to converge in the second, which has three different equilibrium Q-functions. In a comparison of offline learn-ing performance in both games, we find agents are more likely to reach a joint optimal path with Nash Q-learning than with a single-agent Q-learning method. When at least one agent adopts Nash Q-learning, the performance of both agents is better than using single-agent Q-learning. We have also implemented an online version of Nash Q-learning that balances exploration with exploitation, yielding improved performance.
简言之:本文研究的是将Q-learning应用到竞争的多智能体中(多智能体的关系包括竞争、合作、竞争与合作--超市老板与顾客),整体基于广义随机博弈的游戏框架。每个智能体通过联合动作共同维护Q函数,并基于对当前Q值假设的Nash均衡行为来更新,可证明其是收敛的。本文通过两个游戏来对比,其中具有独特Q均值函数的算法在一个网格中始终收敛,而具有三个不同Q均衡函数的第二个网格在学习过程中无法收敛。在两款游戏中的离线学习性能中,智能体通过Nash Q-learning可以更快达到联合最优路径。同时还实现了在线版本的Nash Q-learning,其在探索(exploration)与利用(exploitation)之间取得平衡。
强化学习可以使多智能体在学习中行动,即一边学一边干活,不需要提前预知环境模型。在典型的多智能体系统中,智能体缺乏关于其他智能体的完整信息,因此随着智能体相互了解并相应地调整其行为,多智能体环境也在不断变化。
关于为什么不将Q-learning直接应用到多个单智能体中有以下两原因:(1)由多个智能体组成的环境不再是静止的,之前单智能体的理论不适用;(2)环境的非平稳性不是由任意随机过程产生的,而是由在某些重要方面有规律的智能体产生的。因此基于强化学习的多智能体研究不仅仅是将单智能体的Q-learning应用到多智能体这么简单,需要重新指定新的规范。
在将Q-learning扩展到多智能体环境时,采用通用随机博弈框架。在随机博弈中,每个智能体的奖励取决于所有智能体的联合动作和当前状态,状态转移服从马尔可夫性质。在竞争关系中,奖励总是负相关,而在合作关系中,奖励是正相关。
解决通用随机博弈的基本办法是Nsah均衡(Nash equilibrium),每个参与者有效地对其他参与者的行为持有正确的期望,并根据这种期望理性地行动。
研究目的:整个智能体要以最少的步骤达到目标,在此过程中其必须了解其智能体的策略并作出响应,整体是与其他智能体不断交互,裁决自己下一步状态选择的系统。
我们的 NashQ 算法通过使用均衡算子代替预期效用最大化(单智能体),将单智能体 Q-learning推广到随机博弈。
Q函数与Q值的更新公式如下所示:
其中是学习效率,在0-1范围内。
随机博弈框架是对具有离散时间与竞争性质的多主体进行建模,目的是最大化奖励总和 ,计算公式如下:
Nash均衡是一种联合策略,其中每个代理都是对其他代理的最佳响应。
将针对单智能体的Q-learning中的Q(s,a)拓展到Nash Q-learning的Q(s,a1,a2,..),公式如下:
Nash Q-learning与Q-learning有一个关键的不同点:如何使用下一个状态的 Q 值来更新当前状态的 Q 值。多智能体 Q-learning算法会根据未来的纳什均衡收益进行更新,而单智能体 Q-learning算法的更新是基于智能体自身的最大收益,如下表所示:
Q值更新公式如下:
证明到的收敛性。
通过两个不同网格上的游戏来验证Nash算法,尽管简单,网格世界游戏拥有动态游戏的所有关键要素: 特定于位置或特定于状态的动作,定性过渡 (智能体四处移动) 以及即时和长期的奖励。如下图所示是两个游戏的展示图,其中下方是起始点,上方编号对应起始点的目标,智能体只能上下左右移动。
每个智能体的目标是通过最少的步骤达到目标点,其中他们可以观察其他智能体先前动作与当前状态(previous actions of both agents and the current state)来选择自己当下的行为。
奖励设置:到达目标点—奖励100,到达另一个目标点并且没有与其他智能体发生碰撞,奖励0,与其他智能体发生碰撞,奖励-1,并且二者都被弹回之前的位置。
上图分别为游戏1与游戏2的Nash均衡路径,当一个智能体到达其目标函数时游戏重新开始。在新一轮游戏中,每个智能体被随机分配一个新的位置 (除了其目标单元格)。学习代理保留从以前的情节中学到的q值。训练在5000次后停止。
本文贡献:
(1)为多智能体Q-learning的进一步理论工作提供了一个起点,为收敛提供了一个已知的充分条件,在特定情况下可以通过利用博弈或均衡选择过程中的其他已知结构来放宽该条件;
(2)目前还没有多智能体学习方法为一般和随机游戏提供直接适用的性能保证。
class EpsGreedyQPolicy():
def __init__(self, epsilon=.1, decay_rate=1):
self.epsilon = epsilon
self.decay_rate = decay_rate
def select_action(self, q_values):
assert q_values.ndim == 1
nb_actions = q_values.shape[0]
if np.random.uniform() < self.epsilon:
action = np.random.random_integers(0, nb_actions-1)
else:
action = np.argmax(q_values)
return action
def select_greedy_action(self, q_values):
assert q_values.ndim == 1
action = np.argmax(q_values)
return action
class MatrixGame():
def __init__(self):
self.reward_matrix = self._create_reward_table()
def step(self, action1, action2):
r1 = self.reward_matrix[0][action1][action2]
r2 = self.reward_matrix[1][action1][action2]
return None, r1, r2
def _create_reward_table(self):
reward_matrix = [
[[1, -1], [-1, 1]],
[[-1, 1], [1, -1]]
]
return reward_matrix
import numpy as np
import nashpy
class NashQLearner():
def __init__(self,
alpha=0.1,
policy=None,
gamma=0.99,
ini_state="nonstate",
actions=None):
self.alpha = alpha
self.gamma = gamma
self.policy = policy
self.actions = actions
self.state = ini_state
# q values (my and opponent)
self.q, self.q_o = {}, {}
self.q[ini_state] = {}
self.q_o[ini_state] = {}
# nash q value
self.nashq = {}
self.nashq[ini_state] = 0
# pi (my and opponent)
self.pi, self.pi_o = {}, {}
self.pi[ini_state] = np.repeat(1.0/len(self.actions), len(self.actions))
self.pi_o[ini_state] = np.repeat(1.0/len(self.actions), len(self.actions))
self.previous_action = None
self.reward_history = []
self.pi_history = []
def act(self, training=True):
if training:
action_id = self.policy.select_action(self.pi[self.state])
action = self.actions[action_id]
self.previous_action = action
else:
action_id = self.policy.select_greedy_action(self.pi)
action = self.actions[action_id]
return action
def observe(self, state="nonstate", reward=None, reward_o=None, opponent_action=None, is_learn=True):
"""
observe next state and learn
"""
if is_learn:
self.check_new_state(state) # if the state is new state, extend q table
self.learn(state, reward, reward_o, opponent_action)
def learn(self, state, reward, reward_o, opponent_action):
self.reward_history.append(reward)
self.q[state][(self.previous_action, opponent_action)] = self.compute_q(state, reward, opponent_action, self.q)
self.q_o[state][(self.previous_action, opponent_action)] = self.compute_q(state, reward_o, opponent_action, self.q_o)
self.pi[state], self.pi_o[state] = self.compute_pi(state)
self.nashq[state] = self.compute_nashq(state)
self.pi_history.append(self.pi[state][0])
def compute_q(self, state, reward, opponent_action, q):
if (self.previous_action, opponent_action) not in q[state].keys():
q[state][(self.previous_action, opponent_action)] = 0.0
q_old = q[state][(self.previous_action, opponent_action)]
updated_q = q_old + (self.alpha * (reward + self.gamma*self.nashq[state] - q_old))
return updated_q
def compute_nashq(self, state):
"""
compute nash q value
"""
nashq = 0
for action1 in self.actions:
for action2 in self.actions:
nashq += self.pi[state][action1]*self.pi_o[state][action2] * \
self.q[state][(action1, action2)]
return nashq
def compute_pi(self, state):
"""
compute pi (nash)
"""
q_1, q_2 = [], []
for action1 in self.actions:
row_q_1, row_q_2 = [], []
for action2 in self.actions:
joint_action = (action1, action2)
row_q_1.append(self.q[state][joint_action])
row_q_2.append(self.q_o[state][joint_action])
q_1.append(row_q_1)
q_2.append(row_q_2)
game = nashpy.Game(q_1, q_2)
equilibria = game.support_enumeration()
pi = []
for eq in equilibria:
pi.append(eq)
return pi[0][0], pi[0][1]
def check_new_state(self, state):
"""
if the state is new state, extend q table
"""
if state not in self.q.keys():
self.q[state] = {}
self.q_o[state] = {}
for action1 in self.actions:
for action2 in self.actions:
if state not in self.pi.keys():
self.pi[state] = np.repeat(
1.0/len(self.actions), len(self.actions))
self.v[state] = np.random.random()
if (action1, action2) not in self.q[state].keys():
self.q[state][(action1, action2)] = np.random.random()
self.q_o[state][(action1, action2)] = np.random.random()
import matplotlib.pyplot as plt
if __name__ == '__main__':
nb_episode = 1000
agent1 = NashQLearner(alpha=0.1, policy=EpsGreedyQPolicy(), actions=np.arange(2))
agent2 = NashQLearner(alpha=0.1, policy=EpsGreedyQPolicy(), actions=np.arange(2))
game = MatrixGame()
for episode in range(nb_episode):
action1 = agent1.act()
action2 = agent2.act()
_, r1, r2 = game.step(action1, action2)
agent1.observe(reward=r1, reward_o=r2, opponent_action=agent2.previous_action)
agent2.observe(reward=r2, reward_o=r1, opponent_action=agent1.previous_action)
plt.plot(np.arange(len(agent1.pi_history)), agent1.pi_history, label="agent1's pi(0)")
plt.plot(np.arange(len(agent2.pi_history)), agent2.pi_history, label="agent2's pi(0)")
plt.xlabel("episode")
plt.ylabel("pi(0)")
plt.legend()
plt.savefig(r"C:\Users\Administrator\Desktop\Implement-of-algorithm\Fig\Nash-Q.jpg")
plt.show()
参考博客:《多智能体学习:强化学习方法》——代码实现