强化学习是研究智能体以及智能体如何通过反复试验学习的方法。它正式化了这样一种思想,即奖励(惩罚)行为人的行为,使其将来更有可能重复(放弃)该行为。
强化学习是和监督学习、非监督学习并列的第三种机器学习方法,三者的关系如下图所示:
强化学习来和监督学习最大的区别是它是没有监督学习已经准备好的训练数据输出值的。强化学习只有奖励值,但是这个奖励值和监督学习的输出值不一样,它不是事先给出的,而是延后给出的,比如走路摔倒了才得到大脑的奖励值。同时,强化学习的每一步与时间顺序前后关系紧密。而监督学习的训练数据之间一般都是独立的,没有这种前后的依赖关系。
再来看看强化学习和非监督学习的区别。也还是在奖励值这个地方。非监督学习是没有输出值也没有奖励值的,它只有数据特征。同时和监督学习一样,数据之间也都是独立的,没有强化学习这样的前后依赖关系。
强化学习的建模过程如图:
上面的大脑代表我们的算法执行个体,我们可以操作个体来做决策,即选择一个合适的动作(Action) A t A_t At。下面的地球代表我们要研究的环境,它有自己的状态模型,我们选择了动作 A t A_t At 后,环境的状态(State)会变,我们会发现环境状态已经变为 S t + 1 S_{t+1} St+1,同时我们得到了我们采取动作 A t A_t At 的延时奖励(Reward) R t + 1 R_{t+1} Rt+1。然后个体可以继续选择下一个合适的动作,然后环境的状态又会变,又有新的奖励值 ⋯ \cdots ⋯ 这就是强化学习的思路。
整理下这个思路里面出现的强化学习要素。
第一个是环境的状态 S S S: t t t 时刻环境的状态 S t S_t St 是它的环境状态集合中某一个状态。
第二个是个体的动作 A A A: t t t 时刻个体采取的动作 A t A_t At 是它的动作集合中某一个动作。
第三个是环境的奖励 R R R: t t t 时刻个体在状态 S t S_t St 采取的动作 A t A_t At 对应的奖励 R t + 1 R_{t+1} Rt+1 会在 t + 1 t+1 t+1 时刻得到。
下面是稍复杂一些的模型要素。
第四个是个体的策略(policy) π \pi π:它代表个体采取动作的依据,即个体会依据策略 π \pi π 来选择动作。最常见的策略表达方式是一个条件概率分布 π ( a ∣ s ) \pi(a\mid s) π(a∣s),即在状态 s s s 时采取动作 a a a 的概率。即 π ( a ∣ s ) = P ( A t = a ∣ S t = s ) \pi(a\mid s)=P(A_t=a\mid S_t=s) π(a∣s)=P(At=a∣St=s)。此时概率大的动作被个体选择的概率较高。
第五个是个体采取行动后的价值(value) v π ( s ) v_{\pi}(s) vπ(s):代表在策略 π \pi π 和状态 s s s 时,采取行动后的价值,这个价值一般是一个期望函数。虽然当前动作会给一个延时奖励 R t + 1 R_{t+1} Rt+1,但是光看这个延时奖励是不行的,因为当前的延时奖励高,不代表到了 t + 1 , t + 2 , ⋯ t+1,t+2,\cdots t+1,t+2,⋯ 时刻的后续奖励也高。比如下象棋,我们可以某个动作可以吃掉对方的车,这个延时奖励是很高,但是接着后面我们输棋了。此时吃车的动作奖励值高但是价值并不高。因此我们的价值要综合考虑当前的延时奖励和后续的延时奖励。价值函数 v π ( s ) v_{\pi}(s) vπ(s) 一般可以表示为下式,不同的算法会有对应的一些价值函数变种,但思路相同:
v π ( s ) = E π ( R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ ∣ S t = s ) v_{\pi}(s)=\mathbb{E}_{\pi}(R_{t+1}+\gamma R_{t+2}+\gamma ^ 2 R_{t+3}+\cdots \mid S_t=s) vπ(s)=Eπ(Rt+1+γRt+2+γ2Rt+3+⋯∣St=s)
第六个是奖励衰减因子 γ \gamma γ: γ \gamma γ 在 [ 0 , 1 ] [0,1] [0,1] 之间。如果为 0,则是贪婪法,即价值只由当前延时奖励决定,如果是 1,则所有的后续状态奖励和当前奖励一视同仁。大多数时候,我们会取一个 0 到 1 之间的数字,即当前延时奖励的权重比后续奖励的权重大。
第七个是环境的状态转化模型 P s s ′ a P_{ss'}^a Pss′a:可以理解为一个概率状态机,它可以表示为一个概率模型,即在状态 s s s 下采取动作 a a a,转到下一个状态 s ′ s' s′ 的概率,表示为 P s s ′ a P_{ss'}^a Pss′a。
第八个是探索率 ε \varepsilon ε:这个比率主要用在强化学习训练迭代过程中,由于我们一般会选择使当前轮迭代价值最大的动作,但是这会导致一些较好的但我们没有执行过的动作被错过。因此我们在训练选择最优动作时,会有一定的概率 ε \varepsilon ε 不选择使当前轮迭代价值最大的动作,而选择其他的动作。
以上 8 个就是强化学习模型的基本要素了。当然,在不同的强化学习模型中,会考虑一些其他的模型要素,或者不考虑上述要素的某几个,但是这8个是大多数强化学习模型的基本要素。
这是一个简单的游戏,在一个 3x3 的九宫格里,两个人轮流下,直到有个人的棋子满足三个一横一竖或者一斜,赢得比赛游戏结束,或者九宫格填满也没有人赢,则和棋。
首先看第一个要素环境的状态 S S S。这是一个九宫格,每个格子有三种状态,即没有棋子(取值0),有第一个选手的棋子(取值1),有第二个选手的棋子(取值-1)。那么这个模型的状态一共有 3 9 = 19683 3^9=19683 39=19683 个。
接着我们看个体的动作 A A A,这里只有 9 个格子,每次也只能下一步,所以最多只有 9 个动作选项。实际上由于已经有棋子的格子是不能再下的,所以动作选项会更少。实际可以选择动作的就是那些取值为 0 的格子。
第三个是环境的奖励 R R R,这个一般是我们自己设计。由于我们的目的是赢棋,所以如果某个动作导致的改变到的状态可以使我们赢棋,结束游戏,那么奖励最高,反之则奖励最低。其余的双方下棋动作都有奖励,但奖励较少。特别的,对于先下的棋手,不会导致结束的动作奖励要比后下的棋手少。
第四个是个体的策略(policy) π π π,这个一般是学习得到的,我们会在每轮以较大的概率选择当前价值最高的动作,同时以较小的概率去探索新动作,在这里 AI 的策略如下面代码所示。
代码里面的 exploreRate 就是我们的第八个要素探索率 ε \varepsilon ε。即策略是以 1 − ε 1-\varepsilon 1−ε 的概率选择当前最大价值的动作,以 ε \varepsilon ε 的概率随机选择新动作。
第五个是价值函数,代码里用 value 表示。价值函数的更新代码里只考虑了当前动作的现有价值和得到的奖励两部分,可以认为我们的第六个模型要素衰减因子 γ \gamma γ 为 0。具体的代码部分如下,价值更新部分的代码加粗。具体为什么会这样更新价值函数我们以后会讲。
第七个是环境的状态转化模型, 这里由于每一个动作后,环境的下一个模型状态是确定的,也就是九宫格的每个格子是否有某个选手的棋子是确定的,因此转化的概率都是 1,不存在某个动作后会以一定的概率到某几个新状态,比较简单1。
import numpy as np
import pickle
# 棋盘行数、列数和容纳量
BOARD_ROWS = 3
BOARD_COLS = 3
BOARD_SIZE = BOARD_ROWS * BOARD_COLS
class State:
def __init__(self):
# the board is represented by an n * n array,
# [[0., 0., 0.],
# [0., 0., 0.],
# [0., 0., 0.]]
# 1 represents a chessman of the player who moves first,
# -1 represents a chessman of another player
# 0 represents an empty position
self.data = np.zeros((BOARD_ROWS, BOARD_COLS))
self.winner = None
self.hash_val = None
self.end = None
# compute the hash value for one state, it's unique
def hash(self):
if self.hash_val is None:
self.hash_val = 0
for i in self.data.reshape(BOARD_ROWS * BOARD_COLS):
if i == -1:
i = 2
self.hash_val = self.hash_val * 3 + i
return int(self.hash_val)
# check whether a player has won the game, or it's a tie
def is_end(self):
if self.end is not None:
return self.end
results = []
# check rows
for i in range(0, BOARD_ROWS):
results.append(np.sum(self.data[i, :]))
# check columns
for i in range(0, BOARD_COLS):
results.append(np.sum(self.data[:, i]))
# check diagonals: \
results.append(0) # 给斜对角线 \ 的结果增加一个预留位置
for i in range(0, BOARD_ROWS):
results[-1] += self.data[i, i]
# check diagonals: /
results.append(0)
for i in range(0, BOARD_ROWS):
results[-1] += self.data[i, BOARD_ROWS - 1 - i]
# determine the winner
for result in results:
if result == 3:
self.winner = 1
self.end = True
return self.end
if result == -3:
self.winner = -1
self.end = True
return self.end
# whether it's a tie:如果所有格子中不是 1 就是 -1 的话,所有格子数值的绝对值之和为 9
sum = np.sum(np.abs(self.data))
if sum == BOARD_ROWS * BOARD_COLS:
self.winner = 0
self.end = True
return self.end
# game is still going on
self.end = False
return self.end
# @symbol: 1 or -1
# put chessman symbol in position (i, j)
def next_state(self, i, j, symbol):
new_state = State()
new_state.data = np.copy(self.data)
new_state.data[i, j] = symbol
return new_state
# print the board
def print(self):
for i in range(0, BOARD_ROWS):
print('-------------')
out = '| '
for j in range(0, BOARD_COLS):
if self.data[i, j] == 1:
token = '*'
if self.data[i, j] == 0:
token = '0'
if self.data[i, j] == -1:
token = 'x'
out += token + ' | '
print(out)
print('-------------')
def get_all_states_impl(current_state, current_symbol, all_states):
for i in range(0, BOARD_ROWS):
for j in range(0, BOARD_COLS):
if current_state.data[i][j] == 0:
newState = current_state.next_state(i, j, current_symbol)
newHash = newState.hash()
if newHash not in all_states.keys():
isEnd = newState.is_end()
all_states[newHash] = (newState, isEnd)
if not isEnd:
get_all_states_impl(newState, -current_symbol, all_states)
# get all possible states
def get_all_states():
current_symbol = 1
current_state = State()
all_states = dict()
all_states[current_state.hash()] = (current_state, current_state.is_end())
get_all_states_impl(current_state, current_symbol, all_states)
return all_states
# all possible board configurations
all_states = get_all_states()
class Judger:
# @player1: the player who will move first, its chessman will be 1
# @player2: another player with a chessman -1
# @feedback: if True, both players will receive rewards when game is end
def __init__(self, player1, player2):
self.p1 = player1
self.p2 = player2
self.current_player = None
self.p1_symbol = 1
self.p2_symbol = -1
self.p1.set_symbol(self.p1_symbol)
self.p2.set_symbol(self.p2_symbol)
self.current_state = State()
def reset(self):
self.p1.reset()
self.p2.reset()
def alternate(self):
while True:
yield self.p1
yield self.p2
# @print: if True, print each board during the game
def play(self, print=False):
alternator = self.alternate()
self.reset()
current_state = State()
self.p1.set_state(current_state)
self.p2.set_state(current_state)
while True:
player = next(alternator)
if print:
current_state.print()
[i, j, symbol] = player.act()
next_state_hash = current_state.next_state(i, j, symbol).hash()
current_state, is_end = all_states[next_state_hash]
self.p1.set_state(current_state)
self.p2.set_state(current_state)
if is_end:
if print:
current_state.print()
return current_state.winner
# AI player
class Player:
# @step_size: the step size to update estimations
# @epsilon: the probability to explore
def __init__(self, step_size=0.1, epsilon=0.1):
self.estimations = dict()
self.step_size = step_size
self.epsilon = epsilon
self.states = []
self.greedy = []
def reset(self):
self.states = []
self.greedy = []
def set_state(self, state):
self.states.append(state)
self.greedy.append(True)
def set_symbol(self, symbol):
self.symbol = symbol
for hash_val in all_states.keys():
(state, is_end) = all_states[hash_val]
if is_end:
if state.winner == self.symbol:
self.estimations[hash_val] = 1.0
elif state.winner == 0:
# we need to distinguish between a tie and a lose
self.estimations[hash_val] = 0.5
else:
self.estimations[hash_val] = 0
else:
self.estimations[hash_val] = 0.5
# update value estimation
def backup(self):
# for debug
# print('player trajectory')
# for state in self.states:
# state.print()
self.states = [state.hash() for state in self.states]
for i in reversed(range(len(self.states) - 1)):
state = self.states[i]
td_error = self.greedy[i] * (self.estimations[self.states[i + 1]] - self.estimations[state])
self.estimations[state] += self.step_size * td_error
# choose an action based on the state
def act(self):
state = self.states[-1]
next_states = []
next_positions = []
for i in range(BOARD_ROWS):
for j in range(BOARD_COLS):
if state.data[i, j] == 0:
next_positions.append([i, j])
next_states.append(state.next_state(i, j, self.symbol).hash())
if np.random.rand() < self.epsilon:
action = next_positions[np.random.randint(len(next_positions))]
action.append(self.symbol)
self.greedy[-1] = False
return action
values = []
for hash, pos in zip(next_states, next_positions):
values.append((self.estimations[hash], pos))
np.random.shuffle(values)
values.sort(key=lambda x: x[0], reverse=True)
action = values[0][1]
action.append(self.symbol)
return action
def save_policy(self):
with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'wb') as f:
pickle.dump(self.estimations, f)
def load_policy(self):
with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'rb') as f:
self.estimations = pickle.load(f)
# human interface
# input a number to put a chessman
# | q | w | e |
# | a | s | d |
# | z | x | c |
class HumanPlayer:
def __init__(self, **kwargs):
self.symbol = None
self.keys = ['q', 'w', 'e', 'a', 's', 'd', 'z', 'x', 'c']
self.state = None
return
def reset(self):
return
def set_state(self, state):
self.state = state
def set_symbol(self, symbol):
self.symbol = symbol
return
def backup(self, _):
return
def act(self):
self.state.print()
key = input("Input your position:")
data = self.keys.index(key)
i = data // int(BOARD_COLS)
j = data % BOARD_COLS
return (i, j, self.symbol)
def train(epochs):
player1 = Player(epsilon=0.01)
player2 = Player(epsilon=0.01)
judger = Judger(player1, player2)
player1_win = 0.0
player2_win = 0.0
for i in range(1, epochs + 1):
winner = judger.play(print=False)
if winner == 1:
player1_win += 1
if winner == -1:
player2_win += 1
print('Epoch %d, player 1 win %.02f, player 2 win %.02f' % (i, player1_win / i, player2_win / i))
player1.backup()
player2.backup()
judger.reset()
player1.save_policy()
player2.save_policy()
def compete(turns):
player1 = Player(epsilon=0)
player2 = Player(epsilon=0)
judger = Judger(player1, player2)
player1.load_policy()
player2.load_policy()
player1_win = 0.0
player2_win = 0.0
for i in range(0, turns):
winner = judger.play()
if winner == 1:
player1_win += 1
if winner == -1:
player2_win += 1
judger.reset()
print('%d turns, player 1 win %.02f, player 2 win %.02f' % (turns, player1_win / turns, player2_win / turns))
# The game is a zero sum game. If both players are playing with an optimal strategy, every game will end in a tie.
# So we test whether the AI can guarantee at least a tie if it goes second.
def play():
while True:
player1 = HumanPlayer()
player2 = Player(epsilon=0)
judger = Judger(player1, player2)
player2.load_policy()
winner = judger.play()
if winner == player2.symbol:
print("You lose!")
elif winner == player1.symbol:
print("You win!")
else:
print("It is a tie!")
if __name__ == '__main__':
train(int(1e5))
compete(int(1e3))
play()
刘建平Pinard:强化学习(一)模型基础 https://www.cnblogs.com/pinard/p/9385570.html ↩︎