前文:
代码来自 https://zhuanlan.zhihu.com/p/32089487
五子棋版的AlphaZero网络一开始是公共的3层全卷积网络,分别使用32、64和128个 3\times3 的filter,使用ReLu激活函数。然后再分成policy和value两个输出。在policy这一端,先使用4个 11 的filter进行降维,再接一个全连接层,使用softmax非线性函数直接输出棋盘上每个位置的落子概率;在value这一端,先使用2个 11的filter进行降维,再接一个64个神经元的全连接层,最后再接一个全连接层,使用tanh非线性函数直接输出 [-1,1] 之间的局面评分。整个策略价值网络的深度只有5~6层,训练和预测都相对比较快。
训练网络模型我用的是作者提供的Tensorflow版,需要注意的是,在调用 TensorFlow API 时会经常看到 data_format 参数: 选择 channels_last 或 channels_first 。TensorFlow 选择 NHWC 格式作为默认格式。因为早期开发都是基于 CPU,使用 NHWC 比 NCHW 稍快一些(不难理解,NHWC 局部性更好,cache 利用率高)。NCHW 则是 Nvidia cuDNN 默认格式,使用 GPU 加速时用 NCHW 格式速度会更快(也有个别情况例外)。
NHWC :排列顺序为 [batch, height, width, channels];
NCHW:排列顺序为 [batch, channels, height, width]。
其中 N 表示这批图像有几张,H 表示图像在竖直方向有多少像素,W 表示水平方向像素数,C 表示通道数(例如黑白图像的通道数 C = 1,RGB 彩色图像的通道数 C = 3,而五子棋的网络有四个通道)。
设计网络时充分考虑两种格式,最好能灵活切换,在 GPU 上训练时使用 NCHW 格式,在 CPU 上做预测时使用 NHWC 格式。
因为我的电脑不支持GPU,所以需要把NCHW 转化为 NHWC:
self.input_states = tf.placeholder(
tf.float32, shape=[None, 4, board_height, board_width])
self.input_state = tf.transpose(self.input_states, [0, 2, 3, 1])
全部代码:
class PolicyValueNet():
def __init__(self, board_width, board_height, model_file=None):
self.board_width = board_width
self.board_height = board_height
# Define the tensorflow neural network
# 1. Input:
self.input_states = tf.placeholder(
tf.float32, shape=[None, 4, board_height, board_width])
self.input_state = tf.transpose(self.input_states, [0, 2, 3, 1])
# 2. Common Networks Layers
self.conv1 = tf.layers.conv2d(inputs=self.input_state,
filters=32, kernel_size=[3, 3],
padding="same", data_format="channels_last",
activation=tf.nn.relu)
self.conv2 = tf.layers.conv2d(inputs=self.conv1, filters=64,
kernel_size=[3, 3], padding="same",
data_format="channels_last",
activation=tf.nn.relu)
self.conv3 = tf.layers.conv2d(inputs=self.conv2, filters=128,
kernel_size=[3, 3], padding="same",
data_format="channels_last",
activation=tf.nn.relu)
# 3-1 Action Networks
self.action_conv = tf.layers.conv2d(inputs=self.conv3, filters=4,
kernel_size=[1, 1], padding="same",
data_format="channels_last",
activation=tf.nn.relu)
# Flatten the tensor
self.action_conv_flat = tf.reshape(
self.action_conv, [-1, 4 * board_height * board_width])
# 3-2 Full connected layer, the output is the log probability of moves
# on each slot on the board
self.action_fc = tf.layers.dense(inputs=self.action_conv_flat,
units=board_height * board_width,
activation=tf.nn.log_softmax)
# 4 Evaluation Networks
self.evaluation_conv = tf.layers.conv2d(inputs=self.conv3, filters=2,
kernel_size=[1, 1],
padding="same",
data_format="channels_last",
activation=tf.nn.relu)
self.evaluation_conv_flat = tf.reshape(
self.evaluation_conv, [-1, 2 * board_height * board_width])
self.evaluation_fc1 = tf.layers.dense(inputs=self.evaluation_conv_flat,
units=64, activation=tf.nn.relu)
# output the score of evaluation on current state
self.evaluation_fc2 = tf.layers.dense(inputs=self.evaluation_fc1,
units=1, activation=tf.nn.tanh)
# Define the Loss function
# 1. Label: the array containing if the game wins or not for each state
self.labels = tf.placeholder(tf.float32, shape=[None, 1])
# 2. Predictions: the array containing the evaluation score of each state
# which is self.evaluation_fc2
# 3-1. Value Loss function
self.value_loss = tf.losses.mean_squared_error(self.labels,
self.evaluation_fc2)
# 3-2. Policy Loss function
self.mcts_probs = tf.placeholder(
tf.float32, shape=[None, board_height * board_width])
self.policy_loss = tf.negative(tf.reduce_mean(
tf.reduce_sum(tf.multiply(self.mcts_probs, self.action_fc), 1)))
# 3-3. L2 penalty (regularization)
l2_penalty_beta = 1e-4
vars = tf.trainable_variables()
l2_penalty = l2_penalty_beta * tf.add_n(
[tf.nn.l2_loss(v) for v in vars if 'bias' not in v.name.lower()])
# 3-4 Add up to be the Loss function
self.loss = self.value_loss + self.policy_loss + l2_penalty
# Define the optimizer we use for training
self.learning_rate = tf.placeholder(tf.float32)
self.optimizer = tf.train.AdamOptimizer(
learning_rate=self.learning_rate).minimize(self.loss)
# Make a session
self.session = tf.Session()
# calc policy entropy, for monitoring only
self.entropy = tf.negative(tf.reduce_mean(
tf.reduce_sum(tf.exp(self.action_fc) * self.action_fc, 1)))
# Initialize variables
init = tf.global_variables_initializer()
self.session.run(init)
# For saving and restoring
self.saver = tf.train.Saver()
if model_file is not None:
self.restore_model(model_file)
def policy_value(self, state_batch):
"""
input: a batch of states
output: a batch of action probabilities and state values
"""
log_act_probs, value = self.session.run(
[self.action_fc, self.evaluation_fc2],
feed_dict={self.input_states: state_batch}
)
act_probs = np.exp(log_act_probs)
return act_probs, value
def policy_value_fn(self, board):
"""
input: board
output: a list of (action, probability) tuples for each available
action and the score of the board state
"""
legal_positions = board.availables
current_state = np.ascontiguousarray(board.current_state().reshape(
-1, 4, self.board_width, self.board_height))
act_probs, value = self.policy_value(current_state)
act_probs = zip(legal_positions, act_probs[0][legal_positions])
return act_probs, value
def train_step(self, state_batch, mcts_probs, winner_batch, lr):
"""perform a training step"""
winner_batch = np.reshape(winner_batch, (-1, 1))
loss, entropy, _ = self.session.run(
[self.loss, self.entropy, self.optimizer],
feed_dict={self.input_states: state_batch,
self.mcts_probs: mcts_probs,
self.labels: winner_batch,
self.learning_rate: lr})
return loss, entropy
def save_model(self, model_path):
self.saver.save(self.session, model_path)
def restore_model(self, model_path):
self.saver.restore(self.session, model_path)
模型比较简单,比较关键的是 loss 函数这一段
# 3-1. Value Loss function
self.value_loss = tf.losses.mean_squared_error(self.labels,
self.evaluation_fc2)
# 3-2. Policy Loss function
self.mcts_probs = tf.placeholder(
tf.float32, shape=[None, board_height * board_width])
self.policy_loss = tf.negative(tf.reduce_mean(
tf.reduce_sum(tf.multiply(self.mcts_probs, self.action_fc), 1)))
# 3-3. L2 penalty (regularization)
l2_penalty_beta = 1e-4
vars = tf.trainable_variables()
l2_penalty = l2_penalty_beta * tf.add_n(
[tf.nn.l2_loss(v) for v in vars if 'bias' not in v.name.lower()])
# 3-4 Add up to be the Loss function
self.loss = self.value_loss + self.policy_loss + l2_penalty
可以看到loss函数是由三个值相加获得,在前文中讲到过:
ℓ=(z-v)2-πTlogP + c|| θ ||2 ,这里是一一对应的。
模型构建好后,开始训练模型。
完整代码如下:
class TrainPipeline():
def __init__(self, init_model=None):
# params of the board and the game
self.board_width = 6
self.board_height = 6
self.n_in_row = 4
self.board = Board(width=self.board_width,
height=self.board_height,
n_in_row=self.n_in_row)
self.game = Game(self.board)
# training params
self.learn_rate = 2e-3
self.lr_multiplier = 1.0 # adaptively adjust the learning rate based on KL
self.temp = 1.0 # the temperature param
self.n_playout = 400 # num of simulations for each move
self.c_puct = 5
self.buffer_size = 10000
self.batch_size = 512 # mini-batch size for training
self.data_buffer = deque(maxlen=self.buffer_size)
self.play_batch_size = 1
self.epochs = 5 # num of train_steps for each update
self.kl_targ = 0.02
self.check_freq = 50
self.game_batch_num = 100
self.best_win_ratio = 0.0
# num of simulations used for the pure mcts, which is used as
# the opponent to evaluate the trained policy
self.pure_mcts_playout_num = 1000
if init_model:
# start training from an initial policy-value net
self.policy_value_net = PolicyValueNet(self.board_width,
self.board_height,
model_file=init_model)
else:
# start training from a new policy-value net
self.policy_value_net = PolicyValueNet(self.board_width,
self.board_height)
self.mcts_player = MCTSPlayer(self.policy_value_net.policy_value_fn,
c_puct=self.c_puct,
n_playout=self.n_playout,
is_selfplay=1)
def get_equi_data(self, play_data):
"""augment the data set by rotation and flipping
play_data: [(state, mcts_prob, winner_z), ..., ...]
"""
extend_data = []
for state, mcts_porb, winner in play_data:
for i in [1, 2, 3, 4]:
# rotate counterclockwise
equi_state = np.array([np.rot90(s, i) for s in state])
equi_mcts_prob = np.rot90(np.flipud(
mcts_porb.reshape(self.board_height, self.board_width)), i)
extend_data.append((equi_state,
np.flipud(equi_mcts_prob).flatten(),
winner))
# flip horizontally
equi_state = np.array([np.fliplr(s) for s in equi_state])
equi_mcts_prob = np.fliplr(equi_mcts_prob)
extend_data.append((equi_state,
np.flipud(equi_mcts_prob).flatten(),
winner))
return extend_data
def collect_selfplay_data(self, n_games=1):
"""collect self-play data for training"""
for i in range(n_games):
winner, play_data = self.game.start_self_play(self.mcts_player,
temp=self.temp)
play_data = list(play_data)[:]
self.episode_len = len(play_data)
# augment the data
play_data = self.get_equi_data(play_data)
self.data_buffer.extend(play_data)
def policy_update(self):
"""update the policy-value net"""
mini_batch = random.sample(self.data_buffer, self.batch_size)
state_batch = [data[0] for data in mini_batch]
mcts_probs_batch = [data[1] for data in mini_batch]
winner_batch = [data[2] for data in mini_batch]
old_probs, old_v = self.policy_value_net.policy_value(state_batch)
for i in range(self.epochs):
loss, entropy = self.policy_value_net.train_step(
state_batch,
mcts_probs_batch,
winner_batch,
self.learn_rate*self.lr_multiplier)
new_probs, new_v = self.policy_value_net.policy_value(state_batch)
kl = np.mean(np.sum(old_probs * (
np.log(old_probs + 1e-10) - np.log(new_probs + 1e-10)),
axis=1)
)
if kl > self.kl_targ * 4: # early stopping if D_KL diverges badly
break
# adaptively adjust the learning rate
if kl > self.kl_targ * 2 and self.lr_multiplier > 0.1:
self.lr_multiplier /= 1.5
elif kl < self.kl_targ / 2 and self.lr_multiplier < 10:
self.lr_multiplier *= 1.5
explained_var_old = (1 -
np.var(np.array(winner_batch) - old_v.flatten()) /
np.var(np.array(winner_batch)))
explained_var_new = (1 -
np.var(np.array(winner_batch) - new_v.flatten()) /
np.var(np.array(winner_batch)))
print(("kl:{:.5f},"
"lr_multiplier:{:.3f},"
"loss:{},"
"entropy:{},"
"explained_var_old:{:.3f},"
"explained_var_new:{:.3f}"
).format(kl,
self.lr_multiplier,
loss,
entropy,
explained_var_old,
explained_var_new))
return loss, entropy
def policy_evaluate(self, n_games=10):
"""
Evaluate the trained policy by playing against the pure MCTS player
Note: this is only for monitoring the progress of training
"""
current_mcts_player = MCTSPlayer(self.policy_value_net.policy_value_fn,
c_puct=self.c_puct,
n_playout=self.n_playout)
pure_mcts_player = MCTS_Pure(c_puct=5,
n_playout=self.pure_mcts_playout_num)
win_cnt = defaultdict(int)
for i in range(n_games):
winner = self.game.start_play(current_mcts_player,
pure_mcts_player,
start_player=i % 2,
is_shown=0)
win_cnt[winner] += 1
win_ratio = 1.0*(win_cnt[1] + 0.5*win_cnt[-1]) / n_games
print("num_playouts:{}, win: {}, lose: {}, tie:{}".format(
self.pure_mcts_playout_num,
win_cnt[1], win_cnt[2], win_cnt[-1]))
return win_ratio
def run(self):
"""run the training pipeline"""
try:
for i in range(self.game_batch_num):
self.collect_selfplay_data(self.play_batch_size)
print("batch i:{}, episode_len:{}".format(
i+1, self.episode_len))
if len(self.data_buffer) > self.batch_size:
loss, entropy = self.policy_update()
# check the performance of the current model,
# and save the model params
if (i+1) % self.check_freq == 0:
print("current self-play batch: {}".format(i+1))
win_ratio = self.policy_evaluate()
self.policy_value_net.save_model('./current_policy.model')
if win_ratio > self.best_win_ratio:
print("New best policy!!!!!!!!")
self.best_win_ratio = win_ratio
# update the best_policy
self.policy_value_net.save_model('./best_policy.model')
if (self.best_win_ratio == 1.0 and
self.pure_mcts_playout_num < 5000):
self.pure_mcts_playout_num += 1000
self.best_win_ratio = 0.0
except KeyboardInterrupt:
print('\n\rquit')
在主函数中只需输入
training_pipeline = TrainPipeline()
training_pipeline.run()
即可训练模型,所以主要分析 TrainPipeline.run()函数。
run()大致流程是:
训练好的模型还需要在人机对弈的实践中使用,测试其在实战中是否能取得良好效果。
以下是人机对弈的函数,可以看到其用了PolicyValueNetNumpy()函数来加载模型并使用:
def run():
n = 5
width, height = 8, 8
model_file = 'best_policy_8_8_5.model'
try:
board = Board(width=width, height=height, n_in_row=n)
game = Game(board)
try:
policy_param = pickle.load(open(model_file, 'rb'))
except:
policy_param = pickle.load(open(model_file, 'rb'),
encoding='bytes') # To support python3
best_policy = PolicyValueNetNumpy(width, height, policy_param)
mcts_player = MCTSPlayer(best_policy.policy_value_fn,
c_puct=5,
n_playout=400) # set larger n_playout for better performance
human = Human()
game.start_play(human, mcts_player, start_player=1, is_shown=1)
PolicyValueNetNumpy():根据训练好的网络模型,对当前局面进行分析,输出各个落子点的概率和价值。
class PolicyValueNetNumpy():
"""policy-value network in numpy """
def __init__(self, board_width, board_height, net_params):
self.board_width = board_width
self.board_height = board_height
self.params = net_params
def policy_value_fn(self, board):
"""
input: board
output: a list of (action, probability) tuples for each available
action and the score of the board state
"""
legal_positions = board.availables
current_state = board.current_state()
X = current_state.reshape(-1, 4, self.board_width, self.board_height)
# first 3 conv layers with ReLu nonlinearity
for i in [0, 2, 4]:
X = relu(conv_forward(X, self.params[i], self.params[i+1]))
# policy head
X_p = relu(conv_forward(X, self.params[6], self.params[7], padding=0))
X_p = fc_forward(X_p.flatten(), self.params[8], self.params[9])
act_probs = softmax(X_p)
# value head
X_v = relu(conv_forward(X, self.params[10],
self.params[11], padding=0))
X_v = relu(fc_forward(X_v.flatten(), self.params[12], self.params[13]))
value = np.tanh(fc_forward(X_v, self.params[14], self.params[15]))[0]
act_probs = zip(legal_positions, act_probs.flatten()[legal_positions])
return act_probs, value
Over.