算是刚开始入门增强学习吧，结合毕设的要求，将增强学习的Q-learning和视频游戏结合起来，花几天时间啃透了yenchenlin的一个不错的项目，加了好多注释和自己的理解，几乎可以说是很简单易读了，希望能够对你有所帮助。

GitHub地址: ReDeepLearningFlappyBird
https://github.com/ZhangRui111/ReDeepLearningFlappyBird

Using Deep Q-Network to Learn How To Play Flappy Bird

Based on DeepLearningFlappyBird

Overview

This project follows the description of the Deep Q Learning algorithm described in Playing Atari with Deep Reinforcement Learning [2] and shows that this learning algorithm can be further generalized to the notorious Flappy Bird.

bird_demo.gif

7 mins version: DQN for flappy bird

How to Run?

git clone https://github.com/ZhangRui111/ReDeepLearningFlappyBird.git
cd DeepLearningFlappyBird
python deep_q_network.py

The program is running with pretrained weights by default. If you want to train the network from the scratch. You can delete /saved_networks/checkpoint. By the way, the training precess may take several days depending on your hardware.

About deep_q_network.py

case1

case2

Deep Q-Network Algorithm

The pseudo-code for the Deep Q Learning algorithm, as given in [1], can be found below:

Initialize replay memory D to size N
Initialize action-value function Q with random weights
for episode = 1, M do
    Initialize state s_1
    for t = 1, T do
        With probability ϵ select random action a_t
        otherwise select a_t=max_a  Q(s_t,a; θ_i)
        Execute action a_t in emulator and observe r_t and s_(t+1)
        Store transition (s_t,a_t,r_t,s_(t+1)) in D
        Sample a minibatch of transitions (s_j,a_j,r_j,s_(j+1)) from D
        Set y_j:=
            r_j for terminal s_(j+1)
            r_j+γ*max_(a^' )  Q(s_(j+1),a'; θ_i) for non-terminal s_(j+1)
        Perform a gradient step on (y_j-Q(s_j,a_j; θ_i))^2 with respect to θ
    end for
end for

Network Architecture

According to [1], I first preprocessed the game screens with following steps:

Convert image to grayscale
Resize image to 80x80
Stack last 4 frames to produce an 80x80x4 input array for network

The architecture of the network is shown in the figure below. The first layer convolves the input image with an 8x8x4x32 kernel at a stride size of 4. The output is then put through a 2x2 max pooling layer. The second layer convolves with a 4x4x32x64 kernel at a stride of 2. We then max pool again. The third layer convolves with a 3x3x64x64 kernel at a stride of 1. We then max pool one more time. The last hidden layer consists of 256 fully connected ReLU nodes.

image

The final output layer has the same dimensionality as the number of valid actions which can be performed in the game, where the 0th index always corresponds to doing nothing. The values at this output layer represent the Q function given the input state for each valid action. At each time step, the network performs whichever action corresponds to the highest Q value using a ϵ greedy policy.

Disclaimer

This work is highly based on the following repos:

[sourabhv/FlapPyBird] (https://github.com/sourabhv/FlapPyBird)
asrivat1/DeepLearningVideoGames
yenchenlin/DeepLearningFlappyBird

原创文章，转载请注明出处: https://www.jianshu.com/p/755f9f2604d0

增强学习玩转FlappyBird