Learn to play Pong with PG from scratch and pixels

Learn to play Pong with PG from scratch and pixels

http://karpathy.github.io/2016/05/31/rl/

Policy Gradients(PG) is default choice for attacking RL problems.

DQN changed Q-Learning.

PG is preferred because it is end-to-end. That means there’s an explicit policy and a principled approach that directly optimizes the expected reward.

Pong is a special case of a Markov Decision Process(MDP): A graph where each node is a particlular game state and each edge is a possible (in general probabilistic) transition. Each edge also gives a reward, and the goal is to compute the optimal way of acting in any state to maximize rewards.

Policy network as below:
Learn to play Pong with PG from scratch and pixels_第1张图片
Input: raw image pixels.

2-layer neural network

Output: move UP or DOWN. Stochastic policy: only produce a probability of moving UP.

Every iteration we will sample from this distribution to get the actual move.

Policy network forward pass in Python/numpy:

def policy_forward(x):
    h = np.dot(W1, x) # compute hidden layer neuron activations
    h[h<0] = 0 # ReLU nonlinearity: threshold at zero
    logp = np.dot(W2, h) # compute log probability of going up
    p = 1.0 / (1.0 + np.exp(-logp)) # sigmoid function (gives probability of going up)
    return p, h  # return probability of taking action 2, and hidden state

Training protocol: If game won. Initialize the policy network with W1, W2 and play 100 games. Assume each game is made up of 200 frames also mean make 200 decisions per game. For example we won 12 games and lost 88. We’ll take all 200*12 = 2400 decisions we made in the winning games and do a positive update (filling in a +1.0 in the gradient for the sampled action, doing backdrop, and parameter update encouraging the actions we picked in all those states). And we’ll take the other 200*88 = 17600 decisions we made in the losing games and do a negative update. And play another 100 games and repeat.

In summary the loss looks like ∑ i A i l o g   p ( y i ∣ x i ) \sum_iA_ilog\ p(y_i|x_i) iAilog p(yixi), where y i y_i yi is the action we happened to sample and A i A_i Ai is a number that we call an advantage. In the case of Pong, for example, A i A_i Ai could be 1.0 if we eventually won in the episode that contained x i x_i xi and -1.0 if we lost.

In a more general RL setting we would receive some reward r t r_t rt at every time step. The “eventual reward” in the diagram above would become R t = ∑ k = 0 ∞ γ k r t + k R_t = \sum^\infin_{k=0} \gamma^kr_{t+k} Rt=k=0γkrt+k, where γ \gamma γ is a number between 0 and 1 called a discount factor (e.g. 0.99). That mean later rewards are exponentially less important. In practice it can also be important to normalize these. Standardize these rewards by subtract mean, divide by standard deviation before use them in backprop.

def policy_backward(eph, epdlogp):
    """ backward pass. (eph is array of intermediate hidden states) """
    # epdlogp is array of y - aprob
    # np.ravel() is equivalent to reshape(-1, order=order).
    dW2 = np.dot(eph.T, epdlogp).ravel()
    dh = np.outer(epdlogp, model['W2'])
    dh[eph <= 0] = 0  # backpro prelu
    dW1 = np.dot(dh.T, epx)
    return {'W1': dW1, 'W2': dW2}

你可能感兴趣的:(强化学习)