强化学习Q=learning ——Reinforcement Learning Solution to the Towers of Hanoi Puzzle

我们的目标是书写强化学习-Q learning的代码,然后利用代码解决汉诺塔问题

强化学习简介

基础的详细定义之类的,就不再这里赘述了。下面直接说一些有用的东西。

强化学习的步骤:

  • 对于每个状态,对这个状态下,所有的动作,计算这个状态-动作的潜在奖励。

    • 一般记录在Q表格中,可以表示为 \(Q[(state,move):value]\)
  • 对于汉诺塔问题,由于我们能达到最终的目标,所以这里设置最终的 reinforcement(\(r\)) = 1

  • 对于强化学习,我们的选择动作有两种策略(注:同的选择所对应的更新Q表格的方程不同)

    • 一,每次选择最小的,更小的值,代表离目标更近。
    • 二,每次选择更大的,更大的值,代表离目标更近。
    • 这里我们设目标为1,同时使用更小值作为选择动作的方式。选择方程如下,
      • $ a_t^o = \mathop{\arg\min}_{a} Q(s_t,a).$
      • 其中\(a_t\)为选择的动作,\(s_t\)为当前状态,可以解释为,\(s_t\)下,有若干的动作\(a\),选择Q最小的动作\(a_t\)
  • 现在考虑Q表格的更新问题

  • 对于Q表格的更新,我们采取下面两种方程。(r=1)(注意:这里我们会初始化所有的Q为0,接着再根据状态-动作进行更新)

    • 如果达到目标
      \[ \begin{align*} Q(s_t,a_t) = Q(s_t,a_t) + \rho (r - Q(s_t,a_t)) \end{align*} \]

      • 或者直接赋值为1,表示到达目标,这里为了计算简单,直接赋值为1。
    • 其他情况
      \[ \begin{align*} Q(s_t,a_t) = Q(s_t,a_t) + \rho (r + Q(s_{t+1},a_{t+1}) - Q(s_t,a_t)) \end{align*} \]

    • 理解上述方程:

      • 首先,在\(s_t\)下,我们根据Q表格值,选取了动作\(a_t\),运动到\(s_{t+1}\)
      • \(s_{t+1}\)下,我们首先做的是更新上一个\(s_t\)下,动作\(a_t\)的Q值。
      • 这时,我们根据Q表格可以有\(s_{t+1}\)下,\(a_{t+1}\)的值,并且,我们有目标奖励reinforcement(\(r\)) = 1
      • 这里,我们把\((r + Q(s_{t+1},a_{t+1}))\)看作实际\(s_t\)下,动作\(a_t\)的Q值,同时估计值是\(Q(s_t,a_t)\)
      • 因此,\((r + Q(s_{t+1},a_{t+1}) - Q(s_t,a_t))\) 可以是估计值与实际值的差值,再乘以学习率\(\rho\),表示每次学习的差值。
      • 最后,把差值累加上原有的估计值\(Q(s_t,a_t)\),即为更新后的\(Q(s_t,a_t)\)

以上为对于基本强化学习的解释。

需要完成的事

  • 首先,我们要把汉诺塔问题可视化。方便观察运行结果,与过程。

  • 简单来看,我们可以用[[1, 2, 3], [], []]表示一个一个状态,三个小的[]表示三根塔柱,数字表示三个塔盘,其中大小表示塔盘的不同大小。

  • 对于移动塔盘的动作,也可以简化为单个[1, 2],或者(1, 2),表示为,把一号塔柱,上的塔盘移动到二号塔柱上,(从左到右依次1,2,3)

  • 那么我们可以书写一下四个方程:

    • printState(state): 打印塔的状态,便于可视化
    • validMoves(state): 返回当前 state下的所有的可行动作
    • makeMove(state, move): 返回根据move(action)移动后的state
    • stateMoveTuple(state, move): statemove(action)需要更改以为tuple格式,即(state,move),因为,这里我们把Q表格更改字典型存储,这样比较简单
  • 接下来书写epsilonDecayFactor方程

    • 此方程的功能为:随机一个数,如果这个数小于我们预设的epsilon,那么就随机一个动作。如果大于,就从Q表格中选择Q值最小的动作运动。
    • 对于epsilonGreedy 方程(If np.random.uniform())来说,小的epsilon意味着,更多可能会使用Q表格选取动作。太大的epsilon会导致无法收敛的问题。对于本次题目来说,加入\(epsilon*=espsilonDecayFactor\) 来不断减小epsilon的值,使其趋向于0。
  • trainQ(nRepetitions, learningRate, epsilonDecayFactor, validMovesF, makeMoveF,startState,goalState)

    • 根据start与goal状态,训练Q表格,得到一个合理的Q表格
    • 以下为trainQ的伪代码:
      1. 初始化 Q.
      2. Repeat:
        1. Use epsilonGreedy function to get the action and get the stateNew
        2. If (stateNew,move) not in Q,
        3. Update Qold = 0
        4. If stateNew is goalState,
          1. Update Qold = 1
        5. Otherwise (not at goal),
          1. If not first step, update Qold = Qold + rho * (1 + Qnew - Qold)
          2. Shift current state and action to old ones.
  • testQ(Q, maxSteps, validMovesF, makeMoveF,startState,goalState)

    • 选择需要的start与goal状态,自动根据Q表格中的值,选择最优的移动策略

    • 以下为testQ的伪代码:

      1. get Q from the trainQ;

      2. Repeat:

        1. Use validMoves function to get the action list;
        2. Use the Q table to get value of \((state,action)\)

        ​ if the action is not in Q, set the value is infinity

        1. Choose the action by the \(argmin Q[(state,move)]\)
        2. Record the action and state in path
        3. If at goal;

        ​ return path

        1. If step > maxStep;

        ​ return 'Goal not reached in maxSteps'

Code & Test

import numpy as np
import random
import matplotlib.pyplot as plt
import copy
%matplotlib inline
def stateModify(state):
    N = 3
    row = []
    stateModify = []
    collums= len(state)
    stateCopy = copy.copy(state)
    for i in range(collums):
        row.append(len(state[i]))
    # add 0 in modified state
    for i in range (collums):
        while row[i] < N:
            stateCopy[i].insert(0,0)
            row[i]= len(stateCopy[i])    
    # set it as modify state
    for i in range(max(row)):
        for j in range(len(stateCopy)):
            stateModify.append(stateCopy[j][i])          
    return(stateModify)
def printState(state):
    statePrint = stateModify(state)
    # print the state 
    i = 0
    for num in statePrint:
        # if the number is zero, we print ' '
        if num == 0:
            print(" ",end=" ")
        else:
            print(num, end=" ")
        i += 1
        if i%3 == 0:
            print("")
    print('------')
def validMoves(state):
    actions = []    
    # check left 
    if state[0] != []:
        # left to middle
        if state[1]==[] or state[0][0] < state[1][0]:
            actions.append([1,2])
        # left to right
        if state[2]==[] or state[0][0] < state[2][0]:
            actions.append([1,3])
   
    # check middle
    if state[1] != []:
        # middle to left
        if state[0]==[] or state[1][0] < state[0][0]:
            actions.append([2,1])
        # middle to right   
        if state[2]==[] or state[1][0] < state[2][0]:
            actions.append([2,3])
    
    # check right        
    if state[2] != []:
        # right to left
        if state[0]==[] or state[2][0] < state[0][0]:
            actions.append([3,1])
        # right to middle
        if state[1]==[] or state[2][0] < state[1][0]:
            actions.append([3,2])            
    return actions
def stateMoveTuple(state, move):
    stateTuple = []
    returnTuple = [tuple(move)]
    for i in range (len(state)):
        stateTuple.append(tuple(state[i]))
    returnTuple.insert(0,tuple(stateTuple))
    return tuple(returnTuple)
def makeMove(state, move):
    stateMove = []
    stateMove = copy.deepcopy(state)
    
    stateMove[move[1]-1].insert(0,stateMove[move[0]-1][0])
    stateMove[move[0]-1].pop(0)
    return stateMove
def epsilonGreedy(Q, state, epsilon, validMovesF):
    validMoveList = validMoves(state)
    if np.random.uniform() < epsilon:
        # Random Move
        lens = len(validMoveList)
        return validMoveList[random.randint(0,lens-1)]
    else:
        # Greedy Move
        Qs = np.array([Q.get(stateMoveTuple(state, m), 0) for m in validMoveList]) 
        return validMoveList[np.argmin(Qs)]
def trainQ(nRepetitions, learningRate, epsilonDecayFactor, validMovesF, makeMoveF,startState,goalState):
    epsilon = 1.0
    outcomes = np.zeros(nRepetitions)
    Q = {}
    for nGames in range(nRepetitions):
        epsilon *= epsilonDecayFactor
        step = 0
        done = False
        state = copy.deepcopy(startState)
    
        while not done:
            step += 1
            move = epsilonGreedy(Q, state, epsilon, validMovesF)         
            stateNew = makeMoveF(state,move)
            if stateMoveTuple(state, move) not in Q:
                Q[stateMoveTuple(state, move)] = 0 
                
            if stateNew == goalState:
#                 Q[stateMoveTuple(state, move)] += learningRate * (1 - Q[stateMoveTuple(state, move)])
                Q[stateMoveTuple(state, move)] = 1
                done = True
                outcomes[nGames] = step  
                
            else:
                if step > 1:
                    Q[stateMoveTuple(stateOld, moveOld)] += learningRate * \
                                    (1 + Q[stateMoveTuple(state, move)] - Q[stateMoveTuple(stateOld, moveOld)]) 
                stateOld = copy.deepcopy(state)
                moveOld = copy.deepcopy(move)
                state = copy.deepcopy(stateNew)
    return Q, outcomes                  
def testQ(Q, maxSteps, validMovesF, makeMoveF,startState,goalState):
    state = copy.copy(startState)
    epsilon = 1.0
    path = []
    path.append(state)
    done = False
    step = 0 
    while not done:
        step += 1 
        Qs = []
        validMoveList = validMoves(state)
        for m in validMoveList:
            if stateMoveTuple(state, m) in Q:
                Qs.append(Q[stateMoveTuple(state, m)])
            else:
                Qs.append(0xffffff)
        stateNew = makeMoveF(state,validMoveList[np.argmin(Qs)])
        path.append(stateNew)
        if stateNew == goalState:
            return path
            done = True
        elif step >=maxSteps:
            print('Goal not reached in {} steps'.format(maxSteps))
            return []
            done = True
        state = copy.deepcopy(stateNew)   
def minsteps(steps,minStepOld,nRepetitions):
    delStep =0

    steps = list(steps)
#     lengh = len(step)
    while delStep != nRepetitions:
        if np.mean(steps)>7:
            steps.pop(0)
            delStep += 1
        else:
            if delStep < minStepOld:
                return delStep,True
            else:
                return minStepOld,False
    if delStep < minStepOld:
        return delStep,True
    else:
        return minStepOld,False
def findBetter(nRepetitions,learningRate,epsilonDecayFactor):
    Q, steps = trainQ(nRepetitions, 0.5, 0.7, validMoves, makeMove,startState = [[1, 2, 3], [], []],goalState = [[], [], [1, 2, 3]])       
    minStepOld,_ = minsteps(steps,0xffffff,50)
    bestlRate = 0.5
    besteFactor = 0.7
    LAndE = []
    for k in range(10):
        for i in range(len(learningRate)):
            for j in range(len(epsilonDecayFactor)):
                Q, steps = trainQ(nRepetitions, learningRate[i], epsilonDecayFactor[j], validMoves, makeMove,\
                                  startState = [[1, 2, 3], [], []],goalState = [[], [], [1, 2, 3]])
                minStepNew,B = minsteps(steps,minStepOld,nRepetitions)
                if B:
                    bestlRate = learningRate[i]
                    besteFactor = epsilonDecayFactor[j]
                    minStepOld = copy.deepcopy(minStepNew)
        LAndE.append([bestlRate,besteFactor])            
    return LAndE

Test part

state = [[1, 2, 3], [], []]
printState(state)
1     
2     
3     
------
state = [[1, 2, 3], [], []]
move =[1, 2]
stateMoveTuple(state, move)
(((1, 2, 3), (), ()), (1, 2))
state = [[1, 2, 3], [], []]
newstate = makeMove(state, move)
newstate
[[2, 3], [1], []]
Q, stepsToGoal = trainQ(100, 0.5, 0.7, validMoves, makeMove,startState = [[1, 2, 3], [], []],goalState = [[], [], [1, 2, 3]])
path = testQ(Q, 20, validMoves, makeMove,startState = [[1, 2, 3], [], []],goalState = [[], [], [1, 2, 3]])
path
[[[1, 2, 3], [], []],
 [[2, 3], [], [1]],
 [[3], [2], [1]],
 [[3], [1, 2], []],
 [[], [1, 2], [3]],
 [[1], [2], [3]],
 [[1], [], [2, 3]],
 [[], [], [1, 2, 3]]]
for s in path:
    printState(s)
    print()
1     
2     
3     
------

2     
3   1 
------

3 2 1 
------

  1   
3 2   
------

  1   
  2 3 
------

1 2 3 
------   

    2 
1   3 
------

    1 
    2 
    3 
------

# find better learningRate and epsilonDecayFactor
learningRate = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
epsilonDecayFactor = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
LAndE = findBetter(100,learningRate,epsilonDecayFactor)
print(LAndE)
[[0.9, 0.2], [0.9, 0.2], [0.9, 0.2], [0.9, 0.2], [0.9, 0.6], [0.9, 0.6], [0.9, 0.6], [0.9, 0.6], [0.9, 0.6], [0.9, 0.6]]

你可能感兴趣的:(强化学习Q=learning ——Reinforcement Learning Solution to the Towers of Hanoi Puzzle)