莫烦python 强化学习 (Reinforcement Learning)

莫烦python 强化学习 (Reinforcement Learning)_第1张图片
莫烦python 强化学习 (Reinforcement Learning)_第2张图片

Q-Learning决策过程

莫烦python 强化学习 (Reinforcement Learning)_第3张图片
莫烦python 强化学习 (Reinforcement Learning)_第4张图片
莫烦python 强化学习 (Reinforcement Learning)_第5张图片
莫烦python 强化学习 (Reinforcement Learning)_第6张图片

Q-learning 小例子

-o---T
# T 就是宝藏的位置, o 是探索者的位置

每一次移动,状态发生改变的反馈

def get_env_feedback(S, A):
    # This is how agent will interact with the environment
    if A == 'right':    # move right
        if S == N_STATES - 2:   # terminate
            S_ = 'terminal'
            R = 1
        else:
            S_ = S + 1
            R = 0
    else:   # move left
        R = 0
        if S == 0:
            S_ = S  # reach the wall
        else:
            S_ = S - 1
    return S_, R

RL算法:选择、更新

def rl():
    q_table = build_q_table(N_STATES, ACTIONS)  # 初始 q table
    for episode in range(MAX_EPISODES):     # 回合
        step_counter = 0
        S = 0   # 回合初始位置
        is_terminated = False   # 是否回合结束
        update_env(S, episode, step_counter)    # 环境更新
        while not is_terminated:

            A = choose_action(S, q_table)   # 选行为
            S_, R = get_env_feedback(S, A)  # 实施行为并得到环境的反馈
            q_predict = q_table.loc[S, A]    # 估算的(状态-行为)值
            if S_ != 'terminal':
                q_target = R + GAMMA * q_table.iloc[S_, :].max()   #  实际的(状态-行为)值 (回合没结束)
            else:
                q_target = R     #  实际的(状态-行为)值 (回合结束)
                is_terminated = True    # terminate this episode

            q_table.loc[S, A] += ALPHA * (q_target - q_predict)  #  q_table 更新
            S = S_  # 探索者移动到下一个 state

            update_env(S, episode, step_counter+1)  # 环境更新

            step_counter += 1
    return q_table

你可能感兴趣的:(python,机器学习,深度学习)