【强化学习纲要】学习笔记之Overview

【强化学习纲要】学习笔记系列


定义与应用场景

Prerequisite

  • 学习RL之前需要学习的知识:线性代数、概率、机器学习相关(数据挖掘、模式识别、深度学习等)
  • 编程能力:Python,PyTorch

RL定义

A computational approach to learning whereby an agent tries to maximize the total amount of reward it receives while interacting with a complex and uncertain environment. (by Sutton and Barto)

最大化智能体与复杂且不确定性的环境交互中所获得收益总和的计算方法

  • 对象:智能体
  • 环境:对象所处的环境,复杂且不确定性
  • 目标:收益总和最大化

RL与Supervised Learning的区别

  • Supervised Learning 用于分类: 标签数据,数据服从i.i.d(独立同分布)
  • RL 用于实现机器超越人的行为:数据不服从i.i.d,没有及时反馈和数据标签

RL面对的经常是序列化数据,数据之间相关性大,learner不知道确定性的action,只是通过在与环境交互过程中不断的Trial-and-error探索来获得最大化的reward,而reward往往是delayed,不会有及时的反馈。

RL需要不断在exploration 和 exploitation之间trade off

Supervised Learning的上限是人类,因为学习的数据来源于人类的标注,而RL的上限是超越人类,如AlphaGo在围棋上的表现

Deep Reinforcement Learning

  • Deep Reinforcement Learning = Deep Learning + Reinforcement Learning
  • Standard RL: game → more features → policy and value functions → actions
  • Deep RL:end-to-end training,game → Deep RL → actions

当下RL的发展

  • 算力足够
  • 应用场景具有简单且明确的rules
  • end-to-end training:features 和 policys参与到对goal的联合优化

Sequential Decision Making

Agent 和 环境

Agent 学习如何与环境交互:Agent会感知环境信息,从而做出使得reward最大化的一系列的action,这些action的集合通常称为Trajectory

Rewards

  • reward是数值反馈信号
  • 指导agent的行为
  • RL目的就是最大化reward

常见的reward包括:游戏win or lose,投资盈利,机器人正确操作

Sequential Decision Making

  • 决策过程是连续进行的,过去的状态决定后续的动作 H t = O 1 , R 1 , A 1 , … , A t − 1 , O t , R t H_t = O_1, R_1, A_1, \dots, A_{t-1}, O_t, R_t Ht=O1,R1,A1,,At1,Ot,Rt
  • State是 H t H_t Ht的函数: S t = f ( H t ) S_t = f(H_t) St=f(Ht), State 包括环境State和Agent State
  • 完全观测:agent直接获取得到环境state,例如Markov Decision Process(MDP)
    O t = S t e = S t a O_t = S_t^e= S_t^a Ot=Ste=Sta
  • 部分观测:agent间接感知环境,例如Partially Observable Markov Decision Process (POMDP)

RL Agent 主要成分

  • Policy:agent的行为函数,从state映射到action的纽带,随机策略以概率表达式
    π ( a ∣ s ) = P [ A t = a ∣ S t = s ] \pi(a|s) = P[A_t=a|S_t=s] π(as)=P[At=aSt=s]
    确定性策略(最佳策略)为
    a ∗ = arg ⁡ max ⁡ a π ( a ∣ s ) a^*=\arg \max_a \pi(a|s) a=argamaxπ(as)

  • value function:评价state或action,
    给定policy π \pi π,预期得到的未来reward(discounted sum,及时反馈 vs 延迟反馈)
    v π = E π [ G t ∣ S t = s ] = E π [ ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s ] , f o r a l l s ∈ s e t ( S ) v_\pi=\mathbb{E}_\pi[G_t | S_t = s] = \mathbb{E}_\pi[\sum_{k=0}^\infty \gamma^kR_{t+k+1}|S_t=s], for all s \in set(S) vπ=Eπ[GtSt=s]=Eπ[k=0γkRt+k+1St=s],forallsset(S)
    其中 γ \gamma γ是discount factor,由此进一步得到Q-function
    q π ( s , a ) = E π [ G t ∣ S t = s , A t = a ] = E π [ ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s , A t = a ] q_\pi(s,a)=\mathbb{E}_\pi[G_t | S_t = s, A_t = a] = \mathbb{E}_\pi[\sum_{k=0}^\infty \gamma^kR_{t+k+1}|S_t=s, A_t =a] qπ(s,a)=Eπ[GtSt=s,At=a]=Eπ[k=0γkRt+k+1St=s,At=a]
    Q-function可以用于评估指定state下,不同action所得得到的reward,从而选择最佳的action

  • model:对处于环境中的agent的state representation(状态表征)
    预测next state
    P s s ′ a = P [ S t + 1 = s ′ ∣ s t = s , A t = a ] P_{ss'}^a = \mathbb{P}[S_{t+1} = s' | s_t = s, A_t = a] Pssa=P[St+1=sst=s,At=a]
    预测next reward
    R s a = E [ R t + 1 ∣ S t = s , A t = a ] R_s^a = \mathbb{E}[R_{t+1} | S_t =s, A_t = a] Rsa=E[Rt+1St=s,At=a]

Markov Decision Processes(MDPs)

  • 每个action的动态或转化模型 P ( S t + 1 = s ′ ∣ s t = s , A t = a ) P(S_{t+1}=s'|s_t =s, A_t =a) P(St+1=sst=s,At=a)
  • reward function R ( S t = s , A t = a ) = E [ R t ∣ S t = s , A t = a ] R(S_t=s, A_t =a) = \mathbb{E}[R_t|S_t = s, A_t = a] R(St=s,At=a)=E[RtSt=s,At=a]
  • discount factor γ ∈ [ 0 , 1 ] \gamma \in [0,1] γ[0,1]

RL类型

按学习内容划分:

  • Value-based agent
    显示表达:value function
    隐式表达:policy(可从value function中获得)
  • Policy-based agent
    显示表达:policy
    隐式表达:no value function
  • Actor-critic agent
    显示表达:policy function 和 value function

按模型划分:

  • Model-based
    显示表达:model
    隐式表达:policy和value function可有可无
  • Model-free
    显示表达:value function 和/或 policy function
    隐式表达:no model

基本问题

  • planning
    给定模型,探究环境,在没有交互情况下进行计算,例如路径规划中的A*算法、RRT
  • RL
    agent可能无法完全感知environment,但可以交互来根据reward选择policy
    环境是黑盒,rules没有显示表达(或很难表达)
    通过行为反馈来感知environment
    planning会辅助用于推断和探索

Exploration and Exploitation

  • Exploration: 探索未知世界,可能带来超越过去的结果,但也能会更糟
  • Exploitation: 根据历史经验来选择,保下限,但上限也是确定的
  • trade-off:科学地将二者结合

你可能感兴趣的:(强化学习,人工智能,强化学习)