强化学习基础

强化学习

策略网络输入状态s,输出动作a的概率分布如下: π ( a ∣ s ) \pi(a|s) π(as)

多次训练轨迹如下

  • r表示回报
  • 横轴为T, 1个回合的步骤数
  • 纵轴为N, 回合数,1行代表1条轨迹,符合概率分布P

[ s 11 a 11 r 11 … … s 1 t a 1 t r 1 t … … s 1 T a 1 T r 1 T … … … … … … s n 1 a n 1 r n 1 … … s n t a n t r n t … … s n T a n T r n T … … … … … … s N 1 a N 1 r N 1 … … s N t a N t r N t … … s N T a N T r N T ] \begin{bmatrix} s_{11} a_{11} r_{11} …… s_{1t} a_{1t} r_{1t} …… s_{1T} a_{1T} r_{1T}\\ …… …… ……\\ s_{n1} a_{n1} r_{n1} …… s_{nt} a_{nt} r_{nt}…… s_{nT} a_{nT} r_{nT} \\ …… …… ……\\ s_{N1} a_{N1} r_{N1} …… s_{Nt} a_{Nt} r_{Nt} …… s_{NT} a_{NT} r_{NT}\\ \end{bmatrix} s11a11r11……s1ta1tr1t……s1Ta1Tr1T………………sn1an1rn1……sntantrnt……snTanTrnT………………sN1aN1rN1……sNtaNtrNt……sNTaNTrNT

策略轨迹 τ = s 1 a 1 , s 2 a 2 , … … , s T a T \tau = s_{1} a_{1} , s_{2} a_{2},……,s_{T} a_{T} τ=s1a1,s2a2……sTaT

发生的概率
P ( τ ) = P ( s 1 a 1 , s 2 a 2 , … … , s T a T ) P(\tau) = P(s_{1} a_{1} , s_{2} a_{2},……,s_{T} a_{T}) P(τ)=P(s1a1,s2a2……sTaT)
= P ( s 1 ) π ( a 1 ∣ s 1 ) P ( s 2 ∣ s 1 , a 1 ) π ( a 2 ∣ s 2 ) P ( s 3 ∣ s 1 , a 1 , s 2 , a 2 ) … … = P(s_{1})\pi(a_{1}|s_{1})P(s_{2}|s_{1},a_{1})\pi(a_{2}|s_{2})P(s_{3}|s_{1},a_{1},s_{2},a_{2})…… =P(s1)π(a1s1)P(s2s1,a1)π(a2s2)P(s3s1,a1,s2,a2)……
= P ( s 1 ) ∏ t = 1 T − 1 π ( a t ∣ s t ) P ( s t + 1 ∣ s 1 , a 1 , . . . . . . , s t , a t ) = P(s_{1})\prod_{t=1}^{T-1}\pi(a_{t}|s_{t})P(s_{t+1}|s_{1},a_{1},......,s_{t},a_{t}) =P(s1)t=1T1π(atst)P(st+1s1,a1,......,st,at)

根据 马尔科夫性(Markov Property),简化为:
P ( τ ) = P ( s 1 ) ∏ t = 1 T − 1 π ( a t ∣ s t ) P ( s t + 1 ∣ s t , a t ) P(\tau) = P(s_{1})\prod_{t=1}^{T-1}\pi(a_{t}|s_{t})P(s_{t+1}|s_{t},a_{t}) P(τ)=P(s1)t=1T1π(atst)P(st+1st,at)

你可能感兴趣的:(AI,人工智能)