τ π Q ( s t , a t ) = r ( s t , a t ) + γ ⋅ E s t + 1 ∼ p [ V ( s t + 1 ) ] \tau ^\pi Q(s_t,a_t)=r(s_t,a_t) + \gamma \cdot E_{s_{t +1}\sim p}[V(s_{t+1})] τπQ(st,at)=r(st,at)+γ⋅Est+1∼p[V(st+1)]
V ( s t ) = E a t ∼ π [ Q ( s t , a t ) − α ⋅ l o g π ( a t ∣ s t ) ] V(s_t)=E_{a_t \sim \pi}[Q(s_t,a_t)-\alpha \cdot log\pi(a_t|s_t)] V(st)=Eat∼π[Q(st,at)−α⋅logπ(at∣st)]
Q k + 1 = τ π Q k Q^{k+1}=\tau^\pi Q^k Qk+1=τπQk
当k趋于无穷时, Q k Q^k Qk将收敛至 π \pi π的soft Q-value。
证明:
r π ( s t , a t ) = r ( s t , a t ) + γ ⋅ E s t + 1 ∼ p [ H ( π ( ⋅ ∣ s t + 1 ) ) ] r_\pi(s_t,a_t)=r(s_t,a_t)+\gamma \cdot E_{s_{t+1}\sim p}[H(\pi(\cdot | s_{t+1}))] rπ(st,at)=r(st,at)+γ⋅Est+1∼p[H(π(⋅∣st+1))]
Q ( s t , a t ) = r ( s t , a t ) + γ ⋅ E s t + 1 ∼ p [ H ( π ( ⋅ ∣ s t + 1 ) ) + E s t + 1 , a t + 1 ∼ ρ π [ Q ( s t + 1 , a t + 1 ) ] Q(s_t,a_t) = r(s_t,a_t)+\gamma \cdot E_{s_{t+1}\sim p}[H(\pi(\cdot | s_{t+1})) + E_{s_{t+1},a_{t+1}\sim \rho_\pi}[Q(s_{t+1},a_{t+1})] Q(st,at)=r(st,at)+γ⋅Est+1∼p[H(π(⋅∣st+1))+Est+1,at+1∼ρπ[Q(st+1,at+1)]
Q ( s t , a t ) = r ( s t , a t ) + γ ⋅ E s t + 1 , a t + 1 ∼ ρ π [ − l o g ( π ( a t + 1 ∣ s t + 1 ) ) + E s t + 1 , a t + 1 ∼ ρ π [ Q ( s t + 1 , a t + 1 ) ] Q(s_t,a_t) = r(s_t,a_t)+\gamma \cdot E_{s_{t+1},a_{t+1}\sim \rho_\pi}[-log(\pi(a_{t+1} | s_{t+1})) + E_{s_{t+1},a_{t+1}\sim \rho_\pi}[Q(s_{t+1},a_{t+1})] Q(st,at)=r(st,at)+γ⋅Est+1,at+1∼ρπ[−log(π(at+1∣st+1))+Est+1,at+1∼ρπ[Q(st+1,at+1)]
Q ( s t , a t ) = r ( s t , a t ) + γ ⋅ E s t + 1 , a t + 1 ∼ ρ π [ Q ( s t + 1 , a t + 1 ) − l o g ( π ( a t + 1 ∣ s t + 1 ) ) Q(s_t,a_t) = r(s_t,a_t)+\gamma \cdot E_{s_{t+1},a_{t+1}\sim \rho_\pi}[Q(s_{t+1},a_{t+1})-log(\pi(a_{t+1} | s_{t+1})) Q(st,at)=r(st,at)+γ⋅Est+1,at+1∼ρπ[Q(st+1,at+1)−log(π(at+1∣st+1))
当|A|<∞时,可以保证熵有界,因而能保证收敛。
π n e w = a r g m i n π ′ ∈ Π D K L ( π ′ ( ⋅ ∣ s t ) ∣ ∣ e x p ( Q π o l d ( s t , ⋅ ) ) Z π o l d ( s t ) ) \pi_{new}=argmin_{\pi^{'}\in \Pi}D_{KL}(\pi^{'}(\cdot|s_t)||\frac{exp(Q^{\pi_{old}}(s_t,\cdot))}{Z^{\pi_{old}}(s_t)}) πnew=argminπ′∈ΠDKL(π′(⋅∣st)∣∣Zπold(st)exp(Qπold(st,⋅)))
Q π n e w ( s t , a t ) ≥ Q π o l d ( s t , a t ) Q^{\pi_{new}}(s_t,a_t)≥Q^{\pi_{old}}(s_t,a_t) Qπnew(st,at)≥Qπold(st,at)
s.t.为:
π o l d ∈ Π , ( s t , a t ) ∈ S × A , ∣ A ∣ < ∞ \pi_{old}\in \Pi,(s_t,a_t)\in S × A, |A| < ∞ πold∈Π,(st,at)∈S×A,∣A∣<∞
证明如下:
π n e w = a r g m i n π ′ ∈ Π D K L ( π ′ ( ⋅ ∣ s t ) ∣ ∣ e x p ( Q π o l d ( s t , ⋅ ) − l o g ( Z ( s t ) ) ) ) = a r g m i n π ′ ∈ Π J π o l d ( π ′ ( ⋅ ∣ s t ) ) \pi_{new}=argmin_{\pi^{'}\in \Pi}D_{KL}(\pi^{'}(\cdot|s_t)||exp(Q^{\pi_{old}}(s_t,\cdot)-log(Z(s_t))))\\ =argmin_{\pi^{'}\in \Pi}J_{\pi_{old}}(\pi^{'}(\cdot|s_t)) πnew=argminπ′∈ΠDKL(π′(⋅∣st)∣∣exp(Qπold(st,⋅)−log(Z(st))))=argminπ′∈ΠJπold(π′(⋅∣st))
J π o l d ( π ′ ( ⋅ ∣ s t ) ) = E a t ∼ π ′ [ l o g ( π ′ ( s t , a t ) ) − Q π o l d ( s t , a t ) + l o g ( Z ( s t ) ) ] J_{\pi_{old}}(\pi^{'}(\cdot|s_t)) = E_{a_t \sim \pi^{'}}[log(\pi^{'}(s_t,a_t))-Q^{\pi_{old}}(s_t,a_t)+log(Z(s_t))] Jπold(π′(⋅∣st))=Eat∼π′[log(π′(st,at))−Qπold(st,at)+log(Z(st))]
由于一直可以取 π n e w = π o l d \pi_{new}=\pi_{old} πnew=πold,所有总能满足:
E a t ∼ π n e w [ l o g ( π n e w ( a t ∣ s t ) ) − Q π o l d ( s t , a t ) ] ≤ E a t ∈ π o l d [ l o g ( π o l d ( a t ∣ s t ) ) − Q π o l d ( s t , a t ) ] E_{a_t\sim \pi_{new}}[log(\pi_{new}(a_t|s_t))-Q^{\pi_{old}}(s_t,a_t)]≤E_{a_t \in \pi_{old}}[log(\pi_{old}(a_t|s_t))-Q^{\pi_{old}}(s_t,a_t)] Eat∼πnew[log(πnew(at∣st))−Qπold(st,at)]≤Eat∈πold[log(πold(at∣st))−Qπold(st,at)]
E a t ∼ π n e w [ l o g ( π n e w ( a t ∣ s t ) ) − Q π o l d ( s t , a t ) ] ≤ − V π o l d ( s t ) E a t ∼ π n e w [ Q π o l d ( s t , a t ) − l o g ( π n e w ( a t ∣ s t ) ) ] ≥ V π o l d ( s t ) E_{a_t\sim \pi_{new}}[log(\pi_{new}(a_t|s_t))-Q^{\pi_{old}}(s_t,a_t)]≤ - V^{\pi_{old}}(s_t)\\E_{a_t\sim \pi_{new}}[Q^{\pi_{old}}(s_t,a_t)-log(\pi_{new}(a_t|s_t))]≥V^{\pi_{old}}(s_t) Eat∼πnew[log(πnew(at∣st))−Qπold(st,at)]≤−Vπold(st)Eat∼πnew[Qπold(st,at)−log(πnew(at∣st))]≥Vπold(st)
Q π o l d ( s t , a t ) = r ( s t , a t ) + γ ⋅ E s t + 1 ∼ p [ V π o l d ( s t + 1 ) ] ≤ r ( s t , a t ) + γ ⋅ E s t + 1 ∼ p E a t + 1 ∼ π n e w [ Q π o l d ( s t , a t ) − l o g ( π n e w ( a t ∣ s t ) ] ≤ . . . . . . . . . . ≤ Q π n e w ( s t , a t ) Q^{\pi_{old}}(s_t,a_t)=r(s_t,a_t)+\gamma \cdot E_{s_{t+1}\sim p }[V^{\pi_{old}}(s_{t+1})]\\ ≤r(s_t,a_t)+\gamma \cdot E_{s_{t+1}\sim p E_{a_{t+1}\sim \pi_{new}}}[Q^{\pi_{old}}(s_t,a_t)-log(\pi_{new}(a_t|s_t)]\\ ≤..........\\ ≤Q^{\pi_{new}}(s_t,a_t) Qπold(st,at)=r(st,at)+γ⋅Est+1∼p[Vπold(st+1)]≤r(st,at)+γ⋅Est+1∼pEat+1∼πnew[Qπold(st,at)−log(πnew(at∣st)]≤..........≤Qπnew(st,at)
假设: ∣ A ∣ < ∞ ; π ∈ Π |A|<∞;\pi\in\Pi ∣A∣<∞;π∈Π
经过不断地soft policy evaluation和policy improvement,最终policy会收敛至 π ⋆ \pi^{\star} π⋆,其满足
Q π ⋆ ( s t , a t ) ≥ Q π ( s t , a t ) ;其中 π ∈ Π Q^{\pi^\star}(s_t,a_t)≥Q^{\pi}(s_t,a_t);其中\pi\in\Pi Qπ⋆(st,at)≥Qπ(st,at);其中π∈Π
By CyrusMay 2022.09.06
世界 再大 不过 你和我
用最小回忆 堆成宇宙
————五月天(因为你 所以我)————