Incremental Implementation

Stationary Problem

In stationary problems, the probability of reward do not change over time.
As step goes larger, we would like to compute the value of an action efficiently, in particular, with constant memory and constant per-time-step computation.
To simplify notation, we concentrate on a single action.

  1. Denote R i R_i Ri the reward received after the i t h i^{th} ith selection of this action, and let Q n Q_n Qn denote the estimate of its action value after it has been selected n − 1 n-1 n1 times, which we can now write simply as: Q n ≐ R 1 + R 2 + . . . + R n − 1 n − 1 Q_n\doteq\frac{R_1+R_2+...+R_{n-1}}{n-1} Qnn1R1+R2+...+Rn1 .
  2. Given Q n Q_n Qn and the n t h n^{th} nth reward, R n R_n Rn, the new average of all n n n rewards can be write as: Q n + 1 = 1 n ∑ i = 1 n − 1 R i = Q n + 1 n [ R n − Q n ] Q_{n+1}=\frac{1}{n}\sum^{n-1}_{i=1}R_i=Q_n+\frac{1}{n}[R_n-Q_n] Qn+1=n1i=1n1Ri=Qn+n1[RnQn].

The general form of this update rule is:
N e w E s t i m a t e ← O l d E s t i m a t e + S t e p S i z e [ T a r g e t − O l d E s t i m a t e ] NewEstimate\leftarrow OldEstimate+StepSize[Target-OldEstimate] NewEstimateOldEstimate+StepSize[TargetOldEstimate]
StepSize is always denoted as α t ( a ) \alpha_t(a) αt(a)

Nonstationary Problem

In nonstationary problems, the probability of reward change over time. In this cases, it makes sense to give more weights to recent rewards than to long-term rewards.

  1. The incremental update rule is modified to be:
    Q n + 1 ≐ Q n + α [ R n − Q n ] Q_{n+1}\doteq Q_n+\alpha[R_n-Q_n] Qn+1Qn+α[RnQn], where α ∈ ( 0 , 1 ] \alpha \in(0,1] α(0,1]
  2. Expotential recency-weighted average:This results in Q n + 1 Q_{n+1} Qn+1 being a weighted average of past rewards and the initial estimate Q 1 Q_1 Q1:
    Q n + 1 = ( 1 − α ) n Q 1 + ∑ i = 1 n α ( 1 − α ) n − i R i Q_{n+1}=(1-\alpha)^nQ_1+\sum^{n}_{i=1}\alpha(1-\alpha)^{n-i}R_i Qn+1=(1α)nQ1+i=1nα(1α)niRi

你可能感兴趣的:(强化学习)