国科大高级人工智能10-强化学习(多臂赌博机、贝尔曼)

文章目录

  • 多臂赌博机Multi-armed bandit(无状态)
  • 马尔科夫决策过程MDP(markov decision process
  • 1.动态规划
  • 蒙特卡罗方法——不知道环境完整模型情况下
    • 2.1 on-policy蒙特卡罗
    • 2.2 off-policy蒙特卡罗
  • 时序差分方法

  • 强化学习:Reinforcement learning
    • 目标:学习从环境状态到行为的映射,智能体选择能够获得环境最大奖赏的行为,使得外部环境对学习系统在某种意义下的评价为最佳
    • 区别:
      • 监督学习:标注中学习
      • 强化学习:交互——学习策略
    • 特性——用于判断某一问题可否用强化学习求解
      • 试错搜索
      • 延迟奖励
    • 挑战
      • exploitation 开采(按原方法进行
      • exploration勘测(看有没有其他方法,试一试
    • 注重总体目标,阶段性不重要
    • 主体:智能体和环境
      • 状态、行为和奖励
    • 要素
      • 策略
        • 状态到行为的映射
          • 确定策略S->A
          • 随机策略S->A1\A2\A3?
      • 奖励
        • 关于状态和行为的函数,有不确定性
      • 价值
        • 累积奖励
        • 长期目标
      • 环境模型
        • 刻画反馈
  • 反馈
    • 评价性反馈(强化学习)
      • 对行为评价
    • 指导性反馈(监督学习)
      • 独立于行为

多臂赌博机Multi-armed bandit(无状态)

方法 确定性? 特性
贪心策略 A t = a r g m a x a Q t ( a ) ( 均 值 ) At=argmax_aQ_t(a)(均值) At=argmaxaQt(a)( 确定性算法
ϵ \epsilon ϵ贪心策略 1 − ϵ 1-\epsilon 1ϵ:贪心选择; ϵ \epsilon ϵ:随机选择 确定性算法 -
乐观初值法Optimistic initial values 每个行为的初值都高Q1高, ϵ = 0 \epsilon=0 ϵ=0 确定性算法 初始只探索,最终贪心
UCB A T = a r g m a x a ( Q t ( a ) + c l n t N t ( a ) ) , N t ( a ) − a 被 选 择 的 次 数 A_T=argmax_a(Q_t(a)+c\sqrt{\frac{lnt}{N_t(a)}}),N_t(a)-a被选择的次数 AT=argmaxa(Qt(a)+cNt(a)lnt ),Nt(a)a 确定性算法 最初差,后比贪心好,收敛于贪心
梯度赌博机算法 $P(A_t=a)=\frac{e{H_t(a)}}{\Sigma_b=1k e^{H_t(b)}}=\pi_t(a).优化目标 E(R_t)=\Sigma_b\pi_t(b)q(b) $ 不确定性算法 更新Ht
  • 形式化

    • 行为:摇哪个臂
      • At:第t轮的行为
    • 奖励:每次摇臂获得的奖励
      • Rt:奖励
    • 第t轮采取的行为a的期望:
      • q(a)=E(Rt|At=a)
      • –贪心策略,每次都选期望最大的a,但不知道期望
      • 只能通过经验,对q(a)估计Qt(a),用贪心策略依据Qt(a)
  • 优化目标:当前行为的期望收益

  • 策略

    • 利用:exploitation
      • 按照贪心策略进行选择,即选择 最大的行为
      • 优点:最大化即时奖励
      • 缺点:由于 只是对∗ 的估计,估计的不确定性导致按照贪心策略选择的行为不一定是使∗ 最大的行为
    • 探索:Exploration
      • 选择贪心策略之外的行为(non-greedy actions)
      • 缺点:短期奖励会比较低
      • 优点:长期奖励会比较高,通过探索可以找出奖励更大的行为,供后续选择
    • 每次二选一,如何平衡?
  • 贪心策略

    • A t = a r g m a x a Q t ( a ) A_t=argmax_aQ_t(a) At=argmaxaQt(a)
    • 有多个最大,则随即一个
  • ϵ \epsilon ϵ贪心策略

    • 1 − ϵ 1-\epsilon 1ϵ:贪心选择(exploitation
    • ϵ \epsilon ϵ:随机选择(exporation
    • ϵ \epsilon ϵ–取决于q(a)的方差,方差越大,取值越大
    • eg
      • 假设q(a)~N(0,1)
      • 则At~N(0,1)正态分布
  • 行为估值方法Qt(a)

    • Q t ( a ) = 采 取 该 行 为 所 获 得 的 奖 励 和 采 取 该 行 为 的 次 数 = Σ i = 1 t − 1 R i 1 A i = a Σ i = 1 t − 1 1 A i = a = 行 为 a 奖 励 的 均 值 Q_t(a)=\frac{采取该行为所获得的奖励和}{采取该行为的次数}=\frac{\Sigma_{i=1}^{t-1}R_i1_{A_i=a}}{\Sigma_{i=1}^{t-1}1_{A_i=a}}=行为a奖励的均值 Qt(a)==Σi=1t11Ai=aΣi=1t1Ri1Ai=a=a
    • 约定,分母=0,Qt(a)=0
    • 分母无穷大,Qt(a)–>q(a)
    • 增量实现
      • Q n ( a ) = R 1 + R 2 + . . . + R n − 1 n − 1 Q_n(a)=\frac{R_1+R_2+...+R_{n-1}}{n-1} Qn(a)=n1R1+R2+...+Rn1
      • Q n + 1 ( a ) = R 1 + R 2 + . . . + R n − 1 + R n n = 1 n ( R n + Σ i = 1 n − 1 R i ) = 1 n ( R n + ( n − 1 ) Q n ( a ) ) = Q n ( a ) − 1 n ( R n − Q n ( a ) ) Q_{n+1}(a)=\frac{R_1+R_2+...+R_{n-1}+R_{n}}{n}=\frac{1}{n}(R_n+\Sigma_{i=1}^{n-1}R_i)=\frac{1}{n}(R_n+(n-1)Q_n(a))=Q_n(a)-\frac{1}{n}(R_n-Q_n(a)) Qn+1(a)=nR1+R2+...+Rn1+Rn=n1(Rn+Σi=1n1Ri)=n1(Rn+(n1)Qn(a))=Qn(a)n1(RnQn(a))
    • 更新公式 n e w E s t i m a t e < − − o l d E s t i m a t e + s t e p s i z e ( t a r g e t − o l d E s t i m a t e ) newEstimate<--oldEstimate+stepsize(target-oldEstimate) newEstimate<oldEstimate+stepsize(targetoldEstimate)
      • 贪心策略的步长:1/n
        • 收敛
      • 更一般的: α 或 α t ( a ) \alpha或\alpha_t(a) ααt(a)——像SGD
    • 非平稳状态的更新公式
      • Q n + 1 ( a ) = Q n ( a ) − α ( R n − Q n ( a ) ) = α R n + ( 1 − α ) Q n ( a ) = α R n + ( 1 − α ) α R n − 1 + ( 1 − α ) 2 Q n − 1 = . . . = ( 1 − α ) n Q 1 + Σ i = 1 n ( 1 − α ) n − i α R i Q_{n+1}(a)=Q_n(a)-\alpha(R_n-Q_n(a))=\alpha R_n+(1-\alpha)Q_n(a)=\alpha R_n+(1-\alpha)\alpha R_{n-1}+(1-\alpha)^2Q_{n-1}=...=(1-\alpha)^nQ_1+\Sigma_{i=1}^n(1-\alpha)^{n-i}\alpha R_i Qn+1(a)=Qn(a)α(RnQn(a))=αRn+(1α)Qn(a)=αRn+(1α)αRn1+(1α)2Qn1=...=(1α)nQ1+Σi=1n(1α)niαRi
      • 这已经是个非平稳的了,时间越近,占比越大—带权值平均
      • 不收敛
    • 收敛条件
      • Σ n = 1 ∞ α n ( a ) = ∞ \Sigma_{n=1}^{\infty}\alpha_n(a)=\infty Σn=1αn(a)=:步长足够大,克服初值和随机扰动的影响
      • $ \Sigma_{n=1}{\infty}\alpha_n2(a)<\infty$:步长最终会越来越小,小到保证收敛
  • 平稳问题

    • q(a)是稳定的,不随时间改变
    • 随着观测样本的增加,平均值估计方法最终收敛于q(a)
  • 非平稳问题

    • q(a)是关于时间的函数(可能老化了)
    • 关注最近的观测样本,时间远的就不靠谱了

国科大高级人工智能10-强化学习(多臂赌博机、贝尔曼)_第1张图片

  • N(A)–A被选择的次数

  • 行为选择策略

    • 如何制定?
      • 贪心策略:选择当前估值最好的行为
      • 贪心策略:以一定的概率随机选择非贪心行为(nongreedy actions),但是对于非贪心行为不加区分
    •  平衡exploitation和exploration,应对行为估值的不确定性
    •  关键:确定每个行为被选择的概率
    • 行为的初始估值
      • 前述贪心策略中,每个行为的初始估值为0
      • 每个行为的初始估值可以帮助我们引入先验知识
      • 初始估值还可以帮助我们平衡exploitation和exploration
      • 乐观初值法Optimistic initial values
        • 每个行为都有个高的初值
        • 优点:初期每个行为都有较大的机会被探索,快速探索
        • 早期只探索,不开采,不关心历史
        • 早期差,但后期很快就跟上
        • 缺点:可能一辈子都探索不完
        • ==Q1=5,=0的贪心
    • UCB(Upper-confidence-bound上确界
      • A T = a r g m a x a ( Q t ( a ) + c l n t N t ( a ) ) , N t ( a ) − a 被 选 择 的 次 数 A_T=argmax_a(Q_t(a)+c\sqrt{\frac{lnt}{N_t(a)}}),N_t(a)-a被选择的次数 AT=argmaxa(Qt(a)+cNt(a)lnt ),Nt(a)a
      • 选择潜力大的:依据估值的置信上界选择
        • 第一项:当前估值高(接近贪心
        • 第二项:不确定性要求高(被选择的次数少–潜力大
        • c:控制探索的程度
      • 比较:
        • 最初几轮差,之后会比贪心策略好
        • 稳定
        • 参数不好调
        • 最终会收敛到贪婪策略
      • 复杂,在多臂赌博机之外的情况用得少
  • 梯度赌博机算法

    • 不确定性算法(随机策略
    • Ht(a):在t轮对行为a的偏好程度
      • 依据选择后的行为,再更新Ht(a)
    • 选择a的概率 P ( A t = a ) = e H t ( a ) Σ b = 1 k e H t ( b ) = π t ( a ) P(A_t=a)=\frac{e^{H_t(a)}}{\Sigma_b=1^k e^{H_t(b)}}=\pi_t(a) P(At=a)=Σb=1keHt(b)eHt(a)=πt(a)
    • 更新公式==SGD
      • H t + 1 ( A t ) = H t ( A t ) + α ( R t − R t ˉ ) ( 1 − π t ( A t ) ) ; R t ˉ = Q t ( a ) 均 值 H_{t+1}(A_t)=H_t(A_t)+\alpha(R_t-\bar{R_t})(1-\pi_t(A_t));\bar{R_t}=Q_t(a)均值 Ht+1(At)=Ht(At)+α(RtRtˉ)(1πt(At));Rtˉ=Qt(a)
      • 对所有a!=A_t: H t + 1 ( a ) = H t ( a ) − α ( R t − R t ˉ ) ( π t ( a ) ) H_{t+1}(a)=H_t(a)-\alpha(R_t-\bar{R_t})(\pi_t(a)) Ht+1(a)=Ht(a)α(RtRtˉ)(πt(a))
    • 优化目标:第t轮期望奖励的大小
      • E ( R t ) = Σ b π t ( b ) q ( b ) E(R_t)=\Sigma_b\pi_t(b)q(b) E(Rt)=Σbπt(b)q(b)
        国科大高级人工智能10-强化学习(多臂赌博机、贝尔曼)_第2张图片
  • 多臂赌博机–强化学习的简化

    • 行为和状态之间无关
  • 扩展

    • 有上下文的多臂赌博及
      • 行为不改变状态
  • 更一般的情形

    • 马尔科夫决策过程

马尔科夫决策过程MDP(markov decision process

  • 常用于建模序列化决策过程

  • 行为

    • 可获得奖励
    • 改变状态–影响长期奖励
  • 学习状态到行为的映射–策略

    • 多臂赌博机q(a)
    • MDP学习(,) 或()
  • 智能体和环境按离散的时间交互

  • 形式化记号

    • S t ∈ S S_t \in S StS状态
    • A t ∈ A A_t \in A AtA行为(有的地方可以走,有的不可以走,有个取值范围)
    • 采取At后,转到状态St+1,并获得Rt+1
    • 马尔科夫决策过程得到的序列记为
      • S 0 , A 0 , R 1 , S 1 , A 1 , R 2 , S 2 , . . . S_0,A_0,R_1,S_1,A_1,R_2,S_2,... S0,A0,R1,S1,A1,R2,S2,...
  • 有限马尔科夫决策过程的建模

    • p ( s ′ , r ∣ s , a ) = P ( S t = s ′ , R t = r ∣ s t − 1 = s , A t − 1 = a ) , [ 0 , 1 ] p(s',r|s,a)=P(St=s',Rt=r|s_{t-1}=s,A_{t-1}=a),[0,1] p(s,rs,a)=P(St=s,Rt=rst1=s,At1=a),[0,1]
      • 和为1
      • 枚举很大,(能枚举出来的话,A*就可以了
    • 状态转移概率:
      • p ( s ′ ∣ s , a ) = Σ r p ( s ′ , r ∣ s , a ) p(s'|s,a)=\Sigma_r p(s',r|s,a) p(ss,a)=Σrp(s,rs,a)
    • 状态-行为对的期望奖励
      • r ( s , a ) = E ( R t ∣ s t − 1 = s , A t − 1 = a ) = Σ r r Σ s ′ p ( s ′ , r ∣ s , a ) r(s,a)=E(Rt|s_{t-1}=s,A_{t-1}=a)=\Sigma_r r\Sigma_s' p(s',r|s,a) r(s,a)=E(Rtst1=s,At1=a)=ΣrrΣsp(s,rs,a)
    • 状态-行为-下一个状态,的奖励
      • r ( s , a , s ′ ) = E ( R t ∣ S t − 1 = s , A t − 1 = a , S t = s ′ ) = Σ r r p ( s ′ , r ∣ s , a ) p ( s ′ ∣ s , a ) r(s,a,s')=E(Rt|S_{t-1}=s,A_{t-1}=a,S_t=s')=\Sigma_r r \frac{p(s',r|s,a)}{p(s'|s,a)} r(s,as)=E(RtSt1=s,At1=a,St=s)=Σrrp(ss,a)p(s,rs,a)
        国科大高级人工智能10-强化学习(多臂赌博机、贝尔曼)_第3张图片
  • 奖励假设

    • 目标:长期或最终的
    • 奖励:即时的
    • 假设(强化学习的基础)
      • 目标和目的是奖励累积的期望值的最大化
  • 累积奖励

    • 多幕式任务: G t = R t + 1 + R t + 2 + R t + 3 + . . . + R T ( t < T , T − 最 终 步 , 终 止 态 ) G_t=R_{t+1}+R_{t+2}+R_{t+3}+...+R_{T}(tGt=Rt+1+Rt+2+Rt+3+...+RT(t<TT)
      • 具有终止态的马尔科夫决策过程——多幕式任务
    • 连续式任务 G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . = Σ k = 0 ∞ γ k R t + k + 1 , 0 ≤ γ ≤ 1 ( 折 扣 率 G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...=\Sigma_{k=0}^{\infty}\gamma^kR_{t+k+1},0 \leq \gamma \leq 1(折扣率 Gt=Rt+1+γRt+2+γ2Rt+3+...=Σk=0γkRt+k+1,0γ1(
      • 无终止
      • 递推: G t = Σ k = 0 ∞ γ k R t + k + 1 = R t + 1 + γ G t + 1 G_t=\Sigma_{k=0}^{\infty}\gamma^kR_{t+k+1}=R_{t+1}+\gamma G_{t+1} Gt=Σk=0γkRt+k+1=Rt+1+γGt+1
      • 求和公式 G t = Σ k = t + 1 T γ k − t − 1 R k , T = ∞ 和 γ = 1 不 能 同 时 出 现 ( 不 收 敛 ) G_t=\Sigma_{k=t+1}^{T}\gamma^{k-t-1}R_{k},T=\infty和\gamma=1不能同时出现(不收敛) Gt=Σk=t+1Tγkt1Rk,T=γ=1
  • 策略

    • 状态到行为的映射
    • 随机式策略 π ( a ∣ s ) \pi(a|s) π(as)概率
    • 确定式策略 a = π ( s ) a=\pi(s) a=π(s)
    • 状态估值函数
      • v π ( s ) = E π ( G t ∣ S t = s ) = E π ( Σ k = 0 ∞ γ k R t + k + 1 ∣ S t = s ) , f o r a l l s ∈ S v_{\pi}(s)=E_\pi(G_t|S_t=s)=E_\pi(\Sigma_{k=0}^{\infty}\gamma^kR_{t+k+1}|S_t=s),for all s \in S vπ(s)=Eπ(GtSt=s)=Eπ(Σk=0γkRt+k+1St=s),forallsS
    • 行为估值函数
      • q ( s , a ) = E π ( G t ∣ S t = s , A t = a ) = E π ( Σ k = 0 ∞ γ k R t + k + 1 ∣ S t = s , A t = a ) q(s,a)=E_\pi(G_t|S_t=s,A_t=a)=E_\pi(\Sigma_{k=0}^{\infty}\gamma^kR_{t+k+1}|S_t=s,A_t=a) q(s,a)=Eπ(GtSt=s,At=a)=Eπ(Σk=0γkRt+k+1St=s,At=a)
  • 贝尔曼方程(方程,可以联立)

    • n个状态–>n个方程n个变量的线性方程组
      国科大高级人工智能10-强化学习(多臂赌博机、贝尔曼)_第4张图片
  • 最优策略

    • 策 略 π 和 π ′ 两 个 策 略 , 对 于 所 有 s , v π ( s ) ≥ v π ′ ( s ) = = = π ≥ π ′ 策略\pi和\pi'两个策略,对于所有s,v_{\pi}(s)\geq v_{\pi'}(s)===\pi \geq \pi' ππsvπ(s)vπ(s)===ππ
    • v ∗ ( s ) = m a x π v π ( s ) , 对 应 的 最 优 策 略 可 以 有 多 个 , 但 v 一 样 v*(s)=max_{\pi}v_{\pi}(s),对应的最优策略可以有多个,但v一样 v(s)=maxπvπ(s),v
    • 行为估值函数: q ∗ ( s , a ) = m a x π q π ( s , a ) q*(s,a)=max_\pi q_\pi(s,a) q(s,a)=maxπqπ(s,a)
  • 贝尔曼最优方程(这是个赋值)
    国科大高级人工智能10-强化学习(多臂赌博机、贝尔曼)_第5张图片

  • 基于状态估值函数的贝尔曼最优性方程

    •  第一步:求解状态估值函数的贝尔曼最优性方程得到最优策略对应的状态估值函数
    •  第二步:根据状态估值函数的贝尔曼最优性方程,进行一步搜索找到每个状态下的最优行为
    •  注意:最优策略可以存在多个
    •  贝尔曼最优性方程的优势,可以采用贪心局部搜索即可得到全局最优解
  • 基于行为估值函数的贝尔曼最优性方程

    • 直接得到最优策略
  • 局限性

    1. 需要知道环境模型
    2. 需要高昂的计算代价和内存(存放估值函数)
    3. 依赖于马尔科夫性
  • 实际应用

    1. 动态规划(考)
    2. 蒙特卡罗方法
    3. 时序查分(用的多
    4. 参数化方法(用的多

1.动态规划

  • 策略估值
    1. 列方程(计算量大
    2. 迭代策略估值——寻找不动点
      • 更新规则(期望更新) v k + 1 ( s ) = Σ a π ( a ∣ s ) Σ s ′ r ′ p ( s ′ , r ∣ s , a ) ( r + γ v k ( s ′ ) ) v_{k+1}(s)=\Sigma_a\pi(a|s)\Sigma_{s'r'}p(s',r|s,a)(r+\gamma v_k(s')) vk+1(s)=Σaπ(as)Σsrp(s,rs,a)(r+γvk(s))
      • 得到稳定点时,得到方程的解
      • 两种实现方式
        1. 同步更新:两个数组存放,一个新数组,一个旧数组
        2. 异步更新:一个数组,同时放新的和旧的。(收敛快,收敛性有保证)
    • 目标:寻找最优策略(策略提升)
import numpy as np
v=np.zeros((5,5))
print(v)
[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]
action=np.array([[-1,0],[1,0],[0,-1],[0,1]])
for k in range(100):  
    for i in range(0,5):
        for j in range(5):
            s=np.array([i,j])
            
            v_a=0.0
            for a in action:
                s_1=s+a                
                if(s_1[0]<0 or s_1[0]>4 or s_1[1]<0 or s_1[1]>4):#超出范围
                    s_1=s
                    v_a+=1/4*(-1.+0.9*v[s_1[0],s_1[1]])
                elif(np.equal([0,1],s).all()):#A
                    
                    s_1=np.array([4,1])
                    v_a+=1/4*(10.+0.9*v[s_1[0],s_1[1]])
                elif(np.equal([0,3],s).all()):#A
                    
                    s_1=np.array([2,3])
                    v_a+=1/4*(5.+0.9*v[s_1[0],s_1[1]])
                else:
                    v_a+=1/4*0.9*v[s_1[0],s_1[1]]
            v[i,j]=v_a
#             print(v[i,j])
#                 s_.append(s_1)
    print(v)
[[-0.5         7.25        1.38125     3.5         0.2875    ]
 [-0.3625      1.5496875   0.65946094  0.93587871  0.02526021]
 [-0.3315625   0.27407813  0.21004629  0.25783312 -0.186304  ]
 [-0.32460156 -0.01136777  0.04470267  0.06807055 -0.27660253]
 [-0.57303535 -0.3814907  -0.32577731 -0.30798402 -0.63153197]]
[[ 0.8246875   8.62374378  2.93700231  4.46153736  0.63890445]
 [ 0.12807031  2.17920446  1.40897965  1.38456233  0.16904517]
 [-0.30715352  0.46591413  0.48992165  0.39515637 -0.22720862]
 [-0.5236356  -0.08876464  0.03227631 -0.03535962 -0.51340812]
 [-0.96151933 -0.64544919 -0.5305602  -0.58872306 -1.0321689 ]]
[[ 1.84026754  8.75466414  3.70149128  4.77057646  0.89892187]
 [ 0.61408748  2.52982021  1.82380398  1.61068095  0.30157386]
 [-0.19392719  0.61583626  0.64509141  0.44847092 -0.24787869]
 [-0.64776552 -0.14514798 -0.01484469 -0.15041361 -0.6873706 ]
 [-1.22365701 -0.82258324 -0.69026002 -0.80385226 -1.30000115]]
[[ 2.43608951  8.66455575  4.01609618  4.87609758  1.06949091]
 [ 0.96186575  2.71486389  2.0220148   1.72083536  0.38990482]
 [-0.08439791  0.70434212  0.71099621  0.45754634 -0.26975458]
 [-0.72271789 -0.19255583 -0.07250248 -0.24889027 -0.81385373]
 [-1.39833841 -0.94834094 -0.81586503 -0.96293696 -1.48477842]]
[[ 2.76218512e+00  8.55939491e+00  4.13156078e+00  4.90596573e+00
   1.17284179e+00]
 [ 1.17976629e+00  2.80474158e+00  2.10783013e+00  1.76878058e+00
   4.38898838e-01]
 [-7.67666250e-03  7.45988690e-01  7.28744102e-01  4.45247962e-01
  -2.94878841e-01]
 [-7.72289979e-01 -2.35607559e-01 -1.28614221e-01 -3.28535315e-01
  -9.07460420e-01]
 [-1.51639424e+00 -1.04114675e+00 -9.13426666e-01 -1.08017741e+00
  -1.61536880e+00]]
[[ 2.93429457  8.4730898   4.16415045  4.90438466  1.23001759]
 [ 1.3050033   2.84218018  2.13836745  1.78355225  0.46045771]
 [ 0.0359807   0.75854192  0.7230472   0.42371669 -0.32158709]
 [-0.80986999 -0.27474503 -0.17857346 -0.39206128 -0.97820746]
 [-1.59885617 -1.11133929 -0.98879128 -1.16718972 -1.70963033]]
[[ 3.02050351  8.40629118  4.16296859  4.88949532  1.25724735]
 [ 1.37082523  2.8516558   2.14227537  1.78108765  0.46487126]
 [ 0.05498252  0.75486161  0.70701305  0.39925177 -0.34802609]
 [-0.84140995 -0.30970375 -0.22129723 -0.4426746  -1.03267116]
 [-1.65885386 -1.16545484 -1.04711494 -1.23248716 -1.77899427]]
[[ 3.05907777  8.3547335   4.14863137  4.86963139  1.2660244 ]
 [ 1.4007218   2.84683326  2.1338022   1.77020024  0.45944071]
 [ 0.05806009  0.7429956   0.68731927  0.37503424 -0.37290002]
 [-0.86917918 -0.34041052 -0.25714818 -0.48313626 -1.07523288]
 [-1.70427689 -1.20788287 -1.09254246 -1.28211103 -1.8309498 ]]
[[ 3.07156244  8.3144941   4.12997579  4.84881517  1.26406856]
 [ 1.40986496  2.83526029  2.12012001  1.75576728  0.44893472]
 [ 0.05189183  0.7276637   0.6672757   0.35257651 -0.39548988]
 [-0.89394432 -0.36704513 -0.28697583 -0.51564223 -1.10889583]
 [-1.73983572 -1.24164389 -1.12823647 -1.3203114  -1.87049904]]
[[ 3.07018389  8.28265155  4.11085157  4.82897256  1.25635999]
 [ 1.40762022  2.82106248  2.10486533  1.74045355  0.43630814]
 [ 0.04097707  0.71151078  0.64844478  0.3324974  -0.41550554]
 [-0.91596582 -0.38994182 -0.31170954 -0.54189436 -1.13577882]
 [-1.76838826 -1.26884735 -1.15654857 -1.35008201 -1.90104325]]
[[ 3.0618939   8.25712464  4.09290817  4.81095457  1.2459961 ]
 [ 1.39959957  2.80644757  2.08985717  1.72566389  0.42330408]
 [ 0.02812736  0.69594252  0.6314822   0.31494289 -0.43293341]
 [-0.93538792 -0.40950052 -0.33220378 -0.56320239 -1.15741552]
 [-1.79172765 -1.29099042 -1.17921057 -1.3735461  -1.92493583]]
[[ 3.0506152   8.23643451  4.07678474  4.79505123  1.23482819]
 [ 1.38907768  2.79254517  2.0759571   1.71208244  0.41088829]
 [ 0.01499592  0.68164262  0.61657624  0.29981765 -0.44791967]
 [-0.95236454 -0.42613113 -0.34919277 -0.58057577 -1.17494053]
 [-1.81103231 -1.309157   -1.19749895 -1.39222524 -1.94383342]]
[[ 3.03851708e+00  8.21951679e+00  4.06264472e+00  4.78126344e+00
   1.22390683e+00]
 [ 1.37790557e+00  2.77987997e+00  2.06351626e+00  1.69998427e+00
   3.99543436e-01]
 [ 2.49040483e-03  6.68883485e-01  6.03680541e-01  2.86913109e-01
  -4.60690822e-01]
 [-9.67083453e-01 -4.40223689e-01 -3.63289019e-01 -5.94796878e-01
  -1.18920887e+00]
 [-1.82711864e+00 -1.32414961e+00 -1.21236164e+00 -1.40722387e+00
  -1.95892241e+00]]
[[ 3.02675272  8.20559029  4.05042831  4.76945062  1.21378173]
 [ 1.36708145  2.76864108  2.0526152   1.68941753  0.38946167]
 [-0.00894133  0.65771024  0.59263864  0.27597791 -0.47150353]
 [-0.9797576  -0.45213435 -0.3749972  -0.60647671 -1.20087509]
 [-1.84058251 -1.33657632 -1.22451078 -1.4193551  -1.97106687]]
[[ 3.01588987  8.1940688   4.03997666  4.75941148  1.20469824]
 [ 1.35710099  2.75883642  2.04319558  1.68031049  0.38066755]
 [-0.01912473  0.6480486   0.5832506   0.26675569 -0.48061496]
 [-0.99060982 -0.46218032 -0.38473137 -0.61609632 -1.21044698]
 [-1.85187901 -1.34690795 -1.23448867 -1.42922657 -1.98090664]]
[[ 3.00616364  8.18450261  4.03109442  4.75092768  1.19672313]
 [ 1.34816967  2.75038121  2.03513326  1.67253394  0.37309467]
 [-0.02804116  0.63976732  0.5753081   0.25900442 -0.48826664]
 [-0.99985982 -0.47063966 -0.39283122 -0.62403758 -1.21832301]
 [-1.8613683  -1.35551603 -1.24271406 -1.43729909 -1.98892296]]
[[ 2.9976249   8.17653977  4.0235814   4.74378671  1.18982372]
 [ 1.34033029  2.74314839  2.02827866  1.66593701  0.36663247]
 [-0.03575576  0.63271374  0.56861226  0.25250514 -0.49467671]
 [-1.0077153  -0.47775348 -0.39957589 -0.63060589 -1.22481893]
 [-1.86934278 -1.36269843 -1.24951468 -1.44392709 -1.99548319]]
[[ 2.99022697  8.16990001  4.01724803  4.73779298  1.1839164 ]
 [ 1.33353873  2.73699701  2.02247872  1.66036709  0.36115383]
 [-0.04237418  0.62673336  0.5629818   0.24706491 -0.5000373 ]
 [-1.01436679 -0.48372924 -0.4051953  -0.63604719 -1.23018699]
 [-1.87604393 -1.36869691 -1.25515015 -1.44938671 -2.00087152]]
[[ 2.98387585  8.16435708  4.01192228  4.73277224  1.17889575]
 [ 1.32770842  2.73178746  2.01758819  1.65568031  0.35653083]
 [-0.04801732  0.62168011  0.55825603  0.24251667 -0.50451478]
 [-1.01998539 -0.48874444 -0.40987929 -0.64056067 -1.23463014]
 [-1.88167329 -1.37370958 -1.25982829 -1.45389562 -2.00531048]]
[[ 2.97845887  8.15972638  4.00745205  4.7285725   1.17465134]
 [ 1.32273592  2.72738938  2.013475    1.65174638  0.3526431 ]
 [-0.052807    0.61742114  0.55429504  0.23871734 -0.50825151]
 [-1.02472228 -0.49295025 -0.41378494 -0.64430851 -1.23831264]
 [-1.88640015 -1.37789986 -1.26371696 -1.4576271  -2.00897616]]
[[ 2.97386051  8.15585603  4.00370501  4.72506302  1.17107698]
 [ 1.31851523  2.72368517  2.01002211  1.64845025  0.34938173]
 [-0.0568584   0.6138386   0.55097845  0.23554546 -0.51136832]
 [-1.02870949 -0.49647503 -0.41704246 -0.64742327 -1.24136809]
 [-1.89036717 -1.38140328 -1.2669527  -1.46072033 -2.01200916]]
[[ 2.96997076  8.15262039  4.00056737  4.72213236  1.16807531]
 [ 1.31494537  2.72057096  2.00712758  1.6456921   0.34665069]
 [-0.06027638  0.61082955  0.54820353  0.23289841 -0.51396714]
 [-1.03206132 -0.49942744 -0.41975997 -0.65001374 -1.24390558]
 [-1.89369476 -1.38433259 -1.26964726 -1.46328786 -2.01452265]]
[[ 2.96668914  8.14991509  3.99794204  4.71968621  1.16555969]
 [ 1.31193404  2.71795641  2.00470367  1.64338627  0.34436664]
 [-0.06315417  0.60830512  0.54588313  0.23068991 -0.51613364]
 [-1.03487598 -0.50189927 -0.42202736 -0.65216945 -1.24601455]
 [-1.89648457 -1.38678183 -1.27189247 -1.4654213  -2.01660826]]
[[ 2.96392617  8.14765316  3.99574664  4.71764509  1.1634545 ]
 [ 1.30939905  2.71576373  2.00267545  1.64145985  0.34245815]
 [-0.06557335  0.6061892   0.54394362  0.22884759 -0.51793955]
 [-1.03723746 -0.50396793 -0.4239194  -0.65396422 -1.24776848]
 [-1.8988224  -1.38882954 -1.27376411 -1.46719552 -2.01834062]]
[[ 2.96160352  8.14576202  3.99391157  4.71594227  1.16169462]
 [ 1.30726841  2.71392639  2.00097932  1.63985115  0.34086448]
 [-0.06760447  0.60441696  0.54232301  0.22731084 -0.51944486]
 [-1.03921726 -0.50569858 -0.42549838 -0.6554591  -1.24922794]
 [-1.90078061 -1.39054139 -1.27532486 -1.46867202 -2.01978077]]
[[ 2.95965343  8.14418102  3.99237819  4.71452182  1.1602245 ]
 [ 1.30547985  2.71238786  1.99956155  1.6385082   0.33953427]
 [-0.06930811  0.60293344  0.54096917  0.22602902 -0.52069964]
 [-1.04087602 -0.50714603 -0.42681618 -0.6567046  -1.25044291]
 [-1.90242019 -1.39197231 -1.27662671 -1.46990142 -2.02097882]]
[[ 2.95801774  8.14285942  3.99109722  4.713337    1.15899706]
 [ 1.3039799   2.71110022  1.99837683  1.63738735  0.33842429]
 [-0.07073593  0.60169217  0.53983842  0.22495984 -0.52174564]
 [-1.04226509 -0.50835632 -0.42791607 -0.65774263 -1.25145475]
 [-1.9037925  -1.39316826 -1.27771281 -1.47092553 -2.02197603]]
[[ 2.95664683  8.14175479  3.99002732  4.71234872  1.1579726 ]
 [ 1.30272298  2.71002303  1.99738712  1.63645199  0.33749823]
 [-0.07193182  0.60065399  0.5388941   0.22406801 -0.52261769]
 [-1.04342779 -0.50936808 -0.42883412 -0.65860794 -1.25229769]
 [-1.90494074 -1.39416772 -1.27861904 -1.47177892 -2.02280645]]
[[ 2.95549857  8.14083161  3.98913382  4.71152437  1.15711776]
 [ 1.30167037  2.7091222   1.99656048  1.63567149  0.3367257 ]
 [-0.07293293  0.59978594  0.53810557  0.22332407 -0.52334476]
 [-1.04440064 -0.51021372 -0.4296004  -0.65932941 -1.25300012]
 [-1.90590121 -1.39500288 -1.27937528 -1.47249026 -2.02349824]]
[[ 2.9545373   8.14006017  3.98838774  4.71083673  1.15640454]
 [ 1.30078931  2.70836908  1.99587012  1.63502024  0.33608129]
 [-0.07377062  0.59906032  0.53744717  0.22270348 -0.52395103]
 [-1.0452144  -0.51092041 -0.43024003 -0.65993106 -1.2535856 ]
 [-1.90670443 -1.39570068 -1.28000641 -1.47308334 -2.02407472]]
[[ 2.95373292  8.13941558  3.98776479  4.71026311  1.15580953]
 [ 1.30005215  2.70773959  1.99529365  1.63447685  0.33554374]
 [-0.07447132  0.59845388  0.53689747  0.22218575 -0.5244566 ]
 [-1.04589488 -0.51151088 -0.43077395 -0.66043286 -1.2540737 ]
 [-1.90737599 -1.39628364 -1.28053315 -1.47357792 -2.02455524]]
[[ 2.95306005  8.13887705  3.98724469  4.70978458  1.15531316]
 [ 1.29943561  2.70721354  1.99481232  1.63402344  0.33509534]
 [-0.07505726  0.59794715  0.53643854  0.22175382 -0.52487826]
 [-1.04646378 -0.5120042  -0.43121963 -0.66085142 -1.25448069]
 [-1.90793737 -1.39677063 -1.2809728  -1.47399041 -2.02495585]]
[[ 2.95249737  8.13842716  3.98681047  4.70938536  1.15489908]
 [ 1.29892008  2.70677401  1.99441045  1.63364512  0.33472129]
 [-0.07554711  0.59752378  0.53605539  0.22139344 -0.52522995]
 [-1.0469393  -0.5124163  -0.43159165 -0.6612006  -1.25482009]
 [-1.90840655 -1.39717741 -1.28133976 -1.47433449 -2.02528992]]
[[ 2.95202695  8.13805136  3.98644797  4.70905228  1.15455364]
 [ 1.29848914  2.70640681  1.99407494  1.63332944  0.33440924]
 [-0.07595654  0.59717011  0.53573554  0.22109275 -0.52552331]
 [-1.0473367  -0.51276052 -0.4319022  -0.66149191 -1.25510318]
 [-1.90879862 -1.39751717 -1.28164607 -1.47462154 -2.02556852]]
[[ 2.95163374  8.13773746  3.98614535  4.70877437  1.15426545]
 [ 1.29812896  2.70610008  1.99379484  1.63306602  0.33414891]
 [-0.07629869  0.59687469  0.53546852  0.22084185 -0.52576804]
 [-1.04766877 -0.51304803 -0.43216143 -0.66173497 -1.25533931]
 [-1.90912622 -1.39780093 -1.28190174 -1.47486102 -2.02580091]]
[[ 2.95130513  8.1374753   3.98589272  4.70854248  1.15402502]
 [ 1.29782798  2.70584388  1.99356101  1.63284621  0.33393172]
 [-0.07658458  0.59662795  0.53524561  0.22063248 -0.52597221]
 [-1.04794621 -0.51328814 -0.43237783 -0.66193778 -1.2555363 ]
 [-1.90939991 -1.39803791 -1.28211516 -1.47506085 -2.02599477]]
[[ 2.95103055  8.13725635  3.98568182  4.70834898  1.15382442]
 [ 1.29757651  2.70562991  1.9933658   1.63266277  0.33375051]
 [-0.07682342  0.59642189  0.53505953  0.22045777 -0.52614255]
 [-1.04817798 -0.51348866 -0.43255847 -0.66210701 -1.25570064]
 [-1.90962853 -1.39823581 -1.28229332 -1.47522759 -2.0261565 ]]
[[ 2.95080114  8.13707351  3.98550578  4.70818752  1.15365704]
 [ 1.29736643  2.70545122  1.99320284  1.63250969  0.33359931]
 [-0.07702294  0.59624981  0.53490419  0.22031197 -0.52628468]
 [-1.04837158 -0.51365611 -0.43270926 -0.66224824 -1.25583776]
 [-1.9098195  -1.39840107 -1.28244203 -1.47536673 -2.02629143]]
[[ 2.9506095   8.13692082  3.98535881  4.70805277  1.15351739]
 [ 1.29719095  2.70530199  1.9930668   1.63238194  0.33347314]
 [-0.0771896   0.59610611  0.53477452  0.2201903  -0.52640328]
 [-1.04853328 -0.51379594 -0.43283513 -0.6623661  -1.25595218]
 [-1.909979   -1.39853906 -1.28256616 -1.47548285 -2.02640403]]
[[ 2.95044942  8.13679332  3.98523614  4.70794032  1.15340085]
 [ 1.29704437  2.70517739  1.99295325  1.63227533  0.33336786]
 [-0.07732879  0.59598611  0.53466627  0.22008875 -0.52650224]
 [-1.04866833 -0.51391269 -0.4329402  -0.66246446 -1.25604765]
 [-1.91011221 -1.39865428 -1.28266978 -1.47557975 -2.02649798]]
[[ 2.95031572  8.13668686  3.98513373  4.70784648  1.15330361]
 [ 1.29692196  2.70507334  1.99285845  1.63218635  0.33328   ]
 [-0.07744504  0.59588592  0.53457591  0.220004   -0.52658483]
 [-1.04878111 -0.51401017 -0.43302791 -0.66254655 -1.25612733]
 [-1.91022346 -1.39875048 -1.28275628 -1.47566063 -2.02657638]]
[[ 2.95020406  8.13659797  3.98504824  4.70776816  1.15322246]
 [ 1.29681972  2.70498646  1.99277932  1.63211208  0.33320669]
 [-0.07754211  0.59580227  0.53450048  0.21993327 -0.52665374]
 [-1.04887529 -0.51409157 -0.43310113 -0.66261506 -1.25619381]
 [-1.91031635 -1.39883081 -1.28282849 -1.47572812 -2.02664181]]
[[ 2.95011081  8.13652375  3.98497688  4.70770279  1.15315474]
 [ 1.29673435  2.70491393  1.99271326  1.6320501   0.3331455 ]
 [-0.07762318  0.59573242  0.53443751  0.21987423 -0.52671126]
 [-1.04895394 -0.51415953 -0.43316225 -0.66267224 -1.2562493 ]
 [-1.91039393 -1.39889787 -1.28288877 -1.47578446 -2.02669641]]
[[ 2.95003293  8.13646178  3.98491731  4.70764823  1.15309822]
 [ 1.29666306  2.70485337  1.99265811  1.63199837  0.33309444]
 [-0.07769087  0.59567411  0.53438495  0.21982496 -0.52675926]
 [-1.04901961 -0.51421627 -0.43321327 -0.66271997 -1.25629561]
 [-1.9104587  -1.39895386 -1.28293908 -1.47583148 -2.02674198]]
[[ 2.94996791  8.13641005  3.98486758  4.7076027   1.15305106]
 [ 1.29660353  2.7048028   1.99261208  1.63195519  0.33305182]
 [-0.07774739  0.59562542  0.53434107  0.21978383 -0.52679933]
 [-1.04907444 -0.51426363 -0.43325586 -0.6627598  -1.25633426]
 [-1.91051278 -1.39900061 -1.28298108 -1.47587073 -2.02678001]]
[[ 2.94991361  8.13636685  3.98482607  4.70756469  1.15301169]
 [ 1.29655383  2.70476059  1.99257366  1.63191915  0.33301625]
 [-0.07779458  0.59558477  0.53430444  0.2197495  -0.52683276]
 [-1.04912022 -0.51430318 -0.43329142 -0.66279305 -1.25636652]
 [-1.91055794 -1.39903963 -1.28301614 -1.47590349 -2.02681176]]
[[ 2.94986828  8.13633079  3.98479142  4.70753297  1.15297884]
 [ 1.29651233  2.70472535  1.99254158  1.63188907  0.33298656]
 [-0.07783398  0.59555084  0.53427386  0.21972085 -0.52686067]
 [-1.04915845 -0.5143362  -0.43332109 -0.66282081 -1.25639345]
 [-1.91059564 -1.39907221 -1.28304541 -1.47593083 -2.02683825]]
[[ 2.94983043  8.13630068  3.9847625   4.70750649  1.15295141]
 [ 1.29647768  2.70469593  1.99251481  1.63186396  0.33296179]
 [-0.07786688  0.59552251  0.53424834  0.21969694 -0.52688396]
 [-1.04919036 -0.51436376 -0.43334587 -0.66284397 -1.25641592]
 [-1.91062712 -1.39909941 -1.28306984 -1.47595365 -2.02686037]]
[[ 2.94979882  8.13627555  3.98473835  4.70748439  1.15292853]
 [ 1.29644875  2.70467136  1.99249245  1.631843    0.3329411 ]
 [-0.07789435  0.59549886  0.53422704  0.21967697 -0.52690341]
 [-1.04921701 -0.51438677 -0.43336655 -0.66286331 -1.25643467]
 [-1.9106534  -1.39912212 -1.28309024 -1.4759727  -2.02687882]]
[[ 2.94977244  8.13625457  3.9847182   4.70746595  1.15290942]
 [ 1.2964246   2.70465086  1.9924738   1.63182551  0.33292384]
 [-0.07791728  0.59547911  0.53420925  0.21966031 -0.52691963]
 [-1.04923925 -0.51440598 -0.43338381 -0.66287945 -1.25645033]
 [-1.91067534 -1.39914108 -1.28310726 -1.4759886  -2.02689423]]
[[ 2.94975041  8.13623705  3.98470137  4.70745055  1.15289348]
 [ 1.29640443  2.70463374  1.99245822  1.63181091  0.33290943]
 [-0.07793642  0.59546263  0.5341944   0.2196464  -0.52693318]
 [-1.04925782 -0.51442202 -0.43339822 -0.66289292 -1.2564634 ]
 [-1.91069365 -1.3991569  -1.28312147 -1.47600188 -2.02690709]]
[[ 2.94973202  8.13622243  3.98468733  4.70743769  1.15288017]
 [ 1.2963876   2.70461945  1.99244522  1.63179872  0.33289741]
 [-0.0779524   0.59544887  0.53418201  0.21963479 -0.52694449]
 [-1.04927333 -0.51443541 -0.43341025 -0.66290417 -1.25647431]
 [-1.91070895 -1.39917011 -1.28313334 -1.47601296 -2.02691782]]
[[ 2.94971666  8.13621022  3.9846756   4.70742697  1.15286906]
 [ 1.29637354  2.70460752  1.99243437  1.63178854  0.33288737]
 [-0.07796575  0.59543738  0.53417167  0.2196251  -0.52695392]
 [-1.04928627 -0.51444658 -0.43342029 -0.66291355 -1.25648341]
 [-1.91072171 -1.39918114 -1.28314324 -1.4760222  -2.02692678]]
[[ 2.94970385  8.13620003  3.98466582  4.70741801  1.15285979]
 [ 1.29636181  2.70459756  1.99242531  1.63178005  0.33287899]
 [-0.07797688  0.5954278   0.53416303  0.21961701 -0.5269618 ]
 [-1.04929708 -0.51445591 -0.43342868 -0.66292139 -1.25649101]
 [-1.91073237 -1.39919035 -1.28315151 -1.47602992 -2.02693426]]
[[ 2.94969314  8.13619152  3.98465765  4.70741054  1.15285205]
 [ 1.29635202  2.70458924  1.99241774  1.63177296  0.33287199]
 [-0.07798618  0.59541979  0.53415582  0.21961026 -0.52696838]
 [-1.0493061  -0.5144637  -0.43343567 -0.66292793 -1.25649736]
 [-1.91074127 -1.39919803 -1.28315841 -1.47603637 -2.02694051]]
[[ 2.94968421  8.13618442  3.98465083  4.7074043   1.15284559]
 [ 1.29634384  2.7045823   1.99241143  1.63176705  0.33286616]
 [-0.07799395  0.59541311  0.5341498   0.21960462 -0.52697386]
 [-1.04931363 -0.5144702  -0.43344151 -0.66293339 -1.25650265]
 [-1.91074869 -1.39920445 -1.28316417 -1.47604175 -2.02694572]]
[[ 2.94967675  8.13617849  3.98464513  4.70739909  1.15284019]
 [ 1.29633701  2.70457651  1.99240616  1.63176211  0.33286128]
 [-0.07800043  0.59540753  0.53414478  0.21959992 -0.52697845]
 [-1.04931991 -0.51447563 -0.43344639 -0.66293795 -1.25650707]
 [-1.91075489 -1.39920981 -1.28316897 -1.47604624 -2.02695007]]
[[ 2.94967053  8.13617354  3.98464038  4.70739474  1.15283569]
 [ 1.29633132  2.70457167  1.99240176  1.63175798  0.33285722]
 [-0.07800584  0.59540287  0.53414059  0.21959599 -0.52698227]
 [-1.04932516 -0.51448016 -0.43345046 -0.66294175 -1.25651076]
 [-1.91076007 -1.39921428 -1.28317299 -1.47604998 -2.0269537 ]]
[[ 2.94966533  8.13616941  3.98463642  4.70739111  1.15283193]
 [ 1.29632656  2.70456764  1.99239809  1.63175454  0.33285382]
 [-0.07801035  0.59539898  0.53413709  0.21959271 -0.52698546]
 [-1.04932954 -0.51448394 -0.43345386 -0.66294493 -1.25651384]
 [-1.91076439 -1.39921801 -1.28317634 -1.47605311 -2.02695673]]
[[ 2.94966099  8.13616596  3.9846331   4.70738808  1.1528288 ]
 [ 1.29632259  2.70456426  1.99239502  1.63175167  0.33285099]
 [-0.07801412  0.59539574  0.53413416  0.21958997 -0.52698813]
 [-1.0493332  -0.5144871  -0.4334567  -0.66294758 -1.25651641]
 [-1.910768   -1.39922113 -1.28317914 -1.47605572 -2.02695926]]
[[ 2.94965737  8.13616308  3.98463034  4.70738555  1.15282618]
 [ 1.29631927  2.70456145  1.99239247  1.63174927  0.33284862]
 [-0.07801727  0.59539303  0.53413172  0.21958769 -0.52699035]
 [-1.04933625 -0.51448973 -0.43345906 -0.66294979 -1.25651856]
 [-1.91077101 -1.39922373 -1.28318147 -1.4760579  -2.02696137]]
[[ 2.94965435  8.13616068  3.98462803  4.70738344  1.15282399]
 [ 1.2963165   2.7045591   1.99239033  1.63174727  0.33284664]
 [-0.0780199   0.59539077  0.53412969  0.21958578 -0.52699221]
 [-1.0493388  -0.51449193 -0.43346104 -0.66295164 -1.25652035]
 [-1.91077352 -1.3992259  -1.28318342 -1.47605972 -2.02696313]]
[[ 2.94965182  8.13615867  3.98462611  4.70738168  1.15282217]
 [ 1.29631419  2.70455714  1.99238855  1.6317456   0.332845  ]
 [-0.07802209  0.59538888  0.53412799  0.21958419 -0.52699376]
 [-1.04934093 -0.51449377 -0.43346269 -0.66295318 -1.25652184]
 [-1.91077562 -1.39922771 -1.28318505 -1.47606124 -2.0269646 ]]
[[ 2.94964971  8.136157    3.9846245   4.70738021  1.15282065]
 [ 1.29631227  2.70455551  1.99238706  1.6317442   0.33284362]
 [-0.07802392  0.59538731  0.53412657  0.21958286 -0.52699505]
 [-1.0493427  -0.5144953  -0.43346407 -0.66295447 -1.25652309]
 [-1.91077737 -1.39922922 -1.28318641 -1.47606251 -2.02696583]]
[[ 2.94964796  8.1361556   3.98462316  4.70737898  1.15281938]
 [ 1.29631066  2.70455414  1.99238582  1.63174304  0.33284247]
 [-0.07802545  0.59538599  0.53412539  0.21958175 -0.52699613]
 [-1.04934419 -0.51449658 -0.43346522 -0.66295554 -1.25652413]
 [-1.91077883 -1.39923049 -1.28318754 -1.47606357 -2.02696686]]
[[ 2.94964649  8.13615443  3.98462204  4.70737795  1.15281831]
 [ 1.29630931  2.704553    1.99238478  1.63174207  0.33284151]
 [-0.07802673  0.59538489  0.5341244   0.21958083 -0.52699703]
 [-1.04934542 -0.51449765 -0.43346618 -0.66295644 -1.256525  ]
 [-1.91078006 -1.39923154 -1.28318849 -1.47606445 -2.02696771]]
[[ 2.94964526  8.13615346  3.9846211   4.7073771   1.15281743]
 [ 1.29630819  2.70455205  1.99238391  1.63174125  0.33284071]
 [-0.07802779  0.59538398  0.53412357  0.21958005 -0.52699779]
 [-1.04934646 -0.51449854 -0.43346698 -0.66295719 -1.25652573]
 [-1.91078107 -1.39923242 -1.28318928 -1.47606519 -2.02696843]]
[[ 2.94964424  8.13615264  3.98462032  4.70737638  1.15281669]
 [ 1.29630725  2.70455125  1.99238319  1.63174058  0.33284004]
 [-0.07802868  0.59538321  0.53412288  0.21957941 -0.52699841]
 [-1.04934732 -0.51449929 -0.43346765 -0.66295781 -1.25652634]
 [-1.91078192 -1.39923315 -1.28318994 -1.47606581 -2.02696903]]
[[ 2.94964338  8.13615197  3.98461967  4.70737579  1.15281607]
 [ 1.29630647  2.70455059  1.99238259  1.63174001  0.33283949]
 [-0.07802942  0.59538257  0.53412231  0.21957887 -0.52699894]
 [-1.04934804 -0.51449991 -0.4334682  -0.66295833 -1.25652684]
 [-1.91078264 -1.39923377 -1.28319049 -1.47606632 -2.02696952]]
[[ 2.94964267  8.1361514   3.98461912  4.70737529  1.15281556]
 [ 1.29630582  2.70455003  1.99238208  1.63173954  0.33283902]
 [-0.07803004  0.59538204  0.53412183  0.21957842 -0.52699938]
 [-1.04934864 -0.51450043 -0.43346867 -0.66295877 -1.25652726]
 [-1.91078323 -1.39923428 -1.28319095 -1.47606675 -2.02696994]]
[[ 2.94964208  8.13615093  3.98461867  4.70737487  1.15281513]
 [ 1.29630528  2.70454957  1.99238166  1.63173914  0.33283863]
 [-0.07803056  0.59538159  0.53412143  0.21957805 -0.52699974]
 [-1.04934914 -0.51450086 -0.43346906 -0.66295913 -1.25652762]
 [-1.91078372 -1.39923471 -1.28319133 -1.47606711 -2.02697029]]
[[ 2.94964158  8.13615053  3.98461829  4.70737453  1.15281477]
 [ 1.29630482  2.70454919  1.99238131  1.63173882  0.33283831]
 [-0.07803099  0.59538122  0.53412109  0.21957773 -0.52700005]
 [-1.04934956 -0.51450122 -0.43346938 -0.66295943 -1.25652791]
 [-1.91078414 -1.39923506 -1.28319165 -1.47606741 -2.02697058]]
[[ 2.94964116  8.1361502   3.98461797  4.70737424  1.15281447]
 [ 1.29630444  2.70454886  1.99238102  1.63173854  0.33283804]
 [-0.07803135  0.59538091  0.53412081  0.21957747 -0.5270003 ]
 [-1.04934991 -0.51450153 -0.43346965 -0.66295969 -1.25652816]
 [-1.91078448 -1.39923536 -1.28319192 -1.47606766 -2.02697082]]
[[ 2.94964082  8.13614993  3.98461771  4.707374    1.15281422]
 [ 1.29630412  2.7045486   1.99238077  1.63173831  0.33283781]
 [-0.07803165  0.59538065  0.53412058  0.21957725 -0.52700051]
 [-1.0493502  -0.51450178 -0.43346988 -0.6629599  -1.25652836]
 [-1.91078477 -1.39923561 -1.28319214 -1.47606787 -2.02697102]]
[[ 2.94964053  8.1361497   3.98461749  4.7073738   1.15281401]
 [ 1.29630386  2.70454837  1.99238057  1.63173812  0.33283762]
 [-0.0780319   0.59538044  0.53412039  0.21957707 -0.52700069]
 [-1.04935045 -0.51450199 -0.43347007 -0.66296008 -1.25652853]
 [-1.91078501 -1.39923582 -1.28319233 -1.47606804 -2.02697119]]
[[ 2.94964029  8.1361495   3.98461731  4.70737363  1.15281383]
 [ 1.29630364  2.70454818  1.9923804   1.63173796  0.33283746]
 [-0.07803211  0.59538026  0.53412022  0.21957692 -0.52700084]
 [-1.04935065 -0.51450216 -0.43347023 -0.66296022 -1.25652868]
 [-1.91078521 -1.39923599 -1.28319248 -1.47606818 -2.02697133]]
[[ 2.94964009  8.13614934  3.98461715  4.70737349  1.15281369]
 [ 1.29630345  2.70454803  1.99238026  1.63173783  0.33283733]
 [-0.07803229  0.5953801   0.53412009  0.21957679 -0.52700096]
 [-1.04935082 -0.51450231 -0.43347036 -0.66296035 -1.2565288 ]
 [-1.91078538 -1.39923614 -1.28319261 -1.47606831 -2.02697145]]
[[ 2.94963992  8.13614921  3.98461702  4.70737337  1.15281357]
 [ 1.2963033   2.7045479   1.99238014  1.63173772  0.33283722]
 [-0.07803243  0.59537998  0.53411997  0.21957669 -0.52700107]
 [-1.04935096 -0.51450243 -0.43347047 -0.66296045 -1.2565289 ]
 [-1.91078552 -1.39923626 -1.28319272 -1.47606841 -2.02697154]]
[[ 2.94963978  8.1361491   3.98461692  4.70737327  1.15281347]
 [ 1.29630317  2.70454779  1.99238004  1.63173762  0.33283713]
 [-0.07803255  0.59537987  0.53411988  0.2195766  -0.52700115]
 [-1.04935108 -0.51450253 -0.43347056 -0.66296053 -1.25652898]
 [-1.91078563 -1.39923636 -1.28319281 -1.47606849 -2.02697163]]
[[ 2.94963966  8.13614901  3.98461683  4.70737319  1.15281338]
 [ 1.29630307  2.7045477   1.99237996  1.63173755  0.33283705]
 [-0.07803266  0.59537979  0.5341198   0.21957652 -0.52700122]
 [-1.04935118 -0.51450262 -0.43347064 -0.66296061 -1.25652905]
 [-1.91078573 -1.39923644 -1.28319289 -1.47606856 -2.02697169]]
[[ 2.94963956  8.13614893  3.98461675  4.70737312  1.15281331]
 [ 1.29630298  2.70454762  1.99237989  1.63173748  0.33283699]
 [-0.07803274  0.59537971  0.53411973  0.21957646 -0.52700128]
 [-1.04935126 -0.51450269 -0.4334707  -0.66296067 -1.25652911]
 [-1.91078581 -1.39923651 -1.28319295 -1.47606862 -2.02697175]]
[[ 2.94963948  8.13614886  3.98461669  4.70737306  1.15281325]
 [ 1.2963029   2.70454756  1.99237983  1.63173743  0.33283694]
 [-0.07803281  0.59537965  0.53411968  0.21957641 -0.52700133]
 [-1.04935133 -0.51450275 -0.43347075 -0.66296072 -1.25652915]
 [-1.91078588 -1.39923657 -1.283193   -1.47606867 -2.0269718 ]]
[[ 2.94963941  8.13614881  3.98461664  4.70737302  1.1528132 ]
 [ 1.29630284  2.7045475   1.99237978  1.63173738  0.33283689]
 [-0.07803287  0.5953796   0.53411963  0.21957637 -0.52700138]
 [-1.04935139 -0.5145028  -0.4334708  -0.66296076 -1.25652919]
 [-1.91078594 -1.39923662 -1.28319304 -1.47606871 -2.02697184]]
[[ 2.94963936  8.13614876  3.9846166   4.70737298  1.15281316]
 [ 1.29630279  2.70454746  1.99237974  1.63173734  0.33283685]
 [-0.07803292  0.59537956  0.5341196   0.21957633 -0.52700141]
 [-1.04935143 -0.51450284 -0.43347084 -0.66296079 -1.25652923]
 [-1.91078598 -1.39923666 -1.28319308 -1.47606874 -2.02697187]]
[[ 2.94963931  8.13614873  3.98461656  4.70737294  1.15281313]
 [ 1.29630274  2.70454742  1.99237971  1.63173731  0.33283682]
 [-0.07803296  0.59537952  0.53411956  0.2195763  -0.52700144]
 [-1.04935147 -0.51450288 -0.43347087 -0.66296082 -1.25652926]
 [-1.91078602 -1.39923669 -1.28319311 -1.47606877 -2.0269719 ]]
[[ 2.94963927  8.1361487   3.98461653  4.70737292  1.1528131 ]
 [ 1.29630271  2.70454739  1.99237968  1.63173729  0.3328368 ]
 [-0.078033    0.59537949  0.53411954  0.21957628 -0.52700147]
 [-1.04935151 -0.5145029  -0.43347089 -0.66296085 -1.25652928]
 [-1.91078606 -1.39923672 -1.28319314 -1.4760688  -2.02697192]]
[[ 2.94963924  8.13614867  3.9846165   4.70737289  1.15281307]
 [ 1.29630268  2.70454737  1.99237966  1.63173727  0.33283678]
 [-0.07803303  0.59537947  0.53411951  0.21957626 -0.52700149]
 [-1.04935154 -0.51450293 -0.43347091 -0.66296087 -1.2565293 ]
 [-1.91078608 -1.39923675 -1.28319316 -1.47606882 -2.02697194]]
[[ 2.94963921  8.13614865  3.98461648  4.70737287  1.15281305]
 [ 1.29630265  2.70454735  1.99237964  1.63173725  0.33283676]
 [-0.07803305  0.59537945  0.5341195   0.21957624 -0.5270015 ]
 [-1.04935156 -0.51450295 -0.43347093 -0.66296088 -1.25652932]
 [-1.91078611 -1.39923677 -1.28319318 -1.47606883 -2.02697196]]
[[ 2.94963919  8.13614863  3.98461646  4.70737286  1.15281304]
 [ 1.29630263  2.70454733  1.99237962  1.63173723  0.33283674]
 [-0.07803307  0.59537943  0.53411948  0.21957622 -0.52700152]
 [-1.04935158 -0.51450297 -0.43347095 -0.6629609  -1.25652933]
 [-1.91078613 -1.39923678 -1.28319319 -1.47606885 -2.02697197]]
[[ 2.94963917  8.13614861  3.98461645  4.70737284  1.15281302]
 [ 1.29630261  2.70454731  1.99237961  1.63173722  0.33283673]
 [-0.07803309  0.59537942  0.53411947  0.21957621 -0.52700153]
 [-1.04935159 -0.51450298 -0.43347096 -0.66296091 -1.25652934]
 [-1.91078614 -1.3992368  -1.2831932  -1.47606886 -2.02697198]]
[[ 2.94963915  8.1361486   3.98461644  4.70737283  1.15281301]
 [ 1.2963026   2.7045473   1.9923796   1.63173721  0.33283672]
 [-0.0780331   0.5953794   0.53411946  0.2195762  -0.52700154]
 [-1.04935161 -0.51450299 -0.43347097 -0.66296092 -1.25652935]
 [-1.91078615 -1.39923681 -1.28319321 -1.47606887 -2.02697199]]
[[ 2.94963914  8.13614859  3.98461643  4.70737282  1.152813  ]
 [ 1.29630259  2.70454729  1.99237959  1.6317372   0.33283671]
 [-0.07803311  0.59537939  0.53411945  0.21957619 -0.52700155]
 [-1.04935162 -0.514503   -0.43347098 -0.66296093 -1.25652936]
 [-1.91078617 -1.39923682 -1.28319322 -1.47606888 -2.026972  ]]
[[ 2.94963913  8.13614858  3.98461642  4.70737282  1.15281299]
 [ 1.29630258  2.70454728  1.99237958  1.63173719  0.3328367 ]
 [-0.07803312  0.59537939  0.53411944  0.21957619 -0.52700155]
 [-1.04935163 -0.51450301 -0.43347099 -0.66296093 -1.25652936]
 [-1.91078618 -1.39923683 -1.28319323 -1.47606888 -2.02697201]]
[[ 2.94963912  8.13614857  3.98461641  4.70737281  1.15281299]
 [ 1.29630257  2.70454727  1.99237957  1.63173719  0.3328367 ]
 [-0.07803313  0.59537938  0.53411943  0.21957618 -0.52700156]
 [-1.04935164 -0.51450302 -0.43347099 -0.66296094 -1.25652937]
 [-1.91078618 -1.39923683 -1.28319324 -1.47606889 -2.02697201]]
[[ 2.94963911  8.13614857  3.98461641  4.7073728   1.15281298]
 [ 1.29630256  2.70454727  1.99237957  1.63173718  0.33283669]
 [-0.07803314  0.59537937  0.53411943  0.21957618 -0.52700156]
 [-1.04935164 -0.51450302 -0.433471   -0.66296094 -1.25652937]
 [-1.91078619 -1.39923684 -1.28319324 -1.47606889 -2.02697202]]
[[ 2.9496391   8.13614856  3.9846164   4.7073728   1.15281298]
 [ 1.29630255  2.70454726  1.99237956  1.63173718  0.33283669]
 [-0.07803314  0.59537937  0.53411942  0.21957617 -0.52700157]
 [-1.04935165 -0.51450303 -0.433471   -0.66296095 -1.25652938]
 [-1.91078619 -1.39923684 -1.28319325 -1.4760689  -2.02697202]]
[[ 2.9496391   8.13614856  3.9846164   4.7073728   1.15281297]
 [ 1.29630255  2.70454726  1.99237956  1.63173717  0.33283669]
 [-0.07803315  0.59537936  0.53411942  0.21957617 -0.52700157]
 [-1.04935165 -0.51450303 -0.43347101 -0.66296095 -1.25652938]
 [-1.9107862  -1.39923685 -1.28319325 -1.4760689  -2.02697202]]
[[ 2.94963909  8.13614855  3.98461639  4.70737279  1.15281297]
 [ 1.29630254  2.70454725  1.99237955  1.63173717  0.33283668]
 [-0.07803315  0.59537936  0.53411942  0.21957616 -0.52700157]
 [-1.04935166 -0.51450303 -0.43347101 -0.66296095 -1.25652938]
 [-1.9107862  -1.39923685 -1.28319325 -1.4760689  -2.02697203]]
[[ 2.94963909  8.13614855  3.98461639  4.70737279  1.15281297]
 [ 1.29630254  2.70454725  1.99237955  1.63173717  0.33283668]
 [-0.07803315  0.59537936  0.53411941  0.21957616 -0.52700158]
 [-1.04935166 -0.51450304 -0.43347101 -0.66296096 -1.25652939]
 [-1.91078621 -1.39923685 -1.28319325 -1.47606891 -2.02697203]]
[[ 2.94963909  8.13614855  3.98461639  4.70737279  1.15281297]
 [ 1.29630254  2.70454725  1.99237955  1.63173717  0.33283668]
 [-0.07803316  0.59537936  0.53411941  0.21957616 -0.52700158]
 [-1.04935166 -0.51450304 -0.43347101 -0.66296096 -1.25652939]
 [-1.91078621 -1.39923685 -1.28319326 -1.47606891 -2.02697203]]

国科大高级人工智能10-强化学习(多臂赌博机、贝尔曼)_第6张图片贪心策略–找Gt最大的下一步s’–v最大

  • 策略提升

    • 根据当前的估值函数,寻找更优的策略,珠宝找到最优策略
      • 依 据 π 的 估 值 函 数 v π , 得 到 最 优 策 略 π ′ 依据\pi的估值函数v_\pi,得到最优策略\pi' πvπ,π
    • 提升方法
      • 看 q π ( s , a ) 是 否 大 于 v π ( s ) ( 这 是 下 面 定 理 的 特 例 看q_\pi(s,a)是否大于v_\pi(s)(这是下面定理的特例 qπ(s,a)vπ(s)(`
    • 定理
      • 如果 q π ( s , π ′ ( s ) ) ≥ v π ( s ) , 则 π ′ 比 π 好 , v π ′ ( s ) ≥ v π ( s ) q_\pi(s,\pi'(s))\geq v_\pi(s),则\pi'比\pi好,v_\pi'(s) \geq v\pi(s) qπ(s,π(s))vπ(s),ππvπ(s)vπ(s)
        国科大高级人工智能10-强化学习(多臂赌博机、贝尔曼)_第7张图片
  • 循环进行–》策略迭代

  • 策略估值
    国科大高级人工智能10-强化学习(多臂赌博机、贝尔曼)_第8张图片

  • 策略迭代=策略估值+策略提升

    • 贝尔曼方程
  • 估值迭代=不精确估值(一轮估值后)+策略提升

    • 贝尔曼最优方程
  • 可否在不精确估值情况下,策略提升?——精确估值耗费很长时间

    • 可以——估值迭代
  • 策略迭代
    国科大高级人工智能10-强化学习(多臂赌博机、贝尔曼)_第9张图片

  • 估值迭代
    国科大高级人工智能10-强化学习(多臂赌博机、贝尔曼)_第10张图片

  • 比较
    国科大高级人工智能10-强化学习(多臂赌博机、贝尔曼)_第11张图片

  • 动态规划

    • 自举的方法(无中生有
    • 把贝尔曼方程变成更新规则
    • 优点:计算效率高
    • 缺点: 要知道环境的完整模型

蒙特卡罗方法——不知道环境完整模型情况下

  • 从真实或模拟的经验中计算状态(行动)估值函数

  • 不需要知道完整的模型

  • 采样
    国科大高级人工智能10-强化学习(多臂赌博机、贝尔曼)_第12张图片

  • 回到原状态的就不要了

  • 基于蒙特卡罗的方法的策略迭代

    • 仅有状态估值无法得出策略
    • 蒙 特 卡 罗 得 到 q π ( s , a ) 蒙特卡罗得到q_\pi(s,a) qπ(s,a),贪心得到策略
  • 优点:不同状态的估值在计算时独立(不依赖于自举)

    • 适用于模型未知或环境模型复杂
    • 收敛性由大数经历决定
  • 缺点:部分状态行为再蒙特卡罗模拟中不出现

    • 解决方案:exploring start :每个“状态-行为”对都以一定的概率作为模拟的起始点(残局)

国科大高级人工智能10-强化学习(多臂赌博机、贝尔曼)_第13张图片

  • 不要exploring start了
  • 其他方法——平衡开采和探索
    • on-policy
      • 每个状态都进行探索:eg:贪心
        • 1 − ϵ + ϵ A ( s ) 贪 心 ; 以 ϵ A ( s ) 选 择 费 贪 心 1-\epsilon+\frac{\epsilon}{A(s)}贪心;以\frac{\epsilon}{A(s)}选择费贪心 1ϵ+A(s)ϵA(s)ϵ
      • 缺点:最终得到的最优策略仅仅是 ϵ \epsilon ϵ最优策略(与最优解还有个小误差)
    • off-policy
      • 使用两个策略:
        • 目标策略 π \pi π,和
          • 待优化策略
          • 贪心
        • 行为策略b
          • 保证每个状态对所有行为进行探索的可能

2.1 on-policy蒙特卡罗

国科大高级人工智能10-强化学习(多臂赌博机、贝尔曼)_第14张图片

2.2 off-policy蒙特卡罗

国科大高级人工智能10-强化学习(多臂赌博机、贝尔曼)_第15张图片

时序差分方法

  • 蒙特卡洛一定要模拟到最后吗
  • 非平稳模拟
    国科大高级人工智能10-强化学习(多臂赌博机、贝尔曼)_第16张图片
    国科大高级人工智能10-强化学习(多臂赌博机、贝尔曼)_第17张图片
  • 时序差分方法是强化学习中最核心的策略学习方法
  • TD和蒙特卡洛方法的联系和区别
    • 联系:都是从经验中学习
    • 非平稳情形下的蒙特卡洛方法是TD的特例
    • 区别:蒙特卡洛方法需要episode完整的信息,TD只需要episode的部分信息
    • TD比蒙特卡罗快吧
  • TD和动态规划方法的联系和区别
    • 联系:TD和动态规划方法都采用自举的方法
    • 区别:动态规划方法依赖于完整的环境模型进行估计,TD依赖于经验进行估计
  • 从一个猜测学习一个猜测
    • 保证他学对了:多走了一步
  • 收敛
  • 在线的从经验中进行策略学习
  • 直接学习行为估值函数完成策略学习
  • 适用于状态和行为空间比较小的问题

国科大高级人工智能10-强化学习(多臂赌博机、贝尔曼)_第18张图片

你可能感兴趣的:(高级人工智能,机器学习,人工智能,强化学习)