程序员小勇

第2章马尔可夫决策过程

2.1 马尔可夫决策过程（上）

Markov Decision Process（MDP）

Markov Decision Process can model a lot of real-world problem. It formally describes the framework of reinforcement learning
Under MDP, the environment is fully observable.
1. Optimal control primarily deals with continuous MDPs
2. Partially observable problems can be converted into MDPs

Markov Property

The history of states: $h_{t}=\left \{ s_{1},s_{2},s_{3},...,s_{t} \right \}$
State $s_{t}$ is Markovian if and only if:
$p(s_{t+1}|s_{t})=p(s_{t+1}|h_{t})$

$p(s_{t+1}|s_{t},a_{t})=p(s_{t+1}|h_{t},a_{t})$
“The future is independent of the past given the present”

Markov Process/Markov Chain

State transition matrix P specifies $p(s_{t+1}=s'|s_{t}=s)$
$P=\begin{bmatrix} P(s_{1}|s_{1}) & P(s_{2}|s_{1}) & ... & P(s_{N}|s_{1})\\ P(s_{1}|s_{2}) & P(s_{2}|s_{2}) & ... & P(s_{N}|s_{2})\\ ... & ... & \ddots & ...\\ P(s_{1}|s_{N}) & P(s_{2}|s_{N}) & ... & P(s_{N}|s_{N}) \end{bmatrix}$

Example of MP

Sample episodes starting from $s_{3}$
1. $s_{3},s_{4},s_{5},s_{6},s_{6}$
2. $s_{3},s_{2},s_{3},s_{2},s_{1}$
3. $s_{3},s_{4},s_{4},s_{5},s_{5}$

Markov Reward Process (MRP)

Markov Reward Process is a Markov Chain + reward
Definition of Markov Reward Process (MRP)
1. S is a (finite) set of states (s ∈ S)
2. P is dynamics/transition model that specifies $P(S_{t+1}=s'|s_{t}=s)$
3. R is a reward function $R(s_{t}=s)=E[r_{t}|s_{t}=s]$
4. Discount factor $\gamma ∈[0,1]$
If finite number of states, R can be a vector

Example of MRP

Reward: +5 in $s_{1}$ , +10 in $s_{7}$ , 0 in all other states. So that we can represent R = [5, 0, 0, 0, 0, 0, 10]

Return and Value function

Definition of Horizon
1. Number of maximum time steps in each episode
2. Can be infinite, otherwise called finite Markov (reward) Process
Definition of Return
1. Discounted sum of rewards from time step t to horizon
  $G_{t}=R_{t+1}+γR_{t+2}+γ^{2}R_{t+3}+γ^{3}R_{t+4}+...+γ^{T-t-1}R_{T}$
Definition of state value function $V_{t}(s)$ for a MRP
1. Expected return from t in state s
  ${V_{t}(s)=E[G_{t}|s_{t}=s]} =E[R_{t+1}+γR_{t+2}+γ^{2}R_{t+3}+γ^{3}R_{t+4}+...+γ^{T-t-1}R_{T}|s_{t}=s]$
2. Present value of future rewards

Why Discount Factor γ

Avoid infinite returns in cycle Markov processes
Uncertainly about the future may not be fully represented
If the reward is financial, immediate rewards may earn more interest than delayed rewards
Animal/human behaviour shows preference for immediate reward
It is sometimes possible to use undiscounted Markov reward processes (i.e. γ = 1), e.g if all sequences terminate.
1. γ = 0: Only care about the immediate reward
2. γ = 1: Future reward is equal to the immediate reward.

Example of MRP

Reward: +5 in $s_{1}$ , +10 in $s_{7}$ , 0 in all other states. So that we can represent R = [5, 0, 0, 0, 0, 0, 10]
Sample returns G for a 4-step episodes with γ = 1/2
1. return for $s_{4},s_{5},s_{6},s_{7}$ : $0+\frac{1}{2}×0+\frac{1}{4}×0+\frac{1}{8}×10=1.25$
2. return for $s_{4},s_{3},s_{2},s_{1}$ : $0+\frac{1}{2}×0+\frac{1}{4}×0+\frac{1}{8}×5=0.625$
3. return for $s_{4},s_{5},s_{6},s_{6}$ : = 0
How to compute the value function? For example, the value of state $s_{4}$ as $V(s_{4})$

Compute the Value of a Markov Reward Process

Value function: expected return from starting in state s
${V(s)=E[G_{t}|s_{t}=s]} =E[R_{t+1}+γR_{t+2}+γ^{2}R_{t+3}+γ^{3}R_{t+4}+...+γ^{T-t-1}R_{T}|s_{t}=s]$
MRP value function satisfies the following Bellman equation:
$V(s)=\underset{Immediate \,reward}{\underbrace{R(s)}}+\underset{Discounted \, sum\, of \, future \, reward}{\underbrace{\gamma \sum_{s\in S'}^{}P(s'|s)V(s')}}$
Practice: To derive the Bellman equation for V(s)
1. Hint: $V(s)=E[R_{t+1}+γE[R_{t+2}+γ^{2}R_{t+3}+...]|s_{t}=s]$

Understanding Bellman equation

Bellman equation describes the iterative relations of states
$V(s)=R(s)+\gamma \sum_{s'\in S}^{}P(s'|s)V(s')$

Matrix Form of Bellman Equation for MRP

Therefore, we can express V(s) using the matrix form:

Analytic solution for value of MRP: $V=(I-γP)^{-1}R$
1. Matrix inverse takes the complexity $O(N^{3})$ for N states
2. Only possible for a small MRPs

Iterative Algorithm for Computing Value of a MRP

Iterative methods for large MRPs:
1. Dynamic Programming
2. Monte-Carlo evaluation
3. Temporal-Difference learning

Monte Carlo Algorithm for Computing Value of a MRP

Algorithm 1 Monte Carlo simulation to calculate MRP value function

$i\leftarrow 0,G_{t}\leftarrow 0$
while $i \neq = N$ do
generate an episode, starting from state s and time t
Using the generated episode, calculate return $g=\sum_{i=t}^{H-1}\gamma ^{i-t}r_{i}$
$G_{t}\leftarrow G_{t}+g,i\leftarrow i+1$
end while
$V_{t}(s)\leftarrow G_{t}/N$
For example: to calculate $V(s_{4})$ we can generate a lot of trajectories then take the average of the returns:
1. return for $s_{4},s_{5},s_{6},s_{7}$ : $0+\frac{1}{2}×0+\frac{1}{4}×0+\frac{1}{8}×10=1.25$
2. return for $s_{4},s_{3},s_{2},s_{1}$ : $0+\frac{1}{2}×0+\frac{1}{4}×0+\frac{1}{8}×5=0.625$
3. return for $s_{4},s_{5},s_{6},s_{6}$ : = 0
4. more trajectories

Iterative Algorithm for Computing Value of a MRP

Algorithm 1 Iterative Algorithm to calculate MRP value function

for all states $s∈S,V'(s)\leftarrow 0,V(s)\leftarrow ∞$
while $||V-V'||>\epsilon$ do
$V\leftarrow V'$
For all states $s∈S,V'(s)=R(s)+\gamma \sum_{s'\in S}^{}P(s'|s)V(s')$
end while
return $V^{'} (s)$ for all $s \in S$

Markov Decision Process (MDP)

Markov Decision Process is Markov Reward Process with decisions.
Definition of MDP
1. S is a finite set of states
2. A is a finite set of actions
3. $P^{a}$ is dynamics/transition model for each action
  
  $P(s_{t+1}=s'|s_{t}=s,a_{t}=a)$
4. R is a reward function $R(s_{t}=s,a_{t}=a)=E[r_{t}|s_{t}=s,a_{t}=a]$
5. Discount factor $γ \in [0, 1]$
MDP is a turple: $(S, A, P, R, γ)$

Policy in MDP

Policy specifies what action to take in each state
Give a state, specify a distribution over actions
Policy: $\pi(a|s)=P(a_{t}=a|s_{t}=s)$
Policies are stationary (time-independent), $A_{t}\sim \pi(a|s)$ for any t > 0
Given an MDP $(S, A, P, R, γ)$ and a policy $\pi$
The state sequence $S_{1},S_{2},...$ is a Markov process $(S,P^{\pi})$
The state and reward sequence $S_{1},R_{1},S_{2},R_{2},...$ is a Markov reward process $(S,P^{\pi},R^{\pi},γ)$ where,
$P^{\pi}(s'|s)=\sum_{a∈A}\pi(a|s)P(s'|s,a)\\$

$R^{\pi}(s)=\sum_{a∈A}\pi(a|s)P(s,a)$

Comparison of MP/MRP and MDP

Value function for MDP

The state-value function $v^{\pi}(s)$ of an MDP is the expected return starting from state s, and following policy $\pi$
$v^{\pi}(s)=E{\pi}[G_{t}|s_{t}=s]$
The action-value function $q^{\pi}(s,a)$ is the expected return starting from state s, taking action a, and following policy $\pi$
$q^{\pi}(s,a)=E{\pi}[G_{t}|s_{t}=s,A_{t}=a]$
We have the relation between $v^{\pi}(s)$ and $q^{\pi}(s,a)$
$v^{\pi}(s)=\sum_{a∈A}\pi(a|s)q^{\pi}(s,a)$

Bellman Expection Equation

The state-value function can be decomposed into immediate reward plus discounted value of successor state,
$v^{\pi}(s)=E_{\pi}[R_{t+1}+γv^{\pi}(s_{t+1})|s_{t}=s]$
The action-value function can similarly be decomposed
$q^{\pi}(s,a)=E_{\pi}[R_{t+1}+γq^{\pi}(s_{t+1},A_{t+1})|s_{t}=s,A_{t}=a]$

Bellman Expection Equation for $V^{\pi}$ and $Q^{\pi}$

$v^{\pi}(s)=\sum_{a∈A}\pi(a|s)q^{\pi}(s,a)$

$q^{\pi}(s,a)=R_{s}^{a}+γ\sum_{s'∈S}P(s'|s,a)v^{\pi}(s')$

Thus
$v^{\pi}(s)=\sum_{a∈A}\pi(a|s)(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{\pi}(s'))$

$q^{\pi}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)\sum_{a'∈A}\pi(a'|s')q^{\pi}(s',a')$

Backup Diagram for $V^{\pi}$

$v^{\pi}(s)=\sum_{a∈A}\pi(a|s)(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{\pi}(s'))$

Backup Diagram for $Q^{\pi}$

$q^{\pi}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)\sum_{a'∈A}\pi(a'|s')q^{\pi}(s',a')$

Policy Evaluation

Evaluate the value of state given a policy $\pi$ : compute $v^{\pi}(s)$
Also called as (value) prediction

Example: Navigate the boat

Example: Policy Evaluation

Two actions: Left and Right
For all actions, reward: +5 in $s_{1}$ , +10 in $s_{7}$ , 0 in all other states. So that we can represent R = [5, 0, 0, 0, 0, 0, 10]
Let’s have a deterministic policy $\pi(s)=$ Left and $γ = 0$ for any state s, then what is the value of the policy?
1. $v^{\pi}= [5, 0, 0, 0, 0, 0, 10]$
Iteration: $v_{k}^{\pi}(s)=r(s,\pi(s))+γ\sum_{s'∈S}P(s'|s,\pi(s))v_{k-1}^{\pi}(s')$
$R = [5, 0, 0, 0, 0, 0, 10]$
Practice 1: Deterministic policy $\pi(s)=$ Left and $γ = 0.5$ for any state s, then what are the states values under the policy?
Practice 2: Stochastic policy $P(\pi(s)=Left)=0.5$ and $P(\pi(s)=Right)=0.5$ and $γ = 0.5$ for any state s, then what are the states values under the policy?
Iteration: $v_{k}^{\pi}(s)=r(s,\pi(s))+γ\sum_{s'∈S}P(s'|s,\pi(s))v_{k-1}^{\pi}(s')$

2.2 马尔可夫决策过程（下）

Decison Making in Markov Decision Process（MDP）

Prediction (evaluate a given policy):
1. Input: MDP $< S, A, P, R, γ >$ and policy $\pi$ or MRP $< S, P^{π}, R^{π}, γ >$
2. Output: value function $v^{\pi}$
Control (search the optimal policy):
1. Input: MDP $< S, A, P, R, γ >$
2. Output: optimal value function $v^{*}$ and optimal policy $\pi^{*}$
Prediction and control in MDP can be solved by dynamic programming.

Dynamic programming

Dynamic programming is a very general solution method for problems which have two properties:

Optimal substructure
1. Principle of optimality applies
2. Optimal solution can be decomposed into subproblems
Overlapping subproblems
1. Subproblems recur many times
2. Solutions can be cached and reused

Markov decision processes satisfy both properties

Bellman equation gives recursive decomposition
Value function stores and reuses solutions

Policy evaluation on MDP

Objective: Evaluate a given policy $\pi$ for a MDP
Output: the value function under policy $v^{\pi}$
Solution: iteration on Bellman expectation backup
Algorithm: Synchronous backup
1. At each iteration t+1
  
  update $v_{t+1}(s)$ from $v_{t}(s')$ for all states $s \in S$ where s’ is a successor state of s
  $v_{t+1}(s)=\sum_{a∈A}\pi(a|s)(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v_{t}(s'))$
Convergence: $v_{1}\rightarrow v_{2}\rightarrow ...\rightarrow v^{\pi}$

Policy evaluation: Iteration on Bellman expectation backup

Bellman expectation backup for a particular policy
$v_{t+1}(s)=\sum_{a∈A}\pi(a|s)(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v_{t}(s'))$
Or if in the form of MRP $< S, P^{π}, R^{π}, γ >$
$v_{t+1}(s)=R^{\pi}(s)+\gamma P^{\pi}(s'|s)V_{t}(s')$

Evaluating a Random Policy in the Small Gridworld

Example 4.1 in the Sutton RL textbook

Undiscounted episodic MDP ( $γ = 1$ )
Nonterminal states 1, …, 14
Two terminal states (two shaded squares)
Action leading out of grid leaves state unchanged, $P (7∣7, r i g h t) = 1$
Reward is -1 until the terminal state is reach
Transition is deterministic given the action, e.g., $P (6∣5, r i g h t) = 1$
Uniform random policy $\pi(l|.)=\pi(r|.)=\pi(u|.)=\pi(d|.)=0.25$

A live demo on policy evaluation

$v^{\pi}(s)=\sum_{a∈A}\pi(a|s)(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{\pi}(s'))$

https://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html

Optimal Value Function

The optimal state-value function $v^{*}(s)$ is the maximum value function over all policies
$v^{*}(s)=\underset{\pi}{max}\,v^{\pi}(s)$
The optimal policy
$\pi^{*}(s)=arg\,\underset{\pi}{max}\,v^{\pi}(s)$
An MDP is “solved” when we know the optimal value
There exists a unique optimal value function, but could be multiple optimal policies (two actions that have the same optimal value function)

Finding Optimal Policy

An optimal policy can be found by maximizing over $q^{*}(s,a)$ ,
$\pi^{*}(a|s)\begin{cases} 1, & \text{ if } a=arg\,max_{a\in A}\,q^{*}(s,a) \\ 0, & \text{ otherwise } \end{cases}$
There is always a deterministic optimal policy for any MDP
If we know $q^{*}(s,a)$ , we immediately have the optimal policy

Policy Search

One option is to enumerate search the best policy
Number of deterministic policies is $A|^{|S|}$
Other approaches such as policy iteration and value iteration are more efficient

MDP Control

Compute the optimal policy
$\pi^{*}(s)=arg\,\underset{\pi}{max}\,v^{\pi}(s)$
Optimal policy for a MDP in an infinite horizon problem (agent acts forever) is
1. Deterministic
2. Stationary (does not depend on time step)
3. Unique? Not necessarily, may have state-actions with identical optimal values

Improving a Policy through Policy Iteration

Iterate through the two steps:
1. Evaluate the policy $\pi$ (computing v given current $\pi$ )
2. Improve the policy by acting greedily with respect to $v^{\pi}$
  $\pi'=greedy(v^{\pi})$

Policy Improvement

Compute the state-action value of a policy $\pi$ :
$q^{\pi_{i}}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{\pi_{i}}(s')$
Compute new policy $\pi_{i+1}$ for all $s \in S$ following
$\pi_{i+1}(s)=arg\,\underset{a}{max}\,q^{\pi_{i}}(s,a)$

Monotonic Improvement in Policy

Consider a deterministic policy $a=\pi(s)$
We improve the policy through
$\pi'(s)=arg\,\underset{a}{max}\,q^{\pi}(s,a)$
This improves the value from any state s over one step,
$q^{\pi}(s,\pi'(s))=\underset{a∈A}{max}\,q^{\pi}(s,a)≥q^{\pi}(s,\pi(s))=v^{\pi}(s)$
It therefore improves the value function, $v_{\pi'(s)}≥v^{\pi}(s)$

$v^{\pi}(s)≤q^{\pi}(s,\pi'(s))=E_{\pi'}[R_{t+1}+γv^{\pi}(S_{t+1}|S_{t}=s)]$

$≤E_{\pi'}[R_{t+1}+γq^{\pi}(S_{t+1},\pi'(S_{t+1}))|S_{t}=s)]$

$≤E_{\pi'}[R_{t+1}+γR_{t+2}+γ^{2}q^{\pi}(S_{t+2},\pi'(S_{t+2}))|S_{t}=s)]$

$≤E_{\pi'}[R_{t+1}+γR_{t+2}+...|S_{t}=s)]=v_{\pi'}(s)$
If iImprovements stop,
$q^{\pi}(s,\pi'(s))=\underset{a∈A}{max}\,q^{\pi}(s,a)≥q^{\pi}(s,\pi(s))=v^{\pi}(s)$
Thus the Bellman optimality equation has been satisified

$v^{\pi}(s)=\underset{a∈A}{max}\,q^{\pi}(s,a)$
Therefore $v^{\pi}(s)=v^{*}(s)$ for all $s \in S$ , so $\pi$ is an optimal policy

Bellman Optimality Equation

1️⃣The optimal value functions are reached by the Bellman optimality equations:
$v^{*}(s)=\underset{a}{max}\,q^{\pi}(s,a)$

$q^{*}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{*}(s')$

thus
$v^{*}(s)=\underset{a}{max}R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{*}(s')$

$q^{*}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)\,\underset{a'}{max}\,q^{*}(s',a')$

Value Iteration by turning the Bellman Optimality Equation as update rule

1️⃣If we know the solution to subproblem $v^{*}(s')$ , which is optimal.

2️⃣Then the solution for the optimal $v^{*}(s)$ can be found by iteration over the following Bellman Optimality backup rule,
$v(s)\leftarrow\underset{a∈A}{max}(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{*}(s'))$

3️⃣The idea of value iteration is to apply these updates iteratively

Algorithm of Value Iteration

Objective: find the optimal policy $\pi$
Solution: iteration on the Bellman optimality backup
Value Iteration algorithm:
1. initialize $k = 1$ and $v_{0}(s)=0$ for all states s
2. For $k = 1 : H$
  1. for each states s
    $q_{k+1}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v_{k}(s')$
    
    $v_{k+1}(s)=\underset{a}{max}\,q_{k+1}(s,a)$
  2. $k\leftarrow k+1$
3. To retrieve the optimal policy after value iteration:
  $\pi(s)=arg\,\underset{a}{max}\,R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v_{k+1}(s')$

Example: Shortest Path

After the optimal values are reached, we run policy extraction to retrieve the optimal policy.

Demo of policy iteration and value Iteration

1️⃣Policy iteration: Iteration of policy evaluation and policy improvement(update)

2️⃣Value iteration

3️⃣https://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html

Policy iteration and value iteration on FrozenLake

1️⃣https://github.com/cuhkrlcourse/RLexample/tree/master/MDP

Different between Policy Iteration and Value Iteration

1️⃣Policy iteration includes: policy evaluation + policy improvement, and the two are repeated iteratively until policy converges.

2️⃣Value Iteration includes: finding optimal value function + one policy extraction.There is no repeat of the two because once the value function is optimal, then the policy out of it should also be optimal (i.e. converged).

3️⃣Finding optimal value function can also be seen as a combination of policy improvement (due to max) and truncated policy evaluation (the reassignment of v(s) after just one sweep of all states regardless of convergence).

Summary for Prediction and Control in MDP

End

1️⃣Optional Homework 1 is available at https://github.com/cuhkrlcourse/ierg6130-assignment

常见的强化学习算法分类及其特点 ywfwyht 人工智能算法分类人工智能
强化学习（ReinforcementLearning,RL）是一种机器学习方法，通过智能体（Agent）与环境（Environment）的交互来学习如何采取行动以最大化累积奖励。以下是一些常见的强化学习算法分类及其特点：1.基于值函数的算法这些算法通过估计状态或状态-动作对的价值来指导决策。Q-Learning无模型的离线学习算法。通过更新Q值表来学习最优策略。更新公式：Q(s,a)←Q(s,a)
星际争霸多智能体挑战赛（SMAC）资源存储库多智能体强化学习人工智能
目录TheStarCraftMulti-AgentChallenge星际争霸多智能体挑战赛Abstract摘要1Introduction1引言2RelatedWork2相关工作3Multi-AgentReinforcementLearning3多智能体强化学习Dec-POMDPs12-POMDPs（十二月-POMDP）Centralisedtrainingwithdecentralisedexec
AlphaStar 星际首秀，人工智能走向星辰大海谷歌开发者
文/王晶，资深工程师，GoogleBrain团队作者王晶，现为GoogleBrain团队的资深工程师，主要致力深度强化学习的研发，和DeepMind团队在强化学习的应用上有许多合作。北京时间1月25日凌晨2点，DeepMind直播了他们的AIAlphaStar和人类顶尖的职业电竞选手对战星际争霸2。根据DeepMind介绍，AlphaStar在2018年12月10日和19日先后以5：0全胜的战绩击
Deepoc大模型在半导体设计优化与自动化 Deepoch 自动化运维人工智能机器人单片机 ai 科技
大模型在半导体设计领域的应用已形成多维度技术渗透，其核心价值在于通过数据驱动的方式重构传统设计范式。以下从技术方向、实现路径及行业影响三个层面展开详细分析：参数化建模与动态调优基于物理的深度学习模型（如PINNs）将器件物理方程嵌入神经网络架构，实现工艺参数与电学性能的非线性映射建模。通过强化学习框架（如PPO算法）动态调整掺杂浓度、栅极长度等关键参数，在3nm节点下实现驱动电流提升18%的同时降
【行云流水a】淘天联合爱橙开源强化学习训练框架ROLL OpenRL/openrl PPO-for-Beginners: 从零开始实现强化学习算法PPO 强化学习框架verl 港大等开源GoT-R1 行云流水AI笔记开源算法
以下是DQN（DeepQ-Network）和PPO（ProximalPolicyOptimization）的全面对比流程图及文字解析。两者是强化学习的核心算法，但在设计理念、适用场景和实现机制上有显著差异：graphTDA[对比维度]-->B[算法类型]A-->C[策略表示]A-->D[动作空间]A-->E[学习机制]A-->F[探索方式]A-->G[稳定性]A-->H[样本效率]A-->I[关键
PettingZoo:多智能体强化学习的标准API 资源存储库多智能体强化学习人工智能深度学习
PettingZoo:AStandardAPIforMulti-AgentReinforcementLearningPettingZoo:多智能体强化学习的标准API目录Abstract摘要1Introduction1介绍2BackgroundandRelatedWorks2背景及相关工作2.1PartiallyObservableStochasticGamesandRLlib2.1部分可观察随机
神经网络架构搜索 IJCAST主编进化计算神经网络架构人工智能
InternationalJournalofComplexityinAppliedScienceandTechnology，投稿网址:https://www.inderscience.com/jhome.php?jcode=ijcast,发表论文不收取任何费用，论文平均审稿25天内即可录用。1.神经网络架构搜索方法分类当前，神经网络架构搜索的方法主要可以归纳为以下三类：a.基于强化学习的NAS方法
强化学习 16G实践以下是基于CQL（Conservative Q-Learning）与QLoRA（Quantized Low-Rank Adaptation）结合的方案相关开源项目及资源，【ai技】行云流水AI笔记开源人工智能
根据你提供的CUDA版本（11.5）和NVIDIA驱动错误信息，以下是PyTorch、TensorFlow的兼容版本建议及环境修复方案：1.版本兼容性表框架兼容CUDA版本推荐安装命令（CUDA11.5）PyTorch11.3/11.6pipinstalltorchtorchvisiontorchaudio--extra-index-urlhttps://download.pytorch.org/
大模型RLHF强化学习笔记（一）：强化学习基础梳理Part1 Gravity! 大模型笔记大模型 LLM 算法机器学习强化学习人工智能
【如果笔记对你有帮助，欢迎关注&点赞&收藏，收到正反馈会加快更新！谢谢支持！】一、强化学习基础1.1Intro定义：强化学习是一种机器学习方法，需要智能体通过与环境交互学习最优策略基本要素：状态（State）：智能体在决策过程中需要考虑的所有相关信息（环境描述）动作（Action）：在环境中可以采取的行为策略（Policy）：定义了在给定状态下智能体应该选择哪个动作，目标是最大化智能体的长期累积奖
LLMs基础学习（八）强化学习专题（7）汤姆和佩琦 NLP 学习 Actor-Critic 算法
LLMs基础学习（八）强化学习专题（7）文章目录LLMs基础学习（八）强化学习专题（7）Actor-Critic算法基础原理算法流程细节算法优缺点分析算法核心总结视频链接：https://www.bilibili.com/video/BV1MQo4YGEmq/?spm_id_from=333.1387.upload.video_card.click&vd_source=57e4865932ea6c
强化学习-双臂老虎机 transuperb 强化学习人工智能
本篇文章模拟AI玩两个老虎机，AI需要判断出哪个老虎机收益更大，然后根据反馈调整对于不同老虎机的价值判断，如果把这个看作一个简单的强化学习的话，那么AI就是agent，两个老虎机就是environment，AI首先会对两台老虎机有一个预测值Q，预测哪一个的价值高，然后AI通过策略函数判断应该选择哪个老虎机，进行Action后根据Reward更新每个老虎机的价值Value，然后再进行下一次判断，直到
ROS2 强化学习：案例与代码实战芯动大师 ROS2学习目标检测人工智能
一、引言在机器人技术不断发展的今天，强化学习（RL）作为一种强大的机器学习范式，为机器人的智能决策和自主控制提供了新的途径。ROS2（RobotOperatingSystem2）作为新一代机器人操作系统，具有更好的实时性、分布式性能和安全性，为强化学习在机器人领域的应用提供了更坚实的基础。本文将通过一个具体案例，深入探讨ROS2与强化学习的结合应用，并提供相关代码实现。二、案例背景本案例以移动机器
解析AI算力网络与通信领域强化学习的算法 AI算力网络与通信 AI人工智能与大数据技术 AI算力网络与通信原理 AI人工智能大数据架构人工智能网络算法 ai
解析AI算力网络与通信领域强化学习的算法：从"快递员找路"到"智能网络大脑"关键词：AI算力网络、通信领域、强化学习、马尔可夫决策、资源调度摘要：本文将用"快递物流系统"的类比，带您理解AI算力网络与通信领域如何通过强化学习实现智能决策。我们会从核心概念讲起，逐步拆解强化学习在网络资源调度中的算法原理，结合Python代码实战，最后探索其在5G/6G、边缘计算等场景的应用。即使您没学过复杂数学，也
AI 在自动驾驶路径规划中的深度强化学习优化 QuantumWalker 人工智能自动驾驶机器学习
```htmlAI在自动驾驶路径规划中的深度强化学习优化在当今快速发展的科技领域中，人工智能（AI）的应用正在不断拓展其边界。特别是在自动驾驶技术中，AI的应用已经从简单的感知和识别发展到了复杂的决策和控制阶段。其中，深度强化学习作为AI的一个重要分支，在自动驾驶路径规划中发挥着越来越重要的作用。一、深度强化学习简介深度强化学习是一种结合了深度学习和强化学习的机器学习方法。它通过让智能体在环境中进
强化学习实战：从 Q-Learning 到 PPO 全流程荣华富贵8 程序员的知识储备2 程序员的知识储备3 人工智能算法机器学习
1引言随着人工智能的快速发展，强化学习（ReinforcementLearning,RL）凭借其在复杂决策与控制问题上的卓越表现，已成为研究与应用的前沿热点。本文旨在从经典的Q-Learning算法入手，系统梳理从值迭代到策略优化的全流程技术细节，直至最具代表性的ProximalPolicyOptimization（PPO）算法，结合理论推导、代码实现与案例分析，深入探讨强化学习的核心原理、算法演
基于CTDE MAPPO的无线通信资源分配强化学习实现 pk_xz123456 仿真模型深度学习算法 lstm 人工智能 rnn 深度学习开发语言
基于CTDEMAPPO的无线通信资源分配强化学习实现摘要本文提出了一种基于集中训练分散执行(CTDE)框架的多智能体近端策略优化(MAPPO)方法，用于解决无线通信网络中的资源分配问题。我们设计了一个多基站协作环境，其中每个基站作为独立智能体，通过分布式决策实现网络吞吐量最大化。实验结果表明，MAPPO算法在频谱效率和用户公平性方面显著优于传统启发式算法。1.引言1.1研究背景随着5G/6G通信技
强化学习系列——PPO算法 lqjun0827 算法深度学习算法人工智能
强化学习系列——PPO算法PPO算法一、背景知识：策略梯度&Advantage二、引入重要性采样（ImportanceSampling）三、PPO-Clip目标函数推导✅四、总结公式（一图总览）参考文献PPO示例代码实现补充内容：重要性采样一、问题背景：我们想估计某个期望❗问题：二、引入重要性采样（ImportanceSampling）三、离散采样形式（蒙特卡洛估计）四、标准化的重要性采样五、在强
人工神经网络：架构原理与技术解析 weixin_47233946 架构
##引言在深度学习和人工智能领域，人工神经网络（ArtificialNeuralNetwork,ANN）作为模拟人脑认知机制的核心技术，已在图像识别、自然语言处理和强化学习等领域实现了革命性突破。从AlphaGo击败人类顶尖棋手到ChatGPT的对话生成能力，ANN的进化持续推动技术边界的扩展。本文将深入剖析人工神经网络的核心原理、技术实现与发展趋势。##一、基础概念与数学模型###1.1生物启发
医疗AI新势力：自演进多智能体MAS的进击之路 Allen_Lyb 医疗高效编程研发人工智能健康医疗机器学习架构大数据
医疗AI新势力：自演进多智能体MAS的进击之路往期相关文章：Python在开放式医疗诊断多智能体系统中的深度应用与自动化分析基于多智能体强化学习的医疗AI中RAG系统程序架构优化研究自演进多智能体在医疗临床诊疗动态场景中的应用医疗AI的新变革在数字化与智能化飞速发展的时代，人工智能（AI）已经逐渐渗透到医疗领域的各个角落，成为推动医疗行业变革的重要力量。从疾病的早期诊断到个性化治疗方案的制定，从医
无线通信中的多智能体强化学习：基于CTDE-MAPPO的功率控制优化 pk_xz123456 仿真模型深度学习算法算法人工智能制造
无线通信中的多智能体强化学习：基于CTDE-MAPPO的功率控制优化摘要本文提出了一种基于集中训练分布式执行(CTDE)框架的多智能体近端策略优化(MAPPO)算法，用于解决无线通信网络中的分布式功率控制问题。通过将多个基站建模为协作智能体，我们设计了一个多智能体强化学习系统，能够在复杂动态环境中实现全局网络效用的优化。本文详细介绍了系统架构、算法实现、实验设置以及性能评估，展示了MAPPO在5G
传统蒙特卡洛（Monte Carlo, MC）方法在强化学习中直接把整条回报序列当作“真值”来估计价值函数，通常配合表格化存储，因此无需环境模型且估计无偏，但只能处理有限状态-动作空间且方差较大强化学习曾小健人工智能
传统蒙特卡洛（MonteCarlo,MC）方法在强化学习中直接把整条回报序列当作“真值”来估计价值函数，通常配合表格化存储，因此无需环境模型且估计无偏，但只能处理有限状态-动作空间且方差较大medium.comanalyticsvidhya.comincompleteideas.net。“深度蒙特卡洛”（DeepMonteCarlo,DMC）则保留“按回报直接更新”的思想，却用深度网络来逼近$Q(
使用Simulink结合MATLAB进行基于强化学习控制下的动态滤波器参数调节系统的仿真 amy_mhd matlab 开发语言
目录一、背景介绍二、所需工具和环境三、步骤详解步骤1：定义系统需求示例：定义系统需求步骤2：准备强化学习环境步骤3：训练强化学习代理步骤4：创建Simulink模型步骤5：添加信号源步骤6：合并信号步骤7：导入强化学习代理步骤8：设计滤波器步骤9：可视化结果步骤10：连接各模块步骤11：设置仿真参数步骤12：运行仿真并分析结果四、总结在现代信号处理领域，动态调整滤波器参数以适应不断变化的环境条件是
强化学习（Reinforcement Learning, RL）概览 MzKyle 人工智能人工智能强化学习机器学习机器人
一、强化学习的核心概念与定位1.定义强化学习是机器学习的分支，研究智能体（Agent）在动态环境中通过与环境交互，以最大化累积奖励为目标的学习机制。与监督学习（有标注数据）和无监督学习（无目标）不同，强化学习通过“试错”学习，不依赖先验知识，适合解决动态决策问题。2.核心要素智能体（Agent）：执行决策的主体，如游戏AI、机器人。环境（Environment）：智能体之外的一切，如棋盘、物理世界
无监督学习概览 MzKyle 人工智能人工智能无监督学习机器学习
一、无监督学习的本质与定位定义：无监督学习是机器学习的三大范式之一（另外两种为监督学习和强化学习），其核心特点是处理未标注数据，通过算法自动发现数据中的隐藏结构、模式或内在规律。与监督学习依赖"输入-输出"对不同，无监督学习仅以原始数据作为输入，目标是揭示数据的内在组织方式。与其他学习范式的区别：监督学习：依赖标签（如分类、回归任务），学习从输入到输出的映射关系强化学习：通过与环境交互获得奖励信号
基于分布式部分可观测马尔可夫决策过程与联邦强化学习的低空经济智能协同决策框架 pk_xz123456 算法无人机分布式算法 matlab 人工智能制造开发语言
基于分布式部分可观测马尔可夫决策过程与联邦强化学习的低空经济智能协同决策框架摘要：低空经济作为新兴战略产业，其核心场景（如无人机物流、城市空中交通、低空监测）普遍面临环境动态性强、个体观测受限、数据隐私敏感及多智能体协同复杂等挑战。本文创新性地提出一种深度融合分布式部分可观测马尔可夫决策过程（Dec-POMDP）与联邦强化学习（FederatedReinforcementLearning,FRL）
空间智能领域，AI人工智能如何大显身手 AI大模型应用之禅人工智能 ai
空间智能领域，AI人工智能如何大显身手关键词：空间智能、人工智能、计算机视觉、地理信息系统、自动驾驶、增强现实、智能城市摘要：本文深入探讨了人工智能在空间智能领域的应用与前景。空间智能作为理解、处理和利用空间信息的能力，正在被AI技术深刻变革。我们将从核心技术原理出发，分析计算机视觉、深度学习、强化学习等技术如何赋能空间智能，探讨其在自动驾驶、智能城市、AR/VR等领域的实际应用，并提供详细的算法
动手学强化学习第10章-Actor-Critic 算法训练代码 zhqh100 算法深度学习 pytorch 人工智能
基于Hands-on-RL/第10章-Actor-Critic算法.ipynbatmain·boyu-ai/Hands-on-RL·GitHub理论Actor-Critic算法修改了警告和报错运行环境DebianGNU/Linux12Python3.9.19torch2.0.1gym0.26.2运行代码Actor-Critic.py#!/usr/bin/envpythonimportgymimpo
Agent 处理流程成都犀牛人工智能大模型 Agent 深度学习神经网络 python Agent
Agent源于研究行为的强化学习，而大模型源于研究知识的深度学习多数情况下认为该系统中会存在下面的角色或名词用户（另一个人）上下文（记忆）变量（记忆）提示词（沟通方式）工具（手臂）大模型（大脑）这个图将着重表现Agent的决策循环，这是其与普通RAG流程最主要的区别。Agent核心工作流示意图用户提示词✏️Agent大模型上下文️变量%%工具️用户交互层AI核心层数据层工具层发送请求用户输入原始指
智能化设计工具链：深度学习与强化学习的全流程融合架构
一、技术架构设计智能化设计工具链的构建需要整合参数化建模、代理模型训练、强化学习优化与多物理场工艺仿真四大模块，形成从设计到制造的闭环系统。典型流程如下：
自适应限流算法实战双囍菜菜 #Go高吞吐架构算法 Golang
自适应限流算法实战文章目录自适应限流算法实战一、限流算法演进史：从静态到自适应1.1传统限流算法的致命缺陷1.2自适应限流的革命性突破二、自适应限流核心指标体系2.1黄金四维指标2.2指标融合公式三、经典自适应算法解析3.1TCPBBR带宽自适应算法核心限流应用3.2NetflixConcurrencyLimit梯度下降策略智能探针机制四、AI赋能的智能限流4.1LSTM预测模型架构4.2强化学习
JAVA中的Enum 周凡杨 java enum 枚举
Enum是计算机编程语言中的一种数据类型---枚举类型。在实际问题中，有些变量的取值被限定在一个有限的范围内。例如，一个星期内只有七天我们通常这样实现上面的定义： public String monday; public String tuesday; public String wensday; public String thursday
赶集网mysql开发36条军规 Bill_chen mysql 业务架构设计 mysql调优 mysql性能优化
(一)核心军规 (1)不在数据库做运算 cpu计算务必移至业务层； (2)控制单表数据量 int型不超过1000w，含char则不超过500w；合理分表；限制单库表数量在300以内； (3)控制列数量字段少而精，字段数建议在20以内
Shell test命令 daizj shell 字符串 test 数字文件比较
Shell test命令 Shell中的 test 命令用于检查某个条件是否成立，它可以进行数值、字符和文件三个方面的测试。数值测试参数说明 -eq 等于则为真 -ne 不等于则为真 -gt 大于则为真 -ge 大于等于则为真 -lt 小于则为真 -le 小于等于则为真实例演示： num1=100 num2=100if test $[num1]
XFire框架实现WebService(二) 周凡杨 java webservice
有了XFire框架实现WebService(一)，就可以继续开发WebService的简单应用。 Webservice的服务端(WEB工程)：两个java bean类： Course.java package cn.com.bean; public class Course { private
重绘之画图板朱辉辉33 画图板
上次博客讲的五子棋重绘比较简单，因为只要在重写系统重绘方法paint（）时加入棋盘和棋子的绘制。这次我想说说画图板的重绘。画图板重绘难在需要重绘的类型很多，比如说里面有矩形，园，直线之类的，所以我们要想办法将里面的图形加入一个队列中，这样在重绘时就
Java的IO流西蜀石兰 java
刚学Java的IO流时，被各种inputStream流弄的很迷糊，看老罗视频时说想象成插在文件上的一根管道，当初听时觉得自己很明白，可到自己用时，有不知道怎么代码了。。。每当遇到这种问题时，我习惯性的从头开始理逻辑，会问自己一些很简单的问题，把这些简单的问题想明白了，再看代码时才不会迷糊。 IO流作用是什么？答：实现对文件的读写，这里的文件是广义的； Java如何实现程序到文件
No matching PlatformTransactionManager bean found for qualifier 'add' - neither 林鹤霄
java.lang.IllegalStateException: No matching PlatformTransactionManager bean found for qualifier 'add' - neither qualifier match nor bean name match! 网上找了好多的资料没能解决，后来发现：项目中使用的是xml配置的方式配置事务，但是
Row size too large (> 8126). Changing some columns to TEXT or BLOB aigo column
原文：http://stackoverflow.com/questions/15585602/change-limit-for-mysql-row-size-too-large 异常信息： Row size too large (> 8126). Changing some columns to TEXT or BLOB or using ROW_FORMAT=DYNAM
JS 格式化时间 alxw4616 JavaScript
/** * 格式化时间 2013/6/13 by 半仙 [email protected] * 需要 pad 函数 * 接收可用的时间值. * 返回替换时间占位符后的字符串 * * 时间占位符:年 Y 月 M 日 D 小时 h 分 m 秒 s 重复次数表示占位数 * 如 YYYY 4占4位 YY 占2位<p></p> * MM DD hh mm
队列中数据的移除问题百合不是茶队列移除
队列的移除一般都是使用的remov();都可以移除的,但是在昨天做线程移除的时候出现了点问题,没有将遍历出来的全部移除, 代码如下; // package com.Thread0715.com; import java.util.ArrayList; public class Threa
Runnable接口使用实例 bijian1013 java thread Runnable java多线程
Runnable接口 a. 该接口只有一个方法：public void run(); b. 实现该接口的类必须覆盖该run方法 c. 实现了Runnable接口的类并不具有任何天
oracle里的extend详解 bijian1013 oracle 数据库 extend
扩展已知的数组空间，例： DECLARE TYPE CourseList IS TABLE OF VARCHAR2(10); courses CourseList; BEGIN -- 初始化数组元素，大小为3 courses := CourseList('Biol 4412 ', 'Psyc 3112 ', 'Anth 3001 '); --
【httpclient】httpclient发送表单POST请求 bit1129 httpclient
浏览器Form Post请求浏览器可以通过提交表单的方式向服务器发起POST请求，这种形式的POST请求不同于一般的POST请求 1. 一般的POST请求，将请求数据放置于请求体中，服务器端以二进制流的方式读取数据，HttpServletRequest.getInputStream()。这种方式的请求可以处理任意数据形式的POST请求，比如请求数据是字符串或者是二进制数据 2. Form
【Hive十三】Hive读写Avro格式的数据 bit1129 hive
1. 原始数据 hive> select * from word; OK 1 MSN 10 QQ 100 Gtalk 1000 Skype 2. 创建avro格式的数据表 hive> CREATE TABLE avro_table(age INT, name STRING)STORE
nginx+lua+redis自动识别封解禁频繁访问IP ronin47
在站点遇到攻击且无明显攻击特征，造成站点访问慢，nginx不断返回502等错误时，可利用nginx+lua+redis实现在指定的时间段内，若单IP的请求量达到指定的数量后对该IP进行封禁，nginx返回403禁止访问。利用redis的expire命令设置封禁IP的过期时间达到在指定的封禁时间后实行自动解封的目的。一、安装环境： CentOS x64 release 6.4(Fin
java-二叉树的遍历-先序、中序、后序（递归和非递归）、层次遍历 bylijinnan java
import java.util.LinkedList; import java.util.List; import java.util.Stack; public class BinTreeTraverse { //private int[] array={ 1, 2, 3, 4, 5, 6, 7, 8, 9 }; private int[] array={ 10,6,
Spring源码学习-XML 配置方式的IoC容器启动过程分析 bylijinnan java spring IOC
以FileSystemXmlApplicationContext为例，把Spring IoC容器的初始化流程走一遍： ApplicationContext context = new FileSystemXmlApplicationContext ("C:/Users/ZARA/workspace/HelloSpring/src/Beans.xml&q
[科研与项目]民营企业请慎重参与军事科技工程 comsci 企业
军事科研工程和项目并非要用最先进，最时髦的技术，而是要做到“万无一失” 而民营科技企业在搞科技创新工程的时候，往往考虑的是技术的先进性，而对先进技术带来的风险考虑得不够，在今天提倡军民融合发展的大环境下，这种“万无一失”和“时髦性”的矛盾会日益凸显。。。。。。所以请大家在参与任何重大的军事和政府项目之前，对
spring 定时器-两种方式 cuityang spring quartz 定时器
方式一：间隔一定时间运行 <bean id="updateSessionIdTask" class="com.yang.iprms.common.UpdateSessionTask" autowire="byName" /> <bean id="updateSessionIdSchedule
简述一下关于BroadView站点的相关设计 damoqiongqiu view
终于弄上线了，累趴，戳这里http://www.broadview.com.cn 简述一下相关的技术点前端：jQuery+BootStrap3.2+HandleBars，全站Ajax（貌似对SEO的影响很大啊！怎么破？），用Grunt对全部JS做了压缩处理，对部分JS和CSS做了合并（模块间存在很多依赖，全部合并比较繁琐，待完善）。后端：U
运维 PHP问题汇总 dcj3sjt126com windows2003
1、Dede(织梦)发表文章时,内容自动添加关键字显示空白页解决方法：后台>系统>系统基本参数>核心设置>关键字替换（是/否），这里选择“是”。后台>系统>系统基本参数>其他选项>自动提取关键字，这里选择“是”。 2、解决PHP168超级管理员上传图片提示你的空间不足网站是用PHP168做的，反映使用管理员在后台无法
mac 下安装php扩展 - mcrypt dcj3sjt126com PHP
MCrypt是一个功能强大的加密算法扩展库，它包括有22种算法，phpMyAdmin依赖这个PHP扩展，具体如下：下载并解压libmcrypt-2.5.8.tar.gz。在终端执行如下命令： tar zxvf libmcrypt-2.5.8.tar.gz cd libmcrypt-2.5.8/ ./configure --disable-posix-threads --
MongoDB更新文档 [四] eksliang mongodb Mongodb更新文档
MongoDB更新文档转载请出自出处：http://eksliang.iteye.com/blog/2174104 MongoDB对文档的CURD，前面的博客简单介绍了，但是对文档更新篇幅比较大，所以这里单独拿出来。语法结构如下： db.collection.update( criteria, objNew, upsert, multi) 参数含义参数
Linux下的解压，移除，复制，查看tomcat命令 y806839048 tomcat
重复myeclipse生成webservice有问题删除以前的，干净 1、先切换到：cd usr/local/tomcat5/logs 2、tail -f catalina.out 3、这样运行时就可以实时查看运行日志了 Ctrl+c 是退出tail命令。有问题不明的先注掉 cp /opt/tomcat-6.0.44/webapps/g
Spring之使用事务缘由(3-XML实现) ihuning spring
用事务通知声明式地管理事务事务管理是一种横切关注点。为了在 Spring 2.x 中启用声明式事务管理，可以通过 tx Schema 中定义的 <tx:advice> 元素声明事务通知，为此必须事先将这个 Schema 定义添加到 <beans> 根元素中去。声明了事务通知后，就需要将它与切入点关联起来。由于事务通知是在 <aop:
GCD使用经验与技巧浅谈啸笑天 GC
前言 GCD(Grand Central Dispatch)可以说是Mac、iOS开发中的一大“利器”，本文就总结一些有关使用GCD的经验与技巧。 dispatch_once_t必须是全局或static变量这一条算是“老生常谈”了，但我认为还是有必要强调一次，毕竟非全局或非static的dispatch_once_t变量在使用时会导致非常不好排查的bug，正确的如下： 1
linux（Ubuntu）下常用命令备忘录1 macroli linux 工作 ubuntu
在使用下面的命令是可以通过--help来获取更多的信息1,查询当前目录文件列表：ls ls命令默认状态下将按首字母升序列出你当前文件夹下面的所有内容，但这样直接运行所得到的信息也是比较少的，通常它可以结合以下这些参数运行以查询更多的信息： ls / 显示/.下的所有文件和目录 ls -l 给出文件或者文件夹的详细信息 ls -a 显示所有文件，包括隐藏文
nodejs同步操作mysql qiaolevip 学习永无止境每天进步一点点 mysql nodejs
// db-util.js var mysql = require('mysql'); var pool = mysql.createPool({ connectionLimit : 10, host: 'localhost', user: 'root', password: '', database: 'test', port: 3306 });
一起学Hive系列文章 superlxw1234 hive Hive入门
[一起学Hive]系列文章目录贴，入门Hive，持续更新中。 [一起学Hive]之一—Hive概述，Hive是什么 [一起学Hive]之二—Hive函数大全-完整版 [一起学Hive]之三—Hive中的数据库(Database)和表(Table) [一起学Hive]之四-Hive的安装配置 [一起学Hive]之五-Hive的视图和分区 [一起学Hive
Spring开发利器：Spring Tool Suite 3.7.0 发布 wiselyman spring
Spring Tool Suite(简称STS)是基于Eclipse，专门针对Spring开发者提供大量的便捷功能的优秀开发工具。在3.7.0版本主要做了如下的更新：将eclipse版本更新至Eclipse Mars 4.5 GA Spring Boot(JavaEE开发的颠覆者集大成者，推荐大家学习)的配置语言YAML编辑器的支持(包含自动提示，

第2章 马尔可夫决策过程

2.1 马尔可夫决策过程（上）

Markov Decision Process（MDP）

Markov Property

Markov Process/Markov Chain

Example of MP

Markov Reward Process (MRP)

Example of MRP

Return and Value function

Why Discount Factor γ

Example of MRP

Compute the Value of a Markov Reward Process

Understanding Bellman equation

Matrix Form of Bellman Equation for MRP

Iterative Algorithm for Computing Value of a MRP

Monte Carlo Algorithm for Computing Value of a MRP

Iterative Algorithm for Computing Value of a MRP

Markov Decision Process (MDP)

Policy in MDP

Comparison of MP/MRP and MDP

Value function for MDP

Bellman Expection Equation

Bellman Expection Equation for V π V^{\pi} Vπ and Q π Q^{\pi} Qπ

Backup Diagram for V π V^{\pi} Vπ

Backup Diagram for Q π Q^{\pi} Qπ

Policy Evaluation

Example: Navigate the boat

Example: Policy Evaluation

2.2 马尔可夫决策过程（下）

Decison Making in Markov Decision Process（MDP）

Dynamic programming

Policy evaluation on MDP

Policy evaluation: Iteration on Bellman expectation backup

Evaluating a Random Policy in the Small Gridworld

A live demo on policy evaluation

Optimal Value Function

Finding Optimal Policy

Policy Search

MDP Control

Improving a Policy through Policy Iteration

Policy Improvement

Monotonic Improvement in Policy

Bellman Optimality Equation

Value Iteration by turning the Bellman Optimality Equation as update rule

Algorithm of Value Iteration

Example: Shortest Path

Demo of policy iteration and value Iteration

Policy iteration and value iteration on FrozenLake

Different between Policy Iteration and Value Iteration

Summary for Prediction and Control in MDP

End

你可能感兴趣的:(强化学习)

第2章马尔可夫决策过程

Bellman Expection Equation for $V^{\pi}$ and $Q^{\pi}$

Backup Diagram for $V^{\pi}$

Backup Diagram for $Q^{\pi}$