u013250861

强化学习-赵世钰（二）：贝尔曼/Bellman方程【用于计算给定π下的State Value：①线性方程组法、②迭代法】、Action Value【根据状态值求解得到；用来评价action优劣】

State Value ：the average Return that an agent can obtain if it follows a given policy/π【给定一个policy/π，所有可能的trajectorys得到的所有return的平均值/期望值： $v_\pi(s)\doteq\mathbb{E}[G_t|S_t=s]$ 】.
Return：the discounted sum of all the rewards collected along a trajectory【return 等于沿着一个特定的 trajectory 收集到的所有 rewards 的折现总和： $G_t\doteq R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\ldots,$ 】

$v_π(s)$ ：从状态 $s$ 出发，遵循策略 $π$ 能够获得的期望回报。

从给定状态 $s_i$ 出发基于给定的policy/π所包含的trajectorys之所以有多个，是因为在每个state时agent接下来要采取的action是按该policy下的概率来采样的，如下图：
如果每一个state处所采取的action是固定的（以箭头表示），则每一个state在该policy中只有一个固定的trajectory；
$v_{\pi}(s)$ depends on $s$ ：This is because its defnition is a conditional expectation with the condition that the agent starts from $S_t=s.$
$v_{\pi}(s)$ depends on $\pi$ ：This is because the trajectories are generated by following the policy $\pi$ . For a different policy, the state value may be different.【不同的policy/π得到的trajectories集合是不一样的】
$v_{\pi}(s)$ does not depend on $t$ ：If the agent moves in the state space, $t$ represents the current time step. The value of $\upsilon_{\pi}(s)$ is determined once the policy is given.

state values和returns之间的关系进一步明确如下：

When both the policy and the system model are deterministic, starting from a state always leads to the same trajectory. In this case, the return obtained starting from a state is equal to the value of that state. 【当 policy 和 system model 都是确定性的时候，从一个状态开始总是会导致相同的轨迹。在这种情况下，从一个状态开始获得的return等于该状态的state value】
By contrast, when either the policy or the system model is stochastic, starting from the same state may generate diﬀerent trajectories. In this case, the returns of diﬀerent trajectories are diﬀerent, and the state value is the mean of these returns.【相反，当 policy 或 system model 中的任何一个是随机的时候，从相同的 state 开始可能会产生不同的trajectory。在这种情况下，不同trajectory的return是不同的，而state value是这些return的平均值。】
The greater the state value is, the better the corresponding policy is.
State values can be used as a metric to evaluate whether a policy is good or not.

While state values are important, how can we analyze them?

The answer is the Bellman equation（贝尔曼方程）, which is an important tool for analyzing state values.
In a nutshell, the Bellman equation describes the relationships between the values of all states.【贝尔曼方程描述了所有状态值之间的关系。】
By solving the Bellman equation, we can obtain the state values. This process is called policy evaluation, which is a fundamental concept in reinforcement learning. 【通过解决贝尔曼方程，我们可以得到状态值。这个过程被称为策略评估，是强化学习中的基本概念。】

Finally, this chapter introduces another important concept called the action value.

2.1 Motivating example 1: Why are returns important?

The previous chapter introduced the concept of returns. In fact, returns play a fundamental role in reinforcement learning since they can evaluate whether a policy is good or not. This is demonstrated by the following examples.

Consider the three policies shown in Figure 2.2. It can be seen that the three policies are diﬀerent at s1. Which is the best and which is the worst? Intuitively,

the leftmost policy is the best because the agent starting from $s_1$ can avoid the forbidden area. 【最左边的策略是最好的，因为从 $s_1$ 开始的agent可以避开禁区】
The middle policy is intuitively worse because the agent starting from $s_1$ moves to the forbidden area. 【中间的策略直观上更差，因为从 $s_1$ 开始的agent会移动到禁区】
The rightmost policy is in between the others because it has a probability of 0.5 to go to the forbidden area.【最右边的策略介于其他两者之间，因为它有0.5的概率进入禁区】

While the above analysis is based on intuition, a question that immediately follows is whether we can use mathematics to describe such intuition. The answer is yes and relies on the return concept. 【尽管上述分析是基于直觉的，但紧随其后的一个问题是我们是否可以使用数学来描述这种直觉。答案是肯定的，这依赖于return的概念】

In particular, suppose that the agent starts from $s_1$ .

Following the ﬁrst policy, the trajectory is s1 → s3 → s4 → s4 · · · . The corresponding discounted return is
$\begin{aligned} return_1& =0+\gamma1+\gamma^21+\ldots \\ &=\gamma(1+\gamma+\gamma^2+\ldots) \\ &=\frac\gamma{1-\gamma}, \\ \end{aligned}$
where $\gamma\in(0,1)$ is the discount rate.
Following the second policy, the trajectory is s1 → s2 → s4 → s4 · · · . The discounted return is

$\begin{aligned} return_2& \begin{aligned}=-1+\gamma1+\gamma^21+\ldots\end{aligned} \\ &=-1+\gamma(1+\gamma+\gamma^2+\ldots) \\ &=-1+\frac\gamma{1-\gamma}. \end{aligned}$

Following the third policy, two trajectories can possibly be obtained. One is s1 → s3 → s4 → s4 · · · , and the other is s1 → s2 → s4 → s4 · · · . The probability of either of the two trajectories is 0.5. Then, the average return that can be obtained starting from $s_1$ is

$\begin{aligned} return_3& \begin{aligned}=0.5\left(-1+\frac{\gamma}{1-\gamma}\right)+0.5\left(\frac{\gamma}{1-\gamma}\right)\end{aligned} \\ &=-0.5+\frac\gamma{1-\gamma}. \end{aligned}$

By comparing the returns of the three policies, we notice that：
$return_1 > return_3 > return_2$

for any value of γ. $return_1 > return_3 > return_2$ suggests that the ﬁrst policy is the best because its return is the greatest, and the second policy is the worst because its return is the smallest.

This mathematical conclusion is consistent with the aforementioned intuition: the ﬁrst policy is the best since it can avoid entering the forbidden area, and the second policy is the worst because it leads to the forbidden area.

2.2 Motivating example 2: How to calculate returns?

While we have demonstrated the importance of returns, a question that immediately follows is how to calculate the returns when following a given policy.

There are two ways to calculate returns. 【有两种计算return的方法】

The ﬁrst is simply by deﬁnition: a return equals the discounted sum of all the rewards collected along a trajectory【return 等于沿着一个 trajectory 收集到的所有奖励的折现总和】. Consider the example in Figure 2.3. Let $v_i$ denote the return obtained by starting from $s_i$ for i = 1, 2, 3, 4. Then, the returns obtained when starting from the four states in Figure 2.3 can be calculated as
The second way, which is more important, is based on the idea of bootstrapping. By observing the expressions of the returns in (2.2), we can rewrite them as

The above equations indicate an interesting phenomenon that the values of the returns rely on each other. More speciﬁcally, v1 relies on v2, v2 relies on v3, v3 relies on v4, and v4 relies on v1. This reﬂects the idea of bootstrapping, which is to obtain the values of some quantities from themselves.

At ﬁrst glance, bootstrapping is an endless loop because the calculation of an unknown value relies on another unknown value. In fact, bootstrapping is easier to understand if we view it from a mathematical perspective. In particular, the equations in (2.3) can be reformed into a linear matrix-vector equation:

which can be written compactly as
$v=r+\gamma Pv.$
Thus, the value of $v$ can be calculated easily as $v=(I-\gamma P)^{-1}r$ , where $I$ is the identity matrix with appropriate dimensions. One may ask whether $I-\gamma P$ is always invertible. The answer is yes and explained in Section 2.7.1.

In fact, (2.3) is the Bellman equation for this simple example. Although it is simple, (2.3) demonstrates the core idea of the Bellman equation: the return obtained by starting from one state depends on those obtained when starting from other states.

2.3 State values

We mentioned that returns can be used to evaluate policies. However, they are inapplicable to stochastic systems because starting from one state may lead to diﬀerent returns.【我们提到过returns可以用来评估policies。然而，在随机系统中，returns是不适用的，因为从一个state出发可能会导致不同的returns】

Motivated by this problem, we introduce the concept of state value in this section.

First, we need to introduce some necessary notations.

Consider a sequence of time steps $t=0,1,2,\ldots$ At time $t$ , the agent is at state $S_t$ , and the action taken following a policy $\pi$ is $A_t.$ The next state is $S_{t+1}$ , and the immediate reward obtained is $R_{t+1}.$ This process can be expressed concisely as

$S_t\xrightarrow{A_t}S_{t+1},R_{t+1}$

Note that $S_t,S_{t+1},A_t,R_{t+1}$ are all random variables. Moreover, $S_t,S_{t+1}\in\mathcal{S},A_t\in\mathcal{A}(S_t)$ , and $R_{t+1}\in\mathcal{R}(S_t,A_t)$ .

Starting from $t$ , we can obtain a state-action-reward trajectory:

$\begin{aligned}S_t\xrightarrow{A_t}S_{t+1},R_{t+1}\xrightarrow{A_{t+1}}S_{t+2},R_{t+2}\xrightarrow{A_{t+2}}S_{t+3},R_{t+3}\ldots.\end{aligned}$

By deﬁnition, the discounted return along the trajectory【注：在给定π条件下，可能有多个trajectory】 is
$G_t\doteq R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\ldots,$
where γ ∈ (0, 1) is the discount rate. Note that Gt is a random variable since $\begin{aligned}R_{t+1},R_{t+2},\ldots\end{aligned}$ are all random variables.

Since $G_t$ is a random variable, we can calculate its expected value (also called the expectation or mean):

$v_\pi(s)\doteq\mathbb{E}[G_t|S_t=s]$

Here, $v_π(s)$ is called the state-value function or simply the state value of s.

Some important remarks are given below.

$v_{\pi}(s)$ depends on $s$ ：This is because its defnition is a conditional expectation with the condition that the agent starts from $S_t=s.$
$v_{\pi}(s)$ depends on $\pi$ ：This is because the trajectories are generated by following the policy $\pi$ . For a different policy, the state value may be different.【不同的policy/π得到的trajectories集合是不一样的】
$v_{\pi}(s)$ does not depend on $t$ ：If the agent moves in the state space, $t$ represents the current time step. The value of $\upsilon_{\pi}(s)$ is determined once the policy is given.

state values和returns之间的关系进一步明确如下：

When both the policy and the system model are deterministic, starting from a state always leads to the same trajectory. In this case, the return obtained starting from a state is equal to the value of that state. 【当 policy 和 system model 都是确定性的时候，从一个状态开始总是会导致相同的轨迹。在这种情况下，从一个状态开始获得的return等于该状态的state value】
By contrast, when either the policy or the system model is stochastic, starting from the same state may generate diﬀerent trajectories. In this case, the returns of diﬀerent trajectories are diﬀerent, and the state value is the mean of these returns.【相反，当 policy 或 system model 中的任何一个是随机的时候，从相同的 state 开始可能会产生不同的trajectory。在这种情况下，不同trajectory的return是不同的，而state value是这些return的平均值。】

2.4 Bellman equation【贝尔曼公式】

Bellman equation, a mathematical tool for analyzing state val- ues. In a nutshell, the Bellman equation is a set of linear equations that describe the relationships between the values of all the states.【贝尔曼方程，这是一种分析状态值的数学工具。简而言之，贝尔曼方程是一组线性方程，描述了所有状态值之间的关系。】

贝尔曼公式推导：

First, note that $G_t$ can be rewritten as
$\begin{aligned} G_{t}& \begin{aligned}=R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\ldots\end{aligned} \\ &=R_{t+1}+\gamma(R_{t+2}+\gamma R_{t+3}+\ldots) \\ &=R_{t+1}+\gamma G_{t+1} \end{aligned}$

where $\begin{aligned}G_{t+1}=R_{t+2}+\gamma R_{t+3}+\ldots.\end{aligned}$ This equation establishes the relationship between $G_t$ and $G_{t+1}$ .

State Value 表示为(2.4) ：
$\begin{aligned} \upsilon_{\pi}(s)&=\mathbb{E}[G_t|S_t=s]\\ &=\mathbb{E}[R_{t+1}+\gamma G_{t+1}|S_t=s] \\ &=\mathbb{E}[R_{t+1}|S_t=s]+\gamma\mathbb{E}[G_{t+1}|S_t=s] \end{aligned}$

第一项 $\mathbb{E}[R_{t+1}|S_t=s]$ 表示 the expectation of immediate rewards，根据the law of total expectation，可被改写为(2.5)：
$\begin{aligned} \mathbb{E}[R_{t+1}|S_t=s]&=\sum_{a\in\mathcal{A}}\pi(a|s)\mathbb{E}[R_{t+1}|S_t=s,A_t=a]\\ &=\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r \end{aligned}$
- Here, $A$ and $R$ are the sets of possible actions and rewards, respectively.
- It should be noted that A may be diﬀerent for diﬀerent states.
- In this case, $A$ should be written as $A (s)$ . Similarly, $R$ may also depend on $(s, a)$ .
- We drop the dependence on $s$ or $(s, a)$ for the sake of simplicity in this book. 【为了书写简洁】
- Nevertheless, the conclusions are still valid in the presence of dependence.
第二项 $\mathbb{E}[G_{t+1}|S_t=s]$ ,表示 the expectation of the future rewards，可被改写为：
$\begin{aligned} \mathbb{E}[G_{t+1}|S_t=s]&=\sum_{s'\in\mathcal{S}}\mathbb{E}[G_{t+1}|S_t=s,S_{t+1}=s']p(s'|s)\\ &=\sum_{s'\in\mathcal{S}}\mathbb{E}[G_{t+1}|S_{t+1}=s']p(s'|s)\quad\text{(due to the Markov property)}\\ &=\sum_{s^{\prime}\in\mathcal{S}}v_\pi(s^{\prime})p(s^{\prime}|s) \\ &=\sum_{s^{\prime}\in\mathcal{S}}v_\pi(s^{\prime})\sum_{a\in\mathcal{A}}p(s^{\prime}|s,a)\pi(a|s) \end{aligned}$
- The above derivation uses the fact that $\mathbb{E}[G_{t+1}|S_t=s,S_{t+1}=s']=\mathbb{E}[G_{t+1}|S_{t+1}=s']$ ，which is due to the Markov property that the future rewards depend merely on the present state rather than the previous ones.
- $s^{'}$ 表示 $s$ 的下一个时刻的状态；

将(2.5)-(2.6)代入(2.4)，得公式(2.7)：

$\begin{aligned} \color{red}{v_{\pi}(s)}&=\mathbb{E}[R_{t+1}|S_t=s]+\gamma\mathbb{E}[G_{t+1}|S_t=s] \\[2ex] &=\underbrace{\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r}_{\text{mean of immediate rewards}}+\underbrace{\gamma\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s'),}_{\text{mean of future rewards}}\\ &=\sum_{a\in\mathcal{A}}\pi(a|s)\left[\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s')\right],\quad\text{for all }s\in\mathcal{S} \end{aligned}$

This equation is the Bellman equation, which characterizes the relationships of state values. It is a fundamental tool for designing and analyzing reinforcement learning algorithms.【这个方程是贝尔曼方程，它描述了state value之间的关系。它是设计和分析强化学习算法的基本工具。】

$v_\pi( s)$ and $v_\pi( s^\prime)$ are unknown state values to be calculated.
- the Bellman equation refers to a set of linear equations for all states rather than a single equation.
- If we put these equations together, it becomes clear how to calculate all the state values.
$\pi(a|s)$ is a given policy.
- Since state values can be used to evaluate a policy, solving the state values from the Bellman equation is a $policy\:evaluation$ process.
$p (r ∣ s, a)$ and $s^\prime|s, a)$ represent the system model.
- We will first show how to calculate the state values $w i t h$ this model and then show how to do that $w i t h o u t$ the model by using model-free algorithms later in this book.

In addition to the expression in (2.7), readers may also encounter other expressions of the Bellman equation in the literature. We next introduce two equivalent expressions.

First, it follows from the law of total probability that
$\begin{aligned}p(s'|s,a)&=\sum_{r\in\mathcal{R}}p(s',r|s,a),\\p(r|s,a)&=\sum_{s'\in\mathcal{S}}p(s',r|s,a).\end{aligned}$
Then, equation (2.7) can be rewritten as
$\begin{aligned}v_{\pi}(s)=\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r|s,a)\left[r+\gamma v_{\pi}(s')\right]\end{aligned}$
Second, the reward $r$ may depend solely on the next state $s^{\prime}$ in some problems. As a result, we can write the reward as $r(s^{\prime})$ and hence $p(r(s^{\prime})|s,a)=p(s^{\prime}|s,a)$ , substituting which into (2.7) gives
$\begin{aligned}v_\pi(s)&=\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s'\in\mathcal{S}}p(s'|s,a)\left[r(s')+\gamma v_\pi(s')\right]\end{aligned}$

2.5 Examples for illustrating the Bellman equation

We next use two examples to demonstrate how to write out the Bellman equation and calculate the state values step by step. Readers are advised to carefully go through the examples to gain a better understanding of the Bellman equation.【我们接下来使用两个例子来演示如何写出贝尔曼方程并逐步计算状态值。建议读者仔细阅读这些例子，以更好地理解贝尔曼方程。】

If we compare the state values of the two policies in the above examples, it can be seen that：【如果我们比较下述示例中两种策略的状态值，可以看到】
$v_{\pi_1}(s_i)\geq v_{\pi_2}(s_i),\quad i=1,2,3,4,$

which indicates that the policy in Figure 2.4 is better because it has greater state values.【这表明图2.4中的策略更好，因为它具有更大的状态值。】
This mathematical conclusion is consistent with the intuition that the ﬁrst policy is better because it can avoid entering the forbidden area when the agent starts from $s_1$ . 【这个数学结论与直觉一致，即第一个policy更好，因为当agent 从 $s_1$ 开始时，它可以避免进入禁区。】
As a result, the above two examples demonstrate that state values can be used to evaluate policies.【这两个例子证明了state value可以用来评估policy】

2.5.1 example01【policy是确定性的】

Consider the first example shown in Figure 2.4, where the policy is deterministic【policy是确定性的】.

We next write out the Bellman equation and then solve the state values from it.

First, consider state $s_1.$ Under the policy：

The probabilities of taking the actions are $\pi( a= a_3|s_1) = 1$ and $\pi( a≠a_3|s_1) = 0$ .
The state transition probabilities are $s^\prime= s_3|s_1, a_3) = 1$ and $s^\prime≠s_3|s_1, a_3) = 0.$
The reward probabilities are $p( r= 0|s_1, a_3) = 1$ and $r\neq0|s_1, a_3) = 0.$

Substituting these values into
$\begin{aligned} \color{red}{v_{\pi}(s)}&=\mathbb{E}[R_{t+1}|S_t=s]+\gamma\mathbb{E}[G_{t+1}|S_t=s] \\[2ex] &=\underbrace{\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r}_{\text{mean of immediate rewards}}+\underbrace{\gamma\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s'),}_{\text{mean of future rewards}}\\ &=\sum_{a\in\mathcal{A}}\pi(a|s)\left[\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s')\right],\quad\text{for all }s\in\mathcal{S} \end{aligned}$

gives
$v_\pi(s_1)=0+\gamma v_\pi(s_3)$

Similarly, it can be obtained that
$\begin{aligned} &\upsilon_{\pi}(s_2) =1+\gamma\upsilon_{\pi}(s_4) \\ &v_{\pi}(s_{3}) =1+\gamma\upsilon_{\pi}(s_4) \\ &\upsilon_{\pi}(s_{4}) =1+\gamma\upsilon_\pi(s_4) \\ \end{aligned}$

We can solve the state values from these equations. Since the equations are simple, we can manually solve them. Here, the state values can be solved as：【我们可以从这些方程中解出状态值。由于这些方程很简单，我们可以手动解决它们。在这里，状态值可以被解决为】
$\begin{gathered} \upsilon_\pi(s_4) =\frac1{1-\gamma}, \\ \upsilon_{\pi}(s_{3}) =\frac1{1-\gamma}, \\ \upsilon_{\pi}(s_{2}) =\frac1{1-\gamma}, \\ \upsilon_{\pi}(s_{1}) =\frac\gamma{1-\gamma}. \end{gathered}$

Furthermore, if we set $γ = 0.9$ , then
$\begin{gathered} \upsilon_\pi(s_4) \begin{aligned}=\frac{1}{1-0.9}=10,\end{aligned} \\ \upsilon_{\pi}(s_{3}) =\frac1{1-0.9}=10, \\ \upsilon_{\pi}(s_{2}) =\frac1{1-0.9}=10, \\ \upsilon_{\pi}(s_{1}) =\frac{0.9}{1-0.9}=9. \end{gathered}$

2.5.2 example02【policy是随机的】

Consider the second example shown in Figure 2.5, where the policy is stochastic【policy是随机的】.

We next write out the Bellman equation and then solve the state values from it.

First, consider state $s_1.$ Under the policy：

At state $s_1$ , the probabilities of going right and down equal 0.5. The probabilities of taking the actions are $\pi( a= a_2|s_1) = 0.5$ and $\pi( a= a_3|s_1) = 0.5.$
The state transition probability is deterministic since $s^{\prime}= s_3|s_1, a_3) = 1$ and $s^\prime= s_2|s_1, a_2) = 1.$
The reward probability is also deterministic since $p( r= 0|s_1, a_3) = 1$ and $p( r= - 1|s_1, a_2) = 1.$

gives
$\begin{aligned}v_\pi(s_1)=0.5[0+\gamma v_\pi(s_3)]+0.5[-1+\gamma v_\pi(s_2)]\end{aligned}$

Similarly, it can be obtained that

$\begin{aligned} &\upsilon_{\pi}(s_{2}) =1+\gamma\upsilon_{\pi}(s_4), \\ &\upsilon_{\pi}(s_{3}) =1+\gamma\upsilon_{\pi}(s_4), \\ &\upsilon_{\pi}(s_{4}) =1+\gamma\upsilon_\pi(s_4). \\ \end{aligned}$

The state values can be solved from the above equations. Since the equations are simple, we can solve the state values manually and obtain

$\begin{aligned} &v_\pi(s_4) =\frac1{1-\gamma}, \\ &\upsilon_{\pi}(s_{3}) =\frac1{1-\gamma}, \\ &\upsilon_{\pi}(s_{2}) =\frac1{1-\gamma}, \\ &v_{\pi}(s_{1}) =0.5[0+\gamma v_\pi(s_3)]+0.5[-1+\gamma v_\pi(s_2)] =-0.5+\frac\gamma{1-\gamma} \end{aligned}$

Furthermore, if we set $γ = 0.9$ , then

$\begin{aligned} &\upsilon_\pi(s_4) =10, \\ &\upsilon_{\pi}(s_{3}) =10, \\ &v_{\pi}(s_{2}) =10, \\ &v_{\pi}(s_{1}) =-0.5+9=8.5. \end{aligned}$

2.6 Matrix-vector form of the Bellman equation【贝尔曼方程的矩阵-向量形式】

The Bellman equation in (2.7) is in an element-wise form. Since it is valid for every state, we can combine all these equations and write them concisely in a matrix-vector form, which will be frequently used to analyze the Bellman equation.【贝尔曼方程(2.7)是以元素形式表示的。由于它对于每个状态都有效，我们可以将所有这些方程组合起来，并以矩阵-向量形式简洁地写出来，将经常用于分析贝尔曼方程。】

To derive the matrix-vector form, we ﬁrst rewrite the Bellman equation【为了得到矩阵-向量形式，我们首先将贝尔曼方程重写】

重写为：

$\begin{aligned}v_{\pi}(s)=r_{\pi}(s)+\gamma\sum_{s'\in\mathcal{S}}p_{\pi}(s'|s)v_{\pi}(s')\quad\quad\quad(2.8)\end{aligned}$

其中：
$\begin{aligned} &r_{\pi}(s)\doteq\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r \\[4ex] &p_{\pi}(s'|s)\dot{=}\sum_{a\in\mathcal{A}}\pi(a|s)p(s'|s,a) \end{aligned}$

$r_{\pi}( s)$ denotes the mean of the immediate rewards；
$p_{\pi}(s^{\prime}|s)$ is the probability of transitioning from $s$ to $s^{\prime}$ under policy $\pi.$

Suppose that the states are indexed as $s_i$ with $\ldots , n,$ where $n = ∣ S ∣.$ For state $s_i$ , ( 2.8) can be written as
$\begin{aligned}v_{\pi}(s_{i})=r_{\pi}(s_{i})+\gamma\sum_{s_{j}\in\mathcal{S}}p_{\pi}(s_{j}|s_{i})v_{\pi}(s_{j})&\quad\quad\quad(2.9)\end{aligned}$

令： $v_\pi=[v_\pi(s_1),\ldots,v_\pi(s_n)]^T\in\mathbb{R}^n$ ， $r_\pi=[r_\pi(s_1),\ldots,r_\pi(s_n)]^T\in\mathbb{R}^n$ ，and $P_\pi\in\mathbb{R}^{n\times n}$ with $[P_\pi]_{ij}=p_\pi(s_j|s_i)$ 。Then, (2.9) can be written in the following matrix-vector form:

$\color{red}{\begin{aligned}v_{\pi}&=r_{\pi}+\gamma P_{\pi}v_{\pi},\end{aligned}\quad\quad(2.10)}$

where $v_π$ is the unknown to be solved, and $r_π$ , $P_π$ are known.

The matrix $P_\mathrm{\pi}$ has some interesting properties.

First, it is a nonnegative matrix meaning that all its elements are equal to or greater than zero. This property is denoted as $P_{\pi}\geq0$ , where 0 denotes a zero matrix with appropriate dimensions. In this book, $\geq$ or $\leq$ represents an elementwise comparison operation.
Second, $P_\pi$ is a stochastic matrix meaning that the sum of the values in every row is equal to one. This property is denoted as $P_{\pi}\mathbf{1}=\mathbf{1}$ , where $\mathbf{1}=[1,\ldots,1]^T$ has appropriate dimensions.

Consider the example shown in Figure 2.6. The matrix-vector form of the Bellman equation is

$\underbrace{\left[\begin{array}{c}v_\pi(s_1)\\v_\pi(s_2)\\v_\pi(s_3)\\v_\pi(s_4)\end{array}\right]}_{v_\pi}=\underbrace{\left[\begin{array}{c}r_\pi(s_1)\\r_\pi(s_2)\\r_\pi(s_3)\\r_\pi(s_4)\end{array}\right]}_{r_\pi}+\gamma\underbrace{\left[\begin{array}{cccc}p_\pi(s_1|s_1)&p_\pi(s_2|s_1)&p_\pi(s_3|s_1)&p_\pi(s_4|s_1)\\p_\pi(s_1|s_2)&p_\pi(s_2|s_2)&p_\pi(s_3|s_2)&p_\pi(s_4|s_2)\\p_\pi(s_1|s_3)&p_\pi(s_2|s_3)&p_\pi(s_3|s_4)&p_\pi(s_4|s_4)\\p_\pi(s_1|s_4)&p_\pi(s_2|s_4)&p_\pi(s_3|s_4)&p_\pi(s_4|s_4)\end{array}\right]}_{P_\pi}\underbrace{\left[\begin{array}{c}v_\pi(s_1)\\v_\pi(s_2)\\v_\pi(s_3)\\v_\pi(s_4)\end{array}\right]}_{v_\pi}$

Substituting the speciﬁc values into the above equation gives

$\left.\left[\begin{array}{c}v_\pi(s_1)\\v_\pi(s_2)\\v_\pi(s_3)\\v_\pi(s_4)\end{array}\right.\right]=\left[\begin{array}{c}0.5(0)+0.5(-1)\\1\\1\\1\end{array}\right]+\gamma\left[\begin{array}{cccc}0&0.5&0.5&0\\0&0&0&1\\0&0&0&1\\0&0&0&1\end{array}\right]\left[\begin{array}{c}v_\pi(s_1)\\v_\pi(s_2)\\v_\pi(s_3)\\v_\pi(s_4)\end{array}\right]$

It can be seen that $P_\pi$ satisﬁes $P_{\pi}\mathbf{1}=\mathbf{1}$ .

2.7 从贝尔曼方程求解 State Values【Solving state values from the Bellman equation】

Calculating the state values of a given policy is a fundamental problem in reinforcement learning. This problem is often referred to as policy evaluation. 【计算给定策略的状态值是强化学习中的一个基本问题。这个问题通常被称为策略评估。】

In this section, we present two methods for calculating state values from the Bellman equation.【介绍两种从贝尔曼方程计算state values的方法】

2.7.1 Closed-form solution

$\operatorname{Since}v_{\pi}=r_{\pi}+\gamma P_{\pi}v_{\pi}$ is a simple linear equation, its closed-form solution can be easily obtained as

$v_\pi=(I-\gamma P_\pi)^{-1}r_\pi.$

Some properties of $(I-\gamma P_{\pi})^{-1}$ are given below.

$I-\gamma P_{\pi}$ is invertible. The proof is as follows. According to the Gershgorin circle theorem [4], every eigenvalue of $I-\gamma P_{\pi}$ lies within at least one of the Gershgorin circles. The $i$ th Gershgorin circle has a center at $[I-\gamma P_{\pi}]_{ii}=1-\gamma p_{\pi}(s_{i}|s_{i})$ and a radius equal to $\sum_{j\neq i}[I-\gamma P_{\pi}]_{ij}=-\sum_{j\neq i}\gamma p_{\pi}(s_{j}|s_{i}).$ Since $\gamma<1$ , we know that the radius is less than the magnitude of the center: $\sum_{j\neq i}\gamma p_\pi(s_j|s_i)<1-\gamma p_\pi(s_i|s_i)$ Therefore, all Gershgorin circles do not encircle the origin, and hence, no eigenvalue of $I-\gamma P_{\pi}$ is zero.
$(I-\gamma P_{\pi})^{-1}\geq I$ , meaning that every element of $(I-\gamma P_{\pi})^{-1}$ is nonnegative and, more specifically, no less than that of the identity matrix. This is because $P_\mathrm{\pi}$ has nonnegative entries, and hence, $(I-\gamma P_{\pi})^{-1}=I+\gamma P_{\pi}+\gamma^{2}P_{\pi}^{2}+\cdots\geq I\geq0.$
For any vector $r\geq 0$ , it holds that $\gamma P_{\pi})^{-1}r\geq r\geq0$ This property follows from the second property because $[(I-\gamma P_{\pi})^{-1}-I]r\geq0.$ As a consequence, if $r_1\geq r_2$ , we $\begin{aligned}\text{have }(I-\gamma P_{\pi})^{-1}r_{1}\geq(I-\gamma P_{\pi})^{-1}r_{2}.\end{aligned}$

2.7.2 Iterative solution

Although the closed-form solution is useful for theoretical analysis purposes, it is not applicable in practice because it involves a matrix inversion operation, which still needs to be calculated by other numerical algorithms. In fact, we can directly solve the Bellman equation using the following iterative algorithm:【尽管闭式解对于理论分析目的很有用，但在实践中不适用，因为它涉及矩阵求逆运算，仍需要
通过其他数值算法进行计算。实际上，我们可以使用以下迭代算法直接解决贝尔曼方程：】

$\color{red}{\begin{aligned}v_{k+1}=r_{\pi}+\gamma P_{\pi}v_{k},\quad k=0,1,2,\ldots\end{aligned}\quad\quad\quad\quad\quad(2.11)}$

This algorithm generates a sequence of values $\{ v_0, v_1, v_2, \ldots \}$ , where $v_0\in \mathbb{R} ^n$ is an initial guess of $v_\pi$ .

It holds that【大家认为】

$\begin{aligned}v_k\to v_\pi&=(I-\gamma P_\pi)^{-1}r_\pi,\quad\text{ as }k\to\infty \quad\quad\quad\quad\quad\quad (2.12)\end{aligned}$

2.7.3 Illustrative examples

We next apply the algorithm in (2.11) to solve the state values of some examples.

The examples are shown in Figure 2.7. The orange cells represent forbidden areas. The blue cell represents the target area. The reward settings are $r_{\text{boundary}}=r_{\text{forbidden}}=-1$ ，and $r_{\text{target}}= 1$ . Here, the discount rate is $\gamma = 0.9$ .

Figure 2.7(a) shows two “good” policies and their corresponding state values obtained by (2.11). The two policies have the same state values but diﬀer at the top two states in the fourth column. Therefore, we know that diﬀerent policies may have the same state values.
Figure 2.7(b) shows two “bad” policies and their corresponding state values. These two policies are bad because the actions of many states are intuitively unreasonable. Such intuition is supported by the obtained state values. As can be seen, the state values of these two policies are negative and much smaller than those of the good policies in Figure 2.7(a).

2.8 From state value to action value

While we have been discussing state values thus far in this chapter, we now turn to the action value, which indicates the “value” of taking an action at a state. While the concept of action value is important, the reason why it is introduced in the last section of this chapter is that it heavily relies on the concept of state values. It is important to understand state values well ﬁrst before studying action values.

State Value：
$\begin{aligned}v_\pi(s)\doteq\mathbb{E}[G_t|S_t=s]\end{aligned}$

Action Value：
$\begin{aligned}q_\pi(s,a)\doteq\mathbb{E}[G_t|S_t=s,A_t=a]\end{aligned}$

As can be seen, the action value is defined as the expected return that can be obtained after taking an action at astate. 【动作值被定义为在某个状态下采取某个动作后可以获得的预期回报】

It must be noted that $q_\pi(s,a)$ depends on a state-action pair $(s, a)$ rather than an action alone. It may be more rigorous to call this value a state-action value, but it is conventionally called an action value for simplicity.【必须注意的是， $q (s, a)$ 取决于state-action对 $(s, a)$ ,而不仅仅是一个action 。为了简单起见，通常将这个值称为 action value，尽管更严谨的说法应该是state-action value】

What is the relationship between action values and state values? 【行动值和状态值之间的关系是什么？】

First, it follows from the properties of conditional expectation that
$\begin{aligned}\underbrace{\mathbb{E}[G_t|S_t=s]}_{v_\pi(s)}&=\sum_{a\in\mathcal{A}}\underbrace{\mathbb{E}[G_t|S_t=s,A_t=a]}_{q_\pi(s,a)}\pi(a|s)\end{aligned}$
It then follows that
$\color{red}{\begin{aligned}v_{\pi}(s)=\sum_{a\in\mathcal{A}}\pi(a|s)q_{\pi}(s,a).&&(2.13)\end{aligned}}$
As a result, a state value is the expectation of the action values associated with that state. 【State Value是与该state相关联的Action Value的期望值】
Second, since the state value is given by
$\begin{aligned}v_\pi(s)=\sum_{a\in\mathcal{A}}\pi(a|s)\Big[\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_\pi(s')\Big],\end{aligned}$
comparing it with (2.13) leads to
$\color{red}{\begin{aligned}q_\pi(s,a)=\sum_{r\in\mathcal{R}}p(r|s,a)r+\gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_\pi(s').\quad\quad\quad\quad(2.14)\end{aligned}}$
It can be seen that the action value consists of two terms.
- The ﬁrst term is the mean of the immediate rewards；
- the second term is the mean of the future rewards.

Both (2.13) and (2.14) describe the relationship between state values and action values. They are the two sides of the same coin:

(2.13) shows how to obtain state values from action values【action values---->state values】；
(2.14) shows how to obtain action values from state values【state values---->action values】；

2.8.1 Illustrative examples

We next present an example to illustrate the process of calculating action values and discuss a common mistake that beginners may make.

Consider the stochastic policy shown in Figure 2.8. We next only examine the actions of $s_1$ . The other states can be examined similarly. The action value of $s_1, a_2)$ is

$q_\pi(s_1,a_2)=-1+\gamma v_\pi(s_2)$

where $s_2$

你可能感兴趣的:(RL/强化学习,强化学习)

人工智能-基础篇-2-什么是机器学习？（ML，监督学习，半监督学习，零监督学习，强化学习，深度学习，机器学习步骤等） weisian151 人工智能人工智能机器学习学习
1、什么是机器学习？机器学习（MachineLearning,ML）是人工智能的一个分支，是一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析等数学理论。其核心目标是让计算机通过分析数据，自动学习规律并构建模型，从而对未知数据进行预测或决策，而无需依赖显式的程序指令。基本思想：通过数据驱动的方式，使系统能够从经验（数据）中改进性能，形成对数据模式的抽象化表达。基本概念：模型：模型是对现实世界现
Keras环境复现代码（三） yanyiche_ keras 深度学习人工智能
DQN雅达利Breakout强化学习实验要求明确实验目的：学习和实现深度Q学习（DQN），这是一种结合了Q学习和深度神经网络的强化学习算法，用于解决复杂的决策问题。清楚实验原理：1、深度Q学习（DeepQ-Network）将卷积神经网络与Q学习结合，解决高维视觉输入的强化学习问题：2、经验回放：将状态转换存储到缓冲区，打破数据相关性，稳定训练。3、目标网络：定期更新目标Q值计算网络，减少训练中的目
Keras环境复现代码（二） yanyiche_ Keras 机器学习人工智能
PPOCartPole控制算法实践实验要求明确实验目的：学习和实现PPO算法，这是一种改进的策略梯度方法，通过限制策略更新的幅度来提高训练的稳定性。清楚实验原理：PPO算法是一种基于策略梯度的强化学习算法，它旨在解决传统策略梯度方法（如REINFORCE算法）在训练过程中可能出现的策略更新不稳定问题。PPO算法通过引入一种新的策略更新机制，限制每次更新的幅度，从而提高训练的稳定性和效率。PPO算法
行为正则化与顺序策略优化结合的离线多智能体学习算法
离线多智能体强化学习（MARL）是一个新兴领域，目标是在从预先收集的数据集中学习最佳的多智能体策略。随着人工智能技术的发展，多智能体系统在诸如自动驾驶、智能家居、机器人协作以及智能调度决策等方面展现了巨大的应用潜力。但现有的离线MARL方法也面临很多挑战，仍存在不协调行为和分布外联合动作的问题。为了应对这些挑战，中山大学计算机学院、美团履约平台技术部开展了学术合作项目，并取得了一些的成果，希望分享
利用视觉-语言模型搭建机器人灵巧操作的支架三谷秋水智能体大模型计算机视觉语言模型机器人人工智能计算机视觉机器学习
25年6月来自斯坦福和德国卡尔斯鲁厄理工的论文“ScaffoldingDexterousManipulationwithVision-LanguageModels”。灵巧机械手对于执行复杂的操作任务至关重要，但由于演示收集和高维控制的挑战，其训练仍然困难重重。虽然强化学习(RL)可以通过在模拟中积累经验来缓解数据瓶颈，但它通常依赖于精心设计的、针对特定任务的奖励函数，这阻碍了其可扩展性和泛化能力。
常见的强化学习算法分类及其特点 ywfwyht 人工智能算法分类人工智能
强化学习（ReinforcementLearning,RL）是一种机器学习方法，通过智能体（Agent）与环境（Environment）的交互来学习如何采取行动以最大化累积奖励。以下是一些常见的强化学习算法分类及其特点：1.基于值函数的算法这些算法通过估计状态或状态-动作对的价值来指导决策。Q-Learning无模型的离线学习算法。通过更新Q值表来学习最优策略。更新公式：Q(s,a)←Q(s,a)
星际争霸多智能体挑战赛（SMAC）资源存储库多智能体强化学习人工智能
目录TheStarCraftMulti-AgentChallenge星际争霸多智能体挑战赛Abstract摘要1Introduction1引言2RelatedWork2相关工作3Multi-AgentReinforcementLearning3多智能体强化学习Dec-POMDPs12-POMDPs（十二月-POMDP）Centralisedtrainingwithdecentralisedexec
AlphaStar 星际首秀，人工智能走向星辰大海谷歌开发者
文/王晶，资深工程师，GoogleBrain团队作者王晶，现为GoogleBrain团队的资深工程师，主要致力深度强化学习的研发，和DeepMind团队在强化学习的应用上有许多合作。北京时间1月25日凌晨2点，DeepMind直播了他们的AIAlphaStar和人类顶尖的职业电竞选手对战星际争霸2。根据DeepMind介绍，AlphaStar在2018年12月10日和19日先后以5：0全胜的战绩击
Deepoc大模型在半导体设计优化与自动化 Deepoch 自动化运维人工智能机器人单片机 ai 科技
大模型在半导体设计领域的应用已形成多维度技术渗透，其核心价值在于通过数据驱动的方式重构传统设计范式。以下从技术方向、实现路径及行业影响三个层面展开详细分析：参数化建模与动态调优基于物理的深度学习模型（如PINNs）将器件物理方程嵌入神经网络架构，实现工艺参数与电学性能的非线性映射建模。通过强化学习框架（如PPO算法）动态调整掺杂浓度、栅极长度等关键参数，在3nm节点下实现驱动电流提升18%的同时降
【行云流水a】淘天联合爱橙开源强化学习训练框架ROLL OpenRL/openrl PPO-for-Beginners: 从零开始实现强化学习算法PPO 强化学习框架verl 港大等开源GoT-R1 行云流水AI笔记开源算法
以下是DQN（DeepQ-Network）和PPO（ProximalPolicyOptimization）的全面对比流程图及文字解析。两者是强化学习的核心算法，但在设计理念、适用场景和实现机制上有显著差异：graphTDA[对比维度]-->B[算法类型]A-->C[策略表示]A-->D[动作空间]A-->E[学习机制]A-->F[探索方式]A-->G[稳定性]A-->H[样本效率]A-->I[关键
PettingZoo:多智能体强化学习的标准API 资源存储库多智能体强化学习人工智能深度学习
PettingZoo:AStandardAPIforMulti-AgentReinforcementLearningPettingZoo:多智能体强化学习的标准API目录Abstract摘要1Introduction1介绍2BackgroundandRelatedWorks2背景及相关工作2.1PartiallyObservableStochasticGamesandRLlib2.1部分可观察随机
神经网络架构搜索 IJCAST主编进化计算神经网络架构人工智能
InternationalJournalofComplexityinAppliedScienceandTechnology，投稿网址:https://www.inderscience.com/jhome.php?jcode=ijcast,发表论文不收取任何费用，论文平均审稿25天内即可录用。1.神经网络架构搜索方法分类当前，神经网络架构搜索的方法主要可以归纳为以下三类：a.基于强化学习的NAS方法
强化学习 16G实践以下是基于CQL（Conservative Q-Learning）与QLoRA（Quantized Low-Rank Adaptation）结合的方案相关开源项目及资源，【ai技】行云流水AI笔记开源人工智能
根据你提供的CUDA版本（11.5）和NVIDIA驱动错误信息，以下是PyTorch、TensorFlow的兼容版本建议及环境修复方案：1.版本兼容性表框架兼容CUDA版本推荐安装命令（CUDA11.5）PyTorch11.3/11.6pipinstalltorchtorchvisiontorchaudio--extra-index-urlhttps://download.pytorch.org/
大模型RLHF强化学习笔记（一）：强化学习基础梳理Part1 Gravity! 大模型笔记大模型 LLM 算法机器学习强化学习人工智能
【如果笔记对你有帮助，欢迎关注&点赞&收藏，收到正反馈会加快更新！谢谢支持！】一、强化学习基础1.1Intro定义：强化学习是一种机器学习方法，需要智能体通过与环境交互学习最优策略基本要素：状态（State）：智能体在决策过程中需要考虑的所有相关信息（环境描述）动作（Action）：在环境中可以采取的行为策略（Policy）：定义了在给定状态下智能体应该选择哪个动作，目标是最大化智能体的长期累积奖
LLMs基础学习（八）强化学习专题（7）汤姆和佩琦 NLP 学习 Actor-Critic 算法
LLMs基础学习（八）强化学习专题（7）文章目录LLMs基础学习（八）强化学习专题（7）Actor-Critic算法基础原理算法流程细节算法优缺点分析算法核心总结视频链接：https://www.bilibili.com/video/BV1MQo4YGEmq/?spm_id_from=333.1387.upload.video_card.click&vd_source=57e4865932ea6c
强化学习-双臂老虎机 transuperb 强化学习人工智能
本篇文章模拟AI玩两个老虎机，AI需要判断出哪个老虎机收益更大，然后根据反馈调整对于不同老虎机的价值判断，如果把这个看作一个简单的强化学习的话，那么AI就是agent，两个老虎机就是environment，AI首先会对两台老虎机有一个预测值Q，预测哪一个的价值高，然后AI通过策略函数判断应该选择哪个老虎机，进行Action后根据Reward更新每个老虎机的价值Value，然后再进行下一次判断，直到
ROS2 强化学习：案例与代码实战芯动大师 ROS2学习目标检测人工智能
一、引言在机器人技术不断发展的今天，强化学习（RL）作为一种强大的机器学习范式，为机器人的智能决策和自主控制提供了新的途径。ROS2（RobotOperatingSystem2）作为新一代机器人操作系统，具有更好的实时性、分布式性能和安全性，为强化学习在机器人领域的应用提供了更坚实的基础。本文将通过一个具体案例，深入探讨ROS2与强化学习的结合应用，并提供相关代码实现。二、案例背景本案例以移动机器
解析AI算力网络与通信领域强化学习的算法 AI算力网络与通信 AI人工智能与大数据技术 AI算力网络与通信原理 AI人工智能大数据架构人工智能网络算法 ai
解析AI算力网络与通信领域强化学习的算法：从"快递员找路"到"智能网络大脑"关键词：AI算力网络、通信领域、强化学习、马尔可夫决策、资源调度摘要：本文将用"快递物流系统"的类比，带您理解AI算力网络与通信领域如何通过强化学习实现智能决策。我们会从核心概念讲起，逐步拆解强化学习在网络资源调度中的算法原理，结合Python代码实战，最后探索其在5G/6G、边缘计算等场景的应用。即使您没学过复杂数学，也
AI 在自动驾驶路径规划中的深度强化学习优化 QuantumWalker 人工智能自动驾驶机器学习
```htmlAI在自动驾驶路径规划中的深度强化学习优化在当今快速发展的科技领域中，人工智能（AI）的应用正在不断拓展其边界。特别是在自动驾驶技术中，AI的应用已经从简单的感知和识别发展到了复杂的决策和控制阶段。其中，深度强化学习作为AI的一个重要分支，在自动驾驶路径规划中发挥着越来越重要的作用。一、深度强化学习简介深度强化学习是一种结合了深度学习和强化学习的机器学习方法。它通过让智能体在环境中进
DeepSeek打破AI天花板：MoE架构+RL推理，效率提升5倍的底层逻辑泡泡Java AI大模型人工智能架构
文章目录一、引言二、MoE架构：高效计算的核心支撑（一）MoE架构概述（二）DeepSeekMoE架构的创新点（三）MoE架构的代码实现示例三、RL推理：智能提升的关键驱动（一）RL推理概述（二）R1的训练流程（三）RL推理中的关键技术（四）RL推理的代码实现示例四、MoE架构与RL推理的结合：效率提升的奥秘（一）计算效率的提升（二）推理能力的增强（三）整体性能的飞跃五、结论与展望《DeepSee
强化学习实战：从 Q-Learning 到 PPO 全流程荣华富贵8 程序员的知识储备2 程序员的知识储备3 人工智能算法机器学习
1引言随着人工智能的快速发展，强化学习（ReinforcementLearning,RL）凭借其在复杂决策与控制问题上的卓越表现，已成为研究与应用的前沿热点。本文旨在从经典的Q-Learning算法入手，系统梳理从值迭代到策略优化的全流程技术细节，直至最具代表性的ProximalPolicyOptimization（PPO）算法，结合理论推导、代码实现与案例分析，深入探讨强化学习的核心原理、算法演
基于CTDE MAPPO的无线通信资源分配强化学习实现 pk_xz123456 仿真模型深度学习算法 lstm 人工智能 rnn 深度学习开发语言
基于CTDEMAPPO的无线通信资源分配强化学习实现摘要本文提出了一种基于集中训练分散执行(CTDE)框架的多智能体近端策略优化(MAPPO)方法，用于解决无线通信网络中的资源分配问题。我们设计了一个多基站协作环境，其中每个基站作为独立智能体，通过分布式决策实现网络吞吐量最大化。实验结果表明，MAPPO算法在频谱效率和用户公平性方面显著优于传统启发式算法。1.引言1.1研究背景随着5G/6G通信技
强化学习系列——PPO算法 lqjun0827 算法深度学习算法人工智能
强化学习系列——PPO算法PPO算法一、背景知识：策略梯度&Advantage二、引入重要性采样（ImportanceSampling）三、PPO-Clip目标函数推导✅四、总结公式（一图总览）参考文献PPO示例代码实现补充内容：重要性采样一、问题背景：我们想估计某个期望❗问题：二、引入重要性采样（ImportanceSampling）三、离散采样形式（蒙特卡洛估计）四、标准化的重要性采样五、在强
人工神经网络：架构原理与技术解析 weixin_47233946 架构
##引言在深度学习和人工智能领域，人工神经网络（ArtificialNeuralNetwork,ANN）作为模拟人脑认知机制的核心技术，已在图像识别、自然语言处理和强化学习等领域实现了革命性突破。从AlphaGo击败人类顶尖棋手到ChatGPT的对话生成能力，ANN的进化持续推动技术边界的扩展。本文将深入剖析人工神经网络的核心原理、技术实现与发展趋势。##一、基础概念与数学模型###1.1生物启发
医疗AI新势力：自演进多智能体MAS的进击之路 Allen_Lyb 医疗高效编程研发人工智能健康医疗机器学习架构大数据
医疗AI新势力：自演进多智能体MAS的进击之路往期相关文章：Python在开放式医疗诊断多智能体系统中的深度应用与自动化分析基于多智能体强化学习的医疗AI中RAG系统程序架构优化研究自演进多智能体在医疗临床诊疗动态场景中的应用医疗AI的新变革在数字化与智能化飞速发展的时代，人工智能（AI）已经逐渐渗透到医疗领域的各个角落，成为推动医疗行业变革的重要力量。从疾病的早期诊断到个性化治疗方案的制定，从医
无线通信中的多智能体强化学习：基于CTDE-MAPPO的功率控制优化 pk_xz123456 仿真模型深度学习算法算法人工智能制造
无线通信中的多智能体强化学习：基于CTDE-MAPPO的功率控制优化摘要本文提出了一种基于集中训练分布式执行(CTDE)框架的多智能体近端策略优化(MAPPO)算法，用于解决无线通信网络中的分布式功率控制问题。通过将多个基站建模为协作智能体，我们设计了一个多智能体强化学习系统，能够在复杂动态环境中实现全局网络效用的优化。本文详细介绍了系统架构、算法实现、实验设置以及性能评估，展示了MAPPO在5G
传统蒙特卡洛（Monte Carlo, MC）方法在强化学习中直接把整条回报序列当作“真值”来估计价值函数，通常配合表格化存储，因此无需环境模型且估计无偏，但只能处理有限状态-动作空间且方差较大强化学习曾小健人工智能
传统蒙特卡洛（MonteCarlo,MC）方法在强化学习中直接把整条回报序列当作“真值”来估计价值函数，通常配合表格化存储，因此无需环境模型且估计无偏，但只能处理有限状态-动作空间且方差较大medium.comanalyticsvidhya.comincompleteideas.net。“深度蒙特卡洛”（DeepMonteCarlo,DMC）则保留“按回报直接更新”的思想，却用深度网络来逼近$Q(
使用Simulink结合MATLAB进行基于强化学习控制下的动态滤波器参数调节系统的仿真 amy_mhd matlab 开发语言
目录一、背景介绍二、所需工具和环境三、步骤详解步骤1：定义系统需求示例：定义系统需求步骤2：准备强化学习环境步骤3：训练强化学习代理步骤4：创建Simulink模型步骤5：添加信号源步骤6：合并信号步骤7：导入强化学习代理步骤8：设计滤波器步骤9：可视化结果步骤10：连接各模块步骤11：设置仿真参数步骤12：运行仿真并分析结果四、总结在现代信号处理领域，动态调整滤波器参数以适应不断变化的环境条件是
强化学习（Reinforcement Learning, RL）概览 MzKyle 人工智能人工智能强化学习机器学习机器人
一、强化学习的核心概念与定位1.定义强化学习是机器学习的分支，研究智能体（Agent）在动态环境中通过与环境交互，以最大化累积奖励为目标的学习机制。与监督学习（有标注数据）和无监督学习（无目标）不同，强化学习通过“试错”学习，不依赖先验知识，适合解决动态决策问题。2.核心要素智能体（Agent）：执行决策的主体，如游戏AI、机器人。环境（Environment）：智能体之外的一切，如棋盘、物理世界
无监督学习概览 MzKyle 人工智能人工智能无监督学习机器学习
一、无监督学习的本质与定位定义：无监督学习是机器学习的三大范式之一（另外两种为监督学习和强化学习），其核心特点是处理未标注数据，通过算法自动发现数据中的隐藏结构、模式或内在规律。与监督学习依赖"输入-输出"对不同，无监督学习仅以原始数据作为输入，目标是揭示数据的内在组织方式。与其他学习范式的区别：监督学习：依赖标签（如分类、回归任务），学习从输入到输出的映射关系强化学习：通过与环境交互获得奖励信号
iOS http封装 374016526 ios 服务器交互 http 网络请求
程序开发避免不了与服务器的交互，这里打包了一个自己写的http交互库。希望可以帮到大家。内置一个basehttp，当我们创建自己的service可以继承实现。 KuroAppBaseHttp *baseHttp = [[KuroAppBaseHttp alloc] init]; [baseHttp setDelegate:self]; [baseHttp
lolcat ：一个在 Linux 终端中输出彩虹特效的命令行工具 brotherlamp linux linux教程 linux视频 linux自学 linux资料
那些相信 Linux 命令行是单调无聊且没有任何乐趣的人们，你们错了，这里有一些有关 Linux 的文章，它们展示着 Linux 是如何的有趣和“淘气” 。在本文中，我将讨论一个名为“lolcat”的小工具 – 它可以在终端中生成彩虹般的颜色。何为 lolcat ? Lolcat 是一个针对 Linux，BSD 和 OSX 平台的工具，它类似于 cat 命令，并为 cat
MongoDB索引管理（1）——[九] eksliang mongodb MongoDB管理索引
转载请出自出处：http://eksliang.iteye.com/blog/2178427 一、概述数据库的索引与书籍的索引类似，有了索引就不需要翻转整本书。数据库的索引跟这个原理一样，首先在索引中找，在索引中找到条目以后，就可以直接跳转到目标文档的位置，从而使查询速度提高几个数据量级。不使用索引的查询称
Informatica参数及变量 18289753290 Informatica 参数变量
下面是本人通俗的理解，如有不对之处，希望指正 info参数的设置：在info中用到的参数都在server的专门的配置文件中（最好以parma）结尾下面的GLOBAl就是全局的，$开头的是系统级变量，$$开头的变量是自定义变量。如果是在session中或者mapping中用到的变量就是局部变量，那就把global换成对应的session或者mapping名字。 [GLOBAL] $Par
python 解析unicode字符串为utf8编码字符串酷的飞上天空 unicode
php返回的json字符串如果包含中文，则会被转换成\uxx格式的unicode编码字符串返回。在浏览器中能正常识别这种编码，但是后台程序却不能识别，直接输出显示的是\uxx的字符，并未进行转码。转换方式如下 >>> import json >>> q = '{"text":"\u4
Hibernate的总结永夜-极光 Hibernate
1.hibernate的作用,简化对数据库的编码,使开发人员不必再与复杂的sql语句打交道做项目大部分都需要用JAVA来链接数据库，比如你要做一个会员注册的页面，那么获取到用户填写的基本信后，你要把这些基本信息存入数据库对应的表中，不用hibernate还有mybatis之类的框架，都不用的话就得用JDBC，也就是JAVA自己的，用这个东西你要写很多的代码，比如保存注册信
SyntaxError: Non-UTF-8 code starting with '\xc4' 随便小屋 python
刚开始看一下Python语言，传说听强大的，但我感觉还是没Java强吧！写Hello World的时候就遇到一个问题，在Eclipse中写的，代码如下 ''' Created on 2014年10月27日 @author: Logic ''' print("Hello World!"); 运行结果 SyntaxError: Non-UTF-8
学会敬酒礼仪不做酒席菜鸟 aijuans 菜鸟
俗话说，酒是越喝越厚，但在酒桌上也有很多学问讲究，以下总结了一些酒桌上的你不得不注意的小细节。细节一：领导相互喝完才轮到自己敬酒。敬酒一定要站起来，双手举杯。细节二：可以多人敬一人，决不可一人敬多人，除非你是领导。细节三：自己敬别人，如果不碰杯，自己喝多少可视乎情况而定，比如对方酒量，对方喝酒态度，切不可比对方喝得少，要知道是自己敬人。细节四：自己敬别人，如果碰杯，一
《创新者的基因》读书笔记 aoyouzi 读书笔记《创新者的基因》
创新者的基因创新者的“基因”，即最具创意的企业家具备的五种“发现技能”：联想，观察，实验，发问，建立人脉。第一部分破坏性创新，从你开始第一章破坏性创新者的基因如何获得启示：发现以下的因素起到了催化剂的作用：(1) -个挑战现状的问题；(2)对某项技术、某个公司或顾客的观察；(3) -次尝试新鲜事物的经验或实验；(4)与某人进行了一次交谈，为他点醒
表单验证技术百合不是茶 JavaScript DOM对象 String对象事件
js最主要的功能就是验证表单,下面是我对表单验证的一些理解,贴出来与大家交流交流 ,数显我们要知道表单验证需要的技术点, String对象,事件,函数一:String对象;通常是对字符串的操作; 1,String的属性; 字符串.length;表示该字符串的长度; var str= "java"
web.xml配置详解之context-param bijian1013 java servlet web.xml context-param
一.格式定义： <context-param> <param-name>contextConfigLocation</param-name> <param-value>contextConfigLocationValue></param-value> </context-param> 作用：该元
Web系统常见编码漏洞（开发工程师知晓） Bill_chen sql PHP Web fckeditor 脚本
1.头号大敌：SQL Injection 原因：程序中对用户输入检查不严格，用户可以提交一段数据库查询代码，根据程序返回的结果，获得某些他想得知的数据，这就是所谓的SQL Injection，即SQL注入。本质: 对于输入检查不充分，导致SQL语句将用户提交的非法数据当作语句的一部分来执行。示例： String query = "SELECT id FROM users
【MongoDB学习笔记六】MongoDB修改器 bit1129 mongodb
本文首先介绍下MongoDB的基本的增删改查操作，然后，详细介绍MongoDB提供的修改器，以完成各种各样的文档更新操作 MongoDB的主要操作 show dbs 显示当前用户能看到哪些数据库 use foobar 将数据库切换到foobar show collections 显示当前数据库有哪些集合 db.people.update，update不带参数，可
提高职业素养，做好人生规划白糖_ 人生
培训讲师是成都著名的企业培训讲师，他在讲课中提出的一些观点很新颖，在此我收录了一些分享一下。注：讲师的观点不代表本人的观点，这些东西大家自己揣摩。 1、什么是职业规划：职业规划并不完全代表你到什么阶段要当什么官要拿多少钱，这些都只是梦想。职业规划是清楚的认识自己现在缺什么，这个阶段该学习什么，下个阶段缺什么，又应该怎么去规划学习，这样才算是规划。
国外的网站你都到哪边看？ bozch 技术网站国外
学习软件开发技术，如果没有什么英文基础，最好还是看国内的一些技术网站，例如：开源OSchina，csdn，iteye,51cto等等。个人感觉如果英语基础能力不错的话，可以浏览国外的网站来进行软件技术基础的学习，例如java开发中常用的到的网站有apache.org 里面有apache的很多Projects,springframework.org是spring相关的项目网站,还有几个感觉不错的
编程之美-光影切割问题 bylijinnan 编程之美
package a; public class DisorderCount { /**《编程之美》“光影切割问题” * 主要是两个问题： * 1.数学公式（设定没有三条以上的直线交于同一点）： * 两条直线最多一个交点，将平面分成了4个区域； * 三条直线最多三个交点，将平面分成了7个区域； * 可以推出：N条直线 M个交点，区域数为N+M+1。
关于Web跨站执行脚本概念 chenbowen00 Web 安全跨站执行脚本
跨站脚本攻击(XSS)是web应用程序中最危险和最常见的安全漏洞之一。安全研究人员发现这个漏洞在最受欢迎的网站,包括谷歌、Facebook、亚马逊、PayPal,和许多其他网站。如果你看看bug赏金计划,大多数报告的问题属于 XSS。为了防止跨站脚本攻击,浏览器也有自己的过滤器,但安全研究人员总是想方设法绕过这些过滤器。这个漏洞是通常用于执行cookie窃取、恶意软件传播,会话劫持,恶意重定向。在
[开源项目与投资]投资开源项目之前需要统计该项目已有的用户数 comsci 开源项目
现在国内和国外,特别是美国那边,突然出现很多开源项目,但是这些项目的用户有多少,有多少忠诚的粉丝,对于投资者来讲,完全是一个未知数,那么要投资开源项目,我们投资者必须准确无误的知道该项目的全部情况,包括项目发起人的情况,项目的维持时间..项目的技术水平,项目的参与者的势力,项目投入产出的效益.....
oracle alert log file（告警日志文件） daizj oracle 告警日志文件 alert log file
The alert log is a chronological log of messages and errors, and includes the following items: All internal errors (ORA-00600), block corruption errors (ORA-01578), and deadlock errors (ORA-00060)
关于 CAS SSO 文章声明 denger SSO
由于几年前写了几篇 CAS 系列的文章，之后陆续有人参照文章去实现，可都遇到了各种问题，同时经常或多或少的收到不少人的求助。现在这时特此说明几点： 1. 那些文章发表于好几年前了，CAS 已经更新几个很多版本了，由于近年已经没有做该领域方面的事情，所有文章也没有持续更新。 2. 文章只是提供思路，尽管 CAS 版本已经发生变化，但原理和流程仍然一致。最重要的是明白原理，然后
初二上学期难记单词 dcj3sjt126com english word
lesson 课 traffic 交通 matter 要紧；事物 happy 快乐的，幸福的 second 第二的 idea 主意；想法；意见 mean 意味着 important 重要的，重大的 never 从来，决不 afraid 害怕的 fifth 第五的 hometown 故乡，家乡 discuss 讨论；议论 east 东方的 agree 同意；赞成 bo
uicollectionview 纯代码布局, 添加头部视图 dcj3sjt126com Collection
#import <UIKit/UIKit.h> @interface myHeadView : UICollectionReusableView { UILabel *TitleLable; } -(void)setTextTitle; @end #import "myHeadView.h" @implementation m
N 位随机数字串的 JAVA 生成实现 FX夜归人 java Math 随机数 Random
/** * 功能描述随机数工具类<br /> * @author FengXueYeGuiRen * 创建时间 2014-7-25<br /> */ public class RandomUtil { // 随机数生成器 private static java.util.Random random = new java.util.R
Ehcache（09）——缓存Web页面 234390216 ehcache 页面缓存
页面缓存目录 1 SimplePageCachingFilter 1.1 calculateKey 1.2 可配置的初始化参数 1.2.1 cach
spring中少用的注解@primary解析 jackyrong primary
这次看下spring中少见的注解@primary注解，例子 @Component public class MetalSinger implements Singer{ @Override public String sing(String lyrics) { return "I am singing with DIO voice
Java几款性能分析工具的对比 lbwahoo java
Java几款性能分析工具的对比摘自：http://my.oschina.net/liux/blog/51800 在给客户的应用程序维护的过程中，我注意到在高负载下的一些性能问题。理论上，增加对应用程序的负载会使性能等比率的下降。然而，我认为性能下降的比率远远高于负载的增加。我也发现，性能可以通过改变应用程序的逻辑来提升，甚至达到极限。为了更详细的了解这一点，我们需要做一些性能
JVM参数配置大全 nickys jvm 应用服务器
JVM参数配置大全 /usr/local/jdk/bin/java -Dresin.home=/usr/local/resin -server -Xms1800M -Xmx1800M -Xmn300M -Xss512K -XX:PermSize=300M -XX:MaxPermSize=300M -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=5 -
搭建 CentOS 6 服务器(14) - squid、Varnish rensanning varnish
（一）squid 安装 # yum install httpd-tools -y # htpasswd -c -b /etc/squid/passwords squiduser 123456 # yum install squid -y 设置 # cp /etc/squid/squid.conf /etc/squid/squid.conf.bak # vi /etc/
Spring缓存注解@Cache使用 tom_seed spring
参考资料 http://www.ibm.com/developerworks/cn/opensource/os-cn-spring-cache/ http://swiftlet.net/archives/774 缓存注解有以下三个： @Cacheable @CacheEvict @CachePut
dom4j解析XML时出现"java.lang.noclassdeffounderror: org/jaxen/jaxenexception"错误 xp9802
java.lang.NoClassDefFoundError: org/jaxen/JaxenExc 关键字: java.lang.noclassdeffounderror: org/jaxen/jaxenexception 使用dom4j解析XML时，要快速获取某个节点的数据，使用XPath是个不错的方法，dom4j的快速手册里也建议使用这种方式执行时却抛出以下异常： Exceptio