OverTheMoon

An introduction to reinforcement learning

CONTENT

Introduction

Terminology
Goal
Classification

Markov Decision Process

Markov Process
Markov Reward Process
Markov Decision Process

Dynamic Programming
Sampling Methods for model-free. (solutions for small MDPs)

Monte Carlo
Temporal-Difference Learning (TD)
Off-Policy Learning

Value Function Approximation (solution for large MDPs)

Basic Idea
Incremental Methods

Stochastic Gradient Descent
Incremental Prediction Algorithms

Batch Methods

Least Squares Prediction
Least Squares Control

Policy Gradient

Monte-Carlo methods
Actor-Critic methods

Model-Based Reinforcement Learning

Model Definition
Integrating Learning and Planning
Simulation-Based Search

Having taken a quick look at several overviews of reinforcement learning, I wrote a script here to conclude and take down some key concepts and points to help myself understand the reinforcement learning.

Introduction

Terminology

Environment
State
- agent state
  the state that the agent can observe.
- environment state
  the whole information that the environment contains in the state.
Action
Reward
Policy
A policy defines the learning agent’s way of behaving at a given time. It is an agent’s behavior function. It is a map from state to action.
- Deterministic policy: $\pi(s)$ . a: action; s: state
- Stochastic policy: $\pi(a|s) = \mathbb{P}[A=a|S=s]$
Value function
Value function is a prediction of future reward. Used to evaluate how good is each state and/or action and therefore to select between actions.
model
how the agent senses the environment. agent’s representation of the environment. It predicts what the environment will do next. Can be model-free: not every situation needs a model.
- Transitions model: $\mathcal{P}$ predicts the next state (i.e. dynamics)
  $\mathcal{P}^{a}_{ss'} = \mathbb{P}[S'=s'|S=s, A=a]$
- Rewards model: $\mathcal{R}$ predicts the next (immediate) reward, e.g.
  $\mathcal{R}^a_s = \mathbb{E}[R|S=s,A=a]$
model-free
on-policy/off_policy

Goal

To maximize the expected cumulative reward.
(Reward Hypothesis)

Classification

two types of tasks
- episodic
- continuous (no terminal state). like automated stock trading.
two ways of learning (sampling methods)
- Monte Carlo (only update when the episodes end)
- TD Learning Methods (update every time states switch)
two approaches (whether MDP is known)
- Model Free
  Policy and/or Value Function. No model.
  We don’t try to explicitly understand the environment. Don’t need to build a model to describe the environment.
  - value-based
    In value-based RL, the goal is to optimize the value function $V (s)$ . The value function is a function that tells us the maximum expected future reward the agent will get at each state. The value of each state is the total amount of the reward an agent can expect to accumulate over the future, starting at that state. The agent will use this value function to select which state to choose at each step. The agent takes the state with the biggest value.
    Therefore, no policy (implicit). You don’t need a policy to sample/determine which action to choose.
  - policy-based
    The policy $\pi(s)$ is what defines the agent behavior at a given time. No value function.
    - Deterministic: a policy at a given state will always return the same action.
    - Stochastic: output a distribution probability over actions.
  - Actor Critic
    It stores both the policy and the value function at the same time.
- Model Based
  Policy and/or Value Function. NModel.
  First build the model. Then we do actions base on the model.

Markov Decision Process

Markov Process

Markov Process (or Markov Chain) is a tuple $\langle\mathcal{S}, \mathcal{P}\rangle$ .

$\mathcal{S}$ is a (finite) set of states.
$\mathcal{P}$ is a state transition probability matrix. $\mathcal{P}_{ss'}=\mathbb{P}[S_{t+1}=s'|S_t=s]$

Markov Reward Process

Markov Reward Process is a tuple $\langle\mathcal{S},\mathcal{P},\mathcal{R},\mathcal{\gamma}\rangle$ .

$\mathcal{R}$ is a reward function. $\mathcal{R}_s=\mathbb{E}[R_{t+1}|S_t=s]$ (immediate reward)
$\mathcal{\gamma}$ is a discount factor. $\mathcal{\gamma} \in [0,1]$
The return $G_t$ is the total discounted reward from time-step t.
$G_t = R_{t+1}+\gamma R_{t+2}+... = \sum_{k=0}^\infty\gamma^kR_{t+k+1}$
Here, no expectation is because $G_t$ is for a single sequence sample.
The value function $v (s)$ gives the long-term value of state $s$ . It is the expected return starting from state $s$ .
$\mathbb{E}[G_t|S_t=s]$
Bellman Equation for MRPs
The value funxtion can be decomposed into two parts:
- immediate reward $R_{t+1}$
- discounted value of successor state $\gamma v(S_{t+1})$
$\mathbb{E}[G_t|S_t=s] = \mathbb{E}[R_{t+1}+\gamma v(S_{t+1})|S_t=s] = \mathcal{R}_s+\gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}v(s')$
In matrix form: $\mathcal{R} + \gamma \mathcal{P}v$
Can be solved directly for small MRPs: $(I-\gamma\mathcal{P})^{-1}\mathcal{R}$

Markov Decision Process

Markov Decision Process is a tuple $\langle\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle$

$\mathcal{A}$ is a finite set of actions
$\mathcal{P}$ is a state transition probability matrix
$\mathcal{P}^a_{ss'}=\mathbb{P}[S_{t+1}=s'|S_t=s, A_t=a]$
$\mathcal{R}$ is a reward function
$\mathcal{R}^a_s=\mathbb{E}[R_{t+1}|S_t=s,A_t=a]$
A policy $\pi$ is a distribution over actions given states. It fully defines the behavior of an agent.
$\pi(a|s) = \mathbb{P}[A_t=a|S_t=s]$
The state-value function $v_\pi(s)$ of an MDP is the expected return starting from state $s$ , and then following policy $\pi$
$v_\pi(s) = \mathbb{E}_\pi[G_t|S_t=s]$
Greedy policy improvement over $V (s)$ requires model of MDP.
The action-value function $q_\pi(s,a)$ is the expected return starting from state $s$ , taking action $a$ , and then following policy $\pi$
$q_\pi(s,a) = \mathbb{E}_\pi[G_t|S_t=s, A_t=a]$
Greedy policy improvement over $Q (s, a)$ is model-free.
Bellman Expectation Equation (the deriving methods are very similar to the ones in MRPs)
$v_\pi(s) = \mathbb{E}[R_{t+1}+\gamma v_\pi(S_{t+1})|S_t=s]$
$v_\pi = \mathcal{R}^\pi+\gamma\mathcal{P}^\pi v_\pi$
$q_\pi(s,a) = \mathbb{E}_\pi[R_{t+1}+\gamma q_\pi(S_{t+1}, A_{t+1})|S_t=s, A_t=a]$
Optimal Value Function
The optimal state-value function $v_*(s)$ is the maximum value function over all policies
$v_*(s) = \max_\pi v_\pi(s)$
The optimal action-value function $q_*(s,a)$ is the maximum action-value function over all policies
$q_*(s,a)=\max_\pi q_\pi(s,a)$

Dynamic Programming

Dynamic programming assumes full knowledge of the MDP.

Sampling Methods for model-free. (solutions for small MDPs)

To estimate and optimise the value function of an unknown MDP.

Monte Carlo

Monte Carlo policy evaluation uses empirical mean return instead of expected return according to $v_\pi(s) = \mathbb{E}_\pi[G_t|S_t=s]$ using the whole episodes with terminals.

Update value $V(S_t)$ toward actual return $G_t$ .
$V(S_t) \gets V(S_t) + \alpha(G_t − V(S_t))$
GLIE Monte-Carlo Control (On-Policy) to update action-value function through sampling.
- Sample kth episode using $\pi:\{S_1,A_1,R_2,...,S_T\}\sim\pi$
- For each state $S_t$ and action $A_t$ in the episode,
  $N(S_t,A_t) \gets N(S_t,A_t)+1$
  $Q(S_t,A_t) \gets Q(S_t,A_t)+\frac{1}{N(S_t,A_t)}(G_t-Q(S_t,A_t))$
- Improve policy based on new action-value function
  $\epsilon \gets 1/k$
  $\pi\gets\epsilon-greedy(Q)$

Temporal-Difference Learning (TD)

TD learning learns from incomplete episodes by bootstrapping (update involves an estimate).

(TD/TD(0)) Update value V(St) toward estimated return $R_t+1 + \gamma V(S_{t+1})$ .
$V(S_t) \gets V(S_t) + \alpha(R_{t+1}+\gamma V(S_{t+1}) − V(S_t))$
SARSA (On-Policy)
use TD instead of MC in GLIE Monte-Carlo Control
$\gets Q(S, A)+\alpha(R+\gamma Q(S', A') − Q(S, A))$
convergence condition
- GLIE sequence of policies $\pi_t(a|s)$
- Robbins-Monro sequence of step-sizes $\alpha_t$
  $\sum^\infty_{t=1}\alpha_t=\infty$
  $\sum^\infty_{t=1}\alpha_t^2<\infty$
n-step temporal-difference learning
$V(S_t) \gets V(S_t) + \alpha (G^{(n)}_t − V(S_t))$ , where $G^{(n)}_t = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n−1}R_{t+n} + \gamma^nV(S_{t+n})$
When $n=\infty$ , it is MC methods.
n-step SARSA
$Q(S_t, A_t)\gets Q(S_t, A_t)+\alpha(q^{(n)}_t − Q(S_t, A_t))$ , where $q_t^{(n)}=R_{t+1}+\gamma R_{t+2}+...+\gamma^{n-1}R_{t+n}+\gamma^nQ(S_{t+n})$
(Forward-view TD( $\lambda$ )
$V(S_t) \gets V(S_t) + \alpha (G^\lambda_t − V(S_t))$ , where $G^\lambda_t = (1-\lambda)\sum^\infty_{n=1}\lambda^{n-1}G_t^{(n)}$
Like MC, can only be computed from complete episodes.
Forward View Sarsa( $\lambda$ ) combines all n-step Q-returns $q_t^{(n)}$
$Q(S_t,A_t) \gets Q(S_t,A_t) + \alpha (q^\lambda_t − Q(S_t,A_t))$ , where $q^\lambda_t = (1-\lambda)\sum^\infty_{n=1}\lambda^{n-1}q_t^{(n)}$
(Backward-view TD( $\lambda$ ))
$\gets V(s)+\alpha \delta_t E_t(s)$ , where $\delta_t = R_{t+1}+\gamma V(S_{t+1}) - V(S_t)$ , $E_0(s)=0$ and $E_t(s)=\gamma\lambda E_{t-1}(s)+1(S_t=s)$

It is biased but with lower variance than the MC methods. It is more efficient but more sensitive to initial value.

Off-Policy Learning

Evaluate target policy $π (a ∣ s)$ to compute $v_π(s)$ or $q_π(s, a)$ , while following behaviour policy $\mu(a|s)$ : $\{S_1, A_1, R_2, ..., S_T \} \sim \mu$

Importance sampling
Q-Learning
$\gets Q(S, A) + α(R + \gamma\max_{a'}Q(S', a') − Q(S, A))$

Value Function Approximation (solution for large MDPs)

Basic Idea

There are too many states and/or actions to store in memory. It is too slow to learn the value of each state individually.
So we estimate value function with function approximation

$\hat{v}(s,w) \thickapprox v_\pi(s)$
$\hat{q}(s,a,w) \thickapprox q_\pi(s,a)$
$\hat{q}(s,a_1), ...,\hat{q}(s,a_m) \thickapprox q_\pi(s,a_1,w), ..., q_\pi(s,a_m,w)$

The function approximators options (differential)

Linear combinations of features
Neural network

Incremental Methods

Stochastic Gradient Descent

To find parameter vector $w$ minimising mean-squared error between approximate value fn $\hat{v}(s,w)$ and true value fn $v_\pi(s)$ : $\mathbb{E}_\pi[(v_\pi(S)-\hat{v}(S,w))^2]$

Gradient descent finds a local minimum
$\triangle w=-\frac{1}{2}\alpha\nabla_wJ(w)=\alpha\mathbb{E}_\pi[(v_\pi(S)-\hat{v}(S,w))\nabla_w\hat{v}(S,w)]$
Stochastic gradient descent samples the gradient
$\triangle w=\alpha(v_\pi(S)-\hat{v}(S,w))\nabla_w\hat{v}(S,w)$
Expected update is equal to full gradient update.

Incremental Prediction Algorithms

In practice, we substitute a target for $v_\pi(s)$

For MC, the target is the return $G_t$
$α(G_t − \hat{v}(St, w))∇_w\hat{v}(S_t, w)$
For TD(0), the target is the TD target $R_{t+1} + \gamma\hat{v}(S_{t+1}, w)$
$α(R_{t+1} + γ\hat{v}(S_{t+1}, w) − \hat{v}(S_t, w))∇_w\hat{v}(S_t, w)$
For TD( $\lambda$ ), the target is the $\lambda$ -return $G^λ_t$
$α(G^λ_t − \hat{v}(S_t, w))∇_w\hat{v}(S_t, w)$

For action-value function approximation, the derivation is similar.

Batch Methods

Least Squares Prediction

Given value function approximation $\hat{v}(s, w) ≈ v_π(s)$ and experience $\mathcal{D}$ consisting of $\langle state, value\rangle$ pairs $\mathcal{D} = \{\langle s_1, v^π_1\rangle,\langle s_2, v^π_2\rangle, ..., \langle s_T , v^π_T\rangle\}$ , which parameters $w$ give the best fitting value fn $\hat{v}(s, w)$ ?
Least squares algorithms find parameter vector $w$ minimising sum-squared error between $\hat{v}(s_t, w)$ and target values $v^π_t$ ,
$\sum^T_{t=1}(v^π_t − \hat{v}(s_t, w))^2= \mathbb{E}_\mathcal{D}[(v^π − \hat{v}(s, w))^2]$

Stochastic Gradient Descent with Experience Replay
Value from experience $\langle s,v^\pi\rangle\sim\mathcal{D}$ and then apply stochastic gradient descent update. It converges to least squares solution $w^\pi=argmin_w LS(w)$ .
$α(v^\pi − \hat{v}(s, w))∇_w\hat{v}(s, w)$
Deep Q-Networks (DQN) with Experience Replay

Least Squares Control

Policy Gradient

score function $\nabla_\theta\log\pi_\theta(s,a)$
policy $\pi_\theta$ and the gradient $\nabla_\theta\pi_\theta(s,a)$
$\nabla_\theta\pi_\theta(s,a) = \pi_\theta(s,a)\frac{\nabla_\theta\pi_\theta(s,a)}{\pi_\theta(s,a)} = \pi_\theta(s,a)\nabla_\theta\log\pi_\theta(s,a)$

Monte-Carlo methods

Actor-Critic methods

Maintain two sets of parameters

Critic: Updates action-value function parameters $w$
Actor: Updates policy parameters $\theta$ , in direction suggested by critic

(how to accelerate the gradient descent.)
Reducing Variance Using a Baseline
Subtract a baseline function $B (s)$ from the policy gradient. This can reduce variance, without changing expectation.

$\mathbb{E}_{π_θ}[∇_θ\log π_θ(s, a)B(s)] = \sum_{s∈S}d^{π_θ}(s)\sum_a∇_θπ_θ(s, a)B(s) = \sum_{s∈S}d^{π_θ}B(s)∇_θ\sum_{a∈A}π_θ(s, a) = 0$

A good baseline is the state value function $B(s)=V^{\pi_\theta}(s)$
So we can rewrite the policy gradient using the advantage function $A^{\pi_\theta}(s,a)$

$A^{\pi_\theta}(s,a) = Q^{\pi_\theta}(s,a) - V^{\pi_\theta}(s)$
$∇_θJ(\theta) = \mathbb{E}_{π_θ}[∇_θ\log π_θ(s, a)A^{\pi_\theta}(s,a)]$

Eligibility Traces (can be used online)

Deterministic gradient theorem can help improve.

Model-Based Reinforcement Learning

Model Definition

A model $\mathcal{M}$ is a representation of an MDP $\langle\mathcal{S}, \mathcal{A},\mathcal{P}, \mathcal{R}\rangle$ , parametrized by $η$ . Here, we will assume state space $\mathcal{S}$ and action space $\mathcal{A}$ are known.
So a model $\mathcal{M} = \langle\mathcal{P}_η, \mathcal{R}_η\rangle$ represents state transitions $\mathcal{P}_η ≈ \mathcal{P}$ and rewards $\mathcal{R}_η ≈ \mathcal{R}$
$S_{t+1} ∼ \mathcal{P}_η(S_{t+1} | S_t, A_t)$
$R_{t+1} = \mathcal{R}_η(R_{t+1} | S_t, A_t)$
Typically assume conditional independence between state transitions and rewards
$\mathbb{P} [S_{t+1}, R_{t+1} | S_t, A_t] = \mathbb{P} [S_{t+1} | S_t, A_t] \mathbb{P} [R_{t+1} | S_t, A_t]$
Goal: estimate model $\mathcal{M}_\eta$ from experience ${S_1,A_1,R_2,...,S_T\}$
After given a model $\mathcal{M} = \langle\mathcal{P}_η, \mathcal{R}_η\rangle$ , we then solve the MDP $\langle\mathcal{S}, \mathcal{A},\mathcal{P}_\eta, \mathcal{R}_\eta\rangle$

Integrating Learning and Planning

Model-Free RL
- No model
- Learn value function (and/or policy) from real experience
Model-Based RL (using Sample-Based Planning)
- Learn a model from real experience
- Plan value function (and/or policy) from simulated experience
Dyna
- Learn a model from real experience
- Learn and plan value function (and/or policy) from real and simulated experience

Simulation-Based Search

combine Dyna and simulation-based search.

Ref:
https://www.freecodecamp.org/news/an-introduction-to-reinforcement-learning-4339519de419/
Online courses by David Silver

EL1242 Digital Electronics 后端
AcademicYear:2024/25AssessmentIntroduction:Course:BEng(Hons)ElectronicEngineeringModuleCode:EL1242ModuleTitle:DigitalElectronicsTitleoftheBrief:PrototypingofAPrimarySmartHomeSystemTypeofassessment:Cou
【深度学习】 PyTorch一文详解 Nerous_ 深度学习深度学习 pytorch 人工智能机器学习 python
“PyTorchisadeeplearningframeworkthatprioritizessimplicityandflexibility,makingitthego-tochoiceforbothresearchersanddevelopers.”—Anonymous1.PyTorch简介1.1PyTorch的背景与发展PyTorch是由Facebook人工智能研究院（FAIR）开发的一个开
freecad嵌入工作台黄河里的小鲤鱼软件开发建模 python
1Introduction导言FreeCADcanbeimportedasaPythonmoduleinotherprogramsorinastandalonePythonconsole,togetherwithallitsmodulesandcomponents.It’sevenpossibletoimporttheFreeCADuserinterfaceasapythonmodulebutwi
CVPR 2024 | 低分辨率引领方向：通过自监督学习提升超分辨率的泛化能力小白学视觉计算机顶会顶刊论文解读计算机视觉深度学习 CVPR 计算机顶会论文解读
论文信息题目：Low-ResLeadstheWay:ImprovingGeneralizationforSuper-ResolutionbySelf-SupervisedLearning低分辨率引领方向：通过自监督学习提升超分辨率的泛化能力作者：HaoyuChen,WenboLi,JinjinGu,JingjingRen,HaozeSun,XueyiZou,ZhensongZhang,Youlia
ACI EP Learning Whitepaper 1. ACI EP组件 m0_54931486 思科 ACI 网络思科 ACI Endpoint ACI fabric Nexus EP 学习
1.ACIEndpointACI网络架构的Endpoint表整合了传统MAC地址表和ARP表的功能。其核心机制是通过硬件层直接学习数据包的源MAC地址与IP地址映射关系，摒弃了传统ARP协议依赖广播请求获取下一跳MAC地址的模式。这种设计优化体现在两方面：1）减少控制面ARP流量处理带来的资源消耗；2）基于终端实际流量即可实时感知主机IP/MAC地址的拓扑迁移，无需依赖GARP通告即可实现终端移动
机器学习课堂4线性回归模型+特征缩放木尘152132 机器学习线性回归 python
一、实验2-2，线性回归模型，计算模型在训练数据集和测试数据集上的均方根误差代码：#2-2线性回归模型importpandasaspdimportnumpyasnpimportmatplotlib.pyplotasplt#参数设置iterations=3000#迭代次数learning_rate=0.0001#学习率m_train=3000#训练样本的数量flag_plot_lines=False
部分标签数据集生成与过滤特定标签方法阳光明媚大男孩机器学习人工智能
完整代码总结这段代码的目的是通过构建一个部分标签学习（PartialLabelLearning,PLL）框架来生成一个包含部分标签的数据集，并且支持根据给定的标签列表对数据集进行筛选和过滤。代码包含了多个类和函数，主要分为以下几部分：数据预处理与加载：使用PyTorch和torchvision来加载CIFAR-10数据集，并对其进行标准化处理。部分标签数据集的生成：为每个样本生成多个候选标签，并模
推测未来Agentic形态：Dynamic Cognitive Contextual Agent with Reinforcement Learning (DCCA-RL) weixin_40941102 语言模型
在AIAgent设计模式领域，我们见证了从简单的ReAct到复杂的LATS的演进，这些模式通过反思、工具使用、规划和多代理协作，极大地提升了AI的自主性和智能性。然而，随着任务复杂度和动态性需求的增加，现有模式逐渐显现出局限性——多Agent协作带来的联合误差和单Agent设计的适应性不足。为此，我们基于对现有模式的全面分析，提出了一个更先进的单Agent框架：DynamicCognitiveCo
C++11 SFINAE概念介绍:类成员的编译时内省(译) 丸子叮咚响 #C++11/14/17/20 SFINAE
点击查看原文AnintroductiontoC++'sSFINAEconcept:compile-timeintrospectionofaclassmemberC++的SFINAE概念介绍：类成员的编译时内省Trivia:AsaC++enthusiast,IusuallyfollowtheannualC++conferencecppconforatleasttrytokeepmyselfup-to
PyTorch 深度学习实战（19）：离线强化学习与 Conservative Q-Learning (CQL) 算法进取星辰 PyTorch 深度学习实战深度学习 pytorch 算法
在上一篇文章中，我们探讨了分布式强化学习与IMPALA算法，展示了如何通过并行化训练提升强化学习的效率。本文将聚焦离线强化学习（OfflineRL）这一新兴方向，并实现ConservativeQ-Learning(CQL)算法，利用Minari提供的静态数据集训练安全的强化学习策略。一、离线强化学习与CQL原理1.离线强化学习的特点无需环境交互：直接从预收集的静态数据集学习数据效率高：复用历史经验
一切皆是映射：DQN训练加速技术：分布式训练与GPU并行 AI天才研究院计算 AI大模型企业级应用开发实战 ChatGPT 计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
1.背景介绍1.1深度强化学习的兴起近年来，深度强化学习（DeepReinforcementLearning，DRL）在游戏、机器人控制、自然语言处理等领域取得了令人瞩目的成就。作为一种结合深度学习和强化学习的强大技术，DRL能够使智能体在与环境交互的过程中学习最优策略，从而实现自主决策和控制。1.2DQN算法及其局限性深度Q网络（DeepQ-Network，DQN）是DRL的一种经典算法，它利用
Moodle + Websoft9：创新教育的强大组合，助力教学与学习开源软件
Moodle+Websoft9：构建未来课堂的技术基石一、Moodle：开源生态的深度解析•模块化设计：支持超800个官方插件，如H5P交互内容创作、BigBlueButton虚拟课堂，满足个性化教学需求。•学习分析引擎：内置LearningAnalyticsAPI，可集成Python/R语言进行深度学习，预测学生学业风险。•移动优先战略：MoodleApp支持离线学习、扫码签到，2023年新增A
书籍-《动手学深度学习（英文版）》
书籍：DiveintoDeepLearning作者：AstonZhang，ZacharyC.Lipton，MuLi，AlexanderJ.Smola出版：CambridgeUniversityPress编辑：陈萍萍的公主@一点人工一点智能下载：书籍下载-《动手学深度学习（英文版）》01书籍介绍深度学习已经彻底改变了模式识别，为计算机视觉、自然语言处理和自动语音识别等领域提供了强大的工具。应用深度学
书籍-《控制理论的数学导论（第三版）》机器人数学
书籍：AMathematicalIntroductiontoControlTheory作者：ShlomoEngelberg出版：WorldScientificPublishingCompany编辑：陈萍萍的公主@一点人工一点智能下载：《控制理论的数学导论（第三版）》01书籍介绍本书在数学严谨性和工程应用之间达到了完美的平衡，有助于学生全面理解控制理论的数学和工程层面。本书不仅有效运用了MATLAB
根据论文复现大模型方法以及出错处理技巧 Ai玩家hly 从0倒1 论文复现大模型复现 Ai大模型复现
复现一篇论文中的大模型搭建涉及以下几个关键步骤：理解论文的模型架构、数据集处理、超参数设置以及实验环境的搭建。这里给出一个基本的实现方法示例，假设我们选择复现一个图像分类任务中的经典模型，例如ResNet。实现步骤示例1.理解论文和模型架构选择一篇关于ResNet的论文作为示例，例如《DeepResidualLearningforImageRecognition》（Heetal.,2015）。2.
集成学习（Ensemble Learning）基础知识1 代码骑士 #机器学习集成学习机器学习人工智能
文章目录一、集成学习1、基本概念2、回顾:误差的偏差-方差分解3、为什么集成学习有效？4、基学习器：“好而不同”5、集成学习的两个基本问题（1）如何训练出具有差异性的多个基学习器？（2）如何将多个基学习器的预测结果集成为最终的强学习器预测结果？二、自助法（Bagging）1、Bagging2、BootstrapBootstrap采样的数学性质3、Bagging:集成学习的两个基本问题（1）如何训练
Rust为Node.js开发者设计：入门到实战平依佩Ula
Rust为Node.js开发者设计：入门到实战rust-for-node-developersAnintroductiontotheRustprogramminglanguageforNodedevelopers.项目地址:https://gitcode.com/gh_mirrors/ru/rust-for-node-developers项目介绍欢迎来到《Rust为Node.js开发者设计》的实践
Chainlink 预言机的原理解析 Chainlink资讯预言机 Chainlink 智能合约
本文来自于8月19日Chainlink开发者社区中国负责人Frank，在DAppLearning分享会上对于Chainlink预言机的原理的讲解，以下是这节分享会的总结内容。有兴趣的小伙伴可以结合视频一起学习：为什么区块链无法主动获取外界数据区块链的特点区块链是一个封闭的确定性系统，每一笔交易都需要不同节点共识，只有超过一定数量的节点共识成功，交易才会被真正认可，并写入区块链。因为对于外部API的
论文笔记-Contrastive Learning for Unpaired Image-to-Image Translation kingsleyluoxin 计算机视觉论文笔记深度学习 python 计算机视觉机器学习人工智能深度学习
论文信息标题：ContrastiveLearningforUnpairedImage-to-ImageTranslation作者：TaesungPark,AlexeiA.Efros,RichardZhang,Jun-YanZhu机构：UniversityofCalifornia,Berkeley;AdobeResearch代码链接https://github.com/taesungp/contra
【迁移学习入门之域适应的背景、理论与方法】进一步理解迁移学习啦？ 985小水博一枚呀深度学习学习笔记迁移学习人工智能机器学习域适应
【迁移学习入门之域适应的背景、理论与方法】进一步理解迁移学习啦？【迁移学习入门之域适应的背景、理论与方法】进一步理解迁移学习啦？文章目录【迁移学习入门之域适应的背景、理论与方法】进一步理解迁移学习啦？1.背景介绍2.理论基础2.1分布差异（DomainShift）2.2迁移学习理论（TransferLearningTheory）2.3领域不变特征（Domain-invariantFeatures）
宝石组合第十五届蓝桥杯大赛软件赛省赛C/C++ 大学 B 组 Geometry Fu 蓝桥杯蓝桥杯 c语言 c++
宝石组合题目来源第十五届蓝桥杯大赛软件赛省赛C/C++大学B组原题链接蓝桥杯宝石组合https://www.lanqiao.cn/problems/19711/learning/问题描述P10426[蓝桥杯2024省B]宝石组合题目描述在一个神秘的森林里，住着一个小精灵名叫小蓝。有一天，他偶然发现了一个隐藏在树洞里的宝藏，里面装满了闪烁着美丽光芒的宝石。这些宝石都有着不同的颜色和形状，但最引人注目
统计机器学习 (Statistical Machine Learning) 原理与代码实例讲解 AGI大模型与大数据研究院 DeepSeek R1 &大数据AI人工智能计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
统计机器学习(StatisticalMachineLearning)原理与代码实例讲解1.背景介绍统计机器学习是现代人工智能和数据科学的核心领域之一。它结合了统计学和计算机科学的理论与方法，通过数据驱动的方式来构建预测模型和决策系统。统计机器学习不仅在学术研究中占据重要地位，还在工业界有广泛应用，如推荐系统、图像识别、自然语言处理等。2.核心概念与联系2.1统计学与机器学习的关系统计学关注数据的收
AI学习第二天--监督学习半监督学习无监督学习 iisugar 机器学习支持向量机人工智能
目录1.监督学习（SupervisedLearning）比喻：技术细节：形象例子：2.无监督学习（UnsupervisedLearning）比喻：技术细节：形象例子：3.半监督学习（Semi-SupervisedLearning）比喻：技术细节：形象例子：4.三者的对比与选择表格总结：5.实际案例对比案例：电商平台用户分群6.关键逻辑总结1.监督学习（SupervisedLearning）比喻：老
计算基因组学需要计算机知识吗,生物信息学——计算基因组学的一些参考书 weixin_39610422 计算基因组学需要计算机知识吗
有两个都可以在新浪爱问资料Bioinformatics.For.Dummies.2nd.Ed.2007.pdfAnIntroductiontoBioinformaticsAlgorithms.pdf另外看到Virginia大学的一些课程The2012ComputationalGenomicsCoursehasbeenrescheduledtoNovember28-December4,2012用mo
数字接龙第十五届蓝桥杯大赛软件赛省赛C/C++ 大学 B 组 Geometry Fu 蓝桥杯蓝桥杯 c语言 c++
数字接龙题目来源第十五届蓝桥杯大赛软件赛省赛C/C++大学B组原题链接蓝桥杯数字接龙https://www.lanqiao.cn/problems/19712/learning/问题描述题目描述小蓝最近迷上了一款名为《数字接龙》的迷宫游戏，游戏在一个大小为n×nn\timesnn×n的格子棋盘上展开，其中每一个格子处都有着一个0⋯k−10\cdotsk-10⋯k−1之间的整数。游戏规则如下：从左上
如何使用MATLAB进行高效的GPU加速深度学习模型训练？百态老人 matlab 深度学习开发语言
要使用MATLAB进行高效的GPU加速深度学习模型训练，可以遵循以下步骤和策略：选择合适的GPU硬件：首先，确保您的计算机配备有支持CUDA的NVIDIAGPU，并且其计算能力至少为3.0或以上。可以通过gpuDevice命令检查GPU是否具备加速功能。安装必要的工具箱：确保安装了MATLAB的DeepLearningToolbox和ParallelComputingToolbox，这些工具箱提供
versal架构简介：Sec I Introduction 妮蔻mega versal架构简介 fpga
1.SoCHardwareOverviewAMD的SoC（系统级芯片）具有广泛的功能，适用于需要可扩展处理能力、集成功能单元和可扩展可编程逻辑的高端应用，并且可以在正常系统操作期间动态配置和重新配置。SoC架构包括一组丰富的集成硬件组件和许多用户可编程设计选项，适用于许多系统级解决方案。每个设备都包含一个处理系统、可编程逻辑、平台管理控制器以及各种集成的硬件子系统和外设。处理系统和可编程逻辑可以选
ubuntu20.04 sanzk ubuntu
ubuntu20.04下载Indexof/ubuntu-releases/20.04/|清华大学开源软件镜像站|TsinghuaOpenSourceMirrorIntroduction·Autolabor-ROS机器人入门课程《ROS理论与实践》零基础教程
【sklearn 02】监督学习、非监督下学习、强化学习 @金色海岸 sklearn 学习人工智能
监督学习、非监督学习、强化学习**机器学习通常分为无监督学习、监督学习和强化学习三类。-第一类：无监督学习（unsupervisedlearning），指的是从信息出发自动寻找规律，分析数据的结构，常见的无监督学习任务有聚类、降维、密度估计、关联分析等。-第二类：监督学习（supervisedlearning），监督学习指的是使用带标签的数据去训练模型，并预测未知数据的标签。监督学习有两种，当预测
COMP9321 25T1 后端
COMP932125T1Assignment1(15marks)IntroductionTheNSWFuelCheckdatasetismaintainedbytheNSWGovernment.ItallowsmotoriststoaccesshistoricalandliveinformationaboutfuelpricesacrossNSW.Wehavedownloadedthe“FuelC
[黑洞与暗粒子]没有光的世界 comsci
无论是相对论还是其它现代物理学,都显然有个缺陷,那就是必须有光才能够计算但是,我相信,在我们的世界和宇宙平面中,肯定存在没有光的世界.... 那么,在没有光的世界,光子和其它粒子的规律无法被应用和考察,那么以光速为核心的 &nbs
jQuery Lazy Load 图片延迟加载 aijuans jquery
基于 jQuery 的图片延迟加载插件，在用户滚动页面到图片之后才进行加载。对于有较多的图片的网页，使用图片延迟加载，能有效的提高页面加载速度。版本： jQuery v1.4.4+ jQuery Lazy Load v1.7.2 注意事项：需要真正实现图片延迟加载，必须将真实图片地址写在 data-original 属性中。若 src
使用Jodd的优点 Kai_Ge jodd
1. 简化和统一 controller ，抛弃 extends SimpleFormController ，统一使用 implements Controller 的方式。 2. 简化 JSP 页面的 bind, 不需要一个字段一个字段的绑定。 3. 对 bean 没有任何要求，可以使用任意的 bean 做为 formBean。使用方法简介
jpa Query转hibernate Query 120153216 Hibernate
public List<Map> getMapList(String hql, Map map) { org.hibernate.Query jpaQuery = entityManager.createQuery(hql); if (null != map) { for (String parameter : map.keySet()) { jp
Django_Python3添加MySQL/MariaDB支持 2002wmj mariaDB
现状首先，[email protected] 中默认的引擎为 django.db.backends.mysql 。但是在Python3中如果这样写的话，会发现 django.db.backends.mysql 依赖 MySQLdb[5] ，而 MySQLdb 又不兼容 Python3 于是要找一种新的方式来继续使用MySQL。 MySQL官方的方案首先据MySQL文档[3]说，自从MySQL
在SQLSERVER中查找消耗IO最多的SQL 357029540 SQL Server
返回做IO数目最多的50条语句以及它们的执行计划。 select top 50 (total_logical_reads/execution_count) as avg_logical_reads, (total_logical_writes/execution_count) as avg_logical_writes, (tot
spring UnChecked 异常官方定义！ 7454103 spring
如果你接触过spring的事物管理！那么你必须明白 spring的非捕获异常！即 unchecked 异常！因为 spring 默认这类异常事物自动回滚！！ public static boolean isCheckedException(Throwable ex) { return !(ex instanceof RuntimeExcep
mongoDB 入门指南、示例 adminjun java mongodb 操作
一、准备工作 1、下载mongoDB 下载地址：http://www.mongodb.org/downloads 选择合适你的版本相关文档：http://www.mongodb.org/display/DOCS/Tutorial 2、安装mongoDB A、不解压模式：将下载下来的mongoDB-xxx.zip打开，找到bin目录，运行mongod.exe就可以启动服务，默
CUDA 5 Release Candidate Now Available aijuans CUDA
The CUDA 5 Release Candidate is now available at http://developer.nvidia.com/<wbr></wbr>cuda/cuda-pre-production. Now applicable to a broader set of algorithms, CUDA 5 has advanced fe
Essential Studio for WinRT网格控件测评 Axiba JavaScript html5
Essential Studio for WinRT界面控件包含了商业平板应用程序开发中所需的所有控件，如市场上运行速度最快的grid 和chart、地图、RDL报表查看器、丰富的文本查看器及图表等等。同时，该控件还包含了一组独特的库，用于从WinRT应用程序中生成Excel、Word以及PDF格式的文件。此文将对其另外一个强大的控件——网格控件进行专门的测评详述。网格控件功能 1、
java 获取windows系统安装的证书或证书链 bewithme windows
有时需要获取windows系统安装的证书或证书链，比如说你要通过证书来创建java的密钥库。有关证书链的解释可以查看此处。 public static void main(String[] args) { SunMSCAPI providerMSCAPI = new SunMSCAPI(); S
NoSQL数据库之Redis数据库管理(set类型和zset类型) bijian1013 redis 数据库 NoSQL
4.sets类型 Set是集合，它是string类型的无序集合。set是通过hash table实现的，添加、删除和查找的复杂度都是O(1)。对集合我们可以取并集、交集、差集。通过这些操作我们可以实现sns中的好友推荐和blog的tag功能。 sadd：向名称为key的set中添加元
异常捕获何时用Exception，何时用Throwable bingyingao
用Exception的情况 try { //可能发生空指针、数组溢出等异常 } catch (Exception e) {
【Kafka四】Kakfa伪分布式安装 bit1129 kafka
在http://bit1129.iteye.com/blog/2174791一文中，实现了单Kafka服务器的安装，在Kafka中，每个Kafka服务器称为一个broker。本文简单介绍下，在单机环境下Kafka的伪分布式安装和测试验证 1. 安装步骤 Kafka伪分布式安装的思路跟Zookeeper的伪分布式安装思路完全一样，不过比Zookeeper稍微简单些(不
Project Euler bookjovi haskell
Project Euler是个数学问题求解网站，网站设计的很有意思，有很多problem，在未提交正确答案前不能查看problem的overview，也不能查看关于problem的discussion thread，只能看到现在problem已经被多少人解决了，人数越多往往代表问题越容易。看看problem 1吧： Add all the natural num
Java-Collections Framework学习与总结-ArrayDeque BrokenDreams Collections
表、栈和队列是三种基本的数据结构，前面总结的ArrayList和LinkedList可以作为任意一种数据结构来使用，当然由于实现方式的不同，操作的效率也会不同。这篇要看一下java.util.ArrayDeque。从命名上看
读《研磨设计模式》-代码笔记-装饰模式-Decorator bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ import java.io.BufferedOutputStream; import java.io.DataOutputStream; import java.io.FileOutputStream; import java.io.Fi
Maven学习(一) chenyu19891124 Maven私服
学习一门技术和工具总得花费一段时间，5月底6月初自己学习了一些工具，maven+Hudson+nexus的搭建，对于maven以前只是听说，顺便再自己的电脑上搭建了一个maven环境，但是完全不了解maven这一强大的构建工具，还有ant也是一个构建工具，但ant就没有maven那么的简单方便，其实简单点说maven是一个运用命令行就能完成构建，测试，打包，发布一系列功
[原创]JWFD工作流引擎设计----节点匹配搜索算法(用于初步解决条件异步汇聚问题) 补充 comsci 算法工作 PHP 搜索引擎嵌入式
本文主要介绍在JWFD工作流引擎设计中遇到的一个实际问题的解决方案，请参考我的博文"带条件选择的并行汇聚路由问题"中图例A2描述的情况(http://comsci.iteye.com/blog/339756),我现在把我对图例A2的一个解决方案公布出来，请大家多指点节点匹配搜索算法(用于解决标准对称流程图条件汇聚点运行控制参数的算法) 需要解决的问题：已知分支
Linux中用shell获取昨天、明天或多天前的日期 daizj linux shell 上几年昨天获取上几个月
在Linux中可以通过date命令获取昨天、明天、上个月、下个月、上一年和下一年 # 获取昨天 date -d 'yesterday' # 或 date -d 'last day' # 获取明天 date -d 'tomorrow' # 或 date -d 'next day' # 获取上个月 date -d 'last month' #
我所理解的云计算 dongwei_6688 云计算
在刚开始接触到一个概念时，人们往往都会去探寻这个概念的含义，以达到对其有一个感性的认知，在Wikipedia上关于“云计算”是这么定义的，它说： Cloud computing is a phrase used to describe a variety of computing co
YII CMenu配置 dcj3sjt126com yii
Adding id and class names to CMenu We use the id and htmlOptions to accomplish this. Watch. //in your view $this->widget('zii.widgets.CMenu', array( 'id'=>'myMenu', 'items'=>$this-&g
设计模式之静态代理与动态代理 come_for_dream 设计模式
静态代理与动态代理代理模式是java开发中用到的相对比较多的设计模式，其中的思想就是主业务和相关业务分离。所谓的代理设计就是指由一个代理主题来操作真实主题，真实主题执行具体的业务操作，而代理主题负责其他相关业务的处理。比如我们在进行删除操作的时候需要检验一下用户是否登陆，我们可以删除看成主业务，而把检验用户是否登陆看成其相关业务
【转】理解Javascript 系列 gcc2ge JavaScript
理解Javascript_13_执行模型详解摘要: 在《理解Javascript_12_执行模型浅析》一文中,我们初步的了解了执行上下文与作用域的概念，那么这一篇将深入分析执行上下文的构建过程，了解执行上下文、函数对象、作用域三者之间的关系。函数执行环境简单的代码:当调用say方法时，第一步是创建其执行环境，在创建执行环境的过程中，会按照定义的先后顺序完成一系列操作:1.首先会创建一个
Subsets II hcx2013 set
Given a collection of integers that might contain duplicates, nums, return all possible subsets. Note: Elements in a subset must be in non-descending order. The solution set must not conta
Spring4.1新特性——Spring缓存框架增强 jinnianshilongnian spring4
目录 Spring4.1新特性——综述 Spring4.1新特性——Spring核心部分及其他 Spring4.1新特性——Spring缓存框架增强 Spring4.1新特性——异步调用和事件机制的异常处理 Spring4.1新特性——数据库集成测试脚本初始化 Spring4.1新特性——Spring MVC增强 Spring4.1新特性——页面自动化测试框架Spring MVC T
shell嵌套expect执行命令 liyonghui160com
一直都想把expect的操作写到bash脚本里,这样就不用我再写两个脚本来执行了,搞了一下午终于有点小成就,给大家看看吧. 系统:centos 5.x 1.先安装expect yum -y install expect 2.脚本内容: cat auto_svn.sh #!/bin/bash
Linux实用命令整理 pda158 linux
0. 基本命令　　linux 基本命令整理　　1. 压缩解压　　tar -zcvf a.tar.gz a #把a压缩成a.tar.gz 　　tar -zxvf a.tar.gz #把a.tar.gz解压成a 　　2. vim小结　　2.1 vim替换　　:m,ns/word_1/word_2/gc
独立开发人员通向成功的29个小贴士 shoothao 独立开发
概述：本文收集了关于独立开发人员通向成功需要注意的一些东西,对于具体的每个贴士的注解有兴趣的朋友可以查看下面标注的原文地址。明白你从事独立开发的原因和目的。保持坚持制定计划的好习惯。万事开头难，第一份订单是关键。培养多元化业务技能。提供卓越的服务和品质。谨小慎微。营销是必备技能。学会组织，有条理的工作才是最有效率的。 “独立
JAVA中堆栈和内存分配原理 uule java
1、栈、堆 1.寄存器：最快的存储区, 由编译器根据需求进行分配,我们在程序中无法控制.2. 栈：存放基本类型的变量数据和对象的引用，但对象本身不存放在栈中，而是存放在堆（new 出来的对象）或者常量池中（字符串常量对象存放在常量池中。）3. 堆：存放所有new出来的对象。4. 静态域：存放静态成员（static定义的）5. 常量池：存放字符串常量和基本类型常量（public static f

An introduction to reinforcement learning

CONTENT

Introduction

Terminology

Goal

Classification

Markov Decision Process

Markov Process

Markov Reward Process

Markov Decision Process

Dynamic Programming

Sampling Methods for model-free. (solutions for small MDPs)

Monte Carlo

Temporal-Difference Learning (TD)

Off-Policy Learning

Value Function Approximation (solution for large MDPs)

Basic Idea

Incremental Methods

Stochastic Gradient Descent

Incremental Prediction Algorithms

Batch Methods

Least Squares Prediction

Least Squares Control

Policy Gradient

Monte-Carlo methods

Actor-Critic methods

Model-Based Reinforcement Learning

Model Definition

Integrating Learning and Planning

Simulation-Based Search

你可能感兴趣的:(An introduction to reinforcement learning)