koukouvagia

[2106] Video Super-Resolution Transformer

paper
code
mathematical reasoning mainly from this paper

Content

- Abstract
- Preliminary
- - - video super-resolution (VSR)
    - transformer block
- Method
- - - model architecture
    - spatial-temporal convolutional self-attention (STCSA)
    - - drawbacks of FCSA
      - detailed structure in STCSA
      - why STCSA is suitable
      - spatial-temporal position encoding
    - bidirectional optical flow-based feed-forward (BOFF)
- Experiment
- - - result on REDS
    - result on Vimeo-90K
    - result on Vid4
    - ablation study
    - - optical flow
      - STCSA layer & BOFF layer
      - number of frames

Abstract

components in traditional Transformer design and their limitations

fully-connected self-attention layer (FCSA) neglect local information in video
ViTs split an image into several patches or tokens, which damage local spatial information since contents (eg. lines, edges, shapes, objects) divided into different tokens
token-wise feed-forward layer misalign features between video frames and ignore feature propagation across frames
this layer independently process each of input token embeddings without any interaction across frames

main contributions of VSR-Transformer

spatial-temporal convolutional attention (STCSA) layer: exploit locality and spatial-temporal data information through different layers
bidirectional optical flow-based feed-forward (BOFF) layer: use interaction across all frame embeddings for feature propagation and alignment

Preliminary

notation
a calligraphic letter $\mathcal{X}$ : a data sequence
a calligraphic letter $\mathcal{D}$ : a distribution
a bold upper case letter $\mathbf{X}$ : a matrix
a bold lower case letter $\mathbf{x}$ : a vector
a lower case letter x: an element of a matrix
$[T]$ : a set ${1, ..., T\}$
$\mathbf{1}\{.\}$ : an indicator function, where $\mathbf{1}\{A\}=1$ if A is true and $\mathbf{1}\{A\}=0$ if A is false
$\mathbb{E}_{\mathcal{D}}$ : an empirical expectation with respect to distribution $\mathcal{D}$

definition 1 (function distance) given a function $\mathbb{R}^{d\times n}\rightarrow\mathbb{R}^{d\times n}$ and a target function $f^{\ast}: \mathbb{R}^{d\times n}\rightarrow\mathbb{R}^{d\times n}$ , we define a distance between these 2 function as
$\mathcal{L}_{f^{\ast}, \mathcal{D}}(f):=\mathbb{E}_{\mathbf{X}\sim\mathcal{D}}[l(f(\mathbf{X}), f^{\ast}(\mathbf{X}))]$

for ground truth $Y=f^{\ast}(\mathcal{D})$ , loss denoted by $\mathcal{L}_\mathcal{D}(f)$

definition 2 ( $k$ -pattern function) a function $\mathcal{X}\rightarrow\mathcal{Y}$ is a k-pattern if for some ${\{\pm\}}^k\rightarrow\mathcal{Y}$ and index $j^{\ast}: f()=g(x_{j^{\ast, ..., j^{\ast}+k}})$ . we call a function $h_{\mathbf{u}, \mathbf{W}}(\mathbf{x})=\sum_{j}\langle {\mathbf{u}}^{(j)}, {\mathbf{v}}_{\mathbf{W}}^{(j)}\rangle$ can learn a k-pattern function from a feature ${\mathbf{v}}_{\mathbf{W}}^{(j)}$ of data $x$ with a layer ${\mathbf{u}}^{(j)}\in R^q$ if for $\epsilon>0$ , we have
$\mathcal{L}_{f^{\ast}, \mathcal{D}}(h_{\mathbf{u}, \mathbf{W}})\leq\epsilon$

feature ${\mathbf{v}}_{\mathbf{W}}^{(j)}$ learned by a convolutional attention network or a fully connected attention network parameterized by $\mathbf{W}$

$\implies$ any function that can capture locality of data mean it should learn a $k$ -pattern function

video super-resolution (VSR)

given a LR video sequence $\{V_1, ..., V_T\}\sim\mathcal{D}$ , where $V_t\in\mathbb{R}^{3\times H\times W}$ is t-th LR frame, $\mathcal{D}$ is a distribution of videos
extract features $\mathcal{X}=\{X_1, ..., X_T\}$ from LR video frames, where $X_t\in\mathbb{R}^{C\times H\times W}$ is $t$ -th feature
learn a non-linear mapping $F$ to reconstruct HR frames $\widehat{\mathcal{Y}}$ by utilizing spatial-temporal information across sequence
$\widehat{\mathcal{Y}}\triangleq(\widehat{Y}_1, ..., \widehat{Y}_T)=F(X_1, ..., X_T)$

given ground-truth HR frames $\mathcal{Y}=\{Y_1, ..., Y_T\}$ , where $Y_t$ is $t$ -th HR frame
minimize a loss function between generated HR frame $\widehat{Y}_t$ and ground-truth HR frame $Y_t$
$\widehat{F}=\underset{F}{\arg\min}\mathcal{L}_\mathcal{D}(F)\triangleq\widehat{\mathbb{E}}_{\mathcal{D}, t\in[T]}[d(\widehat{Y}_t, Y_t)]$

where, $d(\cdot, \cdot)$ is a distance metric, such as L1-loss, L2-loss, Charbonnier loss

for VSR tasks, a sequence method can be used, such as RNN, LSTM, Transformer
note that Transformer gain particular interest since it avoid recursion and thus allow parallel computing in practice

transformer block

given an input feature $X\in\mathbb{R}^{d\times n}$ ( $d$ -dimensional embeddings of $n$ tokens)
transformer block is a sequence-to-sequence function, mapping a sequence $\mathbb{R}^{d\times n}$ to another sequence $\mathbb{R}^{d\times n}$
consist of 2 parts, one is a self-attention layer with a skip connection
$f_1(X)=LN(X+\sum_{i=1}^hW_o^i(W_v^iX)SoftMax((W_k^iX)^T(W_q^iX))$

where, $W_o^i\in\mathbb{R}^{d\times m}$ is a linear layer, $W_v^i, W_k^i, W_q^i\in\mathbb{R}^{m\times d}$ are linear layers mapping feature to value, key, query, $h$ is heads number, $m$ is head size
the other is a token-wise feed-forward layer with a skip connection
$f_2(X)=LN(f_1(X)+W_2ReLU(W_1f_1(X)+b_1\mathbf{1}_n^T)+b_2\mathbf{1}_n^T)$

where, $W_1\in\mathbb{R}^{r\times d}, W_2\in\mathbb{R}^{d\times r}$ are linear layers, $b_1\in\mathbb{R}^r, b_2\in\mathbb{R}^d$ are bias, $r$ is hidden layer size of feed-forward layer

Method

model architecture

The framework of video super-resolution Transformer. Given a low-resolution (LR) video, we first use an extractor to capture features of the LR videos. Then, a spatial-temporal convolutional self-attention and an optical flow-based feed-forward network model a sequence of continuous representations. Note that these two layers both have skip connections. Last, the reconstruction network restores a high-resolution video from the representations and the up-sampling frames.

feature extractor capture features from LR input
transformer map features to a sequence of continuous representations
reconstruction restore HR videos from representations

loss function Charbonnier loss

Network architecture of the VSR-Transformer.

Network architecture of the feature extractor and reconstruction network.

T frames number, C channels number, H image height, W image width
I input channels number, O output channels number
CONV convolution, with K kernel size, S stride, P padding, G groups
PixelShuffle pixel shuffle with upscale factor of 2
LeakyReLU Leaky ReLU activation function with a negative slope of 0.01

spatial-temporal convolutional self-attention (STCSA)

drawbacks of FCSA

Q: whether FCSA layer learn $k$ -patterns with gradient descent
theorem 1 we assume $m = 1$ and $\vert u_i\vert\leq1$ , and weights are initialized as some permutation invariant distribution over $\mathbb{R}^n$ , and for all $\mathbf{x}$ we have $h_{\mathbf{u}, \mathbf{W}}^{FCSA}\in[-1, 1]$ which satisfies definition 2. then, the following holds
$\mathbb{E}_{W\sim\mathcal{W}}\Vert\frac\partial{\partial\mathbf{W}}\mathcal{L}_{f, \mathcal{D}}(h_{\mathbf{u}, \mathbf{W}}^{FCSA})\Vert_2^2\leq qn\min\{ \dbinom{n-1}{k}^{-1}, \dbinom{n-1}{k-1}^{-1}\}$

from theorem 1:

initial gradient is small, if $k=\Omega(\log{n})$ and fully connected attention layer is initialized as a permutation invariant distribution
fully connected attention layer result in gradient vanishing, if $q$ is not large enough
gradient descent will be “stuck” upon initialization, thus unable to learn $k$ -pattern function

$\implies$ FCSA layer cannot use spatial information of each frame since local information not encoded in embeddings of all tokens

detailed structure in STCSA

Illustration of the spatial-temporal convolutional self-attention. The unfold operation is to extract sliding local patches from a batched input feature map, while the fold operation is to combine an array of sliding local patches into a large feature map.

given feature maps of input video frames $X\in{\Reals}^{T\times C\times H\times W}$
step 1 capture spatial information of each frame in $x$
$X\in{\Reals}^{T\times C\times H\times W}\xrightarrow{W_q, W_k, W_v}Q, K, V\in{\Reals}^{T\times C\times H\times W}$

where, $W_q, W_k, W_v$ are 3 independent conv layers
step 2 unfold features into sliding local $H_p\times W_p$ -size patches in each frame, and reshape into query, key, value matrix
$V\in{\Reals}^{T\times C\times H\times W}\xrightarrow{unfold}{\Reals}^{T\times CH_pW_p\times\frac{HW}{H_pW_p}}\xrightarrow{reshape}{\Reals}^{n\_heads\times\frac{CH_pW_p}{n\_heads}\times T\frac{HW}{H_pW_p}}$

where, $n\_patches=\frac{HW}{H_pW_p}$ is patches number in each frame, $dim=CH_pW_p$ is dimension of each patch, $n\_heads$ is heads number
step 3 calculate similarity matrix and aggregate with value for attention matrix
$V)=softmax(\frac{Q^TK}{\sqrt{d}})V^T\in{\Reals}^{n\_heads\times T\frac{HW}{H_pW_p}\times T\frac{HW}{H_pW_p}}$

where, $d=\frac{CH_pW_p}{n\_heads}$ is hidden dimension
note that similarity matrix $Q^TK$ related to all embedding tokens of the whole video frames
step 4 reshape attention matrix, and fold tensors of updated sliding local patches into features
$Attention\in{\Reals}^{n\_heads\times T\frac{HW}{H_pW_p}\times T\frac{HW}{H_pW_p}}\xrightarrow{reshape}{\Reals}^{T\times CH_pW_p\times\frac{HW}{H_pW_p}}\xrightarrow{fold}{\Reals}^{T\times C\times H\times W}$

step 5 obtain final features, and achieve output with a skip connection and a normalization
$Attention\in{\Reals}^{T\times C\times H\times W}\xrightarrow{{W_o}}F\in{\Reals}^{T\times C\times H\times W}$

$f_1(X)=LN(X+F)\in{\Reals}^{T\times C\times H\times W}$

where, $W_o$ is a conv layer

step 2 to step 4 inspired by COLA-Net
with a summary of steps above, STCSA formulated as
$f_1(X)=LN(X+\sum_{i=1}^hW_o^i\kappa_2(\underbrace{\kappa_1(W_v^iX)}_\text{v}softmax({\underbrace{\kappa_1(W_k^iX)}_\text{w}}^T\underbrace{\kappa_1(W_q^iX)}_\text{q})))$

where, $\kappa_1(\cdot), \kappa_2(\cdot)$ are unfold and fold operation, $h$ is heads number which set $h = 1$ for good performance

why STCSA is suitable

Q: how STCSA layer learn $k$ -patterns with gradient descent
theorem 2 assume we initialize each element of weights uniformly drawn from
$\{\pm\frac1k\}$ . fix some $\delta>0$ , some k-pattern $f$ and some distribution $\mathcal{D}$ . then is $q>2^{k+3}\log(\frac{2^k}\delta)$ , and let $h_{\mathbf{u}^{(s)}, \mathbf{W}^{(s)}}^{STCSA}$ be a function satisfying definition 2, with probability at least $1-\delta$ over the initialization, when training a spatial-temporal convolutional self-attention layer using gradient descent with $\eta$ , we have
$\frac{1}{S}\sum_{s=1}^S\mathcal{L}_{f, \mathcal{D}}(h_{\mathbf{u}^{(s)}, \mathbf{W}^{(s)}}^{STCSA})\leq{\eta}^2S^2nk^{\frac52}2^{k+1}+\frac{k^22^{2k+1}}{q\eta S}+\eta nqk$

from theorem 2:

loss $\mathcal{L}_{f, \mathcal{D}}(h_{\mathbf{u}^{(s)}, \mathbf{W}^{(s)}}^{STCSA})$ will be small with finite $S$ steps in optimization, thus able to learn $k$ -pattern function

$\implies$ STCSA layer with gradient descent can capture locality of each frame

spatial-temporal position encoding

VSRT is permutation-invariant, thus requiring precise spatial-temporal position information
3D fixed position encoding: 2 spatial positional information (horizontal, vertical) and 1 temporal positional information
$i)=\begin{cases} \sin(pos\cdot{\alpha}_k) &\text{if } i=2k \\ cos(pos\cdot{\alpha}_k) &\text{if } i=2k+1 \end{cases}$

where, ${\alpha}_k=1/1000^{2k/\frac{d}3}$ , $k$ is an integer in $\frac{k}6)$ , $p o s$ is position in corresponding dimension, $d$ is channel dimension size

bidirectional optical flow-based feed-forward (BOFF)

Illustration of the bidirectional optical flow-based feed-forward layer. Given a video sequence, we first bidirectionally estimate the forward and backward optical flows and wrap the feature maps with the responding optical flows. Then we learn a forward and backward propagation network to produce two sequences of features from concatenated wrapped features and LR frames. Last, we fusion these two feature sequences into one feature sequence.

given features $X\in\mathbb{R}^{T\times C\times H\times W}$ output by STCSA layer
step 1: learn bidirectional optical flows between neighboring frames
$\overleftarrow{O}_t=\begin{cases} spy(V_1, V_1) &\text{if } t=1 \\ spy(V_{t-1}, V_t) &\text{if } t\in(1, T] \end{cases}, \overrightarrow{O}_t=\begin{cases} spy(V_{t+1}, V_t) &\text{if } t\in[1, T) \\ spy(V_T, V_T) &\text{if } t=T \end{cases}$

where, $\overleftarrow{O}, \overrightarrow{O}\in\mathbb{R}^{T\times2\times H\times W}$ are backward and forward optical flows; $spy(\cdot, \cdot)$ is a function as SPyNet which is pre-trained and updated in training
step 2: obtain bidirectional features along with backward and forward propagation
$\overleftarrow{X}=warp(X, \overleftarrow{O}), \overrightarrow{X}=warp(X, \overrightarrow{O})$

where, $\overleftarrow{X}, \overrightarrow{X}\in\mathbb{R}^{T\times C\times H\times W}$ are backward and forward features
step 3 aggerate frames and warped features, and feed into 2-layer CNN for backward and forward propagation
$f_2(X)=LN(f_1(X)+fusion(\overleftarrow{W_1}ReLU(\overleftarrow{W_2}[V, \overleftarrow{X}])+\overrightarrow{W_1}ReLU(\overrightarrow{W_2}[V, \overrightarrow{X})]))$

where, $[\cdot, \cdot]$ is an aggregation operator, $\overleftarrow{W_1}, \overleftarrow{W_2}, \overrightarrow{W_1}, \overrightarrow{W_2}$ are weights of backward and forward networks
extend 2-layer networks to multi-layer networks
$f_2(X)=LN(f_1(X)+fusion(R_1(V, \overleftarrow{X})+R_2(V, \overrightarrow{X})))$

where, $R_1, R_2$ are flexible networks

Experiment

dataset

	resolution	training set	testing set
REDS	$1280\times720$	266 clips	REDS4 4 clips
Vimeo-90K	$448\times256$	64,612 clips	Vimeo-90K-T 7,824 clips
Vid4	$720\times480$		4 clips each 34 frames
experiment detail

degradation bicubic down-sampling (BI)
input $64\times64$ -size, 5 or 7 frames
data augmentation random horizontal flipping, random $90^{\circ}$ rotation
frame normalized to $448\times256$ -size
optimizer Adam: $\beta_1=0.9, \beta_2=0.99$ , batch size=2 per GPU, 600K iterations
learning rate initial 2e-4, cosine decay to 1e-7

result on REDS

Quantitative comparison (PSNR/SSIM) on REDS4 for $4\times$ VSR. The results are tested on RGB channels. Red and blue indicate the best and the second best performance, respectively. “ $\dag$ ” means a method trained on 5 frames for a fair comparison.

Qualitative comparison on the REDS4 dataset for $4\times$ VSR. Zoom in for the best view.

key findings

the highest PSNR and comparable SSIM
when training with 5 frames, BasicVSR and IconVSR worse than EDVR
$\implies$ BasicVSR and IconVSR rely much on aggregation of long-term sequence information
64-channel VSRT better performance than 128-channel EDVR-L
VSRT able to recover finer details and sharper edges

result on Vimeo-90K

Quantitative comparison (PSNR/SSIM) on Vimeo-90K-T for $4\times$ VSR. Red and blue indicate the best and the second best performance, respectively.

Qualitative comparison on Vimeo-90K-T for $4\times$ VSR. Zoom in for the best view.

key findings

the highest PSNR and SSIM
generalization ability on Vid4 of VSRT better than EDVR but worse than BasicVSR and IconVSR
$\impliedby$ BasicVSR and IconVSR tested on all frames, while VSRT and EDVR tested on 7 frames
$\impliedby$ a distribution bias between Vimeo-90K-T and Vid4
VSRT able to generate sharp and realistic HR frames

result on Vid4

Quantitative comparison (PSNR/SSIM) on Vid4 for 4x VSR. Red and blue indicate the best and the second best performance, respectively. “Y” denotes the evaluation on Y channels.

Quantitative comparison (PSNR/SSIM) on Vid4 for $4\times$ VSR. Red and blue indicate the best and the second best performance, respectively. “Y” denotes the evaluation on Y channels. “ $\dag$ ” means a method trained and tested on 7 frames for a fair comparison.

>Qualitative comparison on Vid4 for $4\times$ VSR. Zoom in for the best view.

ablation study

optical flow

w/o optical flow: replace SPyNet in BOFF layer with a stack of Residual ReLU networks

Ablation study on REDS for $4\times$ VSR. Here, “w/o” and “w/ optical flow” mean the VSR-Transformer without and with the optical flow, respectively. Zoom in for the best view.

optical flow is important in BOFF layer and help feature propagation and alignment

STCSA layer & BOFF layer

w/o STCSA: remove STCSA layer
w/o BOFF: replace BOFF layer with a stack of Residual ReLU networks

Ablation study on REDS for $4\times$ VSR. Here, “w/o STCSA” and “w/o BOFF” mean the VSR-Transformer without the spatial-temporal convolutional self-attention (STCSA) layer and bidirectional optical flow-based feed-forward (BOFF) layer, respectively.

STCSA layer exploit locality of data and fuse information among different frames
BOFF layer help to perform feature propagation and alignment

number of frames

w/ 3 frames: train model with 3 frames
training with more frames help to restore missing information from other neighboring frames

Open3D 点云DBSCAN聚类算法 MelaCandy 算法聚类 numpy 计算机视觉图像处理 3d
目录一、DBSCAN基本原理二、代码实现2.1关键函数2.2完整代码三、实现效果3.1原始点云3.2聚类后点云Open3D点云算法汇总及实战案例汇总的目录地址：Open3D点云算法与点云深度学习案例汇总（长期更新）-CSDN博客一、DBSCAN基本原理DBSCAN（Density-BasedSpatialClusteringofApplicationswithNoise）是一种基于密度的聚类算法，
目标检测领域总结：从传统方法到 Transformer 时代的革新 DoYangTan 目标检测系列目标检测 transformer 人工智能
目标检测领域总结：从传统方法到Transformer时代的革新目标检测是计算机视觉领域的一个核心任务，它的目标是从输入图像中识别并定位出目标物体。随着深度学习的兴起，目标检测方法已经取得了显著的进展。从最早的传统方法到现如今基于Transformer的先进算法，目标检测的发展经历了多个重要的阶段。本文将详细总结目标检测领域的演进，涵盖传统方法、两阶段检测方法、单阶段检测方法和基于Transform
2024MathorCup数学建模之——MathorCup奖杯”获得者经验思路分享美赛数学建模数学建模
一、经验分享1.工具选择：顺手即可。Matlab和Python都是比较主流的选择，二者的应用场合各有不同。Python在数据分析、深度学习方面的优势愈发明显，而Matlab更适合进行物理仿真和数值计算。不过随着Python社区不断发展，其功能也愈发全面与强大，因此我们比较推荐学有余力的情况下可以更早接触Python。2.模型算法：多多益善。不一定要精通所有的算法，但是手上至少要准备一些常用的算法（
AI人工智能软件开发方案：开启智能时代的创新钥匙广州硅基技术官方人工智能
一、引言：AI浪潮下的软件开发新机遇近年来，人工智能（AI）技术的迅猛发展如同一股汹涌澎湃的浪潮，席卷了全球各个领域。从最初的概念提出到如今的广泛应用，AI历经了漫长的发展历程，终于迎来了属于它的黄金时代。回首过去，AI的发展并非一帆风顺，早期由于计算能力和算法的限制，经历了多次起伏。但随着大数据、云计算、机器学习、深度学习等技术的不断突破，AI迎来了爆发式增长。如今，AI已经深入到人们生活和工作
深度学习框架PyTorch——从入门到精通（6.2）自动微分机制 Fansv587 深度学习 pytorch 人工智能经验分享 python 机器学习
本节自动微分机制是上一节自动微分的扩展内容自动微分是如何记录运算历史的保存张量非可微函数的梯度在本地设置禁用梯度计算设置requires_grad梯度模式（GradModes）默认模式（梯度模式）无梯度模式推理模式评估模式（`nn.Module.eval()`）自动求导中的原地操作原地操作的正确性检查多线程自动求导CPU上的并发不确定性计算图保留自动求导节点的线程安全性C++钩子函数不存在线程安全
Deepseek和豆包在技术创新方面有哪些相同点与不同点？ alankuo 人工智能
Deepseek和豆包在技术创新方面的相同点与不同点如下：相同点架构基础：都以Transformer架构为基础进行开发。Transformer架构能有效处理长序列数据，捕捉文本语义信息，为模型性能提供基础。混合专家模型（MoE）应用：都采用了MoE架构。该架构将模型拆分为多个“专家”，训练和推理时让不同“专家”负责不同任务或数据子集，提高模型表达能力和效率，降低训练成本。模型优化以提升性能：都通过
神经网络中层与层之间的关联 iisugar 神经网络深度学习计算机视觉
目录1.层与层之间的核心关联：数据流动与参数传递1.1数据流动（ForwardPropagation）1.2参数传递（BackwardPropagation）2.常见层与层之间的关联模式2.1典型全连接网络（如手写数字分类）2.2卷积神经网络（CNN，如图像分类）2.3循环神经网络（RNN/LSTM，如文本生成）2.4Transformer（如机器翻译）3.层间关联的核心原则3.1数据传递的“管道
Pytorch深度学习教程_9_nn模块构建神经网络 tRNA做科研深度学习保姆教程深度学习 pytorch 神经网络
欢迎来到《深度学习保姆教程》系列的第九篇！在前面的几篇中，我们已经介绍了Python、numpy及pytorch的基本使用，进行了梯度及神经网络的实践并学习了激活函数和激活函数，在上一个教程中我们学习了优化算法。今天，我们将开始使用pytorch构建我们自己的神经网络。欢迎订阅专栏进行系统学习：深度学习保姆教程_tRNA做科研的博客-CSDN博客目录1.理解nn模块：(1)使用nn.Sequent
Radiance Fields from VGGSfM和Mast3r:两种先进3D重建方法的比较与分析 2401_87458718 3d
VGGSfM和Mast3r:3D场景重建的新方向在计算机视觉和3D重建领域,如何从2D图像重建3D场景一直是一个充满挑战的研究课题。近年来,随着深度学习技术的发展,一些新的方法被提出并取得了显著的进展。本文将重点介绍两种最新的基于深度学习的3D重建方法:VGGSfM和Mast3r,并通过GaussianSplatting技术对它们的性能进行全面比较和分析。VGGSfM:基于视觉几何的深度结构运动恢
基于 PyTorch 的 MNIST 手写数字分类模型欣然～ pytorch 分类人工智能
一、概述本代码使用PyTorch框架构建了一个简单的神经网络模型，用于解决MNIST手写数字分类任务。代码主要包括数据的加载与预处理、神经网络模型的构建、损失函数和优化器的定义、模型的训练、评估以及最终模型的保存等步骤。二、依赖库torch：PyTorch深度学习框架的核心库，提供了张量操作、自动求导等功能。torch.nn：PyTorch的神经网络模块，包含了各种神经网络层、损失函数等。torc
高效快速教你DeepSeek如何进行本地部署并且可视化对话大富大贵7 程序员知识储备1 程序员知识储备2 程序员知识储备3 经验分享
科技文章：高效快速教你DeepSeek如何进行本地部署并且可视化对话摘要：随着自然语言处理（NLP）技术的进步，DeepSeek作为一款基于深度学习的语义搜索技术，广泛应用于文本理解、对话系统及信息检索等多个领域。本文将探讨如何高效快速地在本地部署DeepSeek，并结合可视化工具实现对话过程的监控与分析。通过详尽的步骤、案例分析与代码示例，帮助开发者更好地理解和应用DeepSeek技术。同时，本
《AI医疗系统开发实战录》第6期——智能导诊系统实战骆驼_代码狂魔程序员的法宝人工智能 django python neo4j 知识图谱
关注我，后期文章全部免费开放，一起推进AI医疗的发展核心主题：如何构建95%准确率的智能导诊系统？技术突破：结合BERT+知识图谱的混合模型设计一、智能导诊架构设计python基于BERT的意图识别模型（PyTorch）fromtransformersimportBertTokenizer,BertForSequenceClassificationimporttorchclassTriageMod
Python基于深度学习的动物图片识别技术的研究与实现 Java老徐 Python 毕业设计 python 深度学习开发语言深度学习的动物图片识别技术 Python动物图片识别技术
博主介绍：✌程序员徐师兄、7年大厂程序员经历。全网粉丝12w+、csdn博客专家、掘金/华为云/阿里云/InfoQ等平台优质作者、专注于Java技术领域和毕业项目实战✌文末获取源码联系精彩专栏推荐订阅不然下次找不到哟2022-2024年最全的计算机软件毕业设计选题大全：1000个热门选题推荐✅Java项目精品实战案例《100套》Java微信小程序项目实战《100套》感兴趣的可以先收藏起来，还有大家
【深度学习与大模型基础】第7章-特征分解与奇异值分解 lynn-66 深度学习与大模型基础算法机器学习人工智能
一、特征分解特征分解（EigenDecomposition）是线性代数中的一种重要方法，广泛应用于计算机行业的多个领域，如机器学习、图像处理和数据分析等。特征分解将一个方阵分解为特征值和特征向量的形式，帮助我们理解矩阵的结构和性质。1.特征分解的定义对于一个n×n的方阵A，如果存在一个非零向量v和一个标量λ，使得：则称λ为矩阵A的特征值，v为对应的特征向量。特征分解将矩阵A分解为：其中：Q是由特征
【论文阅读】实时全能分割模型万里守约论文阅读论文阅读图像分割图像处理计算机视觉
文章目录导言1、论文简介2、论文主要方法3、论文针对的问题4、论文创新点总结导言在最近的计算机视觉领域，针对实时多任务分割的需求日益增长，特别是在交互式分割、全景分割和视频实例分割等多种应用场景中。为了解决这些挑战，本文介绍了一种新方法——RMP-SAM（Real-TimeMulti-PurposeSegmentAnything），旨在实现实时的多功能分割。RMP-SAM结合了动态卷积与高效的模型
震惊！ “深度学习”都在学习什么扉间798 深度学习学习人工智能
常见的机器学习分类算法俗话说三个臭皮匠胜过诸葛亮这里面集成学习就是将单一的算法弱弱结合算法融合用投票给特征值加权重AdaBoost集成学习算法通过迭代训练一系列弱分类器，给予分类错误样本更高权重，使得后续弱分类器更关注这些样本，然后将这些弱分类器线性组合成强分类器，提高整体分类性能。（一）投票机制投票是一种直观且常用的算法融合策略。在多分类问题中，假设有多个分类器对同一数据进行分类判断。每个分类器
NLP高频面试题（十）——目前常见的几种大模型架构是啥样的 Chaos_Wang_ NLP常见面试题自然语言处理架构人工智能
深入浅出：目前常见的几种大模型架构解析随着Transformer模型的提出与发展，语言大模型迅速崛起，已经成为人工智能领域最为关注的热点之一。本文将为大家详细解析几种目前常见的大模型架构，帮助读者理解其核心差异及适用场景。1.什么是LLM（大语言模型）？LLM通常指参数量巨大、能够捕捉丰富语义信息的Transformer模型，它们通过海量的文本数据训练而成，能够实现高度逼真的文本生成、复杂的语言理
深度学习 | pytorch + torchvision + python 版本对应及环境安装 zfgfdgbhs 深度学习 python pytorch
目录一、版本对应二、安装命令（pip）1.版本（1）v2.5.1~v2.0.0（2）v1.13.1~v1.11.0（3）v1.10.1~v1.7.02.安装全过程（1）选择版本（2）安装结果参考文章一、版本对应下表来自pytorch的github官方文档：pytorch/vision:Datasets,TransformsandModelsspecifictoComputerVisionpytor
一文讲清楚深度学习和机器学习平凡而伟大. 机器学习人工智能深度学习机器学习人工智能
目录1.定义机器学习（MachineLearning,ML）深度学习（DeepLearning,DL）2.工作原理机器学习深度学习3.应用场景机器学习深度学习4.主要区别5.为什么选择深度学习？6.总结深度学习和机器学习是人工智能（AI）领域中两个密切相关但有所区别的概念。要清楚地解释它们之间的关系，我们可以从定义、工作原理、应用场景以及两者的主要区别等方面进行探讨。1.定义机器学习（Machin
DeepSeek：智能搜索与分析的新纪元 XRC2231 学习
在人工智能浪潮席卷全球的今天，DeepSeek如同一颗璀璨的新星，以其独特的魅力和强大的功能，在AI领域脱颖而出。DeepSeek，这一基于深度学习和数据挖掘技术的智能搜索与分析系统，不仅重新定义了搜索引擎的边界，更以其卓越的性能和广泛的应用场景，为全球用户带来了前所未有的智能体验。本文将从DeepSeek的定义、特点、应用场景、优势等方面进行全面而深入的介绍，带您领略这一新兴技术的独特魅力。一、
Linux部署模型报错OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_mod dkgee linux pytorch 运维
报错内容：OSError:Errornofilenamedpytorch_model.bin,tf_model.h5,model.ckpt.indexorflax_model.msgpackfoundindirectory主要原因是transformer版本不对，需要升级pipinstall--upgradehuggingface_hubpipinstalltransformers[torch]其
大模型学习终极指南：从新手到专家的必经之路，全网最详尽解析，你敢挑战吗？大模型入门教程学习人工智能 AI 大模型大模型学习大模型教程 AI大模型
随着人工智能技术的飞速发展，大模型（Large-ScaleModels）已经成为推动自然语言处理（NLP）、计算机视觉（CV）等领域进步的关键因素。本文将为您详细介绍从零开始学习大模型直至成为专家的全过程，包括所需掌握的知识点、学习资源以及实践建议等。无论您是初学者还是有一定基础的专业人士，都能从中获得有价值的指导。一、基础知识准备在开始学习大模型之前，需要先掌握一些基础知识，这些知识将为后续的学
AI模型技术演进与行业应用图谱智能计算研究中心其他
内容概要当前AI模型技术正经历从基础架构到行业落地的系统性革新。主流深度学习框架如TensorFlow和PyTorch持续优化动态计算图与分布式训练能力，而MXNet凭借高效的异构计算支持在边缘场景崭露头角。与此同时，模型压缩技术通过量化和知识蒸馏将参数量降低60%-80%，联邦学习则通过加密梯度交换实现多机构数据协同训练。在应用层面，医疗诊断模型通过迁移学习在CT影像分类任务中达到98.2%的准
AI大模型训练教程 Small踢倒coffee_氕氘氚 python自学经验分享笔记
1.引言随着人工智能技术的快速发展，大模型（如GPT-3、BERT等）在自然语言处理、计算机视觉等领域取得了显著的成果。训练一个大模型需要大量的计算资源、数据和专业知识。本教程将带你了解如何从零开始训练一个AI大模型。2.准备工作2.1硬件要求GPU：推荐使用NVIDIA的高性能GPU，如A100、V100等。内存：至少64GBRAM。存储：SSD存储，至少1TB。#2.2软件环境操作系统：Lin
使用Jupyter Notebook进行深度学习编程 - 深度学习教程 shandianfk_com ChatGPT AI jupyter 深度学习 ide
大家好，今天我们要聊聊如何使用JupyterNotebook进行深度学习编程。深度学习是人工智能领域中的一项重要技术，通过模仿人脑神经网络的方式进行学习和分析。JupyterNotebook作为一个强大的工具，可以帮助我们轻松地进行深度学习编程，尤其适合初学者和研究人员。本文将带领大家一步步了解如何在JupyterNotebook中开展深度学习项目。一、什么是JupyterNotebook？Jup
Opencv之计算机视觉一闭月之泪舞计算机视觉计算机视觉 opencv python
一、环境准备使用opencv库来实现简单的计算机视觉。需要安装两个库：opencv-python和opencv-contrib-python，版本可以自行选择，注意不同版本的opencv中的某些函数名和用法可能不同pipinstallopencv-python==3.4.18.65-ihttps://pypi.tuna.tsinghua.edu.cn/simplepipinstallopencv-
计算机视觉总结 Trank-Lw 计算机视觉深度学习人工智能
以下是针对上述问题的详细解答，并结合代码示例进行说明：1.改进YOLOv5人脸检测模块，复杂光照场景准确率从98.2%提升至99.5%优化具体过程：光照补偿：在数据预处理阶段，采用自适应光照补偿算法，对图像进行实时增强，以减少光照变化对人脸检测的影响。数据增强：在训练数据中增加复杂光照场景下的样本，如强光、弱光、背光等，通过数据增强提高模型对不同光照条件的适应性。模型调整：对YOLOv5模型的网络
深度学习 Deep Learning 第8章深度学习优化 odoo中国 AI编程人工智能深度学习人工智能优化
深度学习第8章深度学习的优化章节概述本章深入探讨了深度学习中的优化技术，旨在解决模型训练过程中面临的各种挑战。优化是深度学习的核心环节，直接关系到模型的训练效率和最终性能。本章首先介绍了优化在深度学习中的特殊性，然后详细讨论了多种优化算法，包括随机梯度下降（SGD）、动量法、Nesterov动量法、AdaGrad、RMSProp和Adam等。此外，还探讨了参数初始化策略、自适应学习率方法以及二阶优
景联文科技提供高质量文本标注服务，驱动AI技术发展景联文科技科技人工智能
文本标注是指在原始文本数据上添加标签的过程，这些标签可以用来指示特定的实体、关系、事件等信息，以帮助计算机理解和处理这些数据。文本标注是自然语言处理（NLP）领域的一个重要环节，它通过为文本的不同部分提供具体的含义和上下文信息，增强机器学习和深度学习模型对文本内容的理解能力。标注类型情感分析情感极性：确定文本表达的情感倾向，如正面、负面或中立。强度评估：衡量情感的强烈程度，从轻微到极端不等。命名实
景联文科技：以高质量数据标注推动人工智能领域创新与发展景联文科技科技人工智能数据标注
在当今这个由数据驱动的时代，高质量的数据标注对于推动机器学习、自然语言处理（NLP）、计算机视觉等领域的发展具有不可替代的重要性。数据标注过程涉及对原始数据进行加工，通过标注特定对象的特征来生成能够被机器学习模型识别和使用的编码格式，从而使数据更具有意义和可解读性。数据标注的主要类型包括：图像标注：指在图片中标识出目标物体的位置、形状或类别等信息，如自动驾驶技术中的行人、车辆及交通标志的识别。文本
关于旗正规则引擎规则中的上传和下载问题何必如此文件下载压缩 jsp 文件上传
文件的上传下载都是数据流的输入输出，大致流程都是一样的。一、文件打包下载 1.文件写入压缩包 string mainPath="D:\upload\"; 下载路径 string tmpfileName=jar.zip; &n
【Spark九十九】Spark Streaming的batch interval时间内的数据流转源码分析 bit1129 Stream
以如下代码为例（SocketInputDStream）： Spark Streaming从Socket读取数据的代码是在SocketReceiver的receive方法中，撇开异常情况不谈(Receiver有重连机制，restart方法，默认情况下在Receiver挂了之后，间隔两秒钟重新建立Socket连接)，读取到的数据通过调用store(textRead)方法进行存储。数据
spark master web ui 端口8080被占用解决方法 daizj 8080 端口占用 spark master web ui
spark master web ui 默认端口为8080，当系统有其它程序也在使用该接口时，启动master时也不会报错，spark自己会改用其它端口，自动端口号加1，但为了可以控制到指定的端口，我们可以自行设置，修改方法： 1、cd SPARK_HOME/sbin 2、vi start-master.sh 3、定位到下面部分
oracle_执行计划_谓词信息和数据获取周凡杨 oracle 执行计划
oracle_执行计划_谓词信息和数据获取(上) 一：简要说明在查看执行计划的信息中，经常会看到两个谓词filter和access，它们的区别是什么，理解了这两个词对我们解读Oracle的执行计划信息会有所帮助。简单说，执行计划如果显示是access，就表示这个谓词条件的值将会影响数据的访问路径（表还是索引），而filter表示谓词条件的值并不会影响数据访问路径，只起到
spring中datasource配置 g21121 dataSource
datasource配置有很多种，我介绍的一种是采用c3p0的，它的百科地址是： http://baike.baidu.com/view/920062.htm  <bean name="propertiesConfig" class="org.springframework.b
web报表工具FineReport使用中遇到的常见报错及解决办法（三）老A不折腾 finereport FAQ 报表软件
这里写点抛砖引玉，希望大家能把自己整理的问题及解决方法晾出来，Mark一下，利人利己。出现问题先搜一下文档上有没有，再看看度娘有没有，再看看论坛有没有。有报错要看日志。下面简单罗列下常见的问题，大多文档上都有提到的。 1、repeated column width is largerthan paper width：这个看这段话应该是很好理解的。比如做的模板页面宽度只能放
mysql 用户管理墙头上一根草 linux mysql user
1.新建用户 //登录MYSQL@>mysql -u root -p@>密码//创建用户mysql> insert into mysql.user(Host,User,Password) values(‘localhost’,'jeecn’,password(‘jeecn’));//刷新系统权限表mysql>flush privileges;这样就创建了一个名为：
关于使用Spring导致c3p0数据库死锁问题 aijuans spring Spring 入门 Spring 实例 Spring3 Spring 教程
这个问题我实在是为整个 springsource 的员工蒙羞如果大家使用 spring 控制事务，使用 Open Session In View 模式， com.mchange.v2.resourcepool.TimeoutException: A client timed out while waiting to acquire a resource from com.mchange.
百度词库联想 annan211 百度
<!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <title>RunJS</title&g
int数据与byte之间的相互转换实现代码百合不是茶位移 int转byte byte转int 基本数据类型的实现
在BMP文件和文件压缩时需要用到的int与byte转换,现将理解的贴出来; 主要是要理解;位移等概念 http://baihe747.iteye.com/blog/2078029 int转byte; byte转int; /** * 字节转成int,int转成字节 * @author Administrator *
简单模拟实现数据库连接池 bijian1013 java thread java多线程简单模拟实现数据库连接池
简单模拟实现数据库连接池实例1： package com.bijian.thread; public class DB { //private static final int MAX_COUNT = 10; private static final DB instance = new DB(); private int count = 0; private i
一种基于Weblogic容器的鉴权设计 bijian1013 java weblogic
服务器对请求的鉴权可以在请求头中加Authorization之类的key，将用户名、密码保存到此key对应的value中，当然对于用户名、密码这种高机密的信息，应该对其进行加砂加密等，最简单的方法如下： String vuser_id = "weblogic"; String vuse
【RPC框架Hessian二】Hessian 对象序列化和反序列化 bit1129 hessian
任何一个对象从一个JVM传输到另一个JVM，都要经过序列化为二进制数据(或者字符串等其他格式，比如JSON)，然后在反序列化为Java对象，这最后都是通过二进制的数据在不同的JVM之间传输(一般是通过Socket和二进制的数据传输)，本文定义一个比较符合工作中。 1. 定义三个POJO Person类 package com.tom.hes
【Hadoop十四】Hadoop提供的脚本的功能 bit1129 hadoop
1. hadoop-daemon.sh 1.1 启动HDFS ./hadoop-daemon.sh start namenode ./hadoop-daemon.sh start datanode 通过这种逐步启动的方式，比start-all.sh方式少了一个SecondaryNameNode进程，这不影响Hadoop的使用，其实在 Hadoop2.0中，SecondaryNa
中国互联网走在“灰度”上 ronin47 管理灰度
中国互联网走在“灰度”上（转）文/孕峰第一次听说灰度这个词，是任正非说新型管理者所需要的素质。第二次听说是来自马化腾。似乎其他人包括马云也用不同的语言说过类似的意思。灰度这个词所包含的意义和视野是广远的。要理解这个词，可能同样要用“灰度”的心态。灰度的反面，是规规矩矩，清清楚楚，泾渭分明，严谨条理，是决不妥协，不转弯，认死理。黑白分明不是灰度，像彩虹那样
java-51-输入一个矩阵，按照从外向里以顺时针的顺序依次打印出每一个数字。 bylijinnan java
public class PrintMatrixClockwisely { /** * Q51.输入一个矩阵，按照从外向里以顺时针的顺序依次打印出每一个数字。例如：如果输入如下矩阵： 1 2 3 4 5 6 7 8 9
mongoDB 用户管理开窍的石头 mongoDB用户管理
1:添加用户第一次设置用户需要进入admin数据库下设置超级用户（use admin） db.addUsr({user:'useName',pwd:'111111',roles:[readWrite,dbAdmin]}); 第一个参数用户的名字第二个参数
[游戏与生活]玩暗黑破坏神3的一些问题 comsci 生活
暗黑破坏神3是有史以来最让人激动的游戏。。。。但是有几个问题需要我们注意玩这个游戏的时间，每天不要超过一个小时，且每次玩游戏最好在白天结束游戏之后，最好在太阳下面来晒一下身上的暗黑气息，让自己恢复人的生气 &nb
java 二维数组如何存入数据库 cuiyadll java
using System; using System.Linq; using System.Text; using System.Windows.Forms; using System.Xml; using System.Xml.Serialization; using System.IO; namespace WindowsFormsApplication1 {
本地事务和全局事务Local Transaction and Global Transaction(JTA) darrenzhu java spring local global transaction
Configuring Spring and JTA without full Java EE http://spring.io/blog/2011/08/15/configuring-spring-and-jta-without-full-java-ee/ Spring doc -Transaction Management http://docs.spring.io/spri
Linux命令之alias - 设置命令的别名，让 Linux 命令更简练 dcj3sjt126com linux alias
用途说明设置命令的别名。在linux系统中如果命令太长又不符合用户的习惯，那么我们可以为它指定一个别名。虽然可以为命令建立“链接”解决长文件名的问题，但对于带命令行参数的命令，链接就无能为力了。而指定别名则可以解决此类所有问题【1】。常用别名来简化ssh登录【见示例三】，使长命令变短，使常用的长命令行变短，强制执行命令时询问等。常用参数格式：alias 格式：ali
yii2 restful web服务[格式响应] dcj3sjt126com PHP yii2
响应格式当处理一个 RESTful API 请求时，一个应用程序通常需要如下步骤来处理响应格式：确定可能影响响应格式的各种因素，例如媒介类型，语言，版本，等等。这个过程也被称为 content negotiation。资源对象转换为数组，如在 Resources 部分中所描述的。通过 [[yii\rest\Serializer]]
MongoDB索引调优（2）——[十] eksliang mongodb MongoDB索引优化
转载请出自出处：http://eksliang.iteye.com/blog/2178555 一、概述上一篇文档中也说明了，MongoDB的索引几乎与关系型数据库的索引一模一样，优化关系型数据库的技巧通用适合MongoDB，所有这里只讲MongoDB需要注意的地方二、索引内嵌文档可以在嵌套文档的键上建立索引，方式与正常
当滑动到顶部和底部时，实现Item的分离效果的ListView gundumw100 android
拉动ListView，Item之间的间距会变大，释放后恢复原样； package cn.tangdada.tangbang.widget; import android.annotation.TargetApi; import android.content.Context; import android.content.res.TypedArray; import andr
程序员用HTML5制作的爱心树表白动画 ini JavaScript jquery Web html5 css
体验效果：http://keleyi.com/keleyi/phtml/html5/31.htmHTML代码如下： <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"><head><meta charset="UTF-8" > <ti
预装windows 8 系统GPT模式的ThinkPad T440改装64位 windows 7旗舰版 kakajw ThinkPad 预装改装 windows 7 windows 8
该教程具有普遍参考性，特别适用于联想的机器，其他品牌机器的处理过程也大同小异。该教程是个人多次尝试和总结的结果，实用性强，推荐给需要的人！缘由小弟最近入手笔记本ThinkPad T440，但是特别不能习惯笔记本出厂预装的Windows 8系统，而且厂商自作聪明地预装了一堆没用的应用软件，消耗不少的系统资源（本本的内存为4G，系统启动完成时，物理内存占用比
Nginx学习笔记 mcj8089 nginx
一、安装nginx 1、在nginx官方网站下载一个包，下载地址是： http://nginx.org/download/nginx-1.4.2.tar.gz 2、WinSCP(ftp上传工
mongodb 聚合查询每天论坛链接点击次数 qiaolevip 每天进步一点点学习永无止境 mongodb 纵观千象
/* 18 */ { "_id" : ObjectId("5596414cbe4d73a327e50274"), "msgType" : "text", "sendTime" : ISODate("2015-07-03T08:01:16.000Z"
java术语（PO/POJO/VO/BO/DAO/DTO） Luob. DAO POJO DTO po VO BO
PO(persistant object) 持久对象在o/r 映射的时候出现的概念,如果没有o/r映射,就没有这个概念存在了.通常对应数据模型(数据库),本身还有部分业务逻辑的处理.可以看成是与数据库中的表相映射的java对象.最简单的PO就是对应数据库中某个表中的一条记录,多个记录可以用PO的集合.PO中应该不包含任何对数据库的操作. VO(value object) 值对象通
算法复杂度 Wuaner Algorithm
Time Complexity & Big-O： http://stackoverflow.com/questions/487258/plain-english-explanation-of-big-o http://bigocheatsheet.com/ http://www.sitepoint.com/time-complexity-algorithms/