_Summer tree

【TPAMI 2022】A Survey on Vision Transformer

文章目录

WHAT
Contents
- 2. Formulation of Transformer
- - 2.1 Self-Attention
  - 2.2 Other Key Concepts in Transformer
- 3 VISION TRANSFORMER
- - 3.1 Backbone for Representation Learning
  - - 3.1.1 Pure Transformer
    - 3.1.2 Transformer with Convolution
    - 3.1.3 Self-supervised Representation Learning
    - 3.1.4 Discussions
  - 3.2 High/Mid-level Vision
  - - 3.2.1 Generic Object Detection
    - 3.2.2 Segmentation
    - 3.2.3 Pose Estimation
    - 3.2.4 Other Tasks
    - 3.2.5 Discussions
  - 3.3 Low-level Vision
  - - 3.3.1 Image Generation
    - 3.3.2 Image Processing
  - 3.4 Video Processing
  - - 3.4.1 High-level Video Processing
    - 3.4.2 Low-level Video Processing
    - 3.4.3 Discussions
  - 3.5 Multi-Modal Tasks
  - 3.6 Efficient Transformer
  - - 3.6.1 Pruning and Decomposition
    - 3.6.2 Knowledge Distillation
    - 3.6.3 Quantization
    - 3.6.4 Compact Architecture Design
- 4 CONCLUSIONS AND DISCUSSIONS
- - 4.1 Challenges
  - 4.2 Future Prospects

WHAT

在本文中，我们对这些视觉转换器模型进行了综述，并根据不同的任务对其进行了分类，分析了它们的优缺点。
探讨的主要类别：
- 骨干网络
- 高/中级视觉
- 低级视觉、
- 视频处理
介绍高效的transformer方法用于将transformer推广到真实的基于设备的应用程序中
简要地介绍了计算机视觉中的自注意机制，因为它是transformer的基本组件。
我们讨论了视觉transformer面临的挑战，并提出了未来的几个研究方向。

Transformer是一种新型的神经网络。它主要利用自注意机制[7]，[8]来提取内在特征[9]，在人工智能应用中具有广阔的应用前景.

2. Formulation of Transformer

Each transformer block is composed of a multi-head attention layer, a feed-forward neural network, shortcut connection and layer normalization.

2.1 Self-Attention

the input vector is first transformed into three different vectors: the query vector q, the key vector k and the value vector v with dimension dq = dk = dv = dmodel = 512.
Vectors derived from different inputs are then packed together into three different matrices, namely, Q, K and V.

Vectors with larger probabilities receive additional focus from the following layers.
Note that the preceding process is invariant to the position of each word, meaning that the self-attention layer lacks the ability to capture the positional information of words in a sentence.

a positional encoding with dimension dmodel is added to the original input embedding. Specifically, the position is encoded with the following equations

in which pos denotes the position of the word in a sentence, and i represents the current dimension of the positional encoding.

Multi-Head Attention.
一个单一的自我注意层限制了我们专注于一个或多个特定位置的能力，而不会同时影响对其他同等重要位置的注意。
对于不同的头部使用不同的查询矩阵、键值矩阵，这些矩阵通过随机初始化，训练后可以将输入向量投射到不同的表示子空间中。

2.2 Other Key Concepts in Transformer

FFN ：consists of two linear transformation layers and a nonlinear activation function within them, and can be denoted as the following function ：
Residual Connection in the Encoder and Decoder.： The output of these operations can be described as：
Final Layer in the Decoder.：这是通过一个线性层和一个softmax层实现的。线性层将该向量投影为具有dword维数的logits向量，其中dword是词汇表中的单词数、。然后使用softmax层将logit向量转换为概率。

3 VISION TRANSFORMER

3.1 Backbone for Representation Learning

3.1.1 Pure Transformer

ViT：

Vision Transformer (ViT) [15] is a pure transformer directly applies to the sequences of image patches for image classification task.
Figure 5 shows the framework of ViT.

DeiT: 一种竞争性的无卷积转换器，称为数据效率图像转换器(DeiT)，通过仅对ImageNet数据库进行训练

Variants of ViT ：

TNT [29]： further divides the patch into a number of subpatches and introduces a novel transformer-in-transformer architecture which utilizes an inner transformer block to model the relationship between sub-patches and an outer transformer block for patch-level information exchange.
Swin Transformers [60], [64] performs local attention within a window and introduces a shifted window partitioning approach for cross-window connections.
. DeepViT [68] proposes to establish crosshead communication to re-generate the attention maps to increase the diversity at different layers.
KVT [69] introduces the k-NN attention to utilize locality of images patches and ignore noisy tokens by only computing attentions with top-k similar tokens.
XCiT [71] performs self-attention calculation across feature channels rather than tokens, which allows efficient processing of high-resolution images.

自注意机制的计算复杂度和注意精度是未来优化的两个重点。

3.1.2 Transformer with Convolution

there are still gaps in performance between transformers and existing CNNs. One main reason can be the lack of ability to extract local information.. combining the transformer with convolution can be a more straightforward way to introduce the locality into the conventional transformer.

3.1.3 Self-supervised Representation Learning

Generative Based Approach.

以 iGPT [14]为例。
consists of a pre-training stage followed by a finetuning stage.
During the pre-training stage, auto-regressive and BERT objectives are explored. To implement pixel prediction, a sequence transformer architecture is adopted instead of language tokens (as used in NLP).
当与早期停止结合使用时，预训练可以被认为是一个有利的初始化或正则化。
During the fine-tuning stage, they add a small classification head to the model. This helps optimize a classification objective and adapts all weights.

The difference of iGPT and ViT-like models mainly lies on 3 aspects:

The input of iGPT is a sequence of color palettes by clustering pixels, while ViT uniformly divided the image into a number of local patches
The architecture of iGPT is an encoder-decoder framework, while ViT only has transformer encoder;
iGPT utilizes auto-regressive selfsupervised loss for training, while ViT is trained by supervised image classification task.

Contrastive Learning Based Approach.

MoCo v3 framework, which is an incremental improvement of MoCo [112].
the authors take two crops for each image under random data augmentation. They are encodes by two encoders Fq和fk，输出向量q和k。
The encoder fq consists of a backbone (e.g., ViT), a projection head and an extra prediction head; while the encoder fk has the backbone and projection head, but not the prediction head. fk is updated by the moving-average of fq, excluding the prediction head.

The model can still be unstable if the learning rate is too big and the first layer is unlikely the essential reason for the instability.

3.1.4 Discussions

All of the components of vision transformer including;

multihead self-attention,
multi-layer perceptron,
shortcut connection,
layer normalization,
positional encoding
network topology

From the results in Figure 6, we can see that combining CNN and transformer achieve the better performance, indicating their complementation to each other through local connection and global connection.

3.2 High/Mid-level Vision

object detection [16], [17], [113], [114], [115],
lane detection [116], 道路检测
segmentation [33], [25], [18]
pose estimation [34], [35], [36], [117].

3.2.1 Generic Object Detection

Transformer-based object detection methods are broadly categorized into two groups：

transformerbased set prediction methods [16], [17], [120], [121], [122]
transformer-based backbone methods [113], [115]

与基于cnn的检测器相比，基于Transformer的检测器在精度和运行速度方面都表现出了很强的性能。

Transformer-based Set Prediction for Detection:

DETR: a simple and fully end-to-end object detector, treats the object detection task as an intuitive set prediction problem, eliminating traditional hand-crafted components such as anchor generation and non-maximum suppression (NMS) post-processing.
Deformable DETR: deformable attention module attends to a small set of key positions around a reference point rather than looking at all spatial locations on image feature maps as performed by the original multi-head attention mechanism in transformer
该方法大大降低了计算复杂度，具有较快的收敛速度。
可变形注意模块可以很容易地应用于多尺度特征的融合。
TSP-FCOS and TSP-RCNN: a new bipartite matching scheme is designed for greater training stability and faster convergence and two transformerbased set prediction models
Spatially Modulated Co-Attention (SMCA): to accelerate the convergence by constraining co-attention responses to be high near initially estimated bounding box locations.
Adaptive Clustering Transformer (ACT) : to reduce the computation cost of pre-trained DETR. ACT adaptively clusters the query features using a locality sensitivity hashing (LSH) method and broadcasts the attention output to the queries represented by the selected prototypes.
ACT用于替换预训练DETR模型的自我注意模块，而无需进行任何再训练。

Transformer-based Backbone for Detection；

Fast R-CNN： The input image is divided into several patches and fed into a vision transformer, whose output embedding features are reorganized according to spatial information before passing through a detection head for the final results.
A massive pretraining transformer backbone could bring benefits to the proposed ViT-FRCNN.
也有相当多的方法来探索多功能视觉Transformer主干设计[29]，[72]，[60]，[62]，并将这些主干转移到传统的检测框架，如RetinaNet[127]和Cascade R-CNN[128]。例如，Swin Transformer[60]通过ResNet-50骨干网获得4盒AP增益，使用类似的FLOPs用于各种检测框架。

Pre-training for Transformer-based Object Detection.：

Dai et al. [32] proposed unsupervised pre-training for object detection (UP-DETR). Specifically, a novel unsupervised pretext task named random query patch detection is proposed to pre-train the DETR model.
UP-DETR still outperforms DETR, demonstrating the effectiveness of the unsupervised pre-training scheme.
Fang et al. [126] explored how to transfer the pure ViT structure that is pre-trained on ImageNet to the more challenging object detection task and proposed the YOLOS detector.
the proposed YOLOS first drops the classification tokens in ViT and appends learnable detection tokens. Besides, the bipartite matching loss is utilized to perform set prediction for objects. With this simple pre-training scheme on ImageNet dataset, the proposed YOLOS shows competitive performance for object detection on COCO benchmark.

3.2.2 Segmentation

panoptic segmentation, 全景分割
instance segmentation
semantic segmentation

Transformer for Panoptic Segmentation.

Wang et al. [25] proposed Max-DeepLab to directly predict panoptic segmentation results with a mask transformer, without involving surrogate sub-tasks such as box detection.
Max-DeepLab streamlines the panoptic segmentation tasks in an end-to-end fashion and directly predicts a set of nonoverlapping masks and corresponding labels.
MaxDeepLab adopts a dual-path framework that facilitates combining the CNN and transformer.

Transformer for Instance Segmentation.

VisTR, a transformerbased video instance segmentation model, was proposed by Wang et al. [33] to produce instance prediction results from a sequence of input images. A strategy for matching instance sequence is proposed to assign the predictions with ground truths. In order to obtain the mask sequence for each instance, VisTR utilizes the instance sequence segmentation module to accumulate the mask features from multiple frames and segment the mask sequence with a 3D CNN.
Hu et al. [130] proposed an instance segmentation Transformer (ISTR) to predict low-dimensional mask embeddings, and match them with ground truth for the set loss. ISTR conducted detection and segmentation with a recurrent refinement strategy which is different from the existing top-down and bottom-up frameworks.

Transformer for Semantic Segmentation.

Zheng et al. [18] proposed a transformer-based semantic segmentation network (SETR). SETR utilizes an encoder similar to ViT [15] as the encoder to extract features from an input image. A multi-level feature aggregation module is adopted for performing pixel-wise segmentation.
Strudel et al. [134] introduced Segmenter which relies on the output embedding corresponding to image patches and obtains class labels with a point-wise linear decoder or a mask transformer decoder.
Xie et al. [135] proposed a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders, which outputs multiscale features and avoids complex decoders.

Transformer for Medical Image Segmentation.

.Cao et al. [30] proposed an Unet-like pure Transformer for medical image segmentation, by feeding the tokenized image patches into the Transformer-based U-shaped Encoder-Decoder architecture with skip-connections for local-global semantic feature learning.
Valanarasu et al. [136] explored transformer-based solutions and study the feasibility of using transformer-based network architectures for medical image segmentation tasks and proposed a Gated Axial-Attention model which extends the existing architectures by introducing an additional control mechanism in the self-attention module.

3.2.3 Pose Estimation

Transformer for Hand Pose Estimation.

Huang et al. [34] proposed a transformer based network for 3D hand pose estimation from point sets.
- The encoder first utilizes a PointNet [138] to extract point-wise features from input point clouds and then adopts standard multi-head self-attention module to produce embeddings.
- a feature extractor such as PointNet++ [139] is used to extract hand joint-wise features, which are then fed into the decoder as positional encodings.
Huang et al. [35] proposed HOT-Net (short for hand-object transformer network) for 3D hand-object pose estimation.
- HOT-Net uses a ResNet to generate initial 2D hand-object pose and then feeds it into a transformer to predict the 3D hand-object pose.
- A spectral graph convolution network is therefore used to extract input embeddings for the encoder.
Hampali et al. [140] proposed to estimate the 3D poses of two hands given a single color image.
- appearance and spatial encodings of a set of potential 2D locations for the joints of both hands were inputted to a transformer, and the attention mechanisms were used to sort out the correct configuration of the joints and outputted the 3D poses of both hands.

Transformer for Human Pose Estimation.

Lin et al. [36] proposed a mesh transformer (METRO) for predicting 3D human pose and mesh from a single RGB image.
- METRO extracts image features via a CNN and then perform position encoding by concatenating a template human mesh to the image feature.
- A multi-layer transformer encoder with progressive dimensionality reduction is proposed to gradually reduce the embedding dimensions and finally produce 3D coordinates of human joint and mesh vertices.
- METRO randomly mask some input queries during training
Yang et al. [117] constructed an explainable model named TransPose based on Transformer architecture and low-level convolutional blocks.
- The attention layers built in Transformer can capture long-range spatial relationships between keypoints and explain what dependencies the predicted keypoints locations highly rely on
Li et al. [141] proposed a novel approach based on Token representation for human Pose estimation (TokenPose).
Mao et al. [142] proposed a human pose estimation framework that solved the task in the regression-based fashion.
Jiang et al. [143] proposed a novel transformer based network that can learn a distribution over both pose and motion in an unsupervised fashion rather than tracking body parts and trying to temporally smooth them.
Hao et al. [144] proposed to personalize a human pose estimator given a set of test images of a person without using any manual annotations

3.2.4 Other Tasks

Pedestrian Detection.

Endto-end Detector (PED)： employs a new decoder called Dense Queries and Rectified Attention field (DQRF) to support dense queries and alleviate the noisy or narrow attention field of the queries.
They also proposed V-Match, which achieves additional performance improvements by fully leveraging visible annotations.

Lane Detection

LSTR： improves performance of curve lane detection by learning the global context with a transformer network.
LSTR regards lane detection as a task of fitting lanes with polynomials and uses neural networks to predict the parameters of polynomials.
Liu et al. [147] utilized a transformer encoder structure for more efficient context feature extraction.

Scene Graph. ：
Scene graph is a structured representation of a scene that can clearly express the objects, attributes, and relationships between objects in the scene [148].

Graph R-CNN [149] utilizes self-attention to integrate contextual information from neighboring nodes in the graph.
Sharifzadeh et al. [150] employed transformers over the extracted object embedding.
Sharifzadeh et al. [151] proposed a new pipeline called Texema and employed a pre-trained Text-to-Text
Transfer Transformer (T5) [152] to create structured graphs from textual input and utilized them to improve the relational reasoning module.

Tracking.
such as TMT [153], TrTr [154] and TransT [155]. All these work use a Siamese-like tracking pipeline to do video object tracking and utilize the encoder-decoder network to replace explicit cross-correlation operation for global and rich contextual inter-dependencies.

the transformer encoder and decoder are assigned to the template branch and the searching branch, respectively.

Sun et al. proposed TransTrack [156], which is an online joint-detection-and-tracking pipeline.

Re-Identification.

He et al. [157] proposed TransReID to investigate the application of pure transformers in the field of object re-identification (ReID)
Both Liu et al. [158] and Zhang et al. [159] provided solutions for introducing transformer network into video-based person Re-ID

Point Cloud Learning.

Guo et al. [161] proposed a novel framework that replaces the original selfattention module with a more suitable offset-attention module, which includes implicit Laplace operator and normalization refinement.
Zhao et al. [162] designed a novel transformer architecture called Point Transformer.

3.2.5 Discussions

The key issues that need to be resolved before transformer can be adopted for high-level tasks relate to input embedding, position encoding, and prediction loss.

Nevertheless, exploration into the use of transformers for high-level vision tasks is still in the preliminary stages and so further research may prove beneficial.

3.3 Low-level Vision

These tasks often take images as outputs (e.g., high-resolution or denoised images), which is more challenging than high-level vision tasks

3.3.1 Image Generation

as shown in Figure 9 (a). Jiang et al. [38] proposed TransGAN, which build GAN using the transformer architecture.

Kwonjoon Lee et al. [163] proposed ViTGAN, which introduce several technique to both generator and discriminator to stabilize the training procedure and convergence.
ViTGAN is the first work to demonstrate transformer-based GANs can achieve comparable performance to state-of-the-art CNN-based GANs.
Parmar et al. [27] proposed Image Transformer, taking the first step toward generalizing the transformer model to formulate image translation and generation tasks in an auto-regressive manner

Image Transformer consists of two parts：

an encoder for extracting image representation
a decoder to generate pixels.

Esser et al. [37] proposed Taming Transformer. Taming Transformer consists of two parts: a VQGAN and a transformer. VQGAN is a variant of VQVAE [164], which uses a discriminator and perceptual loss to improve the visual quality.
DALL·E [41] proposed the transformer model for text-to-image generation, which synthesizes images according to the given captions.
- The whole framework consists of two stages.
- In the first stage, a discrete VAE is utilized to learn the visual codebook.
- In the second stage, the text is decoded by BPE-encode and the corresponding image is decoded by dVAE learned in the first stage.
- Then an autoregression transformer is used to learn the prior between the encoded text and image.

3.3.2 Image Processing

Yang et al. [39] proposed Texture Transformer Network for Image Super-Resolution (TTSR), using the transformer architecture in the reference-based image super-resolution problem.
Chen et al. [19] proposed Image Processing Transformer (IPT), which fully utilizes the advantages of transformers by using large pre-training datasets.
Wang et al. [165] proposed SceneFormer to utilize transformer in 3D indoor scene generation. By treating a scene as a sequence of objects, the transformer decoder can be used to predict series of objects and their location, category, and size.

3.4 Video Processing

3.4.1 High-level Video Processing

Video Action Recognition.

. Rohit et al. proposed the action transformer [167] to model the underlying relationship between the human of interest and the surrounding context.
- the I3D [169] is used as the backbone to extract highlevel feature maps.
Lohit et al. [170] proposed an interpretable differentiable module, named temporal transformer network, to reduce the intra-class variance and increase the inter-class variance.
Fayyaz and Gall proposed a temporal transformer [171] to perform action recognition tasks under weakly supervised settings.
Gavrilyuk et al. proposed an actor-transformer [173] architecture to learn the representation, using the static and dynamic representations generated by the 2D and 3D networks as input.

Video Retrieval.

Shao et al. [174] suggested using the transformer to model the long-range semantic dependency.
Gabeur et al. [175] presented a multi-modal transformer to learn different cross-modal cues in order to represent videos.

Video Object Detection.

Chen et al. introduced the memory enhanced global-local aggregation (MEGA) [176] to capture more content.
Yin et al. [177] proposed a spatiotemporal transformer to aggregate spatial and temporal information.
Together with another spatial feature encoding component, these two components perform well on 3D video object detection tasks.

Multi-task Learning.

Seong et al. proposed the video multi-task transformer network [178], which handles multi-task learning on untrimmed videos.

3.4.2 Low-level Video Processing

Frame/Video Synthesis.

Liu et al. proposed the ConvTransformer [166], which is comprised of five components: feature embedding, position encoding, encoder, query decoder, and the synthesis feed-forward network.
- Compared with LSTM based works, the ConvTransformer achieves superior results with a more parallelizable architecture
Another transformer-based approach was proposed by Schatz et al. [179], which uses a recurrent transformer network to synthetize human actions from novel views.

Video Inpainting ： Video inpainting tasks involve completing any missing regions within a frame.

Zeng et al. proposed a spatial-temporal transformer network [28], which uses all the input frames as input and fills them in parallel.
- The spatial-temporal adversarial loss is used to optimize the transformer network.

3.4.3 Discussions

3.5 Multi-Modal Tasks

多模式任务(例如，视频-文本、图像-文本和音频-文本)

VideoBERT [180], which uses a CNNbased module to pre-process videos in order to obtain representation tokens. A transformer encoder is then trained on these tokens to learn the video-text representations for downstream tasks, such as video caption
VisualBERT [181] and VL-BERT [182], which adopt a single-stream unified transformer to capture visual elements and image-text relationship for downstream tasks such as visual question answering (VQA) and visual commonsense reasoning (VCR)
SpeechBERT [183] explore the possibility of encoding audio and text pairs with a transformer encoder to process autotext tasks such as speech question answering (SQA). [183] explore the possibility of encoding audio and text pairs with a transformer encoder to process autotext tasks such as speech question answering (SQA).
Contrastive Language-Image Pre-training (CLIP) [40] takes natural language as supervision to learn more efficient image representation.
DALL-E [41] synthesizes new images of categories described in an input text.
Ding et al. proposes CogView [42], which is a transformer with VQ-VAE tokenizer similar to DALL-E, but supports Chinese text input.
Unified Transformer (UniT) [43] model is proposed to cope with multi-modal multi-task learning, which can simultaneously handle multiple tasks across different domains, including object detection, natural language understanding and vision-language reasoning.

3.6 Efficient Transformer

we review the researches carried out into compressing and accelerating transformer models for efficient implementation.
This includes including:

network pruning, 网络裁剪
low-rank decomposition, 低秩分解
knowledge distillation,
network quantization,
compact architecture design. 紧凑架构设计

Table 4 lists some representative works for compressing transformer-based models.

3.6.1 Pruning and Decomposition

Michel et al. [44] presented empirical evidence that a large percentage of attention heads can be removed at test time without impacting performance significantly.
Dalvi et al. [184] analyzed the redundancy in pre-trained transformer models from two perspectives:
- general redundancy
- task-specific redundancy.
Prasanna et al. [184] analyzed the lotteries in BERT and showed that good sub-networks also exist in transformer-based models, reducing both the FFN layers and attention heads in order to achieve high compression rates.
vision transformer [15] which splits an image to multiple patches,
Tang et al. [186] proposed to reduce patch calculation to accelerate the inference, and the redundant patches can be automatically discovered by considering their contributions to the effective output features.
Zhu et al. [187] extended the network slimming approach [188] to vision transformers for reducing the dimensions of linear projections in both FFN and attention modules.
Fan et al. [198] proposed a layer-wisely dropping strategy to regularize the training of models, and then the whole layers are removed together at the test phase
Wang et al. [200] decomposed the standard matrix multiplication in transformer models, improving the inference efficiency.

3.6.2 Knowledge Distillation

Knowledge distillation aims to train student networks by transferring knowledge from large teacher networks [201], [202], [203].

Mukherjee et al. [204] used the pre-trained BERT [10] as a teacher to guide the training of small models, leveraging large amounts of unlabeled data.
Wang et al. [205] train the student networks to mimic the output of self-attention layers in the pre-trained teacher models.
. A teacher’s assistant [206] is also introduced in [205], reducing the gap between large pre-trained transformer models and compact student networks, thereby facilitating the mimicking process.
Jiao et al. [45] design different objective functions to transfer knowledge from teachers to students.
Jia et al. [207] proposed a finegrained manifold distillation method, which excavates effective knowledge through the relationship between images and the divided patches.

3.6.3 Quantization

Quantization aims to reduce the number of bits needed to represent network weight or intermediate features [208], [209]. Quantization methods for general neural networks have been discussed at length and achieve performance on par with the original networks [210],
[211], [212].

Shridhar et al. [215] suggested embedding the input into binary high-dimensional vectors, and then using the binary input representation to train the binary neural networks.
Cheong et al. [216] represented the weights in the transformer models by low-bit (e.g., 4-bit) representation
Zhao et al. [217] empirically investigated various quantization methods and showed that kmeans quantization has a huge development potential.
Prato et al. [46] proposed a fully quantized transformer, which, as the paper claims, is the first 8- bit model not to suffer any loss in translation quality.
Liu et al. [218] explored a post-training quantization scheme to reduce the memory storage and computational costs of vision transformers.

3.6.4 Compact Architecture Design

Jiang et al. [47] simplified the calculation of self-attention by proposing a new module — called spanbased dynamic convolution — that combine the fully-connected layers and the convolutional layers
Interesting “hamburger” layers are proposed in [220], using matrix decomposition to substitute the original self-attention layers
Su et al. [82] searched patch size and dimensions of linear projections and head number of attention modules to get an efficient vision transformer.
Li et al. [223] explored a self-supervised search strategy to get a hybrid architecture composing of both convolutional modules and self-attention modules.
Katharopoulos et al. [224] approximated self-attention as a linear dot-product of kernel feature maps and revealed the relationship between tokens via RNNs.

The preceding methods take different approaches in how they attempt to identify redundancy in transformer models (see Figure 13).

4 CONCLUSIONS AND DISCUSSIONS

4.1 Challenges

Although researchers have proposed many transformer-based models to tackle computer vision tasks, these works are only the first steps in this field and still have much room for improvement.

the transformer architecture in ViT [15] follows the standard transformer for NLP [9], but an improved version specifically designed for CV remains to be explored.

The generalization and robustness of transformers for computer vision are also challenging.

Compared with CNNs, pure transformers lack some inductive biases and rely heavily on massive datasets for large-scale training [15].
the quality of data has a significant influence on the generalization and robustness of transformers
There is still a long way to go in order to better generalize pre-trained transformers on more generalized visual tasks.
Although the robustness has been investigated in [232], [233], [234], it is still an open problem waiting to be solved.
it remains a challenging subject to clearly explain why transformer works well on visual tasks.
Position embeddings are added into image patches to retain positional information, which is important in computer vision tasks.
developing efficient transformer models for CV remains an open problem.
Although several methods have been proposed to compress transformer, they remain highly complex.
Consequently, efficient transformer models are urgently needed so that vision transformer can be deployed on resource-limited devices.

4.2 Future Prospects

One direction is the effectiveness and the efficiency of transformers in computer vision.
- The goal is to develop highly effective and efficient vision transformers;
- transformers with high performance and low resource cost.
- The effectiveness is usually correlated with the efficiency, so determining how to achieve a better balance between them is a meaningful topic for future study
Most of the existing vision transformer models are designed to handle only a single task.
- We believe that more tasks can be involved in only one model. Unifying all visual tasks and even other tasks in one transformer (i.e., a grand unified model) is an exciting topic.
CNNs perform well on small datasets, whereas transformers perform better on large datasets. The question for the future is whether to use CNN or transformer
By training with large datasets, transformers can achieve stateof-the-art performance on both NLP [11], [10] and CV benchmarks [15]. It is possible that neural networks need big data rather than inductive bias.
Can transformer obtains satisfactory results with a very simple computational paradigm (e.g., with only fully connected layers) and massive data training?

你可能感兴趣的:(Transformer,transformer,深度学习,人工智能)

喜爱购有什么新消息？如何打造百城万店氧惠好物
自2020年10月起，西安喜爱购商贸商贸股份有限公司全力打造的“百城万店”新零售商业模式应运而生。在探索新零售的道路上,通过互联网、大数据、云计算、人工智能等新技术,重构“人、货、场”商业元素,秉持“舍利差赚服务”经营理念,在全国至少一百个城市的“一千户以上的中高端社区”,打造至少两万家“一区一店”社区生活超市。大家好！我是氧惠最大团队&联合创始人氧惠达人导师。氧惠佣金更高，模式更好，终端用户不流
AI 驱动自动化运维平台架构与实现大富大贵7 程序员知识储备1 程序员知识储备2 程序员知识储备3 算法机器学习人工智能决策树大数据
摘要：随着云计算、容器化和大规模分布式系统的普及，传统人工运维方法已难以满足现代IT环境中海量指标、日志和拓扑关系的实时分析与故障响应需求。AI驱动的自动化运维（AIOps）平台通过融合机器学习、深度学习、图分析以及强化学习等多学科技术，实现对海量运维数据的智能感知、预测、诊断和自动化修复。本文深入探讨AI驱动自动化运维平台的整体架构设计与核心技术实现，涵盖数据采集与预处理、AI引擎设计、自动化执
冒充顺华文庭内部群胜天半子毛顺华就是骗子，中粮仓智慧农业虚拟盘及早远离切勿被套！昌龙律法
人到老年，就怕手头没钱。一些不法分子利用老年人信息闭塞、认知较弱等特点瞄准了老年人的“钱袋子”花样百出实施诈骗老年人损失财产的同时还饱受精神打击不能忍！这些套路，应该让爸妈知道智慧农业，低碳环保双探交易市场，数字体育，人工智能十选五就是骗局我们曾曝光了无数种金融骗局，不知道能有多少人看到，能帮一个是一个，再次曝光一种炒股诱导做慈善参加数字经济的骗局，相信作为股民，大家都会经常接到一下分析个股，或者
人工智能真的能编程吗？研究勾勒出自主软件工程的障碍 WSSWWWSSW 人工智能软件工程
想象一下这样一个未来：人工智能悄然承担起软件开发的繁重工作：重构杂乱无章的代码、迁移遗留系统以及排查竞态条件，这样人类工程师就可以专注于架构、设计以及那些机器仍然无法解决的真正新颖的问题。最近的进展似乎让这个未来近在咫尺，但麻省理工学院计算机科学与人工智能实验室（CSAIL）以及其他几家合作机构的研究人员发表的一篇新论文指出，要实现这个潜在的未来，需要认真审视当前面临的挑战。这篇题为《面向软件工程
GPU 之后，IMU 登场：AI 发展的下一次飞跃
你早晨醒来，手机上的大模型帮你写完邮件、翻译合同，却依旧不能帮你把厨房里洒掉的牛奶擦干。你戴上的AR眼镜知道“那里有杯子”，却抓不到它——AI会说不会做。是不是哪里少了一截？人工智能（AI）的发展历程中，我们见证了从简单的数据处理到复杂的语言生成能力的飞跃。然而，尽管AI在虚拟世界中表现出色，它在物理世界中的表现却相对滞后。为了填补这一空白，AI正在进入一个新的发展阶段：行动驱动时代。在本文中，我
YOLOv13_SSOD：基于超图关联增强的半监督目标检测框架（原创创新算法）
YOLOv13_SSOD：基于超图关联增强的半监督目标检测框架项目背景随着深度学习技术的快速发展，目标检测在各个领域都取得了显著的进展。然而，现有的监督学习方法在实际应用中面临着标注数据稀缺、泛化能力不足等挑战。特别是在火灾烟雾检测、工业质检等特定场景中，获取大量高质量标注数据的成本极高。为了解决这一问题，本项目基于最新发布的YOLOv13架构，结合EfficientTeacher半监督学习框架，
USB串口通信、握手协议、深度学习等技术要点深度学习教程, 深度学习人工智能网络协议
基于OpenMV的智能车牌识别系统：从硬件到算法的完整实现前言本文将详细介绍一个基于OpenMV微控制器的智能车牌识别系统的设计与实现。该系统集成了嵌入式视觉处理、串口通信协议、深度学习OCR识别等多种技术，实现了从图像采集到车牌识别的完整流程。系统架构概述整体设计思路该车牌识别系统采用分布式架构设计，将计算密集型任务与嵌入式控制分离：┌─────────────┐USB串口通信┌────────
python学习路线（从菜鸟到起飞）突突突然不会编了 python 学习开发语言
以下是基于2025年最新技术趋势的Python学习路线，综合多个权威资源整理而成，涵盖从零基础到进阶应用的全流程，适合不同学习目标（如Web开发、数据分析、人工智能等）的学习者。路线分为基础、进阶、实战、高级、方向拓展五个阶段，并附学习资源推荐：一、基础阶段（1-2个月）目标：掌握Python核心语法与编程思维，熟悉开发环境。环境搭建安装Python3.10+，配置PyCharm或VSCode开发
语音识别开源项目推荐：GitHub热门仓库盘点 AGI大模型与大数据研究院 AI大模型应用开发实战语音识别开源 github ai
2024年必看！GitHub热门语音识别开源项目全解析：从入门到实战关键词语音识别(ASR)、开源项目、GitHub、Whisper、FunASR、PaddleSpeech、深度学习摘要想象一下：开车时只需说一句话就能自动发消息，听英文演讲时实时获得中文翻译，给视障人士读文本时精准转换——这些场景的背后，语音识别（AutomaticSpeechRecognition,ASR）技术正在改变我们与机器
Python训练 + Go优化 + C#部署：端到端AI模型的跨语言实践威哥说编程人工智能学习资料库 python golang c#
在现代AI应用中，如何高效地训练、优化、并最终部署AI模型是一项复杂且具有挑战性的任务。在这一过程中，选择合适的编程语言和工具可以显著提高效率和系统的性能。Python作为AI领域的主流语言，具有丰富的深度学习框架（如PyTorch和TensorFlow），在模型训练方面处于领先地位。然而，针对计算密集型任务（如数据预处理、加密等），Go语言因其高效的并发处理和出色的性能，成为优化计算的理想选择。
DL00478-涡轮叶片缺陷检测数据集yolo格式1300张左右
涡轮叶片缺陷检测数据集yolo格式1300张左右涡轮叶片缺陷检测数据集YOLO格式解析：提升研究与论文写作的关键要点在研究涡轮叶片缺陷检测的过程中，数据集的选择和格式处理是一个至关重要的环节。特别是当你打算通过卷积神经网络（CNN）等深度学习模型进行缺陷检测时，数据集的标注和格式化直接影响到模型的训练效果和论文的质量。本文将重点探讨涡轮叶片缺陷检测数据集的YOLO格式，并分析如何利用这一格式为研究
YOLO 目标检测的改进方法
YOLO目标检测的改进方法可以从模型架构、训练策略、损失函数等多个方面入手，以下是一些常见的改进方法方向及参考文献：模型架构改进骨干网络替换：使用更轻量或更强大的网络替换原始骨干网络。轻量级网络如MobileNetV3、ShuffleNetV2等适合移动端部署，可提高推理速度；高性能网络如ConvNeXt、SwinTransformer等能提取更丰富的语义特征，提升检测精度。还可添加CBAM、SE
分类模型（BERT）训练全流程巴伦是只猫人工智能分类 bert 数据挖掘
使用BERT实现分类模型的完整训练流程BERT(BidirectionalEncoderRepresentationsfromTransformers)是一种强大的预训练语言模型，在各种NLP任务中表现出色。下面我将详细梳理使用BERT实现文本分类模型的完整训练过程。1.准备工作1.1环境配置pipinstalltransformerstorchtensorflowpandassklearn1.2
京东零售重磅开源 | OxyGent：像搭乐高一样组装AI团队，实现群体智能京东零售技术零售开源人工智能
京东零售Oxygen团队正式开源发布多智能体协作框架——OxyGent。这一创新框架致力于帮助开发者高效组装多智能体协作系统，实现智能体间的无缝协作、弹性扩展与全链路可追溯。推动人工智能从“单点突破”迈向“群体智能”时代。OxyGent已在开源社区正式上线。开源地址：https://github.com/jd-opensource/OxyGent官网地址：https://oxygent.jd.co
具身智能的视觉-语言导航综述
24年2月来自曲阜师范、华东师大和哈工大的论文“Vision-LanguageNavigationwithEmbodiedIntelligence:ASurvey”。作为人工智能领域的长期愿景，具身智能的核心目标是提升智体与环境的感知、理解和交互能力。视觉-语言导航（VLN）作为实现具身智能的重要研究路径，致力于探索智体如何利用自然语言与人进行有效沟通，接收并理解指令，并最终依靠视觉信息实现精准导
具身智能：从理论到实践的跨越
具身智能（EmbodiedAI）的概念起源与发展是一个跨越半个多世纪的学术探索历程，其核心思想在不同学科的交叉碰撞中逐渐成型。以下从理论源头、技术奠基、术语演进三个维度展开解析，揭示这一概念的学术脉络与产业价值：一、理论源头：从图灵的哲学构想到认知科学的具身化转向1.图灵的"感官机器"设想（1950年）在人工智能奠基性论文《计算机器与智能》中，图灵提出了两种智能发展路径：抽象计算路径：如国际象棋等
Epoch 老兵发新帖人工智能
在深度学习和机器学习中，Epoch（轮次或周期）是一个核心训练概念，指模型在整个训练数据集上完成一次完整遍历的过程。以下是关于Epoch的详细解析：一、核心定义基本含义Epoch表示模型将所有训练数据完整学习一次的过程。例如：若训练集有10,000个样本，则1个Epoch即模型用这10,000个样本训练一轮。与相关概念的关系Batch（批次）：数据集被分割成的小组（如每批32个样本）。Iterat
生命3.0时代，面对人工智能时代的到来，我们可以做些什么笃定的沙丁鱼
生命的定义生命的定义有很多，最为人所熟知的是在生物学上的定义，即生命是蛋白质存在的一种形式。但是，这种定义可能不太适用于未来的智能机器和外星文明，我们不能将我们对未来生命的思考局限在过去遇到过的物种，所以需要将生命定义得更广阔一些：生命是一个能保持自身复杂性并能进行复制的过程。复制的对象并不是由原子组成的物质，而是能阐明原子是如何排列的信息，这种信息由比特组成。换句话说：我们可以将生命看作一种自我
深度学习图像分类数据集—百种病虫害分类 AI街潜水的八角深度学习图像数据集深度学习分类人工智能
该数据集为图像分类数据集，适用于ResNet、VGG等卷积神经网络，SENet、CBAM等注意力机制相关算法，VisionTransformer等Transformer相关算法。数据集信息介绍：百种病虫害识别分类，训练集45095张，验证集7508张，测试集22619张具体类别为以下：insect_classes=["rice_leaf_roller","rice_leaf_caterpillar
不正规不靠谱：假摩根士丹利内部群推荐绿色低碳减排平台骗局揭露!送一万体验资金做慈善全是假的! 易星辰分享普法
关于曝光网上摩根士丹利何晓斌宝丰能源节能减排在炒股群推荐智慧农业中粮仓平台骗局的文章，其内容主要揭示了近期频发的一种投资诈骗手段。以下是该骗局的主要特点和步骤：为什么明明跟老师对过视频，确认是本人，怎么还会被骗了?你有没有想过一个名人大咖怎么会有时间给你们一对一视频，其次我来给大家揭露一下，这个套路AI换脸骗局是一种利用人工智能技术，通过替换视频中的人脸来伪造身份或进行诈骗的行为。你的账户“余额”
车辆云端威胁情报共享系统的多维解析与发展路径百态老人大数据人工智能
第一部分：内容本质提取原始内容描述了一个闭环网络安全体系：“车辆实时上传异常行为日志至安全运营中心（VSOC），云端通过机器学习分析攻击模式并下发全局防御策略”。其核心架构包含：数据采集层：车辆端持续收集异常行为日志数据，包含CAN总线通信模式、网络流量特征及驾驶行为数据传输层：通过V2X通信协议和OTA更新通道实现车云双向通信分析层：安全运营中心(VSOC)采用CNN-BiSRU等深度学习模型进
假冒朱民！通达OA社科院朱民ST-balance项目就是假的，被骗亏损真相揭秘，亲身亏损经历正义青天
通达OA社科院朱民ST-balance项目不正规——杀猪盘不能提现投票骗局曝光！随着互联网的普及，数字经济蓬勃发展，各种线上平台如雨后春笋般涌现。然而，在这些看似繁荣的平台中，不乏一些黑平台，它们以欺诈手段骗取用户的财产，给人们的财产安全带来严重威胁！因此，我们有必要提高警惕，防范黑平台诈骗。针对网上素未谋面的牛散大咖，经济学家等推荐网上投资理财、数字经济，数字体育市场，人工智能项目，数字低碳，慈
基于深度学习的语音识别：从音频信号到文本转录 Blossom.118 机器学习与人工智能深度学习语音识别音视频人工智能机器学习线性代数计算机视觉
前言语音识别（AutomaticSpeechRecognition,ASR）是人工智能领域中一个极具挑战性和应用前景的研究方向。它通过将语音信号转换为文本，为人们提供了更加自然和便捷的人机交互方式。近年来，深度学习技术在语音识别领域取得了显著进展，极大地提高了语音识别的准确率和鲁棒性。本文将详细介绍如何使用深度学习技术构建一个语音识别系统，从音频信号的预处理到模型的训练与部署。一、语音识别的基本概
什么是GPT-4T？亿只小灿灿人工智能 GPT-4T
1.引言：GPT-4T概述GPT-4T是OpenAI开发的新一代多模态大型语言模型，在GPT-4的基础上增强了对表格数据、数学表达式和代码的处理能力。其核心创新在于Transformer架构的优化，使模型能够更高效地处理结构化数据与文本的融合任务。本文将深入探讨GPT-4T的技术原理、应用场景及代码实现。2.GPT-4T核心技术解析2.1多模态输入处理GPT-4T支持三种主要输入模态：自然语言文本
数字人系统：AI界的超级巨星，你准备好了吗？优秘智能UMI 数字人人工智能深度学习计算机视觉机器学习自然语言处理语言模型图像处理
在这个日新月异的科技时代，每一个创新的火花都可能点燃一场变革的燎原之火。今天，我们要聊的，正是那颗在AI领域熠熠生辉的璀璨新星——优秘数字人系统。它不仅仅是技术的飞跃，更是对未来生活方式的深刻重塑，一场关于人机交互、智能共生的美好预演。技术原理：深度解析与智能构建的奥秘1.深度学习：智能的基石数字人系统的核心技术之一在于深度学习。深度学习是一种模仿人脑神经网络结构和功能的机器学习技术，通过构建多层
普通人想利用AI变现，这5个赛道不能错过！浮沉导师
随着人工智能技术的迅猛发展，越来越多的普通人开始关注如何利用AI实现变现。AI不仅改变了我们的工作方式，也创造了众多赚钱的机会。本文将介绍五个值得关注的AI赛道，帮助你抓住这些机会，实现收入增长。【高省】APP网购优惠券免费领，分享还能赚钱。【高省】是一个自用省钱佣金高，分享推广赚钱多的平台。佣金更高，模式更好，终端用户不流失。0投资，稳定可靠，百度有几百万篇报道，期待你的加入。应用市场下载【高省
AI人工智能 Agent：金融投资中智能体的应用 AI天才研究院 AI大模型企业级应用开发实战 AI大模型应用入门实战与进阶 AI人工智能与大数据计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
AI人工智能Agent：金融投资中智能体的应用1.背景介绍在金融投资领域，人工智能（AI）技术的应用已经成为一种趋势。随着数据量的爆炸性增长和计算能力的提升，AI技术在金融市场中的应用变得越来越广泛和深入。智能体（Agent）作为AI技术的重要组成部分，能够在金融投资中发挥重要作用。智能体可以通过学习和适应市场环境，自动执行交易策略，优化投资组合，甚至预测市场趋势。2.核心概念与联系2.1智能体（
对话谷歌前 CEO Eric Schmidt：数字超智能将在十年内到来，AI 将创造更多更高薪的工作 AI科技大本营人工智能
责编|王启隆出品|CSDN（ID：CSDNnews）投稿或寻求报道|[email protected]科技巨擘、谷歌前CEOEricSchmidt最近做客PeterDiamandis的Moonshots播客，与主持人PeterDiamandis及DaveLondon展开了一场关于人工智能未来的深度对话。全世界都在为AI的飞速发展感到兴奋又焦虑时，这位曾经执掌谷歌帝国长达十年、亲眼见证并推动了这场技术
聚焦基础研究突破，北电数智联合复旦大学等团队提出“AI安全”DDPA方法入选ICML CSDN资讯人工智能安全数据要素大数据
近日，由北电数智首席科学家窦德景教授牵头，联合复旦大学和美国奥本大学等科研团队共同研发，提出一种DDPA（DynamicDelayedPoisoningAttack）新型对抗性攻击方法，为机器学习领域的安全研究提供新视角与工具，相关论文已被国际机器学习大会（ICML2025）收录。ICML由国际机器学习学会（IMLS）主办，聚焦深度学习、强化学习、自然语言处理等机器学习前沿方向，是机器学习与人工智
格灵深瞳视觉算法面试30问全景精解机＿长算法面试职场和发展
格灵深瞳视觉算法面试30问全景精解——AI感知×智能安防×场景创新：格灵深瞳视觉算法面试核心考点全览前言格灵深瞳（GREATVISION）作为国内领先的人工智能与计算机视觉企业，专注于智慧安防、智能交通、智慧零售等领域，推动视觉算法在大规模城市级场景的落地。格灵深瞳视觉算法岗位面试不仅考察候选人对视觉基础理论的扎实掌握，更关注其在复杂场景下的创新能力与工程实践。本文精选30个高质量面试问题，涵盖基
安装数据库首次应用 Array_06 java oracle sql
可是为什么再一次失败之后就变成直接跳过那个要求 enter full pathname of java.exe的界面这个java.exe是你的Oracle 11g安装目录中例如：【F:\app\chen\product\11.2.0\dbhome_1\jdk\jre\bin】下的java.exe 。不是你的电脑安装的java jdk下的java.exe！注意第一次，使用SQL D
Weblogic Server Console密码修改和遗忘解决方法 bijian1013 Welogic
在工作中一同事将Weblogic的console的密码忘记了，通过网上查询资料解决，实践整理了一下。一.修改Console密码打开weblogic控制台，安全领域 --> myrealm -->&n
IllegalStateException: Cannot forward a response that is already committed Cwind java Servlets
对于初学者来说，一个常见的误解是：当调用 forward() 或者 sendRedirect() 时控制流将会自动跳出原函数。标题所示错误通常是基于此误解而引起的。示例代码： protected void doPost() { if (someCondition) { sendRedirect(); } forward(); // Thi
基于流的装饰设计模式木zi_鸣设计模式
当想要对已有类的对象进行功能增强时，可以定义一个类，将已有对象传入，基于已有的功能，并提供加强功能。自定义的类成为装饰类模仿BufferedReader，对Reader进行包装，体现装饰设计模式装饰类通常会通过构造方法接受被装饰的对象，并基于被装饰的对象功能，提供更强的功能。装饰模式比继承灵活，避免继承臃肿，降低了类与类之间的关系装饰类因为增强已有对象，具备的功能该
Linux中的uniq命令被触发 linux
Linux命令uniq的作用是过滤重复部分显示文件内容，这个命令读取输入文件，并比较相邻的行。在正常情况下，第二个及以后更多个重复行将被删去，行比较是根据所用字符集的排序序列进行的。该命令加工后的结果写到输出文件中。输入文件和输出文件必须不同。如果输入文件用“- ”表示，则从标准输入读取。 AD： uniq [选项] 文件说明：这个命令读取输入文件，并比较相邻的行。在正常情况下，第二个
正则表达式Pattern 肆无忌惮_ Pattern
正则表达式是符合一定规则的表达式，用来专门操作字符串，对字符创进行匹配，切割，替换，获取。例如，我们需要对QQ号码格式进行检验规则是长度6~12位不能0开头只能是数字，我们可以一位一位进行比较，利用parseLong进行判断，或者是用正则表达式来匹配[1-9][0-9]{4,14} 或者 [1-9]\d{4,14} &nbs
Oracle高级查询之OVER (PARTITION BY ..) 知了ing oracle sql
一、rank()/dense_rank() over(partition by ...order by ...) 现在客户有这样一个需求，查询每个部门工资最高的雇员的信息，相信有一定oracle应用知识的同学都能写出下面的SQL语句： select e.ename, e.job, e.sal, e.deptno from scott.emp e, (se
Python调试矮蛋蛋 python pdb
原文地址： http://blog.csdn.net/xuyuefei1988/article/details/19399137 1、下面网上收罗的资料初学者应该够用了，但对比IBM的Python 代码调试技巧： IBM：包括 pdb 模块、利用 PyDev 和 Eclipse 集成进行调试、PyCharm 以及 Debug 日志进行调试： http://www.ibm.com/d
webservice传递自定义对象时函数为空，以及boolean不对应的问题 alleni123 webservice
今天在客户端调用方法 NodeStatus status=iservice.getNodeStatus(). 结果NodeStatus的属性都是null。进行debug之后，发现服务器端返回的确实是有值的对象。后来发现原来是因为在客户端，NodeStatus的setter全部被我删除了。本来是因为逻辑上不需要在客户端使用setter，结果改了之后竟然不能获取带属性值的
java如何干掉指针，又如何巧妙的通过引用来操作指针————>说的就是java指针百合不是茶
C语言的强大在于可以直接操作指针的地址，通过改变指针的地址指向来达到更改地址的目的,又是由于c语言的指针过于强大，初学者很难掌握， java的出现解决了c，c++中指针的问题 java将指针封装在底层，开发人员是不能够去操作指针的地址，但是可以通过引用来间接的操作：定义一个指针p来指向a的地址（&是地址符号）：
Eclipse打不开，提示“An error has occurred.See the log file ***/.log” bijian1013 eclipse
打开eclipse工作目录的\.metadata\.log文件，发现如下错误： !ENTRY org.eclipse.osgi 4 0 2012-09-10 09:28:57.139 !MESSAGE Application error !STACK 1 java.lang.NoClassDefFoundError: org/eclipse/core/resources/IContai
spring aop实例annotation方法实现 bijian1013 java spring AOP annotation
在spring aop实例中我们通过配置xml文件来实现AOP，这里学习使用annotation来实现，使用annotation其实就是指明具体的aspect,pointcut和advice。1.申明一个切面(用一个类来实现)在这个切面里,包括了advice和pointcut AdviceMethods.jav
[Velocity一]Velocity语法基础入门 bit1129 velocity
用户和开发人员参考文档 http://velocity.apache.org/engine/releases/velocity-1.7/developer-guide.html 注释 1.行级注释## 2.多行注释#* *# 变量定义使用$开头的字符串是变量定义，例如$var1, $var2, 赋值使用#set为变量赋值，例
【Kafka十一】关于Kafka的副本管理 bit1129 kafka
1. 关于request.required.acks request.required.acks控制者Producer写请求的什么时候可以确认写成功，默认是0， 0表示即不进行确认即返回。 1表示Leader写成功即返回，此时还没有进行写数据同步到其它Follower Partition中 -1表示根据指定的最少Partition确认后才返回，这个在 Th
lua统计nginx内部变量数据 ronin47 lua nginx　统计
server { listen 80; server_name photo.domain.com; location /{set $str $uri; content_by_lua ' local url = ngx.var.uri local res = ngx.location.capture(
java-11.二叉树中节点的最大距离 bylijinnan java
import java.util.ArrayList; import java.util.List; public class MaxLenInBinTree { /* a. 1 / \ 2 3 / \ / \ 4 5 6 7 max=4 pass "root"
Netty源码学习-ReadTimeoutHandler bylijinnan java netty
ReadTimeoutHandler的实现思路：开启一个定时任务，如果在指定时间内没有接收到消息，则抛出ReadTimeoutException 这个异常的捕获，在开发中，交给跟在ReadTimeoutHandler后面的ChannelHandler，例如 private final ChannelHandler timeoutHandler = new ReadTim
jquery验证上传文件样式及大小(好用) cngolon 文件上传 jquery验证
<!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <script src="jquery1.8/jquery-1.8.0.
浏览器兼容【转】 cuishikuan css 浏览器 IE
浏览器兼容问题一：不同浏览器的标签默认的外补丁和内补丁不同问题症状：随便写几个标签，不加样式控制的情况下，各自的margin 和padding差异较大。碰到频率:100% 解决方案：CSS里 *{margin:0;padding:0;} 备注：这个是最常见的也是最易解决的一个浏览器兼容性问题，几乎所有的CSS文件开头都会用通配符*来设
Shell特殊变量：Shell $0, $#, $*, $@, $?, $$和命令行参数 daizj shell $#$?特殊变量
前面已经讲到，变量名只能包含数字、字母和下划线，因为某些包含其他字符的变量有特殊含义，这样的变量被称为特殊变量。例如，$ 表示当前Shell进程的ID，即pid，看下面的代码： $echo $$ 运行结果 29949 特殊变量列表变量含义 $0 当前脚本的文件名 $n 传递给脚本或函数的参数。n 是一个数字，表示第几个参数。例如，第一个
程序设计KISS 原则-------KEEP IT SIMPLE, STUPID! dcj3sjt126com unix
翻到一本书，讲到编程一般原则是kiss：Keep It Simple, Stupid.对这个原则深有体会，其实不仅编程如此，而且系统架构也是如此。 KEEP IT SIMPLE, STUPID! 编写只做一件事情，并且要做好的程序；编写可以在一起工作的程序，编写处理文本流的程序，因为这是通用的接口。这就是UNIX哲学.所有的哲学真正的浓缩为一个铁一样的定律，高明的工程师的神圣的“KISS 原
android Activity间List传值 dcj3sjt126com Activity
第一个Activity： import java.util.ArrayList;import java.util.HashMap;import java.util.List;import java.util.Map;import android.app.Activity;import android.content.Intent;import android.os.Bundle;import a
tomcat 设置java虚拟机内存 eksliang tomcat 内存设置
转载请出自出处：http://eksliang.iteye.com/blog/2117772 http://eksliang.iteye.com/ 常见的内存溢出有以下两种: java.lang.OutOfMemoryError: PermGen space java.lang.OutOfMemoryError: Java heap space ------------
Android 数据库事务处理 gqdy365 android
使用SQLiteDatabase的beginTransaction()方法可以开启一个事务，程序执行到endTransaction() 方法时会检查事务的标志是否为成功，如果程序执行到endTransaction()之前调用了setTransactionSuccessful() 方法设置事务的标志为成功则提交事务，如果没有调用setTransactionSuccessful() 方法则回滚事务。事
Java 打开浏览器 hw1287789687 打开网址 open浏览器 open browser 打开url 打开浏览器
使用java 语言如何打开浏览器呢? 我们先研究下在cmd窗口中,如何打开网址使用IE 打开 D:\software\bin>cmd /c start iexplore http://hw1287789687.iteye.com/blog/2153709 使用火狐打开 D:\software\bin>cmd /c start firefox http://hw1287789
ReplaceGoogleCDN：将 Google CDN 替换为国内的 Chrome 插件 justjavac chrome Google google api chrome插件
Chrome Web Store 安装地址： https://chrome.google.com/webstore/detail/replace-google-cdn/kpampjmfiopfpkkepbllemkibefkiice 由于众所周知的原因，只需替换一个域名就可以继续使用Google提供的前端公共库了。同样，通过script标记引用这些资源，让网站访问速度瞬间提速吧
进程VS.线程 m635674608 线程
资料来源： http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/001397567993007df355a3394da48f0bf14960f0c78753f000 1、Apache最早就是采用多进程模式 2、IIS服务器默认采用多线程模式 3、多进程优缺点优点：多进程模式最大
Linux下安装MemCached 字符串 memcached
前提准备：1. MemCached目前最新版本为：1.4.22，可以从官网下载到。2. MemCached依赖libevent，因此在安装MemCached之前需要先安装libevent。2.1 运行下面命令，查看系统是否已安装libevent。[root@SecurityCheck ~]# rpm -qa|grep libevent libevent-headers-1.4.13-4.el6.n
java设计模式之--jdk动态代理（实现aop编程） Supanccy2013 java DAO 设计模式 AOP
与静态代理类对照的是动态代理类，动态代理类的字节码在程序运行时由Java反射机制动态生成，无需程序员手工编写它的源代码。动态代理类不仅简化了编程工作，而且提高了软件系统的可扩展性，因为Java 反射机制可以生成任意类型的动态代理类。java.lang.reflect 包中的Proxy类和InvocationHandler 接口提供了生成动态代理类的能力。 &
Spring 4.2新特性-对java8默认方法(default method)定义Bean的支持 wiselyman spring 4
2.1 默认方法(default method) java8引入了一个default medthod; 用来扩展已有的接口,在对已有接口的使用不产生任何影响的情况下,添加扩展使用default关键字 Spring 4.2支持加载在默认方法里声明的bean 2.2 将要被声明成bean的类 public class DemoService {