Mr.Daozhi

attention机制介绍，易懂

以下文字转自网址：

https://towardsdatascience.com/attn-illustrated-attention-5ec4ad276ee3#0458

Attn: Illustrated Attention

Attention in GIFs and how it is used in machine translation like Google Translate

Raimi Karim

Jan 20 · 12 min read

(TLDR: animation for attention here)

For decades, Statistical Machine Translation has been the dominant translation model [9], until the birth of Neural Machine Translation(NMT). NMT is an emerging approach to machine translation that attempts to build and train a single, large neural network that reads an input text and outputs a translation [1].

The pioneers of NMT are proposals from Kalchbrenner and Blunsom (2013), Sutskever et. al (2014) and Cho. et. al (2014b), where the more familiar framework is the sequence-to-sequence (seq2seq) learning from Sutskever et. al. This article will be based on the seq2seq framework and how attention can be built on it.

Fig. 0.1: seq2seq with an input sequence of length 4

In seq2seq, the idea is to have two recurrent neural networks (RNNs) with an encoder-decoder architecture: read the input words one by one to obtain a vector representation of a fixed dimensionality (encoder), and, conditioned on these inputs, extract the output words one by one using another RNN (decoder). Explanation adapted from [5].

Fig. 0.2: seq2seq with an input sequence of length 64

The trouble with seq2seq is that the only information that the decoder receives from the encoder is the last encoder hidden state (the 2 red nodes in Fig. 0.1), a vector representation which is like a numerical summary of an input sequence. So, for a long input text (Fig. 0.2), we unreasonably expect the decoder to use just this one vector representation (hoping that it ‘sufficiently summarises the input sequence’) to output a translation. This might lead to catastrophic forgetting. This paragraph has 100 words. Can you translate this paragraph to another language you know, right after this question mark?

If we can’t, then we shouldn’t be so cruel to the decoder. How about instead of just one vector representation, let’s give the decoder a vector representation from every encoder time step so that it can make well-informed translations? Enter attention.

Fig 0.3: Adding an attention as an interface between encoder and decoder. Here, the first decoder time step is getting ready to receive information from encoder before giving the first translated word.

Attention is an interface between the encoder and decoder that provides the decoder with information from every encoder hidden state (apart from the hidden state in red in Fig. 0.3). With this setting, the model is able to selectively focus on useful parts of the input sequence and hence, learn the alignment between them. This helps the model to cope effectively with long input sentences [9].

Definition: alignment
Alignment means matching segments of original text with their corresponding segments of the translation. Definition adapted from here.

Fig. 0.3: Alignment for the French word ‘la’ is distributed across the input sequence but mainly on these 4 words: ‘the’, ‘European’, ‘Economic’ and ‘Area’. Darker purple indicates better attention scores (Image source)

There are 2 types of attention, as introduced in [2]. The type of attention that uses all the encoder hidden states is also known as global attention. In contrast, local attention uses only a subset of the encoder hidden states. As the scope of this article is global attention, any references made to “attention” in this article are taken to mean “global attention”.

This article provide a summary of how attention works using animations, so that we can understand them without (or after having read a paper or tutorial full of) mathematical notations ?. As examples, I will be sharing 4 NMT architectures that were designed in the past 5 years. I will also be peppering this article with some intuitions on some concepts so keep a lookout for them!

Contents
1. Attention: Overview
2. Attention: Examples
3. Summary
Appendix: Score Functions

1. Attention: Overview

Before we look at how attention is used, allow me to share with you the intuition behind a translation task using the seq2seq model.

Intuition: seq2seq
A translator reads the German text from start till the end. Once done, he starts translating to English word by word. It is possible that if the sentence is extremely long, he might have forgotten what he has read in the earlier parts of the text.

So that’s a simple seq2seq model. The step-by-step calculation for the attention layer I am about to go through is a seq2seq+attention model. Here’s a quick intuition on this model.

Intuition: seq2seq + attention
A translator reads the German text while writing down the keywords from the start till the end, after which he starts translating to English. While translating each German word, he makes use of the keywords he has written down.

Attention places different focus on different words by assigning each word with a score. Then, using the softmaxed scores, we aggregate the encoder hidden states using a weighted sum of the encoder hidden states, to get the context vector. The implementations of an attention layer can be broken down into 4 steps.

Step 0: Prepare hidden states.

Let’s first prepare all the available encoder hidden states (green) and the first decoder hidden state (red). In our example, we have 4 encoder hidden states and the current decoder hidden state. (Note: the last consolidated encoder hidden state is fed as input to the first time step of the decoder. The output of this first time step of the decoder is called the first decoder hidden state, as seen below.)

Fig. 1.0: Getting ready to pay attention

Step 1: Obtain a score for every encoder hidden state.

A score (scalar) is obtained by a score function (also known as alignment score function [2] or alignment model [1]). In this example, the score function is a dot product between the decoder and encoder hidden states.

See Appendix A for a variety of score functions.

Fig. 1.1: Get the scores

decoder_hidden = [10, 5, 10]encoder_hidden  score
---------------------
     [0, 1, 1]     15 (= 10×0 + 5×1 + 10×1, the dot product)
     [5, 0, 1]     60
     [1, 1, 0]     15
     [0, 5, 1]     35

In the above example, we obtain a high attention score of 60 for the encoder hidden state [5, 0, 1]. This means that the next word to be translated is going to be heavily influenced by this encoder hidden state.

Step 2: Run all the scores through a softmax layer.

We put the scores to a softmax layer so that the softmaxed scores (scalar) add up to 1. These softmaxed scores represent the attention distribution [3, 10].

Fig. 1.2: Get the softmaxed scores

encoder_hidden  score  score^
-----------------------------
     [0, 1, 1]     15       0
     [5, 0, 1]     60       1
     [1, 1, 0]     15       0
     [0, 5, 1]     35       0

Notice that based on the softmaxed score score^, the distribution of attention is only placed on [5, 0, 1] as expected. In reality, these numbers are not binary but a floating point between 0 and 1.

Step 3: Multiply each encoder hidden state by its softmaxed score.

By multiplying each encoder hidden state with its softmaxed score (scalar), we obtain the alignment vector [2] or the annotation vector [1]. This is exactly the mechanism where alignment takes place.

Fig. 1.3: Get the alignment vectors

encoder  score  score^  alignment
---------------------------------
[0, 1, 1]   15      0   [0, 0, 0]
[5, 0, 1]   60      1   [5, 0, 1]
[1, 1, 0]   15      0   [0, 0, 0]
[0, 5, 1]   35      0   [0, 0, 0]

Here we see that the alignment for all encoder hidden states except [5, 0, 1] are reduced to 0 due to low attention scores. This means we can expect that the first translated word should match the input word with the [5, 0, 1] embedding.

Step 4: Sum up the alignment vectors.

The alignment vectors are summed up to produce the context vector [1, 2]. A context vector is an aggregated information of the alignment vectors from the previous step.

Fig. 1.4: Get the context vector

encoder  score  score^  alignment
---------------------------------
[0, 1, 1]   15     0  [0, 0, 0]
[5, 0, 1]   60     1  [5, 0, 1]
[1, 1, 0]   15     0  [0, 0, 0]
[0, 5, 1]   35     0  [0, 0, 0]context = [0+5+0+0, 0+0+0+0, 0+1+0+0] = [5, 0, 1]

Step 5: Feed the context vector into the decoder.

The manner this is done depends on the architecture design. Later we will see in the examples in Sections 2a, 2b and 2c how the architectures make use of the context vector for the decoder.

Fig. 1.5: Feed the context vector to decoder

That’s about it! Here’s the entire animation:

Fig. 1.6: Attention

Training and inference
During inference, the input to each decoder time step t is the predicted output from decoder time step t-1.
During training, the input to each decoder time step t is our ground truth output from decoder time step t-1.

Intuition: How does attention actually work?

Answer: Backpropagation, surprise surprise. Backpropagation will do whatever it takes to ensure that the outputs will be close to the ground truth. This is done by altering the weights in the RNNs and in the score function, if any. These weights will affect the encoder hidden states and decoder hidden states, which in turn affect the attention scores.

[Back to top]

2. Attention: Examples

We have seen the both the seq2seq and the seq2seq+attention architectures in the previous section. In the next sub-sections, let’s examine 3 more seq2seq-based architectures for NMT that implement attention. For completeness, I have also appended their Bilingual Evaluation Understudy (BLEU) scores — a standard metric for evaluating a generated sentence to a reference sentence.

2a. Bahdanau et. al (2015) [1]

This implementation of attention is one of the founding attention fathers. The authors use the word ‘align’ in the title of the paper “Neural Machine Translation by Learning to Jointly Align and Translate” to mean adjusting the weights that are directly responsible for the score, while training the model. The following are things to take note about the architecture:

The encoder is a bidirectional (forward+backward) gated recurrent unit (BiGRU). The decoder is a GRU whose initial hidden state is a vector modified from the last hidden state from the backward encoder GRU (not shown in the diagram below).
The score function in the attention layer is the additive/concat.
The input to the next decoder step is the concatenation between the output from the previous decoder time step (pink) and context vector from the current time step (dark green).

Fig. 2a: NMT from Bahdanau et. al. Encoder is a BiGRU, decoder is a GRU.

The authors achieved a BLEU score of 26.75 on the WMT’14 English-to-French dataset.

Intuition: seq2seq with bidirectional encoder + attention

Translator A reads the German text while writing down the keywords. Translator B (who takes on a senior role because he has an extra ability to translate a sentence from reading it backwards) reads the same German text from the last word to the first, while jotting down the keywords. These two regularly discuss about every word they read thus far. Once done reading this German text, Translator B is then tasked to translate the German sentence to English word by word, based on the discussion and the consolidated keywords that the both of them have picked up.

Translator A is the forward RNN, Translator B is the backward RNN.

2b. Luong et. al (2015) [2]

The authors of Effective Approaches to Attention-based Neural Machine Translation have made it a point to simplify and generalise the architecture from Bahdanau et. al. Here’s how:

The encoder is a two-stacked long short-term memory (LSTM) network. The decoder also has the same architecture, whose initial hidden states are the last encoder hidden states.
The score functions they experimented were (i) additive/concat, (ii) dot product, (iii) location-based, and (iv) ‘general’.
The concatenation between output from current decoder time step, and context vector from the current time step are fed into a feed-forward neural network to give the final output (pink) of the current decoder time step.

Fig. 2b: NMT from Luong et. al. Encoder is a 2 layer LSTM, likewise for decoder.

On the WMT’15 English-to-German, the model achieved a BLEU score of 25.9.

Intuition: seq2seq with 2-layer stacked encoder + attention

Translator A reads the German text while writing down the keywords. Likewise, Translator B (who is more senior than Translator A) also reads the same German text, while jotting down the keywords. Note that the junior Translator A has to report to Translator B at every word they read. Once done reading, the both of them translate the sentence to English together word by word, based on the consolidated keywords that they have picked up.

[Back to top]

2c. Google’s Neural Machine Translation (GNMT) [9]

Because most of us must have used Google Translate in one way or another, I feel that it is imperative to talk about Google’s NMT, which was implemented in 2016. GNMT is a combination of the previous 2 examples we have seen (heavily inspired by the first [1]).

The encoder consists of a stack of 8 LSTMs, where the first is bidirectional (whose outputs are concatenated), and a residual connection exists between outputs from consecutive layers (starting from the 3rd layer). The decoder is a separate stack of 8 unidirectional LSTMs.
The score function used is the additive/concat, like in [1].
Again, like in [1], the input to the next decoder step is the concatenation between the output from the previous decoder time step (pink) and context vector from the current time step (dark green).

Fig. 2c: Google’s NMT for Google Translate. Skip connections are denoted by curved arrows. *Note that the LSTM cells only show the hidden state and input; it does not show the cell state input.

The model achieves 38.95 BLEU on WMT’14 English-to-French, and 24.17BLEU on WMT’14 English-to-German.

Intuition: GNMT — seq2seq with 8-stacked encoder (+bidirection+residual connections) + attention

8 translators sit in a column from bottom to top, starting with Translator A, B, …, H. Every translator reads the same German text. At every word, Translator A shares his/her findings with Translator B, who will improve it and share it with Translator C — repeat this process until we reach Translator H. Also, while reading the German text, Translator H writes down the relevant keywords based on what he knows and the information he has received.

Once everyone is done reading this English text, Translator A is told to translate the first word. First, he tries to recall, then he shares his answer with Translator B, who improves the answer and shares with Translator C — repeat this until we reach Translator H. Translator H then writes the first translation word, based on the keywords he wrote and the answers he got. Repeat this until we get the translation out.

3. Summary

Here’s a quick summary of all the architectures that you have seen in this article:

seq2seq
seq2seq + attention
seq2seq with bidirectional encoder + attention
seq2seq with 2-stacked encoder + attention
GNMT — seq2seq with 8-stacked encoder (+bidirection+residual connections) + attention

That’s it for now! In my next post, I will walk through with you the concept of self-attention and how it has been used in Google’s Transformer and Self-Attention Generative Adversarial Network (SAGAN). Keep an eye on this space!

Appendix: Score Functions

Below are some of the score functions as compiled by Lilian Weng. Additive/concat and dot product have been mentioned in this article. The idea behind score functions involving the dot product operation (dot product, cosine similarity etc.), is to measure the similarity between two vectors. For feed-forward neural network score functions, the idea is to let the model learn the alignment weights together with the translation.

Fig. A0: Summary of score functions

Fig. A1: Summary of score functions. h represents encoder hidden states while s represents decoder hidden states. (Image source)

[Back to top]

References

[1] Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et. al, 2015)

[2] Effective Approaches to Attention-based Neural Machine Translation (Luong et. al, 2015)

[3] Attention Is All You Need (Vaswani et. al, 2017)

[4] Self-Attention GAN (Zhang et. al, 2018)

[5] Sequence to Sequence Learning with Neural Networks (Sutskever et. al, 2014)

[6] TensorFlow’s seq2seq Tutorial with Attention (Tutorial on seq2seq+attention)

[7] Lilian Weng’s Blog on Attention (Great start to attention)

[8] Jay Alammar’s Blog on Seq2Seq with Attention (Great illustrations and worked example on seq2seq+attention)

[9] Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (Wu et. al, 2016)

Related Articles on Deep Learning

Animated RNN, LSTM and GRU

Line-by-Line Word2Vec Implementation (on word embeddings)

Step-by-Step Tutorial on Linear Regression with Stochastic Gradient Descent

10 Gradient Descent Optimisation Algorithms + Cheat Sheet

Counting No. of Parameters in Deep Learning Models

基于Python的健身数据分析工具的搭建流程day1 weixin_45677320 python 开发语言数据挖掘爬虫
基于Python的健身数据分析工具的搭建流程分数据挖掘、数据存储和数据分析三个步骤。本文主要介绍利用Python实现健身数据分析工具的数据挖掘部分。第一步：加载库加载本文需要的库，如下代码所示。若库未安装，请按照python如何安装各种库（保姆级教程）_python安装库-CSDN博客https://blog.csdn.net/aobulaien001/article/details/133298
LLM中最后一个词语的表征（隐藏状态）通常会融合前面所有词语的信息吗？ ZhangJiQun&MXP 教学 2024大模型以及算力 2021 AI python 机器学习算法深度学习人工智能
LLM中最后一个词语的表征（隐藏状态）通常会融合前面所有词语的信息吗？在大语言模型（LLM）中，最后一个词语的表征（隐藏状态）通常会融合前面所有词语的信息，这是由LLM的核心架构（以Transformer为基础）决定的，具体可以从以下角度理解：1.核心机制：自注意力（Self-Attention）的作用现代LLM（如GPT系列、Qwen等）均基于Transformer架构，其核心是自注意力机制。在
深度学习图像分类数据集—桃子识别分类 AI街潜水的八角深度学习图像数据集深度学习分类人工智能
该数据集为图像分类数据集，适用于ResNet、VGG等卷积神经网络，SENet、CBAM等注意力机制相关算法，VisionTransformer等Transformer相关算法。数据集信息介绍：桃子识别分类：['B1','M2','R0','S3']训练数据集总共有6637张图片，每个文件夹单独放一种数据各子文件夹图片统计:·B1:1601张图片·M2:1800张图片·R0:1601张图片·S3:
图神经网络：挖掘关系数据中的宝藏
图神经网络：挖掘关系数据中的宝藏在浩瀚的数据海洋中，蕴藏着一类特殊而强大的资源——关系数据。它们不是孤立的点，而是相互连接、彼此影响的复杂网络：社交平台上朋友的朋友、电商系统中商品与用户的互动、蛋白质分子内原子的结合、城市交通网中的道路连接……这些数据天然以图的形式存在，节点代表实体，边则承载着实体间千丝万缕的关系。传统的数据挖掘工具面对这些盘根错节的结构往往力不从心，而图神经网络（GNN）的崛起
从RNN循环神经网络到Transformer注意力机制：解析神经网络架构的华丽蜕变熊猫钓鱼>_> 神经网络 rnn transformer
1.引言在自然语言处理和序列建模领域，神经网络架构经历了显著的演变。从早期的循环神经网络（RNN）到现代的Transformer架构，这一演变代表了深度学习方法在处理序列数据方面的重大进步。本文将深入比较这两种架构，分析它们的工作原理、优缺点，并通过实验结果展示它们在实际应用中的性能差异。2.循环神经网络（RNN）2.1基本原理循环神经网络是专门为处理序列数据而设计的神经网络架构。RNN的核心思想
搜广推校招面经九十一
美团机器学习/数据挖掘算法工程师_二面一、介绍一下ESMM模型，是否有进行过函数推导传统的转化率建模方式：只用发生点击（click=1）的样本来训练CVR模型。CVR定义如下：CVR=P(y=1∣x,z=1)CVR=P(y=1|x,z=1)CVR=P(y=1∣x,z=1)y=1表示用户发生了转化（如购买）z=1表示用户点击了广告这样做的问题：样本选择偏差（SampleSelectionBias,S
华为OD技术面试高频考点（算法篇、AI方向）
一、Transformer核心机制：自注意力(Self-Attention)公式:Attention=softmax(QK^T/√d_k)v运作原理：1.Q/K/V矩阵：输入向量通过线性变换生成Query(查询）、Key(键）、Value(值)2.注意力权重:Softmax(QKT/√d_k)→计算词与词之间的关联度3.输出：权重与Value加权求和→捕获长距离依赖-优势：并行计算、全局上下文感知
【字节跳动】数据挖掘面试题0010：解释全国人均收入下降，各省份人均收入增加的现象，属于辛普森悖论（开放性问题）言析数智数据挖掘常见面试题辛普森悖论局部与整体分析差异归因数据分析面试题
文章大纲一、辛普森悖论的核心定义二、现象成因：加权平均中的“权重偏移”三、数学逻辑与案例说明1.数学表达式2.具体案例四、辛普森悖论的本质：忽略“混杂因素”的影响五、生活中常见的辛普森悖论案例及应对策略1.医疗疗法效果评估2.大学录取率的性别偏差3.篮球运动员投篮效率4.公司员工绩效与部门规模如何利用辛普森悖论？（数据分析中的价值）六、总结全国人均收入下降而各省份人均收入增加的现象，确实属于辛普森
数据挖掘：从理论到实践的深度探索代码老y 数据挖掘人工智能
在当今数字化时代，数据已经成为企业决策的重要依据。数据挖掘作为一门从大量数据中提取有价值信息的技术，已经广泛应用于各个领域，如金融、医疗、零售、互联网等。本文将深入探讨数据挖掘的基本概念、主要技术和实际应用案例，帮助读者更好地理解数据挖掘的价值和应用。一、数据挖掘的基本概念（一）数据挖掘的定义数据挖掘（DataMining）是从大量数据中提取有用信息的过程。它结合了统计学、机器学习、数据库技术和人
Transformer、BERT等模型原理与应用案例程序猿全栈の董（董翔）人工智能热门技术领域 transformer bert 深度学习
Transformer、BERT等模型原理与应用案例Transformer模型原理Transformer是一种基于注意力机制的深度学习模型架构，由Vaswani等人在2017年的论文"AttentionIsAllYouNeed"中提出。与传统的循环神经网络(RNN)和卷积神经网络(CNN)不同，Transformer完全依赖自注意力机制来处理输入序列的全局依赖关系。核心组件多头自注意力机制(Mul
【力扣（LeetCode）】数据挖掘面试题0003： 356. 直线镜像
文章大纲题目描述**坐标变化规律**解题方案题目描述在一个二维平面空间中，给你n个点的坐标。问，是否能找出一条平行于y轴的直线，让这些点关于这条直线成镜像排布？平行于y轴的直线（即垂直于x轴的直线，其方程形式为(x=a)，其中(a)为常数）的对称点具有以下显著特点：坐标变化规律设直线为(x=a)，平面内任意一点(P(x,y))关于该直线的对称点为(P’(x’,y’))，则两者坐标满足：纵坐标不变：
注意力机制和小潘一起学AI 深度学习人工智能
第一种注意力机制#注意力机制importtorchimporttorch.nnasnnimporttorch.nn.functionalasFclassAttn(nn.Module):def__init__(self,query_size,key_size,value_size1,value_size2,output_size):"""初始化函数中的参数有5个query_size代表query的最
时尚搭配助手，深度解析用Keras构建智能穿搭推荐系统忆愿高质量领域文章 keras 人工智能深度学习机器学习 python
文章目录引言：当算法遇见时尚第一章数据工程：时尚系统的基石1.1数据获取的多元化途径1.2数据预处理全流程1.2.1图像标准化与增强1.2.2多模态数据处理第二章模型架构设计：从分类到推荐2.1基础CNN模型（图像分类）2.2多任务学习模型（属性联合预测）第三章推荐算法核心3.1协同过滤与内容推荐的融合第四章系统优化4.1注意力机制应用第五章实战演练5.2实时推荐API实现第六章前沿探索：时尚AI
用Keras构建爱情模型：破解情侣间的情感密码忆愿高质量领域文章 keras 人工智能深度学习 python 机器学习自然语言处理神经网络
文章目录一、给情话穿上数字马甲1.1中文分词那些坑1.2停用词过滤玄学二、给神经网络装个情感温度计2.1记忆增强套餐2.2注意力机制实战三、给模型喂点狗粮数据3.1数据增强七十二变3.2标注的艺术四、调参比哄对象还难4.1超参数扫雷指南4.2可视化调参黑科技五、实战演练之保命指南5.1部署成求生APP5.2案例分析库六、当AI遇见现实：模型局限与伦理困境6.1隐私雷区七、从玩具模型到生产系统7.1
Gemini vs DeepSeek：Transformer 架构下的技术路线差异与企业级选择 charles666666 transformer 架构深度学习语言模型产品经理人工智能
一、引言：从商业价值切入Gemini和DeepSeek都基于Transformer架构，但在技术路线和应用场景上各有侧重。本文将解密同源Transformer下的技术分野，帮助企业做出更明智的大模型选型决策。二、Transformer核心机制精要Transformer架构是现代大语言模型的基础，其核心机制包括自注意力机制和前馈神经网络。自注意力机制使模型能够捕捉序列中元素的全局依赖关系，但也是GP
【机器学习-08】参数调优宝典：网格搜索与贝叶斯搜索等攻略云天徽上机器学习机器学习人工智能
博主简介：曾任某智慧城市类企业算法总监，目前在美国市场的物流公司从事高级算法工程师一职，深耕人工智能领域，精通python数据挖掘、可视化、机器学习等，发表过AI相关的专利并多次在AI类比赛中获奖。CSDN人工智能领域的优质创作者，提供AI相关的技术咨询、项目开发和个性化解决方案等服务，如有需要请站内私信或者联系任意文章底部的的VX名片（ID：xf982831907）博主粉丝群介绍：①群内初中生、
NLP-D7-李宏毅机器学习---X-Attention&&GAN&BERT&GPT 甄小胖机器学习自然语言处理机器学习 bert
—0521今天4:30就起床了！真的是迫不及待想看新的课程！！！昨天做人脸识别系统的demo查资料的时候，发现一个北理的大四做cv的同学，差距好大！！！我也要努力呀！！不是比较，只是别人可以做到这个程度，我也一定可以！！！要向他学习！！！开始看课程啦！-----0753看完了各种attention，由于attention自己计算的限制，当N很大的时候会产生计算速度问题，从各种不同角度（人工知识输入
【AI大模型】Transformer架构位置编码我爱一条柴ya 学习AI记录人工智能神经网络 ai AI编程
Transformer架构中的位置编码(PositionalEncoding)是其核心设计之一，用于解决一个关键问题：Self-Attention机制本身对输入元素的顺序是“无感知”的(permutationinvariant)。问题：为什么需要位置编码？Self-Attention的本质缺陷：Self-Attention通过计算所有元素对之间的关联来工作。然而，它只关心元素是什么(x_i的内容)
Python爬虫在社交平台数据挖掘中的应用：深入探索用户互动程序员威哥 python 爬虫数据挖掘
引言社交媒体已经成为全球用户互动的主要平台，每天都有大量的信息生成，用户之间的互动行为如点赞、评论、分享、转发等构成了宝贵的数据资源。如何利用这些互动数据为商业决策、用户行为分析以及产品优化提供支持，已经成为数据科学与大数据分析领域的一个重要课题。Python作为一款强大的编程语言，凭借其丰富的爬虫库和数据分析工具，已经成为挖掘社交平台数据的重要工具。在本文中，我们将通过Python爬虫技术，深入
AI人工智能与机器学习的大数据融合应用 AI智能探索者人工智能机器学习大数据 ai
AI人工智能与机器学习的大数据融合应用关键词：AI人工智能、机器学习、大数据、融合应用、数据挖掘摘要：本文深入探讨了AI人工智能与机器学习在大数据融合应用方面的相关内容。首先介绍了研究的背景、目的、预期读者和文档结构，对核心术语进行了清晰定义。接着阐述了AI、机器学习和大数据的核心概念及相互联系，给出了形象的文本示意图和Mermaid流程图。详细讲解了核心算法原理，并通过Python源代码进行说明
Python爬虫实战：利用Selenium与反反爬技术高效爬取天眼查企业信息 Python爬虫项目 2025年爬虫实战项目 python 爬虫开发语言 scrapy selenium
摘要本文将详细介绍如何使用Python爬虫技术获取天眼查的企业信息数据。我们将从爬虫基础开始，逐步深入到高级反反爬技术，最终构建一个能够稳定获取天眼查数据的爬虫系统。文章包含完整的代码实现、技术原理分析以及实际应用场景，帮助读者全面掌握企业信息爬取的核心技术。关键词：Python爬虫、天眼查、Selenium、反反爬技术、企业信息采集、数据挖掘一、引言在当今大数据时代，企业信息数据对于市场分析、商
Python 爬虫实战：京东商品数据采集（登录态验证 + 价格监控系统） Python核芯 Python爬虫实战项目 python 爬虫开发语言
一、引言在电商飞速发展的当下，京东作为国内头部电商平台之一，拥有海量商品数据。对于商家而言，精准掌握这些数据能助力优化定价策略、洞察市场动态；对消费者来说，追踪商品价格走势有助于把握最佳购买时机。本文将深入剖析如何借助Python爬虫技术实现京东商品数据采集，包括突破登录态验证以及搭建价格监控系统，为读者呈上一份实用的电商数据挖掘指南。二、环境搭建安装Python库：执行以下命令安装所需的库：pi
计算机视觉：Transformer的轻量化与加速策略 xcLeigh 计算机视觉CV 计算机视觉 transformer 人工智能 AI 策略
计算机视觉：Transformer的轻量化与加速策略一、前言二、Transformer基础概念回顾2.1Transformer架构概述2.2自注意力机制原理三、Transformer轻量化策略3.1模型结构优化3.1.1减少层数和头数3.1.2优化Patch大小3.2参数共享与剪枝3.2.1参数共享3.2.2剪枝3.3知识蒸馏四、Transformer加速策略4.1模型量化4.2.2TPU加速4.
Attention机制完全解析：从原理到ChatGPT实战学废了wuwu chatgpt
一、Attention的本质与计算步骤1.1核心思想动态聚焦：Attention是一种信息分配机制，让模型在处理输入时动态关注最重要的部分。类比：像人类阅读时用荧光笔标记关键句子。1.2计算三步曲（以"吃苹果"为例）Q(Query)、K(Key)、V(Value)的分工角色数学表示作用类比QW_q·输入向量主动提问者：表示当前需要关注什么好比"学生举手提问"KW_k·输入向量匹配者：提供被匹配的特
AI“大航海”时代：企业人力资源的AI-HR实践与效能提升策略
在数字化浪潮的推动下，人工智能（AI）正以前所未有的速度渗透各行各业，人力资源管理（HR）领域也不例外。AI技术的引入与应用落地，不仅提升HR管理效率，更在深层次上带来人力资源运作模式的变革。什么是AI-HR所谓AI-HR，是指将人工智能技术应用于人力资源管理，并通过机器学习、自然语言处理、数据挖掘等技术，优化招聘、培训、绩效评估、员工关系等人力资源各个业务模块。近年来，随着AI技术的成熟和普及，
PagedAttention和Continuous Batching 流浪大人大模型深度学习人工智能机器学习
PagedAttention是什么PagedAttention是一种用于优化Transformer架构中注意力机制的技术，主要用于提高大语言模型在推理阶段的效率，特别是在处理长序列数据时能有效减少内存碎片和提高内存利用率。它借鉴了操作系统中虚拟内存分页机制的思想。工作原理传统注意力机制的局限性：传统的注意力机制在处理长序列时，需要为每个位置计算注意力得分并存储中间结果，这会导致内存占用随着序列长度
YOLOv11模型轻量化挑战的技术黑客飓风 YOLO 目标跟踪人工智能
YOLOv11模型轻量化挑战的技术文章大纲背景与意义YOLOv11在目标检测领域的地位与优势轻量化需求的实际应用场景（移动端、嵌入式设备等）轻量化面临的挑战：精度与速度的权衡YOLOv11模型结构分析整体架构设计特点（如主干网络、特征融合模块等）参数量与计算量分布的关键瓶颈现有轻量化改进的局限性轻量化技术路线网络结构优化深度可分离卷积替代传统卷积注意力机制的高效嵌入设计冗余模块的剪枝与删除量化与压
Python实现基于POA-CNN-LSTM-Attention鹈鹕优化算法（POA）优化卷积长短期记忆神经网络融合注意力机制进行多变量回归预测的详细项目实例 nantangyuxi Python 算法神经网络 python 人工智能深度学习目标检测机器学习
目录Python实她基她POA-CNN-LSTM-Attentikon鹈鹕优化算法（POA）优化卷积长短期记忆神经网络融合注意力机制进行她变量回归预测她详细项目实例...1项目背景介绍...1项目目标她意义...1提升她变量回归预测精度...2优化模型训练效率...2python复制ikmpoxtos#操作系统接口，用她环境管理和文件操作ikmpoxtqaxnikngs#警告管理模块，控制运行时警
Python训练打卡DAY47 韩哈哈1129 Python训练打卡 python 开发语言
DAY47：注意力热图可视化恩师@浙大疏锦行知识点：热力图#可视化空间注意力热力图（显示模型关注的图像区域）defvisualize_attention_map(model,test_loader,device,class_names,num_samples=3):"""可视化模型的注意力热力图，展示模型关注的图像区域"""model.eval()#设置为评估模式withtorch.no_grad
Python训练打卡Day46 编程有点难 Python学习笔记 python 开发语言
通道注意力(SE注意力)知识点回顾：不同CNN层的特征图：不同通道的特征图什么是注意力：注意力家族，类似于动物园，都是不同的模块，好不好试了才知道。通道注意力：模型的定义和插入的位置通道注意力后的特征图和热力图注意力机制：一种让模型学会「选择性关注重要信息」的特征提取器，就像人类视觉会自动忽略背景，聚焦于图片中的主体（如猫、汽车）。transformer中的叫做自注意力机制，他是一种自己学习自己的
xml解析小猪猪08 xml
1、DOM解析的步奏准备工作： 1.创建DocumentBuilderFactory的对象 2.创建DocumentBuilder对象 3.通过DocumentBuilder对象的parse(String fileName)方法解析xml文件 4.通过Document的getElem
每个开发人员都需要了解的一个SQL技巧 brotherlamp linux linux视频 linux教程 linux自学 linux资料
对于数据过滤而言CHECK约束已经算是相当不错了。然而它仍存在一些缺陷，比如说它们是应用到表上面的，但有的时候你可能希望指定一条约束，而它只在特定条件下才生效。使用SQL标准的WITH CHECK OPTION子句就能完成这点，至少Oracle和SQL Server都实现了这个功能。下面是实现方式： CREATE TABLE books ( id &
Quartz——CronTrigger触发器 eksliang quartz CronTrigger
转载请出自出处：http://eksliang.iteye.com/blog/2208295 一.概述 CronTrigger 能够提供比 SimpleTrigger 更有具体实际意义的调度方案，调度规则基于 Cron 表达式，CronTrigger 支持日历相关的重复时间间隔（比如每月第一个周一执行），而不是简单的周期时间间隔。二.Cron表达式介绍 1）Cron表达式规则表 Quartz
Informatica基础 18289753290 Informatica Monitor manager workflow Designer
1. 1）PowerCenter Designer：设计开发环境，定义源及目标数据结构；设计转换规则，生成ETL映射。 2）Workflow Manager：合理地实现复杂的ETL工作流，基于时间，事件的作业调度 3）Workflow Monitor：监控Workflow和Session运行情况，生成日志和报告 4）Repository Manager：
linux下为程序创建启动和关闭的的sh文件，scrapyd为例酷的飞上天空 scrapy
对于一些未提供service管理的程序每次启动和关闭都要加上全部路径，想到可以做一个简单的启动和关闭控制的文件下面以scrapy启动server为例，文件名为run.sh： #端口号，根据此端口号确定PID PORT=6800 #启动命令所在目录 HOME='/home/jmscra/scrapy/' #查询出监听了PORT端口
人--自私与无私永夜-极光
今天上毛概课,老师提出一个问题--人是自私的还是无私的,根源是什么? 从客观的角度来看,人有自私的行为,也有无私的
Ubuntu安装NS-3 环境脚本随便小屋 ubuntu
将附件下载下来之后解压，将解压后的文件ns3environment.sh复制到下载目录下（其实放在哪里都可以，就是为了和我下面的命令相统一）。输入命令： sudo ./ns3environment.sh >>result 这样系统就自动安装ns3的环境，运行的结果在result文件中，如果提示 com
创业的简单感受 aijuans 创业的简单感受
2009年11月9日我进入a公司实习，2012年4月26日，我离开a公司，开始自己的创业之旅。今天是2012年5月30日，我忽然很想谈谈自己创业一个月的感受。当初离开边锋时，我就对自己说：“自己选择的路，就是跪着也要把他走完”，我也做好了心理准备，准备迎接一次次的困难。我这次走出来，不管成败
如何经营自己的独立人脉 aoyouzi 如何经营自己的独立人脉
独立人脉不是父母、亲戚的人脉，而是自己主动投入构造的人脉圈。“放长线，钓大鱼”，先行投入才能产生后续产出。现在几乎做所有的事情都需要人脉。以银行柜员为例，需要拉储户，而其本质就是社会人脉，就是社交！很多人都说，人脉我不行，因为我爸不行、我妈不行、我姨不行、我舅不行……我谁谁谁都不行，怎么能建立人脉？我这里说的人脉，是你的独立人脉。以一个普通的银行柜员
JSP基础百合不是茶 jsp 注释隐式对象
1,JSP语句的声明 <%! 声明 %> 　　声明：这个就是提供java代码声明变量、方法等的场所。表达式 <%= 表达式 %> 　　这个相当于赋值，可以在页面上显示表达式的结果，程序代码段/小型指令　<% 程序代码片段 %> 2,JSP的注释
web.xml之session-config、mime-mapping bijian1013 java web.xml servlet session-config mime-mapping
session-config 1.定义： <session-config> <session-timeout>20</session-timeout> </session-config> 2.作用：用于定义整个WEB站点session的有效期限，单位是分钟。 mime-mapping 1.定义： <mime-m
互联网开放平台（1） Bill_chen 互联网 qq 新浪微博百度腾讯
现在各互联网公司都推出了自己的开放平台供用户创造自己的应用，互联网的开放技术欣欣向荣，自己总结如下： 1.淘宝开放平台(TOP) 网址：http://open.taobao.com/ 依赖淘宝强大的电子商务数据，将淘宝内部业务数据作为API开放出去，同时将外部ISV的应用引入进来。目前TOP的三条主线： TOP访问网站：open.taobao.com ISV后台：my.open.ta
【MongoDB学习笔记九】MongoDB索引 bit1129 mongodb
索引可以在任意列上建立索引索引的构造和使用与传统关系型数据库几乎一样,适用于Oracle的索引优化技巧也适用于Mongodb 使用索引可以加快查询,但同时会降低修改,插入等的性能内嵌文档照样可以建立使用索引测试数据 var p1 = { "name":"Jack", "age&q
JDBC常用API之外的总结白糖_ jdbc
做JAVA的人玩JDBC肯定已经很熟练了，像DriverManager、Connection、ResultSet、Statement这些基本类大家肯定很常用啦，我不赘述那些诸如注册JDBC驱动、创建连接、获取数据集的API了，在这我介绍一些写框架时常用的API，大家共同学习吧。 ResultSetMetaData获取ResultSet对象的元数据信息
apache VelocityEngine使用记录 bozch VelocityEngine
VelocityEngine是一个模板引擎，能够基于模板生成指定的文件代码。使用方法如下： VelocityEngine engine = new VelocityEngine();// 定义模板引擎 Properties properties = new Properties();// 模板引擎属
编程之美-快速找出故障机器 bylijinnan 编程之美
package beautyOfCoding; import java.util.Arrays; public class TheLostID { /*编程之美假设一个机器仅存储一个标号为ID的记录，假设机器总量在10亿以下且ID是小于10亿的整数，假设每份数据保存两个备份，这样就有两个机器存储了同样的数据。 1.假设在某个时间得到一个数据文件ID的列表，是
关于Java中redirect与forward的区别 chenbowen00 java servlet
在Servlet中两种实现： forward方式：request.getRequestDispatcher(“/somePage.jsp”).forward(request, response); redirect方式：response.sendRedirect(“/somePage.jsp”); forward是服务器内部重定向，程序收到请求后重新定向到另一个程序，客户机并不知
[信号与系统]人体最关键的两个信号节点 comsci 系统
如果把人体看做是一个带生物磁场的导体,那么这个导体有两个很重要的节点,第一个在头部,中医的名称叫做百汇穴, 另外一个节点在腰部,中医的名称叫做命门如果要保护自己的脑部磁场不受到外界有害信号的攻击,最简单的
oracle 存储过程执行权限 daizj oracle 存储过程权限执行者调用者
在数据库系统中存储过程是必不可少的利器，存储过程是预先编译好的为实现一个复杂功能的一段Sql语句集合。它的优点我就不多说了，说一下我碰到的问题吧。我在项目开发的过程中需要用存储过程来实现一个功能，其中涉及到判断一张表是否已经建立，没有建立就由存储过程来建立这张表。 CREATE OR REPLACE PROCEDURE TestProc IS fla
为mysql数据库建立索引 dengkane mysql 性能索引
前些时候，一位颇高级的程序员居然问我什么叫做索引，令我感到十分的惊奇，我想这绝不会是沧海一粟，因为有成千上万的开发者（可能大部分是使用MySQL的）都没有受过有关数据库的正规培训，尽管他们都为客户做过一些开发，但却对如何为数据库建立适当的索引所知较少，因此我起了写一篇相关文章的念头。最普通的情况，是为出现在where子句的字段建一个索引。为方便讲述，我们先建立一个如下的表。
学习C语言常见误区如何看懂一个程序如何掌握一个程序以及几个小题目示例 dcj3sjt126com c 算法
如果看懂一个程序，分三步 1、流程 2、每个语句的功能 3、试数如何学习一些小算法的程序尝试自己去编程解决它，大部分人都自己无法解决如果解决不了就看答案关键是把答案看懂，这个是要花很大的精力，也是我们学习的重点看懂之后尝试自己去修改程序，并且知道修改之后程序的不同输出结果的含义照着答案去敲调试错误
centos6.3安装php5.4报错 dcj3sjt126com centos6
报错内容如下: Resolving Dependencies --> Running transaction check ---> Package php54w.x86_64 0:5.4.38-1.w6 will be installed --> Processing Dependency: php54w-common(x86-64) = 5.4.38-1.w6 for
JSONP请求 flyer0126 jsonp
使用jsonp不能发起POST请求。 It is not possible to make a JSONP POST request. JSONP works by creating a <script> tag that executes Javascript from a different domain; it is not pos
Spring Security（03）——核心类简介 234390216 Authentication
核心类简介目录 1.1 Authentication 1.2 SecurityContextHolder 1.3 AuthenticationManager和AuthenticationProvider 1.3.1 &nb
在CentOS上部署JAVA服务 java--hhf java jdk centos Java服务
本文将介绍如何在CentOS上运行Java Web服务，其中将包括如何搭建JAVA运行环境、如何开启端口号、如何使得服务在命令执行窗口关闭后依旧运行第一步：卸载旧Linux自带的JDK ①查看本机JDK版本 java -version 结果如下 java version "1.6.0"
oracle、sqlserver、mysql常用函数对比[to_char、to_number、to_date] ldzyz007 oracle mysql SQL Server
oracle &n
记Protocol Oriented Programming in Swift of WWDC 2015 ningandjin protocol WWDC 2015 Swift2.0
其实最先朋友让我就这个题目写篇文章的时候，我是拒绝的，因为觉得苹果就是在炒冷饭，把已经流行了数十年的OOP中的“面向接口编程”还拿来讲，看完整个Session之后呢，虽然还是觉得在炒冷饭，但是毕竟还是加了蛋的，有些东西还是值得说说的。通常谈到面向接口编程，其主要作用是把系统设计和具体实现分离开，让系统的每个部分都可以在不影响别的部分的情况下，改变自身的具体实现。接口的设计就反映了系统
搭建 CentOS 6 服务器(15) - Keepalived、HAProxy、LVS rensanning keepalived
（一）Keepalived （1）安装 # cd /usr/local/src # wget http://www.keepalived.org/software/keepalived-1.2.15.tar.gz # tar zxvf keepalived-1.2.15.tar.gz # cd keepalived-1.2.15 # ./configure # make &a
ORACLE数据库SCN和时间的互相转换 tomcat_oracle oracle sql
SCN（System Change Number 简称 SCN）是当Oracle数据库更新后，由DBMS自动维护去累积递增的一个数字，可以理解成ORACLE数据库的时间戳，从ORACLE 10G开始，提供了函数可以实现SCN和时间进行相互转换；　　用途：在进行数据库的还原和利用数据库的闪回功能时，进行SCN和时间的转换就变的非常必要了；　　操作方法：　　1、通过dbms_f
Spring MVC 方法注解拦截器 xp9802 spring mvc
应用场景，在方法级别对本次调用进行鉴权，如api接口中有个用户唯一标示accessToken,对于有accessToken的每次请求可以在方法加一个拦截器，获得本次请求的用户，存放到request或者session域。 python中，之前在python flask中可以使用装饰器来对方法进行预处理，进行权限处理先看一个实例,使用@access_required拦截： ?

attention机制介绍，易懂

以下文字转自网址：

Attn: Illustrated Attention

Attention in GIFs and how it is used in machine translation like Google Translate

1. Attention: Overview

2. Attention: Examples

2a. Bahdanau et. al (2015) [1]

2b. Luong et. al (2015) [2]

2c. Google’s Neural Machine Translation (GNMT) [9]

3. Summary

Appendix: Score Functions

References

Related Articles on Deep Learning

你可能感兴趣的:(数据挖掘,attention,注意力机制)