瞳恩Dawn

Speech Recognition Using attention-based sequence-to-sequence methods

Abstract—Speech is one of the most important and prominent manner to communicate among human being. It also has capacity to become a kind of medium when facing the human computer interaction. Speech recognition has become a popular area across research institutes and the Internet-related companies. This paper presents a brief overview on two main steps of speech recognition which are feature extraction and training model using deep learning. In particular, five art-of-state methods using attention-based sequence-to-sequence model for speech recognition training process are discussed.

Keywords-speech recognition; attention mechanism; sequence to sequence; neural transducer; Mel-frequency cepstrum coefficient

I. INTRODUCTION
Natural language refers to a kind of language that evolves naturally with culture, and it is also the primary tool of human communicating and thinking. While speech recognition, as the name suggests that takes natural language speech as the input into the model, and the output is the text of the speech. In other words, it is converting speech signals into text sequences. It is simple for humans to convert speech audio into text manually. Still, when facing large amounts of data, it takes plenty of time, and it is, to some extent, very difficult or impossible to covert in real-time for humans. Moreover, there are hundreds of languages in the world, so that few experts can master multiple languages simultaneously. As a result, people expect that machine learning can help people achieve this task.

At present, the typical steps of speech recognition include preprocessing, feature extraction, training, and recognition. For the feature extraction, because the speech signal is volatile, even if people try hard to say the same two sentences, the signals of which always have some differences. So, feature extraction of speech is difficult for computer scientists.

In this paper, we introduce the main process of speech recognition. For feature extraction, we introduce one of the most popular approaches of it which is called the Mel-frequency Cepstrum Coefficient (MFCC) [1]. For the training part, it is evident that the length of input (speech vectors sequence) and output (text vectors sequence) is probably different. Input is determined by humans (etc., select 25ms speech), and the specific output length is determined by the model itself. Thus, Sequence-To-Sequence (Seq2Seq) based models are most widely used nowadays.

The remainder of this article is organized as follows. In Section II, the MFCC feature extraction approach is illustrated. In Section III, we describe the basic attention mechanism, as well as five training methods based on it: Listen, Attend, and Spell Connectionist Temporal Classification, RNN transducer, Neural Transducer, and Monotonic Chunkwise Attention. Finally, concluding remarks are contained in Section IV.

II. FEATURE EXTRACTION

Because of the instability of speech signals, feature extraction of the speech signal is very difficult. It exists different features between each word. For each word, there are differences among different people, such as adults and children, male and female. Even for the same person and the same word, there also exists changes for a different time[2]. Mel-Fre-Frequency-Doppler is proposed based on the different auditory characteristics of human ears. It uses the nonlinear frequency unit, which names Mel frequency [4], to simulate the human auditory system [8,10,11,17,18]. The calculation method is shown in Formula (1):

Figure 1 shows the construction of the MFCC model.

The original acoustic wave through the window and other pre-processing, then we obtain the frame signal.

Because it is difficult to observe the characteristics of the signal in the time domain, transforming it into the energy distribution in the frequency domain can solve this problem. The energy distribution in the spectrum, which represents the characteristics of different sounds, is obtained by fast Fourier transform.

After the fast Fourier change of the speech signals is completed, Mel frequency filtering is performed [3]. The specific step is to redefine the filter bank composed of triangular bandpass filters, assuming that the center frequency of each filter is , is the low-pass frequency within the coverage range after the cross overlap of three filters, is the high pass frequency. Then the calculation method is shown as

We can obtain the output spectrum energy generated by each filter, and then the data should be transformed into a logarithm. Finally, the discrete cosine transform is converted to the time domain to obtain the final MFCC. In MFCC, the main advantage is that it uses Mel frequency scaling, which is very approximate to the human auditory system.

III. TRAINING MODEL
In this section, we first introduce the Sequence to sequence model and attention mechanism, and then we discuss five models based on Seq2seq and attention. The five models are named Listen, Attend, and Spell, Connectionist Temporal Classification, RNN transducer, Neural Transducer, and Monotonic Chunkwise Attention.

A. Description of attention-based sequence-to-sequence model

A typical sequence to sequence model has two parts – an encoder and a decoder. Both parts are practically two different neural network models combined into one network. The task of an encoder network is to understand the input sequence, and create a high-dimensional representation of it. This representation is forwarded to a decoder network which generates a sequence of its own that represents the output. Figure 2 shows the Encoder-decoder with an attention mechanism. Match can be Dot-product method or Additive attention method.

The attention mechanism [19] is achieved by Encoder and Decoder. When human tries to understand a picture, he/she focuses on specific portions of the image to get the whole essence of the picture. So, we can train an artificial system to focus on particular elements of the image to get the whole “picture”. This is essentially how the attention mechanism works. To implement an attention mechanism, we take input from each time step of the encoder – but give weightage to the time steps. The weightage depends on the importance of that time step for the decoder to optimally generate the next word in the sequence. The output is computed by the weighted sum of the values, where the weight of each value is calculated by the query and the corresponding key. [19] We compute the dot products of the query(Q) with all keys(K), divide each by and apply a softmax function to obtain the weights on the values. The calculation of attention is shown as

The common methods to achieve Match functions is Dot-product method [19]. Additive attention [26] is another function, we compute the sum operation instead of dot products, which using a feed-forward network with a single hidden layer, and this function is shown as

B. Speech recognition using attention-based sequence-to-sequence methods

Attend and Spell (LAS) [9], a neural network that learns to transcribe an audio sequence signal to a word sequence, one character at a time, without using explicit language models, pronunciation models, HMMs, etc. LAS does not make any independent assumptions about the nature of the probability distribution of the output character sequence, given the input acoustic sequence. This method is based on the sequence-to-sequence learning framework with attention. It consists of an encoder Pyramid Recurrent Neural Network (p-RNN), which is named the Listener, and a decoder RNN, which is named the Speller. The Listener converts acoustic features which are generated from MFCC, into high-level features. The purpose of this module is to remove the noise from speech and extract the information which is only related to speech recognition, such as the difference between speaker and speaker. Listener can be processed by RNN or convolutional neural network (CNN) and the current mainstream approach is adding self-attention into loop management. While the signal of speech is often too long, and adjacent vectors carry repeated information in the RNN sequence. P-RNN adds the Down-sampling operation in Listen module. Figure 3 shows the structure of p-RNN, and the blocks in this figure mean that there are four vectors in the first layer in RNN but two vectors in the second layer.

Toward Down-sampling, Pooling Over Time [4] and Truncated Self-attention [17] are optional measures. Polling Over Time can select fewer vectors as output and Truncated Self-attention only consider part of sequence instead of the whole sequence. The difference between them is showed in Figure 4 and Figure 5.

The Speller is an RNN and it converts the higher level features into output discourse by affirm the probability distribution of the next character, given all of the acoustics and preceding characters. During each step, the RNN uses its internal state to guide an attention mechanism [21, 22, 23] to compute a “context” vector from the high level features of the listener. By using this context vector and internal state to both update the internal state and to predict the next character in the sequence. Figure 6 shows the construction of Listen, Attend and Spell Model.

Connectionist Temporal Classification (CTC) [11,15] can achieve online streaming speech recognition, which means it gives a token distribution when seeing one input vector. Figure 7 shows the structure of CTC and . In order to complete online management, Encoder need to use some model that seeing the whole sequence is not a necessary task and uni-directional RNN is an optional choice. One difficulty is that every acoustic feature is very short, so making the decision about what this acoustic belongs to is hard to machine. Toward this issue, model add an element in token distribution when machine cannot make sure the input vector’s class. In the end, we first merge duplicate tokens and then remove the output which include . Another problem is labeled label is not enough to process the cross-entropy. If we use Phoneme as the output token, 10 vectors may represent one word and meanwhile, it includes . According to the rule of merging and removing, designing and creating the new label is necessary and this operation is called Alignment. In that paper, the author selects exhaustion to help training.

In speech recognition, it is no doubt that people prefer obtaining a timely result to getting it after inputting the whole sentence. However, the sequence-to-sequence model is not suitable for these critical tasks. Neural transducer [13], one of the general sequence-to-sequence models, can meet the demand by separating the outputs into ‘chunks’. The key is using a transducer RNN which generates extensions to former output. Before the Neural Transducer model was published, some related models, such as sequence transducer which can model the conditional dependence of input and output at the same time. Moreover, there is no limit on the length of input and output in this model. However, its prediction model and transcription model cannot operate in one-time steps. The model generally intercepts the voice stream (such as 300ms) as a block according to a fixed time length. The neural transducer receives the voice information of the current block and decodes the corresponding text of the current voice stream segment. And then, it combines with the output text information of the previous block and the neural network state. By using this model, the decoding delay is effectively controlled. The presence of Neural Transducer as an extension of RNN and sequence-to-sequence model can realize ‘online’ speech recognition and produce output in real-time.

As one of the most popular models, Recurrent Neural Network has demonstrated great success in many kinds of tasks. In speech recognition, RNNs can transcribe raw speech waveforms. However, with the shortcomings such as the high costs, a system of end-to-end speech recognition [15] consisting of a single RNN architecture instead of a speech pipeline can improve performance. In this model, a combination of the deep bidirectional LSTM [6] recurrent neural network and the Connectionist Temporal Classification objective function plan an essential role in optimizing the word error rate. This method has been replied to, but the results are not convincing enough. The bidirectional LSTM recurrent neural network is the combination of BRNNs with LSTM. Figure 8 shows the structure of BRNNs.

The idea of the bidirectional recurrent neural network [5] is dividing the traditional cell into two parts, one is for positive time direction (forward states), and the other is for negative time direction (backward states). In addition, the outputs of forwarding states do not connect to the input of Backward states. Therefore, this structure provides the output layer with complete past and future context information for each point in the input sequence. In speech recognition, our data set is an audio file and its corresponding text. Unfortunately, audio files and text are difficult to align on the unit of words. Connectionist Temporal Classification objective functions allow an RNN to be trained for sequence transcription tasks without requiring any prior alignment between the input and target sequences. In general, it allows RNN to learn sequence data directly without labeling the mapping relationship between the input sequence and output sequence in training data in advance, breaks the data dependency constraint of RNN applied to speech recognition, and makes the RNN model achieve better application results in sequence learning tasks. Finally, combining the new models and the basic models improves the latest accuracy of independent speaker recognition.

In the network training, RNN transducer [7] is a kind of method for improving online speech recognition. It is an improvement of the CTC model. However, because there are still many problems in the CTC model, the most significant of which is that CTC assumes that the outputs of the model are conditionally independent. In view of the shortcomings of CTC, Alex graves proposed the RNN-t model around 2012. RNN-t model skillfully integrates language model and acoustic model and carries out joint optimization at the same time. It is a relatively perfect model structure in theory. Figure 9 shows the structure of RNN -t. It includes three modules: encoder network, language model prediction network, and joint network. Encoder network is generally composed of multi-layer Bi LSTM. Input the acoustic characteristics of the time sequence to obtain high-level feature expression. The Predictor is composed of multi-layer uni LSTM, embedding corresponding to the input label and the corresponding output. Joiner is a combination of multi-layer and full connection layers. The input is a linear combination of two codes, which is input into the feed-forward neural network to obtain the probability distribution function of prediction characters [14].

However, there were still some problems like the quadratic time and space cost happened in the decoding process. To overcome these problems and enhance the efficiency, the model called Monotonic Chunkwise Attention (MoCha) [17] solves these problems by splitting the input sequence into some small chunks.

The structure of MoCha is similar with that of sequence-to-sequence which combines with the soft attention which can convert the input sequence into a new output sequence. And Figure 9 displays the structure of MoCha. Transporting the generated output elements to the input can help regulate the output generation process more directly while the self-attention model cannot improve so much. And the main difference between MoCha and Neural Transducer is that the size of the window can change dynamically. And the window size is a parameter of model learning. And the other parts are all same as that of Neural Transducer, which can provide the output timely.

IV. CONCLUSION

In this paper, we have a review on MFCC which is a feature extraction method. And then we introduce Attention mechanism and Seq2seq model. In the end, we discuss five current prominent model about speech recognition base on Attention and Seq2seq mechanism.

In conclusion, MFCC is the essential first step of speech signal processing. It is a cepstrum parameter extracted in mel scale frequency domain. Mel scale describes the nonlinear characteristics of human ear frequency. LAS is a neural speech recognizer that can transfer acoustic features to characters directly without using any of the traditional components of a speech recognition system, such as HMMs. This approach is the cornerstone of new neural speech recognizers that are simpler to train and achieve better than traditional speech recognition systems. CTC is an online streaming speech recognition method, so it gives a token distribution when seeing one input vector. This mechanism is so flexible that it allows extremely non-sequential alignments. In speech recognition, the acoustic feature inputs and corresponding outputs generally produced in the same order with only small deviations. Another problem is that the input and output sequences have very different lengths, and they are different in different case, which depend on the speaking rate and writing system. So this issue making it more difficult to process the alignment. The Neural translator is a model that can help realize real-time transformation. It is helpful for the speech recognition system and is also very important for the future online speech translation system. Recurrent neural networks are a kind of neural network that can be used for prediction. It has no requirements for the input sequence length and has the prediction ability based on time series data. Some of its extensions and deformations have improved its defects and have a wide range of applications according to different needs. Moreover, RNN-t is an improvement based on CTC to improve real-time recognition. It adds an RNN with the previous output as the input based on the encoder of the CTC model, which is called a prediction network. It has fast training speed and high efficiency. MoCha is a seq2seq model with soft attention, which retains the advantages of hard monotonic attention linear time complexity and real-time decoding and allows soft alignment.

REFERENCES

W. Junqin, and Y. Junjun, “An improved arithmetic of MFCC in speech recognition system,” International Conference on Electronics, Communications and Control (ICECC), 2011, pp. 719–722.

Wang Sumin. The Research and Simulation of Isolated Word in Noise:[Master thesis], Jiangxi University of Science and Technology, 2009.

Chen Yong, Qu Zhiyi, Liu Ying, Jiu Kang, Guo Aiping, Yang Zhiguo,
The Extraction and Application of Phonetic Characteristic Parameter MFCC, Journal of Hunan Agricultural University (natural sciences), 2009.

C. Ittichaichareon, S. Suksri, and T. Yingthawornsuk “Speech recognition using MFCC,” International conference on computer graphics, simulation and modeling, 2012, pp. 135-138.

M. Schuster and K. K. Paliwal,“Bidirectional Recurrent Neural Networks,” IEEE Transactions on Signal Processing, vol. 45, pp. 2673–2681, 1997.

M.A. Graves and J. Schmidhuber, “Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures,”Neural Networks, vol. 18, no. 5-6, pp. 602–610, June/July 2005.

A. Graves, “Sequence transduction with recurrent neural networks,” in ICML Representation Learning Work-sop, 2012.

W. Han, C. Chan, C. Choy, and K. Pun, “An efficient MFCC extraction method in speech recognition,” IEEE International Symposium on Circuits and Systems, 2006, pp. 4- .

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960–4964.

G. Saha, and U. S. Yadhunandan, “Modified mel-frequency cepstral coefficient,” Proceedings of the IASTED, 2004.

S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid CTC/Attention Architecture for End-to-End Speech Recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240-1253, 2017.

A. Graves, and N. Jaitly, “Towards End-To-End Speech Recognition with Recurrent Neural Networks,” International conference on machine learning, 2014, pp. 1764-1772.

N. Jaitly, D. Sussillo, Q. V. Le, O. Vinyals, I. Sutskever, and S. Bengio, “A neural transducer,” arXiv preprint, 2015.

C. Yeh, J. Mahadeokar, K. Kalgaonkar, Y. Wang, D. Le, M. Jain, K. Schubert, C. Fuegen, and M. L. Seltzer, “Transformer-transducer: End-to-end speech recognition with self-attention,” arXiv preprint, 2019.

A. Graves, A. Mohamed, and G. Hinton “Speech recognition with deep recurrent neural networks,” IEEE international conference on acoustics, speech and signal processing, pp. 6645 - 6649, 2013.

C. C. Chiu, and C. Raffel, “Monotonic chunkwise attention,” arXiv preprint, 2017.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, 2017, pp. 5998-6008.

D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint, 2014.

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, “Gradient flow in recurrent nets: the difficulty of learning long-term dependencies,” 2001.

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, pp. 1735–1780, 1997.

T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” arXiv preprint, 2017.

A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369-376.

D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint, 2014.

D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-end attention-based large vocabulary speech recognition,” IEEE international conference on acoustics, speech and signal processing (ICASSP), 2016, pp. 4945-4949.

【MoodVine】DeepSeek聊天持久化（2）：Spring AI + Redis实现对话记忆管理一只鱼吖【西瓜和晚霞】MoodVine spring redis java
在上一篇文章中，我们介绍了如何引入SpringAI，本文将深入探讨如何实现聊天记录的持久化存储。一、初始方案：内存存储的局限性在项目初期，我们使用简单的内存存储实现聊天记录管理：创建ChatController@RestController@RequestMapping("/chat")publicclassChatController{privatefinalOllamaChatModeloll
李航老师-统计学习小三爷_df1b
三个准则1.作为入门选手，不要每章都看2.不要从零造轮子去实现算法，太浪费时间3.必须能手推公式章节目录##统计学习概论-统计学习的目的是对数据进行==预测与分析==-统计学习的前提是同类数据具有一定的统计规律性-统计学习的方法-监督学习(supervisedlearning)-非监督学习(unsupervisedlearning)-半监督学习(semi-supervisedlearning)-强
阿里通义千问Qwen3深夜升级：架构革新+性能碾压俊哥V AI AI新闻热点由AI辅助创作 AI 人工智能
（以下借助DeepSeek-R1&Grok3辅助整理）北京时间2025年7月22日凌晨，阿里云通义千问团队发布了Qwen3旗舰模型的最新更新——Qwen3-235B-A22B-Instruct-2507-FP8。这一更新不仅在性能上实现了突破，还标志着开源大模型技术架构的重大进化。本报告基于官方发布信息、社区反馈以及相关分析，全面解读该更新的技术细节、性能表现、社区反应及未来展望。一、技术架构与战
新年逼自己一把，学会使用DeepSeek R1：从「翻车」到「封神」实战无数碎片寻妳杂谈人工智能
DeepSeekR1的发布就像是一颗闪亮的星星，瞬间照亮了整个AI领域。它不仅颠覆了我们对传统指令模型的认知，更带来了全新的推理能力，让我们在日常工作、学习中都能高效利用AI。然而，要想完全发挥R1的潜力，你必须掌握一些使用技巧，避免那些让AI“翻车”的错误。接下来，我们将通过一些经典案例和实用技巧，帮助你从入门到精通，让DeepSeekR1成为你工作中的得力助手。1.DeepSeekR1模型的独
【Python练习】031. 解释python中的深拷贝和浅拷贝
031.解释python中的深拷贝和浅拷贝031.解释python中的深拷贝和浅拷贝1.浅拷贝（ShallowCopy）浅拷贝的实现方式示例代码2.深拷贝（DeepCopy）深拷贝的实现方式示例代码区别总结适用场景注意事项浅拷贝的应用深拷贝的应用不同数据类型的拷贝行为自定义对象的拷贝控制性能与适用场景031.解释python中的深拷贝和浅拷贝在Python中，深拷贝（DeepCopy）和浅拷贝（S
从API到Agent：万字洞悉LangChain工程化设计 bpluo42657 langchain
——构建下一代AI应用的核心范式迁移一、传统API范式的局限性：为什么需要Agent？接口式AI的痛点python#传统NLPAPI调用示例response=openai.Completion.create(model="text-davinci-003",prompt="请翻译：Helloworld",max_tokens=50)单次请求/响应模式缺乏状态管理与上下文延续硬编码逻辑难以应对复杂场
客服系统本地部署对接fastgpt 以及现有业务系统 adminwolf 个人开发
在日常的用户咨询中，许多用户会问我们的系统或浏览器插件能否直接接入Deepseek。其实，这种说法存在一定的不准确之处。正确的理解是，我们需要接入的是支持Deepseek的AI知识库平台，而非直接接入Deepseek本身，而且这些平台通常都支持多种大模型切换。下面，就为大家详细介绍相关的接入方式。我们网站：gofly.v1kf.com一、扣子智能体平台对于非技术人员来说，现在建议直接使用coze.
vue2解决页面重排滚动条问题啥都不是的小白菜前端 javascript html
项目场景：项目场景：vue2问题描述例如：在一个卡片页面底部添加一条数据后，滚动条自动跑到了页面顶部去了：原因分析：可能是添加数据后页面重排导致的解决方案：提示：通过deepseek给出了一个较为高效的方案且不会干扰用户的正常滚动行为exportdefault{data(){return{scrollPosition:0};},beforeUpdate(){this.scrollPosition=
ai绘画生成软件哪个好？几款好用的AI绘画软件分享! 呼酱小宝箱
随着人工智能技术的不断发展，越来越多的AI绘画生成软件被开发出来。这些软件利用深度学习技术，可以将普通照片或图像转化成具备艺术效果的画作。那么，ai绘画生成软件哪个好？首先，让我们来看一下几个常见的AI绘画生成软件，它们分别是：1、DeepDreamDeepDream是由Google开发的一款AI绘画生成软件。它通过卷积神经网络对输入的图片进行处理，从而生成出具有艺术风格的画作。DeepDream
DeepSeek部署指南：从入门到精通 wujj_whut 热门应用 c++DeepSeek 嵌入式实时数据库
DeepSeek部署指南：从入门到精通引言在人工智能和深度学习领域，模型的部署是一个至关重要的环节。DeepSeek作为一款强大的深度学习框架，其部署过程不仅关系到模型的性能表现，还直接影响到实际应用的效果。本文将详细介绍DeepSeek的部署流程，涵盖从环境配置到实际应用的各个方面，旨在帮助读者全面掌握DeepSeek的部署技巧。一、DeepSeek简介DeepSeek是一款开源的深度学习框架，
uniapp使用uni-ui怎么修改默认的css样式比如多选框及样式覆盖小程序/安卓/ios兼容问题禾苗种树 uni-app ui css scss
修改uni-ui多选框(uni-data-checkbox)的默认样式在uniapp中使用uni-ui的uni-data-checkbox组件时，可以通过以下几种方式修改其默认样式：方法一：使用深度选择器格式一：在页面的style部分使用深度选择器>>>或/deep/来穿透组件作用域：/*在普通CSS中*/>>>#rememberbox.uni-checkbox-input{border-colo
利用 Python 爬取小红书热门笔记并进行标签关键词分析程序员威哥最新爬虫实战项目 python 笔记开发语言
一、背景与目标小红书（RED）作为中国最活跃的内容社区之一，拥有大量关于美妆、穿搭、美食、旅游等领域的用户生成内容（UGC）。对于产品、品牌方或研究人员来说，提取热门笔记的标签关键词，可以有效捕捉用户关注点、消费趋势及内容热词。本项目目标：使用Python爬取小红书某个话题下的热门笔记；分析每篇笔记中的标题、正文、标签等字段；利用NLP技术提取高频关键词；对关键词进行可视化与聚类分析。二、技术难点
机器学习初学者理论初解 Mikhail_G 机器学习人工智能
大家好!为什么手机相册能自动识别人脸？为什么购物网站总能推荐你喜欢的商品？这些“智能”背后，都藏着一位隐形高手——机器学习（MachineLearning）。一、什么是机器学习？简单说，机器学习是教计算机从数据中自己找规律的技术。就像教孩子认猫：不是直接告诉他“猫有尖耳朵和胡须”，而是给他看100张猫狗照片，让他自己总结出猫的特征。传统程序vs机器学习传统程序：输入规则+数据→输出结果（例：按“温
Deep in the heart 与《心迷宫》的互译 lingxuanqiquan
前几天，我在别人推荐下，看了一部电影《心迷宫》。整个片子看起来有点像一个小品，反转迭出，高潮迭起。故事采用大量的倒序、插叙，没有用心仔细看的人，或许会看的有点迷茫。按照惯例，此处有大量剧透，介意者误视之~在县城上班的宗耀是村长的儿子，他和老爸不和，因为老爸给他安排的道路不是他想要的。因为是村长的儿子，所以得帮老爸保留面子，但他挡不住内心的悸动——尽管老爸希望他找个城里姑娘恋爱结婚，但他还是和村里的
销售易发布中国首款AI CRM，纷享销客什么时候能抄上作业 wq54wq 人工智能
在数字化转型的深水区，客户关系管理、系统已成为企业增长的核心基础设施，一家可以与企业共同成长的CRM厂商才能跟上企业业务的快速发展，帮助企业实现高质量增长。2025年3月19日，销售易在腾讯云城市峰会上高调发布中国首款AICRM产品——NeoAgent。这款融合了腾讯混元大模型与DeepSeek开源技术的智能体矩阵，不仅重新定义了CRM的交互逻辑，更将行业竞争推向了“使技术真正回归赋能业务的本质”
泽平的ScalersTalk第七轮新概念朗读持续力训练Day 394 20220420 郑泽平
练习材料：L44-3:SpeedandcomfortForafewhours,yousettlebackinadeeparmchairtoenjoytheflight.Therealescapistcanwatchafilmandsipchampagneonsomeservices.Butevenwhensuchrefinementsarenotavailable,thereisplentytok
背靠腾讯的销售易，发布中国首款AI CRM，纷享销客接下来怎么办 CC_54321 人工智能
在数字化转型的深水区，客户关系管理、系统已成为企业增长的核心基础设施，一家可以与企业共同成长的CRM厂商才能跟上企业业务的快速发展，帮助企业实现高质量增长。2025年3月19日，销售易在腾讯云城市峰会上高调发布中国首款AICRM产品——NeoAgent。这款融合了腾讯混元大模型与DeepSeek开源技术的智能体矩阵，不仅重新定义了CRM的交互逻辑，更将行业竞争推向了“使技术真正回归赋能业务的本质”
在NLP深层语义分析中，深度学习和机器学习的区别与联系
在自然语言处理（NLP）的深层语义分析任务中，深度学习与机器学习的区别和联系主要体现在以下方面：一、核心区别特征提取方式机器学习：依赖人工设计特征（如词频、句法规则、TF-IDF等），需要领域专家对文本进行结构化处理。例如，传统情感分析需人工定义“情感词库”或通过词性标注提取关键成分。深度学习：通过神经网络自动学习多层次特征。例如，BERT等模型可从原始文本中捕获词向量、句法关系甚至篇章级语义，无
迁移学习：知识复用的智能迁移引擎 | 从理论到实践的跨域赋能范式大千AI助手人工智能 Python #OTHER 迁移学习人工智能机器学习算法神经网络大模型迁移
让AI像人类一样“举一反三”的通用学习框架本文由「大千AI助手」原创发布，专注用真话讲AI，回归技术本质。拒绝神话或妖魔化。搜索「大千AI助手」关注我，一起撕掉过度包装，学习真实的AI技术！一、核心定义与基本概念迁移学习（TransferLearning）是一种机器学习范式，其核心思想是：将源领域（SourceDomain）学到的知识迁移到目标领域（TargetDomain），以提升目标任务的性能
敏捷开发中的自然语言处理集成项目管理实战手册项目管理最佳实践敏捷流程自然语言处理 easyui ai
敏捷开发中的自然语言处理集成：让代码与需求“说人话”关键词：敏捷开发、自然语言处理（NLP）、用户故事分析、需求自动化、持续集成优化摘要：在敏捷开发中，“快速响应变化”的核心目标常被繁琐的文本处理拖慢——需求文档像“天书”、用户故事靠“脑补”、缺陷报告整理耗时……自然语言处理（NLP）就像一位“智能翻译官”，能让开发团队与需求文档“流畅对话”。本文将用“搭积木”“翻译机”等生活化比喻，带您理解如何
Deepoc大模型重构核工业智能基座：混合增强架构与安全增强决策技术 Deepoch 人工智能创业创新科技自动化学习
面向复杂系统的高可靠AI赋能体系构建Deepoc大模型通过多维度技术突破，显著提升核工业知识处理与决策可靠性。经核能行业验证，其生成内容可验证性提升68%，关键参数失真率99.999%）。动态可信度评估系统：基于贝叶斯神经网络实时量化模型不确定性，为关键决策提供置信度评分（如堆芯功率控制置信区间±0.05%）。二、核心突破：物理增强型智能算法创新机理与数据双驱动建模神经微分方程求解器：将中子输运方
静默的守护者：Deepoc具身智能如何重塑护理床的温暖感知 Deepoch 人工智能
静默的守护者：Deepoc具身智能如何重塑护理床的温暖感知深夜的康复病房，一张智能护理床正悄然运作。当传感器捕捉到老人翻身时的细微颤抖，床体自动调整侧倾角度提供支撑；检测到骶骨区域压力超标，气垫矩阵瞬间启动动态减压；护工轻声说“升高背部30度”，床体即刻精准响应——这并非科幻场景，而是传统护理床加装Deepoc具身智能开发板后获得的感知进化。当冰冷的机械被赋予“看见身体状态、听懂照护需求、预判健康
旋转目标检测：Deep Spatial Feature Transformation for Oriented Aerial Object Detection【方法解析】沉浸式AI 《AI与SLAM论文解析》人工智能计算机视觉旋转目标检测
DeepSpatialFeatureTransformationforOrientedAerialObjectDetection目录DeepSpatialFeatureTransformationforOrientedAerialObjectDetection摘要关键词引言相关工作旋转对齐模块特征对齐方法旋转对齐模块特征选择模块摘要航空图像中的目标检测在计算机视觉领域引起了广泛关注。不同于自然图像
推荐项目： Few-Shot-Adversarial-Learning-for-face-swap 邱晋力
推荐项目：Few-Shot-Adversarial-Learning-for-face-swap去发现同类优质开源项目:https://gitcode.com/1、项目介绍Few-Shot-Adversarial-Learning-for-face-swap是一个基于PyTorch的开源实现，重演了三星AI实验室的一项前沿研究——“Few-ShotAdversarialLearningofReal
Ubuntu 22.04. 安装微信
Ubuntu22.04.安装微信添加仓库首次使用时，你需要运行如下一条命令将移植仓库添加到系统中。wget-O-https://deepin-wine.i-m.dev/setup.sh|sh应用安装自此以后，你可以像对待普通的软件包一样，使用apt-get系列命令进行各种应用安装、更新和卸载清理了。比如安装微信只需要运行下面的命令，sudoapt-getinstallcom.qq.weixin.d
Real-World Blur Dataset for Learning and Benchmarking Deblurring Algorithms 钟屿深度学习
用于学习和评估去模糊算法的真实世界模糊数据集摘要近年来，针对相机抖动和物体运动模糊的单幅图像去模糊提出了许多基于学习的方法。为了将这些方法推广到真实世界的模糊场景，包含大量真实模糊图像及其对应的清晰真实图像（groundtruth）的数据集至关重要。然而，目前尚不存在这样的数据集，因此所有现有方法都依赖于合成数据集，这导致它们无法有效去除真实世界图像的模糊。在本工作中，我们提出了一个用于学习和评估
Deep Multi-scale Convolutional Neural Network for Dynamic Scene Deblurring 论文阅读钟屿论文阅读计算机视觉人工智能
用于动态场景去模糊的深度多尺度卷积神经网络摘要针对一般动态场景的非均匀盲去模糊是一个具有挑战性的计算机视觉问题，因为模糊不仅来源于多个物体运动，还来源于相机抖动和场景深度变化。为了去除这些复杂的运动模糊，传统的基于能量优化的方法依赖于简单的假设，例如模糊核是部分均匀或局部线性的。此外，最近的基于机器学习的方法也依赖于在这些假设下生成的合成模糊数据集。这使得传统的去模糊方法在模糊核难以近似或参数化的
Lua 打印输出完整 table 表奶酪Cheese lua 开发语言
代码如下:functiondump(o)localt={}local_t={}local_n={}localspace,deep=string.rep('',2),0localtype=_ENV.typelocalfunction_ToString(o,_k)iftype(o)==('number')thentable.insert(t,o)elseiftype(o)==('string')the
SpringBoot单元测试全攻略：MockMVC+Testcontainers+覆盖率分析 fanxbl957 Web spring boot 单元测试后端
博主介绍：Java、Python、js全栈开发“多面手”，精通多种编程语言和技术，痴迷于人工智能领域。秉持着对技术的热爱与执着，持续探索创新，愿在此分享交流和学习，与大家共进步。DeepSeek-行业融合之万象视界(附实战案例详解100+)全栈开发环境搭建运行攻略：多语言一站式指南(环境搭建+运行+调试+发布+保姆级详解)感兴趣的可以先收藏起来，希望帮助更多的人SpringBoot单元测试全攻略：
Kimi-Audio：最佳音LLM, 如何免费使用 Kimi-Audio AI 模型？知识大胖 NVIDIA GPU和大语言模型开发教程人工智能 kimi
简介继DeepSeek之后，字节跳动（现名MoonShotAI，又名Kimi）也在生成式人工智能领域加速发展，并发布了自己的音频模型Kimi-Audio，据说是迄今为止最好的音频模型。推荐文章《NvidiaGPU入门教程之02ubuntu安装A100显卡驱动(含8步快速浓缩教程)》权重2，安装A100显卡驱动《本地大模型知识库OpenWebUI系列之如何解决知识库上传文件故障Extractedco
戴尔笔记本win8系统改装win7系统 sophia天雪 win7 戴尔改装系统 win8
戴尔win8 系统改装win7 系统详述第一步：使用U盘制作虚拟光驱： 1）下载安装UltraISO：注册码可以在网上搜索。 2）启动UltraISO，点击“文件”—》“打开”按钮，打开已经准备好的ISO镜像文
BeanUtils.copyProperties使用笔记 bylijinnan java
BeanUtils.copyProperties VS PropertyUtils.copyProperties 两者最大的区别是： BeanUtils.copyProperties会进行类型转换，而PropertyUtils.copyProperties不会。既然进行了类型转换，那BeanUtils.copyProperties的速度比不上PropertyUtils.copyProp
MyEclipse中文乱码问题 0624chenhong MyEclipse
一、设置新建常见文件的默认编码格式，也就是文件保存的格式。在不对MyEclipse进行设置的时候，默认保存文件的编码，一般跟简体中文操作系统（如windows2000，windowsXP）的编码一致，即GBK。在简体中文系统下，ANSI 编码代表 GBK编码;在日文操作系统下，ANSI 编码代表 JIS 编码。 Window-->Preferences-->General -
发送邮件不懂事的小屁孩 send email
import org.apache.commons.mail.EmailAttachment; import org.apache.commons.mail.EmailException; import org.apache.commons.mail.HtmlEmail; import org.apache.commons.mail.MultiPartEmail;
动画合集换个号韩国红果果 html css
动画指一种样式变为另一种样式 keyframes应当始终定义0 100 过程 1 transition 制作鼠标滑过图片时的放大效果 css .wrap{ width: 340px;height: 340px; position: absolute; top: 30%; left: 20%; overflow: hidden; bor
网络最常见的攻击方式竟然是SQL注入蓝儿唯美 sql注入
NTT研究表明，尽管SQL注入（SQLi）型攻击记录详尽且为人熟知，但目前网络应用程序仍然是SQLi攻击的重灾区。信息安全和风险管理公司NTTCom Security发布的《2015全球智能威胁风险报告》表明，目前黑客攻击网络应用程序方式中最流行的，要数SQLi攻击。报告对去年发生的60亿攻击行为进行分析，指出SQLi攻击是最常见的网络应用程序攻击方式。全球网络应用程序攻击中，SQLi攻击占
java笔记2 a-john java
类的封装： 1，java中，对象就是一个封装体。封装是把对象的属性和服务结合成一个独立的的单位。并尽可能隐藏对象的内部细节（尤其是私有数据） 2，目的：使对象以外的部分不能随意存取对象的内部数据（如属性），从而使软件错误能够局部化，减少差错和排错的难度。 3，简单来说，“隐藏属性、方法或实现细节的过程”称为——封装。 4，封装的特性： 4.1设置
[Andengine]Error：can't creat bitmap form path “gfx/xxx.xxx” aijuans 学习Android遇到的错误
最开始遇到这个错误是很早以前了，以前也没注意，只当是一个不理解的bug，因为所有的texture，textureregion都没有问题，但是就是提示错误。昨天和美工要图片，本来是要背景透明的png格式，可是她却给了我一个jpg的。说明了之后她说没法改，因为没有png这个保存选项。我就看了一下，和她要了psd的文件，还好我有一点
自己写的一个繁体到简体的转换程序 asialee java 转换繁体 filter 简体
今天调研一个任务，基于java的filter实现繁体到简体的转换，于是写了一个demo，给各位博友奉上，欢迎批评指正。实现的思路是重载request的调取参数的几个方法，然后做下转换。
android意图和意图监听器技术百合不是茶 android 显示意图隐式意图意图监听器
Intent是在activity之间传递数据;Intent的传递分为显示传递和隐式传递显式意图：调用Intent.setComponent() 或 Intent.setClassName() 或 Intent.setClass()方法明确指定了组件名的Intent为显式意图，显式意图明确指定了Intent应该传递给哪个组件。隐式意图;不指明调用的名称,根据设
spring3中新增的@value注解 bijian1013 java spring @Value
在spring 3.0中，可以通过使用@value，对一些如xxx.properties文件中的文件，进行键值对的注入，例子如下： 1.首先在applicationContext.xml中加入： <beans xmlns="http://www.springframework.
Jboss启用CXF日志 sunjing log jboss CXF
1. 在standalone.xml配置文件中添加system-properties： <system-properties> <property name="org.apache.cxf.logging.enabled" value=&
【Hadoop三】Centos7_x86_64部署Hadoop集群之编译Hadoop源代码 bit1129 centos
编译必需的软件 Firebugs3.0.0 Maven3.2.3 Ant JDK1.7.0_67 protobuf-2.5.0 Hadoop 2.5.2源码包 Firebugs3.0.0 http://sourceforge.jp/projects/sfnet_findbug
struts2验证框架的使用和扩展白糖_ 框架 xml bean struts 正则表达式
struts2能够对前台提交的表单数据进行输入有效性校验，通常有两种方式： 1、在Action类中通过validatexx方法验证，这种方式很简单，在此不再赘述； 2、通过编写xx-validation.xml文件执行表单验证，当用户提交表单请求后，struts会优先执行xml文件，如果校验不通过是不会让请求访问指定action的。本文介绍一下struts2通过xml文件进行校验的方法并说
记录-感悟 braveCS 感悟
再翻翻以前写的感悟，有时会发现自己很幼稚，也会让自己找回初心。 2015-1-11 1. 能在工作之余学习感兴趣的东西已经很幸福了； 2. 要改变自己，不能这样一直在原来区域，要突破安全区舒适区，才能提高自己，往好的方面发展； 3. 多反省多思考；要会用工具，而不是变成工具的奴隶； 4. 一天内集中一个定长时间段看最新资讯和偏流式博
编程之美-数组中最长递增子序列 bylijinnan 编程之美
import java.util.Arrays; import java.util.Random; public class LongestAccendingSubSequence { /** * 编程之美数组中最长递增子序列 * 书上的解法容易理解 * 另一方法书上没有提到的是，可以将数组排序（由小到大）得到新的数组， * 然后求排序后的数组与原数
读书笔记5 chengxuyuancsdn 重复提交 struts2的token验证
1、重复提交 2、struts2的token验证 3、用response返回xml时的注意 1、重复提交 (1)应用场景 (1-1)点击提交按钮两次。 (1-2)使用浏览器后退按钮重复之前的操作，导致重复提交表单。 (1-3)刷新页面 (1-4)使用浏览器历史记录重复提交表单。 (1-5)浏览器重复的 HTTP 请求。 (2)解决方法 (2-1)禁掉提交按钮 (2-2)
[时空与探索]全球联合进行第二次费城实验的可能性 comsci
二次世界大战前后,由爱因斯坦参加的一次在海军舰艇上进行的物理学实验 -费城实验至今给我们大家留下很多迷团..... 关于费城实验的详细过程,大家可以在网络上搜索一下,我这里就不详细描述了在这里,我的意思是,现在
easy connect 之 ORA-12154: TNS: 无法解析指定的连接标识符 daizj oracle ORA-12154
用easy connect连接出现“tns无法解析指定的连接标示符”的错误，如下： C:\Users\Administrator>sqlplus username/[email protected]:1521/orcl SQL*Plus: Release 10.2.0.1.0 – Production on 星期一 5月 21 18:16:20 2012 Copyright (c) 198
简单排序:归并排序 dieslrae 归并排序
public void mergeSort(int[] array){ int temp = array.length/2; if(temp == 0){ return; } int[] a = new int[temp]; int
C语言中字符串的\0和空格 dcj3sjt126com c
\0 为字符串结束符，比如说： abcd (空格)cdefg；存入数组时，空格作为一个字符占有一个字节的空间，我们
解决Composer国内速度慢的办法 dcj3sjt126com Composer
用法：有两种方式启用本镜像服务： 1 将以下配置信息添加到 Composer 的配置文件 config.json 中（系统全局配置）。见“例1” 2 将以下配置信息添加到你的项目的 composer.json 文件中（针对单个项目配置）。见“例2” 为了避免安装包的时候都要执行两次查询，切记要添加禁用 packagist 的设置，如下 1 2 3 4 5
高效可伸缩的结果缓存 shuizhaosi888 高效可伸缩的结果缓存
/** * 要执行的算法，返回结果v */ public interface Computable<A, V> { public V comput(final A arg); } /** * 用于缓存数据 */ public class Memoizer<A, V> implements Computable<A,
三点定位的算法 haoningabc c 算法
三点定位，已知a,b,c三个顶点的x,y坐标和三个点都z坐标的距离，la，lb,lc 求z点的坐标原理就是围绕a,b,c 三个点画圆，三个圆焦点的部分就是所求但是，由于三个点的距离可能不准，不一定会有结果，所以是三个圆环的焦点，环的宽度开始为0，没有取到则加1 运行 gcc -lm test.c test.c代码如下 #include "stdi
epoll使用详解 jimmee c linux 服务端编程 epoll
epoll - I/O event notification facility在linux的网络编程中，很长的时间都在使用select来做事件触发。在linux新的内核中，有了一种替换它的机制，就是epoll。相比于select，epoll最大的好处在于它不会随着监听fd数目的增长而降低效率。因为在内核中的select实现中，它是采用轮询来处理的，轮询的fd数目越多，自然耗时越多。并且，在linu
Hibernate对Enum的映射的基本使用方法 linzx0212 enum Hibernate
枚举 /** * 性别枚举 */ public enum Gender { MALE(0), FEMALE(1), OTHER(2); private Gender(int i) { this.i = i; } private int i; public int getI
第10章高级事件（下） onestopweb 事件
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
孙子兵法 roadrunners 孙子兵法
始计第一孙子曰：兵者，国之大事，死生之地，存亡之道，不可不察也。故经之以五事，校之以计，而索其情：一曰道，二曰天，三曰地，四曰将，五曰法。道者，令民于上同意，可与之死，可与之生，而不危也；天者，阴阳、寒暑、时制也；地者，远近、险易、广狭、死生也；将者，智、信、仁、勇、严也；法者，曲制、官道、主用也。凡此五者，将莫不闻，知之者胜，不知之者不胜。故校之以计，而索其情，曰
MySQL双向复制 tomcat_oracle mysql
本文包括: 主机配置从机配置建立主-从复制建立双向复制背景按照以下简单的步骤: 参考一下：在机器A配置主机(192.168.1.30) 在机器B配置从机(192.168.1.29) 我们可以使用下面的步骤来实现这一点步骤1：机器A设置主机在主机中打开配置文件 ,
zoj 3822 Domination(dp) 阿尔萨斯 Mina
题目链接：zoj 3822 Domination 题目大意：给定一个N∗M的棋盘，每次任选一个位置放置一枚棋子，直到每行每列上都至少有一枚棋子，问放置棋子个数的期望。解题思路：大白书上概率那一张有一道类似的题目，但是因为时间比较久了，还是稍微想了一下。dp[i][j][k]表示i行j列上均有至少一枚棋子，并且消耗k步的概率（k≤i∗j）,因为放置在i+1~n上等价与放在i+1行上，同理

Speech Recognition Using attention-based sequence-to-sequence methods

你可能感兴趣的:(nlp,deep,learning)