Conformer: Convolution-augmented Transformer for Speech Recognition

1.论文摘要

Transformer 为基础的模型擅长捕捉content-based gloabl interactions;卷积更适合捕捉局部的local features. 本文将两者的优势结合起来,并且使用的参数更少,在Lbrispeech 上达到了SOTA with 2.1/4.3% 的wer.

2. 模型结构

Conformer: Convolution-augmented Transformer for Speech Recognition_第1张图片

  • Multi-Headed Self-Attention Module
    Conformer: Convolution-augmented Transformer for Speech Recognition_第2张图片
  • Convolution Module
    Conformer: Convolution-augmented Transformer for Speech Recognition_第3张图片
  • Feed Forward Module
    Conformer: Convolution-augmented Transformer for Speech Recognition_第4张图片

两个线性层中间加一个非线性激活函数, 并且采用residual connection 和layernorm。

  • Conformer Block
    Conformer: Convolution-augmented Transformer for Speech Recognition_第5张图片
    #3. 实验结果
    Conformer: Convolution-augmented Transformer for Speech Recognition_第6张图片

你可能感兴趣的:(语音识别asr)