论文笔记-Convolutional Neural Networks for Speech Recognition

问题:
ASR里用CNN做声学模型,输入特征FBANK,采用三通道形式作为输入,请问如何处理句子不同帧数问题?

  • https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/CNN_ASLPTrans2-14.pdf
  • https://yh1008.github.io/DNN-HMM/slides#/

CONVOLUTIONAL NEURAL NETWORKS AND THEIR USE IN ASR

The convolutional neural network (CNN) can be regarded as a variant of the standard neural network. Instead of using fully connected hidden layers as described in the preceding section, the CNN introduces a special network structure, which consists of alternating so-called convolution and pooling layers.

  • A. Organization of the Input Data to the CNN
  • B. Convolution Ply
  • C. Pooling Ply
  • D. Learning Weights in the CNN
  • E. Pretraining CNN Layers
  • F. Treatment of Energy Features
  • G. The Overall CNN Architecture
  • H. Benefits of CNNs for ASR

A. Organization of the Input Data to the CNN

In this section, we discuss how to organize speech feature vectors into feature maps that are suitable for CNN processing.

The input “image” in question for our purposes can loosely be thought of as a spectrogram, with static, delta and delta-delta features (i.e., first and second temporal derivatives) serving in the roles of red, green and blue, although, as described below, there is more than one alternative for how precisely to bundle these into feature maps.

we need to use inputs that preserve locality in both axes of frequency and time.
Time presents no immediate problem from the standpoint of locality. Like other DNNs for speech, a single window of input to the CNN will consist of a wide amount of context (9–15 frames).
As for frequency, the conventional use of MFCCs does present a major problem because the discrete cosine transform projects the spectral energies into a new basis that may not maintain locality.

In this paper, we shall use the log-energy computed directly from the mel-frequency spectral coefficients (i.e., with no DCT), which we will denote as MFSC features. These will be used to represent each speech frame, along with their deltas and delta-deltas, in order to describe the acoustic energy distribution in each of several different frequency bands.

Speech is analyzed using a 25-ms Hamming window with a fixed 10-ms frame rate. Speech feature vectors are generated by Fourier-transform-based filter-bank analysis, which includes 40 log energy coefficients distributed on a mel scale, along with their first and second temporal derivatives. All speech data were normalized so that each vector dimension has a zero mean and unit variance.

There exist several different alternatives to organizing these MFSC features into maps for the CNN.

First, as shown in Fig. 1(b), they can be arranged as three 2-D feature maps, each of which represents MFSC features (static, delta and delta-delta) distributed along both frequency (using the fre- quency band index) and time (using the frame number within each context window). In this case, a two-dimensional convolution is performed (explained below) to normalize both frequency and temporal variations simultaneously.

Alternatively, we may only consider normalizing frequency variations. In this case, the same MFSC features are organized as a number of one-dimensional (1-D) feature maps (along the frequency band index), as shown in Fig. 1(c). For example, if the context window contains 15 frames and 40 filter banks are used for each frame, we will construct 45 (i.e., 15 times 3) 1-D feature maps, with each map having 40 dimensions, as shown in Fig. 1(c). As a result, a one-dimensional convolution will be applied along the frequency axis.

As a result, a one-dimensional convolution will be applied along the frequency axis. In this paper, we will only focus on this latter arrangement found in Fig. 1(c), a one-dimensional convolution along frequency.


Once the input feature maps are formed, the convolution and pooling layers apply their respective operations to generate the activations of the units in those layers, in sequence, as shown in Fig. 2. Similar to those of the input layer, the units of the con- volution and pooling layers can also be organized into maps. In CNN terminology, a pair of convolution and pooling layers in Fig. 2 in succession is usually referred to as one CNN “layer.” A deep CNN thus consists of two or more of these pairs in suc- cession. To avoid confusion, we will refer to convolution and pooling layers as convolution and pooling plies, respectively.

B. Convolution Ply

CNNs are also often said to be local because the individual units that are computed at a particular positioning of the window depend upon features of the local region of the image that the window currently looks upon.

你可能感兴趣的:(论文笔记-Convolutional Neural Networks for Speech Recognition)