前言:Speex官网:http://speex.org/ 可以再Documentation下找到PDF版或HTML OL版的英文手册。可能会由于英文技能的匮乏或语音解码领域的不熟悉会有翻译错误,所以每段我都会付上英文原段落,也望各位发现后能够不吝赐教,大家共同进步。




Before introducing all the Speex features, here are some concepts in speech coding that help better understand the rest of the manual. Although some are general concepts in speech/audio processing, others are specific to Speex




The sampling rate expressed in Hertz (Hz) is the number of samples taken from a signal per second. For a sampling rate of Fs kHz, the highest frequency that can be represented is equal to Fs/2 kHz (Fs/2 is known as the Nyquist frequency). This is a fundamental property in signal processing and is described by the sampling theorem. Speex is mainly designed for three different sampling rates: 8 kHz, 16 kHz, and 32 kHz. These are respectively refered to as narrowband, wideband and ultra-wideband.

采样率是指从连续信号中每秒钟采集到的采样数量。用Fs kHz来表示,最高频率可表示为Fs/2 kHz(见奈奎斯特Nyquist频率)。采样定理表明这是信号处理最基本的属性。Speex主要设计了三种不同的采样率:8kHz,16kHz和32kHz。分别表示了窄带、宽带和超宽带。



When encoding a speech signal, the bit-rate is defined as the number of bits per unit of time required to encode the speech. It is measured in bits per second (bps), or generally kilobits per second. It is important to make the distinction between kilobits per second (kbps) and kilobytes per second (kBps).




Speex is a lossy codec, which means that it achives compression at the expense of fidelity of the input speech signal. Unlike ome other speech codecs, it is possible to control the tradeoff made between quality and bit-rate. The Speex encoding process is controlled most of the time by a quality parameter that ranges from 0 to 10. In constant bit-rate (CBR) operation, the quality parameter is an integer, while for variable bit-rate (VBR), the parameter is a float.




With Speex, it is possible to vary the complexity allowed for the encoder. This is done by controlling how the search is performed with an integer ranging from 1 to 10 in a way that’s similar to the -1 to -9 options to gzip and bzip2 compression utilities. For normal use, the noise level at complexity 1 is between 1 and 2 dB higher than at complexity 10, but the CPU requirements for complexity 10 is about 5 times higher than for complexity 1. In practice, the best trade-off is between complexity 2 and 4, though higher settings are often useful when encoding non-speech sounds like DTMF tones.

在Speex中,编码器可调整复杂度。用1到10的整数来控制如何执行搜索,就像用-1到-9来设置压缩工具gzip或bzip2(博主注:设计压缩的块长度,为100k~900k)。正常情况下,复杂度为1时噪声级会比复杂度为10时高1~2 dB(分贝),而复杂度为10的CPU需求是复杂度为1的5倍。实践证明,最好将复杂度设置在2~4,设置较高则对非语音编码如双音多频(DTMF)音质较为有用。



Variable bit-rate (VBR) allows a codec to change its bit-rate dynamically to adapt to the “difficulty” of the audio being encoded. In the example of Speex, sounds like vowels and high-energy transients require a higher bit-rate to achieve good quality, while fricatives (e.g. s,f sounds) can be coded adequately with less bits. For this reason, VBR can achive lower bit-rate for the same quality, or a better quality for a certain bit-rate. Despite its advantages, VBR has two main drawbacks: first, by only specifying quality, there’s no guaranty about the final average bit-rate. Second, for some real-time applications like voice over IP (VoIP), what counts is the maximum bit-rate, which must be low enough for the communication channel.

变比牲率(VBR)允许编解码器动态调整比特率以适应的音频解码的“难度”,拿Speex来说,像元音和瞬间高音则需较高比特率(Bit-rate)来达到最佳效果,而摩擦音则用较少的比特(bits)即可完成编码。基于这种原因,变比特率(VBR)可以用较低的比特率(bit-rate)达到相同的效果或使用某比特率(bit-rate)质量会更好。尽管它有这些优势,但VBR也有两个主要的缺点:首先,它只是针对质量,却没办法保证最终的平均比特率(ABR); 其次,在一些实时应用如VOIP电话中,尽管拥有高的比特率(bit-rate),为适应通信信道还是需要适当降低。



Average bit-rate solves one of the problems of VBR, as it dynamically adjusts VBR quality in order to meet a specific target bit-rate. Because the quality/bit-rate is adjusted in real-time (open-loop), the global quality will be slightly lower than that obtained by encoding in VBR with exactly the right quality setting to meet the target average bit-rate.




When enabled, voice activity detection detects whether the audio being encoded is speech or silence/background noise. VAD is always implicitly activated when encoding in VBR, so the option is only useful in non-VBR operation. In this case, Speex detects non-speech periods and encode them with just enough bits to reproduce the background noise. This is called “comfort noise generation” (CNG).




Discontinuous transmission is an addition to VAD/VBR operation, that allows to stop transmitting completely when the background noise is stationary. In file-based operation, since we cannot just stop writing to the file, only 5 bits are used for such frames (corresponding to 250 bps).




Perceptual enhancement is a part of the decoder which, when turned on, attempts to reduce the perception of the noise/distortion produced by the encoding/decoding process. In most cases, perceptual enhancement brings the sound further from the original objectively (e.g. considering only SNR), but in the end it still sounds better (subjective improvement).




Every speech codec introduces a delay in the transmission. For Speex, this delay is equal to the frame size, plus some amount of “look-ahead” required to process each frame. In narrowband operation (8 kHz), the delay is 30 ms, while for wideband (16 kHz), the delay is 34 ms. These values don’t account for the CPU time it takes to encode or decode the frames.


2.2 编解码

The main characteristics of Speex can be summarized as follows:
    • Free software/open-source, patent and royalty-free
    • Integration of narrowband and wideband using an embedded bit-stream
    • Wide range of bit-rates available (from 2.15 kbps to 44 kbps)
    • Dynamic bit-rate switching (AMR) and Variable Bit-Rate (VBR) operation
    • Voice Activity Detection (VAD, integrated with VBR) and discontinuous transmission (DTX)
    • Variable complexity
    • Embedded wideband structure (scalable sampling rate)
    • Ultra-wideband sampling rate at 32 kHz
    • Intensity stereo encoding option
    • Fixed-point implementation


  • 开源的自由软件,免专利,免版权
  • 通过嵌入的比特流集成窄带和宽带
  • 可大范围改变比特率(bit-rate)(从2.15kbps到44kbps )
  • 动态比特率交换(AMR)和变比特率(VBR)操作
  • 静音检测(VAD,和变比特率(VBR)集成)和非连续性传输(DTX)
  • 可变复杂度
  • 嵌入的宽带结构(可变的比特率)
  • 32kHz的超宽带采样率
  • 强立体声编码选项
  • 定点执行

2.3 预处理器

This part refers to the preprocessor module introduced in the 1.1.x branch. The preprocessor is designed to be used on the
audio before running the encoder. The preprocessor provides three main functionalities:
• noise suppression
• automatic gain control (AGC)
• voice activity detection (VAD)


  • 抑制噪音
  • 自动增益控制(AGC)
  • 静音检测(VAD)


The denoiser can be used to reduce the amount of background noise present in the input signal. This provides higher quality speech whether or not the denoised signal is encoded with Speex (or at all). However, when using the denoised signal with the codec, there is an additional benefit. Speech codecs in general (Speex included) tend to perform poorly on noisy input, which tends to amplify the noise. The denoiser greatly reduces this effect.



Automatic gain control (AGC) is a feature that deals with the fact that the recording volume may vary by a large amount between different setups. The AGC provides a way to adjust a signal to a reference volume. This is useful for voice over IP because it removes the need for manual adjustment of the microphone gain. A secondary advantage is that by setting the microphone gain to a conservative (low) level, it is easier to avoid clipping.

不同的设备,录音效果会有较大幅度的变动,自动增益控制(AGC)就是用来处理这种现象的。它提供了一种调整信号为参考音量的方法。这对VOIP(voice over IP)是非常有用的,因为它不需要再手动去调整麦克风增益。第二个好处是,将麦克风增益设置为保守(低)级别,可有效避免削波。


The voice activity detector (VAD) provided by the preprocessor is more advanced than the one directly provided in the codec.


2.4 自适应抖动缓冲

When transmitting voice (or any content for that matter) over UDP or RTP, packet may be lost, arrive with different delay,or even out of order. The purpose of a jitter buffer is to reorder packets and buffer them long enough (but no longer than necessary) so they can be sent to be decoded.


2.5 回声消除


Acoustic Echo Model

图 2.1 回声模式

In any hands-free communication system (Fig. 2.1), speech from the remote end is played in the local loudspeaker, propagates in the room and is captured by the microphone. If the audio captured from the microphone is sent directly to the remote end, then the remove user hears an echo of his voice. An acoustic echo canceller is designed to remove the acoustic echo before it is sent to the remote end. It is important to understand that the echo canceller is meant to improve the quality on the remote end.


2.6 重采样

In some cases, it may be useful to convert audio from one sampling rate to another. There are many reasons for that. It can be for mixing streams that have different sampling rates, for supporting sampling rates that the soundcard doesn’t support, for transcoding, etc. That’s why there is now a resampler that is part of the Speex project. This resampler can be used to convert between any two arbitrary rates (the ratio must only be a rational number) and there is control over the quality/complexity tradeoff.




