


1. 概念



采样率使用赫茲(hz)表示,是每一秒钟信号采样的个数。以采样率Fs kHz为例,其最高频率等于Fs/2 kHz(Fs/2 被称为Nyquist频率)。这是信号处理的基本属性,也通过采样定理被描述。Speex 主要被设计为三个不同的采样率:8kHz,16kHz,32kHz。这些分别称为窄带、宽带、超宽带。






可变比特率允许一个编解码器根据不同的语音动态自适应地变化比特率。以Speex为例,声音像元音和高能量瞬时要求一个更高的比特率去达到好的质量,而编码摩擦音(如 s,f发音)会适当较少比特。因此VBR能达到在相同的质量下更低的比特率,或者在相同的比特率下更好的质量。尽管有以上的优点,但VBR有两个主要的缺点:一个是只能指定质量,但不能保证最终平均比特率。另一个是在一些实时应用中像VOIP,最大比特率必须满足通讯通道。











 3. 预处理器







                   插图1 回音消除模型







 4 自适应抖动缓冲


 5. 语音回音消除器


 6. 重采样





2 Codec description

This section describes Speex and itsfeatures into more details.


2.1 Concepts

Before introducing all the Speex features,here are some concepts in speech coding that help better understand the rest ofthe manual. Although some are general concepts in speech/audio processing,others are specific to Speex.


Sampling rate

The sampling rate expressed in Hertz (Hz)is the number of samples taken from a signal per second. For a sampling rate ofFs kHz, the highest frequency that can be represented is equal to Fs/2 kHz(Fs/2 is known as the Nyquist frequency).

This is a fundamental property in signalprocessing and is described by the sampling theorem. Speex is mainly designedfor three different sampling rates: 8 kHz, 16 kHz, and 32 kHz. These arerespectively refered to as narrowband, wideband and ultra-wideband.



When encoding a speech signal, the bit-rateis defined as the number of bits per unit of time required to encode thespeech. It is measured in bits per second (bps), or generally kilobits persecond. It is important to make the distinction between kilobits per second(kbps) and kilobytes per second (kBps).


Quality (variable)

Speex is a lossy codec, which means that itachives compression at the expense of fidelity of the input speech signal.Unlike some other speech codecs, it is possible to control the tradeoff madebetween quality and bit-rate. The Speex encoding process is controlled most ofthe time by a quality parameter that ranges from 0 to 10. In constant bit-rate(CBR) operation, the quality parameter is an integer, while for variable bit-rate(VBR), the parameter is a float.


Complexity (variable)

With Speex, it is possible to vary thecomplexity allowed for the encoder. This is done by controlling how the searchis performed with an integer ranging from 1 to 10 in a way that’s similar to the-1 to -9 options to gzip and bzip2 compression utilities. For normal use, thenoise level at complexity 1 is between 1 and 2 dB higher than at complexity 10,but the CPU requirements for complexity 10 is about 5 times higher than forcomplexity 1. In practice, the best trade-off is between complexity 2 and 4,though higher settings are often useful when encoding non-speech sounds likeDTMF tones.


Variable Bit-Rate (VBR)

Variable bit-rate (VBR) allows a codec tochange its bit-rate dynamically to adapt to the “difficulty” of the audio beingencoded. In the example of Speex, sounds like vowels and high-energy transientsrequire a higher bit-rate to achieve good quality, while fricatives (e.g. s,fsounds) can be coded adequately with less bits. For this reason, VBR can achivelower bit-rate for the same quality, or a better quality for a certainbit-rate. Despite its advantages, VBR has two main drawbacks: first, by onlyspecifying quality, there’s no guaranty about the final average bit-rate.Second, for some real-time applications like voice over IP (VoIP), what countsis the maximum bit-rate, which must be low enough for the communicationchannel.


Average Bit-Rate (ABR)

Average bit-rate solves one of the problemsof VBR, as it dynamically adjusts VBR quality in order to meet a specifictarget bit-rate. Because the quality/bit-rate is adjusted in real-time(open-loop), the global quality will be slightly lower than that obtained byencoding in VBR with exactly the right quality setting to meet the target averagebit-rate.


Voice Activity Detection (VAD)

When enabled, voice activity detectiondetects whether the audio being encoded is speech or silence/background noise.VAD is always implicitly activated when encoding in VBR, so the option is onlyuseful in non-VBR operation. In this case, Speex detects non-speech periods andencode them with just enough bits to reproduce the background noise. This iscalled “comfort noise generation” (CNG).


Discontinuous Transmission (DTX)

Discontinuous transmission is an additionto VAD/VBR operation, that allows to stop transmitting completely when thebackground noise is stationary. In file-based operation, since we cannot juststop writing to the ?le, only 5 bits are used for such frames (corresponding to250 bps).


Perceptual enhancement

Perceptual enhancement is a part of thedecoder which, when turned on, attempts to reduce the perception of thenoise/distortion produced by the encoding/decoding process. In most cases,perceptual enhancement brings the sound further from the

original objectively (e.g. considering onlySNR), but in the end it still sounds better (subjective improvement).


Latency and algorithmic delay

Every speech codec introduces a delay inthe transmission. For Speex, this delay is equal to the frame size, plus someamount of “look-ahead” required to process each frame. In narrowband operation(8 kHz), the delay is 30 ms, while for wideband (16kHz), the delay is 34 ms.These values don’t account for the CPU time it takes to encode or decode theframes.


3. Preprocessor

This part refers to the preprocessor moduleintroduced in the 1.1.x branch. The preprocessor is designed to be used on theaudio before running the encoder. The preprocessor provides three mainfunctionalities:

• noise suppression

• automatic gain control (AGC)

• voice activity detection (VAD)


The denoiser can be used to reduce theamount of background noise present in the input signal. This provides higherquality speech whether or not the denoised signal is encoded with Speex (or atall). However, when using the denoised signal with the codec, there is anadditional benefit. Speech codecs in general (Speex included) tend to performpoorly on noisy input, which tends to amplify the noise. The denoiser greatlyreduces this effect.


Automatic gain control (AGC) is a featurethat deals with the fact that the recording volume may vary by a large amountbetween different setups. The AGC provides a way to adjust a signal to areference volume. This is useful for voice over IP because it removes the needfor manual adjustment of the microphone gain. A secondary advantage is that bysetting the microphone gain to a conservative (low) level, it is easier toavoid clipping.


The voice activity detector (VAD) providedby the preprocessor is more advanced than the one directly provided in thecodec.



2.4  Adaptive Jitter Buffer

When transmitting voice (or any content forthat matter) over UDP or RTP, packet may be lost, arrive with differentdelay,or even out of order. The purpose of a jitter buffer is to reorderpackets and buffer them long enough (but no longer than necessary) so they can besent to be decoded.


2.5 Acoustic Echo Canceller

In any hands-free communication system(Fig. 2.1), speech from the remote end is played in the local loudspeaker,propagates in the room and is captured by the microphone. If the audio capturedfrom the microphone is sent directly to the remote end,then the remove userhears an echo of his voice. An acoustic echo canceller is designed to removethe acoustic echo before it is sent to the remote end. It is important tounderstand that the echo canceller is meant to improve the quality on theremote end.


2.6 Resampler

In some cases, it may be useful to convertaudio from one sampling rate to another. There are many reasons for that. Itcan be for mixing streams that have different sampling rates, for supportingsampling rates that the soundcard doesn’t support, for transcoding, etc. That’swhy there is now a resampler that is part of the Speex project. This resamplercan be used to convert between any two arbitrary rates (the ratio must only bea rational number) and there is control over the quality/complexity tradeoff。


