前言:Speex官网:http://speex.org/ 可以再Documentation下找到PDF版或HTML OL版的英文手册。可能会由于英文技能的匮乏或语音解码领域的不熟悉会有翻译错误,所以每段我都会付上英文原段落,也望各位发现后能够不吝赐教,大家共同进步。
PS:如需转载,注明出处,不胜感激
2.1 概念
采样率
比特率
质量(可变)
复杂度(可变)
变比特率
平均比特率
静音检测
非连续性传输
知觉增强
延时算法
2.2 编解码
2.3 预处理器
2.4 自适应抖动缓冲
2.5 回声消除
2.6 重采样
后记
This section describes Speex and its features into more details
这部分详细介绍Speex及其特性
Before introducing all the Speex features, here are some concepts in speech coding that help better understand the rest of the manual. Although some are general concepts in speech/audio processing, others are specific to Speex
在介绍Speex特性之前,为了便于阅读后面的文档,需要解释一些概念,尽管一些概念是在语音/音频处理过程中常见的,但也有Speex特有的一些。
采样率
The sampling rate expressed in Hertz (Hz) is the number of samples taken from a signal per second. For a sampling rate of Fs kHz, the highest frequency that can be represented is equal to Fs/2 kHz (Fs/2 is known as the Nyquist frequency). This is a fundamental property in signal processing and is described by the sampling theorem. Speex is mainly designed for three different sampling rates: 8 kHz, 16 kHz, and 32 kHz. These are respectively refered to as narrowband, wideband and ultra-wideband.
采样率是指从连续信号中每秒钟采集到的采样数量。用Fs kHz来表示,最高频率可表示为Fs/2 kHz(见奈奎斯特Nyquist频率)。采样定理表明这是信号处理最基本的属性。Speex主要设计了三种不同的采样率:8kHz,16kHz和32kHz。分别表示了窄带、宽带和超宽带。
比特率
When encoding a speech signal, the bit-rate is defined as the number of bits per unit of time required to encode the speech. It is measured in bits per second (bps), or generally kilobits per second. It is important to make the distinction between kilobits per second (kbps) and kilobytes per second (kBps).
比特率是指每秒钟传送的比特数,在语音信号编码时,表示语音数据每秒钟需要多少个比特表示,单位为bps(比特/秒)或kbps(千比特/秒)。注意区分kbps和kBps(千字节/秒)。
质量(可变)
Speex is a lossy codec, which means that it achives compression at the expense of fidelity of the input speech signal. Unlike ome other speech codecs, it is possible to control the tradeoff made between quality and bit-rate. The Speex encoding process is controlled most of the time by a quality parameter that ranges from 0 to 10. In constant bit-rate (CBR) operation, the quality parameter is an integer, while for variable bit-rate (VBR), the parameter is a float.
Speex是一种有损编解码库,这意味着它的文档压缩方面会导致语音输入信号的失真,和一些语音编解码库不同的是,它尽可能的去控制质量和比特率之间的平衡。大多数时候,是用一个0到10范围内的质量参数来控制Speex的编码,比特率为常量的操作,质量参数是整数,如果是变比特率(VBR),则为浮点数(Float)
复杂度(可变)
With Speex, it is possible to vary the complexity allowed for the encoder. This is done by controlling how the search is performed with an integer ranging from 1 to 10 in a way that’s similar to the -1 to -9 options to gzip and bzip2 compression utilities. For normal use, the noise level at complexity 1 is between 1 and 2 dB higher than at complexity 10, but the CPU requirements for complexity 10 is about 5 times higher than for complexity 1. In practice, the best trade-off is between complexity 2 and 4, though higher settings are often useful when encoding non-speech sounds like DTMF tones.
在Speex中,编码器可调整复杂度。用1到10的整数来控制如何执行搜索,就像用-1到-9来设置压缩工具gzip或bzip2(博主注:设计压缩的块长度,为100k~900k)。正常情况下,复杂度为1时噪声级会比复杂度为10时高1~2 dB(分贝),而复杂度为10的CPU需求是复杂度为1的5倍。实践证明,最好将复杂度设置在2~4,设置较高则对非语音编码如双音多频(DTMF)音质较为有用。
变比特率(VBR)
Variable bit-rate (VBR) allows a codec to change its bit-rate dynamically to adapt to the “difficulty” of the audio being encoded. In the example of Speex, sounds like vowels and high-energy transients require a higher bit-rate to achieve good quality, while fricatives (e.g. s,f sounds) can be coded adequately with less bits. For this reason, VBR can achive lower bit-rate for the same quality, or a better quality for a certain bit-rate. Despite its advantages, VBR has two main drawbacks: first, by only specifying quality, there’s no guaranty about the final average bit-rate. Second, for some real-time applications like voice over IP (VoIP), what counts is the maximum bit-rate, which must be low enough for the communication channel.
变比牲率(VBR)允许编解码器动态调整比特率以适应的音频解码的“难度”,拿Speex来说,像元音和瞬间高音则需较高比特率(Bit-rate)来达到最佳效果,而摩擦音则用较少的比特(bits)即可完成编码。基于这种原因,变比特率(VBR)可以用较低的比特率(bit-rate)达到相同的效果或使用某比特率(bit-rate)质量会更好。尽管它有这些优势,但VBR也有两个主要的缺点:首先,它只是针对质量,却没办法保证最终的平均比特率(ABR); 其次,在一些实时应用如VOIP电话中,尽管拥有高的比特率(bit-rate),为适应通信信道还是需要适当降低。
平均比特率(ABR)
Average bit-rate solves one of the problems of VBR, as it dynamically adjusts VBR quality in order to meet a specific target bit-rate. Because the quality/bit-rate is adjusted in real-time (open-loop), the global quality will be slightly lower than that obtained by encoding in VBR with exactly the right quality setting to meet the target average bit-rate.
平均比特率(ABR)通过动态调整变比特率(VBR)的质量来获得一个特定目标的比特率,解决了VBR中存在的问题之一。因为平均比特率(ABR)是实时(开环)调整质量/比特率(bit-rate)的,整体质量会略低于通过变比特率(VBR)设置的接近于目标平均比特率进行编码获得的质量。
静音检测(VAD)
When enabled, voice activity detection detects whether the audio being encoded is speech or silence/background noise. VAD is always implicitly activated when encoding in VBR, so the option is only useful in non-VBR operation. In this case, Speex detects non-speech periods and encode them with just enough bits to reproduce the background noise. This is called “comfort noise generation” (CNG).
静音检测(VAD)将检测被编码的音频数据是语音还是静音或背景噪声。这个特性在用变比特率(VBR)进行编码是总是开启的,所以选项设置只对非变比特率(VBR)起作用。在这种情况下,Speex检测非语音周期并对用足够的比特数重新生成的背景噪声进行编码。这个叫“舒适噪声生成(CNG)”。
非连续传输(DTX)
Discontinuous transmission is an addition to VAD/VBR operation, that allows to stop transmitting completely when the background noise is stationary. In file-based operation, since we cannot just stop writing to the file, only 5 bits are used for such frames (corresponding to 250 bps).
非连续性传输(DTX)是静音检测(VAD)/变比特率(VBR)操作的额外选项,它能够在背景噪声固定时,完全的停止传输。如果是基于文件的操作,由于我们不能停止对文件的写入,会有5个比特被用到这种帧内(相对于250bps)。
知觉增强
Perceptual enhancement is a part of the decoder which, when turned on, attempts to reduce the perception of the noise/distortion produced by the encoding/decoding process. In most cases, perceptual enhancement brings the sound further from the original objectively (e.g. considering only SNR), but in the end it still sounds better (subjective improvement).
知觉增强中解码的一部分,开启后,用来减少在编码/解码过程中产生的噪音/失真。大多数情况下,知觉增强产生的会和最原始的声音会相差较远(如只考虑信噪比(SNR)),但最后发音效果却很好(主观改善)。
延时算法
Every speech codec introduces a delay in the transmission. For Speex, this delay is equal to the frame size, plus some amount of “look-ahead” required to process each frame. In narrowband operation (8 kHz), the delay is 30 ms, while for wideband (16 kHz), the delay is 34 ms. These values don’t account for the CPU time it takes to encode or decode the frames.
每个声音编解码在传输过程中都会有时延。就Speex来说,它的时延就等于每帧大小加上每帧需要处理的一些"预测"(look-ahead)。在窄带(8kHz)操作中,大概30ms时延,宽带操作大概34ms时延。而且没有将CPU进行编/解码的时间计算在内。
The main characteristics of Speex can be summarized as follows:
• Free software/open-source, patent and royalty-free
• Integration of narrowband and wideband using an embedded bit-stream
• Wide range of bit-rates available (from 2.15 kbps to 44 kbps)
• Dynamic bit-rate switching (AMR) and Variable Bit-Rate (VBR) operation
• Voice Activity Detection (VAD, integrated with VBR) and discontinuous transmission (DTX)
• Variable complexity
• Embedded wideband structure (scalable sampling rate)
• Ultra-wideband sampling rate at 32 kHz
• Intensity stereo encoding option
• Fixed-point implementation
Speex的主要特性总结如下:
This part refers to the preprocessor module introduced in the 1.1.x branch. The preprocessor is designed to be used on the
audio before running the encoder. The preprocessor provides three main functionalities:
• noise suppression
• automatic gain control (AGC)
• voice activity detection (VAD)
这部分涉及到1.1.x里的预处理模块介绍,预处理器是在音频被编码前使用,它主要提供如下三种主要功能:
The denoiser can be used to reduce the amount of background noise present in the input signal. This provides higher quality speech whether or not the denoised signal is encoded with Speex (or at all). However, when using the denoised signal with the codec, there is an additional benefit. Speech codecs in general (Speex included) tend to perform poorly on noisy input, which tends to amplify the noise. The denoiser greatly reduces this effect.
降噪是用来减少输入信号中的背景噪音的数量。不论是Speex(或其他)编码的去噪信号可提供更高的语音质量。无论如何编解码器使用降噪信号都是有利的。一般的语音编解码器(Speex中也包含)在噪音输入方面都表现不佳,往往会扩大噪音。而降噪则大大降低了这种影响。
Automatic gain control (AGC) is a feature that deals with the fact that the recording volume may vary by a large amount between different setups. The AGC provides a way to adjust a signal to a reference volume. This is useful for voice over IP because it removes the need for manual adjustment of the microphone gain. A secondary advantage is that by setting the microphone gain to a conservative (low) level, it is easier to avoid clipping.
不同的设备,录音效果会有较大幅度的变动,自动增益控制(AGC)就是用来处理这种现象的。它提供了一种调整信号为参考音量的方法。这对VOIP(voice over IP)是非常有用的,因为它不需要再手动去调整麦克风增益。第二个好处是,将麦克风增益设置为保守(低)级别,可有效避免削波。
The voice activity detector (VAD) provided by the preprocessor is more advanced than the one directly provided in the codec.
预处理器提供的静音检测(VAD)比编解码器里直接提供的更为先进。
When transmitting voice (or any content for that matter) over UDP or RTP, packet may be lost, arrive with different delay,or even out of order. The purpose of a jitter buffer is to reorder packets and buffer them long enough (but no longer than necessary) so they can be sent to be decoded.
在用UDP或RTP协议传输语音(或其他相关内容)的时候,会出现丢包、不同时延甚至是非时序的到达。抖动缓冲的目的就是将它们缓冲到足够长(不超过必需的)并对这些包进行重排序,然后才送给解码器进行解码。
图 2.1 回声模式
In any hands-free communication system (Fig. 2.1), speech from the remote end is played in the local loudspeaker, propagates in the room and is captured by the microphone. If the audio captured from the microphone is sent directly to the remote end, then the remove user hears an echo of his voice. An acoustic echo canceller is designed to remove the acoustic echo before it is sent to the remote end. It is important to understand that the echo canceller is meant to improve the quality on the remote end.
如图2.1所示,在免提通信系统中,语音从远端传回本地的扩音器,麦克风回捕获房内的回声,然后会将其直接发回给远端,远端用户就会听到它自己的声音。回声消除器就是为了在回声传回给远端用户之前将其消除。重要的是要明白,回声消除用来提高远端用户接收到的语音质量。
In some cases, it may be useful to convert audio from one sampling rate to another. There are many reasons for that. It can be for mixing streams that have different sampling rates, for supporting sampling rates that the soundcard doesn’t support, for transcoding, etc. That’s why there is now a resampler that is part of the Speex project. This resampler can be used to convert between any two arbitrary rates (the ratio must only be a rational number) and there is control over the quality/complexity tradeoff.
在一些情况下,改变音频的采样率是非常有用的。有很多原因,如拥有不同采样率则可进行混合流、支持声卡不支持的采样率、代码转换等。这是为什么重采样会成为Speex工程的一部分。重采样可在任意比率之间转换(比率必须是有理数),它是基于质量/复杂度进行的折中。
后记:
嗯,总体来说感觉翻译的蛮粗糙的,有些地方理解的不是很透,放在这里供大家拍砖。