(Part 1 of this two-part series examines the basic causes of echo in a networking design and provides detailed insight into the echo cancellation architectures that can be used to solve these echo problems.)
As more users begin turning to IP networks for carrying voice services, the need for strong echo cancellation techniques increases in networking system designs. But, implementing echo cancellation is a tricky task that some consider a black art in the communication sector.
In reality, echo cancellation is not a black art. Through the use of proper and careful design practices, designers can effectively bring echo cancellation techniques to VoIP, telecom, and other networking designs.
This two-part series is designed to take the black magic out of echo cancellation design, examining the base technologies and challenges engineers will face when building an echo canceller. In Part 1, we'll define the key causes of echo in a design as well as the two main cancellation techniques. Then, in Part 2, we'll further the discussion by examining the challenges designers will face when implementing echo in a system. In Part 2 we'll also explore echo cancellation testing techniques.
Where Does Echo Come From?
There are generally two kinds of the echo, which can appear when talking on the phone. These include hybrid echo and acoustic echo. These two differ by the place where they are created and by their characteristics. Let's look at each further, starting with hybrid echo.
1. Hybrid Echo
Line echo (also known as electric or hybrid echo) is created by the electrical circuitry connected to the wire lines. Figure 1 provides a simplified version of a network with two phone users.
In Figure 1, designers can see the network using two-wire lines to connect the user phones with the switching station. Each of the two-wire lines between a phone and the switching station carries voice signals in both directions. The switching station provides the power supply to feed the microphones and the switching functionality, which is needed if there are more than two phone line users.
The above network is indeed simple and it can operate very well provided the distance between the phone line users is short. Now, if we want to make calls between very distant phone line users, we need to do something about the signals because of their attenuation in the long lines. So, we need to amplify the signals. But we can't just amplify what is being sent and received over the two-wire line because there are both signals coming in both directions at the same time.
To solve this problem, designers can amplify separated send and receive signals from the two-wire line. This separation is performed by a dedicated electrical device called a hybrid.
The hybrid basically provides a conversion between two-wire and four-wire lines (the switching stations are now connected with 4-wire lines). Another reason for signal separation is that the signals can be transmitted over digital networks between the switching stations. Digital transmission improves the quality of the calls and increases the capacity of the phone networks due to digital signal compression. This makes a more efficient use of the network equipment.
Figure 2 shows a phone network connecting the distant phone line users while also showing signal amplification being performed on separated send and receive channels.
Now, in Figure 2, the interesting part is the hybrid performance because it is the hybrid where the echo can be and, in fact, is created. Ideally, the hybrid should just have a sum of the send and receive signals on the two-wire side and these same signals separated on the four-wire side. But in the reality, there are things like spread of equipment parameters and mismatch of line impedances, which all contribute to imperfect signal separation in the hybrid, which is the cause of the echo creation (Figure 3).
So part of the signal being sent to the hybrid on the four-wire side is returned back as the echo superimposed on the signal being received from the hybrid on the four-wire side. If, for example, the left hybrid of the Figure 2 has this kind of impairment, then the right talker will be hearing his own voice in the handset as an echo and the more the distance (and thus the signal delay) between the user phones, the better this echo will be audible.
There are a few peculiar properties of the hybrid echoes. One is that the echo path delays are very short and each hybrid has a single echo path. Another is that the echo paths don't change or change very slowly over time because of very slow changes of the electrical circuitry parameters and wire lines parameters in the network. These properties of the hybrid echoes allow for easy and effective their removal wherever they appear.
Since the hybrid echo is inherent in the designs involving two- to four-wire conversion, we always need to cancel this echo in the switching stations and any other devices having this kind of conversion. Regular phones, which are connected by two-wire lines with the switching stations, may also have this kind of conversion, but there's an excuse for not doing full-blown echo cancellation in the regular phones.
The delay in the electrical path between the microphone and earpiece (or speaker) in the phone is essentially zero, so a cheap transformer-based attenuator can be used, and hearing your own undelayed voice of low amplitude does not cause much, if any, of the discomfort. Actually, hearing talker's own voice is desirable as people expect to hear themselves but being not able to do so makes them think that the phone is not working.
However, echo cancellation is required in the hands-free phones and all phones that amplify the signal right before the earpiece or loudspeaker. Not doing echo cancellation in such devices leads not only to very well audible echoes for those who make calls to such phones (acoustic echo cancellation will be treated in the following section), but also to self-excitation of the amplifier.
The self-excitation results from non-ideal signal separation in the phone's hybrid, e.g. part of the signal from the microphone reflects at the hybrid to the other signal path and gets amplified by the amplifier. Thus, what the microphone is picking can be heard from the speaker. The acoustic feedback between the amplifier's output and input effectively turns the amplifier to a generator. Therefore, the hands-free and all other amplifying phones (for example, phones for people with hearing impairments, who tend to speak louder) must include line echo cancellers.
Unlike the phones, the dialup modems and faxes always employ built-in echo cancellers to combat the local echo, because these digital devices are much more sensitive to the distortions of the received signals than humans. The same echo cancellation may be desirable in the answering machines, which record the voice from the phone line.
2. Acoustic Echo
The second kind of the echo is the acoustic echo. It is easier to understand why and where this echo is created, although as we will see later, this doesn't make it easier to efficiently cancel it.
The loudspeaker in a phone creates acoustic echo. The sound coming out of the speaker bounces off walls, ceiling and other objects in the room, reflects, and comes back to the phone's microphone. The same situation can also occur everywhere sound from the loudspeaker can be reflected to the microphone. Similarly, if there's bad acoustic decoupling between the microphone and earpiece in the handset, the acoustic echo will exist in the handset, no matter whether it's a regular or cellular phone (Figure 4).
Acoustic echoes differ a lot from the hybrid echoes. First of all, the echo path delays aren't short. The echo path delay is the echo path length divided by the wave propagation speed. Electromagnetic (EM) waves propagate at about the speed of light in the wires, e.g. 3x108 meters/second, while the sound propagation speed in the air is about 3x102 meters/second. As you can see, the difference is 6 orders of magnitude.
The echo path is determined by the size of the room where the phone is used, and, obviously, the larger the room, the longer the echo path delay is. And if we don't cancel this acoustic echo, the person who has called to the hands-free phone can hear a very annoying echo, which is delayed by the sum of the acoustic echo path delay in the room plus the round-trip delay in the network between the phones.
But longer echo path delays aren't the only interesting feature of the acoustic echo. The other interesting thing, which imposes certain problems on the acoustic echo cancellation, is that there are many echo paths available in the room as the sound now can be reflected by many objects to the microphone and the paths can vary over time as the objects change their locations. Suppose you move around the room or somebody opens or closes the door in it. This makes the effective echo path change.
Approaching Echo Cancellation
As the exact network and room echo paths (and their impulse responses) are generally unknown, there's no other simple means to remove the echo but an adaptive system. Let's look at how in general the echo cancellation can be done for one direction of transmission. We will explain this on the example of the hybrid echo cancellation, most of which also applies to the acoustic echo cancellation.
Figure 5 shows a switching station with an echo canceling system integrated on the four-wire side between points A, B, C, and D. The signals (all shown as functions of the sample number, i) are as follows: x(i) is the signal from the user connected by a two-wire line to the switching station (near-end talker signal), while y(i) and u(i) are the signals from and to the other phone line user (far-end talker), which come through the four-wire line.
The idea of an echo canceller is simple. The signal from the far-end talker, y(i), when passing through the hybrid's echo path (between points B and A) is affected by the echo path's impulse response and is transformed to the signal r(i), which is the undesired echo. The signal from the near-end talker, x(i), is added to r(i) at point A. The adaptive filter (normally, FIR) used in the system mimics the impulse response of the hybrid's echo path and produces a replica, r'(i), of the echo signal r(i). If r(i) and r'(i) are the same, then they will cancel each other in the summer connected between points A and C and the filter's output. If r(i) and r'(i) aren't the same, the far-end talker will be hearing not only the near-end talker's x(i) signal, but also the difference of r(i) and r'(i), which is called the residual echo error signal.
The residual echo error, e(i) = r(i) - r'(i), is used to adapt the filter's coefficients. That is, the echo canceller is a tracking system with the residual echo error signal used as the feedback and the purpose if this system is to minimize this error.
Obviously, since the echo path's impulse response is unknown, some time is needed for the echo canceller to minimize the residual echo error signal below a required level. This time is called the convergence time. Note that while the far-end talker's signal y(i) is equal to zero, the echo canceller is not able to converge, because both r(i) and r'(i) are zero, thus the feedback is also zero and there's no adaptation possible. This is why the reference signal y(i) should be present in the beginning of the conversation; grabbing the handset and just saying "hello" should be more than enough for the adaptation to proceed.
Note that the adaptation is possible to do when the near-end talker's signal x(i) is close to or is zero, otherwise this signal x(i) will effectively be an additive noise in the feedback, causing the system to become unstable, diverge and stop working. This is why no filter coefficient adaptation is done at all or the adaptation is very slow during the double-talk periods, e.g. when both the near-end and far-end talkers talk simultaneously.
It is worth mentioning that such an echo cancellation scheme is naturally linear and is very sensitive to the nonlinearities in the echo path. This linear system will not be able to match the impulse response of a nonlinear echo path and therefore effectively remove the echo.
Good echo cancellation performance can be achieved by using the normalized least mean squares (NMLS) algorithm, which is also known as the normalized stochastic gradient algorithm, or its many variations. The NLMS algorithm is the most widely used one and it provides a low cost way to determine the optimum filter coefficients. The algorithm minimizes the mean square of the residual echo error signal at each adaptation step (e.g. at each sample), hence the name of the algorithm. Normalization by the signal power is used because speech is a highly non-stationary process.
Without derivations, which you can find elsewhere in the literature on adaptive signal processing, the general formula for the coefficient adaptation for the NLMS algorithm can be given as:
Where: i is the sample number
ak is the k-th coefficient of the filter
N is the number of filter coefficients
βis the adaptation step, which controls the convergence time and adaptation quality
e is the residual echo error signal
y is the far-end talker signal
σ2 is the reference signal power
The number of coefficients in the filter should be large enough to cover the echo path delay and all additional delays due to the lines and circuitry between the echo canceller and the place where the echo is created (e.g. the total delay between points A and B of the echo canceller as per Figure 5). This should also include the dispersion time due to the network elements.
The hybrid echo path delay is known to be short. The time span over which its impulse response is significant, is typically 2 to 4 milliseconds, but usually when canceling the hybrid echo, the number of filter coefficients is chosen to cover hybrid echo path delays up to 16 milliseconds, which is usually the upper bound of the hybrid echo path delay. The 16 ms path needs 128 coefficients at the sampling rate of 8 KHz.
The length of the acoustic echo path, as has already been pointed out, depends on the size of the room, where it exists. So, without going into room measurements, for acoustic echo path delays of up to 256 ms, we will need 2048 coefficients at the sampling rate of 8 kHz.
Canceller Performance
The NLMS-based echo cancellers for both hybrid and acoustic echo canceling do exist and perform well. However, acoustic echo cancellation is more complex due to the specifics of the acoustic echo paths and the need for the acoustic echo cancellers to operate in the presence of noise in the echo path (examples: noise in the car, noise in a crowded room). For these reasons a number of enhancements has been proposed and implemented in the acoustic echo cancellers (AECs) by researchers.
There are certain improvements possible when employing a frequency-domain AEC. As you probably realized, the NLMS algorithm presented earlier entirely operates in the time domain, no work is done there on the spectrum.
The first problem with time-domain AECs is their resource requirements, which is computed in terms of MIPS. Normally, the AECs need to have many filter coefficients to efficiently cancel the acoustic echo. But doing long convolutions to generate the replica of the echo signal is expensive when doing them directly in the time domain. It is possible to modify the initial NLMS algorithm so that the filter coefficients are updated once per a block of samples y(i)...y(i+N) instead of doing that each new sample y(i).
The modified NLMS algorithm modified is called the block NLMS (or BNLMS) algorithm. The advantage of keeping the coefficients fixed during the block of N samples y(i) is that it is possible to replace the time-domain convolution by multiplication in the frequency domain. The direct convolution computation in the time domain has a cost proportional to N2 (e.g. the number of multiplications, if we compute it for N samples and there are N coefficients in the filter). Frequency domain processing requires computation of several discrete Fourier transforms (DFTs) of the signals.
The Fast Fourier transform (FFT) is a very efficient DFT implementation and it is known to have a computational cost proportional to N*log2 (N). So it is more efficient to use convolution by FFT for big Ns.
The disadvantage of using BNLMS is that it has slower convergence, which is due to the effective scaling of the adaptation step β down N times. Also, such and similar simple frequency-domain echo cancellers introduce a processing delay. Hence, we have a tradeoff between the convergence time, delay, and cost. It should be noted, however, that FFTs require additional memory, so, we also trade memory for MIPS.
Performing in Light of Noise
Another problem with AECs is that if their implementation is a time-domain NLMS, then they will perform poorly in the presence of noise (BNLMS will suffer from noise too). This problem is especially important for the AECs that need to work in the cars, crowded rooms or otherwise noisy locations.
When implementing the AECs with frequency-domain analysis and synthesis blocks (or sub-band processing), it can be possible not only to reduce the computational cost, but also reduce the processing delay, have better noise immunity, and even suppress the noise by frequency-domain noise suppressors directly integrated in such AECs. Doing so will greatly improve the quality of speech in the end.
Also, with frequency-domain processing, it's possible to have better immunity to the nonlinearities in the echo paths because the AEC will adapt to the strong fundamental frequencies, while their weak harmonics can be suppressed as part of noise. This all is impossible to achieve with time-domain AECs.
Finally, it is important to know how the echo canceller behaves in the double-talk situations—when both talkers talk simultaneously. Obviously, the parties prefer to hear each other throughout the entire conversation and hear little or no echo during double-talks. Therefore, the echo canceller's double-talk performance should also be addressed when designing and testing echo cancellers or simply choosing which one to integrate to the phone.
Frequency-domain echo cancellers are very effective in canceling acoustic echoes. Unlike time-domain AECs, they need fewer DSP MIPS, perform better in double-talk situations, work well in presence of noise, can have embedded noise suppression almost for free and perform better with nonlinearities in the echo path. This is why frequency-domain echo cancellers should be preferred over time-domain ones.
On to Part 2
That wraps up Part 1 in our series on demystifying echo cancellation design. In Part 2, we'll further the discussion by looking at the most typical design and integration mistakes that lead to failures in achieving echo cancellation. To view Part 2, click here.