Background:
Fundamental frequency is a basic feature of speech signal. Nowadays, it has been widely used to in both research and practical use. With fundamental frequency, we get another criterion for speaker identification. Also, in musical instrument market, it has been used to adjust and improve the sound quality.
Principle:
We use time domain method: autocorrelation to finish this task.
The procedure is show in the picture below:
Preprocess (center clip):
If we take an autocorrelation of a frame of speech signal, it will contain too much information. Specifically, the result will have too many peaks, most of which are attributed to the damped oscillations of the vocal tract response. Thus, some rapidly changing formants frequencies can create confusion to the detection of pitch.
To avoid this problem, we will use a method called “center clipping ” to preprocess the signal. It is actually a non-linear transformation.
yn=C[xn]
Through this preprocessing, the peaks are converted to short pulses consisting of the part of each peak that exceeds the clipping level.
Autocorrelation:
The short-time autocorrelation function is defined by:
Where wm is a window defined over 0≤m≤L-1. This reflects the periodicity of a frame of signal.
Local maximum value:
Due to the periodicity of pitch, the result of autocorrelation speech signal will present peaks periodically. And we just need to locate the index of the peak closest to origin to acquire the period.
For example, the second red mark means the peak that can give us the information of period.
Results:
This picture shows the results when I have a normal speech. The picture shows real-time frequencies, waveforms and fundamental frequencies. And we can see that the average fundamental frequency is about 170Hz, which corresponding to the theoretical range of male’s pitch.
This picture shows the results when I try to pronounce a single tone (high pitch). We can see that the frequency is concentrated around 800Hz and the waveform is like sinusoid wave. Then due to the single tone, the frequency is almost a constant around 270Hz.
Discussion:
Improvement I:
In preprocessing, we can use different “center clipping” to implement base the tradeoff.
The 3-level nominates the signal:
And its autocorrelation and possible results
In hardware terms, only need some simple combinational logic and up-down counter to accumulate the autocorrelation value for each value of k
Improvement II:
This Clipping-autocorrelation is not fail-safe, doubling error may occur:
This is a STFT of clipped signal
The first harmonic peak is weak while the second is strong. Thus, a misjudging occurs: the result will take the second harmonic as the fundamental frequency.
Above all, it still requires a variety of logical tests and post-process using non-linear filter to improve the accuracy.
#Because the delay is so large, we did not integrate this task with fundamental frequency monitor. And achieve them seperatively.
Background:
Voice activity detection (VAD), also known as voiced/unvoiced detection (voiced/unvoiced detection), speech/word boundary detection, speech endpoint detection, etc., usually refers to the complex noise background environment The signal stream distinguishes between speech signals and non-speech signals, and determines the starting point and ending point of the speech signal, providing necessary support for subsequent signal processing. Accurate voice endpoint detection has important practical significance for multi-channel transmission systems, voice recognition systems, and voice enhancement systems. The development of voice endpoint detection technology can not only improve the efficiency of the transmission system, but also improve the accuracy of the recognition system and improve the voice quality.
Principles:
The detection of speech endpoints essentially distinguishes the two features by speech and noise for the same parameters. Pre-processing usually includes framing and pre-filtering. Framing refers to segmenting the voice signal (called voice frames, each frame usually overlaps), pre-filtering generally refers to using a high-pass filter to filter out low-frequency noise; parameter extraction refers to selecting to reflect the difference between voice and noise Characteristic parameters; endpoint decision refers to the use of a decision criterion (such as threshold decision or pattern classification, etc.) to distinguish between speech frames and non-speech frames; post-processing refers to the smoothing of the above decision results and other processing to obtain the final speech endpoint. In the process of voice endpoint detection, parameter extraction and endpoint decision are two key steps.
Parameter extraction refers to selecting feature parameters that can reflect the difference between speech and noise, and is based on the characteristics of speech and noise. The voice signal is a typical non-stationary signal. However, the formation process of speech is closely related to the movement of pronunciation organs. This physical movement is much slower than the speed of sound vibration, so the speech signal can often be assumed to be stable for a short time. Speech can be roughly divided into two categories: voiceless and voiced. Voiced sounds show obvious periodicity in the time domain, formants appear in the frequency domain, and most of the energy is concentrated in the lower frequency band. However, the unvoiced segment has no obvious time-domain and frequency-domain characteristics, which is similar to white noise. In the study of speech endpoint detection algorithms, the periodic features of voiced speech can be used, while unvoiced speech is difficult to distinguish from broadband noise.
Results:
VAD based on magnitude:
The results of these method are shown in this figure. We can find that the voiced parts of the speech signal were shown in the red rectangle. The start points and the end points are marked by green circles.
VAD based on energy:
The results of these method are shown in this figure. We can find that the voiced parts of the speech signal were shown in the red rectangle. The start points and the end points are marked by purple circles. But we can find that the red rectangles are not good enough. We can find that the method based on energy has some wrong judgments.
VAD based on zero crossing rate:
The results of these method is shown in this figure. We can find that the voiced parts of the speech signal were shown in the red rectangle. The start points and the end points are marked by green circles.
We can use some special speech signal to test the performances of VAD based on the zero crossing rate. And we can find that for weak friction sound, weak popping sound, the detection performance is not good. Such as Peter, Piper, picked.
VAD based on correlation:
The results of these method are shown in this figure. We can find that the voiced parts of the speech signal were shown in the red rectangle. The start points and the end points are marked by red circles. We can find that the VAD based on correlation has weak resistances of noise.
We can use some special speech signal to test the performances of VAD based on the zero crossing rate. And we can find that for voiceless consonant, the detection performance is not good. Such as street, sister.
Wavelet algorithm:
Then we try to use the best algorithm called wavelet transform method.
We can find that the results are much better than above four methods. But the cons of this method is that the processing speed is much slower than above methods.
Noised speech signal results analysis:
Then we can find the test data to test our methods. The test data we choose is form Tanyer, S. Gökhun, and Hamza Ozer. "Voice activity detection in nonstationary noise." IEEE Transactions on speech and audio processing 8.4 (2000): 478-482.
And we can find that the different performances with noised signals processed by different methods. The results are shown in this table.
We can find that
For the detection performances of five methods.
The VAD based on short time energy is better than the VAD based on short time magnitude. The VAD based on short time zero cross rate is better than the VAD based on short time energy.
The VAD based on correlation method is better than the VAD based on short time zero cross rate.
The VAD based on wavelet is better than the VAD based on correlation method.
For the noise resistance performances of five methods.
The VAD based on short time energy is 85 which is better than the VAD based on short time magnitude is 82.
The VAD based on short time zero cross rate is 86 which better than the VAD based on short time energy is 85.
The VAD based on correlation method is 91 witch is better than the VAD based on short time zero cross rate is 86.
The VAD based on wavelet is 95 which is better than the VAD based on correlation method is 91.
Result and discussion:
Then we can compare the performances of five different methods. We can find that the advantage of short time energy, short time magnitude and short time zero crossing rate all is the running speed is fast. But disadvantage of short time energy is cannot divide weak friction sound and noise. The disadvantage of short time magnitude is weak noise resistance. The disadvantage of short time zero crossing rate is cannot detect weak friction sound, weak popping sound. The advantage of VAD based on correlation method is good detection performance for dullness, the disadvantage is bad detection performances for voiceless consonant. The advantage of VAD based on wavelet method is high accuracy detection performance, the disadvantage is running speed is too slow to be accepted.
Problems and future direction:
Problems:
Future direction: