目录
一、实时语音处理中的分帧策略
1. 分帧策略方式1
2. 分帧策略方式2
3. 代码实现
3.1 方式1
3.2 方式2
3.3 实现3
二 、模拟实时处理
1. 时域上的分割(均分)
2. 时域上的分割(重叠)
三、实时处理实例
上一篇:语音信号处理之预处理简述(二)
在预处理简述第二篇中:
语音信号处理之预处理简述(二)
介绍了语音处理是如何分帧的,分帧的方法和实现只是为了方便对理论的理解。但是有明显的缺陷:
(1)内存上,在实时语音处理任务中,不可能提前分出一大块内存,用来提前存储所有的数据帧
(2)代码运行效率很低
(3)没有考虑最后一帧的处理方法,在实时语音处理中,输入算法的每个大块的最后一帧是一定要去考虑和处理的
虽然有各种不足,但是在理论学习,算法实现和仿真的时候,用这种方式反而更容易调试,就仅仅考虑算法的实现而言,抛开算法的运行内存和效率。
实时语音处理中,每次输入算法的数据长度,都是8的整数倍,并且算法处理的是short类型的数据。一般,帧移和帧重叠相等,各占帧长的50%。
以长度为4096 Bytes的音频文件为例,长度是4096个char,输入到算法就是2048个short,即输入的总长是2048个采样点;假定帧长是256,帧移和帧重叠部分都是128。可以分成15帧,相关参数的值如下:
Input_len=2048; FRAME_LEN=256; STEP_LEN=128; OVERLAP_LEN=128; |
分帧的过程如下图:
由上图知,总帧数的算法是:Input_len / OVERLAP_LEN -1=16-1=15
每一次的输出是重叠相加的部分,共有15次输出。处理完最后一帧,并输出结果后,总输出长度相比于总输入长度少了一个OVERLAP_LEN,这部分就相当于缺失了语音信息。
在方式1的基础上,多分出1帧数据,最后这帧数据的后半部分用0补上。最后这帧数据的输出(长度为OVERLAP_LEN)就填充到最后一个小块,这样就能实现输入和输出的数据长度是相同的。
测试代码
#include
#include
#include
#include
#define INPUT_LEN (4096)
#define FRAME_LEN (256)
#define STEP_LEN (128)
//#define OVERLAP_LEN (FRAME_LEN-STEP_LEN)
#define OVERLAP_LEN (128)
int main(void)
{
FILE* input_ptr = NULL;
FILE* output_ptr = NULL;
char all_in_dat[INPUT_LEN];
char all_out_dat[INPUT_LEN];
// char* all_in_dat = (char*)calloc(INPUT_LEN, sizeof(char));
// char* all_out_dat = (char*)calloc(INPUT_LEN, sizeof(char));
char in_dat[FRAME_LEN];
char prev_out_data[OVERLAP_LEN];
char out_dat[OVERLAP_LEN];
memset(all_in_dat, 0, INPUT_LEN);
/* 如果不清零,最后一个OVERLAP_LEN的值是无法确定的,表现为很异常的值 */
memset(all_out_dat, 0, INPUT_LEN);
memset(in_dat, 0, FRAME_LEN);
memset(prev_out_data, 0, OVERLAP_LEN);
memset(out_dat, 0, OVERLAP_LEN);
int inputdata_length, Nframes, frame_idx, i, j, k;
//长度固定为4096 Bytes的文件
input_ptr = fopen("4k.pcm", "r");
if (!input_ptr) {
printf("open input stream fail\n");
return -1;
}
fseek(input_ptr, 0, SEEK_END);
inputdata_length = ftell(input_ptr);
//char->1 byte
printf("inputdata_length:%d\n", inputdata_length);
rewind(input_ptr);
Nframes=inputdata_length/OVERLAP_LEN-1;
printf("Nframes:%d\n", Nframes);
int count = fread(all_in_dat, sizeof(unsigned char), INPUT_LEN, input_ptr);
printf("count:%d\n", count);
rewind(input_ptr);
//========================= Start Processing ===============================
for(frame_idx=0, k=0; frame_idx
对比分析结果。4k的源文件:
分帧,重叠相加,重组后的文件:
注:
A. 对于重叠相加法
每一个当前帧的输出=当前输出帧的前半部分 + 上一结果帧的后半部分
B.缺失最后一个长度为OVERLAP的输出,如果对于all_out_dat不进行赋0值的操作,这部分的数值就是不确定的,屏蔽这行代码:
/* 如果不清零,最后一个OVERLAP_LEN的值是无法确定的,表现为很异常的值 */
//memset(all_out_dat, 0, INPUT_LEN);
运行程序后,输出的结果文件如下:
放大最后这部分,共128个数值:
大致有“对称”的趋势,类似于谱泄露,但是,实际上并不是,这是信息缺失。
测试代码:在方式1的基础上做修改即可。
#include
#include
#include
#include
#define INPUT_LEN (4096)
#define FRAME_LEN (256)
#define STEP_LEN (128)
//#define OVERLAP_LEN (FRAME_LEN-STEP_LEN)
#define OVERLAP_LEN (128)
int main(void)
{
FILE* input_ptr = NULL;
FILE* output_ptr = NULL;
char all_in_dat[INPUT_LEN];
char all_out_dat[INPUT_LEN];
// char* all_in_dat = (char*)calloc(INPUT_LEN, sizeof(char));
// char* all_out_dat = (char*)calloc(INPUT_LEN, sizeof(char));
char in_dat[FRAME_LEN];
char prev_out_data[OVERLAP_LEN];
char out_dat[OVERLAP_LEN];
memset(all_in_dat, 0, INPUT_LEN);
/* 如果不清零,最后一个OVERLAP_LEN的值是无法确定的,表现为很异常的值 */
memset(all_out_dat, 0, INPUT_LEN);
memset(in_dat, 0, FRAME_LEN);
memset(prev_out_data, 0, OVERLAP_LEN);
memset(out_dat, 0, OVERLAP_LEN);
int inputdata_length, Nframes, frame_idx, i, j, k;
//长度固定为4096 Bytes的文件
input_ptr = fopen("4k.pcm", "r");
if (!input_ptr) {
printf("open input stream fail\n");
return -1;
}
fseek(input_ptr, 0, SEEK_END);
inputdata_length = ftell(input_ptr);
//char->1 byte
printf("inputdata_length:%d\n", inputdata_length);
rewind(input_ptr);
//Nframes=inputdata_length/OVERLAP_LEN-1;
Nframes=inputdata_length/OVERLAP_LEN;//多分出一帧
printf("Nframes:%d\n", Nframes);
int count = fread(all_in_dat, sizeof(unsigned char), INPUT_LEN, input_ptr);
printf("count:%d\n", count);
rewind(input_ptr);
//========================= Start Processing ===============================
for(frame_idx=0, k=0; frame_idx
正确处理最后一个分帧后的打印输出:
inputdata_length:4096
Nframes:32
count:4096
process done
结果音频文件:
在实际处理中,对算法而言,输入输出都是short类型的数据,因此要先做个预处理,char类型转为short类型,再分帧,这里涉及数据的合并,char的最高位是符号位,会导致数据有正负运算,虽然说char和unsigned char都是占用一个字节。所以,不能再用char类型来存储音频流,而是用unsigned char。实际上,嵌入式系统中,音频链路都是使用unsigned char来存储和传输。
读的时候,可以用不同的字节长度来读音频流数据,写的时候(输出),也可以用不同的方式将数据以流的形式写入文件中。
#include
#include
#include
#include
#define INPUT_LEN (4096)
#define FRAME_LEN (256)
#define STEP_LEN (128)
#define OVERLAP_LEN (128)
int main(void)
{
FILE* input_ptr = NULL;
FILE* output_ptr = NULL;
#if 0 //
char all_in_dat[INPUT_LEN];
char all_out_dat[INPUT_LEN];
#endif
unsigned char all_in_dat[INPUT_LEN];
unsigned char all_out_dat[INPUT_LEN];
short input_dat[INPUT_LEN/2];
short output_dat[INPUT_LEN/2];
short in_dat[FRAME_LEN];
short prev_out_data[OVERLAP_LEN];
short out_dat[OVERLAP_LEN];
//init input
memset(all_in_dat, 0, INPUT_LEN);
memset(all_out_dat, 0, INPUT_LEN);
memset(input_dat, 0, INPUT_LEN/2);
memset(output_dat, 0, INPUT_LEN/2);
memset(in_dat, 0, FRAME_LEN);
memset(prev_out_data, 0, OVERLAP_LEN);
memset(out_dat, 0, OVERLAP_LEN);
int inputdata_length, input_len, Nframes, frame_idx, i, j, k;
//长度固定为4096 Bytes的文件
input_ptr = fopen("4k.pcm", "r");
if (!input_ptr) {
printf("open input stream fail\n");
return -1;
}
fseek(input_ptr, 0, SEEK_END);
inputdata_length = ftell(input_ptr);
//char->1 byte
printf("inputdata_length:%d\n", inputdata_length);
rewind(input_ptr);
int count = fread(all_in_dat, sizeof(unsigned char), INPUT_LEN, input_ptr);
printf("count:%d\n", count);
rewind(input_ptr);
input_len=INPUT_LEN/2;
Nframes=input_len/OVERLAP_LEN;
printf("Nframes:%d\n", Nframes);
//对输入的数据做预处理
for(i=0, j=0; i>8)&0xff); //high 8bit
j=j+2;
}
output_ptr = fopen("4k-output.pcm", "wb");
if (!output_ptr) {
printf("open output stream fail\n");
return -1;
}
fwrite(all_out_dat, sizeof(unsigned char), INPUT_LEN, output_ptr);
#else /* 以short的方式将音频流写入文件 */
output_ptr = fopen("4k-output.pcm", "wb");
if (!output_ptr) {
printf("open output stream fail\n");
return -1;
}
fwrite(output_dat, sizeof(unsigned char), INPUT_LEN/2, output_ptr);
#endif
fclose(input_ptr);
fclose(output_ptr);
printf("process done\n");
return 0;
}
结果文件:
将分帧算法封装成一个API,直接调用即可。
void voice_frame(short *x, int Nframes, short *xout); |
把自然输入的语音(ADC-MIC-IN)当成是一个无限长的音频文件,那每次输入算法的时候,就要进行“分割”,分成一小块一小块进行输入,再将输出进行拼接。
示例代码
#define _CRT_SECURE_NO_WARNINGS
#include
#include
#include "baselib.h"
#include "win_fun.h"
typedef unsigned char uint8_t;
#define FRAME_LEN (256)
#define STEP_LEN (128)
#define OVERLAP_LEN (128)
#define BLK_INPUT_LEN (4096)
//#define BLK_INPUT_LEN (8192)
static const float win[FRAME_LEN]={
0.080000, 0.080140, 0.080558, 0.081256, 0.082232, 0.083487, 0.085018, 0.086825, 0.088908, 0.091264, 0.093893, 0.096793, 0.099962, 0.103398, 0.107099, 0.111063, 0.115287, 0.119769, 0.124506, 0.129496, 0.134734, 0.140219, 0.145946, 0.151913, 0.158115, 0.164549, 0.171211, 0.178097, 0.185203, 0.192524, 0.200056, 0.207794,
0.215734, 0.223871, 0.232200, 0.240716, 0.249413, 0.258287, 0.267332, 0.276542, 0.285912, 0.295437, 0.305110, 0.314925, 0.324878, 0.334960, 0.345168, 0.355493, 0.365931, 0.376474, 0.387117, 0.397852, 0.408674, 0.419575, 0.430550, 0.441591, 0.452691, 0.463845, 0.475045, 0.486285, 0.497557, 0.508854, 0.520171, 0.531500,
0.542834, 0.554166, 0.565489, 0.576797, 0.588083, 0.599340, 0.610560, 0.621738, 0.632866, 0.643938, 0.654946, 0.665885, 0.676747, 0.687527, 0.698216, 0.708810, 0.719301, 0.729684, 0.739951, 0.750097, 0.760115, 0.770000, 0.779745, 0.789345, 0.798793, 0.808084, 0.817212, 0.826172, 0.834958, 0.843565, 0.851988, 0.860222,
0.868261, 0.876100, 0.883736, 0.891163, 0.898377, 0.905373, 0.912148, 0.918696, 0.925015, 0.931100, 0.936947, 0.942554, 0.947916, 0.953030, 0.957894, 0.962504, 0.966857, 0.970952, 0.974785, 0.978353, 0.981656, 0.984690, 0.987455, 0.989948, 0.992168, 0.994113, 0.995782, 0.997175, 0.998290, 0.999128, 0.999686, 0.999965,
0.999965, 0.999686, 0.999128, 0.998290, 0.997175, 0.995782, 0.994113, 0.992168, 0.989948, 0.987455, 0.984690, 0.981656, 0.978353, 0.974785, 0.970952, 0.966857, 0.962504, 0.957894, 0.953030, 0.947916, 0.942554, 0.936947, 0.931100, 0.925015, 0.918696, 0.912148, 0.905373, 0.898377, 0.891163, 0.883736, 0.876100, 0.868261,
0.860222, 0.851988, 0.843565, 0.834958, 0.826172, 0.817212, 0.808084, 0.798793, 0.789345, 0.779745, 0.770000, 0.760115, 0.750097, 0.739951, 0.729684, 0.719302, 0.708810, 0.698216, 0.687527, 0.676747, 0.665885, 0.654946, 0.643938, 0.632866, 0.621738, 0.610560, 0.599340, 0.588083, 0.576797, 0.565489, 0.554166, 0.542833,
0.531500, 0.520171, 0.508854, 0.497557, 0.486285, 0.475045, 0.463845, 0.452692, 0.441591, 0.430550, 0.419575, 0.408674, 0.397852, 0.387117, 0.376474, 0.365931, 0.355493, 0.345168, 0.334960, 0.324877, 0.314925, 0.305110, 0.295437, 0.285912, 0.276542, 0.267331, 0.258287, 0.249413, 0.240716, 0.232200, 0.223871, 0.215734,
0.207794, 0.200056, 0.192524, 0.185203, 0.178097, 0.171211, 0.164549, 0.158115, 0.151913, 0.145947, 0.140219, 0.134734, 0.129496, 0.124506, 0.119769, 0.115287, 0.111063, 0.107099, 0.103398, 0.099962, 0.096793, 0.093893, 0.091265, 0.088908, 0.086825, 0.085018, 0.083487, 0.082232, 0.081256, 0.080558, 0.080140, 0.080000,
};
float winGain;
void voice_frame(short *x, int Nframes, short *xout, int blk_index);
int main(void)
{
int i, j;
int inputdata_length;
FILE* input_ptr = NULL;
FILE* output_ptr = NULL;
input_ptr = fopen("80k.pcm", "r");
if (!input_ptr) {
printf("open input stream fail\n");
return -1;
}
fseek(input_ptr, 0, SEEK_END);
inputdata_length = ftell(input_ptr);
printf("inputdata_length:%d\n", inputdata_length);
rewind(input_ptr);
uint8_t* all_in_dat = (uint8_t*)calloc(inputdata_length, sizeof(uint8_t));
uint8_t* all_out_dat = (uint8_t*)calloc(inputdata_length, sizeof(uint8_t));
uint8_t* blk_input_dat = (uint8_t*)calloc(BLK_INPUT_LEN, sizeof(uint8_t));
uint8_t* blk_output_dat = (uint8_t*)calloc(BLK_INPUT_LEN, sizeof(uint8_t));
int count = fread(all_in_dat, sizeof(uint8_t), inputdata_length, input_ptr);
printf("count:%d\n", count);
rewind(input_ptr);
int in_dat_len=BLK_INPUT_LEN/2;
int out_dat_len=in_dat_len;
printf("in_dat_len:%d\n", in_dat_len);
short* in_dat = (short*)calloc(in_dat_len, sizeof(short));
short* out_dat = (short*)calloc(out_dat_len, sizeof(short));
int all_block=inputdata_length/BLK_INPUT_LEN;
printf("all_block:%d\n", all_block);
int Nframes= in_dat_len/OVERLAP_LEN;
printf("a block can div to Nframes:%d\n", Nframes);
for(i=0; i>8)&0xff); //high 8bit
j=j+2;
}
//整合
for (arr_index=0; arr_index
串口输出:
nputdata_length:81920 count:81920 in_dat_len:2048 all_block:20 a block can div to Nframes:16 the first block, do somthing init |
注:
(1)输入,是一个固定长为80k的音频文件。
(2)每个小块分帧后,还加了窗
对比如下:
即对于每次输入,输入数据的头部分保留有上一个大块的尾部信息,实际上跟每个大块里的分帧操作是一样的原理。
示例:块和块之间有128个点的重叠,图示如下:
示例代码
#define _CRT_SECURE_NO_WARNINGS
#include
#include
#include "baselib.h"
#include "win_fun.h"
typedef unsigned char uint8_t;
#define FRAME_LEN (256)
#define STEP_LEN (128)
#define OVERLAP_LEN (128)
#define BLK_INPUT_LEN (4096)
//#define BLK_INPUT_LEN (8192)
#define BLK_OVERLAP_LEN (128)
//#define BLK_OVERLAP_LEN (256)
static const float win[FRAME_LEN]={
0.080000, 0.080140, 0.080558, 0.081256, 0.082232, 0.083487, 0.085018, 0.086825, 0.088908, 0.091264, 0.093893, 0.096793, 0.099962, 0.103398, 0.107099, 0.111063, 0.115287, 0.119769, 0.124506, 0.129496, 0.134734, 0.140219, 0.145946, 0.151913, 0.158115, 0.164549, 0.171211, 0.178097, 0.185203, 0.192524, 0.200056, 0.207794,
0.215734, 0.223871, 0.232200, 0.240716, 0.249413, 0.258287, 0.267332, 0.276542, 0.285912, 0.295437, 0.305110, 0.314925, 0.324878, 0.334960, 0.345168, 0.355493, 0.365931, 0.376474, 0.387117, 0.397852, 0.408674, 0.419575, 0.430550, 0.441591, 0.452691, 0.463845, 0.475045, 0.486285, 0.497557, 0.508854, 0.520171, 0.531500,
0.542834, 0.554166, 0.565489, 0.576797, 0.588083, 0.599340, 0.610560, 0.621738, 0.632866, 0.643938, 0.654946, 0.665885, 0.676747, 0.687527, 0.698216, 0.708810, 0.719301, 0.729684, 0.739951, 0.750097, 0.760115, 0.770000, 0.779745, 0.789345, 0.798793, 0.808084, 0.817212, 0.826172, 0.834958, 0.843565, 0.851988, 0.860222,
0.868261, 0.876100, 0.883736, 0.891163, 0.898377, 0.905373, 0.912148, 0.918696, 0.925015, 0.931100, 0.936947, 0.942554, 0.947916, 0.953030, 0.957894, 0.962504, 0.966857, 0.970952, 0.974785, 0.978353, 0.981656, 0.984690, 0.987455, 0.989948, 0.992168, 0.994113, 0.995782, 0.997175, 0.998290, 0.999128, 0.999686, 0.999965,
0.999965, 0.999686, 0.999128, 0.998290, 0.997175, 0.995782, 0.994113, 0.992168, 0.989948, 0.987455, 0.984690, 0.981656, 0.978353, 0.974785, 0.970952, 0.966857, 0.962504, 0.957894, 0.953030, 0.947916, 0.942554, 0.936947, 0.931100, 0.925015, 0.918696, 0.912148, 0.905373, 0.898377, 0.891163, 0.883736, 0.876100, 0.868261,
0.860222, 0.851988, 0.843565, 0.834958, 0.826172, 0.817212, 0.808084, 0.798793, 0.789345, 0.779745, 0.770000, 0.760115, 0.750097, 0.739951, 0.729684, 0.719302, 0.708810, 0.698216, 0.687527, 0.676747, 0.665885, 0.654946, 0.643938, 0.632866, 0.621738, 0.610560, 0.599340, 0.588083, 0.576797, 0.565489, 0.554166, 0.542833,
0.531500, 0.520171, 0.508854, 0.497557, 0.486285, 0.475045, 0.463845, 0.452692, 0.441591, 0.430550, 0.419575, 0.408674, 0.397852, 0.387117, 0.376474, 0.365931, 0.355493, 0.345168, 0.334960, 0.324877, 0.314925, 0.305110, 0.295437, 0.285912, 0.276542, 0.267331, 0.258287, 0.249413, 0.240716, 0.232200, 0.223871, 0.215734,
0.207794, 0.200056, 0.192524, 0.185203, 0.178097, 0.171211, 0.164549, 0.158115, 0.151913, 0.145947, 0.140219, 0.134734, 0.129496, 0.124506, 0.119769, 0.115287, 0.111063, 0.107099, 0.103398, 0.099962, 0.096793, 0.093893, 0.091265, 0.088908, 0.086825, 0.085018, 0.083487, 0.082232, 0.081256, 0.080558, 0.080140, 0.080000,
};
float winGain;
void voice_frame(short *x, int Nframes, short *xout, int blk_index);
int main(void)
{
int i, j;
int inputdata_length;
FILE* input_ptr = NULL;
FILE* output_ptr = NULL;
input_ptr = fopen("80k.pcm", "r");
if (!input_ptr) {
printf("open input stream fail\n");
return -1;
}
fseek(input_ptr, 0, SEEK_END);
inputdata_length = ftell(input_ptr);
printf("inputdata_length:%d\n", inputdata_length);
rewind(input_ptr);
uint8_t* all_in_dat = (uint8_t*)calloc(inputdata_length, sizeof(uint8_t));
uint8_t* all_out_dat = (uint8_t*)calloc(inputdata_length, sizeof(uint8_t));
uint8_t* blk_input_dat = (uint8_t*)calloc(BLK_INPUT_LEN, sizeof(uint8_t));
uint8_t* blk_output_dat = (uint8_t*)calloc(BLK_INPUT_LEN, sizeof(uint8_t));
int count = fread(all_in_dat, sizeof(uint8_t), inputdata_length, input_ptr);
printf("count:%d\n", count);
rewind(input_ptr);
int in_dat_len=BLK_INPUT_LEN/2;
int out_dat_len=in_dat_len;
printf("in_dat_len:%d\n", in_dat_len);
short* in_dat = (short*)calloc(in_dat_len, sizeof(short));
short* out_dat = (short*)calloc(out_dat_len, sizeof(short));
int all_block=inputdata_length/BLK_INPUT_LEN;
printf("all_block:%d\n", all_block);
int Nframes= in_dat_len/OVERLAP_LEN;
printf("a block can div to Nframes:%d\n", Nframes);
for(i=0; i>8)&0xff); //high 8bit
j=j+2;
}
//整合
for (arr_index=0; arr_index
在实时处理中,如何处理“块”与“块”之间的衔接,也是个很值得去注意的问题。重叠分割,可以使得每个块之间的衔接处,数值差距没有那么大;如果采用均匀分割法,那么有可能在分割处,当前这个块的尾部和下一个块的头部数值差距很大,造成块和块之间的“不连续”, 频域上表现的就是突变,类似于音乐噪声的起因,听起来就有咔咔声。
基本的处理流程:每次中断产生4k数据,头部128个采样点用存有上一次中断的原始数据tmp_buf填充,输入算法,得到输出后,更新tmp_buf(存这次中断产生的最后128个数据点)。
一个最简单的实例:DMA半满中断数据量为4k,在回调函数中将数据取出作为算法的输入,算法进行分帧再输出,数据输出后续通路可以是耳机,也可以是其他链路。最简单的框架如下图:
该实验将算法处理后的数据使用串口发送到PC,PC端用python脚本获取音频数据,摘取其中一次结果,如下: