简介
基于语音信号的性别分类是许多音频系统的重要组成部分,例如自动语音识别、说话者识别和基于内容的多媒体索引。
本示例使用长短期记忆 (LSTM) 网络,这是一种循环神经网络 (RNN),非常适合研究序列和时序数据。LSTM 网络可以学习序列的时间步之间的长期相关性。LSTM 层 (lstmLayer) 可以前向分析时间序列,而双向 LSTM 层 (bilstmLayer) 可以前向和后向分析时间序列。此示例使用双向 LSTM 层。
此示例用 Gammatone 倒频谱系数 (gtcc(Audio Toolbox))、音调估计 (pitch(Audio Toolbox))、谐波比 (harmonicRatio(Audio Toolbox)) 和几个频谱形状描述符 (Spectral Descriptors(Audio Toolbox)) 序列来训练 LSTM 网络。
为了加快训练过程,请在具有 GPU 的机器上运行此示例。如果您的机器同时有 GPU 和 Parallel Computing Toolbox™,则 MATLAB© 会自动使用 GPU 进行训练;否则,它使用 CPU。
通过预训练网络进行性别分类
在进入详细训练过程之前,您将使用预训练网络对两个测试信号中说话者的性别进行分类。
加载预训练网络以及用于特性归一化的预先计算的向量。
load('genderIDNet.mat', 'genderIDNet', 'M', 'S');
加载一个男性说话者的测试信号。
[audioIn, Fs] = audioread('maleSpeech.flac');
sound(audioIn, Fs)
隔离信号中的语音区域。
boundaries = detectSpeech(audioIn, Fs);
audioIn = audioIn(boundaries(1):boundaries(2));
创建一个 audioFeatureExtractor(Audio Toolbox) 以从音频数据中提取特征。您将使用同一对象来提取特征进行训练。
extractor = audioFeatureExtractor( ...
"SampleRate",Fs, ...
"Window",hamming(round(0.03*Fs),"periodic"), ...
"OverlapLength",round(0.02*Fs), ...
...
"gtcc",true, ...
"gtccDelta",true, ...
"gtccDeltaDelta",true, ...
...
"SpectralDescriptorInput","melSpectrum", ...
"spectralCentroid",true, ...
"spectralEntropy",true, ...
"spectralFlux",true, ...
"spectralSlope",true, ...
...
"pitch",true, ...
"harmonicRatio",true);
从信号中提取特征并对其进行归一化。
features = extract(extractor, audioIn);
features = (features.' - M)./S;
对信号进行分类。
gender = classify(genderIDNet, features)
gender = categorical
male
对另一个女性说话者的信号进行分类。
[audioIn, Fs] = audioread('femaleSpeech.flac');
sound(audioIn, Fs)
boundaries = detectSpeech(audioIn, Fs);
audioIn = audioIn(boundaries(1):boundaries(2));
features = extract(extractor, audioIn);
features = (features.' - M)./S;
classify(genderIDNet, features)
ans = categorical
female
预处理训练音频数据
当使用特征向量序列时,此示例中使用的 BiLSTM 网络效果最佳。为了说明如何预处理管道,此示例逐步演示了针对单个音频文件的步骤。
读取包含语音的音频文件的内容。说话者的性别是男性。
[audioIn,Fs] = audioread('Counting-16-44p1-mono-15secs.wav');
labels = {'male'};
绘制音频信号,然后使用 sound 命令收听它。
timeVector = (1/Fs) * (0:size(audioIn,1)-1);
figure
plot(timeVector,audioIn)
ylabel("Amplitude")
xlabel("Time (s)")
title("Sample Audio")
grid on
sound(audioIn,Fs)
语音信号具有静默段,其中不包含与说话者性别相关的有用信息。使用 detectSpeech(Audio Toolbox) 定位音频信号中的语音段。
speechIndices = detectSpeech(audioIn,Fs);
创建一个 audioFeatureExtractor(Audio Toolbox) 以从音频数据中提取特征。语音信号在本质上是动态的,并且随着时间而变化。假设语音信号在短时间尺度上是稳定的,并且其处理通常在 20-40 毫秒的时间窗中完成。指定 30 毫秒时间窗,其中包含 20 毫秒重叠。
extractor = audioFeatureExtractor( ...
"SampleRate",Fs, ...
"Window",hamming(round(0.03*Fs),"periodic"), ...
"OverlapLength",round(0.02*Fs), ...
...
"gtcc",true, ...
"gtccDelta",true, ...
"gtccDeltaDelta",true, ...
...
"SpectralDescriptorInput","melSpectrum", ...
"spectralCentroid",true, ...
"spectralEntropy",true, ...
"spectralFlux",true, ...
"spectralSlope",true, ...
...
"pitch",true, ...
"harmonicRatio",true);
从每个音频段中提取特征。audioFeatureExtractor 的输出是一个 numFeatureVectors×numFeatures 数组。此示例中使用的 sequenceInputLayer 需要时间沿第二个维度排列。置换输出数组,使时间沿第二个维度排列。
featureVectorsSegment = {};
for ii = 1:size(speechIndices,1)
featureVectorsSegment{end+1} = ( extract(extractor,audioIn(speechIndices(ii,1):speechIndices(ii,2))) )';
end
numSegments = size(featureVectorsSegment)
numSegments = 1×2
1 11
[numFeatures,numFeatureVectorsSegment1] = size(featureVectorsSegment{1})
numFeatures = 45
numFeatureVectorsSegment1 = 124
复制标签,使它们与音频段一一对应。
labels = repelem(labels,size(speechIndices,1))
labels = 1×11 cell
{'male'} {'male'} {'male'} {'male'} {'male'} {'male'} {'male'} {'male'} {'male'} {'male'} {'male'}
当使用 sequenceInputLayer 时,通常使用长度一致的序列最合适。将特征向量数组转换为特征向量序列。对每个序列使用 20 个特征向量,其中 5 个特征向量重叠。
featureVectorsPerSequence = 20;
featureVectorOverlap = 5;
hopLength = featureVectorsPerSequence - featureVectorOverlap;
idx1 = 1;
featuresTrain = {};
sequencePerSegment = zeros(numel(featureVectorsSegment),1);
for ii = 1:numel(featureVectorsSegment)
sequencePerSegment(ii) = max(floor((size(featureVectorsSegment{ii},2) - featureVectorsPerSequence)/hopLength) + 1,0);
idx2 = 1;
for j = 1:sequencePerSegment(ii)
featuresTrain{idx1,1} = featureVectorsSegment{ii}(:,idx2:idx2 + featureVectorsPerSequence - 1);
idx1 = idx1 + 1;
idx2 = idx2 + hopLength;
end
end
为简明起见,辅助函数 HelperFeatureVector2Sequence 封装上述处理,并在此示例的其余部分中使用。
复制标签,使它们与训练集一一对应。
labels = repelem(labels,sequencePerSegment);
预处理管道的结果是由若干个 NumFeatures×FeatureVectorsPerSequence 矩阵组成的一个 NumSequence×1 元胞数组。labels 是一个 NumSequence×1 数组。
NumSequence = numel(featuresTrain)
NumSequence = 27
[NumFeatures,FeatureVectorsPerSequence] = size(featuresTrain{1})
NumFeatures = 45
FeatureVectorsPerSequence = 20
NumSequence = numel(labels)
NumSequence = 27
下图简要显示每个检测到的语音区域所使用的特征提取。
创建训练和测试数据存储
此示例使用 Mozilla 通用语音数据集的一个子集 [1]。该数据集包含测试人员口述短句的 48 kHz 录音。下载该数据集并解压缩下载的文件。将 PathToDatabase 设置为数据的位置。
url = 'http://ssd.mathworks.com/supportfiles/audio/commonvoice.zip';
downloadFolder = tempdir;
dataFolder = fullfile(downloadFolder,'commonvoice');
if ~exist(dataFolder,'dir')
disp('Downloading data set (956 MB) ...')
unzip(url,downloadFolder)
end
使用 audioDatastore 为训练集和验证集创建数据存储。使用 readtable 读取与音频文件相关联的元数据。
loc = fullfile(dataFolder);
adsTrain = audioDatastore(fullfile(loc,'train'),'IncludeSubfolders',true);
metadataTrain = readtable(fullfile(fullfile(loc,'train'),"train.tsv"),"FileType","text");
adsTrain.Labels = metadataTrain.gender;
adsValidation = audioDatastore(fullfile(loc,'validation'),'IncludeSubfolders',true);
metadataValidation = readtable(fullfile(fullfile(loc,'validation'),"validation.tsv"),"FileType","text");
adsValidation.Labels = metadataValidation.gender;
使用 countEachLabel(Audio Toolbox) 检查训练集和验证集的性别分类。
countEachLabel(adsTrain)
ans=2×2 table
Label Count
______ _____
female 1000
male 1000
countEachLabel(adsValidation)
ans=2×2 table
Label Count
______ _____
female 200
male 200
要使用整个数据集来训练网络并达到尽可能最高的准确度,请将 reduceDataset 设置为 false。要快速运行此示例,请将 reduceDataset 设置为 true。
reduceDataset = false;
if reduceDataset
% Reduce the training dataset by a factor of 20
adsTrain = splitEachLabel(adsTrain,round(numel(adsTrain.Files) / 2 / 20));
adsValidation = splitEachLabel(adsValidation,20);
end
创建训练集和验证集
确定数据集中音频文件的采样率,然后更新音频特征提取器的采样率、窗口和重叠长度。
[~,adsInfo] = read(adsTrain);
Fs = adsInfo.SampleRate;
extractor.SampleRate = Fs;
extractor.Window = hamming(round(0.03*Fs),"periodic");
extractor.OverlapLength = round(0.02*Fs);
为了加快处理速度,请将计算分配给多个工作进程。如果您有 Parallel Computing Toolbox™,该示例将对数据存储进行分区,以便在可用的各工作进程之间以并行方式进行特征提取。确定系统的最佳分区数量。如果您没有 Parallel Computing Toolbox™,该示例将使用单一工作进程。
if ~isempty(ver('parallel')) && ~reduceDataset
pool = gcp;
numPar = numpartitions(adsTrain,pool);
else
numPar = 1;
end
以循环方式:
从音频数据存储中读取。
检测语音区域。
从语音区域中提取特征向量。
复制标签,使它们与特征向量一一对应。
labelsTrain = [];
featureVectors = {};
% Loop over optimal number of partitions
parfor ii = 1:numPar
% Partition datastore
subds = partition(adsTrain,numPar,ii);
% Preallocation
featureVectorsInSubDS = {};
segmentsPerFile = zeros(numel(subds.Files),1);
% Loop over files in partitioned datastore
for jj = 1:numel(subds.Files)
% 1. Read in a single audio file
audioIn = read(subds);
% 2. Determine the regions of the audio that correspond to speech
speechIndices = detectSpeech(audioIn,Fs);
% 3. Extract features from each speech segment
segmentsPerFile(jj) = size(speechIndices,1);
features = cell(segmentsPerFile(jj),1);
for kk = 1:size(speechIndices,1)
features{kk} = ( extract(extractor,audioIn(speechIndices(kk,1):speechIndices(kk,2))) )';
end
featureVectorsInSubDS = [featureVectorsInSubDS;features(:)];
end
featureVectors = [featureVectors;featureVectorsInSubDS];
% Replicate the labels so that they are in one-to-one correspondance
% with the feature vectors.
repedLabels = repelem(subds.Labels,segmentsPerFile);
labelsTrain = [labelsTrain;repedLabels(:)];
end
在分类应用中,将所有特征归一化为具有零均值和单位标准差是很好的做法。
计算每个系数的均值和标准差,并使用它们来归一化数据。
allFeatures = cat(2,featureVectors{:});
allFeatures(isinf(allFeatures)) = nan;
M = mean(allFeatures,2,'omitnan');
S = std(allFeatures,0,2,'omitnan');
featureVectors = cellfun(@(x)(x-M)./S,featureVectors,'UniformOutput',false);
for ii = 1:numel(featureVectors)
idx = find(isnan(featureVectors{ii}));
if ~isempty(idx)
featureVectors{ii}(idx) = 0;
end
end
将特征向量缓冲到具有 20 个特征向量及 10 个重叠的序列中。如果一个序列的特征向量少于 20 个,则将其丢弃。
[featuresTrain,trainSequencePerSegment] = HelperFeatureVector2Sequence(featureVectors,featureVectorsPerSequence,featureVectorOverlap);
复制标签,使它们与序列一一对应。
labelsTrain = repelem(labelsTrain,[trainSequencePerSegment{:}]);
labelsTrain = categorical(labelsTrain);
使用与创建训练集相同的步骤创建验证集。
labelsValidation = [];
featureVectors = {};
valSegmentsPerFile = [];
parfor ii = 1:numPar
subds = partition(adsValidation,numPar,ii);
featureVectorsInSubDS = {};
valSegmentsPerFileInSubDS = zeros(numel(subds.Files),1);
for jj = 1:numel(subds.Files)
audioIn = read(subds);
speechIndices = detectSpeech(audioIn,Fs);
numSegments = size(speechIndices,1);
features = cell(valSegmentsPerFileInSubDS(jj),1);
for kk = 1:numSegments
features{kk} = ( extract(extractor,audioIn(speechIndices(kk,1):speechIndices(kk,2))) )';
end
featureVectorsInSubDS = [featureVectorsInSubDS;features(:)];
valSegmentsPerFileInSubDS(jj) = numSegments;
end
repedLabels = repelem(subds.Labels,valSegmentsPerFileInSubDS);
labelsValidation = [labelsValidation;repedLabels(:)];
featureVectors = [featureVectors;featureVectorsInSubDS];
valSegmentsPerFile = [valSegmentsPerFile;valSegmentsPerFileInSubDS];
end
featureVectors = cellfun(@(x)(x-M)./S,featureVectors,'UniformOutput',false);
for ii = 1:numel(featureVectors)
idx = find(isnan(featureVectors{ii}));
if ~isempty(idx)
featureVectors{ii}(idx) = 0;
end
end
[featuresValidation,valSequencePerSegment] = HelperFeatureVector2Sequence(featureVectors,featureVectorsPerSequence,featureVectorOverlap);
labelsValidation = repelem(labelsValidation,[valSequencePerSegment{:}]);
labelsValidation = categorical(labelsValidation);
定义 LSTM 网络架构
LSTM 网络可以学习序列数据的时间步之间的长期相关性。此示例使用双向 LSTM 层 bilstmLayer 前向和后向分析序列。
将输入大小指定为大小为 NumFeatures 的序列。指定输出大小为 50 的一个隐藏双向 LSTM 层,并输出序列。然后,指定输出大小为 50 的一个双向 LSTM 层,并输出序列的最后一个元素。此命令指示双向 LSTM 层将其输入映射到 50 个特征,然后准备好输出到全连接层。最后,通过包含大小为 2 的全连接层,后跟 softmax 层和分类层,来指定两个类。
layers = [ ...
sequenceInputLayer(size(featuresTrain{1},1))
bilstmLayer(50,"OutputMode","sequence")
bilstmLayer(50,"OutputMode","last")
fullyConnectedLayer(2)
softmaxLayer
classificationLayer];
接下来,指定分类器的训练选项。将 MaxEpochs 设置为 4 以便基于训练数据对网络进行 4 轮训练。将 MiniBatchSize 设置为 256 以便网络可以一次查看 128 个训练信号。将 Plots 指定为 "training-progress" 以生成随着迭代次数增加显示训练进度的图。将 Verbose 设置为 false 以禁止打印对应于图中所示数据的表输出。将 Shuffle 指定为 "every-epoch" 以在每轮开始时使训练序列变为乱序。将 LearnRateSchedule 指定为 "piecewise" 以便每经过一定数量的轮次 (1) 时,按指定的因子 (0.1) 降低学习率。
此示例使用自适应矩估计 (ADAM) 求解器。与默认的具有动量的随机梯度下降 (SGDM) 求解器相比,ADAM 在使用 LSTM 之类的循环神经网络 (RNN) 时性能更好。
miniBatchSize = 256;
validationFrequency = floor(numel(labelsTrain)/miniBatchSize);
options = trainingOptions("adam", ...
"MaxEpochs",4, ...
"MiniBatchSize",miniBatchSize, ...
"Plots","training-progress", ...
"Verbose",false, ...
"Shuffle","every-epoch", ...
"LearnRateSchedule","piecewise", ...
"LearnRateDropFactor",0.1, ...
"LearnRateDropPeriod",1, ...
'ValidationData',{featuresValidation,labelsValidation}, ...
'ValidationFrequency',validationFrequency);
训练 LSTM 网络
使用 trainNetwork 用指定的训练选项和层架构训练 LSTM 网络。由于训练集很大,训练过程可能需要几分钟。
net = trainNetwork(featuresTrain,labelsTrain,layers,options);
训练进度图的顶部子图表示训练准确度,即基于每个小批量的分类准确度。当训练在成功进行时,此值通常会逐渐增大,直到 100%。底部子图显示训练损失,即基于每个小批量的交叉熵损失。当训练在成功进行时,该值通常会逐渐降低,直到为零。
如果训练未收敛,绘图可能会在各值之间振荡,而不会呈现向上或向下趋势。这种振荡意味着训练准确度没有提高,训练损失没有减少。这种情况可能发生在训练开始时,或者在训练准确度有一些初步提高后。在许多情况下,更改训练选项可以帮助网络实现收敛。减少 MiniBatchSize 或减少 InitialLearnRate 可能会导致更长的训练时间,但这可能有助于网络更好地学习。
可视化训练准确度
计算训练准确度,该准确度表示分类器对于所训练信号的准确度。首先,对训练数据进行分类。
prediction = classify(net,featuresTrain);
绘制混淆矩阵。使用列汇总和行汇总显示两个类的精确率和召回率。
figure
cm = confusionchart(categorical(labelsTrain),prediction,'title','Training Accuracy');
cm.ColumnSummary = 'column-normalized';
cm.RowSummary = 'row-normalized';
可视化验证准确度
计算验证准确度。首先,对训练数据进行分类。
[prediction,probabilities] = classify(net,featuresValidation);
绘制混淆矩阵。使用列汇总和行汇总显示两个类的精确率和召回率。
figure
cm = confusionchart(categorical(labelsValidation),prediction,'title','Validation Set Accuracy');
cm.ColumnSummary = 'column-normalized';
cm.RowSummary = 'row-normalized';
该示例基于每个训练语音文件生成多个序列。如果考虑对应于同一个文件的所有序列的输出类,并应用“最大规则”决策从中选择置信度分数最高的语音段所在的类,则可以获得更高的准确度。
确定验证集中每个文件生成的序列数。
sequencePerFile = zeros(size(valSegmentsPerFile));
valSequencePerSegmentMat = cell2mat(valSequencePerSegment);
idx = 1;
for ii = 1:numel(valSegmentsPerFile)
sequencePerFile(ii) = sum(valSequencePerSegmentMat(idx:idx+valSegmentsPerFile(ii)-1));
idx = idx + valSegmentsPerFile(ii);
end
通过分析从同一文件生成的所有序列的输出类,基于每个训练文件预测性别。
numFiles = numel(adsValidation.Files);
actualGender = categorical(adsValidation.Labels);
predictedGender = actualGender;
scores = cell(1,numFiles);
counter = 1;
cats = unique(actualGender);
for index = 1:numFiles
scores{index} = probabilities(counter: counter + sequencePerFile(index) - 1,:);
m = max(mean(scores{index},1),[],1);
if m(1) >= m(2)
predictedGender(index) = cats(1);
else
predictedGender(index) = cats(2);
end
counter = counter + sequencePerFile(index);
end
可视化基于多数规则预测的混淆矩阵。
figure
cm = confusionchart(actualGender,predictedGender,'title','Validation Set Accuracy - Max Rule');
cm.ColumnSummary = 'column-normalized';
cm.RowSummary = 'row-normalized';
参考资料
支持函数
function [sequences,sequencePerSegment] = HelperFeatureVector2Sequence(features,featureVectorsPerSequence,featureVectorOverlap)
if featureVectorsPerSequence <= featureVectorOverlap
error('The number of overlapping feature vectors must be less than the number of feature vectors per sequence.')
end
hopLength = featureVectorsPerSequence - featureVectorOverlap;
idx1 = 1;
sequences = {};
sequencePerSegment = cell(numel(features),1);
for ii = 1:numel(features)
sequencePerSegment{ii} = max(floor((size(features{ii},2) - featureVectorsPerSequence)/hopLength) + 1,0);
idx2 = 1;
for j = 1:sequencePerSegment{ii}
sequences{idx1,1} = features{ii}(:,idx2:idx2 + featureVectorsPerSequence - 1); %#ok
idx1 = idx1 + 1;
idx2 = idx2 + hopLength;
end
end
end