matlab lstm分类,使用 LSTM 网络对性别进行分类

简介

基于语音信号的性别分类是许多音频系统的重要组成部分,例如自动语音识别、说话者识别和基于内容的多媒体索引。

本示例使用长短期记忆 (LSTM) 网络,这是一种循环神经网络 (RNN),非常适合研究序列和时序数据。LSTM 网络可以学习序列的时间步之间的长期相关性。LSTM 层 (lstmLayer) 可以前向分析时间序列,而双向 LSTM 层 (bilstmLayer) 可以前向和后向分析时间序列。此示例使用双向 LSTM 层。

此示例用 Gammatone 倒频谱系数 (gtcc(Audio Toolbox))、音调估计 (pitch(Audio Toolbox))、谐波比 (harmonicRatio(Audio Toolbox)) 和几个频谱形状描述符 (Spectral Descriptors(Audio Toolbox)) 序列来训练 LSTM 网络。

为了加快训练过程,请在具有 GPU 的机器上运行此示例。如果您的机器同时有 GPU 和 Parallel Computing Toolbox™,则 MATLAB© 会自动使用 GPU 进行训练;否则,它使用 CPU。

通过预训练网络进行性别分类

在进入详细训练过程之前,您将使用预训练网络对两个测试信号中说话者的性别进行分类。

加载预训练网络以及用于特性归一化的预先计算的向量。

load('genderIDNet.mat', 'genderIDNet', 'M', 'S');

加载一个男性说话者的测试信号。

[audioIn, Fs] = audioread('maleSpeech.flac');

sound(audioIn, Fs)

隔离信号中的语音区域。

boundaries = detectSpeech(audioIn, Fs);

audioIn = audioIn(boundaries(1):boundaries(2));

创建一个 audioFeatureExtractor(Audio Toolbox) 以从音频数据中提取特征。您将使用同一对象来提取特征进行训练。

extractor = audioFeatureExtractor( ...

"SampleRate",Fs, ...

"Window",hamming(round(0.03*Fs),"periodic"), ...

"OverlapLength",round(0.02*Fs), ...

...

"gtcc",true, ...

"gtccDelta",true, ...

"gtccDeltaDelta",true, ...

...

"SpectralDescriptorInput","melSpectrum", ...

"spectralCentroid",true, ...

"spectralEntropy",true, ...

"spectralFlux",true, ...

"spectralSlope",true, ...

...

"pitch",true, ...

"harmonicRatio",true);

从信号中提取特征并对其进行归一化。

features = extract(extractor, audioIn);

features = (features.' - M)./S;

对信号进行分类。

gender = classify(genderIDNet, features)

gender = categorical

male

对另一个女性说话者的信号进行分类。

[audioIn, Fs] = audioread('femaleSpeech.flac');

sound(audioIn, Fs)

boundaries = detectSpeech(audioIn, Fs);

audioIn = audioIn(boundaries(1):boundaries(2));

features = extract(extractor, audioIn);

features = (features.' - M)./S;

classify(genderIDNet, features)

ans = categorical

female

预处理训练音频数据

当使用特征向量序列时,此示例中使用的 BiLSTM 网络效果最佳。为了说明如何预处理管道,此示例逐步演示了针对单个音频文件的步骤。

读取包含语音的音频文件的内容。说话者的性别是男性。

[audioIn,Fs] = audioread('Counting-16-44p1-mono-15secs.wav');

labels = {'male'};

绘制音频信号,然后使用 sound 命令收听它。

timeVector = (1/Fs) * (0:size(audioIn,1)-1);

figure

plot(timeVector,audioIn)

ylabel("Amplitude")

xlabel("Time (s)")

title("Sample Audio")

grid on

matlab lstm分类,使用 LSTM 网络对性别进行分类_第1张图片

sound(audioIn,Fs)

语音信号具有静默段,其中不包含与说话者性别相关的有用信息。使用 detectSpeech(Audio Toolbox) 定位音频信号中的语音段。

speechIndices = detectSpeech(audioIn,Fs);

创建一个 audioFeatureExtractor(Audio Toolbox) 以从音频数据中提取特征。语音信号在本质上是动态的,并且随着时间而变化。假设语音信号在短时间尺度上是稳定的,并且其处理通常在 20-40 毫秒的时间窗中完成。指定 30 毫秒时间窗,其中包含 20 毫秒重叠。

extractor = audioFeatureExtractor( ...

"SampleRate",Fs, ...

"Window",hamming(round(0.03*Fs),"periodic"), ...

"OverlapLength",round(0.02*Fs), ...

...

"gtcc",true, ...

"gtccDelta",true, ...

"gtccDeltaDelta",true, ...

...

"SpectralDescriptorInput","melSpectrum", ...

"spectralCentroid",true, ...

"spectralEntropy",true, ...

"spectralFlux",true, ...

"spectralSlope",true, ...

...

"pitch",true, ...

"harmonicRatio",true);

从每个音频段中提取特征。audioFeatureExtractor 的输出是一个 numFeatureVectors×numFeatures 数组。此示例中使用的 sequenceInputLayer 需要时间沿第二个维度排列。置换输出数组,使时间沿第二个维度排列。

featureVectorsSegment = {};

for ii = 1:size(speechIndices,1)

featureVectorsSegment{end+1} = ( extract(extractor,audioIn(speechIndices(ii,1):speechIndices(ii,2))) )';

end

numSegments = size(featureVectorsSegment)

numSegments = 1×2

1 11

[numFeatures,numFeatureVectorsSegment1] = size(featureVectorsSegment{1})

numFeatures = 45

numFeatureVectorsSegment1 = 124

复制标签,使它们与音频段一一对应。

labels = repelem(labels,size(speechIndices,1))

labels = 1×11 cell

{'male'} {'male'} {'male'} {'male'} {'male'} {'male'} {'male'} {'male'} {'male'} {'male'} {'male'}

当使用 sequenceInputLayer 时,通常使用长度一致的序列最合适。将特征向量数组转换为特征向量序列。对每个序列使用 20 个特征向量,其中 5 个特征向量重叠。

featureVectorsPerSequence = 20;

featureVectorOverlap = 5;

hopLength = featureVectorsPerSequence - featureVectorOverlap;

idx1 = 1;

featuresTrain = {};

sequencePerSegment = zeros(numel(featureVectorsSegment),1);

for ii = 1:numel(featureVectorsSegment)

sequencePerSegment(ii) = max(floor((size(featureVectorsSegment{ii},2) - featureVectorsPerSequence)/hopLength) + 1,0);

idx2 = 1;

for j = 1:sequencePerSegment(ii)

featuresTrain{idx1,1} = featureVectorsSegment{ii}(:,idx2:idx2 + featureVectorsPerSequence - 1);

idx1 = idx1 + 1;

idx2 = idx2 + hopLength;

end

end

为简明起见,辅助函数 HelperFeatureVector2Sequence 封装上述处理,并在此示例的其余部分中使用。

复制标签,使它们与训练集一一对应。

labels = repelem(labels,sequencePerSegment);

预处理管道的结果是由若干个 NumFeatures×FeatureVectorsPerSequence 矩阵组成的一个 NumSequence×1 元胞数组。labels 是一个 NumSequence×1 数组。

NumSequence = numel(featuresTrain)

NumSequence = 27

[NumFeatures,FeatureVectorsPerSequence] = size(featuresTrain{1})

NumFeatures = 45

FeatureVectorsPerSequence = 20

NumSequence = numel(labels)

NumSequence = 27

下图简要显示每个检测到的语音区域所使用的特征提取。

matlab lstm分类,使用 LSTM 网络对性别进行分类_第2张图片

创建训练和测试数据存储

此示例使用 Mozilla 通用语音数据集的一个子集 [1]。该数据集包含测试人员口述短句的 48 kHz 录音。下载该数据集并解压缩下载的文件。将 PathToDatabase 设置为数据的位置。

url = 'http://ssd.mathworks.com/supportfiles/audio/commonvoice.zip';

downloadFolder = tempdir;

dataFolder = fullfile(downloadFolder,'commonvoice');

if ~exist(dataFolder,'dir')

disp('Downloading data set (956 MB) ...')

unzip(url,downloadFolder)

end

使用 audioDatastore 为训练集和验证集创建数据存储。使用 readtable 读取与音频文件相关联的元数据。

loc = fullfile(dataFolder);

adsTrain = audioDatastore(fullfile(loc,'train'),'IncludeSubfolders',true);

metadataTrain = readtable(fullfile(fullfile(loc,'train'),"train.tsv"),"FileType","text");

adsTrain.Labels = metadataTrain.gender;

adsValidation = audioDatastore(fullfile(loc,'validation'),'IncludeSubfolders',true);

metadataValidation = readtable(fullfile(fullfile(loc,'validation'),"validation.tsv"),"FileType","text");

adsValidation.Labels = metadataValidation.gender;

使用 countEachLabel(Audio Toolbox) 检查训练集和验证集的性别分类。

countEachLabel(adsTrain)

ans=2×2 table

Label Count

______ _____

female 1000

male 1000

countEachLabel(adsValidation)

ans=2×2 table

Label Count

______ _____

female 200

male 200

要使用整个数据集来训练网络并达到尽可能最高的准确度,请将 reduceDataset 设置为 false。要快速运行此示例,请将 reduceDataset 设置为 true。

reduceDataset = false;

if reduceDataset

% Reduce the training dataset by a factor of 20

adsTrain = splitEachLabel(adsTrain,round(numel(adsTrain.Files) / 2 / 20));

adsValidation = splitEachLabel(adsValidation,20);

end

创建训练集和验证集

确定数据集中音频文件的采样率,然后更新音频特征提取器的采样率、窗口和重叠长度。

[~,adsInfo] = read(adsTrain);

Fs = adsInfo.SampleRate;

extractor.SampleRate = Fs;

extractor.Window = hamming(round(0.03*Fs),"periodic");

extractor.OverlapLength = round(0.02*Fs);

为了加快处理速度,请将计算分配给多个工作进程。如果您有 Parallel Computing Toolbox™,该示例将对数据存储进行分区,以便在可用的各工作进程之间以并行方式进行特征提取。确定系统的最佳分区数量。如果您没有 Parallel Computing Toolbox™,该示例将使用单一工作进程。

if ~isempty(ver('parallel')) && ~reduceDataset

pool = gcp;

numPar = numpartitions(adsTrain,pool);

else

numPar = 1;

end

以循环方式:

从音频数据存储中读取。

检测语音区域。

从语音区域中提取特征向量。

复制标签,使它们与特征向量一一对应。

labelsTrain = [];

featureVectors = {};

% Loop over optimal number of partitions

parfor ii = 1:numPar

% Partition datastore

subds = partition(adsTrain,numPar,ii);

% Preallocation

featureVectorsInSubDS = {};

segmentsPerFile = zeros(numel(subds.Files),1);

% Loop over files in partitioned datastore

for jj = 1:numel(subds.Files)

% 1. Read in a single audio file

audioIn = read(subds);

% 2. Determine the regions of the audio that correspond to speech

speechIndices = detectSpeech(audioIn,Fs);

% 3. Extract features from each speech segment

segmentsPerFile(jj) = size(speechIndices,1);

features = cell(segmentsPerFile(jj),1);

for kk = 1:size(speechIndices,1)

features{kk} = ( extract(extractor,audioIn(speechIndices(kk,1):speechIndices(kk,2))) )';

end

featureVectorsInSubDS = [featureVectorsInSubDS;features(:)];

end

featureVectors = [featureVectors;featureVectorsInSubDS];

% Replicate the labels so that they are in one-to-one correspondance

% with the feature vectors.

repedLabels = repelem(subds.Labels,segmentsPerFile);

labelsTrain = [labelsTrain;repedLabels(:)];

end

在分类应用中,将所有特征归一化为具有零均值和单位标准差是很好的做法。

计算每个系数的均值和标准差,并使用它们来归一化数据。

allFeatures = cat(2,featureVectors{:});

allFeatures(isinf(allFeatures)) = nan;

M = mean(allFeatures,2,'omitnan');

S = std(allFeatures,0,2,'omitnan');

featureVectors = cellfun(@(x)(x-M)./S,featureVectors,'UniformOutput',false);

for ii = 1:numel(featureVectors)

idx = find(isnan(featureVectors{ii}));

if ~isempty(idx)

featureVectors{ii}(idx) = 0;

end

end

将特征向量缓冲到具有 20 个特征向量及 10 个重叠的序列中。如果一个序列的特征向量少于 20 个,则将其丢弃。

[featuresTrain,trainSequencePerSegment] = HelperFeatureVector2Sequence(featureVectors,featureVectorsPerSequence,featureVectorOverlap);

复制标签,使它们与序列一一对应。

labelsTrain = repelem(labelsTrain,[trainSequencePerSegment{:}]);

labelsTrain = categorical(labelsTrain);

使用与创建训练集相同的步骤创建验证集。

labelsValidation = [];

featureVectors = {};

valSegmentsPerFile = [];

parfor ii = 1:numPar

subds = partition(adsValidation,numPar,ii);

featureVectorsInSubDS = {};

valSegmentsPerFileInSubDS = zeros(numel(subds.Files),1);

for jj = 1:numel(subds.Files)

audioIn = read(subds);

speechIndices = detectSpeech(audioIn,Fs);

numSegments = size(speechIndices,1);

features = cell(valSegmentsPerFileInSubDS(jj),1);

for kk = 1:numSegments

features{kk} = ( extract(extractor,audioIn(speechIndices(kk,1):speechIndices(kk,2))) )';

end

featureVectorsInSubDS = [featureVectorsInSubDS;features(:)];

valSegmentsPerFileInSubDS(jj) = numSegments;

end

repedLabels = repelem(subds.Labels,valSegmentsPerFileInSubDS);

labelsValidation = [labelsValidation;repedLabels(:)];

featureVectors = [featureVectors;featureVectorsInSubDS];

valSegmentsPerFile = [valSegmentsPerFile;valSegmentsPerFileInSubDS];

end

featureVectors = cellfun(@(x)(x-M)./S,featureVectors,'UniformOutput',false);

for ii = 1:numel(featureVectors)

idx = find(isnan(featureVectors{ii}));

if ~isempty(idx)

featureVectors{ii}(idx) = 0;

end

end

[featuresValidation,valSequencePerSegment] = HelperFeatureVector2Sequence(featureVectors,featureVectorsPerSequence,featureVectorOverlap);

labelsValidation = repelem(labelsValidation,[valSequencePerSegment{:}]);

labelsValidation = categorical(labelsValidation);

定义 LSTM 网络架构

LSTM 网络可以学习序列数据的时间步之间的长期相关性。此示例使用双向 LSTM 层 bilstmLayer 前向和后向分析序列。

将输入大小指定为大小为 NumFeatures 的序列。指定输出大小为 50 的一个隐藏双向 LSTM 层,并输出序列。然后,指定输出大小为 50 的一个双向 LSTM 层,并输出序列的最后一个元素。此命令指示双向 LSTM 层将其输入映射到 50 个特征,然后准备好输出到全连接层。最后,通过包含大小为 2 的全连接层,后跟 softmax 层和分类层,来指定两个类。

layers = [ ...

sequenceInputLayer(size(featuresTrain{1},1))

bilstmLayer(50,"OutputMode","sequence")

bilstmLayer(50,"OutputMode","last")

fullyConnectedLayer(2)

softmaxLayer

classificationLayer];

接下来,指定分类器的训练选项。将 MaxEpochs 设置为 4 以便基于训练数据对网络进行 4 轮训练。将 MiniBatchSize 设置为 256 以便网络可以一次查看 128 个训练信号。将 Plots 指定为 "training-progress" 以生成随着迭代次数增加显示训练进度的图。将 Verbose 设置为 false 以禁止打印对应于图中所示数据的表输出。将 Shuffle 指定为 "every-epoch" 以在每轮开始时使训练序列变为乱序。将 LearnRateSchedule 指定为 "piecewise" 以便每经过一定数量的轮次 (1) 时,按指定的因子 (0.1) 降低学习率。

此示例使用自适应矩估计 (ADAM) 求解器。与默认的具有动量的随机梯度下降 (SGDM) 求解器相比,ADAM 在使用 LSTM 之类的循环神经网络 (RNN) 时性能更好。

miniBatchSize = 256;

validationFrequency = floor(numel(labelsTrain)/miniBatchSize);

options = trainingOptions("adam", ...

"MaxEpochs",4, ...

"MiniBatchSize",miniBatchSize, ...

"Plots","training-progress", ...

"Verbose",false, ...

"Shuffle","every-epoch", ...

"LearnRateSchedule","piecewise", ...

"LearnRateDropFactor",0.1, ...

"LearnRateDropPeriod",1, ...

'ValidationData',{featuresValidation,labelsValidation}, ...

'ValidationFrequency',validationFrequency);

训练 LSTM 网络

使用 trainNetwork 用指定的训练选项和层架构训练 LSTM 网络。由于训练集很大,训练过程可能需要几分钟。

net = trainNetwork(featuresTrain,labelsTrain,layers,options);

matlab lstm分类,使用 LSTM 网络对性别进行分类_第3张图片

训练进度图的顶部子图表示训练准确度,即基于每个小批量的分类准确度。当训练在成功进行时,此值通常会逐渐增大,直到 100%。底部子图显示训练损失,即基于每个小批量的交叉熵损失。当训练在成功进行时,该值通常会逐渐降低,直到为零。

如果训练未收敛,绘图可能会在各值之间振荡,而不会呈现向上或向下趋势。这种振荡意味着训练准确度没有提高,训练损失没有减少。这种情况可能发生在训练开始时,或者在训练准确度有一些初步提高后。在许多情况下,更改训练选项可以帮助网络实现收敛。减少 MiniBatchSize 或减少 InitialLearnRate 可能会导致更长的训练时间,但这可能有助于网络更好地学习。

可视化训练准确度

计算训练准确度,该准确度表示分类器对于所训练信号的准确度。首先,对训练数据进行分类。

prediction = classify(net,featuresTrain);

绘制混淆矩阵。使用列汇总和行汇总显示两个类的精确率和召回率。

figure

cm = confusionchart(categorical(labelsTrain),prediction,'title','Training Accuracy');

cm.ColumnSummary = 'column-normalized';

cm.RowSummary = 'row-normalized';

matlab lstm分类,使用 LSTM 网络对性别进行分类_第4张图片

可视化验证准确度

计算验证准确度。首先,对训练数据进行分类。

[prediction,probabilities] = classify(net,featuresValidation);

绘制混淆矩阵。使用列汇总和行汇总显示两个类的精确率和召回率。

figure

cm = confusionchart(categorical(labelsValidation),prediction,'title','Validation Set Accuracy');

cm.ColumnSummary = 'column-normalized';

cm.RowSummary = 'row-normalized';

matlab lstm分类,使用 LSTM 网络对性别进行分类_第5张图片

该示例基于每个训练语音文件生成多个序列。如果考虑对应于同一个文件的所有序列的输出类,并应用“最大规则”决策从中选择置信度分数最高的语音段所在的类,则可以获得更高的准确度。

确定验证集中每个文件生成的序列数。

sequencePerFile = zeros(size(valSegmentsPerFile));

valSequencePerSegmentMat = cell2mat(valSequencePerSegment);

idx = 1;

for ii = 1:numel(valSegmentsPerFile)

sequencePerFile(ii) = sum(valSequencePerSegmentMat(idx:idx+valSegmentsPerFile(ii)-1));

idx = idx + valSegmentsPerFile(ii);

end

通过分析从同一文件生成的所有序列的输出类,基于每个训练文件预测性别。

numFiles = numel(adsValidation.Files);

actualGender = categorical(adsValidation.Labels);

predictedGender = actualGender;

scores = cell(1,numFiles);

counter = 1;

cats = unique(actualGender);

for index = 1:numFiles

scores{index} = probabilities(counter: counter + sequencePerFile(index) - 1,:);

m = max(mean(scores{index},1),[],1);

if m(1) >= m(2)

predictedGender(index) = cats(1);

else

predictedGender(index) = cats(2);

end

counter = counter + sequencePerFile(index);

end

可视化基于多数规则预测的混淆矩阵。

figure

cm = confusionchart(actualGender,predictedGender,'title','Validation Set Accuracy - Max Rule');

cm.ColumnSummary = 'column-normalized';

cm.RowSummary = 'row-normalized';

matlab lstm分类,使用 LSTM 网络对性别进行分类_第6张图片

参考资料

支持函数

function [sequences,sequencePerSegment] = HelperFeatureVector2Sequence(features,featureVectorsPerSequence,featureVectorOverlap)

if featureVectorsPerSequence <= featureVectorOverlap

error('The number of overlapping feature vectors must be less than the number of feature vectors per sequence.')

end

hopLength = featureVectorsPerSequence - featureVectorOverlap;

idx1 = 1;

sequences = {};

sequencePerSegment = cell(numel(features),1);

for ii = 1:numel(features)

sequencePerSegment{ii} = max(floor((size(features{ii},2) - featureVectorsPerSequence)/hopLength) + 1,0);

idx2 = 1;

for j = 1:sequencePerSegment{ii}

sequences{idx1,1} = features{ii}(:,idx2:idx2 + featureVectorsPerSequence - 1); %#ok

idx1 = idx1 + 1;

idx2 = idx2 + hopLength;

end

end

end

你可能感兴趣的:(matlab,lstm分类)