提取有用的特征,机器学习通常可以在少得多的数据上为您提供与深度学习相当或更好的结果。与在少得多的数据上的深度学习相比,可以获得可比或更好的结果。然后,我们将SVM、孤立森林(R2021b 中的新功能)、稳健协方差和马氏距离、DBSCAN 聚类方法应用于异常检测:
在该数据上, SVM 的性能最好,孤立森林可以接受,而马氏距离效果不佳。
虽然深度学习需要大数据集,但机器学习方法需要的数据要少得多。机器健康并产生“正常”数据。 通常,可以使用的“异常”数据要少得多,但大多数异常检测方法都假设:主要拥有正常数据。
load("FeatureSmall.mat");
rng;
% show distribution in this small data set
st = groupcounts(featureSmall,"label")
outlierFraction = st.Percent(1)/100; % remember fraction of anomalies in data
% divide subset into train and (held out) test
idxSmall = cvpartition(featureSmall.label,'holdout',0.2);
featSmallTrain = featureSmall(idxSmall.training,:);
featSmallTest = featureSmall(idxSmall.test,:);
trueAnomaliesTest = featSmallTest.label;
% identify subset of "normal" (healthy) data in training
healthyIdx = featSmallTrain.label=="Normal";
trainHealthy = featSmallTrain(healthyIdx,2:13); % can safely skip labels column since we only have one class
trainHealthyN = size(trainHealthy,1);
% train one-class SVM. On this data, training just on the "normal" (healthy) data works better
mdlSVM = fitcsvm(trainHealthy,ones(trainHealthyN,1),'Standardize',true,'OutlierFraction',0);
% this variant trains on all data (including anomalies), with the estimated fraction
%mdlSVM = fitcsvm(featSmallTrain(:,2:13),ones(size(featSmallTrain,1),1),'Standardize',true,'OutlierFraction',outlierFraction);
% apply to "small" test data (which has both anomalies and healthy data)
[~,scoreSVM] = predict(mdlSVM,featSmallTest(:,2:13));
isanomalySVM = scoreSVM<0;
predSVM = categorical(isanomalySVM, [1, 0], ["Anomaly", "Normal"]);
confusionchart(trueAnomaliesTest,predSVM,Title="Anomaly Detection with One-class SVM",Normalization="row-normalized");
% train on normal (healthy) data only. Need to tune the ContimationFraction!
[mdlIF,~,scoreTrainIF] = iforest(trainHealthy,'ContaminationFraction',0.05);
[isanomalyIF,scoreTestIF] = isanomaly(mdlIF,featSmallTest(:,2:13));
predIF = categorical(isanomalyIF, [1, 0], ["Anomaly", "Normal"]);
confusionchart(trueAnomaliesTest,predIF,Title="Anomaly Detection with Isolation Forest",Normalization="row-normalized");
%To better understand how the isolation forest works, it can be helpful to visualize the distribution of its anomaly scores.
% visualize the anomaly score distributions for train vs test
figure;
histogram(scoreTestIF,Normalization="probability");
hold on
histogram(scoreTrainIF,Normalization="probability");
xline(mdlIF.ScoreThreshold,"k-", join(["Threshold =" mdlIF.ScoreThreshold]))
legend("Test data","Training data",Location="southeast")
title("Histograms of Anomaly Scores for Isolation Forest")
hold off
featSmallTrainM = table2array(featSmallTrain(:,2:13)); % robustcov expects input as matrix, not table
featSmallTestM = table2array(featSmallTest(:,2:13));
% compute robust covariance on both normal and (typically only a little)
% anomalous data - which is not the case for this data!
[sigma,mu,mahalDistRobust,~] = robustcov(featSmallTrainM, OutlierFraction=outlierFraction);
sigRobustThresh = sqrt(chi2inv(1-outlierFraction,size(featSmallTrainM,2)));
% identify anomalies on new (test) data with Mahalanobis distance
scoreTestMahal = pdist2(featSmallTestM,mu,"mahalanobis",sigma);
isanomalyMahal = scoreTestMahal > sigRobustThresh;
predMahal = categorical(isanomalyMahal, [1, 0], ["Anomaly", "Normal"]);
confusionchart(trueAnomaliesTest,predMahal,Title="Anomaly Detection with Mahalanobis / Robust Covariance",Normalization="row-normalized");
% use t-SNE to reduce the 12 features into a two-dimensional space
T = tsne(featSmallTestM);
% compare the various anomaly detection methods visually, starting with
% "ground truth" (labels)
figure;
tiledlayout(2,2)
nexttile
gscatter(T(:,1),T(:,2),trueAnomaliesTest,"kr",".o",[],"off");
title("Ground Truth")
legend("Normal","Anomalous");
nexttile(2)
gscatter(T(:,1),T(:,2),predIF,"kr",".o",[],"off")
title("Isolation Forest")
nexttile(3)
gscatter(T(:,1),T(:,2),predSVM,"kr",".o",[],"off")
title("One-class SVM")
nexttile(4)
gscatter(T(:,1),T(:,2),predMahal,"kr",".o",[],"off")
title("Robust Mahalanobis Distance")
trainSmallSNE = tsne(featSmallTrainM);
mdlDB = dbscan(trainSmallSNE,5,20);
figure
gscatter(trainSmallSNE(:,1),trainSmallSNE(:,2),mdlDB);
[1] https://blog.csdn.net/kjm13182345320/article/details/124864369
[2] https://blog.csdn.net/kjm13182345320/article/details/127896974?spm=1001.2014.3001.5502