【鲁棒】使用概率轨迹的鲁棒集成聚类研究(Matlab代码实现)

欢迎来到本博客❤️❤️

博主优势:博客内容尽量做到思维缜密,逻辑清晰,为了方便读者。

⛳️座右铭:行百里者,半于九十。

本文目录如下:

目录

1 概述

2 运行结果​编辑

3 文献来源

4 Matlab代码实现


1 概述

摘要:

尽管近年来已经开发了许多成功的集成聚类方法,但大多数现有方法仍然存在两个局限性。首先,它们大多忽视了不确定环节的问题,这可能会误导整个共识进程。其次,它们通常缺乏整合全球信息以完善本地联系的能力。为了解决这两个限制,本文提出了一种基于稀疏图表示和概率轨迹分析的新型集成聚类方法。特别是,我们提出了精英邻居选择策略,通过局部自适应阈值识别不确定链接,并构建具有少量可能可靠链接的稀疏图。我们认为,与使用所有图链接相比,少量可能可靠的链接可以带来明显更好的共识结果,而不管它们的可靠性如何。利用新的转移概率矩阵驱动的随机游走过程来探索图中的全局信息。通过分析随机游走者的概率轨迹,从稀疏图中推导出一种新颖而密集的相似性度量,并在此基础上进一步提出了两个共识函数。在多个真实数据集上的实验结果证明了我们方法的有效性和效率。

原文摘要:

Abstract:

Although many successful ensemble clustering approaches have been developed in recent years, there are still two limitations to most of the existing approaches. First, they mostly overlook the issue of uncertain links, which may mislead the overall consensus process. Second, they generally lack the ability to incorporate global information to refine the local links. To address these two limitations, in this paper, we propose a novel ensemble clustering approach based on sparse graph representation and probability trajectory analysis. In particular, we present the elite neighbor selection strategy to identify the uncertain links by locally adaptive thresholds and build a sparse graph with a small number of probably reliable links. We argue that a small number of probably reliable links can lead to significantly better consensus results than using all graph links regardless of their reliability. The random walk process driven by a new transition probability matrix is utilized to explore the global information in the graph. We derive a novel and dense similarity measure from the sparse graph by analyzing the probability trajectories of the random walkers, based on which two consensus functions are further proposed. Experimental results on multiple real-world datasets demonstrate the effectiveness and efficiency of our approach.

2 运行结果【鲁棒】使用概率轨迹的鲁棒集成聚类研究(Matlab代码实现)_第1张图片

主函数代码:

clear all;
close all;
clc;


%% Load the base clustering pool.
% We have generated a pool of 200 candidate base clusterings for each dataset. 
% Please uncomment the dataset that you want to use and comment the other ones.

dataName = 'MF';
% dataName = 'IS';
% dataName = 'MNIST';
% dataName = 'ODR';
% dataName = 'LS';
% dataName = 'PD';
% dataName = 'USPS';
% dataName = 'FC';
% dataName = 'KDD99_10P';
% dataName = 'KDD99';

members = [];
gt = [];
load(fullfile('..','data',['bc_pool_',dataName,'.mat']),'members','gt');

[N, poolSize] = size(members);
trueK = numel(unique(gt));

%% Settings
% Ensemble size M
M = 10;
% How many times the PTA and PTGP algorithms will be run.
cntTimes = 20; 
% You can set cntTimes to a greater (or smaller) integer if you want to run
% the algorithms more (or less) times.

% For each run, M base clusterings will be randomly drawn from the pool.
% Each row in bcIdx corresponds to an ensemble of M base clusterings.
bcIdx = zeros(cntTimes, M);
for i = 1:cntTimes
    tmp = randperm(poolSize);
    bcIdx(i,:) = tmp(1:M);
end

%% Run PTA and PTGP repeatedly.
% Test different numbers of clusters.
clsNums = [2:20, 25:5:50];
clsNums = unique([clsNums,trueK]);

% In general, you can also simply set the number of clusters to the true number of classses.
% clsNums = trueK;

% Scores
outDir = fullfile('..','results');
mkdir(outDir);
nmiScoresBestK_PTA = zeros(cntTimes, 3);
nmiScoresTrueK_PTA = zeros(cntTimes, 3);
nmiScoresBestK_PTGP = zeros(cntTimes, 1);
nmiScoresTrueK_PTGP = zeros(cntTimes, 1);
for runIdx = 1:cntTimes
    disp('**************************************************************');
    disp(['Run ', num2str(runIdx),':']);
    disp('**************************************************************');
    
    %% Construct the ensemble of M base clusterings
    % baseCls is an N x M matrix, each row being a base clustering.
    baseCls = members(:,bcIdx(runIdx,:));
    
    %% Produce microclusters
    disp('Produce microclusters ... ');
    tic; [mcBaseCls, mcLabels] = computeMicroclusters(baseCls); toc;
    tilde_N = size(mcBaseCls,1);
    disp('--------------------------------------------------------------');
    
    %% Compute the microcluster based co-association matrix.
    disp('Compute the MCA matrix ... ');
    tic; MCA = computeMCA(mcBaseCls); toc;
    disp('--------------------------------------------------------------');
    
    %% Set parameters K and T.
    para.K = floor(sqrt(tilde_N)/2);
    para.T = floor(sqrt(tilde_N)/2);
    
    %% Compute PTS
    disp('Compute PTS ... ');
    tic; PTS = computePTS_fast(MCA,mcLabels,para); toc;
    disp('--------------------------------------------------------------');
    
    %% Perform PTA
    disp('Run the PTA algorithm ... '); 
    [mcResultsAL,mcResultsCL,mcResultsSL] = runPTA(PTS, clsNums);
    % The i-th column in results_al\results_cl\results_sl represents the
    % consensus clustering with clsNums(i) clusters by PTA-AL\CL\SL.
    disp('--------------------------------------------------------------');
    
    %% Perform PTGP 
    disp('Run the PTGP algorithm ... '); 
    mcResultsPTGP = runPTGP(mcBaseCls, PTS, clsNums);     
    disp('--------------------------------------------------------------'); 

    %% Display the clustering results.
    disp('Map microclusters back to objects ... '); tic;
    resultsAL = mapMicroclustersBackToObjects(mcResultsAL, mcLabels);
    resultsCL = mapMicroclustersBackToObjects(mcResultsCL, mcLabels);
    resultsSL = mapMicroclustersBackToObjects(mcResultsSL, mcLabels);
    resultsPTGP = mapMicroclustersBackToObjects(mcResultsPTGP, mcLabels);toc;
    disp('--------------------------------------------------------------');
    
    disp('##############################################################'); 
    scoresAL = computeNMI(resultsAL,gt);
    scoresCL = computeNMI(resultsCL,gt);
    scoresSL = computeNMI(resultsSL,gt);
    scoresPTGP = computeNMI(resultsPTGP,gt);
    
    trueKidx = find(clsNums==trueK);
    nmiScoresBestK_PTA(runIdx,:) = [max(scoresAL),max(scoresCL),max(scoresSL)];
    nmiScoresTrueK_PTA(runIdx,:) = [scoresAL(trueKidx),scoresCL(trueKidx),scoresSL(trueKidx)];
    
    nmiScoresBestK_PTGP(runIdx) = max(scoresPTGP);
    nmiScoresTrueK_PTGP(runIdx) = scoresPTGP(trueKidx);
    
    disp(['The Scores at Run ',num2str(runIdx)]);
    disp('    ---------- The NMI scores w.r.t. best-k: ----------    ');
    disp(['PTA-AL : ',num2str(nmiScoresBestK_PTA(runIdx,1))]);
    disp(['PTA-CL : ',num2str(nmiScoresBestK_PTA(runIdx,2))]);
    disp(['PTA-SL : ',num2str(nmiScoresBestK_PTA(runIdx,3))]);
    disp(['PTGP   : ',num2str(nmiScoresBestK_PTGP(runIdx))]);
    disp('    ---------- The NMI scores w.r.t. true-k: ----------    ');
    disp(['PTA-AL : ',num2str(nmiScoresTrueK_PTA(runIdx,1))]);
    disp(['PTA-CL : ',num2str(nmiScoresTrueK_PTA(runIdx,2))]);
    disp(['PTA-SL : ',num2str(nmiScoresTrueK_PTA(runIdx,3))]);
    disp(['PTGP   : ',num2str(nmiScoresTrueK_PTGP(runIdx))]);
    disp('##############################################################'); 

    %% Save results
    save(fullfile(outDir,['results_',dataName,'_M',num2str(M),'_',num2str(cntTimes),'runs.mat']),'bcIdx','nmiScoresBestK_PTA','nmiScoresTrueK_PTA','nmiScoresBestK_PTGP','nmiScoresTrueK_PTGP');  
end

disp('**************************************************************');
disp(['   ** Average Performance over ',num2str(cntTimes),' runs on the ',dataName,' dataset **']);
disp(['Data size:     ', num2str(N)]);
disp(['Ensemble size: ', num2str(M)]);
disp('   ---------- Average NMI scores w.r.t. best-k: ----------   ');
disp(['PTA-AL : ',num2str(mean(nmiScoresBestK_PTA(:,1)))]);
disp(['PTA-CL : ',num2str(mean(nmiScoresBestK_PTA(:,2)))]);
disp(['PTA-SL : ',num2str(mean(nmiScoresBestK_PTA(:,3)))]);
disp(['PTGP   : ',num2str(mean(nmiScoresBestK_PTGP))]);
disp('   ---------- Average NMI scores w.r.t. true-k: ----------   ');
disp(['PTA-AL : ',num2str(mean(nmiScoresTrueK_PTA(:,1)))]);
disp(['PTA-CL : ',num2str(mean(nmiScoresTrueK_PTA(:,2)))]);
disp(['PTA-SL : ',num2str(mean(nmiScoresTrueK_PTA(:,3)))]);
disp(['PTGP   : ',num2str(mean(nmiScoresTrueK_PTGP))]);
disp('**************************************************************');
disp('**************************************************************');

clear all;
close all;
clc;


%% Load the base clustering pool.
% We have generated a pool of 200 candidate base clusterings for each dataset. 
% Please uncomment the dataset that you want to use and comment the other ones.

dataName = 'MF';
% dataName = 'IS';
% dataName = 'MNIST';
% dataName = 'ODR';
% dataName = 'LS';
% dataName = 'PD';
% dataName = 'USPS';
% dataName = 'FC';
% dataName = 'KDD99_10P';
% dataName = 'KDD99';

members = [];
gt = [];
load(fullfile('..','data',['bc_pool_',dataName,'.mat']),'members','gt');

[N, poolSize] = size(members);
trueK = numel(unique(gt));

%% Settings
% Ensemble size M
M = 10;
% How many times the PTA and PTGP algorithms will be run.
cntTimes = 20; 
% You can set cntTimes to a greater (or smaller) integer if you want to run
% the algorithms more (or less) times.

% For each run, M base clusterings will be randomly drawn from the pool.
% Each row in bcIdx corresponds to an ensemble of M base clusterings.
bcIdx = zeros(cntTimes, M);
for i = 1:cntTimes
    tmp = randperm(poolSize);
    bcIdx(i,:) = tmp(1:M);
end

%% Run PTA and PTGP repeatedly.
% Test different numbers of clusters.
clsNums = [2:20, 25:5:50];
clsNums = unique([clsNums,trueK]);

% In general, you can also simply set the number of clusters to the true number of classses.
% clsNums = trueK;

% Scores
outDir = fullfile('..','results');
mkdir(outDir);
nmiScoresBestK_PTA = zeros(cntTimes, 3);
nmiScoresTrueK_PTA = zeros(cntTimes, 3);
nmiScoresBestK_PTGP = zeros(cntTimes, 1);
nmiScoresTrueK_PTGP = zeros(cntTimes, 1);
for runIdx = 1:cntTimes
    disp('**************************************************************');
    disp(['Run ', num2str(runIdx),':']);
    disp('**************************************************************');
    
    %% Construct the ensemble of M base clusterings
    % baseCls is an N x M matrix, each row being a base clustering.
    baseCls = members(:,bcIdx(runIdx,:));
    
    %% Produce microclusters
    disp('Produce microclusters ... ');
    tic; [mcBaseCls, mcLabels] = computeMicroclusters(baseCls); toc;
    tilde_N = size(mcBaseCls,1);
    disp('--------------------------------------------------------------');
    
    %% Compute the microcluster based co-association matrix.
    disp('Compute the MCA matrix ... ');
    tic; MCA = computeMCA(mcBaseCls); toc;
    disp('--------------------------------------------------------------');
    
    %% Set parameters K and T.
    para.K = floor(sqrt(tilde_N)/2);
    para.T = floor(sqrt(tilde_N)/2);
    
    %% Compute PTS
    disp('Compute PTS ... ');
    tic; PTS = computePTS_fast(MCA,mcLabels,para); toc;
    disp('--------------------------------------------------------------');
    
    %% Perform PTA
    disp('Run the PTA algorithm ... '); 
    [mcResultsAL,mcResultsCL,mcResultsSL] = runPTA(PTS, clsNums);
    % The i-th column in results_al\results_cl\results_sl represents the
    % consensus clustering with clsNums(i) clusters by PTA-AL\CL\SL.
    disp('--------------------------------------------------------------');
    
    %% Perform PTGP 
    disp('Run the PTGP algorithm ... '); 
    mcResultsPTGP = runPTGP(mcBaseCls, PTS, clsNums);     
    disp('--------------------------------------------------------------'); 

    %% Display the clustering results.
    disp('Map microclusters back to objects ... '); tic;
    resultsAL = mapMicroclustersBackToObjects(mcResultsAL, mcLabels);
    resultsCL = mapMicroclustersBackToObjects(mcResultsCL, mcLabels);
    resultsSL = mapMicroclustersBackToObjects(mcResultsSL, mcLabels);
    resultsPTGP = mapMicroclustersBackToObjects(mcResultsPTGP, mcLabels);toc;
    disp('--------------------------------------------------------------');
    
    disp('##############################################################'); 
    scoresAL = computeNMI(resultsAL,gt);
    scoresCL = computeNMI(resultsCL,gt);
    scoresSL = computeNMI(resultsSL,gt);
    scoresPTGP = computeNMI(resultsPTGP,gt);
    
    trueKidx = find(clsNums==trueK);
    nmiScoresBestK_PTA(runIdx,:) = [max(scoresAL),max(scoresCL),max(scoresSL)];
    nmiScoresTrueK_PTA(runIdx,:) = [scoresAL(trueKidx),scoresCL(trueKidx),scoresSL(trueKidx)];
    
    nmiScoresBestK_PTGP(runIdx) = max(scoresPTGP);
    nmiScoresTrueK_PTGP(runIdx) = scoresPTGP(trueKidx);
    
    disp(['The Scores at Run ',num2str(runIdx)]);
    disp('    ---------- The NMI scores w.r.t. best-k: ----------    ');
    disp(['PTA-AL : ',num2str(nmiScoresBestK_PTA(runIdx,1))]);
    disp(['PTA-CL : ',num2str(nmiScoresBestK_PTA(runIdx,2))]);
    disp(['PTA-SL : ',num2str(nmiScoresBestK_PTA(runIdx,3))]);
    disp(['PTGP   : ',num2str(nmiScoresBestK_PTGP(runIdx))]);
    disp('    ---------- The NMI scores w.r.t. true-k: ----------    ');
    disp(['PTA-AL : ',num2str(nmiScoresTrueK_PTA(runIdx,1))]);
    disp(['PTA-CL : ',num2str(nmiScoresTrueK_PTA(runIdx,2))]);
    disp(['PTA-SL : ',num2str(nmiScoresTrueK_PTA(runIdx,3))]);
    disp(['PTGP   : ',num2str(nmiScoresTrueK_PTGP(runIdx))]);
    disp('##############################################################'); 

    %% Save results
    save(fullfile(outDir,['results_',dataName,'_M',num2str(M),'_',num2str(cntTimes),'runs.mat']),'bcIdx','nmiScoresBestK_PTA','nmiScoresTrueK_PTA','nmiScoresBestK_PTGP','nmiScoresTrueK_PTGP');  
end

disp('**************************************************************');
disp(['   ** Average Performance over ',num2str(cntTimes),' runs on the ',dataName,' dataset **']);
disp(['Data size:     ', num2str(N)]);
disp(['Ensemble size: ', num2str(M)]);
disp('   ---------- Average NMI scores w.r.t. best-k: ----------   ');
disp(['PTA-AL : ',num2str(mean(nmiScoresBestK_PTA(:,1)))]);
disp(['PTA-CL : ',num2str(mean(nmiScoresBestK_PTA(:,2)))]);
disp(['PTA-SL : ',num2str(mean(nmiScoresBestK_PTA(:,3)))]);
disp(['PTGP   : ',num2str(mean(nmiScoresBestK_PTGP))]);
disp('   ---------- Average NMI scores w.r.t. true-k: ----------   ');
disp(['PTA-AL : ',num2str(mean(nmiScoresTrueK_PTA(:,1)))]);
disp(['PTA-CL : ',num2str(mean(nmiScoresTrueK_PTA(:,2)))]);
disp(['PTA-SL : ',num2str(mean(nmiScoresTrueK_PTA(:,3)))]);
disp(['PTGP   : ',num2str(mean(nmiScoresTrueK_PTGP))]);
disp('**************************************************************');
disp('**************************************************************');
 

3 文献来源

部分理论来源于网络,如有侵权请联系删除。

[1]Dong Huang, Jian-Huang Lai, Chang-Dong Wang. Robust Ensemble Clustering Using Probability Trajectories. Ransactions on Knowledge and Data Engineering, 2016, vol.28, no.5, pp.1312-1326. 

4 Matlab代码实现

你可能感兴趣的:(鲁棒优化,matlab,聚类,开发语言)