Although many successful ensemble clustering approaches have been developed in recent years, there are still two limitations to most of the existing approaches. First, they mostly overlook the issue of uncertain links, which may mislead the overall consensus process. Second, they generally lack the ability to incorporate global information to refine the local links. To address these two limitations, in this paper, we propose a novel ensemble clustering approach based on sparse graph representation and probability trajectory analysis. In particular, we present the elite neighbor selection strategy to identify the uncertain links by locally adaptive thresholds and build a sparse graph with a small number of probably reliable links. We argue that a small number of probably reliable links can lead to significantly better consensus results than using all graph links regardless of their reliability. The random walk process driven by a new transition probability matrix is utilized to explore the global information in the graph. We derive a novel and dense similarity measure from the sparse graph by analyzing the probability trajectories of the random walkers, based on which two consensus functions are further proposed. Experimental results on multiple real-world datasets demonstrate the effectiveness and efficiency of our approach.

2 运行结果【鲁棒】使用概率轨迹的鲁棒集成聚类研究(Matlab代码实现)_第1张图片


clear all;
close all;

%% Load the base clustering pool.
% We have generated a pool of 200 candidate base clusterings for each dataset. 
% Please uncomment the dataset that you want to use and comment the other ones.

dataName = 'MF';
% dataName = 'IS';
% dataName = 'MNIST';
% dataName = 'ODR';
% dataName = 'LS';
% dataName = 'PD';
% dataName = 'USPS';
% dataName = 'FC';
% dataName = 'KDD99_10P';
% dataName = 'KDD99';

members = [];
gt = [];

[N, poolSize] = size(members);
trueK = numel(unique(gt));

%% Settings
% Ensemble size M
M = 10;
% How many times the PTA and PTGP algorithms will be run.
cntTimes = 20; 
% You can set cntTimes to a greater (or smaller) integer if you want to run
% the algorithms more (or less) times.

% For each run, M base clusterings will be randomly drawn from the pool.
% Each row in bcIdx corresponds to an ensemble of M base clusterings.
bcIdx = zeros(cntTimes, M);
for i = 1:cntTimes
    tmp = randperm(poolSize);
    bcIdx(i,:) = tmp(1:M);

%% Run PTA and PTGP repeatedly.
% Test different numbers of clusters.
clsNums = [2:20, 25:5:50];
clsNums = unique([clsNums,trueK]);

% In general, you can also simply set the number of clusters to the true number of classses.
% clsNums = trueK;

% Scores
outDir = fullfile('..','results');
nmiScoresBestK_PTA = zeros(cntTimes, 3);
nmiScoresTrueK_PTA = zeros(cntTimes, 3);
nmiScoresBestK_PTGP = zeros(cntTimes, 1);
nmiScoresTrueK_PTGP = zeros(cntTimes, 1);
for runIdx = 1:cntTimes
    disp(['Run ', num2str(runIdx),':']);
    %% Construct the ensemble of M base clusterings
    % baseCls is an N x M matrix, each row being a base clustering.
    baseCls = members(:,bcIdx(runIdx,:));
    %% Produce microclusters
    disp('Produce microclusters ... ');
    tic; [mcBaseCls, mcLabels] = computeMicroclusters(baseCls); toc;
    tilde_N = size(mcBaseCls,1);
    %% Compute the microcluster based co-association matrix.
    disp('Compute the MCA matrix ... ');
    tic; MCA = computeMCA(mcBaseCls); toc;
    %% Set parameters K and T.
    para.K = floor(sqrt(tilde_N)/2);
    para.T = floor(sqrt(tilde_N)/2);
    %% Compute PTS
    disp('Compute PTS ... ');
    tic; PTS = computePTS_fast(MCA,mcLabels,para); toc;
    %% Perform PTA
    disp('Run the PTA algorithm ... '); 
    [mcResultsAL,mcResultsCL,mcResultsSL] = runPTA(PTS, clsNums);
    % The i-th column in results_al\results_cl\results_sl represents the
    % consensus clustering with clsNums(i) clusters by PTA-AL\CL\SL.
    %% Perform PTGP 
    disp('Run the PTGP algorithm ... '); 
    mcResultsPTGP = runPTGP(mcBaseCls, PTS, clsNums);     

    %% Display the clustering results.
    disp('Map microclusters back to objects ... '); tic;
    resultsAL = mapMicroclustersBackToObjects(mcResultsAL, mcLabels);
    resultsCL = mapMicroclustersBackToObjects(mcResultsCL, mcLabels);
    resultsSL = mapMicroclustersBackToObjects(mcResultsSL, mcLabels);
    resultsPTGP = mapMicroclustersBackToObjects(mcResultsPTGP, mcLabels);toc;
    scoresAL = computeNMI(resultsAL,gt);
    scoresCL = computeNMI(resultsCL,gt);
    scoresSL = computeNMI(resultsSL,gt);
    scoresPTGP = computeNMI(resultsPTGP,gt);
    trueKidx = find(clsNums==trueK);
    nmiScoresBestK_PTA(runIdx,:) = [max(scoresAL),max(scoresCL),max(scoresSL)];
    nmiScoresTrueK_PTA(runIdx,:) = [scoresAL(trueKidx),scoresCL(trueKidx),scoresSL(trueKidx)];
    nmiScoresBestK_PTGP(runIdx) = max(scoresPTGP);
    nmiScoresTrueK_PTGP(runIdx) = scoresPTGP(trueKidx);
    disp(['The Scores at Run ',num2str(runIdx)]);
    disp('    ---------- The NMI scores w.r.t. best-k: ----------    ');
    disp(['PTA-AL : ',num2str(nmiScoresBestK_PTA(runIdx,1))]);
    disp(['PTA-CL : ',num2str(nmiScoresBestK_PTA(runIdx,2))]);
    disp(['PTA-SL : ',num2str(nmiScoresBestK_PTA(runIdx,3))]);
    disp(['PTGP   : ',num2str(nmiScoresBestK_PTGP(runIdx))]);
    disp('    ---------- The NMI scores w.r.t. true-k: ----------    ');
    disp(['PTA-AL : ',num2str(nmiScoresTrueK_PTA(runIdx,1))]);
    disp(['PTA-CL : ',num2str(nmiScoresTrueK_PTA(runIdx,2))]);
    disp(['PTA-SL : ',num2str(nmiScoresTrueK_PTA(runIdx,3))]);
    disp(['PTGP   : ',num2str(nmiScoresTrueK_PTGP(runIdx))]);

    %% Save results

disp(['   ** Average Performance over ',num2str(cntTimes),' runs on the ',dataName,' dataset **']);
disp(['Data size:     ', num2str(N)]);
disp(['Ensemble size: ', num2str(M)]);
disp('   ---------- Average NMI scores w.r.t. best-k: ----------   ');
disp(['PTA-AL : ',num2str(mean(nmiScoresBestK_PTA(:,1)))]);
disp(['PTA-CL : ',num2str(mean(nmiScoresBestK_PTA(:,2)))]);
disp(['PTA-SL : ',num2str(mean(nmiScoresBestK_PTA(:,3)))]);
disp(['PTGP   : ',num2str(mean(nmiScoresBestK_PTGP))]);
disp('   ---------- Average NMI scores w.r.t. true-k: ----------   ');
disp(['PTA-AL : ',num2str(mean(nmiScoresTrueK_PTA(:,1)))]);
disp(['PTA-CL : ',num2str(mean(nmiScoresTrueK_PTA(:,2)))]);
disp(['PTA-SL : ',num2str(mean(nmiScoresTrueK_PTA(:,3)))]);
disp(['PTGP   : ',num2str(mean(nmiScoresTrueK_PTGP))]);

[1]Dong Huang, Jian-Huang Lai, Chang-Dong Wang. Robust Ensemble Clustering Using Probability Trajectories. Ransactions on Knowledge and Data Engineering, 2016, vol.28, no.5, pp.1312-1326. 

