matlab中kmeans聚类算法

%K均值聚类.
% IDX = KMEANS(X, K) 分割X[N P]的数据矩阵中的样本为K个类,是一种最小化类内点到中心距离和的总和的分割。
% 矩阵X中的行对应的是数据样本,列对应的是变量。
% 提示: 当X是一个向量,本函数会忽略它的方向,将其当作一个[N 1]的数据矩阵。
% KMEANS 函数返回一个代表各个数据样本所属类别索引的[N 1]维向量,函数默认使用平方的欧氏距离。
% KMEANS 将NaNs当作丢失的数据并且忽略X中任何包含NaNs的行
%
%
% [IDX, C] = KMEANS(X, K) 返回一个包含K个聚类中心的[K P]维的矩阵C.
%
% [IDX, C, SUMD] = KMEANS(X, K) 返回一个类间点到聚类中心距离和的[K 1]维向量SUMD。
%
% [IDX, C, SUMD, D] = KMEANS(X, K) 返回一个每个点到任一聚类中心距离的[N K]维矩阵D。
%
% [ … ] = KMEANS(…, ‘PARAM1’,val1, ‘PARAM2’,val2, …) 指定了可选参数对(参数名/参数值)来控制算法的迭代。
% 参数如下:
%
% ‘Distance’ - 距离测度, P维空间, KMEANS算法需要最小化的值
% 可以选择:
% ‘sqeuclidean’ - 平方的欧氏距离 (默认)
% ‘cityblock’ - 曼哈顿距离,各维度差异的绝对值之和。
% ‘cosine’ - 1减去两个样本(当作向量)夹角的余弦值
% ‘correlation’ - 1减去两个样本(当作值的序列)的相关系数
%
% ‘hamming’ - 汉明距离,二进制数据相匹配位置的不同比特百分比。
%
% ‘Start’ - 选择初始聚类中心的方法,有时候也称作种子。
% 可以选择:
% ‘plus’ - 默认值。 利用k-means++算法从X中选择K个观测值:从X中随机的选取第一个聚类中心;之后的
% 聚类中心以一定的概率从剩余的样本中根据其到最近的聚类中心的比例来随机的选取。
% ‘sample’ - 随机的从X中选取K个观测值。
% ‘uniform’ - 根据X的取值范围均匀的随机选取K个样本,对汉明距离不适用。
% ‘cluster’ - 随机的利用X中10%的样本进行一个预聚类的阶段,预聚类阶段的初始聚类中心选取采用‘sample’。
% matrix - 一个初始聚类中心的[K P]维矩阵。此时,你可以用[]代替K,算法会自动的根据矩阵的第一个维度推算K值。
% 你也可以使用3D数组,暗含着第三维为参数’Replicates’的值。
%
% ‘Replicates’ - 重复聚类的次数,默认为1。 每次都会有一个新的初始聚类中心。
%
% ‘EmptyAction’ - 发生空类时的处理措施。
% 可以选择:
% ‘singleton’ - 默认方法。利用据该中心最远的一个观测值建立一个新的类。
% ‘error’ - 将产生空类作为一个错误(error)。
% ‘drop’ - 移除空类并将对应的C和D中的值设置为NaN。
%
%
% ‘Options’ - 迭代算法最小化拟合准则(?)的选项,通过STATSET创建。 Choices of STATSET
% STATSET参数可以选择:
%
% ‘Display’ - 显示输出的哪一阶段的值,可以为 ‘off’(默认),‘iter’和‘final’;
% ‘MaxIter’ - 最大的迭代次数,默认值为100
%
% ‘UseParallel’ - 在满足条件下,如果为真则开启并行计算否则使用串行模式。默认使用串行模式。
% ‘UseSubstreams’ - 默认不使用。
% ‘Streams’ - 这些区域指明是否执行并行的多个‘Start’值和当产生初始聚类中心时如何使用随机数值,
% 更详细的参考 PARALLELSTATS。
% 提示: 如果 'UseParallel’为TRUE且 'UseSubstreams’为FALSE,
% 那么’Streams’的长度必须等于KMEANS使用的workers的数目。
% 如果打开了并行池,那么它的大小和并行池一样。如果没有打开并行池,
% 那么MATLAB可能会自动的打开(这取决于你的安装设置)。为了得到更好的结果,
% 建议运用PARPOOL命令创建并行池的优先级以便当’UseParallel’为TRUE时执行算法。
%
% ‘OnlinePhase’ - 标志位,表示KMEANS是否除了运行一个"batch update"阶段还需一个"on-line
% update"
阶段 。on-line阶段在大数据量时耗时很多。默认为‘off’。
%
% 示例:
%
% X = [randn(20,2)+ones(20,2); randn(20,2)-ones(20,2)];
% opts = statset(‘Display’,‘final’);
% [cidx, ctrs] = kmeans(X, 2, ‘Distance’,‘city’, …
% ‘Replicates’,5, ‘Options’,opts);
% plot(X(cidx==1,1),X(cidx==1,2),‘r.’, …
% X(cidx==2,1),X(cidx==2,2),‘b.’, ctrs(:,1),ctrs(:,2),‘kx’);
%
% 也可以参考LINKAGE, CLUSTERDATA, SILHOUETTE。

% KMEANS 运用两阶段迭代算法来最小化K个类中样本到中心的距离和。
% 第一阶段利用文献中经常描述的"batch" 更新, 其中每次迭代中都一
% 次性地将样本分配到最近的聚类中心,然后更新聚类中心。这一阶段
% 偶尔(特别实在小样本的时候)会陷入局部最优。因此,"batch"阶段可
% 以考虑为第二阶段提供一个快速且可能为近似解的初始聚类中心。第二
% 阶段利用文献中常提及的"on-line"更新, 其中。如果能够减小距离
% 的总和那么其中的样本点都是单独地重新分配且每次分配后都重新计算
% 聚类中心。第二阶段中的每次迭代都会遍历所有的点,但是on-line阶段会收
% 敛到一个局部最小值。寻找全局最优的问题一般只能通过详细(幸运)地选择初始
% 聚类中心,但是使用重复多次的使用随机初始聚类中心中的典型结果是一个全局最小。
%
% 参考文献:
%
% [1] Seber, G.A.F. (1984) Multivariate Observations, Wiley, New York.
% [2] Spath, H. (1985) Cluster Dissection and Analysis: Theory, FORTRAN
% Programs, Examples, translated by J. Goldschmidt, Halsted Press,
% New York.

%判断输入变量是否少于两个
if nargin < 2
error(message(‘stats:kmeans:TooFewInputs’));
end
%判断X是否是实数矩阵;
if ~isreal(X)
error(message(‘stats:kmeans:ComplexData’));
end
%查找是否有NaN数据,有的话就删除,更新X矩阵;
wasnan = any(isnan(X),2);
hadNaNs = any(wasnan);
if hadNaNs
warning(message(‘stats:kmeans:MissingDataRemoved’));
X = X(~wasnan,:);
end

% 获取X矩阵的维数
[n, p] = size(X);
%参数名与默认参数值设置
pnames = { ‘distance’ ‘start’ ‘replicates’ ‘emptyaction’ ‘onlinephase’ ‘options’ ‘maxiter’ ‘display’};
dflts = {‘sqeuclidean’ ‘plus’ [] ‘singleton’ ‘off’ [] [] []};
[distance,start,reps,emptyact,online,options,maxit,display] …
= internal.stats.parseArgs(pnames, dflts, varargin{:});

distNames = {‘sqeuclidean’,‘cityblock’,‘cosine’,‘correlation’,‘hamming’};
distance = internal.stats.getParamVal(distance,distNames,’’‘Distance’’’);

switch distance
case ‘cosine’
Xnorm = sqrt(sum(X.^2, 2));%模长
if any(min(Xnorm) <= eps(max(Xnorm)))
error(message(‘stats:kmeans:ZeroDataForCos’));
end
X = bsxfun(@rdivide,X,Xnorm);%标准化
case ‘correlation’
X = bsxfun(@minus, X, mean(X,2));
Xnorm = sqrt(sum(X.^2, 2));
if any(min(Xnorm) <= eps(max(Xnorm)))
error(message(‘stats:kmeans:ConstantDataForCorr’));
end
X = bsxfun(@rdivide,X,Xnorm);
case ‘hamming’
if ~all( X(:) ==0 | X(:)==1)
error(message(‘stats:kmeans:NonbinaryDataForHamm’));
end
end

Xmins = [];
Xmaxs = [];
CC = [];
if ischar(start)
startNames = {‘uniform’,‘sample’,‘cluster’,‘plus’,‘kmeans++’};
j = find(strncmpi(start,startNames,length(start)));
if length(j) > 1
error(message(‘stats:kmeans:AmbiguousStart’, start));
elseif isempty(j)
error(message(‘stats:kmeans:UnknownStart’, start));
elseif isempty(k)
error(message(‘stats:kmeans:MissingK’));
end
start = startNames{j};
if strcmp(start, ‘uniform’)
if strcmp(distance, ‘hamming’)
error(message(‘stats:kmeans:UniformStartForHamm’));
end
Xmins = min(X,[],1);%求每一列的最小值
Xmaxs = max(X,[],1);%求每一列的最大值
end
elseif isnumeric(start) %如果初始中心是数值类型(numeric)
CC = start;
start = ‘numeric’;
if isempty(k)
k = size(CC,1);%如果K为空通过数值的初始聚类中心获取K值
elseif k ~= size(CC,1);%检测初始聚类中心行是否合法
error(message(‘stats:kmeans:StartBadRowSize’));
elseif size(CC,2) ~= p %检测初始聚类中心列是否合法
error(message(‘stats:kmeans:StartBadColumnSize’));
end
if isempty(reps)
reps = size(CC,3);%如果重复次数参数为空,检测初始聚类中心的第三维获取
elseif reps ~= size(CC,3);
error(message(‘stats:kmeans:StartBadThirdDimSize’));
end

% Need to center explicit starting points for 'correlation'. (Re)normalization
% for 'cosine'/'correlation' is done at each iteration.
if isequal(distance, 'correlation')
      CC = bsxfun(@minus, CC, mean(CC,2));%如果距离测度为相关性需要中心化数据
end

else
error(message(‘stats:kmeans:InvalidStart’));
end

emptyactNames = {‘error’,‘drop’,‘singleton’};
emptyact = internal.stats.getParamVal(emptyact,emptyactNames,’’‘EmptyAction’’’);

[~,online] = internal.stats.getParamVal(online,{‘on’,‘off’},’’‘OnlinePhase’’’);
online = (online==1);

% ‘maxiter’ and ‘display’ are grandfathered as separate param name/value pairs
if ~isempty(display)
options = statset(options,‘Display’,display);
end
if ~isempty(maxit)
options = statset(options,‘MaxIter’,maxit);
end

options = statset(statset(‘kmeans’), options);
display = find(strncmpi(options.Display, {‘off’,‘notify’,‘final’,‘iter’},…
length(options.Display))) - 1;
maxit = options.MaxIter;%确定最大迭代次数

if ~(isscalar(k) && isnumeric(k) && isreal(k) && k > 0 && (round(k)==k))
error(message(‘stats:kmeans:InvalidK’));
% elseif k == 1
% this special case works automatically
elseif n < k
error(message(‘stats:kmeans:TooManyClusters’));
end

% Assume one replicate 检测重复次数的值
if isempty(reps)
reps = 1;
elseif ~internal.stats.isScalarInt(reps,1)
error(message(‘stats:kmeans:BadReps’));
end

[useParallel, RNGscheme, poolsz] = …
internal.stats.parallel.processParallelAndStreamOptions(options,true);

usePool = useParallel && poolsz>0;%检测是否使用并行池

% Prepare for in-progress
if display > 1 % ‘iter’ or ‘final’
if usePool
% If we are running on a parallel pool, each worker will generate
% a separate periodic report. Before starting the loop, we
% seed the parallel pool so that each worker will have an
% identifying label (eg, index) for its report.
internal.stats.parallel.distributeToPool( …
‘workerID’, num2cell(1:poolsz) );

    % Periodic reports behave differently in parallel than they do
    % in serial computation (which is the baseline).
    % We advise the user of the difference.
    
    if display == 3 % 'iter' only
        warning(message('stats:kmeans:displayParallel2'));
        fprintf('    worker\t  iter\t phase\t     num\t         sum\n' );
    end
else
    if useParallel
        warning(message('stats:kmeans:displayParallel'));
    end
    if display == 3 % 'iter' only
        fprintf('  iter\t phase\t     num\t         sum\n');
    end
end

end

if issparse(X) || ~isfloat(X) || strcmp(distance,‘cityblock’) || …
strcmp(distance,‘hamming’)
[varargout{1:nargout}] = kmeans2(X,k, distance, emptyact,reps,start,…
Xmins,Xmaxs,CC,online,display, maxit,useParallel, RNGscheme,usePool,…
wasnan,hadNaNs,varargin{:});
return;
end

emptyErrCnt = 0;

% Define the function that will perform one iteration of the
% loop inside smartFor
loopbody = @loopBody;%定义循环体函数

% Initialize nested variables so they will not appear to be functions here
%初始化循环嵌套变量
totsumD = 0;
iter = 0;

%将数据转置
X = X’;
Xmins = Xmins’;
Xmaxs = Xmaxs’;

% 执行KMEANS多次(reps)在各自的工作区上.
ClusterBest = internal.stats.parallel.smartForReduce(…
reps, loopbody, useParallel, RNGscheme, ‘argmin’);

% 选出最优解
varargout{1} = ClusterBest{5};%最优解的索引idx
varargout{2} = ClusterBest{6}’;%最优解的聚类中心C
varargout{3} = ClusterBest{3}; %最优解的类内距离和sumD
totsumDbest = ClusterBest{1};%最优解的所有类内距离和的总和

if nargout > 3
varargout{4} = ClusterBest{7}; %最优解的点到任意聚类中心的距离
end

if display > 1 % ‘final’ or ‘iter’
fprintf(’%s\n’,getString(message(‘stats:kmeans:FinalSumOfDistances’,sprintf(’%g’,totsumDbest))));
end

if hadNaNs
varargout{1} = statinsertnan(wasnan, varargout{1});% idxbest
if nargout > 3
varargout{4} = statinsertnan(wasnan, varargout{4}); %Dbest
end
end

function cellout = loopBody(rep,S)%循环体函数
    
    if isempty(S)
        S = RandStream.getGlobalStream;
    end
    
    if display > 1 % 'iter'
        if usePool
            dispfmt = '%8d\t%6d\t%6d\t%8d\t%12g\n';
            labindx = internal.stats.parallel.workerGetValue('workerID');
        else
            dispfmt = '%6d\t%6d\t%8d\t%12g\n';
        end
    end

    %定义元胞数组
    cellout = cell(7,1);  % cellout{1}类间距离总和
                          % cellout{2}重复次数
                          % cellout{3}类内距离总和
                          % cellout{4}迭代次数
                          % cellout{5}索引
                          % cellout{6}聚类中心
                          % cellout{7}距离
    
    % Populating total sum of distances to Inf. This is used in the
    % reduce operation if update fails due to empty cluster.
    cellout{1} = Inf;%赋值
    cellout{2} = rep;

    %初始化聚类中心
    switch start
        case 'uniform'
            %C = Xmins(:,ones(1,k)) + rand(S,[p,k]).*(Xmaxs(:,ones(1,k))-Xmins(:,ones(1,k)));
            C = Xmins(:,ones(1,k)) + rand(S,[k,p])'.*(Xmaxs(:,ones(1,k))-Xmins(:,ones(1,k)));
            % For 'cosine' and 'correlation', these are uniform inside a subset
            % of the unit hypersphere.仍需要为'correlation'进行中心化.  
            %  'cosine'/'correlation'的正交化在每次迭代中完成

            if isequal(distance, 'correlation')
                C = bsxfun(@minus, C, mean(C,1));
            end
            if isa(X,'single')
                C = single(C);
            end
        case 'sample'
            C = X(:,randsample(S,n,k));
        case 'cluster'
            Xsubset = X(:,randsample(S,n,floor(.1*n)));
            % Turn display off for the initialization
            optIndex = find(strcmpi('options',varargin));
            if isempty(optIndex)
                opts = statset('Display','off');
                varargin = [varargin,'options',opts];
            else
                varargin{optIndex+1}.Display = 'off';
            end
            [~, C] = kmeans(Xsubset', k, varargin{:}, 'start','sample', 'replicates',1);
            C = C';
        case 'numeric'
            C = CC(:,:,rep)';
            if isa(X,'single')
                C = single(C);
            end
        case {'plus','kmeans++'}
            % Select the first seed by sampling uniformly at random
            index = zeros(1,k);
            [C(:,1), index(1)] = datasample(S,X,1,2);
            minDist = inf(n,1);
       
            % Select the rest of the seeds by a probabilistic model
           for ii = 2:k                    
                minDist = min(minDist,distfun(X,C(:,ii-1),distance));
                denominator = sum(minDist);
                if denominator==0 || isinf(denominator) || isnan(denominator)
                    C(:,ii:k) = datasample(S,X,k-ii+1,2,'Replace',false);
                    break;
                end
                sampleProbability = minDist/denominator;
                [C(:,ii), index(ii)] = datasample(S,X,1,2,'Replace',false,...
                    'Weights',sampleProbability);        
            end
    end
    if ~isfloat(C)      % X may be logical
        C = double(C);
    end
    
    % 计算点到聚类中心的距离和归属到各个类别
    D = distfun(X, C, distance, 0, rep, reps);%计算点到个中心的距离
    [d, idx] = min(D, [], 2);%根据最短距离归属到各个类
    m = accumarray(idx,1,[k,1])';%计算各个类中样本的个数
    
    try % catch空类错误并转移到下一个重复次
        
        %开始第一阶段:批分配
        converged = batchUpdate();
        
        % 开始第二阶段:单个分配
        if online
            converged = onlineUpdate();
        end
        
        
        if display == 2 % 'final'
            fprintf('%s\n',getString(message('stats:kmeans:IterationsSumOfDistances',rep,iter,sprintf('%g',totsumD) )));
        end
        
        if ~converged
            if reps==1
                warning(message('stats:kmeans:FailedToConverge', maxit));
            else
                warning(message('stats:kmeans:FailedToConvergeRep', maxit, rep));
            end
        end
        
        % 计算类内距离和
        nonempties = find(m>0);%判断没有空类,生成非空类的线性目录
        D(:,nonempties) = distfun(X, C(:,nonempties), distance, iter, rep, reps);
        d = D((idx-1)*n + (1:n)');
        sumD = accumarray(idx,d,[k,1]);% 计算类内距离和
        totsumD = sum(sumD(nonempties));% 计算所有类内距离和的总和
        
        % 保存目前最好的解
        cellout = {totsumD,rep,sumD,iter,idx,C,D}';
       
        % 如果在重复运行中发生空类现象,进行捕获并警告,然后继续下一次重复运行,
        %  只有在所有的重复运行失败才会ERROR,再次引发另一种ERROR。
    catch ME
        if reps == 1 || (~isequal(ME.identifier,'stats:kmeans:EmptyCluster')  && ...
                     ~isequal(ME.identifier,'stats:kmeans:EmptyClusterRep'))
            rethrow(ME);
        else
            emptyErrCnt = emptyErrCnt + 1;
            warning(message('stats:kmeans:EmptyClusterInBatchUpdate', rep, iter));
            if emptyErrCnt == reps
                error(message('stats:kmeans:EmptyClusterAllReps'));
            end
        end
    end % catch
    
    %------------------------------------------------------------------
    
    function converged = batchUpdate()
        
        % 遍历每个点,更新每个类
        moved = 1:n;
        changed = 1:k;
        previdx = zeros(n,1);
        prevtotsumD = Inf;
        
        %
        % 开始第一阶段
        %
        
        iter = 0;
        converged = false;
        while true
            iter = iter + 1;
            
            % 更新新的聚类中心和数目以及每个样本到新聚类中心的距离 
            [C(:,changed), m(changed)] = gcentroids(X, idx, changed, distance);
            D(:,changed) = distfun(X, C(:,changed), distance, iter, rep, reps);
            
            %处理空类
            empties = changed(m(changed) == 0);
            if ~isempty(empties)
                if strcmp(emptyact,'error')
                    if reps==1
                        error(message('stats:kmeans:EmptyCluster', iter));
                    else
                        error(message('stats:kmeans:EmptyClusterRep', iter, rep));
                    end
                end
                switch emptyact
                    case 'drop'
                        if reps==1
                            warning(message('stats:kmeans:EmptyCluster', iter));
                        else
                            warning(message('stats:kmeans:EmptyClusterRep', iter, rep));
                        end
                        % Remove the empty cluster from any further processing
                        D(:,empties) = NaN;
                        changed = changed(m(changed) > 0);
                    case 'singleton'
                        for i = empties
                            d = D((idx-1)*n + (1:n)'); % use newly updated distances
                            
                            % 选取一个距离当前类最远的样本作为一个新的类
                            [~, lonely] = max(d);
                            from = idx(lonely); % taking from this cluster
                            if m(from) < 2
                                % In the very unusual event that the cluster had only
                                % one member, pick any other non-singleton point.
                                from = find(m>1,1,'first');
                                lonely = find(idx==from,1,'first');
                            end
                            C(:,i) = X(:,lonely);
                            m(i) = 1;
                            idx(lonely) = i;
                            D(:,i) = distfun(X, C(:,i), distance, iter, rep, reps);
                            
                            % Update clusters from which points are taken
                            [C(:,from), m(from)] = gcentroids(X, idx, from, distance);
                            D(:,from) = distfun(X, C(:,from), distance, iter, rep, reps);
                            changed = unique([changed from]);
                        end
                end
            end
            
            % 在当前配置下计算总距离
            totsumD = sum(D((idx-1)*n + (1:n)'));
            % 循环测试: 如果目标为减少,返回出去
            % 最后一步,之后进行单个更新阶段
            if prevtotsumD <= totsumD
                idx = previdx;
                [C(:,changed), m(changed)] = gcentroids(X, idx, changed, distance);
                iter = iter - 1;
                break;
            end
            if display > 2 % 'iter'
                if usePool
                    fprintf(dispfmt,labindx,iter,1,length(moved),totsumD);
                else
                    fprintf(dispfmt,iter,1,length(moved),totsumD);
                end
            end
            if iter >= maxit
                break;
            end
            
            %对每个点根据就近原则归属到各自的类 
            previdx = idx;
            prevtotsumD = totsumD;
            [d, nidx] = min(D, [], 2);
            
            % 决定哪个样本点移动
            moved = find(nidx ~= previdx);
            if ~isempty(moved)
                % Resolve ties in favor of not moving
                moved = moved(D((previdx(moved)-1)*n + moved) > d(moved));
            end
            if isempty(moved)
                converged = true;
                break;
            end
            idx(moved) = nidx(moved);
            
            % 寻找得到或者失去样本点的类
            changed = unique([idx(moved); previdx(moved)])';
            
        end % phase one
        
    end % nested function
    
    %------------------------------------------------------------------
    
    function converged = onlineUpdate()
                   
        %
        % 第二阶段开始: 单个分配
        %
        changed = find(m > 0);
        lastmoved = 0;
        nummoved = 0;
        iter1 = iter;
        converged = false;
        Del = NaN(n,k); % 重新分配的准则
        while iter < maxit
            %计算每个样本点到各个类的距离以及因每个类中添加或者移除样本点引起的误差和的变化
            %没有发生变化的类并不用更新。仅含单个样本点的类是总距离计算中的特殊情况。
            %移除它们仅有的样本点并不是最好的选择,根据分配准则最好保证它们能够得到保留, 
            %令人高兴地是,对于这种情况我们自动的使用Del(i,idx(i)) == 0。 
            switch distance
                case 'sqeuclidean'
                    for i = changed
                        mbrs = (idx == i)';
                        sgn = 1 - 2*mbrs; % -1 for members, 1 for nonmembers
                        if m(i) == 1
                            sgn(mbrs) = 0; % prevent divide-by-zero for singleton mbrs
                        end
                      Del(:,i) = (m(i) ./ (m(i) + sgn)) .* sum((bsxfun(@minus, X, C(:,i))).^2, 1);
                    end
                  case {'cosine','correlation'}
                    % The points are normalized, centroids are not, so normalize them
                    normC = sqrt(sum(C.^2, 1));
                    if any(normC < eps(class(normC))) % small relative to unit-length data points
                        if reps==1
                            error(message('stats:kmeans:ZeroCentroid', iter));
                        else
                            error(message('stats:kmeans:ZeroCentroidRep', iter, rep));
                        end
                        
                    end
                    % This can be done without a loop, but the loop saves memory allocations
                    for i = changed
                        XCi =  C(:,i)'*X;
                        mbrs = (idx == i)';
                        sgn = 1 - 2*mbrs; % -1 for members, 1 for nonmembers
                        Del(:,i) = 1 + sgn .*...
                            (m(i).*normC(i) - sqrt((m(i).*normC(i)).^2 + 2.*sgn.*m(i).*XCi + 1));
                    end
            end
            
            % 对于任意一个样本点,确定可能是最好的移动方式。然后选择其中的一个进行移动
            previdx = idx;
            prevtotsumD = totsumD;
            [minDel, nidx] = min(Del, [], 2);
            moved = find(previdx ~= nidx);
            moved(m(previdx(moved))==1)=[];
            if ~isempty(moved)
                % Resolve ties in favor of not moving
                moved = moved(Del((previdx(moved)-1)*n + moved) > minDel(moved));
            end
            if isempty(moved)
                % Count an iteration if phase 2 did nothing at all, or if we're
                % in the middle of a pass through all the points
                if (iter == iter1) || nummoved > 0
                    iter = iter + 1;
                    if display > 2 % 'iter'
                        if usePool
                            fprintf(dispfmt,labindx,iter,2,length(moved),totsumD);
                        else
                            fprintf(dispfmt,iter,2,length(moved),totsumD);
                        end
                    end
                end
                converged = true;
                break;
            end
            
            % Pick the next move in cyclic order
            %循环地选择下一次的移动
            moved = mod(min(mod(moved - lastmoved - 1, n) + lastmoved), n) + 1;
            
            % 遍历完所有的样本点,则完成一次迭代
            if moved <= lastmoved
                iter = iter + 1;
                if display > 2 % 'iter'
                    if usePool
                        fprintf(dispfmt,labindx,iter,2,length(moved),totsumD);
                    else
                        fprintf(dispfmt,iter,2,length(moved),totsumD);
                    end
                end
                if iter >= maxit, break; end
                nummoved = 0;
            end
            nummoved = nummoved + 1;
            lastmoved = moved;
            
            oidx = idx(moved);
            nidx = nidx(moved);
            totsumD = totsumD + Del(moved,nidx) - Del(moved,oidx);
            
            %更新类的索引向量、新旧类别各自的样本数目和中心
            idx(moved) = nidx;
            m(nidx) = m(nidx) + 1;
            m(oidx) = m(oidx) - 1;
            switch distance
                case {'sqeuclidean','cosine','correlation'}
                    C(:,nidx) = C(:,nidx) + (X(:,moved) - C(:,nidx)) / m(nidx);
                    C(:,oidx) = C(:,oidx) - (X(:,moved) - C(:,oidx)) / m(oidx);
            end
            changed = sort([oidx nidx]);
        end % phase two
        
    end % nested function
    
end

end % main function

%------------------------------------------------------------------

function D = distfun(X, C, dist, iter,rep, reps)
%DISTFUN计算样本点到中心的距离

switch dist
case ‘sqeuclidean’
D = pdist2mex(X,C,‘sqe’,[],[],[]);
case {‘cosine’,‘correlation’}
% 样本点已被标准化而中心点没有,因此对它们进行标准化
normC = sqrt(sum(C.^2, 1));
if any(normC < eps(class(normC))) % small relative to unit-length data points(?)
if reps==1
error(message(‘stats:kmeans:ZeroCentroid’, iter));
else
error(message(‘stats:kmeans:ZeroCentroidRep’, iter, rep));
end

    end
    C = bsxfun(@rdivide,C,normC);
    D = pdist2mex(X,C,'cos',[],[],[]); 

end
end % function

%------------------------------------------------------------------
function [centroids, counts] = gcentroids(X, index, clusts, dist)
%GCENTROIDS Centroids and counts stratified by group.
%计算各类的样本数目和中心点
p = size(X,1);
num = length(clusts);

centroids = NaN(p,num,‘like’,X);
counts = zeros(1,num,‘like’,X);

for i = 1:num
members = (index == clusts(i));
if any(members)
counts(i) = sum(members);
switch dist
case {‘sqeuclidean’,‘cosine’,‘correlation’}
centroids(:,i) = sum(X(:,members),2) / counts(i);
end
end
end
end % function

Python 中的Kmeans


   
   
   
   
  1. from sklearn.cluster import KMeans
  2. import numpy as np
  3. X = np.array([[ 1, 2], [ 1, 4], [ 1, 0], [ 4, 2], [ 4, 4], [ 4, 0]])
  4. kmeans=KMeans(n_clusters= 2,random_state= 0).fit(X)





转载自: https://blog.csdn.net/xholes/article/details/52911781

你可能感兴趣的:(人工智能)