使用KS算法和SPXY算法进行样本分类(MATLAB)

使用KS算法和SPXY算法进行样本分类

当我们进行建模分析时,在建模过程中训练集样本的选取是非常重要的,下面我将简单介绍一下其中用的较多的KS算法和SPXY算法。

Kennard-Stone算法原理(KS算法)

KS算法原理:把所有的样本都看作训练集候选样本,依次从中挑选样本进训练集。首先选择欧氏距离最远的两个样本进入训练集,其后通过计算剩下的每一个样品到训练集内每一个已知样品的欧式距离,找到拥有最大最小距离的待选样本放入训练集,以此类推,直到达到所要求的样本数目。该方法的优点是能保证训练集中的样本按照空间距离分布均匀。缺点是需要进行数据转换和计算样本两两空间距离,计算量大。

欧式距离:欧几里得度量(euclidean metric)(也称欧氏距离)是一个通常采用的距离定义,指在m维空间中两个点之间的真实距离,或者向量的自然长度(即该点到原点的距离)。在二维和三维空间中的欧氏距离就是两点之间的实际距离。

欧氏距离计算公式使用KS算法和SPXY算法进行样本分类(MATLAB)_第1张图片

Xp和Xq表示两个不同的样本,N表示样本的光谱波点数量。

SPXY算法原理

SPXY算法原理:它是在KS算法基础上发展而来的,SPXY在样品间距离计算时将x变量和y变量同时考虑在内。
距离计算使用KS算法和SPXY算法进行样本分类(MATLAB)_第2张图片

但是在使用这个算法的时候大家可能会想SPXY中的X和Y分别代表什么含义,下面通过一个例子,希望能给大家带来帮助。
例子:北京农业质量标准与检测技术研究中心的王世芳等人与2019年3月在《光谱学与光谱分析》上发表了一篇名叫《SPXY算法的西瓜可溶性固形物近红外光谱检测》的文章。在文章中,采用光谱-理化值共生距离(SPXY)算法对西瓜不同检测部位的样品集进行划分,以可溶性固形物含量为y变量,光谱为x 变量,利用两种变量同时计算样品间距离以保证最大程度表征样本分布,有效地覆盖多维向量空间,增加样本间的差异性和代表性,提高模型稳定性。

代码

我是使用matlab2014a软件,所以下面附上KS算法和SPXY算法的matlab程序。
KS代码

function [XSelected,XRest,vSelectedRowIndex]=ks(X,Num) 
%  ks selects the samples XSelected which uniformly distributed in the exprimental data X's space  
%  Input   
%         X:the matrix of the sample spectra  
%         Num:the number of the sample spectra you want select
%  Output  
%         XSelected:the sample spectras was selected from the X  
%         XRest:the sample spectras remain int the X after select  
%         vSelectedRowIndex:the row index of the selected sample in the X matrix       
%  Programmer: zhimin zhang @ central south university on oct 28 ,2007  
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  
% start of the kennard-stone step one  
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  
X=xlsread('X.xlsx');%obtain the data
[nR,nC]=size(X); % obtain the size of the X matrix  
mDistance=zeros(nR,nR); %dim a matrix for the distance storage  
vAll of Sample=1:nR; 
for i=1:nR-1  
    
    vRX=X(i,:); % 获取X的一行数据 
      
    for j=i+1:nR  
          
        vRX1=X(j,:); % 获得X中的另一行数据          
        mDistance(i,j)=norm(vRX-vRX1); % 计算欧氏距离          
          
    end  
      
end  
[vMax,vIndex Of mDistance]=max(mDistance);  
 
[nMax,nIndex of vMax]=max(vMax);  
 
vSelectedSample(1)=nIndex of vMax;  
vSelectedSample(2)=vIndex Of mDistance(nIndex of vMax);  
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  
% end of the kennard-stone step one  
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  
% start of the kennard-stone step two  
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  
for i=3:Num  
    vNotSelectedSample=setdiff(vAll of Sample,vSelectedSample);  
    vMinDistance=zeros(1,nR-i + 1);  
      
      
    for j=1:(nR-i+1)  
        nIndex of NotSelected=vNotSelectedSample(j);  
        vDistanceNew = zeros(1,i-1);  
          
        for k=1:(i-1)  
            nIndex of Selected=vSelectedSample(k);  
            if(nIndex of Selected<=nIndex of NotSelected)  
                vDistanceNew(k)=mDistance(nIndex of Selected,nIndex of NotSelected);  
            else  
                vDistanceNew(k)=mDistance(nIndex of NotSelected,nIndex of Selected);      
            end                         
        end  
          
        vMinDistance(j)=min(vDistanceNew);  
    end  
      
    [nUseless,nIndex of vMinDistance]=max(vMinDistance);  
    vSelectedSample(i)=vNotSelectedSample(nIndex of vMinDistance);  
end  
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  
%%%%% end of the kennard-stone step two  
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  
%%%%% start of export the result  
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  
vSelectedRowIndex=vSelectedSample;  
 
for i=1:length(vSelectedSample)  
     
    XSelected(i,:)=X(vSelectedSample(i),:);  %训练集数据
end  
 
vNotSelectedSample=setdiff(vAll of Sample,vSelectedSample);  
for i=1:length(vNotSelectedSample)  
     
    XRest(i,:)=X(vNotSelectedSample(i),:);  %预测集数据
end  
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  
%%%%% end of export the result  
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  

其中的X.xlse文件是我要进行分类的数据。

SPXY算法

%function [m,dminmax] = spxy(X,Y,N32) 
 
% Algorithm for Sample set Partitioning based on joint X-Y distances 
% [m,dminmax] = spxy(X,Y,N); 
% 
% X --> Matrix of instrumental responses 
% Y --> Matrix of parameters 
% N --> Number of samples to be selected (minimum of 2) 
% 
% m --> Indexes of the selected samples 
% 
% dminmax(1) = 0; 
% dminmax(2) = Joint XY distance between the two first samples selected by the algorithm 
% dminmax(i) = Smallest joint XY distance between the i-th selected sample and the previously selected ones (i > 2) 
% 
% Reference: 
% R. K. H. Galvao, M. C. U. Araujo, G. E. Jose, M. J. C. Pontes, E. C. Silva, T. C. B. Saldanha 
% A method for calibration and validation subset partitioning 
% Talanta, vol. 67, pp. 736-740, 2005. 
% 
% Web site: www.ele.ita.br/~kawakami/spa/ 
X=xlsread('X.xlsx');
Y=xlsread('X.xlsx');

dminmax = zeros(1,32); % Initializes the vector of minimum distances 
M = size(X,1); % Number of rows in X (samples) 
samples = 1:M; 

% Auto-scales the Y matrix 
for i=1:size(Y,2) % For each parameter in Y 
    yi = Y(:,i); 
    Y(:,i) = (yi - mean(yi))/std(yi); 
end 

D = zeros(M,M); % Initializes the matrix of X distances 
Dy = zeros(M,M); % Initializes the matrix of Y distances 

for i=1:M-1 
    xa = X(i,:); 
    ya = Y(i,:); 
    for j = i+1:M 
      xb = X(j,:); 
      yb = Y(j,:); 
      D(i,j) = norm(xa - xb); 
      Dy(i,j) =  norm(ya - yb); 
    end 
end 

Dmax = max(max(D)); 
Dymax = max(max(Dy)); 

D = D/Dmax + Dy/Dymax; % Combines the distances in X and Y 

[maxD,index_row] = max(D); % maxD = Row vector containing the largest element of each column in D 

[dummy,index_column] = max(maxD); % index_column = column corresponding to the largest element in matrix D 

m(1) = index_row(index_column); 
m(2) = index_column; 

dminmax(2) = D(m(1),m(2)); 

for i = 3:32 
    % This routine determines the distances between each sample still available for selection and each of the samples already selected 
    pool = setdiff(samples,m); % pool = Samples still available for selection 
    dmin = zeros(1,M-i+1); % Initializes the vector of minimum distances between each sample in pool and the samples already selected 
    for j = 1:(M-i+1) % For each sample xa still available for selection 
        indexa = pool(j); % indexa = index of the j-th sample in pool (still available for selection) 
        d = zeros(1,i-1); % Initializes the vector of distances between the j-th sample in pool and the samples already selected 
        for k = 1:(i-1) % The distance with respect to each sample already selected is analyzed 
            indexb =  m(k); % indexb = index of the k-th sample already selected 
            if indexa < indexb 
                d(k) = D(indexa,indexb); 
            else 
                d(k) = D(indexb,indexa); 
            end 
        end 
        dmin(j) = min(d); 
    end 
    % The selected sample corresponds to the largest dmin 
    [dminmax(i),index] = max(dmin); 
    m(i) = pool(index); 
end 
               

SPXY算法的计算过程和KS算法的大致相同。

你可能感兴趣的:(使用KS算法和SPXY算法进行样本分类(MATLAB))