当我们进行建模分析时,在建模过程中训练集样本的选取是非常重要的,下面我将简单介绍一下其中用的较多的KS算法和SPXY算法。
KS算法原理:把所有的样本都看作训练集候选样本,依次从中挑选样本进训练集。首先选择欧氏距离最远的两个样本进入训练集,其后通过计算剩下的每一个样品到训练集内每一个已知样品的欧式距离,找到拥有最大最小距离的待选样本放入训练集,以此类推,直到达到所要求的样本数目。该方法的优点是能保证训练集中的样本按照空间距离分布均匀。缺点是需要进行数据转换和计算样本两两空间距离,计算量大。
欧式距离:欧几里得度量(euclidean metric)(也称欧氏距离)是一个通常采用的距离定义,指在m维空间中两个点之间的真实距离,或者向量的自然长度(即该点到原点的距离)。在二维和三维空间中的欧氏距离就是两点之间的实际距离。
Xp和Xq表示两个不同的样本,N表示样本的光谱波点数量。
SPXY算法原理:它是在KS算法基础上发展而来的,SPXY在样品间距离计算时将x变量和y变量同时考虑在内。
距离计算
但是在使用这个算法的时候大家可能会想SPXY中的X和Y分别代表什么含义,下面通过一个例子,希望能给大家带来帮助。
例子:北京农业质量标准与检测技术研究中心的王世芳等人与2019年3月在《光谱学与光谱分析》上发表了一篇名叫《SPXY算法的西瓜可溶性固形物近红外光谱检测》的文章。在文章中,采用光谱-理化值共生距离(SPXY)算法对西瓜不同检测部位的样品集进行划分,以可溶性固形物含量为y变量,光谱为x 变量,利用两种变量同时计算样品间距离以保证最大程度表征样本分布,有效地覆盖多维向量空间,增加样本间的差异性和代表性,提高模型稳定性。
我是使用matlab2014a软件,所以下面附上KS算法和SPXY算法的matlab程序。
KS代码
function [XSelected,XRest,vSelectedRowIndex]=ks(X,Num)
% ks selects the samples XSelected which uniformly distributed in the exprimental data X's space
% Input
% X:the matrix of the sample spectra
% Num:the number of the sample spectra you want select
% Output
% XSelected:the sample spectras was selected from the X
% XRest:the sample spectras remain int the X after select
% vSelectedRowIndex:the row index of the selected sample in the X matrix
% Programmer: zhimin zhang @ central south university on oct 28 ,2007
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% start of the kennard-stone step one
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
X=xlsread('X.xlsx');%obtain the data
[nR,nC]=size(X); % obtain the size of the X matrix
mDistance=zeros(nR,nR); %dim a matrix for the distance storage
vAll of Sample=1:nR;
for i=1:nR-1
vRX=X(i,:); % 获取X的一行数据
for j=i+1:nR
vRX1=X(j,:); % 获得X中的另一行数据
mDistance(i,j)=norm(vRX-vRX1); % 计算欧氏距离
end
end
[vMax,vIndex Of mDistance]=max(mDistance);
[nMax,nIndex of vMax]=max(vMax);
vSelectedSample(1)=nIndex of vMax;
vSelectedSample(2)=vIndex Of mDistance(nIndex of vMax);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% end of the kennard-stone step one
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% start of the kennard-stone step two
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
for i=3:Num
vNotSelectedSample=setdiff(vAll of Sample,vSelectedSample);
vMinDistance=zeros(1,nR-i + 1);
for j=1:(nR-i+1)
nIndex of NotSelected=vNotSelectedSample(j);
vDistanceNew = zeros(1,i-1);
for k=1:(i-1)
nIndex of Selected=vSelectedSample(k);
if(nIndex of Selected<=nIndex of NotSelected)
vDistanceNew(k)=mDistance(nIndex of Selected,nIndex of NotSelected);
else
vDistanceNew(k)=mDistance(nIndex of NotSelected,nIndex of Selected);
end
end
vMinDistance(j)=min(vDistanceNew);
end
[nUseless,nIndex of vMinDistance]=max(vMinDistance);
vSelectedSample(i)=vNotSelectedSample(nIndex of vMinDistance);
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%% end of the kennard-stone step two
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%% start of export the result
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
vSelectedRowIndex=vSelectedSample;
for i=1:length(vSelectedSample)
XSelected(i,:)=X(vSelectedSample(i),:); %训练集数据
end
vNotSelectedSample=setdiff(vAll of Sample,vSelectedSample);
for i=1:length(vNotSelectedSample)
XRest(i,:)=X(vNotSelectedSample(i),:); %预测集数据
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%% end of export the result
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
其中的X.xlse文件是我要进行分类的数据。
SPXY算法
%function [m,dminmax] = spxy(X,Y,N32)
% Algorithm for Sample set Partitioning based on joint X-Y distances
% [m,dminmax] = spxy(X,Y,N);
%
% X --> Matrix of instrumental responses
% Y --> Matrix of parameters
% N --> Number of samples to be selected (minimum of 2)
%
% m --> Indexes of the selected samples
%
% dminmax(1) = 0;
% dminmax(2) = Joint XY distance between the two first samples selected by the algorithm
% dminmax(i) = Smallest joint XY distance between the i-th selected sample and the previously selected ones (i > 2)
%
% Reference:
% R. K. H. Galvao, M. C. U. Araujo, G. E. Jose, M. J. C. Pontes, E. C. Silva, T. C. B. Saldanha
% A method for calibration and validation subset partitioning
% Talanta, vol. 67, pp. 736-740, 2005.
%
% Web site: www.ele.ita.br/~kawakami/spa/
X=xlsread('X.xlsx');
Y=xlsread('X.xlsx');
dminmax = zeros(1,32); % Initializes the vector of minimum distances
M = size(X,1); % Number of rows in X (samples)
samples = 1:M;
% Auto-scales the Y matrix
for i=1:size(Y,2) % For each parameter in Y
yi = Y(:,i);
Y(:,i) = (yi - mean(yi))/std(yi);
end
D = zeros(M,M); % Initializes the matrix of X distances
Dy = zeros(M,M); % Initializes the matrix of Y distances
for i=1:M-1
xa = X(i,:);
ya = Y(i,:);
for j = i+1:M
xb = X(j,:);
yb = Y(j,:);
D(i,j) = norm(xa - xb);
Dy(i,j) = norm(ya - yb);
end
end
Dmax = max(max(D));
Dymax = max(max(Dy));
D = D/Dmax + Dy/Dymax; % Combines the distances in X and Y
[maxD,index_row] = max(D); % maxD = Row vector containing the largest element of each column in D
[dummy,index_column] = max(maxD); % index_column = column corresponding to the largest element in matrix D
m(1) = index_row(index_column);
m(2) = index_column;
dminmax(2) = D(m(1),m(2));
for i = 3:32
% This routine determines the distances between each sample still available for selection and each of the samples already selected
pool = setdiff(samples,m); % pool = Samples still available for selection
dmin = zeros(1,M-i+1); % Initializes the vector of minimum distances between each sample in pool and the samples already selected
for j = 1:(M-i+1) % For each sample xa still available for selection
indexa = pool(j); % indexa = index of the j-th sample in pool (still available for selection)
d = zeros(1,i-1); % Initializes the vector of distances between the j-th sample in pool and the samples already selected
for k = 1:(i-1) % The distance with respect to each sample already selected is analyzed
indexb = m(k); % indexb = index of the k-th sample already selected
if indexa < indexb
d(k) = D(indexa,indexb);
else
d(k) = D(indexb,indexa);
end
end
dmin(j) = min(d);
end
% The selected sample corresponds to the largest dmin
[dminmax(i),index] = max(dmin);
m(i) = pool(index);
end
SPXY算法的计算过程和KS算法的大致相同。