function rate = KNN(Train_data,Train_label,Test_data,Test_label,k,Distance_mark);
% K-Nearest-Neighbor classifier(K-NN classifier)
%Input:
% Train_data,Test_data are training data set and test data
% set,respectively.(Each row is a data point)
% Train_label,Test_label are column vectors.They are labels of training
% data set and test data set,respectively.
% k is the number of nearest neighbors
% Distance_mark : ['Euclidean', 'L2'| 'L1' | 'Cos']
% 'Cos' represents Cosine distance.
%Output:
% rate:Accuracy of K-NN classifier
%
% Examples:
%
% %Classification problem with three classes
% A = rand(50,300);
% B = rand(50,300)+2;
% C = rand(50,300)+3;
% % label vector for the three classes
% gnd = [ones(300,1);2*ones(300,1);3*ones(300,1)];
% fea = [A B C]';
% trainIdx = [1:150,301:450,601:750]';
% testIdx = [151:300,451:600,751:900]';
% fea_Train = fea(trainIdx,:);
% gnd_Train = gnd(trainIdx);
% fea_Test = fea(testIdx,:);
% gnd_Test = gnd(testIdx);
% rate = KNN(fea_Train,gnd_Train,fea_Test,gnd_Test,1)
%
%
%
%Reference:
%
% If you used my matlab code, we appreciate it very much if you can cite our following papers:
% Jie Gui, Tongliang Liu, Dacheng Tao, Zhenan Sun, Tieniu Tan, "Representative Vector Machines: A unified framework for classical classifiers", IEEE
% Transactions on Cybernetics (Accepted).
% Jie Gui, Tongliang Liu, Dacheng Tao, Zhenan Sun, Tieniu Tan, "Representative Vector Machines: A unified framework for classical classifiers", IEEE
% Transactions on Cybernetics (Accepted).
% Jie Gui et al., "Group sparse multiview patch alignment framework with view consistency for image classification", IEEE Transactions on Image Processing
, vol. 23, no. 7, pp. 3126-3137, 2014
% Jie Gui et al., "How to estimate the regularization parameter for spectral regression
% discriminant analysis and its kernel version?", IEEE Transactions on Circuits and
% Systems for Video Technology, vol. 24, no. 2, pp. 211-223, 2014
% Jie Gui, Zhenan Sun, Wei Jia, Rongxiang Hu, Yingke Lei and Shuiwang Ji, "Discriminant
% Sparse Neighborhood Preserving Embedding for Face Recognition", Pattern Recognition,
% vol. 45, no.8, pp. 2884–2893, 2012
% Jie Gui, Wei Jia, Ling Zhu, Shuling Wang and Deshuang Huang,
% "Locality Preserving Discriminant Projections for Face and Palmprint Recognition,"
% Neurocomputing, vol. 73, no.13-15, pp. 2696-2707, 2010
% Jie Gui et al., "Semi-supervised learning with local and global consistency",
% International Journal of Computer Mathematics (Accepted)
% Jie Gui, Shu-Lin Wang, and Ying-ke Lei, "Multi-step Dimensionality Reduction and
% Semi-Supervised Graph-Based Tumor Classification Using Gene Expression Data,"
% Artificial Intelligence in Medicine, vol. 50, no.3, pp. 181-191, 2010
%This code is written by Gui Jie in the evening 2009/03/11.
%If you have find some bugs in the codes, feel free to contract me
if nargin < 5
error('Not enought arguments!');
elseif nargin < 6
Distance_mark='L2';
end
[n dim] = size(Test_data);% number of test data set
train_num = size(Train_data, 1); % number of training data set
% Normalize each feature to have zero mean and unit variance.
% If you need the following four rows,you can uncomment them.
% M = mean(Train_data); % mean & std of the training data set
% S = std(Train_data);
% Train_data = (Train_data - ones(train_num, 1) * M)./(ones(train_num, 1) * S); % normalize training data set
% Test_data = (Test_data-ones(n,1)*M)./(ones(n,1)*S); % normalize data
U = unique(Train_label); % class labels
nclasses = length(U);%number of classes
Result = zeros(n, 1);
Count = zeros(nclasses, 1);
dist=zeros(train_num,1);
for i = 1:n
% compute distances between test data and all training data and
% sort them
test=Test_data(i,:);
for j=1:train_num
train=Train_data(j,:);V=test-train;
switch Distance_mark
case {'Euclidean', 'L2'}
dist(j,1)=norm(V,2); % Euclead (L2) distance
case 'L1'
dist(j,1)=norm(V,1); % L1 distance
case 'Cos'
dist(j,1)=acos(test*train'/(norm(test,2)*norm(train,2))); % cos distance
otherwise
dist(j,1)=norm(V,2); % Default distance
end
end
[Dummy Inds] = sort(dist);
% compute the class labels of the k nearest samples
Count(:) = 0;
for j = 1:k
ind = find(Train_label(Inds(j)) == U); %find the label of the j'th nearest neighbors
Count(ind) = Count(ind) + 1;
end% Count:the number of each class of k nearest neighbors
% determine the class of the data sample
[dummy ind] = max(Count);
Result(i) = U(ind);
end
correctnumbers=length(find(Result==Test_label));
rate=correctnumbers/n;
--------------------------------------------------以上是代码---------------------------------------------------------------------
余弦距离和余弦相似度的区别
餘弦相似度(cosine similarity)乃是傳統文件分類中,最常被拿來度量文件間距離的基本度量方法,其以兩個 d 維向量間的角度差異來度量該向量間的距離,所得數據介於 0 ~ 1 之間,當兩向量角度越相近時,所求出的餘弦距離越接近1;反之,則越接近 0。假設在 d 維空間中有兩點a = [a1, a2, …, ad],b = [b1, b2, …,bd],則其餘弦相似度可表示為:
cosineSimilarity(a,b) = dot(a,b) / (norm(a)*norm(b)) [我觉得这里说成cosineSimilarity,不应该说成cosineDistance。相似度越大,距离应该越小。比如,a和b夹角为0,此时最相似,相似度最大,距离最小]
dot(a,b) 代表a和b的内积,因为向量内积定义为 a·b = |a| × |b| × cosθ, (一般情况下,θ∈[0,π], http://baike.baidu.com/view/1485493.htm )。故这样定义不能满足在 0 ~ 1 之間,而是-1到1之间,有两种方式:
(1) 我下面的代码是正确的,用acos,将这个余弦转化为[0, π]之间的角度. 未必一定要限制在0 ~ 1 ,我的代码转化成[0, π],值越大代表其距离越大;
(2) cosineDistance(a,b) = 1- cosineSimilarity(a,b) = 1- dot(a,b) / (norm(a)*norm(b))。cosineDistance的范围就在[0 2]。
dot(a,b) 代表a和b的内积,因为向量内积定义为 a·b = |a| × |b| × cosθ, (一般情况下,θ∈[0,π], http://baike.baidu.com/view/1485493.htm )。故这样定义不能满足在 0 ~ 1 之間,而是-1到1之间,有两种方式:
(1) 我下面的代码是正确的,用acos,将这个余弦转化为[0, π]之间的角度. 未必一定要限制在0 ~ 1 ,我的代码转化成[0, π],值越大代表其距离越大;
(2) cosineDistance(a,b) = 1- cosineSimilarity(a,b) = 1- dot(a,b) / (norm(a)*norm(b))。cosineDistance的范围就在[0 2]。
範例:
a=[1 1 1]; b=[1 0 0];
cosineDistance = dot(a,b) / (norm(a)*norm(b))
cosineDistance =
0.5774 [ http://neural.cs.nthu.edu.tw/jang/books/dcpr/doc/02%E8%B7%9D%E9%9B%A2%E8%88%87%E7%9B%B8%E4%BC%BC%E5%BA%A6.pdf ,已经保存到电脑:距离与相似度.pdf]
(1) Lin Zhu 师弟讲将循环改为计算距离矩阵会节省时间,因为matlab循环很耗时,但大样本还必须用循环否则out of memory.想起以前上课jinsong he 老师也提供了一个KNN代码,不过他的也是用循环实现的.matlab有自带的函数knnclassify,论文Sparsity preserving projections的代码SPP_1NN.m中就用的该函数。在 ASLAN上我的KNN和 knnclassify识别率完全一样
(2) 极其重要 注意点 :倒数第四行程序不要用Result(i) = ind;这对Yale等标号依次为1,2,3等没问题。对二分类1和-1就有问题。SRC_QC和SRC_QC2也是类似的,倒数第三行不能用Result(i) = index, 要用Result(i) = classLabel(index); 原来只修改了这一处,其实SRC_QC2的50行和SRC_QC的42行也要将ii修改为classLabel(ii)。正因为这个错误,才得出SRC在ASLAN上是50%错误率方差是0的错误结果。正确的SRC_QC2和SRC_QC程序在ASLAN目录