【模式识别小作业】K均值聚类K-means clustering+Matlab实现+UCI的Iris和Seeds数据集+分类问题

K均值聚类K-means clustering+Matlab实现+UCI的Iris和Seeds数据集+分类问题

    • 1.Inroduction
    • 2.The characteristics of the data sets
    • 3.Data preprocessing
    • 4.The code
    • 5.Limitations and improvements

完整的代码/readme/报告文件见我的下载内容中。

1.Inroduction

In this assignment, I implemented and test the kmeans-clustering methods to do the three-classification tasks on two data sets by using MATLAB. The data sets, Iris and Seeds, are downloaded from the UCI Machine Learning Repository.

Each program contains one .m file, which is processed to divide each sample into different classes. Since the number of types of the data set is 3, in order to make the experimental results more accurate, it is directly assumed that the number of clusters to be obtained is 3. Because the clustering result has a great relationship with the selection of the initial clustering center and consider the computational complexity, three different sample points are selected every five samples to calculate the cluster. The accuracy of the classification is used to select the three sample points that are most suitable as the initial center of the cluster.

Experimental results show that using kmeans-clustering model to solve the three-classification problems looks not good. Because, I think, the process of clustering greatly depends on the initial clustering numbers “k” and the initial clustering center, the processing speed is not fast and accuracy is random. While the effect on Seeds data set is similar to Iris, although the data set contains 7 features.

2.The characteristics of the data sets

2.1 Iris
The data set contains 150 samples. Every sample have 4 attributes: sepal length in cm, sepal width in cm, petal length in cm, and petal width in cm. Three classes of Iris is Iris Setosa, Iris Versicolour and Iris Virginica. This data set is the most popular in UCI. The clear structure and plentiful samples make it suitable in this assignment.

2.2 Seeds
The data set contains 210 samples. Every sample have 7 attributes: area A, perimeter P, compactness C = 4piA/P^2, length of kernel, width of kernel, asymmetry coefficient and length of kernel groove. And all of these parameters were real-valued continuous. The data set comprises kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian. It is often used for the tasks of classification and cluster analysis.

3.Data preprocessing

Taking the Iris as an example, the first step is to convert characters in data set into digital representations.
【模式识别小作业】K均值聚类K-means clustering+Matlab实现+UCI的Iris和Seeds数据集+分类问题_第1张图片
Read the dataset file with MATLAB, extract the whole attributes into an array, and merge the numeric symbols after each sample. Iris-setosa is marked as 1, Iris-versicolor is marked as 2, and Iris-virginica is marked as 3. Then, copy the processed data set for subsequent cluster processing. And write these data set to each txt file to facilitate saving and using.

4.The code

%This program is based on the data set Iris
%The original data set contains three kinds of flowers
%Here, the Kmeans method is processed to do the three-category problem
%The flower name in the data set is changed to a numeric symbol of 1 2 3

clear;
clc;

%%%%%%%%%%%%%%Data preprocessing section%%%%%%%%%%%%%%%%
% Read data
f=fopen('iris.data');%Open dataset file
data=textscan(f,'%f,%f,%f,%f,%s'); %Read file content

D=[];% Used to store attribute values
for i=1:length(data)-1
    D=[D data{1,i}];
end
fclose(f);

lable=data{1,length(data)};
n1=0;n2=0;n3=0;
% Find the index of each type of data
for j=1:length(lable)
    if strcmp(lable{j,1},'Iris-setosa')
        n1=n1+1;
        index_1(n1)=j;% Record the index belonging to the "Iris-setosa" class
        
    elseif strcmp(lable{j,1},'Iris-versicolor')
        n2=n2+1;
        index_2(n2)=j;% Record the index belonging to the "Iris-versicolor" class
        
    elseif strcmp(lable{j,1},'Iris-virginica')
        n3=n3+1;
        index_3(n3)=j;% Record the index belonging to the "Iris-virginica" class
        
    end
end

% Retrieve each type of data according to the index
class_1=D(index_1,:);
class_2=D(index_2,:);
class_3=D(index_3,:);
Attributes=[class_1;class_2;class_3];

%Iris-setosa is marked as 1; Iris-versicolor is marked as 2
%Iris-virginica is marked as 3
I=[1*ones(n1,1);2*ones(n2,1);3*ones(n3,1)];
Iris=[Attributes I];% Change the name of the flower to a number tag

save Iris.mat Iris % Save all data as a mat file

%Save all data as a txt file
f=fopen('iris1.txt','w');
[m,n]=size(Iris);
for i=1:m
    for j=1:n
        if j==n
            fprintf(f,'%g \n',Iris(i,j));
        else
            fprintf(f,'%g,',Iris(i,j));
        end
    end
end
fclose(f);

%Directly take the number of categories k = 3
Iris_test=Iris;
[m,n]=size(Iris_test);
acc_rateqian=0;

%traverse all samples every 5 samples to calculate the optimal 3 sample 
%points that can be used as the initial sample center
for bianli1=1:5:50
    for bianli2=51:5:100
        for bianli3=101:5:150
            
            u1=Iris(bianli1,:);
            u2=Iris(bianli2,:);
            u3=Iris(bianli3,:);
            
            u1_qian=u1(1:4);
            u2_qian=u2(1:4);
            u3_qian=u3(1:4);
            
            diedai=0;
            
            while 1
                
                %Divide each sample into different classes
                for i=1:m
                    d1=0;
                    d2=0;
                    d3=0;
                    
                    for j=1:n-1
                        d1=d1+(Iris_test(i,j)-u1(1,j))^2;
                        d2=d2+(Iris_test(i,j)-u2(1,j))^2;
                        d3=d3+(Iris_test(i,j)-u3(1,j))^2;
                    end
                    %Determine which sample center is closest to each point
                    if (d1<=d2)&&(d1<=d3)
                        Iris_test(i,n)=1;
                    elseif (d2<=d1)&&(d2<=d3)
                        Iris_test(i,n)=2;
                    else
                        Iris_test(i,n)=3;
                    end
                end
                
                %Save the results after the first cluster to show the difference
                if diedai==0
                    f_first=fopen('iris_first.txt','w');
                    [m_first,n_first]=size(Iris_test);
                    for i=1:m_first
                        for j=1:n_first
                            if j==n_first
                                fprintf(f_first,'%g \n',Iris_test(i,j));
                            else
                                fprintf(f_first,'%g,',Iris_test(i,j));
                            end
                        end
                    end
                    fclose(f_first);
                end
                
                
                %Update new cluster center
                Iris_1=[0,0,0,0];
                geshu1=0;
                Iris_2=[0,0,0,0];
                geshu2=0;
                Iris_3=[0,0,0,0];
                geshu3=0;
                for i=1:m
                    if Iris_test(i,n)==1
                        for j=1:n-1
                            Iris_1(1,j)=Iris_1(1,j)+Iris_test(i,j);
                        end
                        geshu1=geshu1+1;
                    end
                    if Iris_test(i,n)==2
                        for j=1:n-1
                            Iris_2(1,j)=Iris_2(1,j)+Iris_test(i,j);
                        end
                        geshu2=geshu2+1;
                    end
                    if Iris_test(i,n)==3
                        for j=1:n-1
                            Iris_3(1,j)=Iris_3(1,j)+Iris_test(i,j);
                        end
                        geshu3=geshu3+1;
                    end
                end
                
                u1=(1/geshu1)*Iris_1;
                u2=(1/geshu2)*Iris_2;
                u3=(1/geshu3)*Iris_3;
                
                %If the cluster center points have not changed, 
                %the clustering process stops
                if u1_qian==u1
                    if u2_qian==u2
                        if u3_qian==u3
                            break;
                        end
                    end
                end
                u1_qian=u1;
                u2_qian=u2;
                u3_qian=u3;
                
                %Limit the number of cluster iterations
                if diedai>1000
                    break
                end
                diedai=diedai+1;
                
            end
            
            %Calculate clustering accuracy
            [m_result,n_result]=size(Iris_test);
            error=0;
            for i=1:m_result
                if Iris(i,n_result)~=Iris_test(i,n_result)
                    error=error+1;
                end
            end
            
            acc_rate=(m_result-error)/m_result;
            
            if acc_rate>=acc_rateqian
                
                %Save clustering results obtained by the last optimal
                %initial clustering center
                f_result=fopen('iris_result.txt','w');
                for i=1:m_result
                    for j=1:n_result
                        if j==n_result
                            fprintf(f_result,'%g \n',Iris_test(i,j));
                        else
                            fprintf(f_result,'%g,',Iris_test(i,j));
                        end
                    end
                end
                fclose(f_result);
                
                acc_final=acc_rate;
                yangben1=bianli1;
                yangben2=bianli2;
                yangben3=bianli3;
                
                diedai_best=diedai;
            end
            
        end
    end
end
fprintf('acc_rate is %f\n',acc_final);
fprintf('Initial cluster center is %d, %d and %d\n',yangben1,yangben2,yangben3);
fprintf('diedaicishu is %d\n',diedai_best);
fprintf('The results of the clusting are saved\n');


5.Limitations and improvements

  1. This K-means can be applied to the data set Seeds and Iris, but the effect is not good, comparing with the neural-network method and the logic regression. It deserves improving further, perhaps by further data preprocessing.
  2. Only three-classification tasks are considered in this assignment, and application of the K-mean clustering model to the multi-classification tasks and the large data analysis should be tried in the future. Perhaps, the application of clustering in big data analysis has advantages over the neural network method.

完整的代码/readme/报告文件见我的下载内容中。

你可能感兴趣的:(【模式识别小作业】K均值聚类K-means clustering+Matlab实现+UCI的Iris和Seeds数据集+分类问题)