利用kmeans对iris数据集进行分类,kmeans聚类算法实例

上课,老师让做作业,kmeans分类,将iris进行聚类分类,于是就做了这个作业。很简单,我就将其设置为分三类,重点是我在选择初始center的时候,尝试了使用随机选取和使用大招选取两种方式,随机选择初始点的效果不如放大招。这个大招是这样的:先随机选取一个作为center1,再选取一个距离这个center1最远的点作为center2,再选择一个距离center1和center2最远的点作为center3。

在训练过程中更新质心的时候,我不是选择online模式,来一个点更新一次质心,而是选择了mini_batch的模式,先送进来mini_batch个样本,分别将其划归到对应的中心处,然后再进行更新。说这么多屁话,不如把代码放上来,代码如下:

import numpy as np
import csv
import random

features = np.loadtxt('iris.csv',delimiter=',',usecols=(1,2,3,4))  #read features
z_min, z_max = features.min(axis=0), features.max(axis=0)          #features normalized
features = (features - z_min)/(z_max - z_min)

csv_file = open('iris.csv')   #transform string to num label
csv_reader_lines = csv.reader(csv_file)
classes_list = []
for i in csv_reader_lines:
    classes_list.append(i[-1])
labels = []
for i in classes_list:
    if i=='setosa':
        labels.append(0)
    elif i=='versicolor':
        labels.append(1)
    else:
        labels.append(2)

labels = np.array(labels)
labels = labels.reshape((150,1))   # transformm list to numpy type

data_index = np.arange(features.shape[0])
np.random.shuffle(data_index)

train_input = features[ data_index[0:120] ]
train_label = labels[ data_index[0:120] ]

test_input = features[ data_index[120:150] ]
test_label = labels[ data_index[120:150] ]

train_length = 120
K = 3
center_1_pos = random.randint(0,train_length)
center1 = train_input[ center_1_pos ]
# center1 = train_input[0]
# center2 = train_input[1]
# center3 = train_input[2]
# print(center1)
# print(center2)
# print(center3)

biggest_distance = 0.0
center_2_pos = 0

for i in range(train_length):#选择center2
    dist = np.sum(pow( (center1 - train_input[i]),2 ))
    if dist > biggest_distance:
        biggest_distance = dist
        center_2_pos = i

center2 = train_input[center_2_pos]


biggest_distance = 0.0
center_3_pos = 0

for i in range(train_length):#选择center3
    dist = np.sum(pow( (center1 - train_input[i]), 2 )) + np.sum(pow( (center2 - train_input[i]) , 2))
    if dist > biggest_distance:
        biggest_distance = dist
        center_3_pos = i

center3 = train_input[center_3_pos]
mini_batch = 20

for epoch in range(10):#在整个数据集上训练10次
    for i in range(6):
        belong1 = []
        belong2 = []
        belong3 = []
        for j in range(mini_batch):#mini_batch
            temp_index = mini_batch * i + j
            belong = 1
            dist_1 = np.sum(pow( ( center1 - train_input[mini_batch*i+j] ),2 ))
            temp_dist = dist_1

            dist_2 = np.sum(pow((center2 - train_input[mini_batch * i + j]), 2))
            dist_3 = np.sum(pow((center3 - train_input[mini_batch * i + j]), 2))

            if(dist_2 < temp_dist):
                temp_dist = dist_2
                belong = 2
            if(dist_3 < temp_dist):
                belong = 3

            if belong==1:
                belong1.append( temp_index )
            elif belong == 2:
                belong2.append(temp_index)
            else:
                belong3.append(temp_index)


        for k in belong1:
            center1 = center1 + train_input[k]
        center1 = center1 / (1 + len(belong1))

        for k in belong2:
            center2 = center2 + train_input[k]
        center2 = center2 / (1 + len(belong2))

        for k in belong3:
            center3 = center3 + train_input[k]
        center3 = center3 / (1 + len(belong3))

    b_1=[]
    b_2=[]
    b_3=[]
    for l in range(test_input.shape[0]):#在测试机上进行测试
        belong = 1
        dist_1 = np.sum(pow((center1 - test_input[l]), 2))
        temp_dist = dist_1

        dist_2 = np.sum(pow((center2 - test_input[ l ]), 2))
        dist_3 = np.sum(pow((center3 - test_input[ l ]), 2))

        if (dist_2 < temp_dist):
            temp_dist = dist_2
            belong = 2
        if (dist_3 < temp_dist):
            belong = 3



        if belong == 1:
            b_1.append(test_label[l][0])
        elif belong == 2:
            b_2.append(test_label[l][0])
        else:
            b_3.append(test_label[l][0])
    print()
    print('epoch : {} / 10' .format(epoch+1))
    print('center1: ',b_1)
    print('center2',b_2)
    print('center3: ',b_3)

下面是我运行程序的结果:

epoch : 1 / 10
center1:  [2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1]
center2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
center3:  [2, 2, 2, 2, 2]

epoch : 2 / 10
center1:  [2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1]
center2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
center3:  [2, 2, 2, 2, 2]

epoch : 3 / 10
center1:  [2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1]
center2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
center3:  [2, 2, 2, 2, 2]

epoch : 4 / 10
center1:  [2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1]
center2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
center3:  [2, 2, 2, 2, 2]

epoch : 5 / 10
center1:  [2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1]
center2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
center3:  [2, 2, 2, 2, 2]

epoch : 6 / 10
center1:  [2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1]
center2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
center3:  [2, 2, 2, 2, 2]

epoch : 7 / 10
center1:  [2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1]
center2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
center3:  [2, 2, 2, 2, 2]

     小数据集上结果还是挺好的。

iris数据集的一个下载链接如下:链接: https://pan.baidu.com/s/1PpIwncqbtQbEuGKSxKyBMg  密码: 7ins

注意:我在读取iris的数据时候,将第一行的属性名称删除了,要不然处理起来麻烦。

哎,做这个小作业也挺有意思的,学习了Nummpy数据的归一化还,csv文件的numpy读取,以及如何将csv文件中的字符类型转化为数字标签。

你可能感兴趣的:(kmeans,机器学习)