上课,老师让做作业,kmeans分类,将iris进行聚类分类,于是就做了这个作业。很简单,我就将其设置为分三类,重点是我在选择初始center的时候,尝试了使用随机选取和使用大招选取两种方式,随机选择初始点的效果不如放大招。这个大招是这样的:先随机选取一个作为center1,再选取一个距离这个center1最远的点作为center2,再选择一个距离center1和center2最远的点作为center3。
在训练过程中更新质心的时候,我不是选择online模式,来一个点更新一次质心,而是选择了mini_batch的模式,先送进来mini_batch个样本,分别将其划归到对应的中心处,然后再进行更新。说这么多屁话,不如把代码放上来,代码如下:
import numpy as np
import csv
import random
features = np.loadtxt('iris.csv',delimiter=',',usecols=(1,2,3,4)) #read features
z_min, z_max = features.min(axis=0), features.max(axis=0) #features normalized
features = (features - z_min)/(z_max - z_min)
csv_file = open('iris.csv') #transform string to num label
csv_reader_lines = csv.reader(csv_file)
classes_list = []
for i in csv_reader_lines:
classes_list.append(i[-1])
labels = []
for i in classes_list:
if i=='setosa':
labels.append(0)
elif i=='versicolor':
labels.append(1)
else:
labels.append(2)
labels = np.array(labels)
labels = labels.reshape((150,1)) # transformm list to numpy type
data_index = np.arange(features.shape[0])
np.random.shuffle(data_index)
train_input = features[ data_index[0:120] ]
train_label = labels[ data_index[0:120] ]
test_input = features[ data_index[120:150] ]
test_label = labels[ data_index[120:150] ]
train_length = 120
K = 3
center_1_pos = random.randint(0,train_length)
center1 = train_input[ center_1_pos ]
# center1 = train_input[0]
# center2 = train_input[1]
# center3 = train_input[2]
# print(center1)
# print(center2)
# print(center3)
biggest_distance = 0.0
center_2_pos = 0
for i in range(train_length):#选择center2
dist = np.sum(pow( (center1 - train_input[i]),2 ))
if dist > biggest_distance:
biggest_distance = dist
center_2_pos = i
center2 = train_input[center_2_pos]
biggest_distance = 0.0
center_3_pos = 0
for i in range(train_length):#选择center3
dist = np.sum(pow( (center1 - train_input[i]), 2 )) + np.sum(pow( (center2 - train_input[i]) , 2))
if dist > biggest_distance:
biggest_distance = dist
center_3_pos = i
center3 = train_input[center_3_pos]
mini_batch = 20
for epoch in range(10):#在整个数据集上训练10次
for i in range(6):
belong1 = []
belong2 = []
belong3 = []
for j in range(mini_batch):#mini_batch
temp_index = mini_batch * i + j
belong = 1
dist_1 = np.sum(pow( ( center1 - train_input[mini_batch*i+j] ),2 ))
temp_dist = dist_1
dist_2 = np.sum(pow((center2 - train_input[mini_batch * i + j]), 2))
dist_3 = np.sum(pow((center3 - train_input[mini_batch * i + j]), 2))
if(dist_2 < temp_dist):
temp_dist = dist_2
belong = 2
if(dist_3 < temp_dist):
belong = 3
if belong==1:
belong1.append( temp_index )
elif belong == 2:
belong2.append(temp_index)
else:
belong3.append(temp_index)
for k in belong1:
center1 = center1 + train_input[k]
center1 = center1 / (1 + len(belong1))
for k in belong2:
center2 = center2 + train_input[k]
center2 = center2 / (1 + len(belong2))
for k in belong3:
center3 = center3 + train_input[k]
center3 = center3 / (1 + len(belong3))
b_1=[]
b_2=[]
b_3=[]
for l in range(test_input.shape[0]):#在测试机上进行测试
belong = 1
dist_1 = np.sum(pow((center1 - test_input[l]), 2))
temp_dist = dist_1
dist_2 = np.sum(pow((center2 - test_input[ l ]), 2))
dist_3 = np.sum(pow((center3 - test_input[ l ]), 2))
if (dist_2 < temp_dist):
temp_dist = dist_2
belong = 2
if (dist_3 < temp_dist):
belong = 3
if belong == 1:
b_1.append(test_label[l][0])
elif belong == 2:
b_2.append(test_label[l][0])
else:
b_3.append(test_label[l][0])
print()
print('epoch : {} / 10' .format(epoch+1))
print('center1: ',b_1)
print('center2',b_2)
print('center3: ',b_3)
下面是我运行程序的结果:
epoch : 1 / 10
center1: [2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1]
center2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
center3: [2, 2, 2, 2, 2]
epoch : 2 / 10
center1: [2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1]
center2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
center3: [2, 2, 2, 2, 2]
epoch : 3 / 10
center1: [2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1]
center2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
center3: [2, 2, 2, 2, 2]
epoch : 4 / 10
center1: [2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1]
center2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
center3: [2, 2, 2, 2, 2]
epoch : 5 / 10
center1: [2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1]
center2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
center3: [2, 2, 2, 2, 2]
epoch : 6 / 10
center1: [2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1]
center2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
center3: [2, 2, 2, 2, 2]
epoch : 7 / 10
center1: [2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1]
center2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
center3: [2, 2, 2, 2, 2]
小数据集上结果还是挺好的。
iris数据集的一个下载链接如下:链接: https://pan.baidu.com/s/1PpIwncqbtQbEuGKSxKyBMg 密码: 7ins
注意:我在读取iris的数据时候,将第一行的属性名称删除了,要不然处理起来麻烦。
哎,做这个小作业也挺有意思的,学习了Nummpy数据的归一化还,csv文件的numpy读取,以及如何将csv文件中的字符类型转化为数字标签。