原理是首先随机设置N个中心点,计算每个样本与中心点的距离,之后按照样本自己排序找出最近的中心点,并将这个样本滑进这个中心点,之后采用均值的方法,重新计算中心点,重复迭代以上操作。
def compute_centroids(list_path, n, loss_convergence, iterations_num):
area_list = []
f = open(list_path)
for line in f:
temp = line.strip()
if len(temp) >=1:
area_list.append(float(temp))
f.close()
centroid_indices = np.random.choice(len(area_list), n)
centroids = []
for centroid_index in centroid_indices:
centroids.append(area_list[centroid_index])
centroids, groups, old_loss = do_kmeans(n, area_list, centroids)
iterations = 1
i = 0
while True:
i = 1+i
centroids, groups, loss = do_kmeans(n, area_list, centroids)
iterations = iterations + 1
print("number:", i, "************loss = %f" % loss)
if abs(old_loss - loss) < loss_convergence or iterations > iterations_num:
break
old_loss = loss
for centroid in centroids:
print(centroid, '\n')
# print result
j = 0
groups_list = []
print("--------------------------------")
for centroid in centroids:
print("k-means result:")
print(centroid, "-------numbers:", len(groups[j]))
groups_list.append(len(groups[j]))
j = j+1
print("--------------------------------")
plot_(centroids, groups_list)
计算中心点代码,打开文件路径,按行读取文件,将文件中的数字一列表的格式存储,从这个列表中,随机选出n个作为中心点。将中心数,列表,和中心数值作为参数传给do_kmeans,开始做迭代,在迭代次数之内,进行do_kmeans,两种可能结束,一个是损失小,一个是迭代次数够了,
def do_kmeans(n, area_list, centroids):
loss = 0
groups = []
new_centroids = []
for i in range(n):
groups.append([])
new_centroids.append(0)
for area in area_list:
min_distance = 1000000
group_index = 0
for centroid_index, centroid in enumerate(centroids):
distance = dis(area, centroid)
if distance < min_distance:
min_distance = distance
group_index = centroid_index
groups[group_index].append(area)
loss += min_distance
new_centroids[group_index] += area
for i in range(n):
print(new_centroids[i])
print(len(groups[i]))
new_centroids[i] /= len(groups[i])
return new_centroids, groups, loss
用来做计算的函数do_kmeans,参数有中心点个数,数据和中心点值,
设置loss变量用来评估聚类效果,对于每一个中心点,计算数据到它的距离,,如果距离大于最小的就将她华进这个中心点的聚类,损失+当前的minloss
画图的函数:
def plot_(centroids, groups_list):
for i in range(len(centroids)):
centroids[i] = int(centroids[i])
z = list(zip(centroids, groups_list))
z.sort(key=take_first)
for i in range(len(centroids)):
centroids[i] = z[i][0]
groups_list[i] = z[i][1]
plt.bar(np.arange(len(centroids)), groups_list, align='center', color='c', tick_label=centroids)
plt.title('Distribution ', fontsize=20)
plt.xlabel('Area_Categories ', fontsize=18)
plt.ylabel('Numbers', fontsize=18)
plt.tick_params(axis='both', labelsize=14)
plt.tight_layout()
for a, b in zip(np.arange(len(centroids)), groups_list):
plt.text(a, b, b, ha='center', va='baseline', fontsize=14, fontstyle='italic')
plt.show()