k-means聚类代码编程学习

原理是首先随机设置N个中心点,计算每个样本与中心点的距离,之后按照样本自己排序找出最近的中心点,并将这个样本滑进这个中心点,之后采用均值的方法,重新计算中心点,重复迭代以上操作。

def compute_centroids(list_path, n, loss_convergence, iterations_num):
    area_list = []
    f = open(list_path)

    for line in f:
        temp = line.strip()

        if len(temp) >=1:
            area_list.append(float(temp))
    f.close()
    centroid_indices = np.random.choice(len(area_list), n)
    centroids = []
    for centroid_index in centroid_indices:
        centroids.append(area_list[centroid_index])
    centroids, groups, old_loss = do_kmeans(n, area_list, centroids)
    iterations = 1
    i = 0
    while True:
        i = 1+i
        centroids, groups, loss = do_kmeans(n, area_list, centroids)
        iterations = iterations + 1
        print("number:", i, "************loss = %f" % loss)
        if abs(old_loss - loss) < loss_convergence or iterations > iterations_num:
            break
        old_loss = loss
        for centroid in centroids:
            print(centroid, '\n')
    # print result
    j = 0
    groups_list = []
    print("--------------------------------")
    for centroid in centroids:
        print("k-means result:")
        print(centroid, "-------numbers:", len(groups[j]))
        groups_list.append(len(groups[j]))
        j = j+1
    print("--------------------------------")
    plot_(centroids, groups_list)

计算中心点代码,打开文件路径,按行读取文件,将文件中的数字一列表的格式存储,从这个列表中,随机选出n个作为中心点。将中心数,列表,和中心数值作为参数传给do_kmeans,开始做迭代,在迭代次数之内,进行do_kmeans,两种可能结束,一个是损失小,一个是迭代次数够了,

def do_kmeans(n, area_list, centroids):
    loss = 0
    groups = []
    new_centroids = []
    for i in range(n):
        groups.append([])
        new_centroids.append(0)
    for area in area_list:
        min_distance = 1000000
        group_index = 0
        for centroid_index, centroid in enumerate(centroids):
            distance = dis(area, centroid)
            if distance < min_distance:
                min_distance = distance
                group_index = centroid_index
        groups[group_index].append(area)
        loss += min_distance
        new_centroids[group_index] += area
    for i in range(n):
        print(new_centroids[i])
        print(len(groups[i]))
        new_centroids[i] /= len(groups[i])
    return new_centroids, groups, loss

用来做计算的函数do_kmeans,参数有中心点个数,数据和中心点值,

设置loss变量用来评估聚类效果,对于每一个中心点,计算数据到它的距离,,如果距离大于最小的就将她华进这个中心点的聚类,损失+当前的minloss

画图的函数:

def plot_(centroids, groups_list):
    for i in range(len(centroids)):
        centroids[i] = int(centroids[i])
    z = list(zip(centroids, groups_list))
    z.sort(key=take_first)
    for i in range(len(centroids)):
        centroids[i] = z[i][0]
        groups_list[i] = z[i][1]
    plt.bar(np.arange(len(centroids)), groups_list, align='center', color='c', tick_label=centroids)
    plt.title('Distribution ', fontsize=20)
    plt.xlabel('Area_Categories ', fontsize=18)
    plt.ylabel('Numbers', fontsize=18)
    plt.tick_params(axis='both', labelsize=14)
    plt.tight_layout()
    for a, b in zip(np.arange(len(centroids)), groups_list):
        plt.text(a, b, b, ha='center', va='baseline', fontsize=14, fontstyle='italic')
    plt.show()

你可能感兴趣的:(kmeans,聚类,学习)