sklearn.cluster.k_means_._k

参数

X：数组或稀疏矩阵。(n_samples,n_featuers)。输入需为np.float64。

n_clusters : 整数，选取种子的数目。

x_squared_norms : shape(n_samples,)，平方欧几里德范数。

random_state : 随机数产生器，你懂得。。

n_local_trials：每次取n_local_trials个候选中心点（第一次除外），然后从中选取最好的一个。

过程

1、得到X的shape（n_sample，n_features）。

2、初始化centers。（n_clusters,n_features）。

3、判断x_squared_norms是否为None，若为None则抛AssertionError。

4、若n_local_trials为None，则计算。此值为固定的，其代表每次选取一个中心点的过程中，所需选取的候选中心点。

5、随机选取第一个中心点：random_state.randint(n_sample)，得到第一个中心点下标center_id。

6、判断X是否为稀疏矩阵，将X[center_id]转化为统一形式，赋值到centers[0]。

7、计算X中的所有点到centers[0]的欧几里得距离，并赋值给closest_dist_sq。距离计算方法参见

euclidean_distances()方法，其中 dist(x, y) = sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y)),x,y均为向量。

采用这个公式有两个好处(1) 稀疏矩阵计算效率高 (2) 当只有一个向量变化时，可减少计算量。详见

euclidean_distances()介绍。

8、计算各点到center[0]距离的总和，赋给current_pot。

9、选取剩余的n_clusters-1个中心点。

(1) 取n_local_trials 个随机值。赋予rand_vals。其中（0

(2) 对closest_dist_sq做累计求和，得到一个累计和序列。然后查找rand_vals在这个序列中的排

名，作为候选中心点下标。共得到n_local_trials个候选中心点。

(3) 计算X中的所有样本点到每个候选中心点的距离。赋予distance_to_candidates。

(4) 下面就开始比较寻找最好的候选中心点。先定义三个变量best_candidate、best_pot、

best_dist_sq。

(5) 对于每一个候选中心点：更新样本点到最近中心的距离，即比较closest_dist_sq与

distance_to_candidates[trial]选取较小的一个作为新的距离new_dist_sq。然后对新距离进

行求和得到new_pot。我们想要使new_pot最小的那个候选中心点作为下一个中心点。

(6) 当best_candidate为None或者新得到的new_pot小于best_pot时，更新best_candidate，

best_pot，best_dist_sq。最后就得到了一个中心点。

(7) 将X[best_candidate]赋予centers[c]，若X为稀疏矩阵，则转化centers[c]的格式。然后更

新current_pot，closest_dist_sq。

(7) 重复上述步骤，就得到了全部的中心点。

10 返回 centers 即所有的中心点。

sklearn.cluster.k_means_._k_init

你可能感兴趣的:(人工智能)