均值漂移聚类算法
As discussed earlier, it is another powerful clustering algorithm used in unsupervised learning. Unlike K-means clustering, it does not make any assumptions; hence it is a non-parametric algorithm.
如前所述,它是在无监督学习中使用的另一种强大的聚类算法。 与K均值聚类不同,它没有做任何假设; 因此它是一种非参数算法。
Mean-shift algorithm basically assigns the datapoints to the clusters iteratively by shifting points towards the highest density of datapoints i.e. cluster centroid.
均值漂移算法基本上是通过将数据点移向最高密度的数据点(即群集质心)来迭代地将数据点分配给群集。
The difference between K-Means algorithm and Mean-Shift is that later one does not need to specify the number of clusters in advance because the number of clusters will be determined by the algorithm w.r.t data.
K-Means算法和Mean-Shift算法之间的区别在于,后一种算法无需提前指定聚类数,因为聚类数将由算法的数据确定。
We can understand the working of Mean-Shift clustering algorithm with the help of following steps −
通过以下步骤,我们可以了解Mean-Shift聚类算法的工作原理:
Step 1 − First, start with the data points assigned to a cluster of their own.
步骤1-首先,从分配给它们自己的群集的数据点开始。
Step 2 − Next, this algorithm will compute the centroids.
步骤2-接下来,此算法将计算质心。
Step 3 − In this step, location of new centroids will be updated.
步骤3-在此步骤中,新质心的位置将被更新。
Step 4 − Now, the process will be iterated and moved to the higher density region.
步骤4-现在,该过程将被迭代并移至更高密度的区域。
Step 5 − At last, it will be stopped once the centroids reach at position from where it cannot move further.
步骤5-最后,一旦质心到达无法继续移动的位置,它将停止。
It is a simple example to understand how Mean-Shift algorithm works. In this example, we are going to first generate 2D dataset containing 4 different blobs and after that will apply Mean-Shift algorithm to see the result.
这是一个了解均值漂移算法工作原理的简单示例。 在此示例中,我们将首先生成包含4个不同Blob的2D数据集,然后将应用Mean-Shift算法查看结果。
%matplotlib inline
import numpy as np
from sklearn.cluster import MeanShift
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from sklearn.datasets.samples_generator import make_blobs
centers = [[3,3,3],[4,5,5],[3,10,10]]
X, _ = make_blobs(n_samples = 700, centers = centers, cluster_std = 0.5)
plt.scatter(X[:,0],X[:,1])
plt.show()
ms = MeanShift()
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
print(cluster_centers)
n_clusters_ = len(np.unique(labels))
print("Estimated clusters:", n_clusters_)
colors = 10*['r.','g.','b.','c.','k.','y.','m.']
for i in range(len(X)):
plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 3)
plt.scatter(cluster_centers[:,0],cluster_centers[:,1],
marker=".",color='k', s=20, linewidths = 5, zorder=10)
plt.show()
Output
输出量
[[ 2.98462798 9.9733794 10.02629344]
[ 3.94758484 4.99122771 4.99349433]
[ 3.00788996 3.03851268 2.99183033]]
Estimated clusters: 3
The following are some advantages of Mean-Shift clustering algorithm −
以下是Mean-Shift聚类算法的一些优点-
It does not need to make any model assumption as like in K-means or Gaussian mixture.
它不需要像K-means或高斯混合中那样做出任何模型假设。
It can also model the complex clusters which have nonconvex shape.
它还可以对具有非凸形状的复杂簇进行建模。
It only needs one parameter named bandwidth which automatically determines the number of clusters.
它只需要一个名为带宽的参数即可自动确定群集数。
There is no issue of local minima as like in K-means.
像K-means一样,没有局部最小值的问题。
No problem generated from outliers.
异常值不会产生任何问题。
The following are some disadvantages of Mean-Shift clustering algorithm −
以下是Mean-Shift聚类算法的一些缺点-
Mean-shift algorithm does not work well in case of high dimension, where number of clusters changes abruptly.
在集群数量突然变化的高维情况下,均值漂移算法不能很好地工作。
We do not have any direct control on the number of clusters but in some applications, we need a specific number of clusters.
我们无法直接控制集群的数量,但是在某些应用程序中,我们需要特定数量的集群。
It cannot differentiate between meaningful and meaningless modes.
它无法区分有意义的模式和无意义的模式。
翻译自: https://www.tutorialspoint.com/machine_learning_with_python/clustering_algorithms_mean_shift_algorithm.htm
均值漂移聚类算法