Algorithm Design & Analysis: Single-linkage Clustering Algorithm & Proof

Hi peers,

In this essay, I will talk about single-linkage clustering (SLC) algorithm. I will first give the problem definition of clustering problem. With proper concepts being introduced, I then give a snippet of pseudocodes for the SLC algorithm. Finally, I will give you the proof of correctness for this algorithm.

Problem Definition – Clustering Algorithm
Intuition: Given a set of points P: {p_1, p_2, … p_n}, we want to output k sets of clusters C: {C_1, C_2, …, C_k} such that the spacing of this set of clusters is maximized. k is also an predefined input.
Important Concepts:

  1. Cluster: a group of points
  2. Distance between points: the Euclidean distance between points. D: {d_12, d_13, …, d_n/n-1} denotes the set of all possible distances between every pair of points in P. d_ij denotes the distance between the point i and point j.
  3. Distance between two clusters C_m and C_n: the minimum distance among all possible distance between a pair of points p_x, p_y, where p_x belongs to C_m and p_y belongs to C_n.
  4. Spacing: the minimum distance among all possible distances between pair of clusters in C.

Pseudocodes:

Sort D in the ascending order; 
Initialize X for storing the set of clusters;
While(X.size() < k)
	d_ij = min(D); 
	if p_i and p_j is not in same cluster: 
		merge the cluster of p_i and the cluster of p_j; 

Proof of Correctness:
To prove the above algorithm correctly outputs a set of clusters that has its spacing maximized, we have to use the following fact:
Fact1: If arbitrary two points p_x and p_y are merged during the process of the algorithm, the distance between p_x and p_y is smaller than S, the spacing of the output set of clusters.
This fact is self-evident because, at very first of the algorithm, we sort the set of distances in ascending order and we always try to merge the pair of the points that have minimum distance in the set of all points.

With the above fact, we can set off proving the correctness of the algorithm. We first introduce several notations.

C: {C_1, C_2, …, C_k} denotes the set of clusters that the algorithm outputs;
C’: {C’_1, C’_2, …, C’_3} denotes an arbitrary set of clusters that is different to C.
S_C: spacing of C;
S_C’: spacing of C’.

Because C’ ≠ C, there exists two points p_x and p_y such that p_x and p_y belongs to a same cluster in C, but they belong to two different clusters C’_x and C’_y in C’ respectively.

Because p_x and p_y belongs to same cluster in C, they are merged at some point when the algorithm is running. There are two possible cases on how they can be merged.
Case1: p_x and p_y are merged directly.
Because they are merged directly, according to fact1,

               d_xy <= S_C				(1) 

Also, p_x and p_y are not the only pairs of the point that separates C’_x and C’_y, the distance between C’_X and C’_y is smaller than d_xy:

			Distance(C’_x, C’_y) <= d_xy     	(2)

Moreover, the C’_x and C’_y are not the only pair of clusters in the set of clusters C’. Thus, there might be other pairs of clusters that have smaller distance. We can safely conclude S_C’ is smaller than the distance between C’_x and C’_y.

			S_C’ <= Distance(C’_x, C’_y)		(3)

Combining inequality 1, inequality 2, and inequality 3, we have:

			S_C’ <= d_xy <= S_C

Case2: p_x and p_y are merged indirectly
Suppose the p_x and p_y got merged indirectly into one cluster by the algorithm through the merge of others points. The following is an analog of this merging process:

		p_x  - p_k – p_k+1– p_k+2 – … - p_k+n – p_y

The above represents a chain of direct merging. Every pair of connected points in the chain is merged directly. However, at the end, p_x and p_y got merged indirectly through this series of direct merges.
Notice that p_x and p_y resides in different cluster in the set of cluster C’. Thus, in C’, the above chain must be broken at somewhere in middle because of the separation of p_x and p_y into different clusters.
If there is any breaking point in between, say, a breaking point between p_k+i and p_k+j, the case2 reduces itself to case1, as p_k+i and p_k+j are direct merged in the algorithm.
Thus, S_C’ <= d_k+i, k+j <= S_C.

Now, we combine the conclusion from case1 and case2. We know that, for an arbitrary set of clusters, C’, that is different to the output set of clusters C, S_C’ is always smaller than S_C. Thus, the S_C is the maximum among all possible set of clusters and the algorithm maximize the spacing of its output set of clusters. Q.E.D

Through the above journey, you might find that this single-linkage clustering (SLC) is similar to Kruskal’s algorithm. Indeed, we can see SLC algorithm as if it is a pre-terminated Kruskal’s algorithm. The subtlety between two algorithms is left for your self-exploration. Hope you enjoy this one. Thanks.

Best,
Ben

你可能感兴趣的:(Algorithm Design & Analysis: Single-linkage Clustering Algorithm & Proof)