1963年,Ward J H提出了使用离差平方(Error Sum of Square)和或者说是信息缺失量作为目标函数的思想来决定究竟应该怎么一步一步合并小类簇为一个大类簇,他在文献中指出,类簇合并后的离差平方和应当最小,即最好的目标函数就是使得类簇合并后的信息缺失最小。
(1)The first step in grouping is to select two of these n subsets which , when united , will reduce by one the number of subsets while producing the least impairment of the optimal value of the objective function .
(2)The n-1 resulting subsets then are examined to determine if a third member should be united with the first pair or another pairing made in order to secure the optional value of the objective function for n-2 groups.
(3)This procedure can be continued until all n members of original arraty are in one group.
(4)As each union is considered in turn, the value of the corresponding objective function is computed and hypothesized to be "equal to or better than" that of any preceding union.
1967年,G. N. Lance, W. T. Williams在ward的基础上,于文献中提出了5中计算类簇合并的方法,分别为:
假设类簇 Sk=Sp⋃Sq , p,q,k 的样本大小分别为 np,nq,nk 则有 np+nq=nk ,类簇 p 的中心为 xp ,类簇 q 的中心为 xq ,则有类簇 k 的中心为 (npxp+nqxq)/nk ,则类簇 k 与类簇 h 之间的距离为:
1969年,Wishart D在上述基础上,结合信息增益的概念又提出了 ward′smethod ,即是现在的 linkage=“ward” 。该方法的基本思想是:
假设数据属于 n×m 的二维矩阵:
The two clusters whose fusion results in the minimum increase in the error sum of squares are combined.
The first fusion clearly involves those two points which are closest.
假设 Xijt 表示包含 kt 个样本的类簇 St 的第 j 样本的第 i 个变量的值, X^jt 表示 St 的第 j 个属性的均值,那么 St 的误差平方和为:
当 Sq 和 Sp 合并为 Sk 后, Sk 与其他的任一个类簇 Sh 的新合并会使得目标函数有一个增加:
综上所述:以上6种均可由 dhk=αpdhp+αqdhq+βdpq+γ|dhp−dhq| 改变不同的系数而得到。
