clustering

层次化聚类可以使用树图表示。

自顶向下: 所有节点当做同一类, 然后逐层划分

自底向上: 每个节点都是独立的类, 然后逐层合并

 

其中需要用到两个距离函数, 用来识别“相似”:

1 metric: N范式、高维向量夹角衡量点与点之间的相似度

2 linkage:衡量类与类之间的相似度:

 2.1) max{d(x,h): x in A, y in B}

 2.2) min{d(x,h): x in A, y in B}

 2.3) sigma(d(x,y))/(|A|*|B|), 均值, 类间所有点的距离之和的均值

 以下几个不甚明白

  • The sum of all intra-cluster variance.
  • The increase in variance for the cluster being merged (Ward's criterion).
  • The probability that candidate clusters spawn from the same distribution function (V-linkage).
  •  

    在树的每一层都是一种聚类结果及对应的类个数

     

     

    http://en.wikipedia.org/wiki/Hierarchical_clustering

     

     

    基于划分

    k-means, 优势是算法简单且快,可以处理大数据量;

    缺点是每次算法过程得到的结果并不一定相同,取决于初始的随机k个质点;最小化了类内的方差,但不保证全局的最小方差; 并且要求均值是可定义的有意义的(质点是用均值计算得到的)【当均值无意义时, 可以使用k-medoids代替, 该算法选取中位点作为质点】

     

    模糊c-means: 点可以概率性的属于多个类

     

    QT clustering(quality threshold), 算法流程:

    • The user chooses a maximum diameter for clusters.
    • Build a candidate cluster for each point by iteratively including the point that is closest to the group, until the diameter of the cluster surpasses the threshold.
    • Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration. Must clarify what happens if more than 1 cluster has the maximum number of points ?
    • Recurse with the reduced set of points.

    The distance between a point and a group of points is computed using complete linkage, i.e. as the maximum distance from the point to any member of the group (see the "Agglomerative hierarchical clustering" section about distance between clusters).

     

    spectral clustering:

     

     

    你可能感兴趣的:(算法,function,Build,qt,each,distance)