http://www.doc88.com/p-01272684016.html
http://wenku.baidu.com/view/4b5803eb6294dd88d0d26b8e.html
5分啊,我就6分的积分,必须分享。
http://files.cnblogs.com/sunblackshine/ClusterTranslate.7z
无格式的..
电气信息工程学院
外 文 翻 译
英文名称: Data mining-clustering
译文名称: 数据挖掘—聚类分析
专 业: 自动化
姓 名: ****
班级学号: ****
指导教师: ******
译文出处: Data mining:Ian H.Witten, Eibe Frank 著
二○一○年四月二十六日
Clustering
5.1 INTRODUCTION
Clustering is similar to classification in that data are grouped. However, unlike classification, the groups are not predefined. Instead, the grouping is accomplished by finding similarities between data according to characteristics found in the actual data. The groups are called clusters. Some authors view clustering as a special type of classification. In this text, however, we follow a more conventional view in that the two are different. Many definitions for clusters have been proposed:
Set of like elements. Elements from different clusters are not alike.
The distance between points in a cluster is less than the distance between a point in the cluster and any point outside it.
A term similar to clustering is database segmentation, where like tuple (record) in a database are grouped together. This is done to partition or segment the database into components that then give the user a more general view of the data. In this case text, we do not differentiate between segmentation and clustering. A simple example of clustering is found in Example 5.1. This example illustrates the fact that that determining how to do the clustering is not straightforward.
As illustrated in Figure 5.1, a given set of data may be clustered on different attributes. Here a group of homes in a geographic area is shown. The first floor type of clustering is based on the location of the home. Homes that are geographically close to each other are clustered together. In the second clustering, homes are grouped based on the size of the house.
Clustering has been used in many application domains, including biology, medicine, anthropology, marketing, and economics. Clustering applications include plant and animal classification, disease classification, image processing, pattern recognition, and document retrieval. One of the first domains in which clustering was used was biological taxonomy. Recent uses include examining Web log data to detect usage patterns.
When clustering is applied to a real-world database, many interesting problems occur:
Outlier handling is difficult. Here the elements do not naturally fall into any cluster. They can be viewed as solitary clusters. However, if a clustering algorithm attempts to find larger clusters, these outliers will be forced to be placed in some cluster. This process may result in the creation of poor clusters by combining two existing clusters and leaving the outlier in its own cluster.
Dynamic data in the database implies that cluster membership may change over time.
Interpreting the semantic meaning of each cluster may be difficult. With classification, the labeling of the classes is known ahead of time. However, with clustering, this may not be the case. Thus, when the clustering process finishes creating a set of clusters, the exact meaning of each cluster may not be obvious. Here is where a domain expert is needed to assign a label or interpretation for each cluster.
There is no one correct answer to a clustering problem. In fact, many answers may be found. The exact number of clusters required is not easy to determine. Again, a domain expert may be required. For example, suppose we have a set of data about plants that have been collected during a field trip. Without any prior knowledge of plant classification, if we attempt to divide this set of data into similar groupings, it would not be clear how many groups should be created.
Another related issue is what data should be used of clustering. Unlike learning during a classification process, where there is some a priori knowledge concerning what the attributes of each classification should be, in clustering we have no supervised learning to aid the process. Indeed, clustering can be viewed as similar to unsupervised learning.
We can then summarize some basic features of clustering (as opposed to classification):
The (best) number of clusters is not known.
There may not be any a priori knowledge concerning the clusters.
Cluster results are dynamic.
The clustering problem is stated as shown in Definition 5.1. Here we assume that the number of clusters to be created is an input value, k. The actual content (and interpretation) of each cluster, , , is determined as a result of the function definition. Without loss of generality, we will view that the result of solving a clustering problem is that a set of clusters is created: K={ }.
DEFINITION 5.1.Given a database D={ } of tuples and an integer value k, the clustering problem is to define a mapping f: where each is assigned to one cluster , . A cluster , contains precisely those tuples mapped to it; that is, ={ and }.
A classification of the different types of clustering algorithms is shown in Figure 5.2. Clustering algorithms themselves may be viewed as hierarchical or partitional. With hierarchical clustering, a nested set of clusters is created. Each level in the hierarchy has a separate set of clusters. At the lowest level, each item is in its own unique cluster. At the highest level, all items belong to the same cluster. With hierarchical clustering, the desired number of clusters is not input. With partitional clustering, the algorithm creates only one set of clusters. These approaches use the desired number of clusters to drive how the final set is created. Traditional clustering algorithms tend to be targeted to small numeric database that fit into memory .There are, however, more recent clustering algorithms that look at categorical data and are targeted to larger, perhaps dynamic, databases. Algorithms targeted to larger databases may adapt to memory constraints by either sampling the database or using data structures, which can be compressed or pruned to fit into memory regardless of the size of the database. Clustering algorithms may also differ based on whether they produce overlapping or nonoverlapping clusters. Even though we consider only nonoverlapping clusters, it is possible to place an item in multiple clusters. In turn, nonoverlapping clusters can be viewed as extrinsic or intrinsic. Extrinsic techniques use labeling of the items to assist in the classification process. These algorithms are the traditional classification supervised learning algorithms in which a special input training set is used. Intrinsic algorithms do not use any a priori category labels, but depend only on the adjacency matrix containing the distance between objects. All algorithms we examine in this chapter fall into the intrinsic class.
The types of clustering algorithms can be furthered classified based on the implementation technique used. Hierarchical algorithms can be categorized as agglomerative or divisive. ”Agglomerative” implies that the clusters are created in a bottom-up fashion, while divisive algorithms work in a top-down fashion. Although both hierarchical and partitional algorithms could be described using the agglomerative vs. divisive label, it typically is more associated with hierarchical algorithms. Another descriptive tag indicates whether each individual element is handled one by one, serial (sometimes called incremental), or whether all items are examined together, simultaneous. If a specific tuple is viewed as having attribute values for all attributes in the schema, then clustering algorithms could differ as to how the attribute values are examined. As is usually done with decision tree classification techniques, some algorithms examine attribute values one at a time, monothetic. Polythetic algorithms consider all attribute values at one time. Finally, clustering algorithms can be labeled base on the mathematical formulation given to the algorithm: graph theoretic or matrix algebra. In this chapter we generally use the graph approach and describe the input to the clustering algorithm as an adjacency matrix labeled with distance measure.
We discuss many clustering algorithms in the following sections. This is only a representative subset of the many algorithms that have been proposed in the literature. Before looking at these algorithms, we first examine possible similarity measures and examine the impact of outliers.
5.2 SIMILARITY AND DISTANCE MEASURES
There are many desirable properties for the clusters created by a solution to a specific clustering problem. The most important one is that a tuple within one cluster is more like tuples within that cluster than it is similar to tuples outside it. As with classification, then, we assume the definition of a similarity measure, sim( ), defined between any two tuples, . This provides a more strict and alternative clustering definition, as found in Definition 5.2. Unless otherwise stated, we use the first definition rather than the second. Keep in mind that the similarity relationship stated within the second definition is a desirable, although not always obtainable, property.
A distance measure, dis( ), as opposed to similarity, is often used in clustering. The clustering problem then has the desirable property that given a cluster, , and .
Some clustering algorithms look only at numeric data, usually assuming metric data points. Metric attributes satisfy the triangular inequality. The cluster can then be described by using several characteristic values. Given a cluster, of N points { }, we make the following definitions [ZRL96]:
Here the centroid is the “middle” of the cluster; it need not be an actual point in the cluster. Some clustering algorithms alternatively assume that the cluster is represented by one centrally located object in the cluster called a medoid. The radius is the square root of the average mean squared distance from any point in the cluster to the centroid, and of points in the cluster. We use the notation to indicate the medoid for cluster .
Many clustering algorithms require that the distance between clusters (rather than elements) be determined. This is not an easy task given that there are many interpretations for distance between clusters. Given clusters and , there are several standard alternatives to calculate the distance between clusters. A representative list is:
Single link: Smallest distance between an element in one cluster and an element in the other. We thus have dis( )= and .
Complete link: Largest distance between an element in one cluster and an element in the other. We thus have dis( )= and .
Average: Average distance between an element in one cluster and an element in the other. We thus have dis( )= and .
Centroid: If cluster have a representative centroid, then the centroid distance is defined as the distance between the centroids. We thus have dis( )=dis( ), where is the centroid for and similarly for .
Medoid: Using a medoid to represent each cluster, the distance between the clusters can be defined by the distance between the medoids: dis( )=
5.3 OUTLIERS
As mentioned earlier, outliers are sample points with values much different from those of the remaining set of data. Outliers may represent errors in the data (perhaps a malfunctioning sensor recorded an incorrect data value) or could be correct data values that are simply much different from the remaining data. A person who is 2.5 meters tall is much taller than most people. In analyzing the height of individuals, this value probably would be viewed as an outlier.
Some clustering techniques do not perform well with the presence of outliers. This problem is illustrated in Figure 5.3. Here if three clusters are found (solid line), the outlier will occur in a cluster by itself. However, if two clusters are found (dashed line), the two (obviously) different sets of data will be placed in one cluster because they are closer together than the outlier. This problem is complicated by the fact that many clustering algorithms actually have as input the number of desired clusters to be found.
Clustering algorithms may actually find and remove outliers to ensure that they perform better. However, care must be taken in actually removing outliers. For example, suppose that the data mining problem is to predict flooding. Extremely high water level values occur very infrequently, and when compared with the normal water level values may seem to be outliers. However, removing these values may not allow the data mining algorithms to work effectively because there would be no data that showed that floods ever actually occurred.
Outlier detection, or outlier mining, is the process of identifying outliers in a set of data. Clustering, or other data mining, algorithms may then choose to remove or treat these values differently. Some outlier detection techniques are based on statistical techniques. These usually assume that the set of data follows a known distribution and that outliers can be detected by well-known tests such as discordancy tests. However, these tests are not very realistic for real-world data because real-world data values may not follow well-defined data distributions. Also, most of these tests assume single attribute value, and many attributes are involved in real-world datasets. Alternative detection techniques may be based on distance measures.
聚类分析
5.1简介
聚类分析与分类数据分组类似。然而,与数据分类不同的是,所分的组预先是不确定的。相反,分组是根据在实际数据中发现的特点通过寻找数据之间的相关性来实现的。这些组被称为聚类。一些作者认为聚类分析作为一种特殊类型的分类。但是,在本文两个不同的观点中我们遵循更传统的看法。提出了许多有关聚类的定义:
• 类似元素的集合。不同聚类中的元素是不一样的。
• 在聚类中的点之间的距离比在聚类中的一个点和聚类之外任何一点之间的距离要小。
与聚类类似的术语是数据库分割,其中数据库中的元组(记录)被放在一起。 这样做是为了分割或划分成数据的数据库组件,然后给用户一个普遍的看法。这样本文我们就不区分分割和聚类。一个简单聚类分析的例子见例5.1.这个例子说明了决定如何做聚类并不是容易的。
正如图5.1所示,一个给定的数据集合可能汇聚不同的属性。这里显示了一个地域的住宅群。一楼的聚类类型是基于家庭的位置。家庭地理位置相近,彼此都聚集在一起。在第二个聚类,家庭的分类是基于房子的大小分类。
聚类已被用于许多应用领域,包括生物学,医学,人类学,市场营销和经济学。 聚类分析的应用包括植物和动物分类,疾病分类,图像处理,模式识别,文献检索。最先使用聚类分析的领域是生物分类学。最近的使用包括通过研究Web日志的数据来检测其使用模式。
当聚类分析应用到现实世界的数据库,许多有趣的问题将出现:
• 异常值的处理是困难的。这里的元素通常不属于任何一个集合。它们可以被看作是孤立集合。但是,如果聚类算法试图找到更大的集合,这些异常值将被迫放在某个集合内。此过程可能会导致结合两个现有的聚类来建立出贫乏的聚类,并且新建立的聚类本身会出现异常。
• 数据库的动态数据意味着聚类成员可能会随时间而改变。
• 解释每个聚类的意义可能是困难的。通过分类,类的标签提前了。然而,聚类可能并非如此。这样,当聚类过程生成了一个聚类集合,每个集合的确切含义可能不非常明显。下面是其中一个领域专家是需要为每个聚类分配一个标签或解释。
• 对于聚类问题没有准确的答案。事实上,也可以找到很多答案。该聚类所需的确切数目是不容易的确定。同样,一个领域的专家可能需要。例如,假设我们有经过实地考察采集的植物数据。分析之前没有任何有关植物分类的知识,如果我们试图将这些数据划分为类似的分组,我们不知道应该建立多少分组。
• 另一个相关的问题是聚类分析应该使用什么样的数据。与分类过程中的学习不同,分类有一些先验知识,知道每个分类的属性,而在聚类分析中,没有有监督的学习来促进这一过程。事实上,聚类分析可以看作无监督学习。
这样我们总结一些聚类分析的本特征(相对于分类而言):
• 聚类的(最佳)数目是不知道的
• 对于某个聚类可能没有任何先验知识
• 聚类的果是动态的。
聚类问题叙述的正如定义5.1.所示,这里我们假设创建的聚类的数目为一个输入值k,每个聚类 ,( )的实际内容(说明),作为一个功能定义。不失一般性,我们认为,解决问题的结果建立的聚类集合:K={ }。
定义 5.1已知一个数组集合D={ }和一个整数k,聚类问题是定义一个映射f: ,其中分配 到聚类 ( ) 。聚类 ,就是集合D映射到 ={ and }。
聚类算法的不同类型的分类如图5.2。聚类算法本身就可视为分层或分块的。分层聚类分析可以建立一个嵌套的聚类集合。在层次结构中的每层都有单独的聚类。在最低层,每个项目都划分在不同的特殊的集合中。在最顶层,所有的项目属于同一集合。通过分层聚类,需要的聚类数目并没有输入。分块聚类分析算法只创建一个聚类集合。这些方法通过所需的聚类集合数目促使最终集合的建立。传统的聚类算法往往是针对适合小数据库。然而,现在的聚类算法,从分类数据上来看,是针对动态的大数据库。针对大型数据库的算法可适应内存限制通过数据采样或者是使用该数据库的数据结构,从而可以被压缩或修订,以适应数据库的内存限制。也可能是基于是否产生重叠聚类算法。即使我们只考虑重叠的聚类,它可以把某个项目放置在多个聚类中。反过来,不重叠的聚类可以被看作是外在的或者内在的。外源性技术使用项目标签以协助分类。这些算法是传统的分类监督学习算法,这个算法用到了特殊的输入训练集合。内在的算法没有使用任何先验的类别标签,仅仅依赖于矩阵中邻近对象之间的距离。我们在本章研究的所有算法都属于内在类。
聚类算法的类型基于实现技术使用的基础上可以被进一步分类。分层算法可以归类为凝聚算法或者分裂算法。“凝聚”意味着在一个聚类是通过自下而上的方式产生,而分裂算法则是以自上而下的方式工作。虽然分层和分块的算法用凝聚与分裂的标签来描述,但它通常与分层算法联系更紧密。另一种描述标签是指是否对每个元素一一处理,一系列(有时称为增量)的一起处理,或者是否所有的项目都放在一起同步研究。如果一个特定的数组被视为具有在该架构中的所有属性,然后可以用不同聚类算法进行属性检查。由于通常用分层分类的技术来完成,有些算法分析属性值每次只分析一个。Polythetic算法考虑的是每次的所有属性值。最后,聚类算法以算法的数学公式被表示出来:图表或矩阵代数的理论基础。这一章中,我们采用图形方式,并且把聚类算法的输入描述为邻近距阵中距离变化。 我们在以下各节讨论许多聚类算法。这只是已在文献中提出了很多算法中具有代表性的一个。在这些算法找到之前,我们首先研究类似的处理措施,并研究对异常值的影响。
5.2相似性和距离测量
一个特殊的聚类问题的解决方案可以产生很多理想的特性。最重要的是,在某个聚类中的一个数组比聚类外的数组更像聚类中的。至于经过分类,那么,假设我们定义一个近似度,sim( ), 。定义5.2提供了一个更严格的定义和可替代的聚类。除非另有说明,我们使用第一个定义而不是第二个。在第二个定义中的叙述的相似关系是可以获得的特点,但是并不总能获得。
距离量dis( ),而不是相似度,往往被用于聚类分析。根据这样聚类问题可以获得 , 和 这两个集合所表示的特性。
一些聚类算法只看数字型数据,通常假定度量数据点。度量属性满足三角不等式。 那么聚类集合可以使用多种特征值来描述。给出一个聚类集合, N点{ }中的 ,我们提出以下定义[ZRL96]:
这里的质心是聚类集合的“中心”,它不一定是一个聚类集合中的实点。一些聚类算法可能假设聚类集合是由位于聚类集合中心的中心点代替。半径是从集合中的中心点到聚类集合中的任何点间的距离的平方根,并且是对聚类集合中所有点而言的。 我们使用符号 来表示聚类集合 的中心点。
许多聚类算法要求确定聚类集合(而不是元素)中的距离。这不是一件容易的事,因为聚类集合中的距离有很多解释。已知聚类集合 和 ,有几个标准的供选方案来计算聚类集合之间的距离。 典型的列表如下:
单链接:在一个聚类集合中的一个元素与另一个聚类集合中的一个元素之间的最小距离。这样,我们可以得到dis( )= and 。
完整的链接:在一个聚类集合中的一个元素与另一个聚类集合中的一个元素之间的最大距离。这样我们可以得到dis( )= and 。
平均:在一个聚类集合中的一个元素与另一个聚类集合中的一个元素之间的平均距离。这样我们可以得到dis( )= and 。
质心:如果聚类集合有具有代表性的质心,那么中心距离可以定义为这些质心之间的距离。这样我们可以得到dis( )=dis( ), 为 的质心并且与 类似。
中心点:使用中心点来代替每个聚类集合,集合之间的距离可以由中心点之间的距离来定义:dis( )=
5.3异常数据
如前所述,离群点是不同于集合里的剩余数据的采样点。离群值可能代表数据里的错误值(可能是一个传感器故障记录)或者可能是与其余数据值差异过大的正确数据。一个2.5米高的人比大多数人都要高得多。在分析个人的高度时,此值就应该被视为一个离群值。
一些聚类分析技术对于存在离群值的模型的分析表现的并不好。如图5.3描述的问题所示。在这里,如果发现三个聚类集合(实线),异常值将某个集合自身内发生。但是,如果两个集合被发现(虚线),两个(显然)不同的数据集合将被放置在聚类集合中,因为它们比离群值联系更紧密。 这个问题是复杂的事实,实际上有许多聚类算法作为输入所需数目的簇被发现。 实际上许多聚类算法想找到理想聚类集合的输入数目,这个做事实上是很复杂的。
聚类算法实际上可能发现和消除异常点,以确保其有更好的表现。但是,在实际消除异常点时必须要注意。例如,假设有预测洪水的数据挖掘问题。水位值极高非常不容易出现,与正常水位值相比可能就是异常值。然而,删除这些值可能使数据挖掘算法不能有效的工作,因为将不会有数据表明曾经发生过水灾。
异常检测,或离群数据挖掘,是在数据集合中确定离群值的过程。聚类分析,或者其它的数据挖掘,便可以选择算法删除这些值或者赋上不同的其它值。一些异常检测技术是基于统计技术基础上的。这样通常假设数据集合遵循已知的分布并且假设离群值可以被一个著名测试检测出来,例如discordancy测试。 不过,对于现实来说这些测试数据并不是很真实,因为真实的数据值可能不遵循定义好的数据分布。此外,这些测试中大部分只设了单一的属性值测试,并且许多属性涉及现实世界的数据库。可替代的检测技术可能基于距离测量。