DBSCAN 简单来说就是一种基于密度的聚类算法。
数据输入支持weka数据格式,里面有一个例子数据,结果与weka比较过,是相同的。
网上有一个DBSCAN的C#的源码,但是错的。
现无偿贡献出来,没有任何协议,可以随意引用,当然了,能发封信给我或者发个明信片啥的更好了,嘿嘿。
下载
From Weikpedia
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaovei Xui in 1996. It is a density based clustering algorithm because it finds a number of clusters starting from the estimated density distribution of corresponding nodes. DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature. The basic DBSCAN algorithm has been used as a base for many other developments. Such as paralellisation by Domenica Arlia and Massimo Coppola or an enhancement of the data set background to support uncertain data presented by Dirk Habich and Peter Benjamin Volk.
Steps
DBScan requires two parameters: epsilon (eps) and minimum points (minPts). It starts with an arbitrary starting point that has not been visited. It then finds all the neighbour points within distance eps of the starting point.
If the number of neighbors is greater than or equal to minPts, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbours recursively.
If the number of neighbors is less than minPts, the point is marked as noise.
If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points in the dataset.
Pseudocode
C = 0 for each unvisited point P in dataset D N = getNeighbors (P, epsilon) if (sizeof(N) < minPts) mark P as NOISE else ++C mark P as visited add P to cluster C recurse (N)
Advantages
- DBScan does not require you to know the number of clusters in the data a priori. Compare this with k-means.
- DBScan does not have a bias towards a particular cluster shape or size. Compare this with k-means.
- DBScan is resistant to noise and provides a means of filtering for noise if desired.
Disadvantages
- DBSCAN can only result in a good clustering as good as its distance measure is in the function getNeighbors(P,epsilon). The most common distance metric used is the euclidean distance measure. Also any other arbitrary distance measure can be used to influence the behaviour of the resulting cluster.
- DBScan does not respond well to data sets with varying densities so called hirachical data sets.
Compare to other clustering algorithms, such as K-Means, DBSCAN does not scale very well in data volumen. If more data points are added to the clustering data set the slower the algorithms gets due to the nature of the getNeighbors(P,epsilon) function.
Nearest Neighborhood Detection
A nearest neighborhood detection takes place in the getNeighbors(P,epsilon) funktion. For a data point P all data points must be detected that are within the range of epsilon, based on the distance function used in the algorithm. The detection requires that a distance matrix is calculated for the whole data set. The generation of the distance matrix has a complexity of O((n2-n)/2) since only an upper matrix is needed. Within the distance matrix the nearest neighbors can be detected by selecting a tuple with minimums functions over the rows and columns. Research has been pushing the neighborhood detection into traditional databases to enhance the speed of the detection. Databases solve the neighborhood problem with indexes specifically designed for this type of application
Parameter Estimation
Every data mining task has the problem of parameters. Every parameter influences the algorithm in sepcifc ways. For DBSCAN the parameters epsilon and MinPnts are needed. The parameters must be specified by the user of the algorithms since other data sets and other questions require differnt parameters. An initial value for epsilon can be determined by a k-distance graph. As a thumbrule k is determined with D (number of dimensions in the data set): k=D+1.
Although this parameter estimation gives a sufficient inital parameter set the resulting clustering can turn out to be not the expected partitioning. Therfore research has been performed on incrementally optimising the parameters agains a specific target value.
References
- "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise". Proceedings of 2nd International Conference on KDD, AAAI Press. Retrieved on 2007-10-15.
- "Experiments in Parallel Clustering with DBSCAN". Euro-Par 2001: Parallel Processing: 7th International Euro-Par Conference Manchester, UK August 28-31, 2001, Proceedings, Springer Berlin. Retrieved on 2004-02-19.