DBSCAN中的的几个定义:
Ε领域:给定对象半径为Ε内的区域称为该对象的Ε领域
核心对象:如果给定对象Ε领域内的样本点数大于等于MinPts,则称该对象为核心对象。
直接密度可达:对于样本集合D,如果样本点q在p的Ε领域内,并且p为核心对象,那么对象q从对象p直接密度可达。
密度可达:对于样本集合D,给定一串样本点p1,p2….pn,p= p1,q= pn,假如对象pi从pi-1直接密度可达,那么对象q从对象p密度可达。
密度相连:对于样本集合D中的任意一点O,如果存在对象p到对象o密度可达,并且对象q到对象o密度可达,那么对象q到对象p密度相连。
可以发现,密度可达是直接密度可达的传递闭包,并且这种关系是非对称的。密度相连是对称关系。DBSCAN目的是找到密度相连对象的最大集合。
Eg: 假设半径Ε=3,MinPts=3,点p的E领域中有点{m,p,p1,p2,o}, 点m的E领域中有点{m,q,p,m1,m2},点q的E领域中有点{q,m},点o的E领域中有点{o,p,s},点s的E领域中有点{o,s,s1}.
那么核心对象有p,m,o,s(q不是核心对象,因为它对应的E领域中点数量等于2,小于MinPts=3);
点m从点p直接密度可达,因为m在p的E领域内,并且p为核心对象;
点q从点p密度可达,因为点q从点m直接密度可达,并且点m从点p直接密度可达;
点q到点s密度相连,因为点q从点p密度可达,并且s从点p密度可达。
下面用Java来简单实现算法
public class DBScanBuilder { //半径 public static double Epislon = 2; //密度、最小点个数 public static int MinPts = 5; public List<Point> initData() { List<Point> points = new ArrayList<Point>(); InputStream in = null; BufferedReader br = null; try { in = DBScanBuilder.class.getClassLoader().getResourceAsStream("dbscan.txt"); br = new BufferedReader(new InputStreamReader(in)); String line = br.readLine(); while (null != line && !"".equals(line)) { StringTokenizer tokenizer = new StringTokenizer(line); double x = Double.parseDouble(tokenizer.nextToken()); double y = Double.parseDouble(tokenizer.nextToken()); points.add(new Point(x , y)); line = br.readLine(); } } catch (Exception e) { e.printStackTrace(); } finally { IOUtils.closeQuietly(in); IOUtils.closeQuietly(br); } return points; } //计算两点之间的欧氏距离 public double euclideanDistance(Point a, Point b) { double sum = Math.pow(a.getX() - b.getX(), 2) + Math.pow(a.getY() - b.getY(), 2); return Math.sqrt(sum); } //获取当前点的邻居 public List<Point> obtainNeighbors(Point current, List<Point> points) { List<Point> neighbors = new ArrayList<Point>(); for (Point point : points) { double distance = euclideanDistance(current, point); if (distance < Epislon) { neighbors.add(point); } } return neighbors; } public void mergeCluster(Point point, List<Point> neighbors, int clusterId, List<Point> points) { point.setClusterId(clusterId); for (Point neighbor : neighbors) { //邻域点中未被访问的点先观察是否是核心对象 //如果是核心对象,则其邻域范围内未被聚类的点归入当前聚类中 if (!neighbor.isAccessed()) { neighbor.setAccessed(true); List<Point> nneighbors = obtainNeighbors(neighbor, points); if (nneighbors.size() > MinPts) { for (Point nneighbor : nneighbors) { if (nneighbor.getClusterId() <= 0) { nneighbor.setClusterId(clusterId); } } } } //未被聚类的点归入当前聚类中 if (neighbor.getClusterId() <= 0) { neighbor.setClusterId(clusterId); } } } public void cluster(List<Point> points) { //clusterId初始为0表示未分类,分类后设置为一个正数,如果设置为-1表示噪声 int clusterId = 0; boolean flag = true; //所有点都被访问完成即停止遍历 while (flag) { for (Point point : points) { if (point.isAccessed()) { continue; } point.setAccessed(true); flag = true; List<Point> neighbors = obtainNeighbors(point, points); if (neighbors.size() >= MinPts) { //满足核心对象条件的点创建一个新簇 clusterId = point.getClusterId() <= 0 ? (++clusterId) : point.getClusterId(); mergeCluster(point, neighbors, clusterId, points); } else { //未满足核心对象条件的点暂时当作噪声处理 if(point.getClusterId() <= 0) { point.setClusterId(-1); } } flag = false; } } } //打印结果 public void print(List<Point> points) { Collections.sort(points, new Comparator<Point>() { @Override public int compare(Point o1, Point o2) { return Integer.valueOf(o1.getClusterId()).compareTo(o2.getClusterId()); } }); for (Point point : points) { System.out.println(point.getClusterId() + " - " + point); } } public void build() { List<Point> points = initData(); cluster(points); print(points); } public static void main(String[] args) { DBScanBuilder builder = new DBScanBuilder(); builder.build(); } }代码托管:https://github.com/fighting-one-piece/repository-datamining.git