K-means的不足
K 值需要人为设定,不同 K 值得到的结果不一样; 往往人们并不知道分为几类
对初始的簇中心敏感,不同选取方式会得到不同结果;
对异常值敏感;
样本只能归为一类,不适合多分类任务;
不适合太离散的分类、样本类别不平衡的分类、非凸形状的分类。
from IPython.display import Image
如上图,kmeans并没有很好的分类
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
data1 = pd.read_csv('./lianjia1.csv',encoding = 'gbk')
data2 = pd.read_csv('./lianjia2.csv',encoding = 'gbk')
data3 = pd.read_csv('./lianjia3.csv',encoding = 'utf-8')
data4 = pd.read_csv('./lianjia4.csv',encoding = 'utf-8')
data5 = pd.read_csv('./lianjia5.csv',encoding = 'utf-8')
data6 = pd.read_csv('./lianjia6.csv',encoding = 'utf-8')
data7 = pd.read_csv('./lianjia7.csv',encoding = 'utf-8')
data = pd.concat([data1,data2,data3,data4,data5,data6,data7])
data = data.dropna()
data['cjdanjia'] = data.cjdanjia.str.replace('元/平','').astype(np.float32).map(lambda x: round(x/10000,2))
data1 = data[data.cjxiaoqu.str.contains('龙锦苑东一区 3室2厅 124平')]
plt.scatter(range(1,len(data1)+1), data1.cjdanjia) # range不包括最后一位,因此应该对len(data1) + 1
from sklearn.cluster import KMeans
y_pred = KMeans(n_clusters=2).fit_predict(data1[['cjdanjia']]) # 需要人为指定类别数. 以双中括号 DataFrame方式输入。
y_pred
# array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
# 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
plt.scatter(range(1,len(data1)+1), data1.cjdanjia, c = y_pred)
data2 = data[data.cjxiaoqu.str.contains('龙锦苑东五区 3室2厅 124平')] # 查看‘龙锦苑东五区 3室2厅 124平'的成交记录数据分布情况
plt.scatter(range(1,len(data2)+1), data2.cjdanjia)
y_pred = KMeans(n_clusters=2).fit_predict(data2[['cjdanjia']])
plt.scatter(range(1,len(data2)+1), data2.cjdanjia, c = y_pred)
发现kmeans算法此时,对异常值的分类,效果不佳。 期望的是把期望的高单价聚为一类
DBSCAN算法(基于密度)
详解DBSCAN聚类 - 知乎 (zhihu.com)
epslion:设的很大,所有点都归为一类
epslion:设置得太小,每个点都是单独的一类
优点:用于排除噪声
from sklearn.cluster import DBSCAN
y_pred = DBSCAN().fit_predict(data2[['cjdanjia']]) # 默认eps=0.5
plt.scatter(range(1,len(data2)+1), data2.cjdanjia, c = y_pred)
y_pred = DBSCAN(eps=0.4).fit_predict(data2[['cjdanjia']])
plt.scatter(range(1,len(data2)+1), data2.cjdanjia, c = y_pred)