26. 日月光华 Python数据分析 - 机器学习 - 聚类

K-means的不足

K 值需要人为设定,不同 K 值得到的结果不一样; 往往人们并不知道分为几类
对初始的簇中心敏感,不同选取方式会得到不同结果;
对异常值敏感;
样本只能归为一类,不适合多分类任务;
不适合太离散的分类、样本类别不平衡的分类、非凸形状的分类。

from IPython.display import Image
image.png

如上图,kmeans并没有很好的分类

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

data1 = pd.read_csv('./lianjia1.csv',encoding = 'gbk')
data2 = pd.read_csv('./lianjia2.csv',encoding = 'gbk')
data3 = pd.read_csv('./lianjia3.csv',encoding = 'utf-8')
data4 = pd.read_csv('./lianjia4.csv',encoding = 'utf-8')
data5 = pd.read_csv('./lianjia5.csv',encoding = 'utf-8')
data6 = pd.read_csv('./lianjia6.csv',encoding = 'utf-8')
data7 = pd.read_csv('./lianjia7.csv',encoding = 'utf-8')

data = pd.concat([data1,data2,data3,data4,data5,data6,data7])
data = data.dropna()

data['cjdanjia'] = data.cjdanjia.str.replace('元/平','').astype(np.float32).map(lambda x: round(x/10000,2))
data1 = data[data.cjxiaoqu.str.contains('龙锦苑东一区 3室2厅 124平')]
plt.scatter(range(1,len(data1)+1), data1.cjdanjia)    # range不包括最后一位,因此应该对len(data1) + 1
image.png
from sklearn.cluster import KMeans

y_pred = KMeans(n_clusters=2).fit_predict(data1[['cjdanjia']])    # 需要人为指定类别数. 以双中括号 DataFrame方式输入。
y_pred
# array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
#       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

plt.scatter(range(1,len(data1)+1), data1.cjdanjia, c = y_pred)
image.png
data2 = data[data.cjxiaoqu.str.contains('龙锦苑东五区 3室2厅 124平')]   # 查看‘龙锦苑东五区 3室2厅 124平'的成交记录数据分布情况
plt.scatter(range(1,len(data2)+1), data2.cjdanjia)
image.png
y_pred = KMeans(n_clusters=2).fit_predict(data2[['cjdanjia']])
plt.scatter(range(1,len(data2)+1), data2.cjdanjia, c = y_pred)
image.png

发现kmeans算法此时,对异常值的分类,效果不佳。 期望的是把期望的高单价聚为一类

DBSCAN算法(基于密度)

image.png

image.png

详解DBSCAN聚类 - 知乎 (zhihu.com)
epslion:设的很大,所有点都归为一类
epslion:设置得太小,每个点都是单独的一类
优点:用于排除噪声

from sklearn.cluster import DBSCAN

y_pred = DBSCAN().fit_predict(data2[['cjdanjia']])     # 默认eps=0.5
plt.scatter(range(1,len(data2)+1), data2.cjdanjia, c = y_pred)
image.png
y_pred = DBSCAN(eps=0.4).fit_predict(data2[['cjdanjia']])
plt.scatter(range(1,len(data2)+1), data2.cjdanjia, c = y_pred)
image.png

你可能感兴趣的:(26. 日月光华 Python数据分析 - 机器学习 - 聚类)