用了KMeans、DBSCAN、层次聚类法
class sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1)
重要的指标:n_clusters
聚成几类
属性与方法详解
class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean', algorithm='auto', leaf_size=30, p=None, random_state=None)
重要的指标:eps
半径、min_samples
半径之内的最小样本数
属性与方法详解
class sklearn.cluster.AgglomerativeClustering(n_clusters=2, affinity='euclidean', memory=Memory(cachedir=None), connectivity=None, n_components=None, compute_full_tree='auto', linkage='ward')
重要的指标:n_clusters
聚成几类
属性与方法详解
有自上而下分层次的,也有自下而上分层次的
sklearn.metrics
类库
这里选用homogeneity_completeness_v_measure
,输出
homogeneity
、completeness
、v_measure
,v_measure
是homogeneity
、completeness
的调和平均数
import pandas as pd
import numpy as np
rawdata = pd.read_csv(r"..\Data\cluster_N.csv")
X = rawdata.iloc[:,0:-1]
Y = rawdata.iloc[:,-1]
没有缺失值和异常值,暂时不特征选择
这个数据是有真实标签的,依据真实标签对聚类效果的评价
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import homogeneity_completeness_v_measure
def run_kmeans_model():
model = KMeans(n_clusters=3).fit(X)
result = homogeneity_completeness_v_measure(Y,model.labels_)[1:3]
center = model.cluster_centers_
return result, center
run_kmeans_model()
利用TSNE库将数据降维,并将聚类结果可视化
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
def draw(model):
tsne = TSNE()
tsne.fit_transform(X)
tsne = pd.DataFrame(tsne.embedding_, index=X.index)
r = tsne[model.labels_ == 0]
plt.plot(r[0], r[1], "r.")
g = tsne[model.labels_ == 1]
plt.plot(g[0], g[1], "go")
b = tsne[model.labels_ == 2]
plt.plot(b[0], b[1], "b*")
q = tsne[model.labels_ == -1]
plt.plot(q[0], q[1], "cx")
plt.show()
draw(KMeans(n_clusters=3).fit(X))
homogeneity = 0.7
completeness = 0.69
探究eps
和n_samples
对聚类的影响
def run_dbscan():
result = {
"min_samples":[],
"eps":[],
"n_clusters":[],
"homogeneity":[],
"completeness":[]
}
for i in range(5,11):
for j in np.arange(0.5,1.1,0.1):
model = DBSCAN(eps=j,min_samples=i,metric="euclidean").fit(X) # 聚成5类
homo, com = homogeneity_completeness_v_measure(Y, model.labels_)[1:3]
n_cluster = pd.DataFrame(model.labels_).value_counts().shape[0]
result["min_samples"].append(i)
result["eps"].append(j)
result["homogeneity"].append(homo)
result["completeness"].append(com)
result["n_clusters"].append(n_cluster)
df = pd.DataFrame(result)
return df
run_dbscan()
目标是聚3簇,这里选择n_clusters要选4,因为DBSCAN有离群点
选择eps=0.9,min_samples=5
建模可视化
此时 homogeneity = 0.51,completeness = 0.47
draw(DBSCAN(eps=0.9,min_samples=5,metric="euclidean").fit(X))
def run_agg():
model = AgglomerativeClustering(n_clusters=3).fit(X)
result = homogeneity_completeness_v_measure(Y,model.labels_)[1:3]
return result
run_agg()
homogeneity = 0.74
completeness = 0.73
draw(AgglomerativeClustering(n_clusters=3).fit(X))
# 系谱图
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.preprocessing import MinMaxScaler
def drawc():
df = MinMaxScaler().fit_transform(X)
ss = linkage(df, method="ward")
dendrogram(ss)
plt.show()
drawc()