一.cluster
2.使用
(2)函数:
执行"近邻传播聚类算法"(Affinity Propagation Clustering):[<cluster_centers_indices>,<labels>,<n_iter>=]sklearn.cluster.affinity_propagation(<S>[,preference=None,convergence_iter=15,max_iter=200,damping=0.5,copy=True,verbose=False,return_n_iter=False,random_state='warn'])
#参数说明:
S:指定数据点间的"相似度矩阵"(Matrix of similarities);为n_samples×n_samples array-like
preference:指定各个数据点的"偏好值/参考度"(preference);为float/1×n_samples array-like
#偏好值越大的点越可能被选为"聚类中心"(exemplar)
convergence_iter:指定停止迭代前簇的数量没有变化的迭代次数;为int
max_iter:指定算法的最大迭代次数;为int
damping:指定"阻尼系数"(Damping factor);为0.5<=float<=1
copy:指定是否复制"关联矩阵/亲和度矩阵"(affinity matrix);为bool
#为False可提高内存效率
verbose:指定是否输出计算日志;为bool
return_n_iter:指定是否返回迭代次数;为bool
random_state:指定用于初始化聚类中心的随机数;为int/RandomState instance/None
#Determines random number generation for centroid initialization
cluster_centers_indices:返回距离中心的索引;为1×n_clusters ndarray
labels:返回各数据点所属的簇;为1×n_samples ndarray
n_iter:返回迭代次数;为int
######################################################################################################################
执行"DBSCAN提取"(DBSCAN extraction):[<labels_>=]sklearn.cluster.cluster_optics_dbscan(<reachability>,<core_distances>,<ordering>,<eps>)
#参数说明:同sklearn.cluster.affinity_propagation()
reachability:指定"可达距离"(Reachability distance);为1×n_samples ndarray
core_distances:Distances at which points become core;为1×n_samples ndarray
ordering:OPTICS ordered point indices;为1×n_samples ndarray
eps:指定近邻的最远距离;为float
#即class cluster.DBSCAN()的eps参数;距离超出该值的样本不会被视为近邻
######################################################################################################################
通过"Xi-steep method"来提取簇:[<labels>,<clusters>=]sklearn.cluster.cluster_optics_xi(<reachability>,<predecessor>,<ordering>,<min_samples>[,min_cluster_size=None,xi=0.05,predecessor_correction=True])
#参数说明:其他参数同sklearn.cluster.cluster_optics_dbscan()
predecessor:指定OPTICS之前得到的结果;为1×n_samples ndarray
min_samples:指定被视为聚类中心的数据点的邻域内的最小样本数(包括该点本身);为int>1/0<=float<=1
#邻域内数据点数少于该值的点不会被视为聚类中心
min_cluster_size:指定簇中的最小样本数;为int>1/0<=float<=1
xi:指定构成簇的边界的"可达性图"(reachability plot)的最小"陡度"(steepness);为0<=float<=1
predecessor_correction:指定是否根据OPTICS之前得到的结果修正簇;为bool
######################################################################################################################
求"OPTICS可达图"(OPTICS reachability graph):[<ordering_>,<core_distances_>,<reachability_>,<predecessor_>=]sklearn.cluster.compute_optics_graph(<X>,<min_samples>,<max_eps>,<metric>,<p>,<metric_params>,<algorithm>,<leaf_size>,<n_jobs>)
#参数说明:其他参数同sklearn.cluster.cluster_optics_xi()
X:指定特征数组;为n_samples×n_features ndarray
若metric="precomputed",则指定样本间的距离;为n_samples×n_samples ndarray
max_eps:指定近邻点的最远距离;为float
#距离超过该值的数据点不会被视为近邻点
metric:指定距离的度量;为str/callable
p:指定用于计算数据点间距离的闵可夫斯基度量的幂;为float
metric_params:指定要传入metric的其他参数;为dict
algorithm:指定用于寻找最近邻的算法;为"auto"/"ball_tree"/"kd_tree"/"brute"
leaf_size:指定BallTree/CKDTree的"叶大小"(Leaf size);为int
n_jobs:指定用于并行计算的任务数;为int
ordering_:返回簇的顺序;为1×n_samples ndarray
core_distances_:返回各样本成为"中心"(core)的距离;为1×n_samples ndarray
reachability_:返回各样本的可达距离;为1×n_samples ndarray
predecessor_:返回各样本的前1个样本;为1×n_samples ndarray
######################################################################################################################
执行"DBSCAN聚类算法"(DBSCAN Clustering):[<core_samples>,<labels>=]sklearn.cluster.dbscan(<X>[,eps=0.5,min_samples=5,metric='minkowski',metric_params=None,algorithm='auto',leaf_size=30,p=2,sample_weight=None,n_jobs=None])
#参数说明:eps同sklearn.cluster.compute_optics_graph()的max_eps
# 同sklearn.cluster.affinity_propagation()
# 其他参数同sklearn.cluster.compute_optics_graph()
X:指定特征数组;为n_samples×n_features ndarray/n_samples×n_features sparse CSR matrix
若metric="precomputed",则指定样本间的距离;为n_samples×n_samples ndarray/n_samples×n_samples sparse CSR matrix
sample_weight:指定各样本的权重;为1×n_samples array-like
core_samples:返回"中心样本"(core samples);为1×n_core_samples ndarray
######################################################################################################################
估计均值漂移算法中使用的"带宽"(Bandwidth):[<bandwidth>=]sklearn.cluster.estimate_bandwidth(<X>[,quantile=0.3,n_samples=None,random_state=0,n_jobs=None])
#参数说明:其他参数同sklearn.cluster.affinity_propagation()
X:指定数据;为n_samples×n_features array-like/n_samples×n_features sparse matrix
quantile:0.5 means that the median of all pairwise distances is used;为0<=float<=1
n_samples:指定要使用的样本数;为int
bandwidth:返回"带宽"(bandwidth);为float
######################################################################################################################
执行"K-均值聚类算法"(K-Means clustering):[<centroid>,<labels>,<inertia>,<best_n_iter>=]sklearn.cluster.k_means(<X>,<n_clusters>[,sample_weight=None,init='k-means++',precompute_distances='deprecated',n_init=10,max_iter=300,verbose=False,tol=1e-4,random_state=None,copy_x=True,n_jobs='deprecated',algorithm="auto",return_n_iter=False])
#参数说明:同cluster.estimate_bandwidth()
# sample_weight/同sklearn.cluster.dbscan()
# 其他参数同sklearn.cluster.affinity_propagation()
n_clusters:指定簇/聚类中心的数量;为int
init:指定初始聚类中心;为"k-means++"/"random"/n_clusters×n_features array-like/callable
precompute_distances:指定是否预计算距离;为"auto"/bool
#预计算距离将降低计算时间,但会增加内存开销
n_init:指定算法重复次数(每次均重新初始化聚类中心);为int
#将返回最好的结果
tol:指定相对终止精度;为float
#2次迭代间的改进小于此值时终止计算
copy_x:When pre-computing distances it is more numerically accurate to center the data first
If copy_x is True (default), then the original data is not modified
If False, the original data is modified, and put back before the function returns
but small numerical differences may be introduced by subtracting and then adding the data mean
Note that if the original data is not C-contiguous, a copy will be made even if copy_x is False
If the original data is sparse, but not in CSR format, a copy will be made even if copy_x is False
n_jobs:指定用于计算的OpenMP线程数;为int
algorithm:指定使用的算法类型;为"auto"/"full"/"elkan"
centroid:返回聚类中心;为n_clusters × n_features ndarray
inertia:返回全部样本到其聚类中心的距离的和;为float
#The final value of the inertia criterion (sum of squared distances to the closest centroid for all observations in the training set)
best_n_iter:返回最优结果对应的迭代次数;为int
######################################################################################################################
通过"k-means++算法"(K-Means Plus Plus Algorithm)来选择初始聚类中心:[<centers>,<indices>=]sklearn.cluster.kmeans_plusplus(<X>,<n_clusters>[,x_squared_norms=None,random_state=None,n_local_trials=None])
#参数说明:其他参数同sklearn.cluster.k_means()
x_squared_norms:指定各数据点处的"平方欧氏范数/L2范数"(Squared Euclidean norm/L2 norm);为1×n_samples array-like
n_local_trials:指定每个中心(第1个除外)以不同初始值进行尝试的次数;为int
centers:返回初始聚类中心;为n_clusters×n_features ndarray
indices:返回最终选择的聚类中心;为1×n_clusters ndarray
#元素为各聚类中心在中的索引
######################################################################################################################
基于水平核执行均值漂移聚类算法:[<cluster_centers>,<labels>=]cluster.mean_shift(<X>[,bandwidth=None,seeds=None,bin_seeding=False,min_bin_freq=1,cluster_all=True,max_iter=300,n_jobs=None])
#参数说明:其他参数同sklearn.cluster.k_means()
bandwidth:指定核的"带宽"(bandwidth);为float
seeds:指定核的初始位置;为n_seeds×n_features array-like
bin_seeding:指If true,initial kernel locations are locations of the discretized version of points,where points are binned onto a grid whose coarseness corresponds to the bandwidth
If False,initial kernel locations are locations of all points
#设为True会加快运算,因为需要初始化的种子将减少;若seeds=None,则忽略该参数
min_bin_freq:指定bin中最少的种子数;为int
#不接受种子数低于该值的bin,因而可用于加快运算
cluster_all:指定是否对所有数据点(包括不在任何核中的孤立点)都进行聚类;为bool
cluster_centers:返回所有聚类中心;为n_clusters×n_features ndarray
######################################################################################################################
将聚类用于"归一化了的拉普拉斯算子"(normalized Laplacian)的投影:[<labels>=]cluster.spectral_clustering(<affinity>[,n_clusters=8,n_components=None,eigen_solver=None,random_state=None,n_init=10,eigen_tol=0.0,assign_labels='kmeans',verbose=False])
#参数说明:其他参数同sklearn.cluster.k_means()
affinity:指定样本间的"关联矩阵/亲和度矩阵";为symmetric n_samples×n_samples array-like/symmetric n_samples×n_samples sparse matrix
n_components:指定用于"谱嵌入"(spectral embedding)的特征向量数;为int
eigen_solver:指定使用的特征值分解策略;为"arpack"/"lobpcg"/"amg"
eigen_tol:指定对拉普拉斯矩阵进行特征分解的停止准则;为float
#仅当eigen_solver="arpack"时有效
assign_labels:指定在"嵌入空间"(embedding space)中分配标签的策略;为"kmeans"/"discretize"
######################################################################################################################
基于"特征矩阵"(Feature Matrix)执行"Ward聚类算法"(Ward Clustering):[<children>,<n_connected_components>,<n_leaves>,<parents>,<distances>=]sklearn.cluster.ward_tree(<X>[,connectivity=None,n_clusters=None,return_distance=False])
#参数说明:其他参数同sklearn.cluster.k_means()
connectivity:指定"连接矩阵"(connectivity matrix);为sparse matrix
return_distance:指定是否返回<distances>;为bool
children:返回各非叶节点的子节点;为(n_nodes-1)×2 ndarray
n_connected_components:返回图中连通部件的数量;为int
n_leaves:返回叶节点的数量;为int
parents:返回各节点的父节点;为1×n_nodes ndarray/None
#仅在指定了连接矩阵时才返回,否则返回None
distances:返回簇中心间的距离;为1×(n_nodes-1) ndarray
#distances[i] corresponds to a weighted euclidean distance between the nodes children[i,1] and children[i,2]
#If the nodes refer to leaves of the tree, then distances[i] is their unweighted euclidean distance
3.cluster.bicluster
(1)简介:
该模块用于进行双聚类
(2)使用:
"光谱双聚类算法"(Spectral biclustering):class sklearn.cluster.bicluster.SpectralBiclustering([n_clusters=3,method='bistochastic',n_components=6,n_best=3,svd_method='randomized',n_svd_vecs=None,mini_batch=False,init='k-means++',n_init=10,n_jobs=1,random_state=None])
#参数说明:其他参数同sklearn.cluster.k_means()
n_clusters:指定棋盘结构中的行和列的簇数;为int/tuple,格式为(n_row_clusters,n_column_clusters)
method:指定将奇异向量归一化并转换为"双聚类"(bicluster)的方法;为"bistochastic"/"scale"/"log"
n_components:指定要检查的奇异向量数;为int
n_best:指定将数据投影到其上的最好的奇异向量的个数;为int
#Number of best singular vectors to which to project the data for clustering
svd_method:指定用于寻找奇异向量的算法;为"randomized"/"arpack"
n_svd_vecs:指定用于SVD的向量数;为int
mini_batch:指定是否使用小批量K-均值算法;为bool
######################################################################################################################
"光谱协同聚类算法"(Spectral Co-Clustering):class sklearn.cluster.bicluster.SpectralCoclustering([n_clusters=3,svd_method='randomized',n_svd_vecs=None,mini_batch=False,init='k-means++',n_init=10,n_jobs=1,random_state=None])
#参数说明:其他参数同class sklearn.cluster.bicluster.SpectralBiclustering()