Python 第三方模块 机器学习 Scikit-Learn模块 无监督学习1 聚类2

一.cluster
2.使用
(2)函数:

执行"近邻传播聚类算法"(Affinity Propagation Clustering):[<cluster_centers_indices>,<labels>,<n_iter>=]sklearn.cluster.affinity_propagation(<S>[,preference=None,convergence_iter=15,max_iter=200,damping=0.5,copy=True,verbose=False,return_n_iter=False,random_state='warn'])
  #参数说明:
	S:指定数据点间的"相似度矩阵"(Matrix of similarities);为n_samples×n_samples array-like
	preference:指定各个数据点的"偏好值/参考度"(preference);float/1×n_samples array-like
	  #偏好值越大的点越可能被选为"聚类中心"(exemplar)
	convergence_iter:指定停止迭代前簇的数量没有变化的迭代次数;int
	max_iter:指定算法的最大迭代次数;int
	damping:指定"阻尼系数"(Damping factor);0.5<=float<=1
	copy:指定是否复制"关联矩阵/亲和度矩阵"(affinity matrix);bool
	  #为False可提高内存效率
	verbose:指定是否输出计算日志;bool
	return_n_iter:指定是否返回迭代次数;bool
	random_state:指定用于初始化聚类中心的随机数;int/RandomState instance/None
	  #Determines random number generation for centroid initialization
	cluster_centers_indices:返回距离中心的索引;1×n_clusters ndarray
	labels:返回各数据点所属的簇;1×n_samples ndarray
	n_iter:返回迭代次数;int

######################################################################################################################

执行"DBSCAN提取"(DBSCAN extraction):[<labels_>=]sklearn.cluster.cluster_optics_dbscan(<reachability>,<core_distances>,<ordering>,<eps>)
  #参数说明:同sklearn.cluster.affinity_propagation()
	reachability:指定"可达距离"(Reachability distance);1×n_samples ndarray
	core_distances:Distances at which points become core;1×n_samples ndarray
	ordering:OPTICS ordered point indices;1×n_samples ndarray
	eps:指定近邻的最远距离;float
	  #即class cluster.DBSCAN()的eps参数;距离超出该值的样本不会被视为近邻

######################################################################################################################

通过"Xi-steep method"来提取簇:[<labels>,<clusters>=]sklearn.cluster.cluster_optics_xi(<reachability>,<predecessor>,<ordering>,<min_samples>[,min_cluster_size=None,xi=0.05,predecessor_correction=True])
  #参数说明:其他参数同sklearn.cluster.cluster_optics_dbscan()
	predecessor:指定OPTICS之前得到的结果;1×n_samples ndarray
	min_samples:指定被视为聚类中心的数据点的邻域内的最小样本数(包括该点本身);int>1/0<=float<=1
	  #邻域内数据点数少于该值的点不会被视为聚类中心
	min_cluster_size:指定簇中的最小样本数;int>1/0<=float<=1
	xi:指定构成簇的边界的"可达性图"(reachability plot)的最小"陡度"(steepness);0<=float<=1
	predecessor_correction:指定是否根据OPTICS之前得到的结果修正簇;bool

######################################################################################################################"OPTICS可达图"(OPTICS reachability graph):[<ordering_>,<core_distances_>,<reachability_>,<predecessor_>=]sklearn.cluster.compute_optics_graph(<X>,<min_samples>,<max_eps>,<metric>,<p>,<metric_params>,<algorithm>,<leaf_size>,<n_jobs>)
  #参数说明:其他参数同sklearn.cluster.cluster_optics_xi()
	X:指定特征数组;为n_samples×n_features ndarray
	  若metric="precomputed",则指定样本间的距离;为n_samples×n_samples ndarray
	max_eps:指定近邻点的最远距离;float
	  #距离超过该值的数据点不会被视为近邻点
	metric:指定距离的度量;str/callable
	p:指定用于计算数据点间距离的闵可夫斯基度量的幂;float
	metric_params:指定要传入metric的其他参数;dict
	algorithm:指定用于寻找最近邻的算法;"auto"/"ball_tree"/"kd_tree"/"brute"
	leaf_size:指定BallTree/CKDTree的"叶大小"(Leaf size);int
	n_jobs:指定用于并行计算的任务数;int
	ordering_:返回簇的顺序;1×n_samples ndarray
	core_distances_:返回各样本成为"中心"(core)的距离;1×n_samples ndarray
	reachability_:返回各样本的可达距离;1×n_samples ndarray
	predecessor_:返回各样本的前1个样本;1×n_samples ndarray

######################################################################################################################

执行"DBSCAN聚类算法"(DBSCAN Clustering):[<core_samples>,<labels>=]sklearn.cluster.dbscan(<X>[,eps=0.5,min_samples=5,metric='minkowski',metric_params=None,algorithm='auto',leaf_size=30,p=2,sample_weight=None,n_jobs=None])
  #参数说明:eps同sklearn.cluster.compute_optics_graph()的max_eps
  #        同sklearn.cluster.affinity_propagation()
  #        其他参数同sklearn.cluster.compute_optics_graph()
	X:指定特征数组;为n_samples×n_features ndarray/n_samples×n_features sparse CSR matrix 
	  若metric="precomputed",则指定样本间的距离;为n_samples×n_samples ndarray/n_samples×n_samples sparse CSR matrix
	sample_weight:指定各样本的权重;1×n_samples array-like
	core_samples:返回"中心样本"(core samples);1×n_core_samples ndarray

######################################################################################################################

估计均值漂移算法中使用的"带宽"(Bandwidth):[<bandwidth>=]sklearn.cluster.estimate_bandwidth(<X>[,quantile=0.3,n_samples=None,random_state=0,n_jobs=None])
  #参数说明:其他参数同sklearn.cluster.affinity_propagation()
	X:指定数据;为n_samples×n_features array-like/n_samples×n_features sparse matrix
	quantile:0.5 means that the median of all pairwise distances is used;0<=float<=1
	n_samples:指定要使用的样本数;int
	bandwidth:返回"带宽"(bandwidth);float

######################################################################################################################

执行"K-均值聚类算法"(K-Means clustering):[<centroid>,<labels>,<inertia>,<best_n_iter>=]sklearn.cluster.k_means(<X>,<n_clusters>[,sample_weight=None,init='k-means++',precompute_distances='deprecated',n_init=10,max_iter=300,verbose=False,tol=1e-4,random_state=None,copy_x=True,n_jobs='deprecated',algorithm="auto",return_n_iter=False])
  #参数说明:同cluster.estimate_bandwidth()
  #        sample_weight/同sklearn.cluster.dbscan()
  #        其他参数同sklearn.cluster.affinity_propagation()
	n_clusters:指定簇/聚类中心的数量;int
	init:指定初始聚类中心;"k-means++"/"random"/n_clusters×n_features array-like/callable
	precompute_distances:指定是否预计算距离;"auto"/bool
	  #预计算距离将降低计算时间,但会增加内存开销
	n_init:指定算法重复次数(每次均重新初始化聚类中心);int
	  #将返回最好的结果
	tol:指定相对终止精度;float
	  #2次迭代间的改进小于此值时终止计算
	copy_x:When pre-computing distances it is more numerically accurate to center the data first
	       If copy_x is True (default), then the original data is not modified
	       If False, the original data is modified, and put back before the function returns
	       but small numerical differences may be introduced by subtracting and then adding the data mean
	       Note that if the original data is not C-contiguous, a copy will be made even if copy_x is False
	       If the original data is sparse, but not in CSR format, a copy will be made even if copy_x is False
	n_jobs:指定用于计算的OpenMP线程数;int
	algorithm:指定使用的算法类型;"auto"/"full"/"elkan"
	centroid:返回聚类中心;为n_clusters × n_features ndarray
	inertia:返回全部样本到其聚类中心的距离的和;float
	  #The final value of the inertia criterion (sum of squared distances to the closest centroid for all observations in the training set)
	best_n_iter:返回最优结果对应的迭代次数;int

######################################################################################################################

通过"k-means++算法"(K-Means Plus Plus Algorithm)来选择初始聚类中心:[<centers>,<indices>=]sklearn.cluster.kmeans_plusplus(<X>,<n_clusters>[,x_squared_norms=None,random_state=None,n_local_trials=None])
  #参数说明:其他参数同sklearn.cluster.k_means()
	x_squared_norms:指定各数据点处的"平方欧氏范数/L2范数"(Squared Euclidean norm/L2 norm);1×n_samples array-like
	n_local_trials:指定每个中心(1个除外)以不同初始值进行尝试的次数;int
	centers:返回初始聚类中心;为n_clusters×n_features ndarray
	indices:返回最终选择的聚类中心;1×n_clusters ndarray
	  #元素为各聚类中心在中的索引

######################################################################################################################

基于水平核执行均值漂移聚类算法:[<cluster_centers>,<labels>=]cluster.mean_shift(<X>[,bandwidth=None,seeds=None,bin_seeding=False,min_bin_freq=1,cluster_all=True,max_iter=300,n_jobs=None])
  #参数说明:其他参数同sklearn.cluster.k_means()
	bandwidth:指定核的"带宽"(bandwidth);float
	seeds:指定核的初始位置;为n_seeds×n_features array-like
	bin_seeding:指If true,initial kernel locations are locations of the discretized version of points,where points are binned onto a grid whose coarseness corresponds to the bandwidth
	            If False,initial kernel locations are locations of all points
      #设为True会加快运算,因为需要初始化的种子将减少;若seeds=None,则忽略该参数
	min_bin_freq:指定bin中最少的种子数;int
	  #不接受种子数低于该值的bin,因而可用于加快运算
	cluster_all:指定是否对所有数据点(包括不在任何核中的孤立点)都进行聚类;bool
	cluster_centers:返回所有聚类中心;为n_clusters×n_features ndarray

######################################################################################################################

将聚类用于"归一化了的拉普拉斯算子"(normalized Laplacian)的投影:[<labels>=]cluster.spectral_clustering(<affinity>[,n_clusters=8,n_components=None,eigen_solver=None,random_state=None,n_init=10,eigen_tol=0.0,assign_labels='kmeans',verbose=False])
  #参数说明:其他参数同sklearn.cluster.k_means()
	affinity:指定样本间的"关联矩阵/亲和度矩阵";为symmetric n_samples×n_samples array-like/symmetric n_samples×n_samples sparse matrix
	n_components:指定用于"谱嵌入"(spectral embedding)的特征向量数;int
	eigen_solver:指定使用的特征值分解策略;"arpack"/"lobpcg"/"amg"
	eigen_tol:指定对拉普拉斯矩阵进行特征分解的停止准则;float
	  #仅当eigen_solver="arpack"时有效
	assign_labels:指定在"嵌入空间"(embedding space)中分配标签的策略;"kmeans"/"discretize"

######################################################################################################################

基于"特征矩阵"(Feature Matrix)执行"Ward聚类算法"(Ward Clustering):[<children>,<n_connected_components>,<n_leaves>,<parents>,<distances>=]sklearn.cluster.ward_tree(<X>[,connectivity=None,n_clusters=None,return_distance=False])
  #参数说明:其他参数同sklearn.cluster.k_means()
    connectivity:指定"连接矩阵"(connectivity matrix);为sparse matrix
    return_distance:指定是否返回<distances>;bool
    children:返回各非叶节点的子节点;(n_nodes-1)×2 ndarray
    n_connected_components:返回图中连通部件的数量;int
    n_leaves:返回叶节点的数量;int
    parents:返回各节点的父节点;1×n_nodes ndarray/None
      #仅在指定了连接矩阵时才返回,否则返回None
    distances:返回簇中心间的距离;1×(n_nodes-1) ndarray
      #distances[i] corresponds to a weighted euclidean distance between the nodes children[i,1] and children[i,2]
      #If the nodes refer to leaves of the tree, then distances[i] is their unweighted euclidean distance

3.cluster.bicluster
(1)简介:

该模块用于进行双聚类

(2)使用:

"光谱双聚类算法"(Spectral biclustering):class sklearn.cluster.bicluster.SpectralBiclustering([n_clusters=3,method='bistochastic',n_components=6,n_best=3,svd_method='randomized',n_svd_vecs=None,mini_batch=False,init='k-means++',n_init=10,n_jobs=1,random_state=None])
  #参数说明:其他参数同sklearn.cluster.k_means()
	n_clusters:指定棋盘结构中的行和列的簇数;int/tuple,格式为(n_row_clusters,n_column_clusters)
	method:指定将奇异向量归一化并转换为"双聚类"(bicluster)的方法;"bistochastic"/"scale"/"log"
	n_components:指定要检查的奇异向量数;int
	n_best:指定将数据投影到其上的最好的奇异向量的个数;int
	  #Number of best singular vectors to which to project the data for clustering
	svd_method:指定用于寻找奇异向量的算法;"randomized"/"arpack"
	n_svd_vecs:指定用于SVD的向量数;int
	mini_batch:指定是否使用小批量K-均值算法;bool

######################################################################################################################

"光谱协同聚类算法"(Spectral Co-Clustering):class sklearn.cluster.bicluster.SpectralCoclustering([n_clusters=3,svd_method='randomized',n_svd_vecs=None,mini_batch=False,init='k-means++',n_init=10,n_jobs=1,random_state=None])
  #参数说明:其他参数同class sklearn.cluster.bicluster.SpectralBiclustering()

你可能感兴趣的:(python,机器学习,聚类,双聚类)