单细胞聚类方法

Partitioning-based clustering

kmeans:K均值
论文链接

res <- kmeans(t(data), centers = 9)
adjustedRandIndex(res$cluster, meta$label)
plot(res$centers, col = topo.colors(4))

tsne_out <- Rtsne(data)
plot(tsne_out$Y, col = topo.colors(4))

SAIC:在聚类迭代过程中结合k-means和ANOVA

SCUBA:kmeans;使用gap statistics 识别bifurcation events

scVDMC : single-cell variance-driven multi-task clustering

pcaReduce
论文链接

library(pcaReduce)
res <- PCAreduce(t(data),
                 nbt = 1,
                 q = 7,
                 method = "S")
res[[1]]
adjustedRandIndex(res[[1]][, 1], meta$label)

k-medoids

res <- pamk(data = t(data), krange = 7)
adjustedRandIndex(res$pamobject$clustering, meta$label)

层次聚类

BackSPIN:two-way biclustering algorithm;

cellTree:构建最小生成树;

CIDR:缺失值填补
论文链接

#rows correspond to features (genes, transcripts, etc) and the columns correspond to cells
library(cidr)
load("/Biase.Rdata")

cellType <- factor(meta$label)
types <- levels(cellType)

scols <-
  c("red",
    "blue",
    "green",
    "brown",
    "pink",
    "purple",
    "darkgreen",
    "grey")
cols <- rep(NA, length(cellType))
for (i in 1:length(cols)) {
  cols[i] <- scols[which(types == cellType[i])]
}

#' @param nPC number of principal coordinates (nPC),by default 4.
#' @param nCluster the number of clusters;
#'
sdata <- as.matrix(data)
sdata <- scDataConstructor(sdata)#????scData??
sdata <- determineDropoutCandidates(sdata)#ȷ??dropout??ѡ????
sdata <- wThreshold(sdata)  #????Ȩֵ
sdata <- scDissim(sdata)   #????dissimilarity????

sdata <- scPCA(sdata)  #pcoa
sdata <- nPC(sdata)    #ȷ??????????
nPC <- sdata@nPC  #ȷ??npc??

nCluster(sdata)    #plot????????

sdata <- scCluster(sdata, nPC = nPC)     #cidr???ξ???
adjustedRandIndex(sdata@clusters, meta$label)

sdata@nCluster
plot(
  sdata@PC[, c(1, 2)],
  col = cols,
  pch = sdata@clusters,
  main = "CIDR",
  xlab = "PC1",
  ylab = "PC2"
)

RCA: reference component analysis
论文链接

混合模型(mixture models)

GMM: Gaussian mixture model

pc_res <-  prcomp(t(data))$x
tmp_pca_mat = pc_res[, 1:10]
res <- Mclust(tmp_pca_mat, G = 2:10)
clusterid <-  apply(res$z, 1, which.max)
adjustedRandIndex(clusterid, meta$label)

TSCAN:使用GMM和MST发现pseudo time ordering
论文链接

Graph-based clustering

TCC:Transcript compatibility counts;

  1. 构建affinity matrix;
  2. 计算Jensen-Shannon距离

SIMLR:从单细胞 RNA-seq 数据学习相似度量以执行降维、聚类和可视化
论文链接

library(SIMLR)
data <- CreateSeuratObject(data)
ElbowPlot(data)

SIMLR_res <- SIMLR(data, c = 3)#聚类簇数
adjustedRandIndex(SIMLR_res$y$cluster, meta$label)

plot(SIMLR_res$ydata,
     col = c(topo.colors(7))[meta$label],
     pch = 20)

heatmap(SIMLR_res$S)

SNN-cliq:clique detection ;

①计算初始数据点之间相似性(欧氏距离);

②使用相似矩阵,列出每个数据点的KNN;

③基于每两个数据点的共享邻居(SNN)计算二级相似矩阵;

④构建两个点的SNN图,节点代表数据点,边代表数据点之间的相似性

Louvain:使用社区检测算法进行聚类,首先根据 scRNA-seq 数据构建网络,其中结点代表细
胞,边代表细胞间的相似性,随后使用社区检测算法对网络进行划分,聚类结果很大程度上取
决于相似网络的构建。
论文链接

Density-based clustering

DBSCAN

①随机从一个未被访问过的数据点x开始,以eps为半径搜索范围内所有邻域点;

②如果x点在该邻域内有足够数量的点,数量大于等于minPts,则聚类过程开始,并且当前数据点成为新簇中的第一个核心点。否则,该点将被标记为噪声。该点都会被标记为“已访问”;

③新簇中的每个核心点x,它的eps距离邻域内的点会归为同簇。eps邻域内的所有点都属于同一个簇,然后对才添加到簇中的所有新点重复上述过程。

④重复步骤2和3两个过程,直到确定了簇中的所有点才停止,即访问和标记了聚类的eps邻域内的所有点。

⑤当完成了这个簇的划分,就开始处理新的未访问的点,发现新的簇或者是噪声。重复上述过程,直到所有点被标记为已访问才停止。这样就完成了对所有点的聚类过程。

library(dbscan)
kNNdistplot(t(data), k = 5)
res <- dbscan::dbscan(t(data), minPts = 5, eps = 340)

res$cluster
adjustedRandIndex(res$cluster, meta$label)

GiniClust: discover rare subpopulation
论文链接

Monocle
论文链接

density peak clustering: 考虑数据点之间的距离,而非密度阈值,假设簇中心是簇中数据点密度的局部最大值

神经网络

SOM: competitive learning for clustering ; 随机梯度下降;sensitive to parameter tuning(learning rate)

SCRAT:single-cell R-analysis tools ; 可视化2D热图,表示单细胞基因之间的相关性

SOMSC:压缩高维基因表达数据为2维,用于cellular state transition identification和pseudotemporal ordering of cells

Ensemble clustering(consensus clustering)

SC3
论文链接

library(SC3)
sce <- SingleCellExperiment(assays = list(counts = as.matrix(data),
                                          logcounts = log2(as.matrix(data) + 1)))

# define feature names in feature_symbol column
rowData(sce)$feature_symbol <- rownames(sce)
# remove features with duplicated names
sce <- sce[!duplicated(rowData(sce)$feature_symbol),]

sce <- runPCA(sce)

res <- sc3(sce, ks = 3)
res <- sc3(sce, k_estimator = T)

sc3_plot_consensus(res, k = 3)
sc3_plot_silhouette(res, 10)

adjustedRandIndex(res$sc3_3_clusters, meta$label)

plotPCA(res, colour_by = "sc3_3_clusters")

基于随机森林

RAFSIL:首先对数据进行特征构建,随后学习细胞间相似度。可用于典型的探索性数据分析任务,如降维、可视化、聚类。
论文链接

library(RAFSIL)
cluster_result <- RAFSIL(data = embedding_data,
                         NumC = 6,
                         method = "RAFSIL1")$lab
cluster_result <- RAFSIL(data = t(embedding_data),
                         NumC = 6,
                         method = "RAFSIL2")$lab
final_ARI <- adjustedRandIndex(cluster_result, label)
print(final_ARI)

其他

LAK
论文链接

library(mclust)

setwd("/LAK-master")
source("LAK.R")

#Biase <-  readRDS("Single Cell Data/biase.rds")
yan <-
  readRDS("/yan.rds")
m <- assays(yan)[[1]][, -(50:56)]
LAK_ann <- LAK(m, 3)

yan_ann <- colData(yan)$cell_type1[-(50:56)]
yan_ann_numeric <- c()
id <- names(table(yan_ann))
for (i  in 1:length(yan_ann)) {
  for (j in 1:length(id)) {
    if (yan_ann[i] == id[j]) {
      yan_ann_numeric <- c(yan_ann_numeric, j)
      break
    }
  }
}
adjustedRandIndex(LAK_ann[[1]]$Cs, yan_ann_numeric)

你可能感兴趣的:(生信,聚类,机器学习,算法)