2021年一月初了新的推断肿瘤细胞CNV的方法:copyKAT。也是通过单细胞转录组数据来推断细胞的染色体倍数,进而推断是正常细胞(diploid)还是肿瘤细胞(aneuploid)。它还可以进一步对肿瘤细胞进行聚类,找出不同的亚群。
首先看一下原理及工作流程图。首先对基因表达量做标准化并稳定其方差(a)。相较于inferCNV(见我之前的分享:https://www.jianshu.com/p/1fa1fd4f97ff),copyKAT可以自动寻找diploid cells作为正常细胞(b)。对每个非正常细胞,利用MCMC寻找其CNV的断点(breakpoints)并得到segments(c)。正常细胞和肿瘤细胞由于其基因表达量分布的不同可以被分开(d)。肿瘤细胞通常还可以继续聚类得到其亚群(e)。
首先倒入示例数据,用了文章里的一个数据TNBC1(下载链接:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4476486)。运行非常简单,缺点是慢,1907个细胞要5分钟多,建议服务器提交任务。
exp.rawdata <- read.table("./copykat/data/GSM4476486_filtered_UMIcount_TNBC1.txt", header=T, sep='\t', check.names = F)
copykat.test <- copykat(rawmat=exp.rawdata,
id.type="S",
cell.line="no",
ngene.chr=5,
win.size=25,
KS.cut=0.15,
sam.name="TNBC1",
distance="euclidean",
n.cores=4)
pred.test <- data.frame(copykat.test$prediction)
CNA.test <- data.frame(copykat.test$CNAmat)
看一下结果,主要有两个,1)预测的结果正常细胞(diploid)还是肿瘤细胞(aneuploid),2) 每个CNV segment在每个细胞的表达量。这里看出和inferCNV的不同来了,是基于genomic coordinate而不是gene level的表达量。个人感觉这种量化方法更去噪,热图更整洁。
head(pred.test)
cell.names copykat.pred
#AAACCTGCACCTTGTC AAACCTGCACCTTGTC aneuploid
#AAACGGGAGTCCTCCT AAACGGGAGTCCTCCT diploid
#AAACGGGTCCAGAGGA AAACGGGTCCAGAGGA aneuploid
#AAAGATGCAGTTTACG AAAGATGCAGTTTACG aneuploid
#AAAGCAACAGGAATGC AAAGCAACAGGAATGC aneuploid
#AAAGCAATCGGAATCT AAAGCAATCGGAATCT aneuploid
head(CNA.test[,1:5])
chrom chrompos abspos AAACCTGCACCTTGTC AAACGGGAGTCCTCCT
#1 1 1042457 1042457 -0.03206638 0.03170166
#2 1 1265484 1265484 -0.03206638 0.03170166
#3 1 1519859 1519859 -0.03206638 0.03170166
#4 1 1826619 1826619 -0.03206638 0.03170166
#5 1 2058465 2058465 -0.03206638 0.03170166
#6 1 2280372 2280372 -0.03206638 0.03170166
画个热图看看。很明显正常细胞和肿瘤细胞分开了。
my_palette <- colorRampPalette(rev(RColorBrewer::brewer.pal(n = 3, name = "RdBu")))(n = 999)
chr <- as.numeric(CNA.test$chrom) %% 2+1
rbPal1 <- colorRampPalette(c('black','grey'))
CHR <- rbPal1(2)[as.numeric(chr)]
chr1 <- cbind(CHR,CHR)
rbPal5 <- colorRampPalette(RColorBrewer::brewer.pal(n = 8, name = "Dark2")[2:1])
com.preN <- pred.test$copykat.pred
pred <- rbPal5(2)[as.numeric(factor(com.preN))]
cells <- rbind(pred,pred)
col_breaks = c(seq(-1,-0.4,length=50),seq(-0.4,-0.2,length=150),seq(-0.2,0.2,length=600),seq(0.2,0.4,length=150),seq(0.4, 1,length=50))
heatmap.3(t(CNA.test[,4:ncol(CNA.test)]),dendrogram="r", distfun = function(x) parallelDist::parDist(x,threads =4, method = "euclidean"),
hclustfun = function(x) hclust(x, method="ward.D2"),
ColSideColors=chr1,RowSideColors=cells,Colv=NA, Rowv=TRUE,
notecol="black",col=my_palette,breaks=col_breaks, key=TRUE,
keysize=1, density.info="none", trace="none",
cexRow=0.1,cexCol=0.1,cex.main=1,cex.lab=0.1,
symm=F,symkey=F,symbreaks=T,cex=1, cex.main=4, margins=c(10,10))
legend("topright", paste("pred.",names(table(com.preN)),sep=""), pch=15,col=RColorBrewer::brewer.pal(n = 8, name = "Dark2")[2:1], cex=0.6, bty="n")
再对肿瘤细胞再聚类并画热图,又能分成两群。
tumor.cells <- pred.test$cell.names[which(pred.test$copykat.pred=="aneuploid")]
tumor.mat <- CNA.test[, which(colnames(CNA.test) %in% tumor.cells)]
hcc <- hclust(parallelDist::parDist(t(tumor.mat),threads =4, method = "euclidean"), method = "ward.D2")
hc.umap <- cutree(hcc,2)
rbPal6 <- colorRampPalette(RColorBrewer::brewer.pal(n = 8, name = "Dark2")[3:4])
subpop <- rbPal6(2)[as.numeric(factor(hc.umap))]
cells <- rbind(subpop,subpop)
heatmap.3(t(tumor.mat),dendrogram="r", distfun = function(x) parallelDist::parDist(x,threads =4, method = "euclidean"),
hclustfun = function(x) hclust(x, method="ward.D2"),
ColSideColors=chr1,RowSideColors=cells,Colv=NA, Rowv=TRUE,
notecol="black",col=my_palette,breaks=col_breaks, key=TRUE,
keysize=1, density.info="none", trace="none",
cexRow=0.1,cexCol=0.1,cex.main=1,cex.lab=0.1,
symm=F,symkey=F,symbreaks=T,cex=1, cex.main=4, margins=c(10,10))
legend("topright", c("c1","c2"), pch=15,col=RColorBrewer::brewer.pal(n = 8, name = "Dark2")[3:4], cex=0.9, bty='n')
最后把CNV的结果投射到单细胞聚类结果上看一看是否合理,Seurat标准流程走一遍,聚类结果和copyKAT分群结果投射到TSNE上。
standard10X = function(dat,nPCs=50,res=1.0,verbose=FALSE){
srat = CreateSeuratObject(dat)
srat = NormalizeData(srat,verbose=verbose)
srat = ScaleData(srat,verbose=verbose)
srat = FindVariableFeatures(srat,verbose=verbose)
srat = RunPCA(srat,verbose=verbose)
srat = RunTSNE(srat,dims=seq(nPCs),verbose=verbose)
srat = FindNeighbors(srat,dims=seq(nPCs),verbose=verbose)
srat = FindClusters(srat,res=res,verbose=verbose)
return(srat)
}
TNBC1 <- standard10X(exp.rawdata, nPCs=30, res=0.6)
[email protected]$copykat.pred <- pred.test$copykat.pred
[email protected]$copykat.tumor.pred <- rep("normal", nrow([email protected]))
[email protected]$copykat.tumor.pred[rownames([email protected]) %in% names(hc.umap[hc.umap==1])] <- "tumor cluster 1"
[email protected]$copykat.tumor.pred[rownames([email protected]) %in% names(hc.umap[hc.umap==2])] <- "tumor cluster 2"
p1 <- DimPlot(TNBC1, label = T)
p2 <- DimPlot(TNBC1, group.by = "copykat.pred")
p3 <- DimPlot(TNBC1, group.by = "copykat.tumor.pred")
p1 + p2 + p3
从免疫细胞和肿瘤细胞的标记基因表达来看,copyKAT可以正确找出正常细胞和肿瘤细胞。
FeaturePlot(TNBC1,features=c("PTPRC", "EPCAM"), order = T)
最后作者提到一个需要注意的点,不是所有的肿瘤都存在CNV。儿童肿瘤和血液肿瘤中基本没有copy number event,所以是不适合用这些方法(copyKAT或inferCNV)来寻找肿瘤细胞的。
References:
https://www.nature.com/articles/s41587-020-00795-2#data-availability
https://github.com/navinlabcode/copykat
作者:夜凉如水中
链接:https://www.jianshu.com/p/086e266af03d