inferCNV
https://github.com/broadinstitute/inferCNV/wiki
根据肿瘤组织的单细胞表达数据,推测肿瘤细胞的拷贝数变异情况
-
简单理解:
(1)拷贝数变异是指指染色体上大于1 kb的DNA片段的扩增(amplification)或者减少(deletion),对基因的表达有很大的影响(扩增/降低)。而肿瘤恶性细胞通常伴随着拷贝数变异,通过影响相关基因的表达促进肿瘤发生。
(2)在肿瘤单细胞数据分析过程中,肿瘤细胞类型的注释可通过tumor related marker gene的表达情况(是否高表达)做出判断。而inferCNV可以从拷贝数变异的角度进一步验证肿瘤细胞类型的注释。
(3)inferCNV的算法是在完成肿瘤微环境的细胞类型注释的基础之上,以“Normal”细胞的基因表达情况做对照,计算“tumor”-annotated 细胞中的某些染色体区域的基因表达是否发生明显的增多或减少,从而推测出细胞的拷贝数变异图谱(并可以进一步聚类),从而验证之前的注释结果。
[站外图片上传中...(image-f915e3-1630081866859)]
(4)inferCNV从计算步骤来说分为两大步:第一步根据Normal细胞对比,计算得到tumor-like细胞的CNV图谱(preliminary infercnv object
);然后第二步是可选项,包括降噪处理和HMM预测,可分别得到两种结果。
[站外图片上传中...(image-6e0f6d-1630081866859)]
1、input file(3 files)
(1)raw_counts_matrix
a matrix of genes (rows) vs. cells (columns) containing assigned read counts. (note, sparse matrices are also supported)简单来说就是常规的单细胞矩阵(已经过滤掉低质量的细胞)
-
对于Seurat对象,可以直接如下操作即可
library(Seurat) counts_matrix = GetAssayData(seurat_obj, slot="counts")
The matrix can be provided as a tab-delimited file.
(2)Sample annotation file
- 细胞类型信息(两列):第一列是细胞ID,第二列是细胞类型
- 至少需要包含两种细胞类型(基于细胞注释结果):已知正常的细胞类型(免疫细胞、内皮细胞..)、可能为肿瘤细胞的细胞类型(肿瘤细胞、上皮细胞、成纤维细胞...)
- 此外由于由于肿瘤患者的异质性,不同病人来源的肿瘤细胞的拷贝数变异情况可能差别很大,因此可以在第二列的肿瘤细胞类型进行病源的注释,例如
tumor_P1
,tumor_P2
表示分别来自病人P1、P2的肿瘤细胞。
(3)基因的坐标信息
- 对应表达矩阵文件的基因(行名)的坐标注释信息;
- 包含四列,分别为:基因名、染色体信息、起始位点、终止位点
- https://data.broadinstitute.org/Trinity/CTAT/cnv/ 提供有人类的基因坐标信息,其中genecode_19对应hg19/GRCh37,genecode_v21对应hg38/GRCh38
2、R包分析流程
2.1 安装包
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("infercnv")
library(infercnv)
2.2 构建infercnv对象
-
CreateInfercnvObject()
infercnv_obj = CreateInfercnvObject( #原始count矩阵 raw_counts_matrix=raw_counts_matrix, #细胞类型注释信息 annotations_file=annotations_file, #对应annotations_file里,认为是normal细胞的细胞类型 ref_group_names=c("celltype1","celltype2") #基因坐标信息 gene_order_file=gene_order_file, #指定上述两个文件的分隔符 delim="\t")
2.3 CNV信号预测计算
默认计算得到
preliminary infercnv object
,可分别设置参数交代是否进行进一步降噪(de-noising)或者CNV的HMM预测。-
infercnv::run()
infercnv_obj = infercnv::run( #上一步构建的infercnv对象 infercnv_obj, #筛选基因的阈值:基因在所有细胞的平均表达量 # use 1 for smart-seq, 0.1 for 10x-genomics cutoff=1, #存储输出结果的文件夹名(每一步的中间文件都会保存) out_dir="output_dir", #是否将肿瘤细胞按照病源(病人之间异质性)分群计算CNV图谱 cluster_by_groups=T, #是否降噪处理 denoise=T, #是否利用HMM算法预测CNV状态 HMM=T, #使用的线程数 num_threads = 8)
3 结果的解释
对于inferCNV
的结果,一般都是以如下热图的形式(de-nosied)展示,可分为3部分:上半部分热图、下半部分热图以及左上角的图例
- 首先关于左上角的图例:(0,0.5,1,1.5,2)分别表示相对于Normal细胞的染色体区域基因表达量的倍数,红色表示该区域基因量相对增多,蓝色表示该区域基因量相对减少。柱子的长度表示对应区域的多少;
- 上半部分的热图:表示指定为Normal细胞的表达分布情况,正常情况下应该都是白色,没有明显集中的CNV区域;
- 下半部分的热图:相对于上半部分的Normal cell,计算的得到的每个tumor-like细胞的CNV图谱;然后根据所有细胞的相似性进行树状图聚类。
4 分析示例
4.1 infercnv包示例数据
(1)准备输入文件
setwd("~/inferCNV/example1/")
library(infercnv)
#count表达矩阵
mat_dir = system.file("extdata", "oligodendroglioma_expression_downsampled.counts.matrix.gz", package = "infercnv")
raw_counts = data.table::fread(mat_dir, data.table = F)
rownames(raw_counts) = raw_counts[,1]
raw_counts = raw_counts[,-1]
raw_counts[1:4,1:4]
dim(raw_counts)
#细胞类型注释信息
anno_file_dir = system.file("extdata", "oligodendroglioma_annotations_downsampled.txt", package = "infercnv")
anno_file = data.table::fread(anno_file_dir, data.table = F, header = F)
head(anno_file)
dim(anno_file)
anno_file$V2 = stringr::str_split(anno_file$V2,"_",simplify = T)[,1]
write.table(anno_file, quote = F, row.names = F, col.names = F,
sep = "\t", file = "anno_file.txt")
#基因坐标信息
gene_order_file_dir = system.file("extdata", "gencode_downsampled.EXAMPLE_ONLY_DONT_REUSE.txt", package = "infercnv")
gene_order_file = data.table::fread(gene_order_file_dir, data.table = F, header = F)
head(gene_order_file)
table(gene_order_file$V1 %in% rownames(raw_counts))
write.table(gene_order_file, quote = F, row.names = F, col.names = F,
sep = "\t", file = "gene_order_file.txt")
(2)分析流程
infercnv_obj = CreateInfercnvObject(raw_counts_matrix=raw_counts,
annotations_file="anno_file.txt",
delim="\t",
gene_order_file="gene_order_file.txt",
ref_group_names=c("Microglia/Macrophage",
"Oligodendrocytes (non-malignant)"))
infercnv_obj = infercnv::run(infercnv_obj,
cutoff=1,
out_dir="infer_out",
cluster_by_groups=TRUE,
denoise=T,
HMM=F)
4.2 scCancer包示例数据
(0) Seurat前期分析
# wget http://lifeome.net/software/sccancer/KC-example.tar.gz
library(Seurat)
library(ggplot2)
counts = Read10X("KC-example/filtered_feature_bc_matrix/")
scRNA=CreateSeuratObject(counts = counts)
dim(scRNA)
#[1] 32738 10227
scRNA@assays$RNA@counts[1:4, 1:4]
head([email protected])
feats <- c("nFeature_RNA", "nCount_RNA")
VlnPlot(scRNA, features = feats, pt.size = 0.01, ncol = 2) +
NoLegend()
#过滤指标
retained_c_umi <- scRNA$nFeature_RNA > 300
retained_f_low <- Matrix::rowSums(scRNA@assays$RNA@counts>0) > 3
#计算线粒体基因比例
mito_genes=rownames(scRNA)[grep("^MT-", rownames(scRNA))]
mito_genes
scRNA=PercentageFeatureSet(scRNA, "^MT-", col.name = "percent_mito")
fivenum([email protected]$percent_mito)
#计算核糖体基因比例
ribo_genes=rownames(scRNA)[grep("^Rp[sl]", rownames(scRNA),ignore.case = T)]
ribo_genes
scRNA=PercentageFeatureSet(scRNA, "^RP[SL]", col.name = "percent_ribo")
fivenum([email protected]$percent_ribo)
#计算红血细胞基因比例
rownames(scRNA)[grep("^Hb[^(p)]", rownames(scRNA),ignore.case = T)]
scRNA=PercentageFeatureSet(scRNA, "^HB[^(P)]", col.name = "percent_hb")
fivenum([email protected]$percent_hb)
scRNA=PercentageFeatureSet(scRNA, "^HB[^(P)]", col.name = "percent_hb")
fivenum([email protected]$percent_hb)
feats <- c("percent_mito","percent_ribo", "percent_hb")
VlnPlot(scRNA, group.by = "orig.ident", features = feats, pt.size = 0.01, ncol = 3) +
NoLegend()
retained_c_mito <- scRNA$percent_mito < 15
retained_c_ribo <- scRNA$percent_ribo > 3
retained_c_hb <- scRNA$percent_hb < 0.1
retained_c <- retained_c_umi & retained_c_mito & retained_c_ribo & retained_c_hb
table(retained_c)
# FALSE TRUE
# 1549 8678
retained_f <- retained_f_low
table(retained_f)
# FALSE TRUE
# 14636 18102
scRNA_filt=scRNA[retained_f, retained_c]
dim(scRNA_filt)
#[1] 18102 8678
#标高归
scRNA=scRNA_filt
scRNA <- NormalizeData(scRNA, normalization.method = "LogNormalize", scale.factor = 10000)
scRNA <- FindVariableFeatures(scRNA, selection.method = "vst", nfeatures = 2000)
scRNA <- ScaleData(scRNA, features = VariableFeatures(scRNA),
vars.to.regress = c("nFeature_RNA","percent_mito"))
top10 <- head(VariableFeatures(scRNA), 10)
plot1=VariableFeaturePlot(scRNA)
LabelPoints(plot = plot1, points = top10, repel = TRUE, size=2.5) +
theme(legend.position = c(0.1,0.8))
scRNA <- RunPCA(scRNA, features = VariableFeatures(scRNA))
ElbowPlot(scRNA, ndims=30, reduction="pca")
pc.num=1:20
# scRNA = RunTSNE(scRNA, dims = pc.num)
# DimPlot(scRNA, reduction = "tsne")
scRNA = RunUMAP(scRNA, dims = pc.num)
DimPlot(scRNA, reduction = "umap")
scRNA <- FindNeighbors(scRNA, dims = pc.num)
scRNA <- FindClusters(scRNA, resolution = c(0.01,0.05,0.1,0.2,0.5,0.7,0.9))
library(clustree)
library(cowplot)
library(patchwork)
clustree([email protected], prefix = "RNA_snn_res.")
Idents(scRNA) = scRNA$RNA_snn_res.0.2
table([email protected])
p_umap = DimPlot(scRNA, reduction = "umap" ,label = T)
p_umap
#B cell : 9
cg=c("CD79A","CD79B","IGKC","CD19","MZB1","MS4A1")
DotPlot(scRNA, assay = "RNA",
features = cg) + coord_flip() + p_umap
#T cell : 6
cg=c("CD3D",'CD3E','TRAG','CD3G','CD2')
DotPlot(scRNA, assay = "RNA",
features = cg) + coord_flip() + p_umap
#Cancer cell / Epithelial : 0,3,4,5,10
cg=c("EPCAM","PAX8","KRT18","CD24","KRT19","SCGB2A2","KRT5","KRT15" )
DotPlot(scRNA, assay = "RNA",
features = cg) + coord_flip() + p_umap
#Meyloid cell: 1
cg=c("CD68","LYZ","MARCO","AIF1","TYROBR","MS4A6A","CD1E","IL3RA","LAMP3")
DotPlot(scRNA, assay = "RNA",
features = cg) + coord_flip() + p_umap
#Endothelial cell: 2,8
cg=c("CLDN5","PECAM1","VWF","FLT1","RAMP2")
DotPlot(scRNA, assay = "RNA",
features = cg) + coord_flip() + p_umap
#Fibroblasts(CAF): 7
cg=c("COL1A1","COL1A2","COL3A1","BGN","DCN","POSTN","C1R")
DotPlot(scRNA, assay = "RNA",
features = cg) + coord_flip() + p_umap
cgs = list(
Epi = c("EPCAM","PAX8","KRT18","CD24","KRT19","SCGB2A2","KRT5","KRT15"),
Meyloid = c("CD68","LYZ","MARCO","AIF1","TYROBR","MS4A6A","CD1E","IL3RA","LAMP3"),
T_cell = c("CD3D",'CD3E','TRAG','CD3G','CD2'),
B_cell = c("CD79A","CD79B","IGKC","CD19","MZB1","MS4A1"),
Endo = c("CLDN5","PECAM1","VWF","FLT1","RAMP2"),
Fibro = c("COL1A1","COL1A2","COL3A1","BGN","DCN","POSTN","C1R")
)
#为了美观的展示dotplot,需要调整cluster的leve水平
[email protected] = factor([email protected], levels = c(0,3,4,5,10,1,6,9,2,8,7))
DotPlot(scRNA, features = cgs, assay = "RNA") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
p_umap + plot_layout(widths = c(2, 1))
scRNA$celltype = ifelse([email protected] %in% c(0,3,4,5,10), "Epi",
ifelse([email protected] %in% c(1), "Meyloid",
ifelse([email protected] %in% c(6), "T_cell",
ifelse([email protected] %in% c(9), "B_cell",
ifelse([email protected] %in% c(2,8), "Endo","Fibro")))))
Idents(scRNA) = scRNA$celltype
table([email protected])
[email protected] = factor([email protected], levels = c("Epi","Meyloid","T_cell",
"B_cell","Endo","Fibro"))
p_umap = DimPlot(scRNA, reduction = "umap")
DotPlot(scRNA, features = cgs, assay = "RNA") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
p_umap + plot_layout(widths = c(2, 1))
(1)准备输入文件
library(Seurat)
counts_matrix = GetAssayData(scRNA, slot="counts")
dim(counts_matrix)
counts_matrix[1:4,1:4]
gene_pos = data.table::fread("gencode_v19_gen_pos.complete.txt", data.table = F)
head(gene_pos)
gene_pos$SYMBOL = stringr::str_split(gene_pos$V1, "[|]",simplify = T)[,1]
gene_pos = gene_pos[!duplicated(gene_pos$SYMBOL),]
gene_pos = gene_pos[gene_pos$SYMBOL %in% rownames(counts_matrix), ]
table(rownames(counts_matrix) %in% gene_pos$SYMBOL)
counts_matrix = counts_matrix[match(gene_pos$SYMBOL, rownames(counts_matrix)),]
identical(rownames(counts_matrix), gene_pos$SYMBOL)
counts_matrix[1:4,1:4]
head(gene_pos)
gene_pos = gene_pos[,c(5,2,3,4)]
table(duplicated(gene_pos$SYMBOL))
dim(counts_matrix)
save(counts_matrix, file = "counts_matrix.rda")
write.table(gene_pos, file = "gene_pos.txt",
col.names = F, row.names = F, quote = F, sep = "\t")
annotations_file=data.table::fread(system.file("extdata", "oligodendroglioma_annotations_downsampled.txt", package = "infercnv"))
head(annotations_file)
meta = [email protected][,"celltype",drop=F]
head(meta)
meta$ID = rownames(meta)
meta = meta[,c("ID","celltype")]
table(meta$celltype)
head(meta)
meta_sle = subset(meta, celltype %in% c("Endo","Meyloid","Epi"))
head(meta_sle)
write.table(meta_sle, file = "celltype_sle.txt",
col.names = F, row.names = F, quote = F, sep = "\t")
(2)分析流程
rm(list = ls())
load("counts_matrix.rda")
library(infercnv)
infercnv_obj = CreateInfercnvObject(raw_counts_matrix=counts_matrix,
annotations_file="celltype_sle.txt",
delim="\t",
gene_order_file="gene_pos.txt",
ref_group_names=c("Endo","Meyloid"))
infercnv_obj = infercnv::run(infercnv_obj,
cutoff=0.1,
# cutoff=1 works well for Smart-seq2, and cutoff=0.1 works well for 10x Genomics
out_dir="CNV_infer2",
cluster_by_groups=TRUE,
denoise=TRUE,
HMM=F)