inferCNV

https://github.com/broadinstitute/inferCNV/wiki
根据肿瘤组织的单细胞表达数据，推测肿瘤细胞的拷贝数变异情况
简单理解：

（1）拷贝数变异是指指染色体上大于1 kb的DNA片段的扩增(amplification)或者减少(deletion)，对基因的表达有很大的影响（扩增/降低）。而肿瘤恶性细胞通常伴随着拷贝数变异，通过影响相关基因的表达促进肿瘤发生。

（2）在肿瘤单细胞数据分析过程中，肿瘤细胞类型的注释可通过tumor related marker gene的表达情况(是否高表达)做出判断。而inferCNV可以从拷贝数变异的角度进一步验证肿瘤细胞类型的注释。

Copy Number alterations. WT cell, since diploid organisms, carry two... | Download Scientific Diagram

（3）inferCNV的算法是在完成肿瘤微环境的细胞类型注释的基础之上，以“Normal”细胞的基因表达情况做对照，计算“tumor”-annotated 细胞中的某些染色体区域的基因表达是否发生明显的增多或减少，从而推测出细胞的拷贝数变异图谱（并可以进一步聚类），从而验证之前的注释结果。

[站外图片上传中...(image-f915e3-1630081866859)]

（4）inferCNV从计算步骤来说分为两大步：第一步根据Normal细胞对比，计算得到tumor-like细胞的CNV图谱（preliminary infercnv object）；然后第二步是可选项，包括降噪处理和HMM预测，可分别得到两种结果。

[站外图片上传中...(image-6e0f6d-1630081866859)]

1、input file（3 files）

（1）raw_counts_matrix

a matrix of genes (rows) vs. cells (columns) containing assigned read counts. (note, sparse matrices are also supported）简单来说就是常规的单细胞矩阵（已经过滤掉低质量的细胞）

对于Seurat对象，可以直接如下操作即可

library(Seurat)
counts_matrix = GetAssayData(seurat_obj, slot="counts")

The matrix can be provided as a tab-delimited file.

（2）Sample annotation file

细胞类型信息（两列）：第一列是细胞ID，第二列是细胞类型
至少需要包含两种细胞类型（基于细胞注释结果）：已知正常的细胞类型（免疫细胞、内皮细胞..）、可能为肿瘤细胞的细胞类型（肿瘤细胞、上皮细胞、成纤维细胞...）
此外由于由于肿瘤患者的异质性，不同病人来源的肿瘤细胞的拷贝数变异情况可能差别很大，因此可以在第二列的肿瘤细胞类型进行病源的注释，例如tumor_P1，tumor_P2表示分别来自病人P1、P2的肿瘤细胞。

（3）基因的坐标信息

对应表达矩阵文件的基因(行名)的坐标注释信息；
包含四列，分别为：基因名、染色体信息、起始位点、终止位点
https://data.broadinstitute.org/Trinity/CTAT/cnv/ 提供有人类的基因坐标信息，其中genecode_19对应hg19/GRCh37，genecode_v21对应hg38/GRCh38

2、R包分析流程

2.1 安装包

if (!requireNamespace("BiocManager", quietly = TRUE))
     install.packages("BiocManager")
BiocManager::install("infercnv")
library(infercnv)

2.2 构建infercnv对象

CreateInfercnvObject()

infercnv_obj = CreateInfercnvObject(
                      #原始count矩阵
                      raw_counts_matrix=raw_counts_matrix,
                      #细胞类型注释信息
                      annotations_file=annotations_file,
                      #对应annotations_file里，认为是normal细胞的细胞类型
                      ref_group_names=c("celltype1","celltype2")
                      #基因坐标信息
                      gene_order_file=gene_order_file,
                      #指定上述两个文件的分隔符
                      delim="\t")

2.3 CNV信号预测计算

默认计算得到preliminary infercnv object，可分别设置参数交代是否进行进一步降噪(de-noising)或者CNV的HMM预测。

infercnv::run()

infercnv_obj = infercnv::run(
                      #上一步构建的infercnv对象
                      infercnv_obj,
                      #筛选基因的阈值：基因在所有细胞的平均表达量
                      # use 1 for smart-seq, 0.1 for 10x-genomics
                      cutoff=1,  
                      #存储输出结果的文件夹名（每一步的中间文件都会保存）
                        out_dir="output_dir", 
                      #是否将肿瘤细胞按照病源(病人之间异质性)分群计算CNV图谱
                      cluster_by_groups=T, 
                      #是否降噪处理
                      denoise=T,
                      #是否利用HMM算法预测CNV状态
                      HMM=T,
                      #使用的线程数
                      num_threads = 8)

3 结果的解释

对于inferCNV的结果，一般都是以如下热图的形式(de-nosied)展示，可分为3部分：上半部分热图、下半部分热图以及左上角的图例

image.png

首先关于左上角的图例：（0，0.5，1，1.5，2）分别表示相对于Normal细胞的染色体区域基因表达量的倍数，红色表示该区域基因量相对增多，蓝色表示该区域基因量相对减少。柱子的长度表示对应区域的多少；
上半部分的热图：表示指定为Normal细胞的表达分布情况，正常情况下应该都是白色，没有明显集中的CNV区域；
下半部分的热图：相对于上半部分的Normal cell，计算的得到的每个tumor-like细胞的CNV图谱；然后根据所有细胞的相似性进行树状图聚类。

4 分析示例

4.1 infercnv包示例数据

（1）准备输入文件

setwd("~/inferCNV/example1/")
library(infercnv)
#count表达矩阵
mat_dir = system.file("extdata", "oligodendroglioma_expression_downsampled.counts.matrix.gz", package = "infercnv")
raw_counts = data.table::fread(mat_dir, data.table = F)
rownames(raw_counts) = raw_counts[,1]
raw_counts = raw_counts[,-1]
raw_counts[1:4,1:4]
dim(raw_counts)

#细胞类型注释信息
anno_file_dir = system.file("extdata", "oligodendroglioma_annotations_downsampled.txt", package = "infercnv")
anno_file = data.table::fread(anno_file_dir, data.table = F, header = F)
head(anno_file)
dim(anno_file)
anno_file$V2 = stringr::str_split(anno_file$V2,"_",simplify = T)[,1]
write.table(anno_file, quote = F, row.names = F, col.names = F,
            sep = "\t", file = "anno_file.txt")

#基因坐标信息
gene_order_file_dir = system.file("extdata", "gencode_downsampled.EXAMPLE_ONLY_DONT_REUSE.txt", package = "infercnv")
gene_order_file = data.table::fread(gene_order_file_dir, data.table = F, header = F)
head(gene_order_file)
table(gene_order_file$V1 %in% rownames(raw_counts))
write.table(gene_order_file, quote = F, row.names = F, col.names = F,
            sep = "\t", file = "gene_order_file.txt")

（2）分析流程

infercnv_obj = CreateInfercnvObject(raw_counts_matrix=raw_counts,
                                    annotations_file="anno_file.txt",
                                    delim="\t",
                                    gene_order_file="gene_order_file.txt",
                                    ref_group_names=c("Microglia/Macrophage",
                                                      "Oligodendrocytes (non-malignant)")) 

infercnv_obj = infercnv::run(infercnv_obj,
                             cutoff=1, 
                             out_dir="infer_out",
                             cluster_by_groups=TRUE,
                             denoise=T,
                             HMM=F)

4.2 scCancer包示例数据

(0) Seurat前期分析

# wget http://lifeome.net/software/sccancer/KC-example.tar.gz
library(Seurat) 
library(ggplot2)
counts = Read10X("KC-example/filtered_feature_bc_matrix/")
scRNA=CreateSeuratObject(counts = counts)
dim(scRNA) 
#[1] 32738 10227
scRNA@assays$RNA@counts[1:4, 1:4]
head([email protected])
feats <- c("nFeature_RNA", "nCount_RNA")
VlnPlot(scRNA, features = feats, pt.size = 0.01, ncol = 2) + 
  NoLegend()

#过滤指标
retained_c_umi <- scRNA$nFeature_RNA > 300
retained_f_low <- Matrix::rowSums(scRNA@assays$RNA@counts>0) > 3
#计算线粒体基因比例
mito_genes=rownames(scRNA)[grep("^MT-", rownames(scRNA))] 
mito_genes 
scRNA=PercentageFeatureSet(scRNA, "^MT-", col.name = "percent_mito")
fivenum([email protected]$percent_mito)
#计算核糖体基因比例
ribo_genes=rownames(scRNA)[grep("^Rp[sl]", rownames(scRNA),ignore.case = T)]
ribo_genes
scRNA=PercentageFeatureSet(scRNA, "^RP[SL]", col.name = "percent_ribo")
fivenum([email protected]$percent_ribo)
#计算红血细胞基因比例
rownames(scRNA)[grep("^Hb[^(p)]", rownames(scRNA),ignore.case = T)]
scRNA=PercentageFeatureSet(scRNA, "^HB[^(P)]", col.name = "percent_hb")
fivenum([email protected]$percent_hb)
scRNA=PercentageFeatureSet(scRNA, "^HB[^(P)]", col.name = "percent_hb")
fivenum([email protected]$percent_hb)

feats <- c("percent_mito","percent_ribo", "percent_hb")
VlnPlot(scRNA, group.by = "orig.ident", features = feats, pt.size = 0.01, ncol = 3) +
  NoLegend()

retained_c_mito <- scRNA$percent_mito < 15
retained_c_ribo <- scRNA$percent_ribo > 3
retained_c_hb <- scRNA$percent_hb < 0.1


retained_c <- retained_c_umi & retained_c_mito & retained_c_ribo & retained_c_hb 
table(retained_c)
# FALSE  TRUE 
# 1549  8678
retained_f <- retained_f_low 
table(retained_f)
# FALSE  TRUE 
# 14636 18102

scRNA_filt=scRNA[retained_f, retained_c]
dim(scRNA_filt) 
#[1] 18102  8678


#标高归
scRNA=scRNA_filt
scRNA <- NormalizeData(scRNA, normalization.method = "LogNormalize", scale.factor = 10000)
scRNA <- FindVariableFeatures(scRNA, selection.method = "vst", nfeatures = 2000) 
scRNA <- ScaleData(scRNA, features = VariableFeatures(scRNA), 
                   vars.to.regress = c("nFeature_RNA","percent_mito"))

top10 <- head(VariableFeatures(scRNA), 10) 
plot1=VariableFeaturePlot(scRNA) 
LabelPoints(plot = plot1, points = top10, repel = TRUE, size=2.5) +
  theme(legend.position = c(0.1,0.8))

scRNA <- RunPCA(scRNA, features = VariableFeatures(scRNA)) 

ElbowPlot(scRNA, ndims=30, reduction="pca") 
pc.num=1:20
# scRNA = RunTSNE(scRNA, dims = pc.num)
# DimPlot(scRNA, reduction = "tsne")
scRNA = RunUMAP(scRNA, dims = pc.num)
DimPlot(scRNA, reduction = "umap")

scRNA <- FindNeighbors(scRNA, dims = pc.num) 
scRNA <- FindClusters(scRNA, resolution = c(0.01,0.05,0.1,0.2,0.5,0.7,0.9))
library(clustree)
library(cowplot)
library(patchwork)
clustree([email protected], prefix = "RNA_snn_res.")
Idents(scRNA) = scRNA$RNA_snn_res.0.2
table([email protected])
p_umap = DimPlot(scRNA, reduction = "umap" ,label = T)
p_umap
#B cell : 9
cg=c("CD79A","CD79B","IGKC","CD19","MZB1","MS4A1")
DotPlot(scRNA, assay = "RNA",
        features = cg) + coord_flip() + p_umap
#T cell : 6
cg=c("CD3D",'CD3E','TRAG','CD3G','CD2')
DotPlot(scRNA, assay = "RNA",
        features = cg) + coord_flip()  + p_umap
#Cancer cell / Epithelial : 0,3,4,5,10
cg=c("EPCAM","PAX8","KRT18","CD24","KRT19","SCGB2A2","KRT5","KRT15" )
DotPlot(scRNA, assay = "RNA",
        features = cg) + coord_flip()  + p_umap

#Meyloid cell: 1
cg=c("CD68","LYZ","MARCO","AIF1","TYROBR","MS4A6A","CD1E","IL3RA","LAMP3")
DotPlot(scRNA, assay = "RNA",
        features = cg) + coord_flip()  + p_umap

#Endothelial cell: 2,8
cg=c("CLDN5","PECAM1","VWF","FLT1","RAMP2")
DotPlot(scRNA, assay = "RNA",
        features = cg) + coord_flip()  + p_umap

#Fibroblasts(CAF): 7
cg=c("COL1A1","COL1A2","COL3A1","BGN","DCN","POSTN","C1R")
DotPlot(scRNA, assay = "RNA",
        features = cg) + coord_flip()  + p_umap
 



cgs = list(
  Epi = c("EPCAM","PAX8","KRT18","CD24","KRT19","SCGB2A2","KRT5","KRT15"),
  Meyloid = c("CD68","LYZ","MARCO","AIF1","TYROBR","MS4A6A","CD1E","IL3RA","LAMP3"),
  T_cell = c("CD3D",'CD3E','TRAG','CD3G','CD2'),
  B_cell = c("CD79A","CD79B","IGKC","CD19","MZB1","MS4A1"),
  Endo = c("CLDN5","PECAM1","VWF","FLT1","RAMP2"),
  Fibro = c("COL1A1","COL1A2","COL3A1","BGN","DCN","POSTN","C1R")
)
#为了美观的展示dotplot，需要调整cluster的leve水平
[email protected] = factor([email protected], levels = c(0,3,4,5,10,1,6,9,2,8,7))
DotPlot(scRNA, features = cgs, assay = "RNA") + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  p_umap + plot_layout(widths = c(2, 1))

scRNA$celltype = ifelse([email protected] %in% c(0,3,4,5,10), "Epi",
                        ifelse([email protected] %in% c(1), "Meyloid",
                               ifelse([email protected] %in% c(6), "T_cell",
                                      ifelse([email protected] %in% c(9), "B_cell",
                                             ifelse([email protected] %in% c(2,8), "Endo","Fibro")))))

Idents(scRNA) = scRNA$celltype
table([email protected])
[email protected] = factor([email protected], levels = c("Epi","Meyloid","T_cell",
                                                           "B_cell","Endo","Fibro"))
p_umap = DimPlot(scRNA, reduction = "umap")
DotPlot(scRNA, features = cgs, assay = "RNA") + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  p_umap + plot_layout(widths = c(2, 1))

（1）准备输入文件

library(Seurat)
counts_matrix = GetAssayData(scRNA, slot="counts")
dim(counts_matrix)
counts_matrix[1:4,1:4]

gene_pos = data.table::fread("gencode_v19_gen_pos.complete.txt", data.table = F)
head(gene_pos)
gene_pos$SYMBOL = stringr::str_split(gene_pos$V1, "[|]",simplify = T)[,1]
gene_pos = gene_pos[!duplicated(gene_pos$SYMBOL),]
gene_pos = gene_pos[gene_pos$SYMBOL %in% rownames(counts_matrix), ]
table(rownames(counts_matrix) %in% gene_pos$SYMBOL)
counts_matrix = counts_matrix[match(gene_pos$SYMBOL, rownames(counts_matrix)),]
identical(rownames(counts_matrix), gene_pos$SYMBOL)
counts_matrix[1:4,1:4]
head(gene_pos)
gene_pos = gene_pos[,c(5,2,3,4)]
table(duplicated(gene_pos$SYMBOL))
dim(counts_matrix)
save(counts_matrix, file = "counts_matrix.rda")
write.table(gene_pos, file = "gene_pos.txt", 
            col.names = F, row.names = F, quote = F, sep = "\t")

annotations_file=data.table::fread(system.file("extdata", "oligodendroglioma_annotations_downsampled.txt", package = "infercnv"))
head(annotations_file)
meta = [email protected][,"celltype",drop=F]
head(meta)
meta$ID = rownames(meta)
meta = meta[,c("ID","celltype")]
table(meta$celltype)
head(meta)
meta_sle = subset(meta, celltype %in% c("Endo","Meyloid","Epi"))
head(meta_sle)
write.table(meta_sle, file = "celltype_sle.txt", 
            col.names = F, row.names = F, quote = F, sep = "\t")

（2）分析流程

rm(list = ls())
load("counts_matrix.rda")
library(infercnv)
infercnv_obj = CreateInfercnvObject(raw_counts_matrix=counts_matrix,
                                    annotations_file="celltype_sle.txt",
                                    delim="\t",
                                    gene_order_file="gene_pos.txt",
                                    ref_group_names=c("Endo","Meyloid")) 

infercnv_obj = infercnv::run(infercnv_obj,
                             cutoff=0.1, 
                             # cutoff=1 works well for Smart-seq2, and cutoff=0.1 works well for 10x Genomics
                             out_dir="CNV_infer2", 
                             cluster_by_groups=TRUE,
                             denoise=TRUE,
                             HMM=F)

inferCNV包推测肿瘤单细胞数据的拷贝数变异

inferCNV

1、input file（3 files）

（1）raw_counts_matrix

（2）Sample annotation file

（3）基因的坐标信息

2、R包分析流程

2.1 安装包

2.2 构建infercnv对象

2.3 CNV信号预测计算

3 结果的解释

4 分析示例

4.1 infercnv包示例数据

（1）准备输入文件

（2）分析流程

4.2 scCancer包示例数据

(0) Seurat前期分析

（1）准备输入文件

（2）分析流程

你可能感兴趣的:(inferCNV包推测肿瘤单细胞数据的拷贝数变异)