公众号：生信诊断所

本文是一则学习笔记，但有干货
这里的数据是保密的，做了哪些处理也不能说。就单纯为了测试liger整合数据和Seurat对象直接的转换，还有scCATCH工具的细胞类型注释。

LIGER（Linked Inference of Genomic Experimental Relationships）基因组实验关系关联推断,用一种整合非负矩阵分解算法。
主要的步骤分为：数据预处理和归一化，联合因子分解，分位数归一化和联合聚类，以及可视化。矩阵分解的概念，非负矩阵分解 NMF算法可以参照官方文献。

先找两个数据：一个起名

# install.packages('rliger')
library(rliger)
library(dplyr)
library(patchwork)
library(Seurat)

setwd("/home/data/analysis")
直接指定到10x cellranger分析结果目录下的filtered_feature_bc_matrix目录
folder1 = "../sample1/filtered_feature_bc_matrix"
folder2 = "../sample2/filtered_feature_bc_matrix"

step1: preprocessing and normalization

matrix_list <- rliger::read10X(sample.dirs = c(folder1,folder2),
                       sample.names = c('TU_CL1','tumor_digest'),merge=FALSE)

# make sure all cell names are unique
colnames(matrix_list$TU_CL1[["Gene Expression"]]) <- paste0(colnames(matrix_list$TU_CL1[["Gene Expression"]]),'_1')
colnames(matrix_list$tumor_digest[["Gene Expression"]]) <- paste0(colnames(matrix_list$tumor_digest[["Gene Expression"]]),'_2')

# 创建liger对象
liger_obj <- createLiger(list(TU_CL1=matrix_list$TU_CL1[["Gene Expression"]],
                              tumor_digest = matrix_list$tumor_digest[["Gene Expression"]]))
# Removing 14185 genes not expressing in TU_CL1.
# Removing 10058 genes not expressing in tumor_digest.

# 
liger_obj <- rliger::normalize(liger_obj) %>%
  rliger::selectGenes() %>% # select high variable genes
  rliger::scaleNotCenter() # scale but not center by subtracting mean; not log-transform

Step 2 : joint matrix factorization

这一步做联合矩阵分解，k设置20，也就是20个factor，为20个metagene，比较耗时，多核会好一点，15min in 16cores

# 选做：suggestK非常耗时，可以不做，默认用k=20跑
rliger::suggestK(liger_obj, k.test = seq(20,30,2),lambda = 5, plot.log2 = F,num.cores = 72)
#直接用k=20跑
liger_obj <- rliger::optimizeALS(liger_obj, k=20)
Finished in 32.68537 mins, 30 iterations.
Max iterations set: 30.
Final objective delta: 7.5549e-06.
Best results with seed 1.

Step 3: quantile normalization and joint clustering

liger_obj <- rliger::quantile_norm(liger_obj)

# alignment score ranges from 0 (no alignment) to 1 (perfect alignment)
rliger::calcAlignment(liger_obj) # 评估数据整合的效果。数字越大说明两个数据集比对上的程度越高。
[1] 0.2550047 
这里结果是0.255看来也不怎么好，

Step 4: visualization and downstream analysis


liger_obj <- runUMAP(liger_obj)
Spectral initialization failed to converge, using random initialization instead
dim_plts <- plotByDatasetAndCluster(liger_obj,
                                    axis.labels = c("UMAP 1",'UMAP 2'),
                                    return.plots = TRUE)
dim_plts[[1]] + dim_plts[[2]]

这个函数返回的是一个包含了两个ggplot2对象的list：这里思考一下如何在这个基础上修改图。比如pt.size，label.size等参数

Image.png

loading_plts <- plotGeneLoadings(liger_obj, axis.labels = c("UMAP 1","UMAP 2"), return.plots = TRUE)

Image.png

loading_plts: 每个factor中权重排名靠前的gene
loading_plts[[3]]

Image.png

Step 5: differential expression analysis and marker gene selection

# Setting compare.method to 'clusters' compares each feature's normalized counts between clusters combining all datasets.
# Setting compare.method to 'datasets' give us the features most differentially expressed across datasets for every cluster

dge_cluster <- runWilcoxon(liger_obj, compare.method = 'clusters')
dge_dataset <- runWilcoxon(liger_obj, compare.method = 'datasets')
dge_cluster_sig <- dge_cluster[dge_cluster$padj <0.05 & abs(dge_cluster$logFC)>1,]
dge_dataset_sig <- dge_dataset[dge_dataset$padj <0.05 & abs(dge_dataset$logFC)>1,]

plt_genes <- plotGene(liger_obj,
                      gene = dge_dataset_sig[which.max(dge_dataset_sig$logFC),'feature'],
                      keep.scale = TRUE,
                      return.plots = TRUE)
plt_genes[[1]] + plt_genes[[2]]

Image.png

plt_genes_v <- plotGeneViolin(liger_obj,
                              gene = dge_dataset_sig[which.max(dge_dataset_sig$logFC),'feature'],
                              return.plots = TRUE))
plt_genes_v[[1]] / plt_genes_v[[2]]

Image.png

step 6 转换成Seurat 对象，

# liger to Seurat object
library(Seurat)
seurat_obj = rliger::ligerToSeurat(liger_obj, use.liger.genes = TRUE, by.dataset = T, renormalize = TRUE)
DimPlot(seurat_obj, pt.size = 1, label=T, label.size = 4, repel=T)

Image.png

step 7 用scCATCH注释cluster

library(scCATCH)
liger_clusters = as.character(liger_obj@clusters, levels=1:length(levels(liger_obj@clusters)))
scCATCH_obj <- scCATCH::createscCATCH(seurat_obj[["RNA"]]@data, cluster = liger_clusters)
clu_markers <- scCATCH::findmarkergene(scCATCH_obj, species = "Human",
                                       cluster = "All", cancer = c("Normal"),
                                       marker = cellmatch,
                                       tissue = c("Brain"),
                                       use_method = "2", cell_min_pct = 0.25,
                                       logfc = 0.25, pvalue = 0.25
)

clu_ann <- scCATCH::findcelltype(clu_markers)
scCATCH_ann_data = merge(clu_ann@meta, clu_ann@celltype,by = "cluster") %>% tibble::column_to_rownames("cell")
seurat_obj = AddMetaData(seurat_obj, scCATCH_ann_data)
DimPlot(seurat_obj, pt.size = 1, label = T, label.size = 4, repel = T, group.by = "cell_type")

Image.png

工具鉴定出来的不是很准确，那么来探究一下，到底这两个样本中有没有肿瘤细胞，它们在哪里呢？
这时候就需要人工去鉴定了。
先借助数据库找一下肿瘤marker：
目前单细胞扩展的marker库有：CellMarker，PanglaoDB，SCSig，Immuno-cell-guide等
收集了几个肿瘤marker，也不知道准不准，

Image.png

cancer_marker <- c("PARP1","EGFR","SOX9","SOX2","POU3F2","OLIG2","SALL2","NFIB")
FeaturePlot(seurat_obj,features = cancer_marker)

Image.png

明天再用肿瘤marker自动注释工具试试。
未完待续。关注公众号获取更多内容。

单细胞数据整合鉴定肿瘤细胞

step1: preprocessing and normalization

Step 2 : joint matrix factorization

Step 3: quantile normalization and joint clustering

Step 4: visualization and downstream analysis

Step 5: differential expression analysis and marker gene selection

step 6 转换成Seurat 对象，

step 7 用scCATCH注释cluster

你可能感兴趣的:(单细胞数据整合鉴定肿瘤细胞)