hello,大家好,今天我们来分享一个分析10X单细胞中,多细胞去除的一个方法。文献在Chord: Identifying Doublets in Single-Cell RNA Sequencing Data by an Ensemble Machine Learning Algorithm,这个软件整合了多个方法,取长补短,真正实现多细胞的去除。关于多细胞去除的文献,大家可以参考DoubletFinder、python分析单细胞数据,多细胞去除的模块、多细胞去除之三,R包DoubletDecon。
简单回顾一下文献;
Summary
- most of these methods have good performance in some datasets but lack stability in others(现有的分析方法没有普适性)。
- it is difficult to regard a single method as the gold standard for each scenario(需要多个判断标准)。
- Chord which implements a machine learning algorithm that integrates multiple doublet detection methods.(整合了多个方法)。
- Chord had a higher accuracy and stability than the individual approaches on different datasets containing real and synthetic data.(效果好)。
Introduction
- According to the composition of doublets, doublets can be divided into two major classes: homotypic doublets, which originate from the same cell type, and heterotypic doublets which arise from distinct transcriptional cells generating an artificial hybrid transcriptome(两种多细胞类型)
- Compared to homotypic doublets, heterotypic doublets are considered to have more impact on downstream analyses including dimensionality reduction, cell clustering, differential expression and cell developmental trajectories(异源性的多细胞对下游的干扰更大)。
- we propose Chord which implements an ensemble algorithm that aggregates the results from multiple representative methods to accurately identify doublets(多个方法检测多细胞,提高准确度)。
- Compared to the individual methods, Chord was demonstrated an improved doublet detection accuracy and stability across different datasets of real and synthetic data.(性能好)
检测多细胞的原理two categories
The first strategy of one category uses the distance between simulated artificial doublets and the observation cells to identify doublets(运用这个方法的是软件DoubletFinder)。
The second strategy used by cxds in the scds package is based on co-expressed ‘marker’ genes that are not simultaneously expressed in the same singlet cell but can appear in doublet cells. (一个Barcode表达两种细胞类型的Marker)。
目前的方法稳定性,准确性都不太高,而且随着数据类型的不同性能会发生变化。
describe a new strategy based on an ensemble algorithm of machine learning for doublet identification(这个软件的策略)。
Chord, integrates four representative computational doublet detection methods, DoubletFinder、DoubletCells、bcds and cxds in R environment, to employ these enhancements to improve doublet detection。(整合四个方法同时检测双细胞)。
图注:First, preliminarily predicted doublets are filtered using bcds, cxds, doubletCells and DoubletFinder, and then the processed dataset is randomly sampled to generate simulation doublets that are added to the training dataset.The second step is to fit the weights of the integrated methods through the AdaBoost algorithm(这个算法感兴趣的可以查一下) on the training dataset. In the third step, the ensemble model is used to evaluate the original expression matrix and the doublets are identified by the expectation threshold value.
估计多细胞的经过
The program first roughly estimates the doublets of the input droplet data according to the four built-in methods to filter out the likely doublets from the original data before simulating artificial doublets。(4种方法先估计多细胞)。Therefore, we proposed a key step and an adjustable parameter called overkillrate to preliminarily delete doublets(初步去除多细胞),Selecting this parameter could improve the accuracy of the program(代码中我们会介绍)。
a simulation training set is generated from quality singlet data after removing these doublets(模拟数据集产生)。
Finally, the AdaBoost algorithm was adopted to integrate these doublet detection methods to model training. Then the doublet scores output was calculated by the AdaBoost model for the input droplets data(得到多细胞的评估值)。
衡量这个方案的标准
evaluated these methods on ground-truth scRNA-seq datasets that label doublets using the experimental strategies(效果不错)
used random sampling to proportionally sample singlets and doublets in the dataset to build a doublet rate gradient。总之,Chord outperformed better than other methods on many doublet rates.
The performance of doublet detection approaches on ground-truth datasets(人工伪造的多细胞污染的数据),多软件比较,当然,作者的软件效果最好
The performance of doublet detection approaches in DEGs and pseudotime analysis. 多下游分析的影响
检测到的差异基因the differential gene analysis results of Chord and bcds were more similar to those on the clean data(没有多细胞的数据),indicating that the improved effect on differential expression analysis of these data using Chord and bcds was closer to the true situation.(提高了差异基因检测的能力)。
对下游轨迹分析的影响
Chord and Scrublet had similar cell trajectories to the clean data in the two pseudotime analysis methods, and there were fewer remaining doublets and no new branches were generated. Thus, we can see that Chord was equivalent to or even outperformed other methods in DEG detection and pseudotime analysis on the synthetic scRNA-seq datasets.(对轨迹分析有一定程度的优化)。
运用到真实的数据
Obviously, doublet removal by Chord can have a greater impact on the proportion of cells to avoid these imbalanced distributions and numerous doublets from becoming noise contamination for the quantitative statistics of the proportion of cell types.
we believe that Chord's doublet processing of real data can improve the purity of cell populations, allowing researchers to obtain more accurate cell type identification results, accurately identify DEGs between cell types and obtain better pseudotime analysis result.(效果很好,当然,效果不好,见不到这篇文献了)。
分享一下代码(很简单)
Quick start:
library(Chord)
chord(seu="input seurat object",doubletrat="estimated doubletrate",overkill=T,outname="the name you want")
Q:how to estimate doubletrate?
A:It depends on the number of cells in the sample. 10X can be referred:doubletrate = ~0.9% per 1,000 cells.
Q:how to remove doublets
A:The doublets' barcodes are in the file "outname_doublets.csv"
Boost more methods:
1.Using any method to evaluate the dataset "overkilled.robj", adding the results of socres to "simulated_data.scores.csv".
2.Using any method to evaluate the dataset "seu.robj", adding the results of socres to "simulated_data.scores.csv".
3.In the same dir, run the codes:
load("seu.robj")
load("sce.robj")
chord(seu = seu,sce=sce,doubletrat="estimated doubletrate 2",overkill=T,outname="the name you want 2",addmethods1 ="real_data.scores.csv",addmethods2 = "simulated_data.scores.csv" )
4.The doublets' barcodes are in the file "outname2_doublets.csv"
方法不错,取长补短,值得一试
生活很好,等你超越