2021-09-28

https://doi.org/10.1186/s13059-021-02356-5

Emerging single-cell technologies profile multiple types of molecules within individual cells. A fundamental step in the analysis of the produced high-dimensional data is their visualization using dimensionality reduction techniques such as t-SNE and UMAP. We introduce j-SNE and j-UMAP as their natural generalizations to the joint visualization of multimodal omics data. Our approach automatically learns the relative contribution of each modality to a concise representation of cellular identity that promotes discriminative features but suppresses noise. On eight datasets, j-SNE and j-UMAP produce unified embeddings that better agree with known cell types and that harmonize RNA and protein velocity landscapes.

联合测序技术(单细胞)

· CITE-seq: 整合了转录组和表面蛋白组

· SNARE-seq:转录组和染色质可及性

· ECCITE-seq: 转录组、表面蛋白、TCR/BCR基因编码、CRISPR的干扰sgRNA

为了整合单细胞多组学数据进行降维,包装了t-SNE的损失函数,用梯度下降法求各组学数据的权重以及高维到低维的映射。

SNE法:高维=>低维的映射

假设某数据点的近邻符合正态分布,表示以 i 点为中心 j 是近邻的条件概率

假定一个映射的低维,这里取方差为0.5简化

为了令条件概率分布p和q近似,用KL散度(相对熵衡量理论分布和真实分布的相似性)来衡量

最小化该损失函数就得到和高维分布近似的低维分布(梯度下降迭代)

j-SNE给各组学数据赋予权重(梯度下降)

效果

SNARE-seq(scRNA-seq+scATAC-seq),转录组BJ和K562数据打散,joint中能检出变异

转录组+蛋白组

蛋白质速率(基于ECCITE-seq(mRNA + 表面蛋白组 + TCR  + ...))

蛋白质速率参考Protein velocity and acceleration from single-cell multiomics experiments


https://doi.org/10.1186/s13059-020-1945-3

The simultaneous quantification of protein and RNA makes possible the inference of past, present, and future cell states from single experimental snapshots. To enable such temporal analysis from multimodal single-cell experiments, we introduce an extension of the RNA velocity method that leverages estimates of unprocessed transcript and protein abundances to extrapolate cell states. We apply the model to six datasets and demonstrate consistency among cell landscapes and phase portraits. The analysis software is available as the protaccel Python package.


对每个gene,剪切和降解达到平衡的状态为,系数通过分位数回归求得。T是细胞间差异和RNA剪切降解不平衡的相似度(用周围细胞模拟),给出细胞j对i的单位向量u,进而计算出细胞RNA速率改变的方向,反映了细胞轨迹。蛋白组学也类似。

https://www.nature.com/articles/s41586-018-0414-6

https://doi.org/10.1186/s13059-021-02386-z

The incorporation of unique molecular identifiers (UMIs) in single-cell RNA-seq assays makes possible the identification of duplicated molecules, thereby facilitating the counting of distinct molecules from sequenced reads. However, we show that the naïve removal of duplicates can lead to a bias due to a “pooled amplification paradox,” and we propose an improved quantification method based on unseen species modeling. Our correction called BUTTERFLY uses a zero truncated negative binomial estimator implemented in the kallisto bustools workflow. We demonstrate its efficacy across cell types and genes and show that in some cases it can invert the relative abundance of genes.

建立基因文库时,一方面有PCR后相同mRNA(相同UMI)被不同程度扩增的情况,另一方面却有部分扩增以及部分分子丢失。因此,已有校正UMI扩增的方法会因mRNA的不完全采样而导致count估计的偏倚。

负二项式分布模型

用负二项式分布的概率密度函数来拟合UMI拷贝的分布(如果让所有UMI都观察到时,拷贝数达k次的概率密度分布,拷贝数为0的UMI数量是未知的),那么所有分子数量(N是已观察到的分子数),所有未检测到的/丢失的分子数根据拟合的概率密度函数,,进而可以迭代μ的值。

至于概率密度函数的第二个参数s(1个拷贝数让所有UMI观察到的机率),对数似然函数求s令ll极大,得到迭代后的s。


https://doi.org/10.1186/s13059-021-02368-1

treeclimbR is for analyzing hierarchical trees of entities, such as phylogenies or cell types, at different resolutions. It proposes multiple candidates that capture the latent signal and pinpoints branches or leaves that contain features of interest, in a data-driven way. It outperforms currently available methods on synthetic data, and we highlight the approach on various applications, including microbiome and microRNA surveys as well as single-cell cytometry and RNA-seq datasets. With the emergence of various multi-resolution genomic datasets, treeclimbR provides a thorough inspection on entities across resolutions and gives additional flexibility to uncover biological associations.

1. 数据整合(节点是子节点/叶子的和)

2. 每个节点差异分析(Groups之间)

3. 每个Node计算significance分数:,其中, qk(t) derived from its P value pk and estimated direction sign(θk), under a tuning parameter t.

4.对每条从根节点出发的path,遇到Ui(t)=1 and pi<0.05的节点或者到达叶子就停止. 

R package TreeHeatmap (https://github.com/fionarhuang/TreeHeatmap).


One challenge facing omics association studies is the loss of statistical power when adjusting for confounders and multiple testing. The traditional statistical procedure involves fitting a confounder-adjusted regression model for each omics feature, followed by multiple testing correction. Here we show that the traditional procedure is not optimal and present a new approach, 2dFDR, a two-dimensional false discovery rate control procedure, for powerful confounder adjustment in multiple testing. Through extensive evaluation, we demonstrate that 2dFDR is more powerful than the traditional procedure, and in the presence of strong confounding and weak signals, the power improvement could be more than 100%.


模型可视为

建立H0假设α=0,估计值α可表示为(最小二乘法)

(校正)

你可能感兴趣的:(2021-09-28)