Manifold learning-based methods for analyzing single-cell RNA-sequencing data

https://doi.org/10.1016/j.coisb.2017.12.008 

Yale university 2017年12月发布的基于机器学习中流形学习的单细胞降维降噪处理优化。

The manifold learning:

 假设数据是均匀采样于一个高维欧氏空间中的低维流形,流形学习就是从高维采样数据中恢复低维流形结构,即找到高维空间中的低维流形,并求出相应的嵌入映射,以实现维数约简或者数据可视化。它是从观测到的现象中去寻找事物的本质,找到产生数据的内在规律。

 常见的MFL:PCA、MDS、diffusion mapping等,图下为不同方法的优劣简介。

Manifold learning-based methods for analyzing single-cell RNA-sequencing data_第1张图片

本文关键词:MFL(Manifold models can also be useful for analyzing data generated from disparate dynamics or profiles as the data can be modeled with several disconnected mani- folds)、DPT(a pseudotime trajectory through the data to describe a latent axis of development or cell state transition)、DPT method(to find a major axis of variability in the data, DPT defines a distance from a source cell to all other cells over a modified transition operator that includes only non- trivial diffusion components. This produces trajec- tories of nonlinear variation across a dataset)

而本文的思路是在分析scRNAseq的数据的第二步使用到了MFL:

gene selection, 

manifold learning, 

cell organization,

Dimensionality reduction and visualization,

Density estimation and clustering。

而整体的前三步统称为pseudotime methods。

下图清晰的展示出了文章的分析思路,图也草鸡美。我觉得我还要修炼些时日再做图,分析分析思路比较拿手哈哈哈:

Manifold learning-based methods for analyzing single-cell RNA-sequencing data_第2张图片

每个plot都会有对应的一个subtitle,理解作者在做什么足够。

其中,

主要的文章算法核心在下图:

Manifold learning-based methods for analyzing single-cell RNA-sequencing data_第3张图片

Comparison of pseudotime methods. Pseudotime methods(four kinds of method) may generally be broken down into three stages: gene selection, manifold learning, and cell organization.

从而作者提出了一些现存方法的局限性,

A current limitation of these methods is their reliance to varying degrees on assumptions about the underlying shape of the data (数据潜在形态的假设几何对后期分型影响很大)(e.g. a tree, bifurcating trajectory, etc.)

而他们开发的DPT,也就是最后一种方法:provideing two significant advantages over other pseudotemporal techniques. First, working directly on a diffusion map does not require any greedy computational steps(层级聚类的经典算法,每一步都是贪婪模型,也就是局部最优而不是树的全局最优). Second and most importantly, because DPT operates directly on the diffusion space, it features the least coarse graining or over-fitting of data into low-dimensional assumptions(DPT的工作对象是整体的扩散空间,而不是二分支结构以及树状结构,所以可以以最小的粗粒度过拟合到低维空间).

 文章最后的验证:

 

Manifold learning-based methods for analyzing single-cell RNA-sequencing data_第4张图片

三种降维分析的验证以及模拟数据点的jaccard index similarity validation in jaccard graph ,I mentioned in one piece of previous blog

 文章整篇都是叙述性的算法介绍,而没有任何公示和代码stick up。就本人拙见,比较重要的机器学习思维是其中的manifold learning,pseudotime method,以及根据MFL衍生出来的降维分析方法。

在这里贴一个MFL的CSDN博文,人家讲的贼好。

https://blog.csdn.net/chl033/article/details/6107042

 

Manifold learning-based methods for analyzing single-cell RNA-sequencing data_第5张图片

 

转载于:https://www.cnblogs.com/beckygogogo/p/9195248.html

你可能感兴趣的:(Manifold learning-based methods for analyzing single-cell RNA-sequencing data)