Issues 13
Motivation
Itis well known that the integration among different data-sources is reliablebecause of its potential of unveiling new functionalities of the genomicexpressions, which might be dormant in a single-source analysis. Moreover,different studies have justified the more powerful analyses of multi-platformdata. Toward this, in this study, we consider the circadian genes’ omicsprofile, such as copy number changes and RNA-sequence data along with theirsurvival response. We develop a Bayesian structural equation modeling coupledwith linear regressions and log normal accelerated failure-time regression tointegrate the information between these two platforms to predict the survivalof the subjects. We place conjugate priors on the regression parameters andderive the Gibbs sampler using the conditional distributions of them.
Results
Ourextensive simulation study shows that the integrative model provides a betterfit to the data than its closest competitor. The analyses of glioblastomacancer data and the breast cancer data from TCGA, the largest genomics andtranscriptomics database, support our findings.
背景:昼夜节律振荡是调节多种生理和代谢过程的基本过程。昼夜节律的紊乱与重要的生理后果有关,包括代谢紊乱和癌症。节律基因与诸如胶质母细胞瘤,乳腺癌的发病机制有关,我们的研究集中在昼夜节律基因及其对患者生存的影响。
方法:在本文中提出贝叶斯结构方程式和贝叶斯加速失效时间(AFT)模型相结合的方法,将RNAseq 和DNA CNV进行集成分析,预测预后.
结果:在仿真数据上进行模拟,发现性能优于一般的回归模型,然后在TCGA的癌症数据集上进行预测,发现整合 CNV和RNAseq 可以更好地拟合生存情况。
Motivation
Thematrix factorization is an important way to analyze coregulation patterns intranscriptomic data, which can reveal the tumor signal perturbation status andsubtype classification. However, current matrix factorization methods do notprovide clear bicluster structure. Furthermore, these algorithms are based onthe assumption of linear combination, which may not be sufficient to capturethe coregulation patterns.
Results
Wepresented a new algorithm for Boolean matrix factorization (BMF) viaexpectation maximization (BEM). BEM is more aligned with the molecularmechanism of transcriptomic coregulation and can scale to matrix with over 100million data points. Synthetic experiments showed that BEM outperformed otherBMF methods in terms of reconstruction error. Real-world applicationdemonstrated that BEM is applicable to all kinds of transcriptomic data,including bulk RNA-seq, single-cell RNA-seq and spatial transcriptomicdatasets. Given appropriate binarization, BEM was able to extract coregulationpatterns consistent with disease subtypes, cell types or spatial anatomy.
背景:样本的聚类可以发现细胞异质性,基因的共表达聚类可以揭示转录因子和靶基因的关系,之前的BMF对布尔因子有很多的限制
方法:本文提出一种新的BMF算法,无需假设布尔因子的大小
结果:乳腺癌亚型分类,从单细胞测序进行细胞类型反卷积,空间转录组的细分
Motivation
Manyordinary differential equation (ODE) models have been introduced to replacelinear regression models for inferring gene regulatory relationships fromtime-course gene expression data. But, since the observed data are usually notdirect measurements of the gene products or there is an unknown time lag ingene regulation, it is problematic to directly apply traditional ODE models orlinear regression models.
Results
Weintroduce a lagged ODE model to infer lagged gene regulatory relationships fromtime-course measurements, which are modeled as linear transformation of thegene products. A time-course microarray dataset from a yeast cell-cycle studyis used for simulation assessment of the methods and real data analysis. Theresults show that our method, by considering both time lag and measurementscaling, performs much better than other linear and ODE models. It indicatesthe necessity of explicitly modeling the time lag and measurement scaling inODE gene regulatory models.
背景:推断基因调控网络是系统生物学的主要任务,ODE模型经常被用来描述动态系统。然而,在基因调控网络的研究中,会有两个问题:1. 实验数据和原始数据之间会引入线性缩放2.会存在时间差
方法:我们应用随机非线性回归方法来同时估计ODE模型中的所有参数
结果:在模拟数据集上,与线性模型相比,和没有考虑线性缩放的ODE模型相比,没有考虑连续时间差的ODE模型相比,精确度提高。在真实数据集上也表现突出。
Issue 11
Motivation
Single-cellsequencing (SCS) data provide unprecedented insights into intratumoralheterogeneity. With SCS, we can better characterize clonal genotypes andreconstruct phylogenetic relationships of tumor cells/clones. However, SCS dataare often error-prone, making their computational analysis challenging.
Results
Toinfer the clonal evolution in tumor from the error-prone SCS data, we developedan efficient computational framework, termed RobustClone. It recovers the truegenotypes of subclones based on the extended robust principal componentanalysis, a low-rank matrix decomposition method, and reconstructs thesubclonal evolutionary tree. RobustClone is a model-free method, which can beapplied to both single-cell single nucleotide variation (scSNV) and single-cellcopy-number variation (scCNV) data. It is efficient and scalable to large-scaledatasets. We conducted a set of systematic evaluations on simulated datasetsand demonstrated that RobustClone outperforms state-of-the-art methods inlarge-scale data both in accuracy and efficiency. We further validatedRobustClone on two scSNV and two scCNV datasets and demonstrated thatRobustClone could recover genotype matrix and infer the subclonal evolutiontree accurately under various scenarios. In particular, RobustClone revealedthe spatial progression patterns of subclonal evolution on the large-scale 10XGenomics scCNV breast cancer dataset.
背景:了解癌症进展并表征肿瘤内异质性的先进进化机制可指导预测和控制癌症进展,转移和治疗反应的原理。单细胞测序能更好的刻画细胞异质性,但单细胞测序很容易出错,包括FP,FN,MB,细胞重叠。
方法:PCA常用来从被大量噪声污染的数据中恢复低秩数据,本文在PCA的方法上进行了扩展RPCA,增加了鲁棒性。利用scSNV 和 scCNV恢复细胞的真实的基因型,识别子克隆,重构子克隆进化树。
结果:在模拟数据集上和在真实数据集上,在精确性和有效性上都比已有的方法有优势
Motivation
Inthe analysis of high-throughput omics data from tissue samples, estimating andaccounting for cell composition have been recognized as important steps. Highcost, intensive labor requirements and technical limitations hinder the cellcomposition quantification using cell-sorting or single-cell technologies.Computational methods for cell composition estimation are available, but theyare either limited by the availability of a reference panel or suffer from lowaccuracy.
Results
Weintroduce TOols for the Analysis of heterogeneouS Tissues TOAST/-P andTOAST/+P, two partial reference-free algorithms for estimating cell compositionof heterogeneous tissues based on their gene expression profiles. TOAST/-P andTOAST/+P incorporate additional biological information, includingcell-type-specific markers and prior knowledge of compositions, in theestimation procedure. Extensive simulation studies and real data analysesdemonstrate that the proposed methods provide more accurate and robust cellcomposition estimation than existing methods.
背景:细胞成分(包括细胞类型和比例)可以通过诸如免疫组织化学,流式细胞术和单细胞测序之类的技术通过实验获得。RB方法需要从纯化的组织获得参考数据,作为预测值;RF不需要参考,需要大量样本,PRF需要额外的数据来提高预测结果。
方法:提出了一种局部RF反卷积方法TOAST/-P
and TOAST/+P,该方法利用基因表达数据和细胞类型特异性标记物和先前的组成知识来指导细胞组成估计。-P没有先验细胞组成。+P 有先验细胞组成
结果:在精确性和鲁棒性方面优于现有的方法
Motivation
Cell-type-specificsurface proteins can be exploited as valuable markers for a range ofapplications including immunophenotyping live cells, targeted drug deliveryandin vivoimaging. Despite their utility and relevance,the unique combination of molecules present at the cell surface are not yetdescribed for most cell types. A significant challenge in analyzing ‘omic’discovery datasets is the selection of candidate markers that are mostapplicable for downstream applications.
Results
Here,we developed GenieScore, a prioritization metric that integrates aconsensus-based prediction of cell surface localization with user-input data torank-order candidate cell-type-specific surface markers. In this report, wedemonstrate the utility of GenieScore for analyzing human and rodent data fromproteomic and transcriptomic experiments in the areas of cancer, stem cell andislet biology. We also demonstrate that permutations of GenieScore, termedIsoGenieScore and OmniGenieScore, can efficiently prioritize co-expressed andintracellular cell-type-specific markers, respectively.
背景:细胞类型特异性表面蛋白可作为有价值的标记物用于许多应用,包括活细胞免疫表型鉴定,靶向药物递送和体内成像。但是大多数的细胞类型的表面蛋白是不清楚的。
方法:在这里,我们开发了GenieScore,这是一种优先级度量标准,它将基于共识的细胞表面定位预测与用户输入定量数据(蛋白组或转录组)集成在一起,对候选细胞类型特定的表面标记进行排序。GenieScore为是否是表面蛋白,是否在丰度上有差异,信号强度(是否被特异性的抗体所检测)的乘积
结果:开发了SurfaceGenie,web 界面,计算GenieScore和本体注释
Background
Assigningevery human gene to specific functions, diseases and traits is a grandchallenge in modern genetics. Key to addressing this challenge arecomputational methods, such as supervised learning and label propagation, thatcan leverage molecular interaction networks to predict gene attributes. Inspite of being a popular machine-learning technique across fields, supervisedlearning has been applied only in a few network-based studies for predictingpathway-, phenotype- or disease-associated genes. It is unknown how supervisedlearning broadly performs across different networks and diverse geneclassification tasks, and how it compares to label propagation, the widelybenchmarked canonical approach for this problem.
Results
Inthis study, we present a comprehensive benchmarking of supervised learning fornetwork-based gene classification, evaluating this approach and a classic labelpropagation technique on hundreds of diverse prediction tasks and multiplenetworks using stringent evaluation schemes. We demonstrate that supervisedlearning on a gene’s full network connectivity outperforms label propagaton andachieves high prediction accuracy by efficiently capturing local networkproperties, rivaling label propagation’s appeal for naturally using networktopology. We further show that supervised learning on the full network is alsosuperior to learning on node embeddings (derived usingnode2vec), an increasingly popular approach forconcisely representing network connectivity. These results show that supervisedlearning is an accurate approach for prioritizing genes associated with diversefunctions, diseases and traits and should be considered a staple ofnetwork-based gene classification workflows.
背景:后基因组时代的一大挑战是根据基因组中参与的细胞途径以及与之相关的多因素性状和疾病来表征基因组中的每个基因。 通过计算预测基因与途径,性状或疾病之间的关联(此处称为“基因分类”的任务)对于此任务至关重要
方法:我们提出了基于网络的基因分类的有监督学习,
结果:sp优于label propagation和node embedding
Motivation
Cell-to-cellvariation has uncovered associations between cellular phenotypes. However, itremains challenging to address the cellular diversity of such associations.
Results
Here,we do not rely on the conventional assumption that the same association holdsthroughout the entire cell population. Instead, we assume that associations mayexist in a certain subset of the cells. We developed CEllular Niche Association(CENA) to reliably predict pairwise associations together with the cell subsetsin which the associations are detected. CENA does not rely on predefinedsubsets but only requires that the cells of each predicted subset would share acertain characteristic state. CENA may therefore reveal dynamic modulation ofdependencies along cellular trajectories of temporally evolving states. Usingsimulated data, we show the advantage of CENA over existing methods and itsscalability to a large number of cells. Application of CENA to real biologicaldata demonstrates dynamic changes in associations that would be otherwisemasked.
背景:细胞轨迹已被用于描述基因表达的时间变化,但目前尚未用于研究关联的时间变化。(细胞表型之间的依赖,相互作用)。在分析整个细胞群体时这些关联被掩盖,所以在特定细胞亚群上进行分析。
方法:开发CENA的新方法,该方法旨在解决捕获主要存在于某个细胞子集中的关联的问题。CENA将识别关联的细胞子集的原始问题转变为繁重的子网检测问题。
结果:CENA旨在解决单细胞基因组学研究中的一项重要任务,即探索细胞状态空间中的关联如何变化。