Bioinformatics
Volume 36, Issue 15, 1 August 2020
Motivation
Single-cell RNA-sequencing (scRNA-seq) has become an important tool to unravel cellular heterogeneity, discover new cell (sub)types, and understand cell development at single-cell resolution. However, one major challenge to scRNA-seq research is the presence of ‘drop-out’ events, which usually is due to extremely low mRNA input or the stochastic nature of gene expression. In this article, we present a novel single-cell RNA-seq drop-out correction (scDoc) method, imputing drop-out events by borrowing information for the same gene from highly similar cells.
Results
scDoc is the first method that directly involves drop-out information to accounting for cell-to-cell similarity estimation, which is crucial in scRNA-seq drop-out imputation but has not been appropriately examined. We evaluated the performance of scDoc using both simulated data and real scRNA-seq studies. Results show that scDoc outperforms the existing imputation methods in reference to data visualization, cell subpopulation identification and differential expression detection in scRNA-seq data.
Key: 用dropout直接估算细胞间相似度
Motivation
One of the main challenges in applying graph convolutional neural networks (CNNs) on gene-interaction data is the lack of understanding of the vector space to which they belong, and also the inherent difficulties involved in representing those interactions on a significantly lower dimension, viz Euclidean spaces. The challenge becomes more prevalent when dealing with various types of heterogeneous data. We introduce a systematic, generalized method, called iSOM-GSN, used to transform multi-omic data with higher dimensions onto a 2D grid. Afterwards, we apply a CNN to predict disease states of various types. Based on the idea of Kohonen s self-organizing map, we generate a 2D grid for each sample for a given set of genes that represent a gene similarity network.
Results
We have tested the model to predict breast and prostate cancer using gene expression, DNA methylation and copy number alteration. Prediction accuracies in the 94 98% range were obtained for tumor stages of breast cancer and calculated Gleason scores of prostate cancer with just 14 input genes for both cases. The scheme not only outputs nearly perfect classification accuracy, but also provides an enhanced scheme for representation learning, visualization, dimensionality reduction and interpretation of multi-omic data.
Key: 基因交互数据 图卷积神经网络(CNNs)
Motivation
T-cell receptors (TCRs) function to recognize antigens and play vital roles in T-cell immunology. Surveying TCR repertoires by characterizing complementarity-determining region 3 (CDR3) is a key issue. Due to the high diversity of CDR3 and technological limitation, accurate characterization of CDR3 repertoires remains a great challenge.
Results
We propose a computational method named CATT for ultra-sensitive and precise TCR CDR3 sequences detection. CATT can be applied on TCR sequencing, RNA-Seq and single-cell TCR(RNA)-Seq data to characterize CDR3 repertoires. CATT integrated de Bruijn graph-based micro-assembly algorithm, data-driven error correction model and Bayesian inference algorithm, to self-adaptively and ultra-sensitively characterize CDR3 repertoires with high performance. Benchmark results of datasets from in silico and experimental data demonstrated that CATT showed superior recall and precision compared with existing tools, especially for data with short read length and small size and single-cell sequencing data. Thus, CATT will be a useful tool for TCR analysis in researches of cancer and immunology.
Key: TCR分析,CDR3序列检测
Motivation
Methylation and transcription factors (TFs) are part of the mechanisms regulating gene expression. However, the numerous mechanisms regulating the interactions between methylation and TFs remain unknown. We employ machine-learning techniques to discover the characteristics of TFs that bind to methylation sites.
Results
The classical machine-learning analysis process focuses on improving the performance of the analysis method. Conversely, we focus on the functional properties of the TF sequences. We obtain the principal properties of TFs, namely, the basic polar and hydrophobic Ile amino acids affecting the interaction between TFs and methylated DNA. The recall of the positive instances is 0.878 when their basic polar value is >0.1743. Both basic polar and hydrophobic Ile amino acids distinguish 74% of TFs bound to methylation sites. Therefore, we infer that basic polar amino acids affect the interactions of TFs with methylation sites. Based on our results, the role of the hydrophobic Ile residue is consistent with that described in previous studies, and the basic polar amino acids may also be a key factor modulating the interactions between TFs and methylation.
Key: 甲基化位点与TF互作
Motivation
Transposable elements (TEs) classification is an essential step to decode their roles in genome evolution. With a large number of genomes from non-model species becoming available, accurate and efficient TE classification has emerged as a new challenge in genomic sequence analysis.
Results
We developed a novel tool, DeepTE, which classifies unknown TEs using convolutional neural networks (CNNs). DeepTE transferred sequences into input vectors based onk-mer counts. A tree structured classification process was used where eight models were trained to classify TEs into super families and orders. DeepTE also detected domains inside TEs to correct false classification. An additional model was trained to distinguish between non-TEs and TEs in plants. Given unclassified TEs of different species, DeepTE can classify TEs into seven orders, which include 15, 24 and 16 super families in plants, metazoans and fungi, respectively. In several benchmarking tests, DeepTE outperformed other existing tools for TE classification. In conclusion, DeepTE successfully leverages CNN for TE classification, and can be used to precisely classify TEs in newly sequenced eukaryotic genomes.
Key: 用卷积神经网络分类转座子 k-mer计数
Motivation
Dimensionality reduction is a key step in the analysis of single-cell RNA-sequencing data. It produces a low-dimensional embedding for visualization and as a calculation base for downstream analysis. Nonlinear techniques are most suitable to handle the intrinsic complexity of large, heterogeneous single-cell data. However, with no linear relation between gene and embedding coordinate, there is no way to extract the identity of genes driving any cell s position in the low-dimensional embedding, making it difficult to characterize the underlying biological processes.
Results
In this article, we introduce the concepts of local and global gene relevance to compute an equivalent of principal component analysis loadings for non-linear low-dimensional embeddings. Global gene relevance identifies drivers of the overall embedding, while local gene relevance identifies those of a defined sub-region. We apply our method to single-cell RNA-seq datasets from different experimental protocols and to different low-dimensional embedding techniques. This shows our method s versatility to identify key genes for a variety of biological processes.
Key: scRNA-seq降维分析 PCA
Motivation
Intercellular communication plays an essential role in multicellular organisms and several algorithms to analyze it from single-cell transcriptional data have been recently published, but the results are often hard to visualize and interpret.
Results
We developed Cell cOmmunication exploration with MUltiplex NETworks (COMUNET), a tool that streamlines the interpretation of the results from cell cell communication analyses. COMUNET uses multiplex networks to represent and cluster all potential communication patterns between cell types. The algorithm also enables the search for specific patterns of communication and can perform comparative analysis between two biological conditions. To exemplify its use, here we apply COMUNET to investigate cell communication patterns in single-cell transcriptomic datasets from mouse embryos and from an acute myeloid leukemia patient at diagnosis and after treatment.
Key: scRNA-seq数据分析细胞间通讯并可视化
Motivation
High-throughput technologies allow comprehensive characterization of individuals on many molecular levels. However, training computational models to predict disease status based on omics data is challenging. A promising solution is the integration of external knowledge about structural and functional relationships into the modeling process. We compared four published random forest-based approaches using two simulation studies and nine experimental datasets
Results
The self-sufficient prediction error approach should be applied when large numbers of relevant pathways are expected. The competing methods hunting and learner of functional enrichment should be used when low numbers of relevant pathways are expected or the most strongly associated pathways are of interest. The hybrid approach synthetic features is not recommended because of its high false discovery rate.
Key: 高通量表征技术评估