论文题目:Deep Learning for Genomics: A Concise Overview
scholar 引用:19
页数:40
发表时间:2018.05
发表刊物:Genomics
作者:Tianwei Yue, Haohan Wang
摘要:
This data explosion driven by advancements in genomic research, such as high-throughput sequencing techniques, is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithm, deep leaning has succeeded in a variety of fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning since we are expecting from deep learning a superhuman intenlligence that explores beyond our knowledge to interpret the genome. A powerful deep learning model should rely on insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with a proper deep architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research, as well as pointing out current challenges ans potential research directions for future genomics applications.
结论:
- we have limited abilities to interpret the genomic information but expect from deep learning a superhuman intelligence that explores beyond our knowledge. 之前我可能觉得不现实,但是设想人工智能的前景,我觉得应该是人起到一个抛砖引玉的作用,而计算机的确能得出一些beyond our knowledge的东西。
- deep learning applications slightly lag behind traditional statistical inferences in terms of interpretation. 按前景来说,应该是在可解释性方面超越传统方法的,因此才有这么多人致力于这个方向的研究。
- current applications have not brought about a watershed revolution in genomic research. 就是说还没有取得突破性的大成就,比如说那种研究范式改变的方法出现。类似,计算机视觉领域,其实已经发生了范式的改变,几乎全部都是用深度学习在做了吧。
- The predictive performances in most problems have not reach the expectation for real-world applications, neither have the interpretations of these abstruse models elucidate insightful knowledge. 深度学习的black-box问题。世界各地有很多人也在研究这个问题,所以平时也可以关注一下他们的研究,看是否对可解释性有一些突破。这样参照他们的思路,或许也有助于解释生物学问题。
- By careful selection of data sources and features, or appropriate design of model structures, deep learning can be driven towards a bright direction. 这个方面,其实就是要更好的去将传统方法和机器学习相结合。去其糟粕,取其精华。
- we need to bear in mind numerous challenges beyond simply improving predictive accuracy. 我觉得这是一个科研工作者最基本的,challenges。
Introduction:
- Genomic research aims to understand the genomes of different species.
- In addition to recognizing these patterns in DNA sequences, models can take other genetic and genomic information as input to build systems to help understand the biological mechanisms of underlying genes.
- drug相关的应用,后续可能的研究方向:precision medicine, pharmacy
- medicine: medical research and its applications such as gene therapies, molecular diagnostics, and personlized medicine could be revolutionized by tailoring high-performance computing methods to analyzing avaliable genomic datasets.
- match the candidate protein identified by researchers with their known drug molecules.
正文组织架构:
1. Introduction
2. Deep Learning Architectures: Genomic Perspective
2.1 Convolutional Neural Networks
2.2 Recurrent Neural Networks
2.3 Autoencoders
2.4 Emergent Deep Architectures
2.4.1 Beyond Classic Models
2.4.2 Hybird Architectures
3. Deep learning Architectures: Insights and Remarks
3.1 Model Interpretation
3.2 Transfer Learning and Multitask
3.3 Multi-view Learning
4. Genomic Applications
4.1 Gene expression
4.1.1 Gene expression Characterization
4.1.2 Gene expression Prediction
4.2 Regulatory Genomics
4.2.1 Promoters and Enhancers
4.2.2 Splicing
4.2.3 Transcription Factors and RNA-binding Proteins
4.3 Functional Genomics
4.3.1 Mutations and Functional Activities
4.3.2 Subcellular Localization
4.4 Structural Genomics
4.4.1 Structural Classification of Proteins
4.4.2 Protein Secondary Structure
4.4.3 Protein Tertiary Structure and Quality Assessment
4.4.4 Contact Map
5. Challenges and Opportunities
5.1 The Nature of Data
5.1.1 Class-Imbalanced Data
5.1.2 Various Data Types
5.1.3 Heterogeneity and Confounding Correlations
5.2 Feature Extraction
5.2.1 Mathematical Feature Extraction
5.2.2 Feature Representation
6. Conclusion and Outlook
正文部分内容摘录:
2. Deep Learning Architectures: Genomic Perspective
- 深度学习在其他很多领域的应用相对来说我熟悉一点,那么相应的方法都是怎么应用到genomic的呢?就是说两个领域的对应两种任务一定有一些相似性的。
- CNNs在图像分类任务中非常成熟,在genomic中应用于automatically learn local and global characterization of genomic data.
- RNNs在speech recognition任务上用的很多,也擅长处理序列数据,那么在genomic中,就经常用来处理DNA sequence。
- Autoencoders经常用于pre-training model、denoising、pre-processing the input data。
- 设计或者选择深度学习模型的时候,应该结合模型可以提取可靠的features等特点和biological process。
2.1 Convolutional Neural Networks
- CNN的特点是outstanding capacity to analyze spatial information.
- genomic sequence motifs,motif是在生物学中是一个基于数据的数学统计模型,典型的是一段sequence也可以是一个结构,是特定的group的序列预测。
- skillfully match a CNN architecture to each particular given task
- researchers should have an indepth understanding of CNN architectures as well as take into considerations of biological background. 其实对各种模型的应用都应该做到如此。
- simply changing the network depth would not account for much improvement of model performance, researchers should pay more attention to particular techniques that can be used in CNNs, such as the kernel size, the number of feature map, the design of pooling or convolution kernels, the choice of window size of input DNA sequences, etc, or include prior genomic information if possible. 不要单纯的去增加网络结构的深度,这样只会把模型搞得更复杂,所以之前在那篇ten tips for machine learning的paper中有提到,尽量选择最简单的算法。
2.2 Recurrent Neural Networks
- 为了解决vanishing gradient的问题,1997年提出了LSTM,2014年提出了GRU。
- Genomics data are typically sequential and often considered languages of biological nature.
- 比如说protein function prediction可以看成一个机器翻译的问题,就是protein sequence和the language of Gene Ontology terms之间的翻译。
- 有一种seq-to-seq RNN是可以输入一个变长的sequence,输出一个sequence或者fixed-size prediction的模型,在genomic上也比较promising。
2.3 Autoencoders
- Autoencoders have proved successful for feature extraction.
- autoencoders有应用在gene clustering tasks和dimension reduction in gene expression.
- When applying autoencoders, one should be aware that the better reconstruction accuracy does not necessarily lead to model improvement.
- 案例:a two-step VAE-based models for drug response prediction
2.4 Emergent Deep Architectures
2.4.1 Beyond Classic Models
- Researchers began to leverage more genomic intuitions to fit each particular problem with a more advanced and suitable model.
2.4.2 Hybird Architectures
- each type of deep neural networks has its own strength inspires researchers to develop hybird architechtures.
- 疑问:这种复杂的结构真的好吗?其实更加适用的不是一些简单的结构,最好具备解决一类的问题的能力?
- 有一些复杂结构的确用到了LSTM+attention,但是都没有提到transformer,所以后续可以检索一下相关的文献?是transformer不适用于生物信息学?应该不会。
3. Deep learning Architectures: Insights and Remarks
- visualization techniques that bring about insights into deep learning architectures, and add remarks on model design
3.1 Model Interpretation
- 可解释的问题在image classification领域比较流行的说法就是浅层的网络学到的是边缘这样类似的简单的features,然后越深层的网络学到的features越特别,这也是迁移学习一方面的支撑理论吧。
- Note that works conducted appropriately by classic models do not need additional techniques to visualize features. 不是很理解,但案例说有人训了一个11层的CNN来通过microscopy images来预测蛋白质亚细胞定位的模型,他们的每一层提取的特征都能很好的解释出来。后续可以看看这个文献里是如何解释的。
3.2 Transfer Learning and Multitask
- 迁移学习有从imageNet上迁移到controlled vocabulary领域的;也有从unsupervised迁移到supervised等。
- The idea of sharing information learnt multiple related tasks(transfer learning) or sub-tasks(multitask learning) could extend the power the limited data, especially for genomic data that is costly to obtain.
3.3 Multi-view Learning
- Multi-view learning can be achieved by concatenating features, esemble methods, or multi-modal learning and so on.
4. Genomic Applications
4.1 Gene expression
- 基因表达(gene expression)是指将来自基因的遗传信息合成功能性基因产物的过程。 基因表达产物通常是蛋白质,但是非蛋白质编码基因如转移RNA(tRNA)或小核RNA(snRNA)基因的表达产物是功能性RNA。
4.1.1 Gene expression Characterization
- Denoising autoencoders came in hand since it do not merely retain the information of raw data, but also generalize meaningful and important properties of the input distribution across all input samples.
- Some other works moved to variational inference in autoencoders, which is assumed to be more skillful to capture the internal dependencies among data.
- Another thread for utilizing deep learning to characterize gene expression is to describe the pairwise relationship.
4.1.2 Gene expression Prediction
- it could be more efficient to pre-analyze the contextual information in DNA sequences than directly making prediction.
- instead of only using sequence information, combing epigenetic data into the model might add to explanatory power of the model.
- Generative models were also adopted due to the ability to capture high-order, latent correlations.
4.2 Regulatory Genomics
- The underlying interdependencies behind the sequences limit the flexibility of conventional methods, but deep networks that could model over-representation of sequence information have the potentials to allow regulatory motifs to be identified according to their target sequences.
4.2.1 Promoters and Enhancers
- transcriptional level, occurs at the early stage of gene regulation
- Enhancers and promoters are two of most well characterized types functional elements in the regions of non-coding DNA.
- 有一个2018年的CNN-based的model用到了transfer learning setting on different species/datasets, 另外的亮点在于the design of adversarial training data.
- 2016年的PEDLA has an embedded mechanism to handle class-imbalanced problem. 标注出来模型提出的年份是基本上如果这个模型还不错的话,这些项目组应该会继续优化模型,那么也许在2019,已经有了一些改进或者新的突破。
- 2016年的DeepEnhancer提出the effectiveness of max-pooling and batch normalization都提高分类准确度很有帮助,而盲目的增加网络层数并没有什么用。
- when back-propagation does not perform well for deel networks, people can resort to stacked contractive autoencoder(ScA) and DBN based on DFS models that pre-trained layer-wisely in a greedy way before fine-tuned by back-propagation.
- Enhancer-promoter interaction predictions are always based on non-sequence features from functional genomic signals.
4.2.2 Splicing
- 剪接(英语:splicing,又称拼接),是一种基因重组现象,在分子生物学中,主要是指细胞核内基因信息在转录过程中或是在转录过后的一种修饰,即将内含子移除及合并外显子——内含子与外显子的名称是通用于编码基因的DNA及其转录后的RNA——是真核生物的前mRNA变成mRNA的过程之一。
- alternative splicing is a key post-transcriptional regulatory mechanism that affects gene expression and contributes to proteomic diversity.
4.2.3 Transcription Factors and RNA-binding Proteins
- 在分子生物学中,转录因子(英语:Transcription factor)是指能够结合在某基因上游特异核苷酸序列上的蛋白质,这些蛋白质能调控其基因的转录。方法是转录因子可以调控核糖核酸聚合酶(RNA聚合酶,或叫RNA合成酶)与DNA模板的结合。
- Transcription factors(TFs) and RNA-binding proteins are both crucial regulatory elements in biological processes.
- Many existing deep leaning methods approach transcription factor binding site(TFBS) prediction tasks through convolutional kernels.
- appropriately designed according to the specific problem proved powerful.
4.3 Functional Genomics
- 功能基因组学(Functional genomics)的研究又往往被称为后基因组学(Postgenomics)研究,它是利用结构基因组学提供的信息和产物,通过在基因组或系统水平上全面分析基因的功能,
4.3.1 Mutations and Functional Activities
- one of the shortcomings of previous approaches for predicting the functional activities is the insufficient utilization of positional information.
- deep learning methods naturally account for positional relationships between sequence signals and are computational efficient.
- the effects of mutations are usually predicted by site independent or pairwise models, but these approaches do not sufficiently model higher-order dependencies.
4.3.2 Subcellular Localization
- 亚细胞定位是指某种蛋白或表达产物在细胞内的具体存在部位。
- High-throughput microscopy images are a rich source of biological data remain to be better exploited. One of the important utilization of microscopy images is the automatic detection of the celluar compartment.
4.4 Structural Genomics
4.4.1 Structural Classification of Proteins
- 蛋白质折叠(Protein folding)是蛋白质获得其功能性结构和构象的过程。通过这一物理过程,蛋白质从无规则卷曲折叠成特定的功能性三维结构。在从mRNA序列翻译成线性的氨基酸链时,蛋白质都是以去折叠多肽或无规则卷曲的形式存在。
- Proteins usually share structural similarities with other proteins, among some of which have a common evolutionary origin.
- Early methods for similarity measures mostly rely on sequence properties.
- LSTM for homology detection
- one drawback of homology based approaches for fold recognition is the lack of direct relationship between the protein sequence and the fold.
- gene function annotation to perform protein classification
4.4.2 Protein Secondary Structure
- Protein secondary structure refers to the 3D form of local segments of proteins, which is informative for studying protein structure, function as well as evolution.
- Emergent deep architectures for protein SS prediction have been widely explored with more prior knowledge and various features avaliable.
- 2016年提出的DeepCNF took a large step improving Q3 accuracy above 80% by extending conditional neural fields (CDFs) to include convolutional designs.
4.4.3 Protein Tertiary Structure and Quality Assessment
- The prediction of protein tertiary structure has proven crucial to human's understanding of protein functions and can be applied to, for instance, drug designs.
- Two essential challenges in protein structure prediction include the sampling and the ranking of protein structural models.
- Quality Assessment(QA) is to predict the absolute or relative quality of the protein models before the native structure is avaliable so as to rank them.
4.4.4 Contact Map
- Protein contact map is a binary 2D matrix denoting the spatial closeness of any two residues in the folded 3D protein structure.
5. Challenges and Opportunities
5.1 The Nature of Data
- An inevitable challenge: the unavaiability of true labels due to the lack of knowledge of genetic process, the imbalanced case and control samples due to the rarity of certain disease, and the heterogeneity of data due to the expensiveness of large-scale data collection.
5.1.1 Class-Imbalanced Data
- Large-scale biological data that gathered from assorted sources are usually inherently class-imbalanced
- machine learning: ensemble methods appear to be powerful
- deep learning: boosting to propose an instance-transfer model to reduce the class-imbalanced influence; leveraging data from an auxiliary domain; through model parameters or training process
- boosting的思想是通过弱分类器(如CART)的前向分步训练实现整体模型的优化,从而得到较高的分类准确率。
5.1.2 Various Data Types
- Intuitively, integrating diverse types of data as discriminating features will lead to more predictive power of the models.
5.1.3 Heterogeneity and Confounding Correlations
- The data in most genomic applications involving medical or clinical are heterogenous due to population subgroups, or regional environments. Then one of the problems is the underlying interdependencies among these heterogeneous data.
- once the identification of confounder is presented, the domain adversarial learning , select-additive learning , and confounder filtering can be re-used.
5.2 Feature Extraction
- it is unfortunately time-consuming to directly learning features from genomic sequences when complex interdependences and long range interactions are taken into consideration.
5.2.1 Mathematical Feature Extraction
- Techniques borrowed from mathematics have great potentials to interpret the complex biological structures behind data that otherwise will hinder the generalization of deep learning.
5.2.2 Feature Representation
- By conceptual analogy of the fact that humans communicate through languages, biological organisms convey information within and between cells through information encoded in biological sequences.
- 2016年还有一个doc2vec(类似于word2vec的延伸)to proposed distributed representation of complete protein sequence.