机器学习在生物信息学中的应用

recap

Machine learning methods

general-purpose approaches to learn functional relationships from data without the need to define them a priori

Advantage

derive predictive models without the need for strong assumptions about underlying mechanisms, which are frequently unknown or insufficiently defined (especially for genomic data)

Example

he most accurate prediction of gene expression levels is currently made from a broad set of epigenetic features using linear models or random forests

However, how the selected features determine the transcript levels remains an active research topic

four steps

Most of machine learning applications in genomics can be described within the canonical machine learning workflow with four steps

data cleaning and pre-processing
feature extraction
model fitting/training
evaluation

It is customary to denote one data sample, including all covariates and features as input x (usually a vector of numbers), and label it with its response variable or output value y (usually a single number) when available.

Supervised machine learning model

img

Unsupervised machine learning

approaches aim to discover patterns from the data samples x themselves, without the need for output labels y.

Methods such as clustering and principal component analysis are typical examples of unsupervised models applied to biological data

深度神经网络

img

Neural networks

An artificial neural network, initially inspired by neural networks in the brain, consists of layers of interconnected compute units (neurons).

The depth of a neural network corresponds to the number of hidden layers, and the width to the maximum number of neurons in one of itslayers.

When training networks with larger numbers of hidden layers, artificial neural networks were rebranded as “deep networks”.

The term “neural network” largely refers to the hypothesis class part of a machine learning algorithm:

Hypothesis: non-linear hypothesis function, which involve compositions of multiple linear operators (e.g. matrix multiplications) and element wise nonlinear functions
Loss: “Typical” loss functions for classification and regression: logistic, softmax (multiclass logistic), etc.
Optimization: Gradient descent

architectures

img

Recurrent neural networks

Maintain hidden state over time, hidden state is a function of currentinput and previous hidden state
Traditional RNNs have trouble capturing long-term dependencies
Problem has to do with vanishing gradient, for many activations like sigmoid, gradients get smaller and smaller over subsequent layers
One solution, long short term memory (LSTM) (Hochreiter and Schmidhuber 1997), has more complex structure that specifically encodes memory and pass-through features, able to model long-term dependencies

application

img

DeepBind

CNN architectures to predict specificities of DNA-binding and RNAbinding proteins

Outperformed existing methods, to recover known and novel sequence motifs, and could quantify the effect of sequence alterations and identify functional SNVs

The neurons in the convolutional layer scan for motif sequences and combinations thereof, similar to conventional PWMs

The learning signal from deeper layers informs the convolutional layer which motifs are the most relevant. The motifs recovered by the model can then be visualized as sequence logos

paper

Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning, doi:10.1038/nbt.3300

img

Important considerations related to data

Need sufficient amount of labelled data

Need to be trained, selected and tested on independent data sets to avoid overfitting and assure that the model will generalize

The training set is used to learn models with different parameters, which are then assessed on the validation set.
The model with best performance, for example prediction accuracy or meansquared error, is selected and further evaluated on the test set to quantify the performance on unseen data and for comparison to other methods.
Typical data set proportions are 60% for training, 10% for validation and 30% for model testing. If the data set is small, k-fold cross-validation can be used

Categorical features such as DNA nucleotides need to be encoded numerically. Typically represented as binary vectors with all but one entry set to zero (one-hot coding).

The four bits of each encoded base are commonly considered analogously to color channels of an image to preserve the entity of a nucleotide

加入靠谱熊基地，和大家一起交流