基于随机森林的enhancer预测算法

RFECS: A Random-Forest Based Algorithm for Enhancer Identification from Chromatin State

因为最近刚好在学机器学习算法,结合在生信领域的具体应用会更好,这里就大致解读一下华人大神任兵老师作为通讯作者发表的基于随机森林算法预测enhancer的文章:https://doi.org/10.1371/journal.pcbi.1002968

background

什么是熵?

熵是度量随机变量不确定性的一个物理量,概率越小的,越不确定的,蕴含信息量越大(因为存在的可能性越多),lnP越大(否则讲一句废话就表示信息量为零),获取信息就是消除熵的过程,噪音是信息获取的干扰,数据是信息+噪音,需要用知识分离。概率的输入是微观态,熵的输入是宏观态(微观态的加和)

参考:https://www.zhihu.com/question/22178202

在机器学习中,P往往用来表示样本的真实分布,比如[1,0,0]表示当前样本属于第一类。Q用来表示模型所预测的分布,比如[0.7,0.2,0.1]

直观的理解就是如果用P来描述样本,那么就非常完美。而用Q来描述样本,虽然可以大致描述,但是不是那么的完美,信息量不足,需要额外的一些“信息增量”才能达到和P一样完美的描述。如果我们的Q通过反复训练,也能完美的描述样本,那么就不再需要额外的“信息增量”,Q等价于P。

https://blog.csdn.net/tsyccnh/article/details/79163834

什么是随机森林?随机森林可以看成决策树的延伸,对决策树而言,每个结点都会通过选择最优feature尽可能实现信息增量最大化,也就是最大熵减

随机森林是指利用多棵决策树对样本数据进行训练、分类并预测的一种方法,它在对数据进行分类的同时,还可以给出各个变量(基因)的重要性评分,评估各个变量在分类中所起的作用。随机森林的构建大致如下:首先利用bootstrap方法又放回的从原始训练集中随机抽取n个样本,并构建n个决策树;然后假设在训练样本数据中有m个特征,那么每次分裂时选择最好的特征进行分裂 每棵树都一直这样分裂下去,直到该节点的所有训练样例都属于同一类;接着让每颗决策树在不做任何修剪的前提下最大限度的生长;最后将生成的多棵分类树组成随机森林,用随机森林分类器对新的数据进行分类与回归。对于分类问题,按多棵树分类器投票决定最终分类结果;而对于回归问题,则由多棵树预测值的均值决定最终预测结果

参考:https://baijiahao.baidu.com/s?id=1612329431904493042&wfr=spider&for=pc

下列笔记参考StatQuest视频做的记录: https://statquest.org/video-index/

Building, using and evaluating random forest

Sometimes decision tree are not flexible when it comes to classifying new samples

Random forest combines the simplicity of decision tree and the flexibilty resulting in a vast improvement in accuracy

  1. Create a bootstrapped dataset, select sample randomly and the bootstrapped dataset is the same size as the original
  2. Create a decision tree using the bootstrapped dataset, but only use a random subset of variables (or columns) at each step, eg. each step consider two variables and choose one as the node
  3. And we generate a wide variety of trees
  4. We get a new patient, and we test this case in each tree we built, we can see which gets the most votes(yes have heart disease, or not) So we conclude based on the most votes

Bootstrapping the data plus using the aggregate to make a decision is called "Bagging"

Note: we allow duplicated entries in the bootstrapped dataset, that means we could have "Out-of-Bag Dataset", and we can see whether it is correctly labeled by random forest

Ultimately, we can measure how accurate our random forest is by the proportion of Out-Of -Bag samples that were corrrectly classified by the random forest

And we change the number of variables used per step, do this a bunch of times and then choose the most accurate random forest

Intro

基于随机森林算法对enhancer进行预测,寻找optimal set of histone modifications for enhancer prediction in different cell
types. 用哪些组蛋白修饰对增强子进行预测可以取得最好的效果?

增强子鉴定困难来源于:

  • 没有共有的序列特征,非常diverse
  • 距离受其调控的基因距离可以非常远,从哪找?

先前的预测增强子的工作包括基于motif聚类和比较性分析的,但存在问题:

Computational techniques relying on transcription factor motif clustering or comparative analyses have had some success in identifying enhancers, but these predictions are neither comprehensive nor tissue-specific

另外,在全基因组范围鉴定enhancer的很重要的一个方法就是chip-seq,但这个需基于先验的TF的背景,而且要在不同组织、全基因组大范围鉴定enhancer,工作量是巨大的

还有一种方法就是mapping the binding sites of transcriptional co-activators such as p300 and CBP,这种辅激活因子受大量序列特异TF募集,也就是募集到大量的enhancer上。但实际上并不是所有的enhancer都有set of co-activators,而且针对这些辅激活因子的chip级别的抗体也并非随时available

最后一种方法就是寻找open chromatin 区域,但缺乏特异性,因为开放染色质区还包括了silencers/repressors, insulators, promoters以及未知的核蛋白结合序列元件,所以并不能特异地鉴定出我们想要的enhancer

certain histone modifications form a consistent signature of enhancers. It is on this approach that the present work is focused.文章基于的原理就是:组蛋白修饰marks的分布pattern应该可以构成功能元件的signatures,从而根据signatures来推断潜在的enhancer元件

对于已有的计算生物学工具:

The supervised machine learning techniques include HMM, neural networks and genetic algorithm-optimized SVM based approaches, and have proved to be improvements over the profile-based method.

其实这些方法仍有局限性:一方面是受到训练集太少的局限,一方面仅测试了部分chromatin marks预测的效应。而随着更多的marks的发现,可以提高对于enhancer鉴定的精确性;也就是说,使用一种方法确定最佳的marks集合,充分利用更多的marks的信息,帮助我们更好地预测enhancer

首先,需要获取足够的训练集数据,作为Roadmap计划的一部分,作者在H1、IMR90两种细胞系中检测了24种chromatin modification,使用ChIP-seq方法检测了这两种细胞系内的p300 binding site

Additionally, we have experimentally determined a large number of promoter-distal p300 binding sites in each cell type, providing a rich training set for development of accurate and robust enhancer prediction algorithms 定义了启动子远端p300的结合位点(预示着潜在的enhancer)

Results

Prediction of enhancers using random forest and multiple chromatin marks

随机森林算法在生物信息学中应用广泛,这得益于它对于大数据集高效的处理能力、不用担心过拟合问题(算法本身是一个bagging、votes集成算法,样本和feature都加入了随机,由于两个随机性的引入,一方面使得随机森林不容易陷入过拟合,另一方面使其具有一定抗噪声能力;用OOB数据作为验证集)。重要的是,随机森林能够衡量特征重要度。参考:https://blog.csdn.net/SMF0504/article/details/51939064

Random forests have recently become a popular machine learning technique in biology due to their ability to run efficiently on large datasets without over-fitting, and their inherently non-parametric structure. Since random forests use a
single variable at a time, they can give an automatic measure of feature importance

对 enhancer的初始定义:

We selected p300 binding sites overlapping DNase-I hypersensitive sites and distal to annotated TSS as active p300 binding sites representative of enhancers

于是,作者对他们定义的"enhancer"进行无监督聚类,每个cluster都有独特的chromatin state pattern:

p300_BS.jpg

从上图可以看出,这些cluster的H3K4me1信号值较强而H3K4me3信号较弱,这也与一直以来的认知:enhancer和promoter的marker差别是一致的;不过怎么解释其他marker的信号差异呢?

Different clusters were characterized by varying levels of histone acetylation, H3K4me2 or H3K27me3. Clusters with presence or absence of H3K36me3 may represent genic and intergenic enhancers respectively.

一方面,活性染色质marker的差异可能衡量了enhancer内部subclasses还有独特的“state”,PS个人观点:独特的"state“会不会与远程相互作用的差异有关?标志不同的"blocks"?;另一方面,那些具有H3K36me3信号的cluster可能代表了分别落在基因内和基因间区的enhancer

训练集的选取:有活性的远端p300结合位点作为enhancer class,TSS和随机远端100bp bins作为non-enhancer class

To train the forest, active and distal p300-binding sites (BS) were selected as representative of the enhancer class. As nonenhancer classes, we considered annotated transcription start sites (TSS) that overlap DNase-I, and random 100 bp bins that are distal to known p300 or TSS (see Methods)

Training一共分成两步:

  1. 确定两个class: p300 binding site和TSS、随机背景序列
  2. 预测:用来做预测的数据以100bp的bin为单位,每个element构成20维的向量(each feature being a 20-dimensional vector of 100 bp bins from 21to +1 kb along the genomic element.),在每个node进行线性判别(这里作者运用的是一种特殊的随机森林,每个node对应一个线性分类器using Fischer Discriminant approach,应该是判别分析的一种,不是LDA,fisher是寻找这样的一个空间,样本投影在这个空间上,类内距离最小,类间距离最大。那么怎么求这个空间呢,类似于PCA,求最大特征值对应的特征向量组成的空间,参考:https://blog.csdn.net/u013943841/article/details/45080889),这样,每个bin最终都会被分配上enhancer or non-enhancer status
tree.jpg

在上图B,可以发现以p300结合位点为中心,距离越远,得票数越少;选择得票数大于50%的用来做下游分析,并假设在一定p300 binding距离范围内的bins都属于同一个enhancer

基于OOB error的特征重要性度量:https://blog.csdn.net/m0_37770941/article/details/78330795

the importance of individual histone modifications for making enhancer predictions, we use an out-of-bag measure of variable importance implemented in Matlab as the function oobVarImp.

模型参数的优化:经典的ROC曲线

We used Receiver Operating Characteristic (ROC) curves to determine optimal parameters for our classification algorithm

但是作者在这里提到了一个问题,ROC曲线里面横坐标的specificity并不是一个肯定的值,因为不能保证随机选取的elements of non-p300 classes就都是真阴性。

Hence, in addition to the ROC curves generated using 5-fold cross-validation(5折交叉验证), we also verified parameter selection by comparing the percentage of predicted enhancers at each cutoff that overlap markers of active enhancers (validation rate) or TSS (misclassification rate)

因此如上,作者在选取参数上进一步考量了不同阈值(即enhancer投票数高于多少才被认为是enhancer)下,有多少预测的enhancer的marker和已知的active enhancer/TSS的marker是比较一致的,已知的active enhancer marker/pattern比如超敏位点distal DNase-I hypersensitivity sites (HS), p300结合位点(excluding those used in training), occupancy by CBP or sequence specific TF known to act at embryonic stem cell enhancers such as NANOG, OCT4 and SOX2.

就随机森林算法而言,主要的一个需要决定的参数就是树的数目。注意到在基因组中非增强子序列是远多于增强子元件的(测试集的比例),对训练集而言,non-p300 training sites也是要多于p300 classes的(需要考虑两个class的比例)。另一个需要考虑的参数就是window的长度(-1kb--+1kb,在前面提到过,这也是一个超参数,0.5-1.0kb的window在ROC曲线的表现上其实差不多)

peak_width.jpg

可以看到,虽然0.5k和1k的ROC曲线差不多,但是validation rate较低,错误分类率更高

B图:Percentage of enhancers validated by true positive markers at different numbers of enhancers determined by various cutoffs (Validation rate or VR curve),对训练集的验证正确率

Enhancer predictions in H1 and IMR90 cells

决定树的个数:大于45棵树后AUC值基本不变

number_trees.jpg

In the end, we selected 65 trees for training the random forest as it appeared to be optimal for both cases. The training-set ratio of p300 to non-p300 was set at 1:7 since the ROC curve did not appear to change much beyond this ratio.

下图的调参针对p300-non-p300 site的比例,前文有涉及

class_ratio.jpg

初步调好参数后,作者将训练好的RFECS算法运用在H1和IMR90的chromatin profiles上(24种组蛋白marks)

预测正确率的计算依据是:与超敏位点、p300结合位点、一些特异转录因子结合位点相重合的enhancer占总的预测enhancer的比例(看上去条件比较苛刻)

We then calculated the validation rate as the percentage of predicted enhancers overlapping with DNase-I hypersensitivity sites and binding sites of p300 and a few sequence specific transcription factors known to function in each cell type (true positive markers).

Validation of enhancer predictions

对于同一个细胞系而言,如果分类器能正确地pick up更多的enhancer而不是tss,那么它的表现就是好的,具体的评判分类是否正确的原则如下:

True Positive Markers (TPM) refer to DNase-I hypsersensitivity site, p300, CBP and Transcription factor binding sites

  1. If the nearest TPM lies within 2.5 kb of the enhancer and the nearest TSS is greater than 1 kb away from the TPM, the enhancer is ‘‘validated’’
  2. If a TSS lies within 2.5 kb of the enhancer, and the nearest TPM is either greater than 2.5 kb away from the enhancer or within 1 kb of the TSS, the enhancer is ‘‘misclassified’’
  3. If there is no TPM or TSS within 2.5 kb of the enhancer, it is ‘‘unknown’’.

作者除了评判validation rate和misclassification rate以外,还计算了这些在每个细胞系内预测的enhancer到DNS, p300 结合位点的距离:基本都在200bp以内,说明预测效果良好

distance_marker.jpg

除了计算的验证,作者还确认了proximal TSS上基因表达的激活效应,即需要结合RNA-seq数据来确认细胞系特异的TSS,然后计算以这些TSS为中心,预测enhancer的分布情况:可以看到两种细胞系各自的predicted enhancer以各自TSS为中心呈现富集状态

distance_from_tss.jpg

Random forest trained on one cell-type can accurately predict enhancers in other cell-types

虽然在各自细胞系(H1,IMR90)里,分类器的预测表现都不错,但接下来的问题是,能不能把enhancer预测推广到其他细胞系里?因为如果对每个细胞系都做分类器训练,然后测试,这个工作是很耗时耗力的。所以如果能用一个细胞系做训练集,另一个细胞系做测试集而且效果很好的话,就可以省下不少麻烦。

这个想法很好,作者于是乎就干脆把H1作为训练集,IMR90作为测试集(反过来也是)来评估了一下分类器的泛化能力:

To evaluate the feasibility of such approach, we first trained a random-forest using chromatin modification profiles obtained in H1, and then applied it to the IMR90 cells.

Similarly, we also developed a random forest using the IMR90 data as the training set and then applied it to H1

validation_rate.jpg

上图中红点表示相同细胞系的validation,黑点表示不同细胞系间的validation(比如在H1 predict然后在IMR90里apply),可以看到分类器在不同细胞系间是有一定泛化能力的,不过正确率会比在相同细胞系里略低

那么接下来的问题就是,这种“略低”是技术误差允许的还是细胞系之间真实的生物学差异造成的?

We sought to examine if this moderate decrease in performance was largely due to cell-type specific differences or was within the limits of technical or biological variability between replicates

下面这个地方可能需要理解一下,意思就是作者还测试了每个细胞系的replicate,看有没有可能是重复的批次造成了分类器上述的误差。以上图为例,蓝色的点和蓝色的星号分别表示同一细胞系和不同细胞系在replicate1中的验证,绿色的点和绿色的星号分别表示同一细胞系和不同细胞系在replicate2中的验证,可以看出replicate1二者的差异很大而replicate2二者差异很小

we trained a random forest on one replicate of a cell-type, and made predictions on the other replicate of the same cell type. RFECS trained on IMR90 and then applied to the replicate 1 of the H1 profiles (blue dot vs asterisk) actually showed a higher validation rate and lower misclassification rate than RFECS trained using replicate 2 of H1 (fig. 2C,E)

因此,作者发现不同的replicate可以造成很大的差别,最后的结论是:之前的“略低”很可能是批次造成的,不是分类器本身的问题(不过我觉得replicate有点少,可以多一些;也可能作者只放了两个在图上)

In conclusion, predicting enhancers using the random forest built from a different cell type exhibits a modest decrease in performance compared to a same-cell training set. However, this decrease in performance is comparable to the decrease that can arise due to variability between two replicates of the same cell-type.

Optimal set of chromatin marks required for enhancer prediction

接下来关心的问题是,feature的importance? 到底哪些组蛋白mark对于预测enhancer是重要的?其实在上述随机森林建树的过程中,feature的重要性就可以衡量出来了:

In both H1 and IMR90, the variable importance was assessed for random forests trained on 5 cross-sections of data for each of the 2 sets of replicates individually as well as the set of averaged replicates. Upon ranking histone modifications by variable importance, it is apparent that H3K4me1 and H3K4me3 are the top 2 most robust modifications across replicates and cross-sectional samples in both cell types, followed by H3K4me2

feature_importance.jpg

上图中可以看到排在前三的feature分别是H3K4me3、H3K4me1、H3K4me2(H1和IMR90里的顺序不一样)

不过需要注意到的是,feature importance在不同细胞系里还是有差异的,比如说下面这个相关性热图:

feature_cor.jpg

Differences observed in correlation clustering of the same 24 modifications in C.)H1 and D.)IMR90 explain some of the differences in ordering of variables in the two cell types. Same non-black colors of modifications indicate clusters that co-occur in both cell-types.

黑色的label表示是H1/IMR90独有的,显示出细胞系间mark importance的差异:一方面相对而言在H1中许多marker都是相关性很高的,说明这些marker代表的信息相对冗余(换句话说这些marker分布比较集中、一致),而IMR90中相关性相对弱一些,因此marker之间分布相对更独立(信息不冗余,每个marker对variable importance)的贡献比较平均

相关性热图具体的细节见method:

The Pearson correlation coefficient between any two modifications was computed for RPKM -normalized histone modification reads between -1to +1 kb for all elements within the selected training set. The correlation patterns of each histone modification was used to cluster the modifications and order them using MATLAB tools

接下来作者尝试了使用不同组蛋白修饰集合对enhancer进行预测并衡量准确性(accuracy)

mark_set_validate.jpg

可以看出minimal set of H3K4me1-3表型还是不错的(图例如下)

mark_set_validate2.jpg

In both cases the use of the minimal set of 3 modifications shows a much closer resemblance in performance to all 24 modifications than to the set of 2 marks H3K4me1 and H3K4me3

另外作者还检验了这个minimal set在replicate、chromosome1中的prediction表现(见上图E),同样是很好的。同时H3K27ac在此例中表现也很不错,说明当H3K4me2无相应数据时可以用H3K27ac作为替代的feature,而且后者是更为commonly used marker for active enhancer(因为H3K4me2数据相对少,这个marker关注度不如H3K27ac高,当然希望feature越common越好),最后作者选择H3K4me1, H3K4me3、H3K27ac作为最后的featrue

Comparison of RFECS with other enhancer prediction methods

接近尾声,比较一下这个方法和其他enhancer预测算法(比较当然离不开ROC),说明这个算法的重要性:

同样地通过validation rate和misclassification rate来衡量:

The validation rate of RFECS predictions is around 70%, which is considerably higher than the other three methods (57% ChromaGenSVM, 51% CSIANN, 60% Chromia). Further, the misclassification rates of RFECS is less than 7%, much lower than the 27%, 35% and 15% rates of Chroma-GenSVM, CSIANN and Chromia, respectively.

不过比较的前提是训练集数据的上游处理步骤要一致:

we ran the various enhancer prediction methods on H3K4me1, H3K4me2 and H3K4me3 datasets of H1.

We tried to make the pre-processing stages of the various algorithms as consistent as possible by merging several replicates of each histone modification files and input files into single bed files and randomly selecting a smaller subset of p300 peaks for training, since these were the requirements of the other algorithms such as CSIANN and ChromaGenSVM.

comparison.jpg

Prediction of enhancers in multiple human cell-types

对ENCODE计划里其他细胞系的enhancer进行预测:

We trained our random forest on the p300 ENCODE data in H1 and made enhancer predictions in 12 ENCODE cell-types using the three marks H3K4me1, H3K4me3 and H3K27ac since these were available for all the cell-types. Validation rates were assessed based on overlap with existing DNAse-I hypersensitivity data while misclassification rates were calculated based on overlap with UCSC TSS.

Prediction_in_12cell_line.jpg

作者认为个别细胞系的validation rate较低可能是因为细胞系本身DHS比较少

Discussion

作者认为,除了p300以外,结合包括其他enhancer结合蛋白的信息或许还能使RFECS增强对调控元件的预测能力,而且基于随机森林的预测算法泛化能力好,有特征选择能力,以后还可以用其他类型的featrue进行training;比如添加序列、motif、DNA甲基化信息来预测其他基因组元件

引用汇总:

RFECS: A Random-Forest Based Algorithm for Enhancer Identification from Chromatin State:https://doi.org/10.1371/journal.pcbi.1002968

机器学习概念:

https://www.zhihu.com/question/22178202

https://blog.csdn.net/tsyccnh/article/details/79163834

https://baijiahao.baidu.com/s?id=1612329431904493042&wfr=spider&for=pc

你可能感兴趣的:(基于随机森林的enhancer预测算法)