徐优俊, 裴剑锋
北京大学前沿交叉学科研究院定量生物学中心,北京 100871
摘要:深度学习在计算机视觉、语音识别和自然语言处理三大领域中取得了巨大的成功,带动了人工智能的快速发展。将深度学习的关键技术应用于化学信息学,能够加快实现化学信息处理的人工智能化。化合物结构与性质的定量关系研究是化学信息学的主要任务之一,着重介绍各类深度学习框架(深层神经网络、卷积神经网络、循环或递归神经网络)应用于化合物定量构效关系模型的研究进展,并针对深度学习在化学信息学中的应用进行了展望。
关键词:深度学习;人工智能;定量构效关系;化学信息学
中图分类号:TP301 文献标识码:A
Deep learning for chemoinformatics
XU Youjun, PEI Jianfeng
Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
Abstract: Deep learning have been successfully used in computer vision,speech recognition and natural language processing,leading to the rapid development of artificial intelligence.The key technology of deep learning was also applied to chemoinformatics,speeding up the implementation of artificial intelligence in chemistry.As developing quantitative structure-activity relationship model is one of major tasks for chemoinformatics,the application of deep learning technology in QSAR research was focused.How three kinds of deep learning frameworks,namely,deep neural network,convolution neural network,and recurrent or recursive neural network were applied in QSAR was discussed.A perspective on the future impact of deep learning on chemoinformatics was given.
Key words: deep learning, artificial intelligence, quantitative structure-activity relationship, chemoinformatics
论文引用格式:徐优俊, 裴剑锋. 深度学习在化学信息学中的应用[J], 大数据, 2017, 3(2): 45-66.
XU Y J, PEI J F. Deep learning for chemoinformatics[J]. Big Data Research, 2017, 3(2): 45-66.
4 深度学习框架的对比与分析
表1是深度神经网络框架在QSAR中的应用,可以看出,目前深度学习框架下的QSAR研究主要有以下几个特点。
● 随着数据集的增多以及多样化,研究人员逐渐倾向于使用多任务模型的训练策略,多任务学习中迁移学习的概念被应用到了数据较少的数据集中,提高对该任务的预测能力。多任务学习模型的评估方法大多是基于AUC的,说明多任务模型目前只适用于分类问题,在多任务的回归模型的问题上,还有待开发出更好的训练手段和策略。
● ReLU目前是在QSAR中最常用的一种训练技术,在DNN和CNN框架中基本都使用了该技术。发展更好、更快的训练。
从分子编码技术在深度学习中的应用来看,笔者发现基于原子水平的特征输入在逐渐取代基于分子描述符或指纹的特征输入,这说明深度学习拥有足够的能力从原子层面提取支持分子水平预测的信息,印证了其强大的特征提取能力。但目前比较不足的是对于这些深层特征的深层分析。目前研究人员主要采用的策略是重新设计实验,专门用来可视化隐层中与目标性质相关的分子片段,并没有直接从构建出来的高水平的QSAR模型本身出发进行隐层特征的分析,这方面的研究有待加强。
5 总结与展望
综上所述,由于化学分子数量多、结构复杂多样,使用传统的算法处理时能力常有不足,深度学习的表现比起传统机器学习算法更胜一筹,主要是因为深度学习是一种多层描述的表征学习,通过组合简单、非线性模块来实现,每个模块都会将最简单的描述(从原始或近原始输入开始)转变成较高层、较为抽象的描述。其关键之处在于这些抽象的特征并非人工设计,而是模型从大量数据中自动学习得到的。这样的能力在面对化学中的大量实验数据时显得更为得心应手,更加智能化。从目前的应用表现来看,虽然深度学习在语音处理、计算机视觉和自然语言处理中的应用已经非常广泛,但是深度学习在QSAR乃至化学信息学中的应用目前还只属于初步的阶段。而这些应用表现出来的成功之处可以折射出深度学习在化学领域的应用前景中必然是一条康庄大道。从QSAR问题的复杂度来看,多任务QSAR模型的开发本来是一件很难完成的事情,然而在深度学习面前就显得相对简单,在模型表现上也显得极为突出。在QSAR模型编码时,初步发现一些依靠化学专业知识设计的特征(如分子描述符)已经不再那么重要,仅仅依靠非常简单的原子层面的信息就能组建高水平的QSAR模型。这无疑是归功于深度学习的强大特征学习能力。而且这些特征甚至可以在隐层中被转化为一些真实的化合物子片段的概念,如DeepTox中涉及的毒性片段以及NGF方法涉及的与目标性质相关的片段,促进了深度学习在QSAR中的可解释性的研究。深度神经网络是一套适合做“感知”的框架,让适合做“感知”的深度学习结合以推理为核心的贝叶斯神经网络,形成“感知—推理—决策”的范式,从而加快基于深度学习的新型药物设计的发展。
深度学习应用于化学信息学还存在一些需要解决的关键科学问题,包括如何进一步改进过拟合现象和加快深度神经网络的训练过程;如何发展更适用于分子二维及三维结构信息特征的编码方法和网络结构、超参数优化算法及多目标深度学习算法;如何准确预测化合物与生物网络的作用关系及其生物活性。如何高速有效地处理非结构化的化学分子相关文本文献和图像信息数据,也是一个需要解决的关键问题。深度学习对数据的强大处理和理解能力,也为人们提供了一条可能的新途径,以便更好地理解化学分子结构的物理化学本质。
参考文献:
[1] HINTON G, DENG L, YU D, et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6):82-97.
[2] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Imagenet classification with deep convolutional neural networks[J]. Advances in Neural Information Processing Systems, 2012, 25(2):1097-1105.
[3] COLLOBERT R, WESTON J. A unified architecture for natural language processing: deep neural networks with multitask learning[C]// The 25th International Conference on Machine Learning, July5-9, 2008, Helsinki, Finland. New York: ACM Press, 2008: 160-167.
[4] GAWEHN E, HISS J A, SCHNEIDER G. Deep learning in drug discovery[J]. Molecular Informatics, 2016, 35(1):3-14.
[5] RAGHU M, POOLE B, KLEINBERG J, et al. On the expressive power of deep neural networks[J]. 2016: arXiv:1606.05336.
[6] HINTON G E, OSINDERO S, TEH YW. A fast learning algorithm for deep belief nets[J]. Neural Computation, 2006, 18(7):1527-1554.
[7] SRIVASTAVA N, HINTON G E, KRIZHEVSKY A, et al. Dropout: a simple way to prevent neural networks from overfitting[J]. Journal of Machine Learning Research, 2014, 15(1):1929-1958.
[8] IOFFE S, SZEGEDY C. Batch normalization: accelerating deep network training by reducing internal covariate shift[J]. 2015: arXiv:1502.03167.
[9] GLOROT X, BORDES A, BENGIO Y. Deep sparse rectifier neural networks[C]//The 14th International Conference on Artificial Intelligence and Statistics,April 11-13, 2011, Fort Lauderdale, USA.[S.l.:s.n.],2011: 315-323.
[10] DUCHI J, HAZAN E, SINGER Y. Adaptive subgradient methods for online learning and stochastic optimization[J]. Journal of Machine Learning Research, 2011, 12(7):2121-2159.
[11] ZEILER M D. ADADELTA: an adaptive learning rate method[J]. 2012: arXiv:1212.5701.
[12] KINGMA D, BA J. Adam: a method for stochastic optimization[J]. 2014: arXiv:1412.6980.
[13] MIKOLOV T, KARAFIÁT M, BURGET L, et al. Recurrent neural network based language model[C]//The11th Annual Conference of the International Speech Communication Association, September 26-30, 2010, Makuhari, Chiba.[S.l.:s.n.], 2010: 1045-1048.
[14] WU Y, SCHUSTER M, CHEN Z, et al. Google's neural machine translation system: bridging the gap between human and machine translation[J]. 2016: arXiv:1609.08144.
[15] VINCENT P, LAROCHELLE H, LAJOIE I, et al. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion[J]. Journal of Machine Learning Research, 2010, 11(12):3371-3408.
[16] SOCHER R. Recursive deep learning for natural language processing and computer vision.Citeseer, 2014(8): 1.
[17] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780.
[18] 孙潭霖, 裴剑锋. 大数据时代的药物设计与药物信息[J]. 科学通报, 2015(8):689-693.
[19] SVETNIK V, LIAW A, TONG C, et al. Random forest: a classification and regression tool for compound classification and QSAR modeling[J]. Journal of Chemical Information and Computer Sciences, 2003, 43(6):1947-1958.
[20] RUPP M, TKATCHENKO A, MÜLLER KR, et al. Fast and accurate modeling of molecular atomization energies with machine learning[J]. Physical Review Letters, 2012, 108(5):3125-3130.
[21] RACCUGLIA P, ELBERT K C, ADLER P D F, et al. Machine-learning-assisted materials discovery using failed experiments[J]. Nature, 2016, 533(7601):73-76.
[22] DU H, WANG J, HU Z, et al. Prediction of fungicidal activities of rice blast disease based on least-squares support vector machines and project pursuit regression[J]. Journal of Agricultural and Food Chemistry, 2008, 56(22):10785-10792.
[23] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553):436-444.
[24] JAITLY N, NGUYEN P, SENIOR A W, et al. Application of pretrained deep neural networks to large vocabulary speech recognition[C]//The13th Annual Conference of the International Speech Communication Association,September 9-13, 2012, Portland, OR, USA. [S.l.:s.n.],2012: 1-4.
[25] DAHL G E, YU D, DENG L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1):30-42.
[26] GRAVES A, MOHAMED AR, HINTON G. Speech recognition with deep recurrent neural networks[C]//2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),May 26-31, 2013, Vancouver, BC, Canada. New Jersey: IEEE Press, 2013: 6645-6649.
[27] DENG L, YU D, DAHL G E. Deep belief network for large vocabulary continuous speech recognition: 8972253[P]. 2015-03-03.
[28] GAO J, HE X, DENG L. Deep learning for web search and natural language processing[R]. Redmond:Microsoft Research, 2015.
[29] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems, 2013: arXiv:1310.4546.
[30] SOCHER R, LIN C C, MANNING C, et al. Parsing natural scenes and natural language with recursive neural networks[C]//The 28th International Conference on MACHINE LEARNing (ICML-11), June 28-July 2, 2011, Bellevue, Washington, USA. [S.l.:s.n.], 2011:129-136.
[31] HE K, ZHANG X, REN S, et al. Delving deep into rectifiers: surpassing human-level performance on imagenet classification[C]//The IEEE International Conference on Computer Vision,December 13-16, 2015, Santiago, Chile. New Jersey: IEEE Press, 2015: 1026-1034.
[32] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//The IEEE Conference on Computer Vision and Pattern Recognition, June 7-12, 2015, Boston, MA, USA. New Jersey: IEEE Press, 2015: 1-9.
[33] RUSSAKOVSKY O, DENG J, SU H, et al. Imagenet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3):211-252.
[34] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//The IEEE Conference on Computer Vision and Pattern Recognition, June 27-30, 2016, Las Vegas, NV, USA. New Jersey: IEEE Press, 2016: 770-778.
[35] MARKOFF J. Scientists see promise in deep-learning programs[N]. New York Times, 2012-10-25.
[36] CARHART R E, SMITH D H, VENKATARAGHAVAN R. Atom pairs as molecular features in structure-activity studies: definition and applications[J]. Journal of Chemical Information and Computer Sciences, 1985, 25(2):64-73.
[37] KEARSLEY S K, SALLAMACK S, FLUDER E M, et al. Chemical similarity using physiochemical property descriptors[J]. Journal of Chemical Information and Computer Sciences, 1996, 36(1):118-127.
[38] RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by back-propagating errors[J]. Cognitive Modeling, 1988, 5(3):1.
[39] MA J, SHERIDAN R P, LIAW A, et al. Deep neural nets as a method for quantitative structure-activity relationships[J]. Journal of chemical information and modeling, 2015, 55(2):263-274.
[40] DAHL G E, JAITLY N, SALAKHUTDINOV R. Multi-task neural networks for QSAR predictions[J]. 2014: arXiv:1406.1231.
[41] EVGENIOU T, PONTIL M. Regularized multi--task learning[C]//The 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,August 22 - 25, 2004, Seattle, WA, USA. New York: ACM Press,2004: 109-117.
[42] MAURI A, CONSONNI V, PAVAN M, et al. Dragon software: an easy approach to molecular descriptor calculations[J]. Match, 2006, 56(2):237-248.
[43] SNOEK J, LAROCHELLE H, ADAMS R P. Practical bayesian optimization of machine learning algorithms[J]. Advances in Neural Information Processing Systems, 2012: arXiv:1206.2944.
[44] SNOEK J, SWERSKY K, ZEMEL R S, et al. Input warping for bayesian optimization of non-stationary functions[C]//International Conference on Machine Learning,June 21-26, 2014, Beijing, China. [S.l.:s.n.], 2014: 1674-1682.
[45] FRIEDMAN J H. Greedy function approximation: a gradient boosting machine[J]. Annals of Statistics, 2001, 29(5):1189-1232.
[46] UNTERTHINER T, MAYR A, KLAMBAUER G, et al. Multi-task deep networks for drug target prediction[J]. Neural Information Processing System, 2014: 1-4.
[47] GAULTON A, BELLIS L J, BENTO A P, et al. ChEMBL: a large-scale bioactivity database for drug discovery[J]. Nucleic Acids Research, 2012, 40(D1):D1100-D1107.
[48] ROGERS D, HAHN M. Extended-connectivity fingerprints[J]. Journal of Chemical Information and Modeling, 2010, 50(5):742-754.
[49] HARPER G, BRADSHAW J, GITTINS J C, et al. Prediction of biological activity for high-throughput screening using binary kernel discrimination[J]. Journal of Chemical Information and Computer Sciences, 2001, 41(5):1295-1300.
[50] LOWE R, MUSSA H Y, NIGSCH F, et al. Predicting the mechanism of phospholipidosis[J]. Journal of Cheminformatics, 2012, 4(1):2.
[51] XIA X, MALISKI E G, GALLANT P, et al. Classification of kinase inhibitors using a Bayesian model[J]. Journal of Medicinal Chemistry, 2004, 47(18):4463-4470.
[52] KEISER M J, ROTH B L, ARMBRUSTER B N, et al. Relating protein pharmacology by ligand chemistry[J]. Nature Biotechnology, 2007, 25(2):197-206.
[53] WANG Y, SUZEK T, ZHANG J, et al. PubChem bioassay: 2014 update[J]. Nucleic Acids Research, 2014, 42(Database Issue):1075-1082.
[54] ROHRER S G, BAUMANN K. Maximum unbiased validation (MUV) data sets for virtual screening based on PubChem bioactivity data[J]. Journal of Chemical Information and Modeling, 2009, 49(2):169-184.
[55] MYSINGER M M, CARCHIA M, IRWIN J J, et al. Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking[J]. Journal of medicinal chemistry, 2012, 55(14):6582-6594.
[56] RAMSUNDAR B, KEARNES S, RILEY P, et al. Massively multitask networks for drug discovery[J]. 2015: arXiv:1502.02072.
[57] MAYR A, KLAMBAUER G, UNTERTHINER T, et al. DeepTox: toxicity prediction using deep learning[J]. Frontiers in Environmental Science, 2016, 3(8):80.
[58] KAZIUS J, MCGUIRE R, BURSI R. Derivation and validation of toxicophores for mutagenicity prediction[J]. Journal of medicinal chemistry, 2005, 48(1):312-320.
[59] FRIEDMAN J, HASTIE T, TIBSHIRANI R. Regularization paths for generalized linear models via coordinate descent[J]. Journal of Statistical Software, 2010, 33(1):1.
[60] SIMON N, FRIEDMAN J, HASTIE T, et al. Regularization paths for Cox’s proportional hazards model via coordinate descent[J]. Journal of Statistical Software, 2011, 39(5):1.
[61] DUVENAUD D K, MACLAURIN D, IPARRAGUIRRE J, et al. Convolutional networks on graphs for learning molecular fingerprints[J]. Advances in Neural Information Processing Systems, 2015: arXiv:1509.09292.
[62] GRAVES A, WAYNE G, DANIHELKA I. Neural turing machines[J]. 2014: arXiv:1410.5401.
[63] MORGAN H L. The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service[J]. Journal of Chemical Documentation, 1965, 5(2):107-113.
[64] DELANEY J S. ESOL: estimating aqueous solubility directly from molecular structure[J]. Journal of Chemical Information and Computer Sciences, 2004, 44(3):1000-1005.
[65] GAMO F-J, SANZ L M, VIDAL J, et al. Thousands of chemical starting points for antimalarial lead identification[J]. Nature, 2010, 465(7296):305-310.
[66] HACHMANN J, OLIVARES-AMAYA R, ATAHAN-EVRENK S, et al. The Harvard clean energy project: large-scale computational screening and design of organic photovoltaics on the world community grid[J]. The Journal of Physical Chemistry Letters, 2011, 2(17):2241-2251.
[67] KEARNES S, MCCLOSKEY K, BERNDL M, et al. Molecular graph convolutions: moving beyond fingerprints[J]. Journal of Computer-Aided Molecular Design, 2016, 30(8):595-608.
[68] HUGHES T B, MILLER G P, SWAMIDASS S J. Modeling epoxidation of drug-like molecules with a deep machine learning network[J]. ACS Central Science, 2015, 1(4):168-180.
[69] HUGHES T B, MILLER G P, SWAMIDASS S J. Site of reactivity models predict molecular reactivity of diverse chemicals with glutathione[J]. Chemical research in toxicology, 2015, 28(4):797-809.
[70] WALLACH I, DZAMBA M, HEIFETS A. AtomNet: a deep convolutional neural network for bioactivity prediction in structure-based drug discovery[J]. 2015: arXiv:1510.02855.
[71] KOES D R, BAUMGARTNER M P, CAMACHO C J. Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise[J]. Journal of Chemical Information and Modeling, 2013, 53(8):1893-1904.
[72] GABEL J, DESAPHY J R M, ROGNAN D. Beware of machine learning-based scoring functionson the danger of developing black boxes[J]. Journal of Chemical Information and Modeling, 2014, 54(10):2807-2815.
[73] SPITZER R, JAIN A N. Surflex-Dock: docking benchmarks and real-world application[J]. Journal of Computer-Aided Molecular Design, 2012, 26(6):687-699.
[74] COLEMAN R G, STERLING T, WEISS D R. SAMPL4 & DOCK3. 7: lessons for automated docking procedures[J]. Journal of Computer-Aided Molecular Design, 2014, 28(3):201-209.
[75] ALLEN W J, BALIUS T E, MUKHERJEE S, et al. DOCK 6: impact of new features and current docking performance[J]. Journal of Computational Chemistry, 2015, 36(15):1132-1156.
[76] PEREIRA J C, CAFFARENA E R, DOS SANTOS C N. Boosting docking-based virtual screening with deep learning[J]. Journal of Chemical Information and Modeling, 2016: arXiv:1608.04844.
[77] LUSCI A, POLLASTRI G, BALDI P. Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules[J]. Journal of Chemical Information and Modeling, 2013, 53(7):1563-1575.
[78] JAIN N, YALKOWSKY S H. Estimation of the aqueous solubility I: application to organic nonelectrolytes[J]. Journal of Pharmaceutical Sciences, 2001, 90(2):234-252.
[79] LOUIS B, AGRAWAL V K, KHADIKAR P V. Prediction of intrinsic solubility of generic drugs using MLR, ANN and SVM analyses[J]. European Journal of Medicinal Chemistry, 2010, 45(9):4018-4025.
[80] AZENCOTT C-A, KSIKES A, SWAMIDASS S J, et al. One-to four-dimensional kernels for virtual screening and the prediction of physical, chemical, and biological properties[J]. Journal of Chemical Information and Modeling, 2007, 47(3):965-974.
[81] FRÖHLICH H, WEGNER J K, ZELL A. Towards optimal descriptor subset selection with support vector machines in classification and regression[J]. QSAR & Combinatorial Science, 2004, 23(5):311-318.
[82] XU Y, DAI Z, CHEN F, et al. Deep learning for drug-induced liver injury[J]. Journal of Chemical Information and Modeling, 2015, 55(10):2085-2093.
[83] LAKE B M, SALAKHUTDINOV R, TENENBAUM J B. Human-level concept learning through probabilistic program induction[J]. Science, 2015, 350(6266):1332-1338.
[84] ALTAE-TRAN H, RAMSUNDAR B, PAPPU A S, et al. Low data drug discovery with one-shot learning[J]. 2016: arXiv:1611.03199.
[85] KUHN M, LETUNIC I, JENSEN L J, et al. The SIDER database of drugs and side effects[J]. Nucleic Acids Research,2015, 44(D1):D1075.
[86] GÓMEZ-BOMBARELLI R, DUVENAUD D, HERNÁNDEZ-LOBATO J M, et al. Automatic chemical design using a data-driven continuous representation of molecules[J]. 2016: arXiv:1610.02415.
[87] SEGLER M H S, KOGEJ T, TYRCHAN C, et al. Generating focussed molecule libraries for drug discovery with recurrent neural networks[J]. 2017: arXiv:1701.01329.
徐优俊(1990-),男,北京大学前沿交叉学科研究院博士生,主要研究方向为药物设计与药物信息。
裴剑锋(1975-),男,博士,北京大学前沿交叉学科研究院特聘研究员,主要研究方向为药物设计与药物信息。