用biobert标记基因和蛋白质

一,引言 (I. Introduction)

Text mining in the clinical domain has become increasingly important with the number of biomedical documents currently out there with valuable information waiting to be deciphered and optimized by NLP techniques. With the accelerated progress in NLP, pre-trained language models now carry millions (or even billions) of parameters and can leverage massive amounts of textual knowledge for downstream tasks such as question answering, natural language inference, and in the case that we will work through, biomedical text tagging via named-entity recognition. All of the code can be found on my GitHub.

在临床领域中的文本挖掘已变得越来越重要,因为目前存在许多生物医学文档,有价值的信息正在等待通过NLP技术进行解密和优化。 随着NLP的加速发展,经过预先训练的语言模型现在具有数百万(甚至数十亿)个参数,并且可以利用大量文本知识来完成下游任务,例如问题解答,自然语言推论以及在我们可以工作的情况下通过命名实体识别进行生物医学文本标记。 所有代码都可以在我的GitHub上找到 。

二。 背景 (II. Background)

As a state-of-the-art breakthrough in NLP, Google researchers developed a language model known as BERT (Devlin et. al, 2018) that was developed to learn deep representations by jointly conditioning on a bidirectional context of the text in all layers of its architecture¹. These representations are valuable for sequential data, such as text, that heavily relies on context and the advent of transfer learning in this field helps carry the encoded knowledge over to strengthen an individual’s smaller tasks across domains. In transfer learning, we call this step “fine-tuning”, which means that the pre-trained model is now being fine-tuned for the particular task we have in mind. The original English-language model used two corpora in their pre-training: Wikipedia and BooksCorpus. For a deeper intuition behind transformers like BERT, I would suggest a series of blogs on their architecture and fine-tuned tasks.

作为NLP的最新突破,谷歌研究人员开发了一种称为BERT的语言模型(Devlin等人,2018),该模型被开发为通过在所有层次上共同基于文本的双向上下文来学习深度表示的建筑¹。 这些表示形式对于顺序数据(例如文本)非常有价值,例如文本,文本在很大程度上依赖于上下文,并且该领域的转移学习的到来有助于将编码后的知识带到身边,以加强个人在各个领域的较小任务。 在转移学习中,我们将此步骤称为“微调”,这意味着现在针对我们要考虑的特定任务对预训练模型进行微调。 最初的英语模型在预训练中使用了两种语料库:Wikipedia和BooksCorpus。 为了更深入地了解BERT之类的变压器,我建议提出一系列有关其架构和微调任务的博客 。

用biobert标记基因和蛋白质_第1张图片
BERT Architecture (Devlin et al., 2018) BERT Architecture(Devlin等人,2018)

BioBERT (Lee et al., 2019) is a variation of the aforementioned model from Korea University and Clova AI. Researchers added to the corpora of the original BERT with PubMed and PMC. PubMed is a database of biomedical citations and abstractions, whereas PMC is an electronic archive of full-text journal articles. Their contributions were a biomedical language representation model that could manage tasks such as relation extraction and drug discovery to name a few. By having a pre-trained model that encompasses both general and biomedical domain corpora, developers and practitioners could now encapsulate biomedical terms that would have been incredibly difficult for a general language model to comprehend.

你可能感兴趣的:(python,java)