google 知识库_Googles领域的知识库增强语言模型

google 知识库

Google has published a new way of pre-training a language model which is augmented using a knowledge retrieval mechanism, that looks up existing knowledge from an external Wikipedia corpus. This makes the outputs the trained language model generates more fact-based and vast. It uses masked language modeling transformers for training and learns to retrieve and attend from millions of Wiki documents.

谷歌已经发布了一种预训练语言模型的新方法,该方法使用知识检索机制进行了增强,该机制可以从外部Wikipedia语料库中查找现有知识。 这使得经过训练的语言模型的输出生成的基于事实的信息更加丰富。 它使用屏蔽语言建模转换器进行培训,并学会从数百万个Wiki文档中检索和参加。

There are 5 sections to this article:

本文分为五个部分:

  • Intro to Knowledge Base

    知识库简介

  • Intro to REALM

    REALM简介

  • Training Process

    训练过程

  • Knowledge Retrieval In Detail

    知识检索详细

  • Future Speculation

    未来的猜测

Recently OpenAI released GPT-3 which is trained using 175 billion parameters to generate some really amazing results, you can read more here. Previously Google themselves had released transformer-based models like Bert and T5 which have shown to be useful for a variety of tasks.

OpenAI最近发布了GPT-3,该GPT-3经过训练,使用了1,750亿个参数,可以产生一些非常惊人的结果,您可以在此处阅读更多内容。 以前,Google本身已经发布了基于变压器的模型,例如Bert和T5 ,它们已显示出对多种任务有用。

I have been playing with https://learnfromanyone.com/ which is an app in development and using GPT3 api. Mckay Wrigley posted some conversations users have with the app. For instance I am learning about Samurai history from Dan Carlin himself. (or the GPT3 actor Dan Carlin) Which is fun because I don’t know much about Japanese history and it’s a fun and unique way to learn only possible because of GPT-3! But as Brady on twitter commented a valid concern, is what GPT-3 saying accurate?. Currently there is no way to fact-check if what GPT-3 is saying is accurate with an external source. So anyone with some cursory knowledge on the topics can point out the inaccuracies, which is a flaw for the current GPT-3 based systems.

我一直在玩https://learnfromanyone.com/ ,这是一个正在开发中并使用GPT3 API的应用程序。 Mckay Wrigley发布了用户与该应用程序的一些对话。 例如,我正在从丹·卡林本人那里学习武士的历史。 (或GPT3演员Dan Carlin),这很有趣,因为我对日本历史了解不多,这是一种有趣而独特的方法,因为GPT-3只学习可能的东西! 但是,正如布雷迪(Brady)在Twitter上发表的评论确实令人担忧时,GPT-3所说的准确吗? 当前,无法通过外部来源事实检查GPT-3所说的内容是否正确。 因此,任何对主题有粗浅知识的人都可以指出不正确的地方,这对于当前基于GPT-3的系统是一个缺陷。

Language models have come a long way the last couple years after the invention of Transformers and the Attention mechanism being applied to deep learning/neural network architectures and trained using back propagation methods. These models can not only understand human language syntax and grammar, produce news articles and conversations which are indistinguishable from human-written ones but they also understand a lot of context when prompts are given to it in plain English language and demonstrate meta-learning. It has already mastered English and several other languages, and several features of language such as structure, syntax etc. but it is only useful in practical applications if it can understand language with relation to the environment/world and can generate output that makes sense in the world. Since just understanding the structure of language means it can always output outlandish gibberish and facts. But scaling this language models have shown that it can also output several facts about almost any topic since it has been trained on humongous datasets scraped from the internet and millions of books.

在Transformers的发明和Attention机制应用于深度学习/神经网络体系结构并使用反向传播方法进行训练之后的最后几年,语言模型已经走了很长一段路。 这些模型不仅可以理解人类语言的语法和语法,可以产生与人类撰写的新闻文章和会话没有区别的新闻,而且当以纯英语向其发出提示并演示元学习时,它们还可以理解很多上下文。 它已经掌握了英语和其他几种语言以及语言的几种功能(例如结构,语法等),但是只有在能够理解与环境/世界相关的语言并且可以生成有意义的输出时,它才在实际应用中有用。世界。 由于仅了解语言的结构就意味着它总是可以输出古怪的胡言乱语和事实。 但是对这种语言模型进行缩放显示,由于已经对从互联网上收集的庞大数据集和数百万本书进行了训练,因此它还可以输出几乎任何主题的多个事实。

Language Models as Knowledge Base

语言模型作为知识库

Knowledge Base are important for any organisation to function efficiently. Current knowledge bases are hand-engineered databases designed in a relational way, with limited graph-based connections which allow engineers to query the knowledge. But designing this database, writing queries etc. needs a lot of human supervision, takes a lot of time and resources, and is by design made to work for specific use cases within a domain. Eg. an EHR system for a hospital, a plant genomic repository, or companies data (payroll, law etc.) All of this data is siloed into different data warehouses, and it’s generally pretty hard for the knowledge to be shared within a domain, let alone with people from other domains.

知识库对于任何组织有效运作都很重要。 当前的知识库是以关系方式设计的人工工程数据库,具有有限的基于图形的连接,允许工程师查询知识。 但是设计该数据库,编写查询等需要大量的人工监督,需要大量的时间和资源,并且是设计使之可用于域内的特定用例的。 例如。 用于医院,植物基因组资料库或公司数据(工资,法律等)的EHR系统。所有这些数据都存储在不同的数据仓库中,通常很难在一个域内共享知识,更不用说了与来自其他领域的人。

In more recent years, current KBs extract knowledge from any existing documents using complex NLP operations such as entity extraction, coreference solution, entity linking & relation extraction. This is again a pretty resource-intensive work that requires effort from highly paid engineers and are currently only used by a few companies.

近年来,当前的知识库使用复杂的NLP操作(例如实体提取,共指解决方案,实体链接和关系提取)从任何现有文档中提取知识 这又是一项相当耗费资源的工作,需要高薪工程师的努力,目前仅由少数公司使用。

Wouldn’t it be great if the language model or AI can make sense of this data in a simple way of just feeding it raw, unprocessed information.

如果语言模型或AI可以通过简单的方式来获取原始数据,未经处理的信息来理解这些数据,那将不是很好。

This is what GPT-3 has shown with their huge models trained on the general internet data and many are seeing it as a primitive form of an intelligent system, although some are skeptical. Their are some limitations to this models though:

这就是GPT-3所展示的庞大模型,这些模型在一般的互联网数据上进行了训练,许多人将其视为智能系统的原始形式,尽管有些人对此表示怀疑。 但是,它们对此模型有一些限制:

  • Complexity: They are huge models, which are hard to train. Requires a lot of computation power (money), a large quantity of dataset and lots of time and engineering resources. Although you only need to train it once, and it would perform pretty well on diverse tasks. OpenAI has demonstrated that by giving API access to their big model, which many developers are using to perform lots of diverse tasks. But the data it is trained on is not updated, so it’s like talking to someone who only knows about the world before 2019 (the latest point of time it has data on). And as you know a lot has happened since 2019, and a proper intelligent system has to have an updated knowledge about the world as time goes by and needs to be updated frequently.

    复杂性:它们是巨大的模型,很难训练。 需要大量的计算能力(金钱),大量的数据集以及大量的时间和工程资源。 尽管您只需要训练一次,它就可以在各种任务上很好地执行。 OpenAI通过向API授予他们的大型模型的权限来证明这一点,许多开发人员正在使用该模型来执行许多不同的任务。 但是它所训练的数据不会更新,因此就像与只知道2019年之前(了解数据的最新时间)的人交谈。 如您所知,自2019年以来发生了很多事情,并且适当的智能系统必须随着时间的流逝而更新有关世界的知识,并且需要经常更新。

  • Small context: The context of the input “prompts” you can provide the model is really small, so it behaves like someone with short-term memory, and cannot remember anything said to it 10 minutes ago if you are having a conversation. So don’t expect it to finish a Game of Thrones novel anytime soon.

    小上下文:您可以为模型提供的输入“提示”的上下文确实很小,因此它的行为就像具有短期记忆的人,并且如果您在进行对话,则无法记住10分钟前所说的任何内容。 因此,不要指望它能很快完成《权力的游戏》小说。

  • Common-sense reasoning: Although it can do some kind of reasoning, it’s still very random and inconsistent and doesn’t really show that it is trying to understand the logic behind some of the prompts.

    常识性推理 :尽管它可以进行某种推理,但它仍然是非常随机且不一致的,并且并没有真正表明它试图理解某些提示背后的逻辑。

  • Knowledge: It can spit out knowledge about any topic, that’s true, but this knowledge is stored in a very abstract form (in the learned weights of the model) and the knowledge is also stored in a compressed form (300 billion params are still not enough!) to encode every bit of knowledge about the world. Also there’s no way to prod the network to understand where and how the knowledge it spits out is stored, so it cannot be used as a reliable model for any application that needs fact-based and true knowledge. Also a reliable knowledge base would be necessary for the task of reasoning, especially reasoning needed to perform harder problems like solving maths or physics problems.

    知识 :它可以吐出任何主题的知识,这是对的,但是该知识以非常抽象的形式存储(以模型的学习权重),并且知识也以压缩形式存储(仍然没有3,000亿个参数足够!)编码有关世界的每一点知识。 此外,也没有办法促使网络了解其吐出的知识在何处以及如何存储,因此对于任何需要基于事实和真实知识的应用程序,它都不能用作可靠的模型。 同样,对于推理任务,尤其是执行更难的问题(如解决数学或物理问题)所需的推理,可靠的知识库将是必需的。

Google’s new model REALM seems like a step in the right direction to solve some of the problems I just mentioned.

Google的新模型REALM似乎是朝正确方向迈出的一步,可以解决我刚才提到的一些问题。

Retrieval-Augmented Language Representation Model Pre-training

检索增强语言表示模型预训练

Humans also have limited capacity to store information, and most of the knowledge is stored externally and we just use the right tools (Google!) to access that information by simply doing a google search for instance. So why not teach the language model to do the same thing! That’s where Google has created a new knowledge retrieval mechanism that works along side the transformer model. The model when trying to generate an output doesn’t just take the help of it’s internal weights (like most transformers until now), but goes to an external document repository (Wikipedia Corpus) to support its predictions which makes the outputs much more accurate and reliable than just looking at it’s own encoded weights.

人类存储信息的能力也很有限,并且大多数知识都存储在外部,我们仅使用正确的工具(例如Google!)就可以通过简单的Google搜索来访问该信息。 那么,为什么不教语言模型做同样的事情呢! 在那里,Google创建了一种新的知识检索机制,该机制可与转换器模型一起使用。 尝试生成输出时,该模型不仅利用其内部权重(像到目前为止的大多数转换器一样),还使用外部文档存储库(Wikipedia Corpus)来支持其预测,从而使输出更加准确,比仅查看其自身的编码权重更可靠。

So during the pre-training step, this model is not encoding the knowledge inside it, but it’s learning to search for the right document from the external repository. This process of externalizing the knowledge retrieval has many advantages:

因此,在预培训步骤中,此模型未在其中编码知识,而是在学习从外部存储库中搜索正确的文档。 外部化知识检索的过程具有许多优点:

  • It doesn’t need billions of parameters to store this world knowledge. They have demonstrated that there model with only 300 million parameters performs as well as 11 billion parameters used in the earlier T5 language model.

    它不需要数十亿的参数来存储这一世界知识。 他们证明了只有3亿个参数的模型的性能以及早期T5语言模型中使用的110亿个参数的性能。
  • The external knowledge repository works asynchronously and is independent of the language model. Right now they are using Wikipedia, but in the future they could use the power of their search engine. So REALM can google search just like us to provide an answer to a question. Also this solves the problem of keeping the model up-to-date without needing to retrain the big model again and again saving money and time.

    外部知识库是异步工作的,并且独立于语言模型。 目前,他们正在使用Wikipedia,但将来他们可以使用其搜索引擎的功能。 因此REALM可以像我们一样在Google搜索中提供问题的答案。 这也解决了保持模型最新的问题,而无需再次训练大型模型,又一次节省了金钱和时间。
  • It’s much easier to debug where it’s getting it’s answers from, as the model can now provide with an index of the document it referred, to support the arguments it’s making. This would be great for applications which need fact-checking and more reliable knowledge access.

    调试从何处获得答案要容易得多,因为该模型现在可以提供所引用文档的索引,以支持所创建的参数。 对于需要事实检查和更可靠的知识访问的应用程序来说,这将是一个很好的选择。

Results

结果

REALM outperforms the largest T5–11B model while being 30 times smaller. They test the model on Open-domain QA, which is the hardest knowledge-based dataset and no supporting documents are present to support the questions asked, which is a perfect test bed for testing the knowledge retriever capabilities. They also perform well on FEVER fact-verification dataset, Jeopardy trivia questions dataset and several other knowledge testing datasets. (This are just claims as published in the paper)

REALM的性能要比最大的T5-11B模型小30倍。 他们在开放域QA上测试该模型,这是最难的基于知识的数据集,并且没有支持文档来支持所提出的问题,这是测试知识检索器功能的理想测试平台。 他们在FEVER事实验证数据集,Jeopardy trivia问题数据集和其他几个知识测试数据集上也表现良好。 (这只是该 论文中 发表的声明 )

Model Architecture

模型架构

  • Neural Knowledge Retriever (KR): This model is also a neural network, which is trained along with the transformer in an end-to-end differential fashion just like any other deep learning model. The job of this model is to attend millions of documents (Wikipedia) and choose the best document that would support the given input sentence, and this “real” fact sentence retrieved from the knowledge base is appended to the input sentence.

    神经知识检索器(KR) :该模型也是一个神经网络,与其他任何深度学习模型一样,它以端到端的差分方式与变压器一起训练。 该模型的工作是参加数百万个文档(维基百科),并选择能够支持给定输入句子的最佳文档,然后将从知识库中检索的“真实”事实句子添加到输入句子中。

  • Transformer: This is the same transformer model used previously with many layers of self-attention block which attends over every word in the input and outputs the predicted sequence. But the context of input is larger because it has to not only attend to the input sentence but also to the sentence extracted from the wiki document.

    变压器 这是与以前使用的变压器模型相同的多层自注意力模块,该模块负责输入中的每个单词并输出预测的序列。 但是输入的上下文更大,因为输入不仅要注意输入句子,还必须注意从Wiki文档中提取的句子。

Pre-training process for the REALM model REALM模型的预训练过程

Training Process: This is the simplest way to explain how they augment the language model. You will need a little knowledge about how to train transformers to understand this.

培训过程:这是解释它们如何增强语言模型的最简单方法。 您将需要一些有关如何训练变压器的知识,以了解这一点。

  • The first step is getting a tokenized sentence from a raw dataset (any internet corpus). Then you randomly mask a word (or span of words) and give this tokenized sentence (in embedded vector form) along with positional embedding to the REALM model.

    第一步是从原始数据集(任何互联网语料库)中获取标记化的句子。 然后,您随机掩盖一个单词(或单词范围),并将此标记化的句子(以嵌入式矢量形式)以及位置嵌入到REALM模型中。
  • The job of REALM model is to predict what words the masked tokens should be, and it does that by assigning probability to all the words in a vocabulary set, and the highest probability is the word it chooses. In more simpler words, the model is trained to fill-in-the-blanks.

    REALM模型的工作是预测被屏蔽的标记应该是什么单词,并通过为词汇集中的所有单词分配概率来做到这一点,而最高的概率就是它选择的单词。 用更简单的话来说,该模型经过训练可以填充空白。

  • Based on the output and how close it is to the real tokens (before masking) a loss is calculated. This loss is back-propogated through the model and is used to train the weights/parameters and the goal of training is to minimize this loss.

    根据输出以及它与真实令牌的接近程度(在屏蔽之前),将计算损失。 该损失通过模型反向传播,并用于训练权重/参数,训练的目的是使这种损失最小化。
  • In the classical transformer model, we only have one transformer, which has many layers of language and knowledge representation hidden in the parameters. The layers work together using a self-attention layer (which attends all the words in the sentence) to predict what the model thinks should be the predicted word. In fact, if you look at the top 10 words having the highest probability, they would all make certain sense based on the input sentence, but the highest probability is always the best word chosen given the input sentence. But because of the limited parameters and the vast and unlimited knowledge available to attend and ponder, classic transformers would fail to give the exact word you would want, especially for specific world facts.

    在经典转换器模型中,我们只有一个转换器,该转换器在参数中隐藏了许多层语言和知识表示。 这些层使用自我注意层(将句子中的所有单词都包含在内)一起工作,以预测模型认为应该是预测的单词。 实际上,如果您查看概率最高的前10个单词,则根据输入句子它们都将具有一定的意义,但是最高概率始终是在给定输入句子的情况下选择的最佳单词。 但是由于参数有限以及可以参加和思考的浩瀚而无穷的知识,经典的变形金刚无法给出您想要的确切单词,尤其是针对特定的世界事实。
  • In REALM, first the input embedded tokens are given to the KR. The job of KR is to search the document (from millions of documents) which closely matched the given input tokens. This documents are first pre-computed using neural network weights and assigned a vector which gives a location and direction in multi-dimensional (128-dim) space. So it’s a like a big search index just like the Google algorithm would use now when you search for a term in their website.

    在REALM中,首先将输入的嵌入式令牌提供给KR。 KR的工作是(从数百万个文档中)搜索与给定输入令牌紧密匹配的文档。 该文档首先使用神经网络权重进行预先计算,并分配了一个向量,该向量给出了多维(128维)空间中的位置和方向。 因此,这就像一个庞大的搜索索引,就像您在他们的网站中搜索术语时所使用的Google算法一样。
  • Then using a method called Maximum Inner Product Search (MIPS). MIPS calculates inner products between the query vector and each target document vector. The vector which returns the maximum output, has the selected document which has the best supporting sentence for the input sentence.

    然后使用称为最大内部产品搜索(MIPS)的方法。 MIPS计算查询向量和每个目标文档向量之间的内积。 返回最大输出的向量具有选定的文档,该文档的输入语句具有最佳的支持语句。

  • The document vectors are pre-computed (this means most related documents are clustered together) so the query vector finds a neighborhood in the vector space, which best matches what it’s looking for. Think of it like you are grocery shopping, and look for almond milk, you will go to the milk section because they arranged all the products they are selling in clusters. But if the products was kept randomly anywhere in the store, you will have a hard time finding anything you want.

    文档向量是预先计算的(这意味着大多数相关文档被聚在一起),因此查询向量在向量空间中找到一个最匹配其查找内容的邻域。 就像您在杂货店购物,并寻找杏仁奶一样,您将进入牛奶专区,因为他们将他们出售的所有产品按组排列。 但是,如果将产品随机存放在商店中的任何地方,您将很难找到想要的任何东西。
  • So during training the KR has to recompute this arrangement of the target vectors so that the next query vector has an easier time finding what it wants. It does it by getting the loss value from the transformer output and updating the gradients.

    因此,在训练过程中,KR必须重新计算目标向量的这种排列方式 ,以便下一个查询向量可以更轻松地找到所需的内容。 它通过从变压器输出中获取损耗值并更新梯度来实现。

  • Once we have the document sentence retrieved, we append the document text along with input sentence and provide it to the transformer. The layers in transformer perform rich cross-attention between the text and the supporting document text, and if the attention weights are learned correctly, it will learn to map the document text with the input text, and predict the missing word more accurately. But if the retrieved document text doesn’t help it predict the word correctly, then that loss signal is passed through the retriever so that it learns to do it’s job better.

    检索到文档语句后,我们将文档文本与输入语句一起添加到转换器中。 转换器中的各层在文本和支持文档文本之间执行丰富的交叉注意 ,并且如果正确学习了注意权重,它将学习将文档文本与输入文本映射,并更准确地预测丢失的单词。 但是,如果检索到的文档文本不能帮助它正确地预测单词,则该丢失信号会通过检索器传递,以便它学会做得更好。

  • And if you have the right configuration and size and train this for a while, you will have a model which not only learns to retrieve the correct document but also predict output sequences correctly. And you get the document index from the corpus for your debugging purposes as a bonus.

    如果您具有正确的配置和大小并进行一段时间的训练,您将拥有一个模型,该模型不仅可以学习检索正确的文档,而且可以正确预测输出顺序。 另外,您还可以从语料库中获取文档索引,作为调试的目的。

The main advantage here is the transformer doesn’t have to hard code a lot of general facts, that task is off-loaded. This results in as you will see later a much lighter Transformer model with less parameters, but still performs well on knowledge-intensive tasks such as question-answering.

这里的主要优点是,转换器不必对许多常规事实进行硬编码,因为该任务已卸载。 如您稍后将看到的那样,它的结果是重量更轻,参数更少的Transformer模型,但在诸如问答之类的知识密集型任务上仍然表现良好。

In the example above the raw input sentence has “pounds” as a masked word. The retriever goes and retrieves top-k (k can be any number based on your transformer size) documents that are closely matched with the input sentence. So, it would maybe find Wikipedia page for Buckingham Palace and England, and then give the exact sentence from the page which support the input sentence. This extracted sentence is then used later to predict the masked token and also output the document indexes it used. You can imagine how powerful it can make the model to generate very specific world facts and knowledge without requiring it to encode so much of the world knowledge inside itself.

在上面的示例中,原始输入句子具有“磅”作为被屏蔽的单词。 检索器继续检索与输入句子紧密匹配的前k个(根据变压器大小,k可以是任何数字)文档。 因此,它可能会找到有关白金汉宫和英格兰的 Wikipedia页面,然后从该页面提供支持输入句子的确切句子。 然后,此提取的句子随后将用于预测被屏蔽的令牌,并输出其使用的文档索引。 您可以想象,它可以使模型产生非常具体的世界事实和知识而无需在其内部编码太多世界知识的强大功能。

Knowledge Retriever Training

知识检索培训

This section goes into a little more detail about what’s happening inside KR. Each document is encoded using a Dense Passage Retrieval technique which gives a bi-directional encoding for the text in each document. You can think of this as some form of compressed encoding which gives some important features for each document in a form which you can match in a hyper-dimensional space with an input query vector. Then the MIPS technique mentioned above does a nearest-neighbor search using the neural network weights to match with the target documents.

本节将详细介绍KR内部发生的情况。 使用密集通道检索对每个文档进行编码 该技术为每个文档中的文本提供了双向编码。 您可以将其视为某种压缩编码形式,它以一种形式为每个文档提供一些重要功能,可以在超维空间中使用输入查询向量进行匹配。 然后,上述MIPS技术使用神经网络权重与目标文档进行匹配,以进行最近邻居搜索。

Major challenge is there are millions of documents to search through which provides two problems: first encoding millions of document and keeping them in memory would require a large memory and then training the neural weights for all of this weights would take ages even with the fastest computation available. Google solves this by

面临的主要挑战是要搜索数百万个文档,这将带来两个问题:首先,对数百万个文档进行编码,然后将其保存在内存中,这将需要较大的内存;然后,即使计算速度最快,训练所有这些权重的神经权重也会花费很多时间可用。 Google通过以下方式解决了这个问题

  • training and matching the input vector with only a small amount of encoded documents at a time.

    训练输入向量并将其与每次仅少量的编码文档进行匹配。
  • Recomputing the document encoding (search index builder) asynchronously, so the transformer will continue training with a snapshot of the knowledge base, while the KB is updating itself in the background. Since, recomputing millions of documents takes time.

    异步地重新计算文档编码(搜索索引生成器),因此转换器将继续使用知识库的快照进行培训,而KB在后台对其进行更新。 由于重新计算数百万个文档需要花费时间。

So two jobs run parallely in the KR, a primary trainer job which updates the embed parameters every step by listening to the training signal from the transformer. And the index builder which indexes the document based on the embed parameters. The trainer for each step uses a snapshot of the index, and the index takes it’s own time to re-index based on the latest model parameters, and then when it’s done re-indexing updates the snapshot.

因此,两个作业在KR中并行运行,这是一个主要的培训师工作,它通过侦听来自变压器的培训信号来更新嵌入参数。 还有索引生成器 ,它基于嵌入的参数为文档建立索引。 每一步的训练者都会使用索引的快照,索引会根据最新的模型参数花费自己的时间来重新索引,然后在完成索引操作后更新快照。

They have shown that training of transformer doesn’t affect the performance even if the index building occurs 500 steps later. Which means for 500 steps the trainer and the transformer are using a “stale” knowledge base, but it still is enough for the transformer to give accurate predictions.

他们表明,即使在500步之后建立索引,对变压器的培训也不会影响性能。 这意味着培训师和变压器要使用500步才能使用“陈旧的”知识库,但对于变压器来说,准确地进行预测仍然足够。

One more advantage of re-indexing asynchronously is that the knowledge corpus can be kept up-to-date with the latest events and document vectors recalculated when the model is eventually put into production without changing the model or retraining it.

异步重新索引的另一个好处是,当模型最终投入生产时,无需更改模型或对其进行重新培训,就可以使知识语料库保持最新状态,并重新计算文档向量。

Future Speculation

未来的猜测

My favorite thing to do is speculate the future! There are lots of problems which seem unsolvable, one of them is how to counter the misinformation being spread online. Many companies are struggling and are pressured to work on solving this problem. A knowledge-augmented intelligent AI system would be one of the ways to solve this misinformation problem. What if we had a decentralized pool of different knowledge bases, with their own trust-based community of system and protocols and maintained by everyone globally. Then the AI systems like REALM can tap into these different KBs and extract knowledge and provide answers to whatever questions you ask them. The trust for using an AI system would depend on how trustworthy are the knowledge bases it is trained on.

我最喜欢做的是推测未来! 有很多问题似乎无法解决,其中之一就是如何应对在线传播的错误信息。 许多公司都在苦苦挣扎,并面临着解决这一问题的压力。 知识增强的智能AI系统将是解决此错误信息问题的方法之一。 如果我们有一个分散的不同知识库库,拥有他们自己的基于信任的系统和协议社区,并由全球每个人维护,该怎么办。 然后,像REALM这样的AI系统可以利用这些不同的知识库并提取知识,并为您提出的任何问题提供答案。 使用AI系统的信任度取决于它在其上训练的知识库的可信度。

Another major roadblock is the current AI systems lack reasoning skills. Gary Marcus has talked about this recently with his harsh critique of the GPT-3 system. Will off-loading memorizing facts to an external base, help the big language models to efficiently use the parameters to perform better reasoning? This remains to be seen, and I am sure Google and others must be training their own GPT-3 like big systems. If anything else, having a perfect knowledge base might allow it to think more clearly about complex stuff in fields as diverse as science, philosophy, governing etc.

另一个主要障碍是当前的AI系统缺乏推理能力。 加里·马库斯(Gary Marcus)最近对GPT-3系统发表了严厉的评论,谈到了这一点。 将存储的事实转移到外部基础上是否会帮助大型语言模型有效地使用参数来执行更好的推理? 这还有待观察,我相信Google和其他公司必须像大型系统一样训练自己的GPT-3。 如果还有其他的话,拥有完善的知识库可能会使它更清晰地思考科学,哲学,治理等各个领域中的复杂事物。

AI as oracles. My favorite idea about using AI is to create the perfect prediction system, that can predict complex events and be an oracle of truth and fate. Can it learn from the data it is provided and predict future events? Will big hedge funds use it to create the most efficient stock market that always goes up? Will it help predict natural disasters with more accuracy? Can it predict the future of an individual? This is stuff of fantasies for sure, but it’s fun to speculate.

AI是Oracle 。 我最喜欢使用AI的想法是创建一个完美的预测系统,该系统可以预测复杂的事件并成为真理和命运的预言。 它可以从提供的数据中学习并预测未来的事件吗? 大型对冲基金会使用它来创建最有效的股票市场吗? 它会有助于更准确地预测自然灾害吗? 它可以预测一个人的未来吗? 这肯定是幻想的东西,但是推测很有趣

As with language models, I see three diverse paths emerging:

与语言模型一样,我看到了三种不同的路径:

  • Closed-off and Opaque: A path to keep scaling the models (without using an external knowledge module) and maybe the model learns to discard extraneous, unneeded facts and still manage to deduct the knowledge from the encoded abstract weights. Would this be a more “truthful” fact machine, as it is inferring the knowledge from the world knowledge it has assimilated instead of relying on the knowledge base curated by humans and their own biases? OpenAI has proved to the detractors that scaling a model does produce some mind-blowing things, and maybe scaling from billion to trillion parameters and increasing the dataset size is the only thing you need. Please read this article by Gwern where he goes deep with the idea of scaling and how the major AI companies keep ignoring this.

    封闭和不透明:一种不断扩展模型的路径(无需使用外部知识模块),也许模型学会了丢弃不必要的事实,并且仍然设法从编码的抽象权重中推导知识。 这会是一个更“真实”的事实机器,因为它是从吸收的世界知识中推断出知识,而不是依赖于人类和他们自己偏爱的知识基础吗? OpenAI已向批评者证明,对模型进行缩放确实会产生一些令人难以置信的事情,也许仅需将参数从数十亿扩展到万亿,并增加数据集的大小即可。 请阅读Gwern撰写的这篇文章,他深入探讨了扩展的思想以及主要的AI公司如何一直忽略这一点。

  • Open and decentralized: This plays into the idea of having several parties creating and maintaining their own knowledge bases, and a rating system emerges where you get what you pay for. This might also be useful for companies and specific domains, where certain parties create their own knowledge bases and sell it to companies having their own AI systems, for instance EHR systems in hospitals.

    开放和去中心化 :这涉及到让多个参与方创建和维护自己的知识库的想法,并且出现了一个等级系统,您可以在此获得所要支付的费用。 这对于公司和特定领域也可能有用,在某些领域,当事方创建自己的知识库并将其出售给拥有自己的AI系统的公司,例如医院的EHR系统。

  • “Free” and a monopoly: Where the world is ruled by only one “truth” and that truth is whatever knowledge base Google provides you. You just trust the knowledge, because can Google ever be wrong about a fact? Can Wikipedia be wrong about a fact? Can you really believe and trust in this one system, one rule methodology? I am not saying Google wants this, but it is a possibility.

    “自由”和垄断:世界仅由一个“真相”统治,而事实就是Google为您提供的任何知识基础。 您只要相信这些知识,因为Google会在事实上犯错吗? Wikipedia可以对事实是错误的吗? 您真的可以相信和信任这一系统,一种规则方法吗? 我并不是说Google想要这样做,但这是有可能的。

I am kidding! I don’t! 我在开玩笑! 我不!

Let’s leave the speculation world and jump back to reality. Next steps: There is still lots of work left to do. Google plans to test this model architecture on more reasoning-intensive tasks. They also want to explore retrieving images and knowledge graphs and not be limited to text domain.

让我们离开推测世界,回到现实。 后续步骤:还有很多工作要做。 Google计划在更多推理密集型任务上测试此模型架构。 他们还希望探索检索图像和知识图,而不仅限于文本域。

Although Google hasn’t released the big model to public yet, they have released a smaller version and also the code (thanks to the developers!) so researchers can play with it. I tried running the code, but it uses Tensorflow 2 estimator library which I am not too familiar with. If anyone has managed to run the code, please do message me. Also, message me for any feedback as I am pretty new to writing and appreciate criticism. And please tell me your own speculations! I love reading speculations!!!

尽管Google尚未向公众发布大型模型,但他们已经发布了较小的版本和代码 (感谢开发人员!),以便研究人员可以使用它。 我尝试运行代码,但它使用了我不太熟悉的Tensorflow 2估计器库。 如果有人设法运行了代码,请给我发消息。 另外,请发给我任何反馈信息,因为我是新来的写作者,感谢批评。 并且请告诉我您自己的猜测! 我喜欢阅读猜测!!!

Find me on twitter at @swapp19902 or email [email protected]. Let’s connect and make this world a fun place to live in.

在Twitter上找到我,地址为@ swapp19902或电子邮件为[email protected]。 让我们联系起来,使这个世界成为一个有趣的生活场所。

翻译自: https://levelup.gitconnected.com/googles-realm-a-knowledge-base-augmented-language-model-bc1a9c9b3d09

google 知识库

你可能感兴趣的:(python,java,人工智能,机器学习,大数据)