gpt-2 生成文本_对gpt 2进行了微调,以实现神奇的收集风味文本生成

gpt-2 生成文本

A template for fine tuning your own GPT-2 model.

用于微调您自己的GPT-2模型的模板。

GPT-3 has dominated the NLP news cycle recently with its borderline magical performance in text generation, but for everyone without $1,000,000,000 of Azure compute credits there are still plenty of ways to experiment with language models on your own. Hugging Face is a free open source company focussing on NLP tooling and they provide one of the easiest ways of accessing pre-trained models and tokenizers for NLP experiments. In this article, I will share a method for fine tuning the 117M parameter GPT-2 model with a corpus of Magic the Gathering card flavour texts to create a flavour text generator. This will all be captured in a Colab notebook so you can copy and edit to create generators for your own tasks!

GPT-3最近在NLP新闻周期中占据主导地位,其在文本生成方面具有出色的神奇性能,但是对于没有$ 1,000,000,000的Azure计算积分的每个人,仍然有很多方法可以自己尝试语言模型。 Hugging Face是一家专注于NLP工具的免费开源公司,它们为NLP实验提供了访问预训练模型和令牌生成器的最简单方法之一。 在本文中,我将分享一种使用Magic the Gathering卡味觉文本集对117M参数GPT-2模型进行微调的方法,以创建味觉文本生成器。 所有这些都将捕获在Colab笔记本中,因此您可以复制和编辑以创建适合自己任务的生成器!

初始点 (Starting Point)

Generative language models require billions of data points and millions of dollars in compute power to train successfully from scratch. For example, GPT-3 cost an estimated $4.6 million dollars to train and 355 years of compute time. However, fine tuning many of these models for custom tasks is easily within reach to anyone with access to even a single GPU. For this project we will be using Colab, which comes with many common data science packages pre-installed, including PyTorch and free access to GPU resource.

生成语言模型需要数十亿个数据点和数百万美元的计算能力才能从头开始成功训练。 例如,GPT-3的培训费用约为460万美元,计算时间为355年 。 但是,对于甚至可以访问单个GPU的任何人,都可以轻松地针对自定义任务微调许多这些模型。 对于此项目,我们将使用Colab,该软件预装有许多常见的数据科学软件包,包括PyTorch和对GPU资源的免费访问。

First, we will install the Hugging Face transformers library, which will also fetch the excellent (and fast) tokenizers library. Although Hugging Face provide a resource for text datasets in their nlp library, I will be sourcing my own data for this project. If you don’t have a dataset or application in mind, the nlp library would provide an excellent starting place for easy data acquisiton.

首先,我们将安装Hugging Face 转换器库,该库还将获取出色(快速)的标记程序库。 尽管Hugging Face在其nlp库中为文本数据集提供了资源,但我将为此项目采购自己的数据。 如果您没有数据集或应用程序,那么nlp库将为轻松获取数据提供一个很好的起点。

This will install the Hugging Face transformers library and the tokenizer dependencies. 这将安装Hugging Face转换器库和分词器依赖项。

The Hugging Face libraries will give us access to the GPT-2 model as well as it’s pretrained weights and biases, a configuration class, and a tokenizer to convert each word in our text dataset into a numerical representation to feed into the model for training. Tokenization is important as the models can’t work with text data directly so they need to be encoded into something more manageable. Below is an example of tokenization on some sample text to give a small representative example of what encoding provides.

Hugging Face库将使我们能够访问GPT-2模型及其预训练的权重和偏差,配置类和令牌生成器,以将文本数据集中的每个单词转换为数字表示形式,以馈入模型进行训练。 标记化很重要,因为模型无法直接处理文本数据,因此需要将它们编码为更易于管理的内容。 下面是对一些示例文本进行标记化的示例,以给出编码提供的小代表示例。

你可能感兴趣的:(python,java,vue,nlp,ViewUI)