gpt2模型_用huggingface微调非英语gpt 2模型

gpt2模型

Originally published at https://www.philschmid.de on September 6, 2020.

最初于 2020年9月6日 https://www.philschmid.de 发布

介绍 (introduction)

Unless you’re living under a rock, you probably have heard about OpenAI’s GPT-3 language model. You might also have seen all the crazy demos, where the model writes JSX, HTML code, or its capabilities in the area of zero-shot / few-shot learning. Simon O'Regan wrote an article with excellent demos and projects built on top of GPT-3.

除非您生活在岩石之下,否则您可能听说过OpenAI的GPT-3语言模型。 您可能还已经看过所有疯狂的演示,其中该模型在零镜头学习/少镜头学习领域中编写了JSXHTML代码或其功能。 Simon O'Regan撰写了一篇文章,其中包含基于GPT-3的出色演示和项目 。

A Downside of GPT-3 is its 175 billion parameters, which results in a model size of around 350GB. For comparison, the biggest implementation of the GPT-2 iteration has 1,5 billion parameters. This is less than 1/116 in size.

GPT-3的缺点是其1,750亿个参数,导致模型大小约为350GB。 为了进行比较,GPT-2迭代的最大实现具有15亿个参数。 它的大小小于1/116。

In fact, with close to 175B trainable parameters, GPT-3 is much bigger in terms of size in comparison to any other model else out there. Here is a comparison of the number of parameters of recent popular NLP models, GPT-3 clearly stands out.

实际上,GPT-3具有接近175B的可训练参数,与其他模型相比,其尺寸要大得多。 这是对最近流行的NLP模型的参数数量的比较,GPT-3显然很突出。

created by author 由作者创建

This is all magnificent, but you do not need 175 billion parameters to get good results in text-generation.

这一切都是宏伟的,但您不需要1750亿个参数即可在text-generation获得良好的结果。

There are already tutorials on how to fine-tune GPT-2. But a lot of them are obsolete or outdated. In this tutorial, we are going to use the transformers library by Huggingface in their newest version (3.1.0). We will use the new Trainer class and fine-tune our GPT-2 Model with German recipes from chefkoch.de.

已经有关于如何微调GPT-2的教程。 但是其中许多已经过时或过时了。 在本教程中,我们将使用Huggingface的最新版本(3.1.0)的transformers库。 我们将使用新的Trainer类,并使用来自Chefkoch.de的德国食谱对GPT-2模型进行微调 。

You can find everything we are doing in this colab notebook.

您可以在此colab笔记本中找到我们正在做的所有事情。

变形金刚图书馆由Huggingface (Transformers Library by Huggingface)

Transformers logo 变形金刚徽标

The Transformers library provides state-of-the-art machine learning architectures like BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, T5 for Natural Language Understanding (NLU), and Natural Language Generation (NLG). It also provides thousands of pre-trained models in 100+ different languages and is deeply interoperable between PyTorch & TensorFlow 2.0. It enables developers to fine-tune machine learning models for different NLP-tasks like text classification, sentiment analysis, question-answering, or text generation.

Transformers库提供了最新的机器学习架构,例如BERT,GPT-2,RoBERTa,XLM,DistilBert,XLNet,自然语言理解(NLU)的T5和自然语言生成(NLG)。 它还以100多种不同的语言提供了数千种经过预训练的模型,并且可以在PyTorch和TensorFlow 2.0之间实现深度互操作。 它使开发人员可以针对不同的NLP任务(例如文本分类,情感分析,问题解答或文本生成)微调机器学习模型。

讲解 (Tutorial)

In the tutorial, we fine-tune a German GPT-2 from the Huggingface model hub. As data, we use the German Recipes Dataset, which consists of 12190 german recipes with metadata crawled from chefkoch.de.

在本教程中,我们从Huggingface模型中心微调了德语GPT-2。 作为数据,我们使用“ 德国食谱”数据集 ,该数据集由12190个德国食谱组成,其元数据从Chefkoch.de抓取。

We will use the recipe Instructions to fine-tune our GPT-2 model and let us write recipes afterwards that we can cook.

我们将使用食谱说明来微调我们的GPT-2模型,然后让我们写出可以烹饪的食谱。

created by author 由作者创建

We use a Google Colab with a GPU runtime for this tutorial. If you are not sure how to use a GPU Runtime take a look here.

在本教程中,我们将Google Colab与GPU运行时一起使用。 如果不确定如何使用GPU Runtime,请在此处查看 。

What are we going to do:

我们会做什么:

  • load the dataset from Kaggle

    从Kaggle加载数据集
  • prepare the dataset and build a TextDataset

    准备数据集并构建TextDataset

  • initialize Trainer with TrainingArguments and GPT-2 model

    使用TrainingArguments和GPT-2模型初始化Trainer

  • train and save the model

    训练并保存模型
  • test the model

    测试模型

You can find everything we do in this colab notebook.

您可以在此colab笔记本中找到我们所做的一切。

从Kaggle加载数据集 (Load the dataset from Kaggle)

As already mentioned in the introduction of the tutorial we use the “ German Recipes Dataset” dataset from Kaggle. The dataset consists of 12190 german recipes with metadata crawled from chefkoch.de. In this example, we only use the Instructions of the recipes. We download the dataset by using the “Download” button and upload it to our colab notebook since it only has a zipped size of 4,7MB.

正如本教程简介中已经提到的那样,我们使用Kaggle的“ 德国食谱数据集 ”数据集。 该数据集包含12190个德国食谱,以及从Chefkoch.de抓取的元数据。 在此示例中,我们仅使用食谱的说明。 我们使用“下载”按钮下载数据集,并将其上传到我们的colab笔记本中,因为它的压缩大小仅为4.7MB。

Kaggle competition Kaggle比赛的屏幕截图

After we uploaded the file we use unzip to extract the recipes.json.

上传文件后,我们使用unzip来提取recipes.json

You also could use the kaggle CLI to download the dataset, but be aware you need your Kaggle credentials in the colab notebook.

您也可以使用 kaggle CLI下载数据集,但请注意,您需要在colab笔记本中使用Kaggle凭据。

Here is an example of a recipe.

这是一个食谱示例。

准备数据集并构建TextDataset (Prepare the dataset and build a TextDataset)

The next step is to extract the instructions from all recipes and build a TextDataset. The TextDataset is a custom implementation of the Pytroch Dataset class implemented by the transformers library. If you want to know more about Dataset in Pytorch you can check out this youtube video.

下一步是从所有配方中提取指令并构建TextDatasetTextDataset是由转换器库实现的Pytroch Dataset 类的自定义实现。 如果您想进一步了解Pytorch中的Dataset ,可以查看此youtube视频 。

First, we split the recipes.json into a train and test section. Then we extract Instructions from the recipes and write them into a train_dataset.txt and test_dataset.txt

首先,我们将recipes.json分为traintest部分。 然后,我们从食谱中提取Instructions ,并将其写到train_dataset.txttest_dataset.txt

The next step is to download the tokenizer. We use the tokenizer from the german-gpt2 model.

下一步是下载令牌生成器。 我们使用来自german-gpt2模型的分词器。

Now we can build our TextDataset. Therefore we create a TextDataset instance with the tokenizer and the path to our datasets. We also create our data_collator, which is used in training to form a batch from our dataset.

现在我们可以构建TextDataset 。 因此,我们使用tokenizer和数据集的路径创建一个TextDataset实例。 我们还创建了data_collator ,用于训练以从数据集中形成一批。

使用TrainingArguments和GPT-2模型初始化Trainer (Initialize Trainer with TrainingArguments and GPT-2 model)

The Trainer class provides an API for feature-complete training. It is used in most of the example scripts from Huggingface. Before we can instantiate our Trainer we need to download our GPT-2 model and create TrainingArguments. The TrainingArguments are used to define the Hyperparameters, which we use in the training process like the learning_rate, num_train_epochs, or per_device_train_batch_size. You can find a complete list here.

Trainer类提供了用于功能完整训练的API。 Huggingface的大多数示例脚本中都使用了它。 在实例化Trainer之前,我们需要下载GPT-2模型并创建TrainingArguments 。 TrainingArguments用于定义超参数,我们在训练过程中使用这些num_train_epochs ,例如learning_ratenum_train_epochsper_device_train_batch_size 。 您可以在此处找到完整列表。

训练并保存模型 (Train and Save the model)

To train the model we can simply run trainer.train().

要训​​练模型,我们可以简单地运行trainer.train()

After training is done you can save the model by calling save_model(). This will save the trained model to our output_dir from our TrainingArguments.

训练完成后,您可以通过调用save_model()保存模型。 这会将训练后的模型从TrainingArguments保存到我们的output_dir

测试模型 (Test the model)

To test the model we use another highlight of the transformers library called pipeline. Pipelines are objects that offer a simple API dedicated to several tasks, text-generation amongst others.

为了测试模型,我们使用了变压器库的另一个亮点,称为pipeline 。 管道是提供简单API的对象,这些API专门用于多个任务,其中包括text-generation

result:

结果:

Zuerst Tomaten dazu geben und 2 Minuten kochen lassen. Die Linsen ebenfalls in der Brühe anbrühen.Die Tomaten auspressen. Mit der Butter verrühren. Den Kohl sowie die Kartoffeln andünsten, bis sie weich sind. “

Zuerst Tomaten dazu geben和2 Minuten kochen lassen。 迪林森(Die Linsen)降临在derBrüheanbrühen中。 Mit der Butterverrühren。 丹·科尔·索维(Den Kohl sowie)死于卡斯图芬(Kartoffeln)和比斯滕(bisten weich sind)。

Well, that's it. We’ve done it‍. We have successfully fine-tuned our gpt-2 model to write us recipes.

好吧,就是这样。 我们已经完成了。 我们已经成功调整了gpt-2模型,以编写食谱。

To improve our results we could train it longer and adjust our TrainingArguments or enlarge the dataset.

为了改善结果,我们可以训练更长的时间并调整我们的TrainingArguments或扩大数据集。

You can find everything in this colab notebook.

您可以在此colab笔记本中找到所有内容。

Thanks for reading. If you have any questions, feel free to contact me or comment on this article. You can also connect with me on Twitter or LinkedIn.

谢谢阅读。 如果您有任何疑问,请随时与我联系或对本文发表评论。 您也可以在Twitter或LinkedIn上与我联系。

翻译自: https://towardsdatascience.com/fine-tune-a-non-english-gpt-2-model-with-huggingface-9acc2dc7635b

gpt2模型

你可能感兴趣的:(python,人工智能,机器学习,tensorflow)