预训练 gpt2_用您自己的语言训练gpt 2

预训练 gpt2

We all know modern day Natural Language Processing (NLP) has progressed by leaps and bounds in the past couple of years following the development of attention networks and transformers. It paved the way for a plethora of new algorithms achieving State-Of-The-Art (SOTA) for the different tasks of NLP.

众所周知,随着注意力网络和转换器的发展,近几年来,自然语言处理(NLP)取得了长足的进步。 它为众多新算法铺平了道路,从而可以为NLP的不同任务提供最新技术(SOTA)。

OpenAI has been one of the leaders in providing their own language model (now released GPT-3) which is trained on a huge corpus of internet data. Since, GPT-3 is a recent phenomenon and in English at the moment, and is only accessible through API given by OpenAI, we shift our focus on the earlier version of it i.e. GPT-2. To know about the internal nuts and bolts of GPT-2, I’d suggest you to go through this link. For more depths into Attention and Transformers, here are some excellent links:

OpenAI一直是提供自己的语言模型(现已发布的GPT-3)的领导者之一,该模型在庞大的互联网数据集上进行了培训。 由于GPT-3是目前的一种新现象,目前仅以英语提供,并且只能通过OpenAI提供的API进行访问,因此我们将重点放在GPT-2的早期版本上。 要了解GPT-2的内部螺母和螺栓,建议您通过此链接 。 要深入了解“注意力”和“变形金刚”,以下是一些出色的链接:

  • The illustrated Transformer by Jay Alammar

    Jay Alammar 的插图变形金刚

  • The Annotated Transformer by Harvard NLP

    哈佛NLP 的带注释的变压器

GPT-2 was also released for English, which makes it difficult for someone trying to generate text in a different language.

GPT-2也发布了英语版本,这使得尝试生成其他语言的文本变得很困难。

So why not train your own GPT-2 model on your favourite language for text generation? That is exactly what we are going to do. So, without further ado, let us jump in.

那么,为什么不使用自己喜欢的语言训练自己的GPT-2模型来生成文本呢? 这正是我们要做的。 因此,事不宜迟,让我们跳进去。

For the demo, I have considered a non-Latin alphabet script (Bengali here), because why not!! I have used Huggingface’s implementation for the model.

对于演示,我考虑了非拉丁字母脚本(此处为孟加拉语),因为为什么不! 我已经为模型使用了Huggingface的实现。

1.收集数据。 (1. Gathering the data.)

Gathering good quality data is one of the most important stages as all Data Scientists would agree. So, we are going to assume that you already have a folder containing .txt files having all the data cleaned and stored. For ease, you can use the Wikipedia article data, which is available and can be downloaded with the following code.

所有数据科学家都会同意,收集高质量的数据是最重要的阶段之一。 因此,我们将假设您已经有一个包含.txt文件的文件夹,该文件夹中已清除并存储了所有数据。 为简便起见,您可以使用Wikipedia文章数据,该数据可用并可以使用以下代码下载。

import tensorflow as tf
from gensim.corpora import WikiCorpus
import os
import argparse


# lang = 'bn'


def store(corpus, lang):
    base_path = os.getcwd()
    store_path = os.path.join(base_path, '{}_corpus'.format(lang))
    if not os.path.exists(store_path):
        os.mkdir(store_path)
    file_idx=1
    for text in corpus.get_texts():
        current_file_path = os.path.join(store_path, 'article_{}.txt'.format(file_idx))
        with open(current_file_path, 'w' , encoding='utf-8') as file:
            file.write(bytes(' '.join(text), 'utf-8').decode('utf-8'))
        #endwith
        file_idx += 1
    #endfor


def tokenizer_func(text: str, token_min_len: int, token_max_len: int, lower: bool) -> list:
    return [token for token in text.split() if token_min_len <= len(token) <= token_max_len]


def run(lang):
    origin='https://dumps.wikimedia.org/{}wiki/latest/{}wiki-latest-pages-articles.xml.bz2'.format(lang,lang)
    fname='{}wiki-latest-pages-articles.xml.bz2'.format(lang)
    file_path = tf.keras.utils.get_file(origin=origin, fname=fname, untar=False, extract=False)
    corpus = WikiCorpus(file_path, lemmatize=False, lower=False, tokenizer_func=tokenizer_func)
    store(corpus, lang)


if __name__ == '__main__':
    ARGS_PARSER = argparse.ArgumentParser()
    ARGS_PARSER.add_argument(
        '--lang',
        default='en',
        type=str,
        help='language code to download from wikipedia corpus'
    )
    ARGS = ARGS_PARSER.parse_args()
    run(**vars(ARGS))
python wikipedia_download.py --lang bn

This will create a folder containing all Wikipedia files looking like:

这将创建一个包含所有Wikipedia文件的文件夹,如下所示:

screenshot of file list 文件列表的屏幕截图

Note: due to resource constraint, and since it is for demo purpose, I have trained the model in a small subset of books by Satyajit Ray, especially his detective Feluda series.

注意:由于资源的限制,并且出于演示目的,我在Satyajit Ray (尤其是他的侦探Feluda系列)的一小本书中训练了该模型。

2.标记化 (2. Tokenization)

Now, the second step will be to tokenize the data. For that, we use the following class.

现在,第二步是标记数据。 为此,我们使用以下类。

import os
from tokenizers.models import BPE
from tokenizers import Tokenizer
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
from tokenizers.normalizers import NFKC, Sequence
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.trainers import BpeTrainer


class BPE_token(object):
    def __init__(self):
        self.tokenizer = Tokenizer(BPE())
        self.tokenizer.normalizer = Sequence([
            NFKC()
        ])
        self.tokenizer.pre_tokenizer = ByteLevel()
        self.tokenizer.decoder = ByteLevelDecoder()


    def bpe_train(self, paths):
        trainer = BpeTrainer(vocab_size=50000, show_progress=True, inital_alphabet=ByteLevel.alphabet(), special_tokens=[
            "",
            "",
            "",
            "",
            ""
        ])
        self.tokenizer.train(trainer, paths)


    def save_tokenizer(self, location, prefix=None):
        if not os.path.exists(location):
            os.makedirs(location)
        self.tokenizer.model.save(location, prefix)

关于标记化的一些注意事项: (Some notes on the tokenization:)

  • We use BPE (Byte Pair Encoding), which is a sub word encoding, this generally takes care of not treating different forms of word as different. (e.g. greatest will be treated as two tokens: ‘great’ and ‘est’ which is advantageous since it retains the similarity between great and greatest, while ‘greatest’ has another token ‘est’ added which makes it different). Also, it is not as low level as character-level encoding, which doesn’t retain any value of a particular word.

    我们使用BPE (字节对编码),它是子单词编码,通常要注意不要将不同形式的单词视为不同的单词。 (例如,最大将被视为两个标记:“最大”和“最大”,这是有利的,因为它保留了最大和最大之间的相似性,而“最大”又添加了另一个标记“最大”,这使其与众不同)。 另外,它不像字符级编码那样低,而字符级编码不保留特定单词的任何值。

  • Another small but subtle point is NFKC (Normalization Form Compatibility Composition) in line 13 of code. It is one of the standard Unicode compatibility form. It would not matter much if the language is English, but since we are using Bengali, which contains a different form of character, we are using this specific one. More on it can be found in this link

    另一个小而微妙的地方是代码第13行中的NFKC (规范化表单兼容性组合)。 它是标准Unicode兼容性形式之一。 语言是否为英语并没有多大关系,但是由于我们使用的孟加拉语包含不同的字符形式,因此我们使用的是这种特定的字符。 可以在此链接中找到更多信息

So what we do here is tokenize our data and save it in a folder. Two files will be created (merges.txt and vocab.json) in a specified directory. To run the file, use the following code:

因此,我们在这里要做的是标记数据并将其保存在文件夹中。 将在指定目录中创建两个文件( merges.txtvocab.json )。 要运行文件,请使用以下代码:

from tokenise import BPE_token
from pathlib import Path
import os# the folder 'text' contains all the files
paths = [str(x) for x in Path("./text/").glob("**/*.txt")]tokenizer = BPE_token()# train the tokenizer model
tokenizer.bpe_train(paths)# saving the tokenized data in our specified folder
save_path = 'tokenized_data'
tokenizer.save_tokenizer(save_path)

3.模型初始化 (3. Model Initialization)

Before the real magic begins, we need to make sure the artilleries are ready. Let us start with some initializations.

在真正的魔法开始之前,我们需要确保炮兵已经准备就绪。 让我们从一些初始化开始。

import tensorflow as tf
from transformers import GPT2Config, TFGPT2LMHeadModel, GPT2Tokenizer# loading tokenizer from the saved model path
tokenizer = GPT2Tokenizer.from_pretrained(save_path)tokenizer.add_special_tokens({
"eos_token": "",
"bos_token": "",
"unk_token": "",
"pad_token": "",
"mask_token": ""
})# creating the configurations from which the model can be made
config = GPT2Config(
vocab_size=tokenizer.vocab_size,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id
)# creating the model
model = TFGPT2LMHeadModel(config)

We also create a single string from all our documents and tokenize it.

我们还将根据所有文档创建一个字符串并将其标记化。

single_string = ''
for filename in paths:
with open(filename, "r", encoding='utf-8') as f:
x = f.read()single_string += x + tokenizer.eos_token
string_tokenized = tokenizer.encode(single_string)

After we have encoded the whole string, we now move on to make a TensorFlow dataset, slicing the data into equal intervals, so that our model can learn. Here we use a block size of 100 (length of token in each example) and a batch size of 16. This is kept low else we can run it with ease on a RTX 2060 GPU.

在对整个字符串进行编码之后,我们现在继续制作TensorFlow数据集,将数据切成相等的间隔,以便我们的模型可以学习。 在这里,我们使用的块大小为100(每个示例中令牌的长度),批处理大小为16。这保持较低,否则我们可以在RTX 2060 GPU上轻松运行它。

examples = []
block_size = 100
BATCH_SIZE = 16
BUFFER_SIZE = 1000for i in range(0, len(string_tokenized) - block_size + 1, block_size):
examples.append(string_tokenized[i:i + block_size])
inputs, labels = [], []for ex in examples:
inputs.append(ex[:-1])
labels.append(ex[1:])dataset = tf.data.Dataset.from_tensor_slices((inputs, labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

4.模型训练 (4. Model Training)

Now comes the part we’ve been waiting for, making the model and training. So we define our optimizer, loss functions and the metrics, and start training.

现在是我们一直在等待的部分,包括模型制作和培训。 因此,我们定义了优化器,损失函数和指标,并开始了培训。

# defining our optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)# definining our loss function
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)# defining our metric which we want to observe
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')# compiling the model
model.compile(optimizer=optimizer, loss=[loss, *[None] * model.config.n_layer], metrics=[metric])

Now, let’s train the model

现在,让我们训练模型

num_epoch = 10
history = model.fit(dataset, epochs=num_epoch)

5.预测 (5. Prediction)

To predict, we just need to simply encode the input text and pass it to the model

为了进行预测,我们只需要简单地编码输入文本并将其传递给模型

text = "লালমোহনবাবু "# encoding the input text
input_ids = tokenizer.encode(text, return_tensors='tf')# getting out output
beam_output = model.generate(
input_ids,
max_length = 50,
num_beams = 5,
temperature = 0.7,
no_repeat_ngram_size=2,
)
screenshot of the output 输出的屏幕截图

Now, if you are a Bengali, then you can point it out that the output although the sentence is syntactically correct, it doesn’t look cohesive. True, but for this demo, I have kept this demo a minimal as possible.

现在,如果您是孟加拉人,那么您可以指出,尽管该句子在语法上是正确的,但输出看起来并不连贯。 的确如此,但是对于本演示,我将本演示保持在最低限度。

6.保存模型 (6. Save the Model)

Well, after long training time, what good will it do if we close our session and all our trained model is just lost and we again need to train it from scratch. So, let’s save the model and the tokenizer so that we can retrain from where we left off

好了,经过很长的培训时间,如果我们结束会议并且所有训练的模型都丢失了,那么我们又需要从头开始训练,那将会有什么好处。 因此,让我们保存模型和令牌生成器,以便我们可以从上次中断的地方重新训练

from transformers import WEIGHTS_NAME, CONFIG_NAME
import osoutput_dir = './model_bn_custom/'# creating directory if it is not present
if not os.path.exists(output_dir):
os.mkdir(output_dir)model_to_save = model.module if hasattr(model, 'module') else model
output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
output_config_file = os.path.join(output_dir, CONFIG_NAME)# save model and model configs
model.save_pretrained(output_dir)
model_to_save.config.to_json_file(output_config_file)# save tokenizer
tokenizer.save_pretrained(output_dir)

奖金 (Bonus)

We have already done all the hard work, so to load the saved model and tokenizer, we only need to execute two lines of code and we’re all set.

我们已经完成了所有艰苦的工作,因此要加载保存的模型和令牌生成器,我们只需要执行两行代码就可以完成工作。

tokenizer = GPT2Tokenizer.from_pretrained(output_dir)
model = TFGPT2LMHeadModel.from_pretrained(output_dir)

Voila! Now you can train your own model in your own language. And create content which can race with some of the best literary works in any language.

瞧! 现在,您可以使用自己的语言来训练自己的模型。 并创建可以与任何语言的最佳文学作品竞争的内容。

未来范围: (Future scope:)

This blog gives a framework of how can one train GPT-2 model in any language. This is not at par with some of the pre-trained model available, but to reach that state, we need a lot of training data and computational power.

该博客提供了一个框架,该框架说明了如何训练任何一种语言的GPT-2模型。 这与某些可用的预训练模型不相称,但是要达到该状态,我们需要大量的训练数据和计算能力。

翻译自: https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171

预训练 gpt2

你可能感兴趣的:(python,机器学习,算法,深度学习,人工智能)