深度学习中文故事生成
After discovering time travel, the Earth’s inhabitants now live in futuristic cities, which are controlled by the government, for the duration of a decade. The government plans to send two elite teams of scientists to the city, in order to investigate the origin of these machines and discover the existence of the “God”.
在发现时空旅行之后 ,地球居民现在居住在政府控制的未来城市中,持续了十年。 政府计划派遣两个精英科学家团队前往该市,以调查这些机器的起源并发现“上帝”的存在。
Wouldn’t it be fun to generate stories for your favorite genre? That’s what we’ll be doing today. We’ll learn how to build a story generator that creates stories based on genres, just like the one that created the sci-fi story above (the user-provided input is bold in the story above).
w ^ ouldn't它是有趣的,为您的最喜欢的风格的故事? 那就是我们今天要做的。 我们将学习如何构建一个基于类型来创建故事的故事生成器,就像在上面创建科幻故事的故事生成器一样(用户提供的输入在上面的故事中为粗体)。
You can also use this post as a launching pad to develop your own text generator. For example, you can generate headlines of topics like Tech, Science, and Politics. Or generate lyrics of your favorite artists.
您还可以将此帖子用作启动平台来开发自己的文本生成器。 例如,您可以生成技术,科学和政治等主题的标题。 或生成您喜欢的艺术家的歌词。
为什么要产生故事? (Why Story Generation?)
As an avid movies and TV-Shows fan, I loved the idea of a story-generator that would generate stories based on genres, input prompts, or even titles. After learning about GPT-2, I wanted to bring that idea to life. That’s what led me to build this model.
作为一个狂热的电影和电视节目迷,我喜欢一个故事生成器的想法,该故事生成器可以根据类型,输入提示甚至标题来生成故事。 在学习了GPT-2之后,我想将这个想法变为现实。 这就是促使我建立此模型的原因。
Intended use: To have fun and test this idea of fusion storytelling where we can generate stories by blending our creativity (by providing a prompt) with the model’s creativity (by generating the rest of story using the prompt).
预期用途 :要有趣并测试融合叙事的想法,我们可以通过将创造力(通过提供提示)与模型的创造力(通过使用提示生成其余故事)相结合来生成故事。
预告片时间:测试故事生成器! (Trailer Time: Test out the Story Generator!)
Before taking a deep dive into building the generator, let’s first give story generation a shot. Check out my story generator at this Huggingface link or run the cells in this Colab notebook to generate stories. The model input format is of the form:
在深入研究构建发电机之前,让我们首先介绍一下故事生成。 在此Huggingface链接中查看我的故事生成器 或运行此Colab笔记本中的单元来产生故事 。 模型输入格式的形式为:
e.g.
例如
where genre belongs to: superhero, sci_fi, action, drama, horror, thriller
类型所属:超级英雄,科幻,动作,戏剧,恐怖,惊悚片
The model will generate its story using this prompt. There’s also a more intuitive way to generate stories: a web-app using my model in action (keep in mind this app’s generation is slower than the Huggingface link).
该模型将使用此提示生成其故事。 还有一种更直观的方式来生成故事: 使用我的模型运行的Web应用程序 (请记住,此应用程序的生成速度比Huggingface链接要慢)。
Now that you’re done experimenting with the model, let’s explore the idea: We’re fine-tuning the OpenAI GPT-2 model on a dataset that contains movie plots of different genres. This walkthrough follows a Three-Act structure:
现在,您已经完成了对模型的试验,让我们探讨一下这个想法:我们正在对包含不同流派的电影情节的数据集进行微调OpenAI GPT-2模型。 本演练遵循三行动结构:
Act I: What is GPT-2?
第一幕:什么是GPT-2?
Act II: Fine-Tuning Time…
第二幕:微调时间…
Act III: Generation Time!
第三幕:世代相传!
第一幕:什么是GPT-2? (Act I: What is GPT-2?)
If you’re familiar with the key ideas of GPT-2, you can fast-forward to the next act. Otherwise, get a quick refresher on GPT-2 below.
如果您熟悉GPT-2的关键思想,则可以快速进行下一步。 否则,请快速阅读以下GPT-2。
Recently, we saw the humongous hype behind text-generation model GPT-3 and its dazzling performance for tasks like generating code using zero-shot learning. GPT-2 is the predecessor to GPT-3.
最近,我们看到了文本生成模型GPT-3背后的巨大炒作,以及其在诸如使用零镜头学习生成代码之类的任务上的令人眼花performance乱的性能。 GPT-2是GPT-3的前身。
Wait, why are we using GPT-2 and not GPT-3? Well, it’s in Beta. And…
等等,为什么我们要使用GPT-2而不是GPT-3? 好吧,它在Beta中。 和…
Just like its hype, GPT-3’s size is also gigantic (rumored to require 350 GB of RAM). While still bulky, GPT-2 takes up much less space (largest variant takes up 6GB disk space). And it even comes in different sizes. So, we can use it on many devices (including iPhones).
就像它的宣传一样,GPT-3的大小也是巨大的 (有传言称需要350 GB的RAM )。 尽管GPT-2仍然很笨重,但它占用的空间要少得多(最大的版本占用6GB的磁盘空间 )。 而且它甚至有不同的尺寸 。 因此,我们可以在许多设备(包括iPhone )上使用它。
If you’re looking for a detailed overview of GPT-2, you can get it straight from the horse’s mouth.
如果您正在寻找GPT-2的详细概述,可以直接从马口中获得 。
But, If you just want a quick summary of GPT-2, here’s a distilled rundown:
但是,如果您只想简要了解GPT-2,请参考以下摘要 :
GPT-2 is a text-generation language model using a Decoder-only Transformer (a variant of Transformers architecture). If this seems like mumbo-jumbo, just know that Transformer is a state-of-the-art architecture for NLP models.
GPT-2是使用仅解码器的Transformer ( Transformers体系结构的一种变体)的文本生成 语言模型 。 如果这看起来像是庞然大物,那就知道Transformer是NLP模型的最新架构。
It was pre-trained (with 40GB of text data) on the task of predicting the next word (more formally, token) at each time-step given preceding text as input.
它已接受预训练( 具有40GB的文本数据 ),以预测在每个时间步长给出下一个文本作为输入的下一个单词(更正式地是令牌)的任务。
第二幕:微调时间… (Act II: Fine-Tuning Time…)
We can fine-tune (further train) pre-trained models like GPT-2 on a chosen dataset to tailor their performance towards that dataset. To fine-tune and use the GPT-2 pre-trained model, we will use the huggingface/transformers library. It does all the heavy-lifting for us.
我们可以在选定的数据集上微调 (进一步训练)像GPT-2这样的预训练模型,以针对该数据集调整其性能。 为了微调和使用GPT-2预先训练的模型,我们将使用拥抱面/变压器库。 它为我们完成了所有繁重的工作。
For this idea, I created the dataset files by cleaning, transforming and combining the Kaggle Wikipedia Movie Plots dataset along with scraped superhero comic book plots from Wikipedia.
为此,我通过清理,转换和组合Kaggle Wikipedia电影情节数据集以及Wikipedia刮取的超级英雄漫画情节来创建数据集文件。
The training file has over 30,000 stories. Each line in the file is a story of this format:
培训档案有30,000多个故事。 文件中的每一行都是这种格式的故事:
The genres: superhero, sci_fi, horror, action, drama, thriller
该 流派 :超级英雄,sci_fi,惊悚,动作,剧情,惊悚
To create your own dataset for another task, like generating research paper abstracts based on subjects, each example’s format can be similar like this:
为了为另一个任务创建自己的数据集,例如根据主题生成研究论文摘要,每个示例的格式可以类似如下:
The subjects: physics, chemistry, computer_science, etc.
学科 :物理,化学,计算机科学等。
I recommend that you train the model on Google Colab (Set runtime to GPU). The link to the colab notebook for fine-tuning our model is here. You can create a copy of this notebook and modify it for your own dataset as well.
我建议您在Google Colab上训练模型(将运行时设置为GPU )。 这里是用于微调模型的colab笔记本的链接 。 您可以创建此笔记本的副本 并针对您自己的数据集进行修改。
Here’s the 6_genre_clean_training_data.txt (training file) and 6_genre_eval_data.txt (evaluation file) that you need to upload to your Colab notebook’s environment before running the code.
这是6_genre_clean_training_data.txt (培训文件)和6_genre_eval_data.txt (评估文件),您需要在运行代码之前将其上传到Colab笔记本的环境。
I will go through some of the core code from the Colab notebook in this post.
我将在本文中介绍Colab笔记本中的一些核心代码。
!pip install transformers torch
Above, we are installing the core libraries. We need the transformers library to load the pre-trained GPT-2 checkpoint and fine-tune it. We need torch because it is used for training.
上面,我们正在安装核心库。 我们需要变压器库来加载预训练的GPT-2检查点并对其进行微调。 我们需要火炬,因为它用于训练。
Note: I took the code from this transformers example file and modified it.
注意 :我从这个转换器示例文件中获取了代码,并对其进行了修改。
def main():
model_args = ModelArguments(
model_name_or_path="gpt2", model_type="gpt2"
)
data_args = DataTrainingArguments(
train_data_file="6_genre_clean_training_data.txt",
eval_data_file="6_genre_eval_data.txt",
line_by_line=True,
block_size=512,
overwrite_cache=True,
)
training_args = TrainingArguments(
output_dir="story_generator_checkpoint",
overwrite_output_dir=True,
do_train=True,
do_eval=True,
evaluate_during_training=False,
logging_steps=500,
per_device_train_batch_size=4,
num_train_epochs=3,
save_total_limit=1,
save_steps=1000,
)
In the code snippet above, we specify the model’s arguments (ModelArguments), data arguments (DataTrainingArguments), and training arguments (TrainingArguments). Let’s quickly go through the key arguments for these parameters. To properly check out all parameters, look here.
在上面的代码片段中,我们指定了模型的参数( ModelArguments ),数据参数( DataTrainingArguments )和训练参数( TrainingArguments )。 让我们快速浏览这些参数的关键参数。 要正确检查所有参数,请看此处 。
ModelArguments
(ModelArguments
)
model_name_or_path: Used to specify the model name or its path. The model is then downloaded to our cache directory. Check here for all available models. Here, we specify the model_name_or_path as gpt2. We also have other options like gpt2-medium or gpt2-xl.
model_name_or_path:用于指定模型名称或其路径。 然后将模型下载到我们的缓存目录。 在此处查看所有可用型号。 在这里,我们指定model_name_or_path为GPT2。 我们还有其他选项,例如gpt2-medium或gpt2-xl。
model_type: We are specifying that we want a gpt2 model. This is different from the above parameter because, we only specify the model type, not the name (name refers to gpt2-xl, gpt2-medium, etc.).
model_type :我们指定我们想要一个gpt2模型。 这与上面的参数不同,因为我们仅指定模型类型,而不指定名称(名称指的是gpt2-xl,gpt2-medium等)。
数据训练参数 (DataTrainingArguments)
train_data_file: We provide the training file.
train_data_file :我们提供培训文件。
eval_data_file: We provide a data file to evaluate the model (using perplexity). Perplexity measures how likely our language model generates text from the eval_data_file. The lower the perplexity, the better.
eval_data_file :我们提供一个数据文件来评估模型(使用困惑)。 困惑度衡量我们的语言模型从eval_data_file生成文本的可能性 。 困惑度越低越好。
line_by_line: Set to true since each line in our file is a new story.
line_by_line :设置为true,因为文件中的每一行都是新故事。
block_size: Shorten each training example to only have maximum of block_size tokens.
block_size : 缩短每个训练示例,使其最多只能包含block_size标记 。
训练论点 (TrainingArguments)
Below, n refers to the value of these parameters.
下面, n表示这些参数的值。
output_dir: Where to save the final model checkpoint after fine-tuning.
output_dir :微调后在哪里保存最终模型检查点。
do_train, do_eval: Set to true since we are training and evaluating.
do_train,do_eval :由于我们正在训练和评估,因此设置为true。
logging_steps: Log the model’s loss after every n optimization steps.
logging_steps :每n个优化步骤记录一次模型的损失。
per_device_train_batch_size: Each training optimization step involves n number of training examples.
per_device_train_batch_size :每个训练优化步骤都涉及n个训练示例。
num_train_epochs: Number of full passes through the training dataset.
num_train_epochs :通过训练数据集的完整通过次数。
save_steps: Save intermediate checkpoint after every n optimization steps (Recommended if Colab keeps disconnecting after couple hours).
save_steps :每n个优化步骤后保存一次中间检查点(如果Colab在几个小时后仍保持断开连接,建议使用此检查点)。
save_total_limit: Number of intermediate checkpoints to store at any point.
save_total_limit :要存储在任何点的中间检查点的数量。
Now, it’s time to load the model, its configuration and the tokenizer.
现在,该加载模型,其配置和令牌生成器了。
加载模型,令牌生成器和配置 (Loading Model, Tokenizer and Config)
# Inside main function
config = AutoConfig.from_pretrained(
model_args.model_name_or_path, cache_dir=model_args.cache_dir
)
tokenizer = AutoTokenizer.from_pretrained(
model_args.model_name_or_path, cache_dir=model_args.cache_dir
)
model = GPT2LMHeadModel.from_pretrained(
model_args.model_name_or_path,
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
)
The Classes AutoConfig, AutoTokenizer, GPT2LMHeadModel load the respective configuration (GPT2Config), tokenizer (GPT2Tokenizer), and model (GPT2LMHeadModel) based on the model_name.
类AutoConfig , AutoTokenizer , GPT2LMHeadModel基于model_name加载相应的配置( GPT2Config ),令牌化器( GPT2Tokenizer )和模型( GPT2LMHeadModel )。
The GPT2Tokenizer object tokenizes text (converts text into lists of tokens) and encodes these tokens into numbers. Note that tokens can be words, or even subwords (GPT-2 uses Byte Pair Encoding to create tokens). Below is an example of tokenization (without encoding).
GPT2Tokenizer对象 标记文本(将文本转换为标记列表)并将这些标记编码为数字。 请注意,令牌可以是单词,甚至可以是子单词(GPT-2使用字节对编码来创建令牌)。 下面是标记化的示例(无编码)。
"""
The GPT2 Tokenizer tokenizes text into tokens (where token can be a word or subword)
"""
# Let tokenizer be a GP2Tokenizer object
print(tokenizer.tokenize("This is working"))
Output: ["This", "Ġis", "Ġworking"] # Example where all tokens are words
# "Ġ" indicates start of new word following a space.
print(tokenizer.tokenize("Tokenizer is working"))
Output: ["Token", "izer", "Ġis", "Ġworking"] # Example with token "izer" being a subword
After this, we have to encode the tokens into numbers since computers only deal with numbers.
此后,由于计算机仅处理数字,因此我们必须将令牌编码为数字。
We have GPT2Config object is used to load up the configuration of the GPT-2 model based on the model name.
我们有GPT2Config对象用于根据模型名称加载GPT-2模型的配置。
Finally, we have GPT2LMHeadModel object that loads the model based on the model name and GPT2Config we have.
最后,我们有GPT2LMHeadModel 根据我们拥有的模型名称和GPT2Config加载模型的对象。
Some of you might be wondering: what exactly is an “LMHeadModel”? Simply put, for text-generation, we need a language model (a model that assigns probability distribution to tokens in vocabulary). GPT2LMHeadModel is a language model (it assigns scores to each token in the vocabulary). So, we can use this model for text-generation.
你们中有些人可能会想:“ LMHeadModel ”到底是什么? 简而言之,为了生成文本,我们需要一种语言模型(将概率分布分配给词汇中的标记的模型)。 GPT2LMHeadModel是一种语言模型(它将分数分配给词汇表中的每个标记)。 因此,我们可以使用此模型进行文本生成。
Next main step is adding special tokens to integrate the model with our data.
接下来的主要步骤是添加特殊令牌,以将模型与我们的数据集成。
# Inside our main function (these are only snippets)
special_tokens_dict = {
"bos_token": "",
"eos_token": "",
"pad_token": "",
"additional_special_tokens": [
"",
"",
"",
"",
"",
"",
],
}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))
The datasets have some special tokens, and we need to specify to the model what these tokens are. We can pass in a special_tokens_dict which can have some keys like “bos_token”. To see what keys are allowed, visit this link.
数据集具有一些特殊标记,我们需要向模型指定这些标记是什么。 我们可以传入一个special_tokens_dict ,其中可以包含一些键,例如“ bos_token”。 要查看允许使用哪些键,请访问此链接 。
bos_token (“
”) is the token that appears at the start of each story. bos_token (“
”)是出现在每个故事开始处的标记。 eos_token (“
”) is the token that appears at the end of each story. eos_token (“
”)是出现在每个故事末尾的令牌。 pad_token (“
”) refers to the padding token to pad shorter output to a fixed length. pad_token (“
”)指的是将较短的输出填充到固定长度的填充令牌。
However, we have some extra special tokens (our genre tokens in this case) that don’t have their own keys in our special_tokens_dict. These extra tokens can be placed as a list under one collective key “additional_special_tokens”.
但是,我们还有一些特殊的特殊标记 (在这种情况下是我们的类型标记 ),在special_tokens_dict中没有自己的密钥。 这些额外的令牌可以作为列表放置在一个集体密钥“ additional_special_tokens”下。
Replace the genre tokens with your own categories if you’re working with your own dataset.
如果您正在使用自己的数据集,请使用自己的类别替换类型标记。
We’re ready for training, so let’s create the Trainer object.
我们已经准备好进行培训,因此让我们创建Trainer对象。
# Initialize our Trainer
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
prediction_loss_only=True,
)
# Training
if training_args.do_train:
model_path = (
model_args.model_name_or_path
if model_args.model_name_or_path is not None
and os.path.isdir(model_args.model_name_or_path)
else None
)
trainer.train(model_path=model_path)
trainer.save_model()
tokenizer.save_pretrained(training_args.output_dir)
The Trainer object can be used for training, and evaluating our models.
Trainer对象可用于训练和评估我们的模型。
To create the Trainer object, we specify the model, and datasets along with training arguments (as shown above). The data_collator parameter is used for batching training and evaluation examples.
要创建Trainer对象,我们指定模型,数据集以及训练参数(如上所示)。 data_collator参数用于批处理培训和评估示例。
We then call the Trainer’s train method and save the model to our output directory after training. It’s time to relax and let the machine train the model.
然后,我们调用培训师的培训方法,并在培训后将模型保存到我们的输出目录中。 是时候放松一下,让机器训练模型了。
After training the model, download the model checkpoint folder from Colab onto your computer. You can deploy it as a Huggingface community model, develop a web app, or even develop a mobile app using the model.
训练完模型后,将模型检查点文件夹从Colab下载到您的计算机上。 您可以将其部署为Huggingface社区模型 ,开发Web应用程序 ,甚至使用该模型开发移动应用程序 。
第三幕:世代相传! (Act III: Generation Time!)
Here’s the exciting part! Run this cell below in your Colab notebook if you’re itching to generate stories!
这是令人兴奋的部分! 如果您很想生成故事,请在Colab笔记本中运行以下单元格!
# Slightly simplified from Colab cell to show in the post
checkpoint = "story_generator_checkpoint"
model = GPT2LMHeadModel.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
story_generator = TextGenerationPipeline(model=model, tokenizer=tokenizer)
input_prompt = " Batman"
story = story_generator(input_prompt, max_length=100, do_sample=True,
repetition_penalty=1.1, temperature=1.2,
top_p=0.95, top_k=50)
print(story)
Here, we are using the TextGenerationPipeline object that simplifies text-generation with our model since it abstracts away from input pre-preprocessing and output post-processing steps.
在这里,我们使用TextGenerationPipeline对象,该对象简化了模型生成文本的过程,因为它抽象了输入预处理和输出后处理步骤。
text_generator = TextGenerationPipeline(model=model, tokenizer=tokenizer)
The text_generator is an object of the TextGenerationPipeline class. How are we able to use this object as a function? This class has a __call__ method that allows its objects to behave like functions (this method is called when using the object as a function). Here, we use this object like a function to generate text by providing the input_prompt. That’s all there is to it.
text_generator是TextGenerationPipeline类的对象。 我们如何才能将此对象用作函数? 此类具有__call__方法 ,该方法允许其对象行为类似于函数(将对象用作函数时,将调用此方法)。 在这里,我们通过提供input_prompt将此对象用作函数来生成文本。 这里的所有都是它的。
We still haven’t explained some text-generation parameters like top_p. Stick after the end-credits if you’d like to learn about these parameters.
我们仍然没有解释一些文本生成参数,例如top_p。 如果您想了解这些参数,请坚持使用最终信用。
结语和尾声 (Epilogue and End Credits)
To recap, we took a dataset of movie plots of various genres and fed it to the GPT2LMHeadModel to fine-tune our model to generate genre-specific stories. Using these ideas, you can also create other datasets to generate text based on those datasets as well.
回顾一下,我们获取了各种流派的电影情节的数据集,并将其输入到GPT2LMHeadModel中,以微调我们的模型以生成特定于流派的故事。 使用这些想法,您还可以创建其他数据集以基于这些数据集生成文本。
To test my movie story generator, click here (or use the web app here).
要测试我的电影故事生成器,请单击此处 (或在此处使用网络应用)。
A huge thank you to the Huggingface Team for their transformers library and detailed examples, and this Article by Raymond Cheng for helping me create my model.
非常感谢Huggingface 团队提供的变形金刚库和详细示例,以及Raymond Cheng的这篇文章帮助我创建了模型。
可选:探索文本生成参数 (Optional: Exploring Text-Generation Parameters)
First, let’s learn what the output of the GPT2LMHeadModel is:
首先,让我们学习一下GPT2LMHeadModel的输出是:
The model produces logits for all the possible tokens in its vocabulary when predicting the next token. To simplify, think of these logits as scores for each token. Tokens that have higher scores (logits) mean that they have higher probability to be an appropriate next token. We then apply a softmax operation on the logits of all tokens. We now get a softmax score of each token that is between 0 and 1. The softmax scores of all tokens sum up to 1. These softmax scores can be considered as probabilities (although they aren’t) of a token being an appropriate next token given some previous text.
当预测下一个令牌时,该模型将为词汇表中的所有可能令牌生成logit。 为简化起见,将这些logit视为每个令牌的分数。 具有较高分数(登录)的令牌意味着它们具有较高的可能性成为合适的下一个令牌。 然后,我们对所有令牌的logit应用softmax操作。 现在,我们获得每个令牌的softmax得分,其介于0和1之间。所有令牌的softmax得分总和为1。这些softmax得分可以被视为令牌(作为适当的下一个令牌)的概率(尽管不是)给出一些先前的文字。
Below is a basic outlin of the parameters:
以下是参数的基本概况:
max_length [int]: Specifies the maximum number of tokens to generate.
max_length [int]:指定要生成的最大令牌数。
do_sample [bool]: Specifies whether to use sampling (such as top_p or use greedy search). Greedy search is selecting most probable word at each time step (not recommended as text can get repetitive).
do_sample [bool]:指定是使用采样 (例如top_p还是使用贪婪搜索)。 贪婪搜索会在每个时间步中选择最有可能的单词(不建议这样做,因为文本可能会重复)。
repetition_penalty [float]: Specifies penalty for repetition. Increase the repetition_penalty parameter to reduce repetition.
repetition_penalty [float]:指定重复惩罚。 增加repetition_penalty参数以减少重复。
temperature [float]: Used to increase or decrease reliance on high-logits tokens. Increasing temperature value reduces the chance of only few tokens having very high softmax scores.
温度 [float]:用于增加或减少对高逻辑令牌的依赖。 增加温度值可减少只有极少数令牌具有非常高的softmax分数的机会。
top_p [float]: Specifies to consider only the tokens whose sum of probability (formally, softmax) scores does not exceed the value of top_p
top_p [float]:指定仅考虑概率总和(正式为softmax)得分不超过top_p值的标记
top_k [int]: Tells us to only consider top_k number of tokens (ranked by their softmax scores) when generating a text.
top_k [int]:告诉我们在生成文本时仅考虑top_k个标记的数量(按其softmax分数排名)。
And that’s it.
就是这样。
翻译自: https://towardsdatascience.com/generate-fresh-movie-stories-for-your-favorite-genre-with-deep-learning-143da14b29d6
深度学习中文故事生成