机器学习框架
My wonderful colleagues at Atomwise and I have written a production-level PyTorch framework for training and running deep learning models. Our application centers on drug discovery — predicting whether a molecule inhibits the activity of a protein by binding to a pocket. The goal was to have a stable yet flexible platform to support both machine scientists in training and experimenting, and medicinal chemists in applying them as part of their production workflow.
我和Atomwise的出色同事们写了一个生产级的PyTorch框架,用于训练和运行深度学习模型。 我们的应用以药物发现为中心-预测分子是否通过与口袋结合来抑制蛋白质的活性。 目标是拥有一个稳定而灵活的平台,以支持机器科学家进行培训和实验,以及药用化学家将其应用为生产流程的一部分。
Iterative improvement and “continuous refactoring” are ubiquitous in software development, and we were no exception. The first version was far from perfect; but looking at use cases, feedback, and extended functionality, we have gone through several rounds of refinement. Our design goals were to provide a tool that allows easy experimentation without the need to write code, as well as the use within an automation pipeline. This meant we had to strike a balance between allowing enough knobs to conduct meaningful research, yet limiting the potential for unintentional misconfiguration and confusion. We also optimized for performance and cost-effectiveness within a cloud environment. In this post, I would like to reflect on some general, domain-independent lessons we learned along the way.
迭代改进和“连续重构”在软件开发中无处不在,我们也不例外。 第一个版本远非完美。 但是在查看用例,反馈和扩展功能时,我们经历了几轮改进。 我们的设计目标是提供一种无需编写代码即可轻松进行实验的工具,以及在自动化管道中的使用。 这意味着我们必须在允许足够多的旋钮进行有意义的研究与限制意外配置错误和混淆的可能性之间取得平衡。 我们还针对云环境中的性能和成本效益进行了优化。 在本文中,我想回顾一下我们在此过程中学到的一些一般的,与领域无关的课程。
1.不要重新发明轮子 (1. Do not reinvent the wheel)
When we took a first stab at it, we designed an application from scratch; using torch.Tensor arithmetic, but few of the other provided utility classes. This turned out to be a mistake: We ran into issues with multi-processing and data queueing that we hadn’t anticipated and that torch.utils.data.DataLoader
already addresses. While this Python class might not seem to do much on the surface, it is actually quite elaborate under the hood! And it is flexible enough such that it can be adopted without changes in most use cases. It is much simpler to use customization by inheriting from the lightweight torch.utils.data.DataSet
and from torch.utils.data.Sampler
instead. For the latter, please see my previous post about our vintage TreeSampler
design.
当我们第一次尝试时,我们从头开始设计了一个应用程序。 使用Torch.Tensor算术,但很少其他提供实用程序类。 事实证明这是一个错误:我们遇到了预料之外的并且torch.utils.data.DataLoader
已经解决的多处理和数据排队问题。 尽管这个Python类似乎在表面上并没有做很多事情,但实际上是非常复杂的事情! 而且它足够灵活,因此可以在大多数用例中无需更改就可以采用。 通过从轻量级的torch.utils.data.DataSet
和torch.utils.data.Sampler
继承,使用自定义要简单得多。 对于后者,请参阅我以前的文章,介绍我们的老式TreeSampler
设计。
Most of our architecture follows common standards — see below for a high-level sketch. The training and scoring loops are encapsulated into an Engine
class. Training iterates through minibatches of data; after each epoch, a fixed number of minibatches from a separate test set are used to estimate learning curves and detect possible overfitting. Whenever we improve on (a smoothing average over the most recent) test metrics, we write out a checkpoint as the current best model that could be used to simulate early stopping. In addition, at (possibly different) configurable intervals, we write checkpoints, record learning statistics into a metrics logger (e.g., Tensorboard
or MLFlow
), and print messages to log files and to the console. I described the auxiliary Meter
class here.
我们的大多数体系结构都遵循通用标准-请参见下面的高级草图。 训练和计分循环被封装到Engine
类中。 训练通过小批量数据进行迭代; 在每个时期之后,使用来自单独测试集的固定数量的微型批次来估计学习曲线并检测可能的过度拟合。 每当我们改进(最新的平滑平均值)测试指标时,都会写出一个检查点作为当前最佳模型,该模型可用于模拟提前停止。 另外,以(可能不同)可配置的时间间隔,我们编写检查点,将学习统计信息记录到度量记录器(例如Tensorboard
或MLFlow
)中,并将消息打印到日志文件和控制台中。 我在这里介绍了辅助Meter
类。
We found that a large part of the engine code was concerned with initializing all the objects needed for a training run (model, data sets, optimizer, scheduler, loss functions, …), so we later factored out a LaunchManager
class that is only used during startup.
我们发现,引擎代码的很大一部分与初始化所有需要的跑步训练(模型,数据集,优化,调度,损失函数,...)的对象有关,所以我们后来分解出一个LaunchManager
即只使用类在启动过程中。
The transform pipeline typically refers to a sequence of data preprocessing steps, such as selection, format conversion, and random augmentations. It is most commonly suggested to call them in the __getitem__()
method of the DataSet.
We departed from that by making the transform pipeline part of the collation phase, when a list of examples get concatenated into a minibatch Tensor
. For this purpose, the DataLoader
allows to pass in an optional collate_fn
argument. Thus, Tensor operations can be vectorized over the entire minibatch, resulting in a significant speedup compared to looping over individual examples.
转换管道通常是指一系列数据预处理步骤,例如选择,格式转换和随机扩充。 通常建议在DataSet.
的__getitem__()
方法中调用它们DataSet.
当一系列示例被串联到一个minibatch Tensor
时,我们通过将转换管道作为整理阶段的一部分来实现这一点。 为此, DataLoader
允许传入可选的collate_fn
参数。 因此,可以在整个最小批处理中对Tensor操作进行矢量化处理,与循环遍历各个示例相比,可以显着提高速度。
2.使用配置文件 (2. Use a configuration file)
If shell commands have more than a handful of possible options, they become a burden to remember and to type. In practice, users then often end up copy/pasting wrapper shell scripts that essentially serve as configurations. A better solution is to distinguish settings from code by using configuration files right from the start. These come with several added benefits: They allow hierarchical organization in sections, comments, a clean way of managing defaults, and automatic validation tools.
如果shell命令有很多可能的选项,它们将成为记住和键入的负担。 实际上,用户通常会最终复制/粘贴包装程序外壳程序脚本,这些脚本实际上是作为配置使用的。 更好的解决方案是从一开始就使用配置文件将设置与代码区分开。 它们具有一些额外的好处:它们允许按节,注释,默认值的干净管理方式和自动验证工具进行分层组织。
While an interpreted language like Python makes it easy to write configurations in Python itself, I prefer to keep the distinction from code explicit by using a format like yaml
or json
. In our case, we chose yaml
format with validation using the ConfigObj
package. There are also alternative configuration packages, such as are schema, or Facebook’s recently introduced hydra.
ConfigObj
manages a separate specification file containing the type and value constraints for each parameter. It is convenient to specify defaults here — the configuration can then omit certain options. For example, when running in scoring mode, we can leave the training section blank — and vice versa. During validation, remember to also check for extra parameters — those specified by the user, but unmentioned in the spec. In many cases, these are typos, and would easily go unnoticed otherwise.
尽管像Python这样的解释型语言可以很容易地在Python本身中编写配置,但我更喜欢通过使用yaml
或json
这样的格式来保持与代码的区别。 在本例中,我们选择了yaml
格式并使用ConfigObj
包进行了验证。 还有其他配置包,例如模式或Facebook最近引入的hydra.
ConfigObj
管理一个单独的规范文件,其中包含每个参数的类型和值约束。 在此处指定默认值很方便-然后配置可以忽略某些选项。 例如,在计分模式下运行时,我们可以将训练部分留为空白,反之亦然。 在验证期间,请记住还要检查其他参数-用户指定的参数,但在规范中未提及。 在许多情况下,这些都是错别字,否则很容易被忽视。
We constrain the entire configuration to a single file. While file inclusion, override mechanisms, or having multiple sub-configurations could be simpler to manage when running experiments, we felt that it would not serve the longer term goals of consistency and reproducibility.
我们将整个配置限制在一个文件中。 尽管在进行实验时文件包含,覆盖机制或具有多个子配置可能更易于管理,但我们认为这样做不能满足长期性的一致性和可重复性目标。
We found it useful to implement an intelligent diff-like functionality for configuration files. The function breaks out added, removed, and changed options. This helps to quickly see how new models differ from previously trained ones, and what was changed in fine-tuning runs. In contrast to line-based (unix) diff
, it takes into consideration immaterial changes in ordering, implicit default values, and can be told to ignore certain differences such as in logging.
我们发现对配置文件实施类似diff的智能功能很有用。 该函数分解出添加,删除和更改的选项。 这有助于快速查看新模型与先前训练的模型有何不同,以及在微调运行中进行了哪些更改。 与基于行的(unix) diff
,它考虑了排序方面的非实质性变化,隐式默认值,并且可以告知它们忽略某些差异,例如日志记录。
It is easy to have heated debates about the right way to configure and define model architectures — and we had our share as well. First question: What is the appropriate level of specification granularity? The Caffe
framework specifies the complete model declaratively, as a list of all the layers with their settings. On the other end of the spectrum, PyTorch usually takes a more procedural approach — we can define the model as a subclass of torch.nn.Module
, and write out its forward method containing the sequence of computation steps. Both of these approaches have advantages and disadvantages in terms of flexibility, susceptibility to inadvertent misspecification, amount of code, and cross-experiment tracking and comparison. We wanted to avoid a proliferation of model files that only differ in minor ways, so we opted for a compromise solution: We write out one file per model family, but allow for some variation controlled by options (e.g., the depth and width of layers). This is further facilitated by providing a library of common building blocks, such as multi-layer perceptrons with different activations and optional dropout and normalization layers.
关于配置和定义模型架构的正确方法的激烈辩论很容易,我们也分享了自己的经验。 第一个问题:规范粒度的适当级别是多少? Caffe
框架以声明的方式指定完整的模型,作为所有图层及其设置的列表。 另一方面,PyTorch通常采用一种更具过程性的方法—我们可以将模型定义为torch.nn.Module
的子类,并写出包含计算步骤序列的前向方法。 这两种方法在灵活性,对误指定的敏感性,代码量以及跨实验跟踪和比较方面都有优点和缺点。 我们希望避免模型文件激增,这些模型文件之间只有很小的差异,因此我们选择了一种折衷解决方案:我们为每个模型族写出一个文件,但允许通过选项控制某些变化(例如,层的深度和宽度) )。 通过提供一个通用构件库,例如具有不同激活方式的多层感知器以及可选的辍学和归一化层,可以进一步简化此过程。
Next question: How do we retrieve these models in the configuration by name (surely we don’t want to refer to the file path)? To this end, we designed a light-weight model registry (you could also call it a factory). When the model code is decorated with a @register
tag, it will be recognized and subsequently be accessible in the configuration by its class name, or a custom given name. Each model file comes with an associated parameter specification. The registry reads and validates these parameters before creating an instance at the start of training.
下一个问题:我们如何通过名称在配置中检索这些模型(当然,我们不想引用文件路径)? 为此,我们设计了一个轻量级的模型注册表(您也可以将其称为工厂)。 当模型代码用@register
标记修饰时,它将被识别,并且随后可以在配置中通过其类名或自定义给定名称进行访问。 每个模型文件都带有关联的参数规范。 在培训开始之前,注册表会在创建实例之前读取并验证这些参数。
Some models we used are composed of others — multi-task, Siamese, and ensemble networks. For these, it proved useful to allow for multiple named models in the configuration, some of which could be currently unused. For example, an ensemble network could have a structure as follows:
我们使用的某些模型由其他模型组成-多任务,连体和集成网络。 对于这些,事实证明在配置中允许使用多个命名模型非常有用,其中一些当前可能尚未使用。 例如,一个集成网络可以具有以下结构:
models:
use_model: ensemblenet_0 # the main model for training
foonet_0:
type: FooNet
# foonet-specific parameters
num_layers: 5
...
barnet_0:
type: BarNet
# barnet-specific parameters
num_channels: 3
...
ensemblenet_0:
submodels:
- foonet_0
- barnet_0
combiner:
type: mlp
...
In other words, we are using the Fundamental Theorem of Software Engineering. We started using named components in other parts of the configuration as well. For example, different model types can require different transform pipelines. So if users switched models, they would need to modify the transform section accordingly, which could be tedious and error-prone (in practice, users would most likely tend to keep commenting and uncommenting sections). So we also refer by name to transform pipelines and minibatch sampling settings.
换句话说,我们正在使用软件工程的基本定理 。 我们也开始在配置的其他部分中使用命名组件。 例如,不同的模型类型可能需要不同的转换管道。 因此,如果用户切换模型,他们将需要相应地修改转换部分,这可能是乏味且容易出错的(实际上,用户很可能倾向于保留注释和不注释部分)。 因此,我们也使用名称来指代转换管道和小批量采样设置。
Overall, our current configuration file has grown to about 500 lines. It is organized into the following top-level sections:
总体而言,我们当前的配置文件已增长到约500行。 它分为以下几个顶级部分:
Models: Discussed above
型号:以上讨论
- Data: Location, format, and caching options. In most cases, the raw data will be too large to fit completely in memory. We assume a separate index file; each row contains training labels and possibly other useful meta-data, along with a reference to the data descriptor (usually a file path, though it is possible to use other data sources and protocols). We allow not only a single input, but a list that will be concatenated internally. This is convenient e.g. for configuring cross-validation runs or for dealing with multiple different input formats 数据:位置,格式和缓存选项。 在大多数情况下,原始数据将太大而无法完全容纳在内存中。 我们假设一个单独的索引文件; 每行包含训练标签和可能的其他有用的元数据,以及对数据描述符的引用(通常是文件路径,尽管可以使用其他数据源和协议)。 我们不仅允许单个输入,而且允许将在内部串联的列表。 这很方便,例如用于配置交叉验证运行或处理多种不同的输入格式
Loss functions: There can be one loss function for training, and zero to many auxiliary loss functions for periodic testing. For multi-task objectives, we define a composite
WeightedSumLoss
with a list of atomic sub-losses, each associated with a weight and a column name in the input data.损失函数:可以有一个损失函数用于训练,而零到许多辅助损失函数用于定期测试。 对于多任务目标,我们定义了一个复合的
WeightedSumLoss
,其中包含原子性子损失列表,每个子项损失与输入数据中的权重和列名相关联。Transform pipeline
改造管道
Minibatch sampling: See earlier post
小批量采样:请参阅较早的文章
Training: optimizer, scheduler, number of iterations, frequency and number of test iterations, smoothing iterations for early stopping, checkpointing
培训:优化器,调度程序,迭代次数,测试迭代的频率和次数,平滑迭代以尽早停止,检查点
Scoring: Test-time augmentation and aggregation, frequency of checkpointing
评分:测试时间增加和汇总,检查点频率
Logging: we incorporate Python
logging.config
in this subsection. As a side note: to capture and merge logging from theDataloader
workers, a torch.multiprocessing.Queue with alogging.handlers.QueueListener
has to be used.日志记录:我们在此小节中合并了Python
logging.config
。 附带说明:要捕获和合并来自Dataloader
工作器的日志记录,必须使用带有logging.handlers.QueueListener
的torch.multiprocessing.Queue。General and operational parameters: CPU and GPU modes, number of worker threads, random seeds, what and how often to write to
Tensorboard
常规和操作参数: CPU和GPU模式,辅助线程数,随机种子,向
Tensorboard
写入的内容和Tensorboard
“All problems in computer science can be solved by another level of indirection … except for the problem of too many layers of indirection.”
“计算机科学中的所有问题都可以通过间接的另一层次来解决……除了过多的间接层的问题。”
— David Wheeler
—戴维·惠勒
3.共同存档和封装所有模型工件 (3. Archive and encapsulate all model artifacts, jointly)
The practice of machine learning scientists involves continuously training and evaluating a large number of models in order to figure out the best model architectures, hyper-parameters, data, and sampling configurations. After evaluation, they keep model files around for the sake of comparisons and to be able to do more analysis in the future. You always want to be able to go back if you need to and have a good understanding of how exactly you arrived at an experimental result; and you don’t exactly know today which question you will ask tomorrow, in a month or in a year from now.
机器学习科学家的实践涉及不断训练和评估大量模型,以便找出最佳的模型架构,超参数,数据和采样配置。 评估后,他们会保留模型文件,以进行比较,以便将来进行更多分析。 您总是希望能够在需要时返回并很好地了解实验结果的精确度。 并且您现在还不完全知道今天明天,未来一个月或一年后您将要问哪个问题。
Archiving doesn’t only apply to the raw model files, but to all artifacts that an experimental run might produce: Logging
and Tensorboard
output, checkpoints, visualizations, and so on. One way of keeping experiments organized is a systematic directory structure, such as using path names that are mnemonic of the parameter differences that were explored, compared to a base model. While many machine learning scientists naturally tend to such an organization, why not do it for them and enforce it?
归档不仅适用于原始模型文件,而且还适用于实验运行可能产生的所有工件: Logging
和Tensorboard
输出,检查点,可视化等。 保持实验井井有条的一种方法是系统的目录结构,例如,与基本模型相比,使用路径名称作为已探究参数差异的代名词。 虽然许多机器学习科学家自然会倾向于这样的组织,但为什么不为他们组织并实施呢?
Therefore, we keep all training-related artifacts in a directory dedicated to a specific run:
因此,我们将所有与培训相关的工件保存在特定于特定跑步的目录中:
Model checkpoints: At regular intervals, plus current best ones for early stopping, one for each test metric
模型检查点:每隔固定的时间间隔,再加上当前最好的尽早停止,每个测试指标一个
Train, test, and performance (time and memory) statistics in Tensorboard (or MLFlow, …) format
Tensorboard(或MLFlow,…)格式的训练,测试和性能(时间和内存) 统计信息
A copy of the original configuration file, with and without the default values imputed according to the specification
原始配置文件的副本,根据规范估算有无默认值
The output log file
输出日志文件
A copy of the checkpoint, if we started from a pretrained model
检查点的副本(如果我们从预先训练的模型开始)
For a fresh training run, an empty directory is created. It is supposed to be modified only by our command-line interface script, never directly by the user. This reduces possible errors, and simplifies the engine code as we can assume a layout with predefined names and structure. Our etiquette is to treat the model directory like a black-box archive.
为了进行新的培训,将创建一个空目录。 应该仅通过我们的命令行界面脚本来修改它,而不能直接由用户修改它。 这可以减少可能的错误,并简化引擎代码,因为我们可以假设布局具有预定义的名称和结构。 我们的礼节是将模型目录视为黑匣子存档。
Photo by Tim Mossholder on Unsplash Tim Mossholder在 Unsplash上 拍摄的照片4.围绕典型工作流程进行架构设计 (4. Architect around typical workflows)
How do we design the command line interface so that it is simple and robust to be used both by a machine learning scientist conducting experiments, and as part of a standard workflow for a domain expert, who is not necessarily a programmer (such as a medicinal chemist in our case)? I discussed above the goal of controlling the program behavior declaratively using a configuration file. In contrast, and to avoid confusion, we want to keep the command line options minimal. Essentially, the script will need to know only three things:
我们如何设计命令行界面,以使其既简单又健壮,既可以供进行实验的机器学习科学家使用,也可以作为领域专家(不一定是程序员(例如,医学工作者)的标准工作流程)的一部分我们的化学家)? 我在上面讨论了使用配置文件声明性地控制程序行为的目标。 相反,为了避免混淆,我们希望使命令行选项保持最小。 本质上,脚本只需要知道三件事:
The path to the model directory
模型目录的路径
The configuration file: It would undue burden on the user to always have to create a new directory containing a single file. We allow the configuration to be copied to the model directory from any original location.
配置文件:总是必须创建一个包含单个文件的新目录,这会对用户造成不必要的负担。 我们允许将配置从任何原始位置复制到模型目录。
The execution mode:
train
orscore
. During scoring, an output directory is specified to contain the file with the model predictions. Certain input columns are copied from the input file. We integrated test set augmentation — the model can generate multiple output scores from a single input row. The configuration can specify columns to define aggregation groups, so that we end up with a single row but possibly multiple statistics over this set (e.g., the mean and standard deviation of scores — thinkpandas.DataFrame.groupby
).执行模式:
train
或score
。 在评分过程中,将指定一个输出目录以包含具有模型预测的文件。 某些输入列是从输入文件中复制的。 我们集成了测试集增强功能-该模型可以从单个输入行生成多个输出得分。 该配置可以指定列来定义聚合组,这样我们最终只能获得一行,但可能对此集合有多个统计信息(例如,分数的均值和标准差-想想pandas.DataFrame.groupby
)。
Sometimes we want to initialize model weights from an existing model instead of randomly. This is true for fine-tuning, but also if we just want to continue training with additional iterations. Should we track this process in the existing model directory, or create a new one? Or, should the behavior depend on the significance of the configuration change (e.g., allowing the user to add more iterations with otherwise unchanged configuration in the same directory)? This question is reminiscent of the Ship of Theseus. Our discussions ultimately settled on always requiring a new directory. For reference, the old model checkpoint is copied; but after initialization, the training run is treated exactly the same as any other.
有时我们想从现有模型而不是随机地初始化模型权重。 对于微调而言,这是正确的,但如果我们只想继续进行额外的迭代训练,也是如此。 我们应该在现有的模型目录中跟踪此过程,还是创建一个新的目录? 或者,该行为是否应取决于配置更改的重要性(例如,允许用户在同一目录中添加其他迭代,而配置保持不变)? 这个问题使人想起These修斯的船。 我们的讨论最终基于始终需要一个新目录。 作为参考,复制了旧模型检查点; 但是初始化后,训练运行将与其他所有运行完全相同。
We also had to take into consideration that training and scoring is mostly run within a cloud environment. We often provision spot instances for cost discounts, but that means we have to be ready for interruptions at any time. Evidently training should not start again from scratch, but resume from a point where it left off. We call this separate scenario auto-restart. It should be as frictionless and transparent to the user as possible; so it should work automatically by re-issuing the original command, without any changes.
我们还必须考虑到培训和计分大部分是在云环境中进行的。 我们通常会提供现货实例以提供成本折扣,但这意味着我们必须随时准备中断。 显然,培训不应再次从头开始,而应从中断的地方开始。 我们将此单独的场景称为自动重启。 它应该对用户尽可能无摩擦和透明; 因此它应该通过重新发出原始命令而自动工作,而无需进行任何更改。
At fixed regular intervals, we save checkpoints comprising the model, optimizer, and scheduler state; note that for sampling purposes, it is necessary to save the states of all random number generators as well. Remember to delete older checkpoints only after successfully writing the current one, to deal with corruption. At auto-restart, we check for the most recent, valid one.
我们以固定的固定间隔保存检查点,其中包括模型,优化器和调度程序状态。 请注意,出于采样目的,还必须保存所有随机数生成器的状态。 请记住,只有在成功写入当前检查点之后,才删除较旧的检查点,以应对损坏。 在自动重启时,我们会检查最新的有效副本。
As mentioned above, we want to prevent users from accidentally overwriting previous experiments. So how do we distinguish that case from auto-restart? Configuration files outside of the model directory could have been modified in the meantime, and the model might even continue in a different absolute path. This is where a function for comparing configurations comes in handy again. If the contents of the configuration file specified on the command lines agrees with that one archived in the model directory, the script assumes it has been auto-restarted; otherwise, we warn the user and abort.
如上所述,我们希望防止用户意外覆盖以前的实验。 那么,如何区分这种情况和自动重启? 同时,可能已修改了模型目录外部的配置文件,并且模型甚至可能在其他绝对路径中继续。 这是比较配置的功能再次派上用场。 如果在命令行上指定的配置文件的内容与在model目录中存档的配置文件的内容一致,则脚本假定该文件已自动重新启动; 否则,我们会警告用户并放弃。
Auto-restart does not only apply to training; sometimes we need to score huge files with tens of millions or of rows. To this end, the Scorer creates a temporary directory named after the output file and saves batches of scores at regular intervals. Finally, all these partial results are re-read, concatenated, aggregated and formatted appropriately before being written back to a single output file.
自动重启不仅适用于培训; 有时,我们需要对数以千万计或多行的大型文件进行评分。 为此,记分员将创建一个以输出文件命名的临时目录,并定期保存一批分数。 最后,在将所有这些部分结果写回到单个输出文件之前,都将对其进行适当的重新读取,连接,聚合和格式化。
Photo by Aneta Foubíková on Unsplash AnetaFoubíková在 Unsplash上的 照片5.设计具有尽可能高的可重复性(但不能更多) (5. Design for as much reproducibility as possible (but not more))
Capturing all model-related artifacts in a dedicated directory is one step towards reproducibility, but only the first.
将所有与模型相关的工件捕获在专用目录中是实现可重复性的一步,但这只是第一步。
Software version tracking
软件版本跟踪
For later reference, it is helpful to print important meta-data to the log file. We use the versioneer
package for tracking the software version; it retrieves the most recent git tag and hash string and makes it available as a Python string. We set our package __VERSION__
string accordingly, use it to mark our internal PyPI
releases, and print it to the log output for every run. We also print the version numbers of important packages, such as torch
.
供以后参考,将重要的元数据打印到日志文件中很有帮助。 我们使用versioneer
软件包来跟踪软件版本; 它检索最新的git标记和哈希字符串,并将其用作Python字符串。 我们相应地设置了包__VERSION__
字符串,使用它来标记内部PyPI
版本,并在每次运行时将其打印到日志输出中。 我们还将打印重要软件包的版本号,例如torch
。
Deterministic execution
确定性执行
It is well known that GPU computation is inherently non-deterministic. Even special CUDA
execution flags cannot eliminate these effects of parallel execution in all cases. Nevertheless, it can be worthwhile to be also to make training as reproducible as possible. To this end, we provide options to seed all random number generators: for the Python random package, for numpy
, for torch
CPU operation, and for each GPU instance. Note that the PyTorch Dataloader
spawns multiple processes, each of which comes with its own independent set of random number generators. As described in more detail in this earlier post, we can control this effectively by pre-generating all required random numbers in the single main thread dedicated to sampling, and tacking them on to each example data.
众所周知, GPU计算本质上是不确定的 。 即使是特殊的CUDA
执行标志,也无法在所有情况下消除并行执行的这些影响。 尽管如此,还是值得使训练尽可能重现。 为此,我们提供了为所有随机数生成器提供种子的选项:用于Python随机包, numpy
, torch
CPU操作以及每个GPU实例。 请注意,PyTorch Dataloader
生成多个进程,每个进程都具有自己独立的一组随机数生成器。 如之前文章中更详细描述的 ,我们可以通过在专用于采样的单个主线程中预先生成所有必需的随机数,然后将其附加到每个示例数据上,从而有效地控制这一点。
Testing
测试中
Test-driven development is an effective, generally applicable guidance for software development. Accordingly, we strive to cover most of our code base with appropriate unit tests. We also require a few integration steps that simulate the different envisioned cases of a complete workflow: Training from scratch, auto-restart, scoring, and retuning, with empty or existing model directories. What is specific to machine learning that it is very easy to introduce a subtle bug and not recognize it for some time: As long as it doesn’t lead to a catastrophic breakdown, model training is designed to compensate for any data flaws, to some degree. Therefore, we found it extremely useful to automate a nightly training regression job, running for a hundred thousand iterations with our latest commits and typical model architectures. Based on previous runs, we set expectation and tolerances for various metrics at specified iterations, to alert us to any change in behavior, be it positive or negative.
测试驱动的开发是有效且普遍适用的软件开发指南。 因此,我们努力通过适当的单元测试覆盖我们的大多数代码库。 我们还需要一些集成步骤来模拟完整工作流程的不同预想情况: 使用空目录或现有模型目录从头开始,自动重新启动,评分和重新调整训练。 机器学习特有的是,很容易引入细微的错误并在一段时间内不识别它:只要它不会导致灾难性的崩溃,模型训练的目的就是为了弥补某些数据缺陷。度。 因此,我们发现使夜间训练回归工作自动化非常有用,该工作使用我们的最新提交和典型模型架构运行十万次迭代。 基于以前的运行,我们在指定的迭代中设置了各种指标的期望值和容差,以提醒我们行为的任何变化,无论是正面还是负面。
6.最后:不要让完美成为善良的敌人 (6. And finally: Don’t let the perfect be the enemy of good)
A common, well-known pitfall is premature optimization, or optimizing a part of the software that is not really that critical in terms of either functionality or performance. Profiling and selective simplification (e.g., running with a trivial model or with trivial data) can hint at what is most important in the overall picture. And on the contrary, sometimes inconspicuous pieces can have outsize effects. As an example, profiling revealed that the following line in the training loop led to slowdown due to unnecessary GPU synchronization:
一个常见的众所周知的陷阱是过早的优化,或者优化软件的一部分,无论从功能还是性能上都不那么关键。 概要分析和选择性简化(例如,使用琐碎的模型或琐碎的数据运行)可以暗示总体上最重要的内容。 相反,有时不起眼的作品可能会产生超大效果。 例如,分析显示,由于不必要的GPU同步,训练循环中的以下行导致了速度下降:
if not torch.isfinite(y_hat).all():
raise OverflowError('Training error is NaN’)
On our journey, we have gone down a few paths that in hindside turned out to be rabbit holes. To name just one example: We went too far in trying to protect existing model directories from accidental overwriting, as discussed above. We had a scheme of write-protecting files, but we didn’t want users to have to change permissions explicitly, and at the same time, it had to work within an automation framework where state was stored in permission-less S3
files. This required a lot of logic to cover the different use cases, but there are always new cases that you didn’t anticipate. The design was complicated and brittle, and in the end just not worth it for the purported benefit.
在旅途中,我们走了几条路,后来发现那是兔子洞。 仅举一个例子:如上所述,我们在保护现有模型目录免受意外覆盖方面做得太过分了。 我们有一个写保护文件的方案,但是我们不希望用户必须显式更改权限,同时,它必须在状态存储在无权限的S3
文件中的自动化框架内工作。 这需要大量的逻辑来涵盖不同的用例,但是总会有您没想到的新用例。 设计复杂而脆弱,最终却不值得声称的好处。
If software gets too complicated to manage, users will get frustrated — but they will do so too if it doesn’t provide at least a basic version of their most desired features. It is a continual process of filtering and prioritizing requests, bundling them into a common, minimal interface.
如果软件变得太复杂而无法管理,则用户会感到沮丧-但是,如果软件没有至少提供其所需功能的基本版本,他们也会感到沮丧。 这是一个连续的过程,用于过滤和区分请求的优先级,将请求捆绑到一个通用的最小接口中。
As software grows and matures, there is always a delicate balance between robustness and convenience on the one side, and new features and flexibility on the other. As much as developers we strive to satisfy as many users as possible, we can never make every one perfectly happy.
随着软件的成长和成熟,一方面在健壮性和便利性之间,另一方面在新特性和灵活性之间始终存在微妙的平衡。 尽我们最大努力满足尽可能多的用户的开发人员,我们永远无法使每个人都完全满意。
致谢 (Acknowledgements)
Thanks to Brandon Anderson, Bastiaan Bergman, Jared Thompson, and Greg Friedland!
感谢Brandon Anderson,Bastiaan Bergman,Jared Thompson和Greg Friedland!
翻译自: https://towardsdatascience.com/writing-a-production-level-machine-learning-framework-lessons-learned-195ce21dd437
机器学习框架