导读:飞桨(PaddlePaddle)致力于让深度学习技术的创新与应用更简单。在单机训练速度方面,通过高并行、低开销的异步执行策略和高效率的核心算子,优化静态图训练性能,在Paddle Fluid v1.5.0的基准测试中,在7个典型模型上进行了测试(图像领域5个,NLP领域2个),其中5个模型的速度显著优于对标框架(大于15%),2个模型与对标框架持平(5%之内)。如果想让单机训练速度更快,可以根据这篇文档的建议从网络构建、数据准备、模型训练三个方向了解飞桨单机训练中常用的优化方法。来一组测试数据先睹为快。
模型名称 |
对标开源框架 |
飞桨 |
对标开源框架 |
吞吐量对比(%) 飞桨VS对标开源框架 |
|
1 |
DeepLab V3+ |
TensorFlow |
13.70 examples/s |
6.40 examples/s |
+ 113.98% |
2 |
YOLOv3 |
MXNet |
29.90 examples/s |
18.58 examples/s |
+ 60.95% |
3 |
BERT |
TensorFlow |
4.04 steps/s |
3.42 steps/s |
+ 18.23% |
4 |
Mask-RCNN |
PyTorch |
3.81 examples/s |
3.24 examples/s |
+ 17.62% |
5 |
CycleGAN |
TensorFlow |
7.51 examples/s |
6.45 examples/s |
+ 16.44% |
6 |
SE-ResNeXt50 |
PyTorch |
168.33 examples/s |
163.13 examples/s |
+ 3.19% |
7 |
Transformer |
TensorFlow |
4.87 examples/s |
4.75 examples/s |
+ 2.42% |
PaddlePaddle version:1.5.0
Tensorflow version:1.12.0
PyTorch version:1.1.0
MXNet version:1.4.1
GPU:Tesla V100-SXM2
CPU:Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz,38核
Nvida driver: 418.39
CUDNN VERSION:7.4.2.24
CUDA VERSION:9.0.176,单卡模式
def data_reader (width, height): def reader(): while True: yield np.random.uniform(-1, 1, size=width*height), \ np.random.randint(0,10) return readertrain_data_reader = data_reader(32, 32)
def reader():
while True:
yield np.random.uniform(-1, 1, size=width*height), \
np.random.randint(0,10)
return reader
train_data_reader = data_reader(32, 32)
Image = paddle.layer.data("image", ...)label = paddle.layer.data("label", ...)# 模型定义# ……prediction = fluid.layers.fc(input= image, size=10)loss = fluid.layers.cross_entropy(input= prediction, label= label)avg_loss = fluid.layers.mean(loss)# ……# 读取数据# paddle.dataset.mnist.train()返回数据读取的Reader,每次可以从Reader中读取一条样本,batch_size为128train_reader = paddle.batch(paddle.dataset.mnist.train(), 128)end = time.time()for batch_id, batch in enumerate(train_reader): data_time = time.time() - end # 训练网络 executor.run(feed=[...], fetch_list=[...]) batch_time = time.time() - end end = time.time()
label = paddle.layer.data("label", ...)
# 模型定义
# ……
prediction = fluid.layers.fc(input= image, size=10)
loss = fluid.layers.cross_entropy(input= prediction, label= label)
avg_loss = fluid.layers.mean(loss)
# ……
# 读取数据
# paddle.dataset.mnist.train()返回数据读取的Reader,每次可以从Reader中读取一条样本,batch_size为128
train_reader = paddle.batch(paddle.dataset.mnist.train(), 128)
end = time.time()
for batch_id, batch in enumerate(train_reader):
data_time = time.time() - end
# 训练网络
executor.run(feed=[...], fetch_list=[...])
batch_time = time.time() - end
end = time.time()
train_py_reader = fluid.layers.py_reader( capacity=10, shapes=((-1, 784), (-1, 1)), dtypes=('float32', 'int64'), name="train_reader", use_double_buffer=True)# 使用 read_file() 方法从py_reader中获取模型的输入image, label = fluid.layers.read_file(reader)# 模型定义# ……prediction = fluid.layers.fc(input= image, size=10)loss = fluid.layers.cross_entropy(input= prediction, label= label)avg_loss = fluid.layers.mean(loss)# ……# 读取数据train_reader = paddle.batch(paddle.dataset.mnist.train(), 128)train_py_reader.decorate_paddle_reader(train_reader)# 启动py_readertrain_py_reader.start()try: end = time.time() while True: print("queue size: ", train_py_reader.queue.size()) loss, = executor.run(fetch_list=[...]) # ... batch_time = time.time() - end end = time.time() batch_id += 1except fluid.core.EOFException: train_py_reader.reset()
shapes=((-1, 784), (-1, 1)),
dtypes=('float32', 'int64'),
name="train_reader",
use_double_buffer=True)
# 使用 read_file() 方法从py_reader中获取模型的输入
image, label = fluid.layers.read_file(reader)
# 模型定义
# ……
prediction = fluid.layers.fc(input= image, size=10)
loss = fluid.layers.cross_entropy(input= prediction, label= label)
avg_loss = fluid.layers.mean(loss)
# ……
# 读取数据
train_reader = paddle.batch(paddle.dataset.mnist.train(), 128)
train_py_reader.decorate_paddle_reader(train_reader)
# 启动py_reader
train_py_reader.start()
try:
end = time.time()
while True:
print("queue size: ", train_py_reader.queue.size())
loss, = executor.run(fetch_list=[...])
# ...
batch_time = time.time() - end
end = time.time()
batch_id += 1
except fluid.core.EOFException:
train_py_reader.reset()
执行器 |
执行对象 |
执行策略 |
Executor |
Program |
根据 Program 中Operator定义的先后顺序依次运行 |
ParallelExecutor |
SSA Graph |
根据Graph中各个节点之间的依赖关系,通过多线程运行 |
build_strategy = fluid.BuildStrategy()build_strategy.enable_inplace = Truebuild_strategy.fuse_all_optimizer_ops=Trueexec_strategy = fluid.ExecutionStrategy() exec_strategy.num_threads = 4train_program = fluid.compiler.CompiledProgram(main_program).with_data_parallel( loss_name=loss.name, build_strategy=build_strategy, exec_strategy=exec_strategy)place = fluid.CUDAPlace(0)exe = Executor(place)# 使用py_reader读取数据,因此执行时不需要feedfetch_outs = exe.run(train_program, fetch_list=[loss.name],)
build_strategy.fuse_all_optimizer_ops=True
exec_strategy = fluid.ExecutionStrategy()
exec_strategy.num_threads = 4
train_program = fluid.compiler.CompiledProgram(main_program).with_data_parallel(
loss_name=loss.name,
build_strategy=build_strategy,
exec_strategy=exec_strategy)
place = fluid.CUDAPlace(0)
exe = Executor(place)
# 使用py_reader读取数据,因此执行时不需要feed
fetch_outs = exe.run(train_program, fetch_list=[loss.name],)
选项 |
类型 |
默认值 |
说明 |
reduce_strategy |
fluid.BuildStrategy.ReduceStrategy |
fluid.BuildStrategy.ReduceStrategy.AllReduce |
使用数据并行训练模型时选用 AllReduce 模式训练还是 Reduce 模式训练。 |
enable_backward_optimizer_op_deps |
bool |
FALSE |
在反向操作和参数更新操作之间添加依赖,保证在所有的反向操作都运行结束之后才开始运行参数更新操作。 |
fuse_all_optimizer_ops |
bool |
FALSE |
对模型中的参数更新算法进行融合 |
fuse_all_reduce_ops |
bool |
FALSE |
多卡训练时,将all_reduce 操作进行融合 |
fuse_relu_depthwise_conv |
bool |
FALSE |
如果模型中存在relu和depthwise_conv操作,并且是连接的,即relu->depthwise_conv,将这两个操作合并为一个 |
fuse_broadcast_ops |
bool |
FALSE |
在 Reduce 模式下,对最后的多个Broadcast操作融合为一个 |
mkldnn_enabled_op_types |
list |
{} |
如果是CPU训练,可以用 mkldnn_enabled_op_types指明模型中的哪些操作可以使用mkldnn库,默认情况下,模型中用到的操作如果在飞桨目前支持的可以使用mkldnn库计算的列表中,这些操作都会调用mkldnn库的接口进行计算 |
debug_graphviz_path |
str |
“” |
将Graph以graphviz格式输出到debug_graphviz_path所指定的文件 |
选项 |
类型 |
默认值 |
说明 |
num_iteration_per_drop_scope |
INT |
1 |
经过多少次迭代之后清理一次local execution scope |
num_threads |
INT |
经验值: 对于CPU:2*dev_count;对于GPU:4*dev_count. |
ParallelExecutor 中执行所有Op使用的线程池大小 |
Fluid中有一些FLAGS可以有助于性能优化:
(1)FLAGS_cudnn_exhaustive_search表示在调用cuDNN中的卷积操作时,根据输入数据的shape等信息,采取穷举搜索的策略从算法库中选取到更快的卷积算法,进而实现对模型中卷积操作的加速。需要注意的是:compiled_prog = compiler.CompiledProgram( fluid.default_main_program()).with_data_parallel( loss_name=loss.name)
loss_name=loss.name)
(1) Baseline版本
(2) 设置exec_strategy.num_threads = device_count
(3) 设置exec_strategy.num_iteration_per_drop_scope = 100
(4) 设置build_strategy.enable_inplace = True,build_strategy.memory_optimize = False
(5) 设置build_strategy.fuse_all_optimizer_ops = True
(6) 使用py_reader进行异步数据读取
(7) 配置优化
- reshape中设置inplace=True
- 使用split操作代替多次slice
优化前:
for index in range(len): input = layers.slice(input_embedding, axes=[1], starts=[index], ends=[index + 1]) …index in range(len):
input = layers.slice(input_embedding, axes=[1], starts=[index], ends=[index + 1])
…
优化后:
sliced_inputs = layers.split(input_embedding, num_or_sections=len, dim=1)for index in range(len): input = sliced_inputs[index] ...len, dim=1)
for index in range(len):
input = sliced_inputs[index]
...
- 减少reshape的次数
优化前:
for index in range(len): … res.append(layers.reshape(input, shape=[1, -1, hidden_size]))real_res = layers.concat(res, 0)real_res = layers.transpose(x=real_res, perm=[1, 0, 2])in range(len):
…
res.append(layers.reshape(input, shape=[1, -1, hidden_size]))
real_res = layers.concat(res, 0)
real_res = layers.transpose(x=real_res, perm=[1, 0, 2])
优化后:
for index in range(len): … res.append(input)real_res = layers.concat(res, 0)real_res = layers.reshape(real_res, shape=[len, -1, hidden_size], inplace=True)real_res = layers.transpose(x=real_res, perm=[1, 0, 2])
res.append(input)
real_res = layers.concat(res, 0)
real_res = layers.reshape(real_res, shape=[len, -1, hidden_size], inplace=True)
real_res = layers.transpose(x=real_res, perm=[1, 0, 2])
模型名称 |
项目地址 |
|
1 |
DeepLab V3+ |
https://github.com/PaddlePaddle/models/tree/v1.5/PaddleCV/deeplabv3%2B |
2 |
YOLOv3 |
https://github.com/PaddlePaddle/models/tree/v1.5/PaddleCV/yolov3 |
3 |
BERT |
https://github.com/PaddlePaddle/ERNIE |
4 |
Mask-RCNN |
https://github.com/PaddlePaddle/models/tree/v1.5/PaddleCV/rcnn |
5 |
CycleGAN |
https://github.com/PaddlePaddle/models/tree/v1.5/PaddleCV/PaddleGAN/cycle_gan |
6 |
SE-ResNeXt50 |
https://github.com/PaddlePaddle/models/tree/v1.5/PaddleCV/image_classification |
7 |
Transformer |
https://github.com/PaddlePaddle/models/tree/v1.5/PaddleNLP/models/neural_machine_translation/transformer |