tensorflow利用for循环进行知识蒸馏训练遇到的内存爆炸问题(OOM)

文章目录

      • 问题背景
      • 出现问题
      • 解决问题
        • 部分一
        • 部分二
        • 部分三
      • 参考

问题背景

最近在用tensorflow学习模型的知识蒸馏,自己基于cifar10数据集训练得到的teacher模型,在对3种不同参数量的student模型使用相同的alpha和temperature参数进行蒸馏之后,得到的实验结果均与论文结果相反(论文:Distilling the Knowledge in a Neural Network)
所以自己打算用for循环方式遍历多种alpha,temperature的参数组合来对比蒸馏效果,然后用matplotlib.pyplot将训练以及对比结果进行绘图(不想自己手动调参了.jpg,在notebook里一遍遍调完参重新跑然后保存数据真难顶)
然后就遇到了很多bug,有tensorflow的out of memory的问题,也有Process和Thread的问题,还有matplotlib.pyplot的问题,没想到代码里面都遇到了

本地环境:
os: win10
显卡:GTX 1650 4G独显
python : 3.8.3
tensorflow : 2.7.0

最后使用的服务器的GPU配置:4个12G的2080Ti(实际只用了GPU:0进行测试)

以下是记录的过程及解决方法

需要用到下面这些模块、函数、类
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
from multiprocessing import Process
from threading import Thread
import time# 重新Model类
class Distiller(keras.Model):def __init__(self, student: keras.Sequential, teacher: keras.Sequential):super().__init__()self.teacher = teacherself.student = studentdef compile(self,optimizer: keras.optimizers.Optimizer,metrics,student_loss_fn,distillation_loss_fn,alpha=0.1,temperature=3):......def train_step(self, data):x, y = data# teacher prediction......return resultsdef test_step(self, data):x, y = data......return results# 创建模型
def build_model(name, conv1_size, conv2_size, conv3_size, dense_size):model = keras.Sequential([......],name=name)return model循环使用的主函数
def main_loop(alpha, T):# 将loop_param也设置为全局变量,不然无法return此参数global loop_paramloop_param = str(alpha) + '_' + str(T) + '_'# 打印本次循环的信息print('\n\n' + loop_param)# 创建studentstudent = build_model('student', 8, 16, 16, 16)# 记录student初始结构student_scratch = keras.models.clone_model(student)......def draw_distill():
......
def draw_scratch():
......

出现问题

一开始直接在for循环中进行distiller和student_scratch的训练与评估,之后几个循环之后就出现了OOM,部分输出如下:

......
Epoch 2/20
2022-03-23 21:34:09.303394: W tensorflow/core/common_runtime/bfc_allocator.cc:462] Allocator (GPU_0_bfc) ran out of memory trying to allocate 242.0KiB (rounded to 247808)requested by op student/conv2d_2/Relu
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
......
2022-03-23 21:34:09.332826: I tensorflow/core/common_runtime/bfc_allocator.cc:1078] Sum Total of in-use chunks: 2.10GiB
2022-03-23 21:34:09.332899: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] total_region_allocated_bytes_: 2255958784 memory_limit_: 2255958836 available bytes: 52 curr_region_allocation_bytes_: 4511918080
2022-03-23 21:34:09.332977: I tensorflow/core/common_runtime/bfc_allocator.cc:1086] Stats:
Limit:                      2255958836
InUse:                      2255956224
MaxInUse:                   2255956224
NumAllocs:                    55811090
MaxAllocSize:                614400000
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0
......
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

所以想着用Thread或者Process来将每个循环进行隔离,使用单进程或者单线程。这样每次循环结束之后创建的进程 / 线程 也会终止,tensorflow占用的空间就不会一直增加直到OOM

解决问题

然后尝试使用Thread或者Process来为每次循环创建单独的进程或线程

部分一

使用Thread,然后模型的训练、评估以及用plt将训练结果绘图等都在main_loop函数中进行,但遇到了:
UserWarning: Starting a Matplotlib GUI outside of the main thread will likely fail的Warning。不过虽然提示说will likely fail,但自己实验中用plt还是能够正常绘图与保存图片的。
以防万一,如果要避免此Warning的提示,可以参照另一篇文章:pyplot.plot使用遇到:UserWarning: Starting a Matplotlib GUI outside of the main thread will likely fail

if __name__ == '__main__':# 加载数据集(train_images, train_labels), (test_images, test_labels) = keras.datasets.cifar10.load_data()# Normalize pixel valuestrain_images, test_images = train_images / 255.0, test_images / 255.0teacher = build_model('teacher', 32, 64, 64, 64)# load teacher model from SavedModelteacher = keras.models.load_model('teacher_model')# 利用循环调参,观察不同超参数对应的蒸馏效果for alpha in (0.1, 0.2, 0.3):for T in range(5, 21, 5):print(time.strftime('%H-%M-%S: '))t = Thread(target=main_loop, args=(alpha, T))t.start()t.join()draw_distill()draw_scratch()

部分二

使用Process
与上面代码内容基本相同,只是将Thread换成Process,然后把需要的参数都传入Process的args中即可。此时可以直接将模型的训练、评估以及用plt把训练结果进行绘图等都放在main_loop函数中进行。

简单的写一下更改后的代码:
if __name__ == '__main__':# 加载数据集......# 利用循环调参,观察不同超参数对应的蒸馏效果for alpha in (0.1, 0.2, 0.3):for T in range(5, 21, 5):print(time.strftime('%H-%M-%S: '))因为创建的子进程与主进程是相互隔离的,所以无法像子线程直接使用主线程的变量数据那样,只能把需要的变量作为参数传入Process中p = Process(target=main_loop, args=(alpha, T, teacher, train_images, train_labels, test_images, test_labels))p.start()# 阻塞当前进程,告诉主进程,当进程p执行完之后才能向后继续执行进程p.join()

更详细的Thread与Process的区别以及简单的使用方法写在了另一篇文章:
python multiprocessing.Process与threading.Thread的区别以及多进程,多线程的一些使用方法

部分三

此外,在本地机器上尝试了用GPU来进行训练,但是很慢,而且输出很多过程的信息。输出的部分信息如下:

 376/1563 [======>.......................] - ETA: 34s - loss: 2.1089 - accuracy: 0.20432022-03-23 11:18:02.870737: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.871020: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.871797: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op __inference_train_function_914 in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.879381: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.879595: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.880326: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op __inference_train_function_914 in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.886668: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.886897: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.887562: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op __inference_train_function_914 in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.893657: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.893894: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.894579: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op __inference_train_function_914 in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.901129: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.901408: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.902220: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op __inference_train_function_914 in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.908633: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.908906: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.909804: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op __inference_train_function_914 in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.917362: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.917588: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.918309: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op __inference_train_function_914 in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.924344: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.924588: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.925364: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op __inference_train_function_914 in device /job:localhost/replica:0/task:0/device:GPU:0383/1563 [======>.......................] - ETA: 34s - loss: 2.1056 - accuracy: 0.20602022-03-23 11:18:02.931965: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
......
2022-03-23 11:18:02.971101: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op __inference_train_function_914 in device /job:localhost/replica:0/task:0/device:GPU:0391/1563 [======>.......................] - ETA: 33s - loss: 2.1030 - accuracy: 0.20772022-03-23 11:18:02.977031: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
......
2022-03-23 11:18:03.020212: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op __inference_train_function_914 in device /job:localhost/replica:0/task:0/device:GPU:0400/1563 [======>.......................] - ETA: 32s - loss: 2.0986 - accuracy: 0.20922022-03-23 11:18:03.025223: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0......

即将写完本文之时,又参考别的blog里面说的方法用GPU训练了一下,在import tensorflow as tf之前,设置:
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '0',之后发现果然有效,很多无用的信息都不再输出。
还尝试在服务器上用GPU进行训练(之前没用服务器测试,就是因为上述原因,输出了太多冗余信息),比本地的GPU快了很多:

服务器:GPU total train time: 62.842313051223755 s
本地:GPU total train time: 104.92916321754456 s

参考

https://keras.io/examples/vision/knowledge_distillation/
https://www.tensorflow.org/guide/gpu?hl=zh-cn

你可能感兴趣的:(tensorflow,深度学习,tensorflow,深度学习,python)