transformers模型加载与seed随机状态

Transformers的模型加载会改变seed的随机状态

这里使用了一个自己写的库进行测试:py-seeds(直接pip就可以安装,目前是0.0.2版本)

import py_seeds
from transformers import AutoModel


def state_str(state_dict):
    random_state = str(state_dict["random"])
    numpy_state = str(state_dict["numpy"][0]) + str(state_dict["numpy"][1].tolist())
    torch_state = str(state_dict["torch"].numpy().tolist())
    torch_cuda_state = "".join([str(i.numpy().tolist()) for i in state_dict["torch_cuda"]])
    return random_state + numpy_state + torch_state + torch_cuda_state


# not load model
state = py_seeds.get_seed_state()
now_state = py_seeds.get_seed_state()
print(state_str(state) == state_str(now_state))
# True: no change to the random state

# load model
model = AutoModel.from_pretrained('roberta-base')
now_state = py_seeds.get_seed_state()
print(state_str(state) == state_str(now_state))
# False: the loading of transformers model would change the random state

# set state after loading
py_seeds.set_seed_state(state)
now_state = py_seeds.get_seed_state()
print(state_str(state) == state_str(now_state))
# True: must set state after loading to keep the random state same
True
False
True

使用了一种很简易的方式进行状态间对比:把所有表示状态的变量都转换为list再转换为str,并拼接,判断两个状态代表的字符串是否相同(不可以用hash)

总结

transformers的模型的from_pretrained将会改变随机状态,如果需要确保随机状态一致,最好在加载完后重新设定一遍随机状态(适用于训练过程中的断点恢复,避免训练出错后恢复随机状态不一致)
(tokenizer的from_pretrained并不会影响随机状态)

笔者语

真的是一个很小的点,本来以为只有在训练的过程才涉及到随机状态的改变,没有想到一个加载的过程也会影响,试了好久才发现,希望能给大家避避坑吧~
可能考虑在下一个版本的py-seeds里更新成更方便判断是否相同的功能~

你可能感兴趣的:(笔记,NLP,深度学习,python,pytorch)