实验bug汇总

RuntimeError: DataLoader worker (pid 39567) is killed by signal: Killed. 

pytorch跑起来程序,但是在中途的某一个step会报这样的错误,这个错误应该是服务器的内存不够了,触发了OS的自动保护机制。曾将尝试过将DataLoader里面的num_worker从16调整至8,但还是会报错,后面直接将num_worker设置为0(全交给主进程,但是速度会变慢)

In my case, I find that it caused by out of memory of machine(note that , not GPU memory). To test that when run your code, you can use command top to see the whether the memory and swap memory are used up. If so , it is the memory problem. You can use a small batch, use a small num_workers, expand your swap memory,etc. Any way could decrease your memory load would be helpful.

再行搜索,得知原因是:loss或者网络的输出不断积累导致计算图不断扩张。解决方案:在训练的循环过程中,需要用到loss,则用loss.data[0]。

你可能感兴趣的:(实验记录,Pytorch)