jupyter执行fit训练的时候老是崩溃的问题

使用jupyter执行ai程序的时候,动不动就提示“服务器似乎挂掉了,但是会理科重启的”,如图所示:


image.png

运行.py文件一般也会出问题:

(py37) twsm@twsm-PR4904P:~/project/paper$ python train_merge_kashgari.py 
2020-05-20 11:48:56.445877: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 16914055168
Aborted (core dumped)

一般情况下,应该是gpu资源没有被释放,通过nvidia-smi命令可以查看GPU资源占用情况及占用的进程id:


image.png

或者用以下命令也可以看到:

(py37) twsm@twsm-PR4904P:~/project/paper$ fuser -v /dev/nvidia*
                     USER        PID ACCESS COMMAND
/dev/nvidia0:        twsm       9087 F...m ZMQbg/1
/dev/nvidia1:        twsm       9087 F...m ZMQbg/1
/dev/nvidia2:        twsm       9087 F...m ZMQbg/1
/dev/nvidia3:        twsm       9087 F...m ZMQbg/1
/dev/nvidiactl:      twsm       9087 F...m ZMQbg/1
/dev/nvidia-uvm:     twsm       9087 F.... ZMQbg/1

使用ps -ef 看看python进程:

(py37) twsm@twsm-PR4904P:~/project/paper$ ps -ef | grep python
twsm       3535   2093  0 5月19 pts/9   00:00:14 /home/twsm/anaconda3/envs/py37/bin/python /home/twsm/anaconda3/envs/py37/bin/jupyter-notebook
twsm       9087   3535 99 10:49 ?        00:09:46 /home/twsm/anaconda3/envs/py37/bin/python -m ipykernel_launcher -f /home/twsm/.local/share/jupyter/runtime/kernel-4380b8f8-ed1b-44f7-b15b-499849b9ef77.json
twsm       9515   3535  0 10:53 ?        00:00:00 /home/twsm/anaconda3/envs/py37/bin/python -m ipykernel_launcher -f /home/twsm/.local/share/jupyter/runtime/kernel-7b392a0b-8b3a-49de-a921-83cbb9839bb3.json
twsm       9532   2029  0 10:57 pts/8    00:00:00 grep --color=auto python

果然可以看到9087的进程。
实际上这个代码在jupyter中已经没有运行了。
一般情况下,这个程序就是你刚才运行的某个jupyter程序代码,在jupyter notebook中选中后,shutdown掉再执行原来的程序就可以了:


image.png

你可能感兴趣的:(jupyter执行fit训练的时候老是崩溃的问题)