!nvidia-smi
from google.colab import drive
drive.mount('/content/drive')
%cd "/content/drive/My Drive/Colab"
%ls
这样的话若是网络不稳定或是使用时间到期,临时文件仍然保存, 据说重连后可以继续训练?有待确认
!nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 49C P8 30W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
直接点击左侧栏中的文件, 选择装载GOOGLE云端硬盘
这时会出现如下的运行单元, Ctrl+Enter
运行即可
from google.colab import drive
drive.mount('/content/drive')
%cd "/content/drive/My Drive/Colab"
%ls
打开浏览器F12,找到console将下面代码粘贴到控制台, 回车运行; 若刷新了页面再重新执行一次. 参考原文
setInterval(()=>{
if(Array.from(document.getElementById("connect").children[0].children[2].innerHTML).splice(3,4).toString() === '重,新,连,接'){
document.getElementById("connect").children[0].children[2].click()
}
},1000)
!kill -9 -1
设置等宽字体, 缩进一致, 方便阅读.
参考: How to change the font size in colab
!pip install mxnet-cu101
2019/11/22 更新: colab安装mxnet-cu100导入出错, 只能安装匹配的101版本, 原因未知.
注意: 虽然显示的CUDA是10.1, 但实测发现, mxnet只支持CUDA 10.0 , 否则安装成功但import失败. 可能是环境变量的问题? 设置 .bashrc
可能奏效? 但在Colab中还是直接用 mxnet-cu100
吧, 毕竟每次重连都需要重新安装.
colab 环境已安装PyTorch-GPU版本, 可直接食用.
import torch
from torch import nn
print(f"\n cuda is available: {torch.cuda.is_available()}",
f"\n device count : {torch.cuda.device_count()}",
f"\n device name : {torch.cuda.get_device_name(0)}",)
device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
x = torch.tensor([1, 2, 3], device=device)
# 输出 tensor([1, 2, 3], device='cuda:0')
cuda
y = torch.tensor([1, 2, 3]).cuda()
net = nn.Linear(3, 1).cuda()
!wget -O [filename] [url]
注意!!! 不要上传大文件到drive,然后mount训练,云主机和drive并不在同一集群内,训练中读取数据非常耗时!以下是我微调ResNet对FashionMNIST数据集的log截图,GPU是Tesla K80. 可以看到每个Epoch基本要花费 24 min,1440 sec,效率低,没有实用性。
复制 测试代码, 在单元格中直接运行. 运行结果如下, 每个epoch只需要 9 sec, 快得飞起!
Epoch 1/12
60000/60000 [==============================] - 17s 291us/step - loss: 0.2623 - acc: 0.9180 - val_loss: 0.0537 - val_acc: 0.9829
Epoch 2/12
60000/60000 [==============================] - 9s 147us/step - loss: 0.0883 - acc: 0.9736 - val_loss: 0.0397 - val_acc: 0.9860
Epoch 3/12
60000/60000 [==============================] - 9s 147us/step - loss: 0.0668 - acc: 0.9798 - val_loss: 0.0354 - val_acc: 0.9886
Epoch 4/12
60000/60000 [==============================] - 9s 147us/step - loss: 0.0542 - acc: 0.9835 - val_loss: 0.0351 - val_acc: 0.9881
Epoch 5/12
60000/60000 [==============================] - 9s 146us/step - loss: 0.0473 - acc: 0.9853 - val_loss: 0.0277 - val_acc: 0.9903
Epoch 6/12
60000/60000 [==============================] - 9s 146us/step - loss: 0.0414 - acc: 0.9875 - val_loss: 0.0276 - val_acc: 0.9914
Epoch 7/12
60000/60000 [==============================] - 9s 147us/step - loss: 0.0365 - acc: 0.9890 - val_loss: 0.0268 - val_acc: 0.9911
Epoch 8/12
60000/60000 [==============================] - 9s 146us/step - loss: 0.0335 - acc: 0.9894 - val_loss: 0.0249 - val_acc: 0.9919
Epoch 9/12
60000/60000 [==============================] - 9s 146us/step - loss: 0.0327 - acc: 0.9900 - val_loss: 0.0242 - val_acc: 0.9911
Epoch 10/12
60000/60000 [==============================] - 9s 146us/step - loss: 0.0299 - acc: 0.9909 - val_loss: 0.0258 - val_acc: 0.9911
Epoch 11/12
60000/60000 [==============================] - 9s 148us/step - loss: 0.0278 - acc: 0.9914 - val_loss: 0.0265 - val_acc: 0.9918
Epoch 12/12
60000/60000 [==============================] - 9s 147us/step - loss: 0.0263 - acc: 0.9918 - val_loss: 0.0259 - val_acc: 0.9912
Test loss: 0.025935380621160586
Test accuracy: 0.9912