地址:启智社区:https://openi.pcl.ac.cn/
云燧T20是基于邃思2.0芯片打造的面向数据中心的第二代人工智能训练加速卡,具有模型覆盖面广、性能强、软件生态开放等特点,可支持多种人工智能训练场景。同时具备灵活的可扩展性,提供业界领先的人工智能算力集群方案。
优势特点
https://openi.pcl.ac.cn/Enflame/GCU_Pytorch1.10.0_Example
Resnet+imagenet_raw
单卡单Epoch
"model": "resnet50",
"local_rank": 0,
"batch_size": 256,
"epochs": 1,
"training_step_per_epoch": -1,
"eval_step_per_epoch": -1,
"acc1": 6.467013835906982,
"acc5": 20.52951431274414,
"device": "dtu",
"skip_steps": 2,
"train_fps_mean": 706.7805865954374,
"train_fps_min": 668.1171056579481,
"train_fps_max": 755.529550208285,
"training_time": "0:12:27"
fps_mean:706.78
acc1:6.47
运行时间:12分27秒
8卡单Epoch
"model": "resnet50",
"local_rank": 5,
"batch_size": 256,
"epochs": 1,
"training_step_per_epoch": -1,
"eval_step_per_epoch": -1,
"acc1": 3.02734375,
"acc5": 12.5,
"device": "dtu",
"skip_steps": 2,
"train_fps_mean": 704.4055937610347,
"train_fps_min": 702.2026238348252,
"train_fps_max": 706.744240295003,
"training_time": "0:07:04"
fps_mean:704.41
acc1:3.03
运行时间:7分04秒
8卡线性度:99.72%
8卡50Epochs
"model": "resnet50",
"local_rank": 0,
"batch_size": 256,
"epochs": 50,
"training_step_per_epoch": -1,
"eval_step_per_epoch": -1,
"acc1": 70.01953125,
"acc5": 94.140625,
"device": "dtu",
"skip_steps": 2,
"train_fps_mean": 578.1979771887603,
"train_fps_min": 242.95998405582057,
"train_fps_max": 728.2789276288373,
"training_time": "0:43:04"
fps_mean:578.20
acc1:70.02
运行时间:43分04秒
8卡100Epochs
"model": "resnet50",
"local_rank": 0,
"batch_size": 64,
"epochs": 100,
"training_step_per_epoch": -1,
"eval_step_per_epoch": -1,
"acc1": 82.2265625,
"acc5": 96.875,
"device": "dtu",
"skip_steps": 2,
"train_fps_mean": 481.25022732778297,
"train_fps_min": 267.4726081053424,
"train_fps_max": 509.6326762775301,
"training_time": "1:18:22"
fps_mean:481.25
acc1:82.22
运行时间:1小时18分22秒
https://openi.pcl.ac.cn/OpenIOSSG/MNIST_PytorchExample_GCU/src/branch/master/train_for_c2net.py
心得
通过查阅代码示例很快就可以掌握从CPU/GPU迁移代码到GCU上运行的方法。除了运行燧原科技提供的代码外,在前阵子学习李沐老师d2l pytorch代码的时候自己也尝试过迁移到gcu上运行,总体来说大部分都可以顺利迁移,此外有时候自己以前跑过的一些基于torch的notebook代码有些根据示例修改成gcu运行也能成功跑起来。
唯一遇到的问题就是有时候运行会出现一长串在编译的运行提示,不知道这是什么情况,而且这类情况通常运行时间会比GPU要久一点,也许可能是代码哪里不对,后期在看看,这类情况遇到的不多。
对于GCU的运行速度总体感觉还是可以的,根据README运行DEMO代码也非常方便。
建议