MXNet运行(分布式+动态库)

环境

安装

从incubator-mxnet/python进入目录

cd ../example/image-classification

host

echo -e "node14\nnode15\nnode16\nnode17\nnode19" > hosts

run (其他机子已经安装好环境)

-s 为 server数量 -n 为worker数量,见docs
按照hosts文件的先后顺序给server分配,再给worker分配。

python ../../tools/launch.py -s 1 -n 2 --launcher ssh -H hosts  \
python train_mnist.py --network lenet --kv-store dist_sync

or

Synchronize Directory & run (其他机器不必安装MXNet)

但是需要安装:

sudo yum install python-pip
sudo pip install numpy
sudo yum install lapack-devel openblas-devel opencv-devel
sudo yum install gtk2-devel

需要加库文件放到同步的文件夹下:

rm -rf mxnet # example/image-classification下的
cp -r ../../python/mxnet .
cp -r ../../lib/libmxnet.so mxnet

然后同步文件夹并在分布式节点上执行:

export DMLC_INTERFACE='ib0';
python ../../tools/launch.py -n 2 -s 1 --launcher ssh -H hosts --sync-dst-dir /home/xugb/image-classification_test/  
python train_mnist.py --network lenet --kv-store dist_sync

incubator-mxnet/tools/bandwidth/下运行:

username@hostname:~/incubator-mxnet/tools/bandwidth$ python  measure.py --kv-store local  --network lenet --num-classes 10
INFO:root:Namespace(disp_batches=1, gc_type='none', gpus='0,1', image_shape='3,224,224', kv_store='local', network='lenet', num_batches=5, num_classes=10, num_layers=152, optimizer='None', test_results=1)
INFO:root:num of arrays = 8, total size = 281.028320 MB
INFO:root:iter 1, 0.155938 sec, 1.802175 GB/sec per gpu, error 0.000000
INFO:root:iter 2, 0.156968 sec, 1.790356 GB/sec per gpu, error 0.000000
INFO:root:iter 3, 0.156240 sec, 1.798699 GB/sec per gpu, error 0.000000
INFO:root:iter 4, 0.156754 sec, 1.792804 GB/sec per gpu, error 0.000000
INFO:root:iter 5, 0.156649 sec, 1.794004 GB/sec per gpu, error 0.000000

可知模型大小

你可能感兴趣的:(MXNet运行(分布式+动态库))