Mxnet/Gluon学习系列(二)—— mxnet中的延迟执行

龟龟是最可爱的小猫咪

Mxnet中的延迟执行

import numpy as np
import time
import mxnet.ndarray as nd

print("===============start using mxnet ndarray============")
start = time.time()
x = nd.random.uniform(shape = (2000,2000))
y = nd.dot(x,x)
print ('workloads are queued:\t%.4f sec'%(time.time()-start))
print (y)
print ('workloads are finished:\t%.4f sec'%(time.time()-start))

# 以下使用numpy做对比
print("===============start using numpy============")
start_np = time.time()
x = np.random.uniform(size=(2000,2000))
y = np.dot(x,x)
print ('workloads are queued:\t%.4f sec'%(time.time()-start_np))
print (y)
print ('workloads are finished:\t%.4f sec'%(time.time()-start_np))

输出结果:

====================start using mxnet ndarray====================
workloads are queued:	0.0009 sec

[[501.1584  508.2972  495.65228 ... 492.84705 492.69086 490.04797]
 [508.81046 507.18216 495.17426 ... 503.1053  497.2932  493.67917]
 [489.566   499.47006 490.17715 ... 490.9994  488.05002 483.2883 ]
 ...
 [484.00186 495.7179  479.92142 ... 493.6995  478.89175 487.20737]
 [499.64926 507.651   497.5939  ... 493.04742 500.74512 495.82703]
 [516.0143  519.17145 506.35403 ... 510.08868 496.3561  495.42523]]
<NDArray 2000x2000 @cpu(0)>
workloads are finished:	0.0855 sec
====================start using numpy===========================
workloads are queued:	0.2381 sec
[[499.78141775 500.20765179 482.10959684 ... 495.70733609 494.62182745
  496.71387707]
 [499.84924786 498.85308693 489.2123237  ... 502.57251308 502.91773143
  501.31651937]
 [476.24190127 482.92743953 470.31281454 ... 487.23095596 483.80486825
  477.8116057 ]
 ...
 [497.14541701 499.29581685 487.46588491 ... 497.46870583 502.95987923
  504.09474747]
 [507.14380432 517.89740016 507.72814195 ... 517.49853163 516.01229087
  517.79480039]
 [505.88017037 506.28039536 498.74898774 ... 504.91917801 511.03318716
  508.53549885]]
workloads are finished:	0.2390 sec

Process finished with exit code 0

可以看出mxnet的ndarray实在print(y)这一步骤才进行计算,而numpy是命令式顺序计算

立即执行

在mxnet中也可以让前端线程等待直到结果完成再继续执行。

可以通过nd.ndarray.wait_to_read() 等待直到特定结果完成,

也可以通过 nd.waitall() 等待所有结果完成。

i.e.

import time
import mxnet.ndarray as nd

start = time.time()
x = nd.random.uniform(shape = (2000,2000))
y = nd.dot(x,x)
y.wait_to_read()
print ('%.4f'%(time.time()-start))

start = time.time()
x = nd.random.uniform(shape = (2000,2000))
y = nd.dot(x,x)
z = nd.dot(x,x)
nd.waitall()
print ('%.4f'%(time.time()-start))

结果:

0.0836
0.0964

Process finished with exit code 0

任何将内容从ndarray搬运到其他不支持延迟执行的数据结构的操作都会触发等待,

例如 asnumpy(), asscalar() 等操作

延迟执行的优劣

  • Convenience
    例如对于一个for循环, 循环中某变量一直被赋值而没有使用,延迟执行可以忽略掉这一部分继续执行加速运算
  • 其他影响
    以一段代码为例,在一次训练的一个批次中,我们计算这一批次的loss值
from mxnet import gluon
from mxnet.gluon import nn
from mxnet import ndarray as nd
from time import time
from mxnet import sym
import os
import subprocess

def get_data():
    start = time()
    batch_size = 1024
    for i in range(60):
        if i%10==0 and i != 0:
            print ('batch %d, time %f sec'%(i,time()-start))
        x = nd.ones((batch_size,1024))
        y = nd.ones((batch_size,))
        yield x,y

net = nn.Sequential()
with net.name_scope():
    net.add(
        nn.Dense(1024, activation='relu'),
        nn.Dense(1024, activation='relu'),
        nn.Dense(1),
    )
net.initialize()
trainer = gluon.Trainer(net.collect_params(), 'sgd', {})
loss = gluon.loss.L2Loss()

def get_mem():
    res = subprocess.check_output(['ps','u','-p',str(os.getpid())])
    return int(str(res).split()[15])/1e3

for x,y in get_data():
    break
loss(y,net(x)).wait_to_read()

mem = get_mem()
for x,y in get_data():
    loss(y,net(x)).wait_to_read()
nd.waitall()
print('Increased memory %f MB'%(get_mem()-mem))

注意最后的

for x,y in get_data():
    loss(y,net(x)).wait_to_read()
nd.waitall()

在每个循环结束触发等待,不至于一次性将大批量数据推入内存,
如果仅仅使用 loss(y,net(x)) 则会短时间将所有数据送进后端

总结

延迟执行使得系统有更多的内存空间来做性能的优化,但我们推荐每个批量里至少有一个同步函数,例如对损失函数进行评估,来避免将过多任务同时丢进后端系统。

你可能感兴趣的:(mxnet,深度学习)