在模型训练前为什么要把数据打包为.npy文件,和普通文件格式有什么区别?

看看下面两段代码的运行时间!
代码一:

import numpy as np
import time

# 100万个数据
n_samples=1000000

# 将随机浮点数作为字符串写入本地TXT文件
with open('fdata.txt', 'w') as fdata:
    for _ in range(n_samples):
        fdata.write(str(10*np.random.random())+',')

# 读取TXT文件的数据,转换为1000x1000的ndarray,并计时
t1=time.time()
with open('fdata.txt','r') as fdata:
    datastr=fdata.read()
lst = datastr.split(',')
lst.pop()
array_lst=np.array(lst,dtype=float).reshape(1000,1000)
t2=time.time()

print(array_lst)
print('\nShape: ',array_lst.shape)
print(f"Time took to read: {t2-t1} seconds.")
#输出结果:
>> [[0.32614787 6.84798256 2.59321025 ... 5.02387324 1.04806225 2.80646522]
 [0.42535168 3.77882315 0.91426996 ... 8.43664343 5.50435042 1.17847223]
 [1.79458482 5.82172793 5.29433626 ... 3.10556071 2.90960252 7.8021901 ]
 ...
 [3.04453929 1.0270109  8.04185826 ... 2.21814825 3.56490017 3.72934854]
 [7.11767505 7.59239626 5.60733328 ... 8.33572855 3.29231441 8.67716649]
 [4.2606672  0.08492747 1.40436949 ... 5.6204355  4.47407948 9.50940101]]

>> Shape:  (1000, 1000)
>> Time took to read: 1.018733024597168 seconds.

代码二:

import numpy as np
import time
np.save('fnumpy.npy', array_lst)

t1=time.time()
array_reloaded = np.load('fnumpy.npy')
t2=time.time()

print(array_reloaded)
print('\nShape: ',array_reloaded.shape)
print(f"Time took to load: {t2-t1} seconds.")
#输出结果:
>> [[0.32614787 6.84798256 2.59321025 ... 5.02387324 1.04806225 2.80646522]
 [0.42535168 3.77882315 0.91426996 ... 8.43664343 5.50435042 1.17847223]
 [1.79458482 5.82172793 5.29433626 ... 3.10556071 2.90960252 7.8021901 ]
 ...
 [3.04453929 1.0270109  8.04185826 ... 2.21814825 3.56490017 3.72934854]
 [7.11767505 7.59239626 5.60733328 ... 8.33572855 3.29231441 8.67716649]
 [4.2606672  0.08492747 1.40436949 ... 5.6204355  4.47407948 9.50940101]]

>> Shape:  (1000, 1000)
>> Time took to load: 0.009010076522827148 seconds.

如果要以其他形状读取也可以

t1=time.time()
array_reloaded = np.load('fnumpy.npy').reshape(10000,100)
t2=time.time()

print(array_reloaded)
print('\nShape: ',array_reloaded.shape)
print(f"Time took to load: {t2-t1} seconds.")
#输出结果:
>> [[0.32614787 6.84798256 2.59321025 ... 3.01180325 2.39479796 0.72345778]
 [3.69505384 4.53401889 8.36879084 ... 9.9009631  7.33501957 2.50186053]
 [4.35664074 4.07578682 1.71320519 ... 8.33236349 7.2902005  5.27535724]
 ...
 [1.11051629 5.43382324 3.86440843 ... 4.38217095 0.23810232 1.27995629]
 [2.56255361 7.8052843  6.67015391 ... 3.02916997 4.76569949 0.95855667]
 [6.06043577 5.8964256  4.57181929 ... 5.6204355  4.47407948 9.50940101]]

>> Shape:  (10000, 100)
>> Time took to load: 0.010006189346313477 seconds.

在模型训练前为什么要把数据打包为.npy文件,和普通文件格式有什么区别?_第1张图片
总结: 从txt或其他文件读取1000*1000的数据,直接读取需要1s,而转成.npy后,读取只需要约0.01s,快100倍。如果需要多次读取相同的数据文件,这是一个有用的技巧,而且如上图所示,数据量越大,速度提升越明显!

你可能感兴趣的:(人工智能)