numpy读取大型二进制数据

2021.01.15 LHQ

读取大型二进制数据的时候（类似GrADS格式，内含规则数据），考虑需要部分、直接读取，查询网络资源发现，numpy里面的两种方式

numpy.fromfile (file, dtype=float, count=-1, sep='', offset=0)
numpy.memmap

numpy.fromfile

numpy.fromfile (file, dtype=float, count=-1, sep='', offset=0)

file: 文件名
dtype：数据格式（GrADS格式的一般为float32，更多设置还在摸索中)
count: 一次读取的数据量（-1表示读取整个文件，注意数据量不是字节数大小，而是数据的数量，例如有多少个浮点数，比如GrADS数据要素单个水平层有nlat*nlon个格点，数据量也就是nlat*nlon）
sep：分隔符，空分割符（“”）表示二进制数据，分隔符中的空格（“ ”）匹配零个或多个空格，单纯的空格分隔符（“ ”）匹配一个或多个空格。
offset: 距离文件当前位置的偏移量（以字节为单位）。默认为0，仅用于二进制文件。使用时，请先open文件，否者文件位置不会设置在开头。

测试代码：

fdir="./database/postvar202006020004400"
nlat=501
nlon=751

with open(fdir,'rb') as f:
    ts1=time.time()
    data=np.fromfile(f,dtype='float32',count=nlat*nlon,offset=nlat*nlon*4)
    te1=time.time()
    t1=te1-ts1
print('耗时:',t1)

with open(fdir,'rb') as f:
    ts1=time.time()
    data=np.fromfile(f,dtype='float32')
    te1=time.time()
    t1=te1-ts1
print('耗时:',t1)

耗时: 0.03650185585021973
耗时: 1.6820800304412842

而使用xgrads读取相应的变量

ts=time.time()
ds = gv.Grads_data(ctlname)
ds = ds.getsinglelevel("u",0,24)
te =time.time()
print("耗时: ", te-ts)

耗时:  0.08249235153198242

考虑到xgrads获得是pandas.dataset数据，而numpy.fromfile仅仅获得要素值，在现有情况下，这两种方式的效率其实相差不大。

numpy.memap

貌似有点复杂，官方解释如下：

Create a memory-map to an array stored in a binary file on disk.
创建一份存储在磁盘上的二进制文件中的数组的内存映射。
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. NumPy’s memmap’s are array-like objects. This differs from Python’s mmap module, which uses file-like objects.
内存映射文件用于访问磁盘上大文件的小片段，而无需将整个文件读入内存。NumPy的memmap是类似数组的对象。这与Python的mmap模块不同，后者使用的是类似文件的对象。
This subclass of ndarray has some unpleasant interactions with some operations, because it doesn’t quite fit properly as a subclass. An alternative to using this subclass is to create the mmap object yourself, then create an ndarray with ndarray.__new__ directly, passing the object created in its ‘buffer=’ parameter.
ndarray的这个子类与某些操作有一些不愉快的交互，因为它不太适合作为子类。使用此子类的另一种方法是自己创建mmap对象，然后直接使用ndarray .__ new__创建一个ndarray，并传递在其'buffer ='参数中创建的对象。
This class may at some point be turned into a factory function which returns a view into an mmap buffer.
此类可能在某些时候变成了工厂函数，该函数将视图返回到mmap缓冲区。
Delete the memmap instance to close the memmap file.
删除memmap实例以关闭memmap文件

脚本测试：

#coding=utf-8
import numpy as np
import time


fdir="./database/postvar202006020004400"
nlat=501
nlon=751

with open(fdir,'rb') as f:
    ts1=time.time()
    data=np.fromfile(f,dtype='float32',count=nlat*nlon,offset=nlat*nlon*4)
    te1=time.time()
    t1=te1-ts1
print('耗时:',t1)

ts1=time.time()
data=np.memmap(fdir,dtype=np.float32,offset=nlat*nlon*4,shape=(nlat,nlon) )
te1=time.time()
t1=te1-ts1
print('耗时:',t1)
del data

耗时: 0.0011439323425292969
耗时: 0.0006759166717529297

mmap 不用读取到缓存里面，确实比较快。

numpy读取大型二进制数据

numpy读取大型二进制数据

numpy.fromfile

numpy.memap

你可能感兴趣的:(numpy读取大型二进制数据)