问题描述:
我使用python pandas读取一些大的CSV文件并将其存储在HDF5文件中,生成的HDF5文件大约为10GB。阅读时会出现问题。即使我试图以块的形式读回来,我仍然得到MemoryError。
以下是我如何创建HDF5文件:
importglob,osimportpandasaspd
hdf=pd.HDFStore('raw_sample_storage2.h5')os.chdir("C:/RawDataCollection/raw_samples/PLB_Gate")forfilenameinglob.glob("RD_*.txt"):raw_df=pd.read_csv(filename,sep=' ',header=None,names=['time','GW_time','node_id','X','Y','Z','status','seq','rssi','lqi'],dtype={'GW_time':uint32,'node_id':uint8,'X':uint16,'Y':uint16,'Z':uint16,'status':uint8,'seq':uint8,'rssi':int8,'lqi':uint8},parse_dates=['time'],date_parser=dateparse,chunksize=50000,skip_blank_lines=True)forchunkinraw_df:hdf.append('raw_sample_all',chunk,format='table',data_columns=True,index=True,compression='blosc',complevel=9)
以下是我如何尝试以块的形式阅读它:
fordfinpd.read_hdf('raw_sample_storage2.h5','raw_sample_all',chunksize=300000):print(df.head(1))
这是我收到的错误消息:
---------------------------------------------------------------------------MemoryErrorTraceback(most recent call last)in()---->1fordfinpd.read_hdf('raw_sample_storage2.h5','raw_sample_all',chunksize=300000):2print(df.head(1))C:\Anaconda\lib\site-packages\pandas\io\pytables.pycinread_hdf(path_or_buf,key,**kwargs)321store=HDFStore(path_or_buf,**kwargs)322try:-->323returnf(store,True)324except:325C:\Anaconda\lib\site-packages\pandas\io\pytables.pycin(store,auto_close)303304f=lambdastore,auto_close:store.select(-->305key,auto_close=auto_close,**kwargs)306307ifisinstance(path_or_buf,string_types):C:\Anaconda\lib\site-packages\pandas\io\pytables.pycinselect(self,key,where,start,stop,columns,iterator,chunksize,auto_close,**kwargs)663auto_close=auto_close)664-->665returnit.get_result()666667defselect_as_coordinates(C:\Anaconda\lib\site-packages\pandas\io\pytables.pycinget_result(self,coordinates)1346"can only use an iterator or chunksize on a table")1347->1348self.coordinates=self.s.read_coordinates(where=self.where)13491350returnself
C:\Anaconda\lib\site-packages\pandas\io\pytables.pycinread_coordinates(self,where,start,stop,**kwargs)3545self.selection=Selection(3546self,where=where,start=start,stop=stop,**kwargs)->3547coords=self.selection.select_coords()3548ifself.selection.filterisnotNone:3549forfield,op,filtinself.selection.filter.format():C:\Anaconda\lib\site-packages\pandas\io\pytables.pycinselect_coords(self)4507returnself.coordinates4508->4509returnnp.arange(start,stop)45104511# utilities ###MemoryError:
我的python环境:
INSTALLED VERSIONS------------------commit:Nonepython:2.7.3.final.0python-bits:32OS:WindowsOS-release:7machine:x86
processor:x86Family6Model42Stepping7,GenuineIntelbyteorder:little
LC_ALL:NoneLANG:Nonepandas:0.15.2nose:1.3.4Cython:0.22numpy:1.9.2scipy:0.15.1statsmodels:0.6.1IPython:3.0.0sphinx:1.2.3patsy:0.3.0dateutil:2.4.1pytz:2015.2bottleneck:Nonetables:3.1.1numexpr:2.3.1matplotlib:1.4.3openpyxl:1.8.5xlrd:0.9.3xlwt:0.7.5xlsxwriter:0.6.7lxml:3.4.2bs4:4.3.2html5lib:Nonehttplib2:Noneapiclient:Nonerpy2:Nonesqlalchemy:0.9.9pymysql:Nonepsycopg2:None
编辑1:
在执行read_hdf()之后,MemoryError发生了大约半小时,同时我检查了taskmgr,并且CPU活动很少,总使用的内存从未超过2.2G。在执行代码之前它大约是2.1 GB。所以无论加载到RAM中的pandas read_hdf()是否小于100 MB(我有4G RAM,而我的32位Windows系统只能使用2.7G,我将其余部分用于RAM磁盘)
这是hdf文件信息:
In[2]:hdf=pd.HDFStore('raw_sample_storage2.h5')hdfOut[2]:Filepath:C:/RawDataCollection/raw_samples/PLB_Gate/raw_sample_storage2.h5/raw_sample_all frame_table(typ->appendable,nrows->308581091,ncols->10,indexers->[index],dc->[time,GW_time,node_id,X,Y,Z,status,seq,rssi,lqi])
此外,我可以通过指示'start'和'stop'而不是'chunksize'来读取hdf文件的一部分:
%%time
df=pd.read_hdf('raw_sample_storage2.h5','raw_sample_all',start=0,stop=300000)printdf.info()print(df.head(5))
执行只需4秒,输出为:
Int64Index:300000entries,0to49999Datacolumns(total10columns):time300000non-null datetime64[ns]GW_time300000non-null uint32
node_id300000non-null uint8
X300000non-null uint16
Y300000non-null uint16
Z300000non-null uint16
status300000non-null uint8
seq300000non-null uint8
rssi300000non-null int8
lqi300000non-null uint8
dtypes:datetime64[ns](1),int8(1),uint16(3),uint32(1),uint8(4)memory usage:8.9MBNonetime GW_time node_id X Y Z status seq \02013-10-2217:20:5839821761320010217162266803312013-10-2217:20:5839821824419654196471924103322013-10-2217:20:5839821888116927214382272203432013-10-2217:20:5839821952217420228822044003442013-10-2217:20:58398220173200102171622668034rssi lqi0-43491-72472-46483-57464-4250Walltime:4.26s
注意到300000行只占用8.9 MB RAM,我尝试使用chunksize和start和stop:
fordfinpd.read_hdf('raw_sample_storage2.h5','raw_sample_all',start=0,stop=300000,chunksize=3000):printdf.info()print(df.head(5))
发生相同的MemoryError。
我不明白这里发生了什么,如果内部机制以某种方式忽略chunksize / start / stop并试图将整个内容加载到RAM中,那么当MemoryError发生时,RAM使用率(仅100 MB)几乎没有增加?为什么执行需要半个小时才能在流程开始时达到错误而没有明显的CPU使用率?
解决方案
So the iterator is built mainly to deal with a where clause. PyTables returns a list of the indicies where the clause is True. These are row numbers. In this case, there is no where clause, but we still use the indexer, which in this case is simply np.arange on the list of rows.
300MM rows takes 2.2GB. which is too much for windows 32-bit (generally maxes out around 1GB). On 64-bit this would be no problem.
In[1]:np.arange(0,300000000).nbytes/(1024*1024*1024.0)Out[1]:2.2351741790771484
So this should be handled by slicing semantics, which would make this take only a trivial amount of memory. Issue opened here.
So I would suggest this. Here the indexer is computed directly and this provides iterator semantics.
In[1]:df=DataFrame(np.random.randn(1000,2),columns=list('AB'))In[2]:df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)In[3]:store=pd.HDFStore('test.h5')In[4]:nrows=store.get_storer('df').nrowsIn[6]:chunksize=100In[7]:foriinxrange(nrows//chunksize+1):chunk=store.select('df',start=i*chunksize,stop=(i+1)*chunksize)# work on the chunkIn[8]:store.close()