hdf5 mysql,当使用“ pandas.read_hdf()”读取巨大的HDF5文件时,即使我通过指定块大小读取了块,为什么仍然仍然出现MemoryError?...

Problem description:

I use python pandas to read a few large CSV file and store it in HDF5 file, the resulting HDF5 file is about 10GB.

The problem happens when reading it back. Even though I tried to read it back in chunks, I still get MemoryError.

Here is How I create the HDF5 file:

import glob, os

import pandas as pd

hdf = pd.HDFStore('raw_sample_storage2.h5')

os.chdir("C:/RawDataCollection/raw_samples/PLB_Gate")

for filename in glob.glob("RD_*.txt"):

raw_df = pd.read_csv(filename,

sep=' ',

header=None,

names=['time', 'GW_time', 'node_id', 'X', 'Y', 'Z', 'status', 'seq', 'rssi', 'lqi'],

dtype={'GW_time': uint32, 'node_id': uint8, 'X': uint16, 'Y': uint16, 'Z':uint16, 'status': uint8, 'seq': uint8, 'rssi': int8, 'lqi': uint8},

parse_dates=['time'],

date_parser=dateparse,

chunksize=50000,

skip_blank_lines=True)

for chunk in raw_df:

hdf.append('raw_sample_all', chunk, format='table', data_columns = True, index = True, compression='blosc', complevel=9)

Here is How I try to read it back in chunks:

for df in pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', chunksize=300000):

print(df.head(1))

Here is the error message I got:

---------------------------------------------------------------------------

MemoryError Traceback (most recent call last)

in ()

----> 1 for df in pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', chunksize=300000):

2 print(df.head(1))

C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in read_hdf(path_or_buf, key, **kwargs)

321 store = HDFStore(path_or_buf, **kwargs)

322 try:

--> 323 return f(store, True)

324 except:

325

C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in (store, auto_close)

303

304 f = lambda store, auto_close: store.select(

--> 305 key, auto_close=auto_close, **kwargs)

306

307 if isinstance(path_or_buf, string_types):

C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)

663 auto_close=auto_close)

664

--> 665 return it.get_result()

666

667 def select_as_coordinates(

C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in get_result(self, coordinates)

1346 "can only use an iterator or chunksize on a table")

1347

-> 1348 self.coordinates = self.s.read_coordinates(where=self.where)

1349

1350 return self

C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in read_coordinates(self, where, start, stop, **kwargs)

3545 self.selection = Selection(

3546 self, where=where, start=start, stop=stop, **kwargs)

-> 3547 coords = self.selection.select_coords()

3548 if self.selection.filter is not None:

3549 for field, op, filt in self.selection.filter.format():

C:\Anaconda\lib\site-packages\pandas\io\pytables.pyc in select_coords(self)

4507 return self.coordinates

4508

-> 4509 return np.arange(start, stop)

4510

4511 # utilities ###

MemoryError:

My python environment:

INSTALLED VERSIONS

------------------

commit: None

python: 2.7.3.final.0

python-bits: 32

OS: Windows

OS-release: 7

machine: x86

processor: x86 Family 6 Model 42 Stepping 7, GenuineIntel

byteorder: little

LC_ALL: None

LANG: None

pandas: 0.15.2

nose: 1.3.4

Cython: 0.22

numpy: 1.9.2

scipy: 0.15.1

statsmodels: 0.6.1

IPython: 3.0.0

sphinx: 1.2.3

patsy: 0.3.0

dateutil: 2.4.1

pytz: 2015.2

bottleneck: None

tables: 3.1.1

numexpr: 2.3.1

matplotlib: 1.4.3

openpyxl: 1.8.5

xlrd: 0.9.3

xlwt: 0.7.5

xlsxwriter: 0.6.7

lxml: 3.4.2

bs4: 4.3.2

html5lib: None

httplib2: None

apiclient: None

rpy2: None

sqlalchemy: 0.9.9

pymysql: None

psycopg2: None

Edit 1:

It took about half an hour for the MemoryError to happen after executing read_hdf(), and in the meanwhile I checked taskmgr, and there's little CPU activity and total memory used never exceeded 2.2G. It was about 2.1 GB before I execute the code. So whatever pandas read_hdf() loaded into the RAM is less than 100 MB (I have 4G RAM, and my 32-bit-Windows system can only use 2.7G, and I used the rest for RAM disk)

Here's the hdf file info:

In [2]:

hdf = pd.HDFStore('raw_sample_storage2.h5')

hdf

Out[2]:

File path: C:/RawDataCollection/raw_samples/PLB_Gate/raw_sample_storage2.h5

/raw_sample_all frame_table (typ->appendable,nrows->308581091,ncols->10,indexers->[index],dc->[time,GW_time,node_id,X,Y,Z,status,seq,rssi,lqi])

Moreover, I can read a portion of the hdf file by indicating 'start' and 'stop' instead of 'chunksize':

%%time

df = pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', start=0,stop=300000)

print df.info()

print(df.head(5))

The execution only took 4 seconds, and the output is:

Int64Index: 300000 entries, 0 to 49999

Data columns (total 10 columns):

time 300000 non-null datetime64[ns]

GW_time 300000 non-null uint32

node_id 300000 non-null uint8

X 300000 non-null uint16

Y 300000 non-null uint16

Z 300000 non-null uint16

status 300000 non-null uint8

seq 300000 non-null uint8

rssi 300000 non-null int8

lqi 300000 non-null uint8

dtypes: datetime64[ns](1), int8(1), uint16(3), uint32(1), uint8(4)

memory usage: 8.9 MB

None

time GW_time node_id X Y Z status seq \

0 2013-10-22 17:20:58 39821761 3 20010 21716 22668 0 33

1 2013-10-22 17:20:58 39821824 4 19654 19647 19241 0 33

2 2013-10-22 17:20:58 39821888 1 16927 21438 22722 0 34

3 2013-10-22 17:20:58 39821952 2 17420 22882 20440 0 34

4 2013-10-22 17:20:58 39822017 3 20010 21716 22668 0 34

rssi lqi

0 -43 49

1 -72 47

2 -46 48

3 -57 46

4 -42 50

Wall time: 4.26 s

Noticing 300000 rows only took 8.9 MB RAM, I tried to use chunksize together with start and stop:

for df in pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', start=0,stop=300000,chunksize = 3000):

print df.info()

print(df.head(5))

Same MemoryError happens.

I don't understand what's happening here, if the internal mechanism somehow ignore chunksize/start/stop and tried to load the whole thing into RAM, how come there's almost no increase in RAM usage (only 100 MB) when MemoryError happens? And why does the execution take half an hour just to reach the error at the very beginning of the process without noticeable CPU usage?

解决方案

So the iterator is built mainly to deal with a where clause. PyTables returns a list of the indicies where the clause is True. These are row numbers. In this case, there is no where clause, but we still use the indexer, which in this case is simply np.arange on the list of rows.

300MM rows takes 2.2GB. which is too much for windows 32-bit (generally maxes out around 1GB). On 64-bit this would be no problem.

In [1]: np.arange(0,300000000).nbytes/(1024*1024*1024.0)

Out[1]: 2.2351741790771484

So this should be handled by slicing semantics, which would make this take only a trivial amount of memory. Issue opened here.

So I would suggest this. Here the indexer is computed directly and this provides iterator semantics.

In [1]: df = DataFrame(np.random.randn(1000,2),columns=list('AB'))

In [2]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)

In [3]: store = pd.HDFStore('test.h5')

In [4]: nrows = store.get_storer('df').nrows

In [6]: chunksize = 100

In [7]: for i in xrange(nrows//chunksize + 1):

chunk = store.select('df',

start=i*chunksize,

stop=(i+1)*chunksize)

# work on the chunk

In [8]: store.close()

你可能感兴趣的:(hdf5,mysql)