HDF5是一种专门用于存储表格数据阵列的高性能存储格式。Pandas的HDFStore类可以将DataFrame存储在HDF5文件中,以便可以有效地访问它,同时仍保留列类型和其他元数据。 它是一个类似字典的类,因此您可以像读取Python dict对象一样进行读写。
HDF5支持压缩存储,使用的方式是blosc,这个是速度最快的也是pandas默认支持的。 使用压缩可以节省空间。开启压缩也没有什么劣势,只会慢一点点。
压缩在小数据量的时候优势不明显,数据量大了才有优势。同时发现hdf读取文件的时候只能是一次写,写的时候可以append,可以put,但是写完成了之后关闭文件,就不能再写了。
# Read from the store, close it if we opened it.
read_hdf(path_or_buf[, key, mode])
# Store object in HDFStore
HDFStore.put(self, key, value[, format, append])
# Append to Table in file.
HDFStore.append(self, key, value[, format, …])
# Retrieve pandas object stored in file
HDFStore.get(self, key)
# Retrieve pandas - object stored in file, optionally based on where criteria
HDFStore.select(self, key[, where, start, …])
# Print detailed information on the store.
- HDFStore.info(self)
# Return a (potentially unordered) list of the keys corresponding to the objects stored in the HDFStore.
HDFStore.keys(self)
import numpy as np
import pandas as pd
# 打开一个hdf文件
hdf = pd.HDFStore('test.hdf','w')
df1 = pd.DataFrame(np.random.standard_normal((3,2)), columns=['A','B'])
df2 = pd.DataFrame(np.random.standard_normal((3,2)), columns=['A','B'])
hdf.put(key='key1', value=df1, format='table', data_columns=True)
hdf.put(key='key2', value=df2, format='table', data_columns=True)
print(hdf.info())
File path: test.hdf
/key1 frame (shape->[3,2])
/key2 frame (shape->[3,2])
print(hdf.keys())
['/key1', '/key2']
print(hdf.get('key1')) # equal to hdf['key1']
print(hdf.get('key2')) # equal to hdf['key2']
A B
0 0.257239 1.684300
1 0.076235 -0.071744
2 -0.266105 -0.874081
A B
0 1.178982 -0.517734
1 0.713010 -0.484248
2 0.741703 -0.650327
# append df2 to the dataset of key1
hdf.append(key='key1', value=df2, format='table', data_columns=True)
print(hdf.get('key1')
A B
0 0.257239 1.684300
1 0.076235 -0.071744
2 -0.266105 -0.874081
0 1.178982 -0.517734
1 0.713010 -0.484248
2 0.741703 -0.650327
large_data = pd.DataFrame(np.random.standard_normal((90000000,4)))
# 普通格式存储:
hdf1 = pd.HDFStore('test1.h5','w')
hdf1.put(key='data', value=large_data)
hdf1.close()
# 压缩格式存储
hdf2 = pd.HDFStore('test2.h5','w', complevel=4, complib='blosc')
hdf2.put(key='data', value=large_data)
hdf2.close()
从结果上看,test2.h5比test1.h5小了700mb,节省了存储空间。
https://glowingpython.blogspot.com/2014/08/quick-hdf5-with-pandas.html
https://pandas.pydata.org/pandas-docs/stable/reference/io.html#hdfstore-pytables-hdf5
https://realpython.com/fast-flexible-pandas/#this-tutorial