利用h5py 构建深度学习数据集

前言

对于深度学习而言,往往有数以十万记的数据,跑程序的时候经常会在加载数据集的时候出现Memory error,查了很多资料,感觉python的h5py包处理数据集非常方便,导入数据时,并不会占据内存空间。


实例

在利用Resnet做迁移学习对图片进行分类的时候,最初直接往内存加载数据,结果报错,内存溢出。后来查了很多方法,终于用h5py解决了。处理数据集的核心代码如下:

for times in range(539): #有5万多张图片, 每次存入100张
    if i == 0:
        h5f = h5py.File("data/train_data.h5", "w") #build File object
        x = h5f.create_dataset("x_train", (100, 299, 299,3), 
        maxshape=(None, 299, 299, 3), 
        dtype =np.float32)# build x_train dataset
        y = h5f.create_dataset("y_train", (100, 80),maxshape=(None, 80), 
        dtype =np.int32)# build y_train dataset
    else:
    h5f = h5py.File("data/train_data.h5", "a") # add mode
    x = h5f["x_train"]
    y = h5f["y_train"]

    image = np.array(list(map(lambda x: ndimage.imread(x, mode='RGB'), image_path[times*100:(times+1)*100]))).astype(np.float32)

    image = preprocess_input(image)
    ytem = label[times*100:(times+1)*100] #读入图片
    if times != 538: #图片总数不能被100整除
         x.resize([times * 100 + 100, 299, 299,3])
         y.resize([times * 100 + 100,80])


         x[times * 100:times * 100 + 100] = image
         y[times * 100:times * 100 + 100] = ytem #写入数据集 

         print('%d images are dealed with' %(times))
     else:
         x.resize([times * 100 + 79, 299, 299, 3])
         y.resize([times * 100 + 79, 80])


         x[times * 100:times * 100 + 79] = image
         y[times * 100:times * 100 + 79] = ytem

         print('%d images are dealed with' % (times))

h5f.close() #close file

你可能感兴趣的:(Deeplearning)