Python 处理数据集,合并多个文件/随机划分为train/valid/test

对某个数据集进行处理,使之符合我的需要。

import os
import shutil
import random
# from xxx_dataset randomly choose images 
# to make a new data
# 60% train 
# 20% valid
# 20% test
path = '/remote-home/my/Database/xxx_dataset/all/images'
train = '/remote-home/my/Database/xxx_new/train'
valid = '/remote-home/my/Database/xxx_new/valid'
test = '/remote-home/my/Database/xxx_new/test'
dirpath = '/remote-home/my/Database/xxx_new/all'
filenames = os.listdir(path)
print(filenames)
print(len(filenames))

 os.listdir() 返回指定的文件夹下包含的文件/文件名

这里,我的images文件夹下还有A/B/C三个文件夹,因此结果为: 

['A', 'B', 'C']
3

一、将A, B, C三个文件夹中的内容合并且复制到给定路径: 

for root, dirs, files in os.walk(path):
    #print(root)
    for i in range(len(files)):
        # print(files[i])
        if files[i][-3:] == 'png':
            file_path = root + '/' + files[i]
            new_file_path = dirpath + '/' + files[i]
            shutil.copy(file_path, new_file_path)

1:for root, dirs, files in os.walk():

root 根目录  dirs [ ]   files 文件名的列表!

2:shutil.copy()

用这个复制时一定要注意()里的类型

二、制作Train/Valid/Test,注意:先打乱数据集,再划分

3:打乱的方法

可以打乱每个图片对应的index顺序,根据index复制到给定路径

pathdir = os.listdir(dirpath)
num = len(pathdir)
num_train = int(num * 0.6)
num_valid = int(num * 0.2)+1
index_list = list(range(num))
print(index_list)

random.shuffle(index_list)
print(index_list)
for i in index_list:
    for root, dirs, files in os.walk(dirpath):
        if i < num_train:
            file_path = dirpath + '/' + files[i]
            new_file_path = train + '/' + files[i]
            shutil.copy(file_path, new_file_path)
        elif num_train<=i<=(num_train+num_valid-1):
            file_path = dirpath + '/' + files[i]
            new_file_path = valid + '/' + files[i]
            shutil.copy(file_path, new_file_path)
        else:
            file_path = dirpath + '/' + files[i]
            new_file_path = test + '/' + files[i]
            shutil.copy(file_path, new_file_path)

 

你可能感兴趣的:(python,数据预处理)