pandas.DataFrame.sample函数抽样划分Pascal voc数据 训练集验证集测试集

先查sample函数的使用方法

DataFrame.sample(n=Nonefrac=Nonereplace=Falseweights=Nonerandom_state=Noneaxis=None)[source]
DataFrame可以是Series、DataFrame
其中的nfrac是相同的作用,n的含义是抽样的个数,是整数;frac是浮点数,是抽样的比例
replace的含义暂时不清楚,猜测是重新分配内存创建DataFrame
weights的含义是给抽样所在axis的每个元素赋值抽样权重,所以weights的长度必须和所在axis的长度相同,不然会报错,缺失值的weights会被设置为0,如果weights加和不等于1,会被normalized到加和为1,inf和-inf值不被允许
axis的含义是抽样的方向,axis=0,对行进行抽样,axis=1,对列进行抽样
random_state是用来复现结果的

import numpy as np
np.random.seed(99999)
df = pd.DataFrame(np.random.randn(90, 9), columns=list('ABCDEFGHK'))
df.head()
Out: 
          A         B         C    ...            G         H         K
0  0.624094  1.274963 -1.659604    ...    -0.769793 -0.563945  0.643137
1 -1.856903  0.065747 -0.334846    ...     0.914018 -0.540344 -1.127328
2 -2.245613 -0.277435  0.177650    ...    -1.921343  0.851318 -0.344258
3 -0.140641  0.532826 -0.289172    ...     0.739484  0.129123 -1.975035
4  0.957110  0.602097  0.521329    ...    -0.943183 -1.005068  1.044988
[5 rows x 9 columns]
df.sample(frac=0.1, replace=True)
Out: 
           A         B         C    ...            G         H         K
44  1.862082 -0.742141  1.131788    ...     0.442055  0.809571 -0.686909
68 -0.378466  0.506277 -0.740573    ...    -0.711384 -0.552085 -0.066600
42  1.315511  0.221039 -0.861310    ...    -2.074313  2.001360  1.495279
78 -0.260060  1.182463  0.763482    ...     1.713880 -1.699808  1.064469
11  0.122260  0.302464  1.244212    ...     0.716719  0.090439 -0.345576
36  1.337259  0.305582 -0.607342    ...     1.858619 -0.478186 -0.229918
49 -0.026310  0.411402  0.244130    ...     1.116335  0.644520  0.212567
33 -0.082600 -0.745143  1.002488    ...     1.726909  0.866196  0.978496
3  -0.140641  0.532826 -0.289172    ...     0.739484  0.129123 -1.975035
[9 rows x 9 columns]

 

整体是先读取voc格式文件中JPEGImages文件夹里面图片的名称,然后放入DataFrame中,使用DataFrame.sample函数来进行采样,然后保存到Main文件夹中

# encoding=utf-8
import os
import pandas as pd

path='~/VOCdevkit/VOC2007/JPEGImages'

dir=os.listdir(path)
lis=[i.split('.')[0] for i in os.listdir(path)]   #读取JPEGImages文件夹中的文件名
df=pd.DataFrame(lis,columns=['name'])             #构建pandas表格
temp1=df.sample(100)                                #随机抽样100个作为训练集
train=temp1['name'].values.tolist()
print(len(train))
with open('~/VOCdevkit/VOC2007/ImageSets/Main/train.txt','w') as f:   #保存为train.txt
    for i in train[:-1]:
        f.write(i+' ')
    f.write(train[-1])

#将train.txt中的去掉,剩下的再次抽样作为test集
dft=pd.DataFrame(list(set(df.name.values)-set(temp1.name.values)),columns=['name'])
print(len(dft))
temp2=dft.sample(10)
test=temp2['name'].values.tolist()
print(len(test))
pathi='~/VOCdevkit/VOC2007/JPEGImages'
with open('~/VOCdevkit/VOC2007/ImageSets/Main/test.txt','w') as f: #保存为test.txt
    for i in test[:-1]:
        f.write(i+' ')
        os.system('cp %s %s' % (os.path.join(pathi, i + '.jpg'), '/home/xin/dogfacedetect/models/data/hashiqi/test'))
    f.write(test[-1])

#将test.txt中的去掉以后全部作为validation集
val=list(set(dft.name.values)-set(temp2.name.values))
print(len(val))
assert len(val)==10
with open('~/VOCdevkit/VOC2007/ImageSets/Main/val.txt','w') as f:  #保存为val.txt
    for i in val[:-1]:
        f.write(i+' ')
    f.write(val[-1])

print(len(set(train)&set(val)))

 

你可能感兴趣的:(深度学习)