原文:
58388Train, 2000 Test, 2000 Validation images 224X224X3 jpg format
Data set of 400 bird species.58388 training images, 2000 test images(5 images per species) and 2000 validation images(5 images per species. This is a very high quality dataset where there is only one bird in each image and the bird typically takes up at least 50% of the pixels in the image. As a result even a moderatly complex model will achieve training and test accuracies in the mid 90% range.
All images are 224 X 224 X 3 color images in jpg format. Data set includes a train set, test set and validation set. Each set contains 400 sub directories, one for each bird species. The data structure is convenient if you use the Keras ImageDataGenerator.flowfromdirectory to create the train, test and valid data generators. The data set also include a file Bird Species.csv. This cvs file contains three columns. The filepaths column contains the file path to an image file. The labels column contains the class name associated with the image file. The Bird Species.csv file if read in using df= pandas.birdscsv(Bird Species.csv) will create a pandas dataframe which then can be split into traindf, testdf and validdf dataframes to create your own partitioning of the data into train, test and valid data sets.
NOTE: The test and validation images in the data set were hand selected to be the "best" images so your model will probably get the highest accuracy score using those data sets versus creating your own test and validation sets. However the latter case is more accurate in terms of model performance on unseen images.
Images were gather from internet searches by species name. Once the image files for a species was downloaded they were checked for duplicate images using a python duplicate image detector program I developed. All duplicates detected were deleted in order to prevent their being images common between the training, test and validation sets.
After that the images were cropped so that the bird occupies at least 50% of the pixel in the image. Then the images were resized to 224 X 224 X3 in jpg format. The cropping ensures that when processed by a CNN their is adequate information in the images to create a highly accurate classifier. Even a moderately robust model should achieve training, validation and test accuracies in the high 90% range. Because of the large size of the dataset I recommend if you try to train a model use and image size of 150 X 150 X3 in order to reduce training time. All files were also numbered sequential starting from one for each species. So test images are named 1.jpg to 5.jpg. Similarly for validation images. Training images are also numbered sequentially with "zeros" padding. For example 001.jpg, 002.jpg ….010.jpg, 011.jpg …..099.jpg, 100jpg, 102.jpg etc. The zero's padding preserves the file order when used with python file functions and Keras flow from directory.
The training set is not balanced, having a varying number of files per species. However each species has at least 120 training image files. This imbalanced did not effect my kernel classifier as it achieved over 98% accuracy on the test set.
One significant imbalance in the data set is the ratio of male species images to female species images. About 85% of the images are of the male and 15% of the female. Males typical are far more diversely colored while the females of a species are typically bland. Consequently male and female images may look entirely different .Almost all test and validation images are taken from the male of the species. Consequently the classifier may not perform as well on female specie images.
class index |
filepaths |
labels |
data set |
0 |
train/ABBOTTS BABBLER/001.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/002.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/003.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/004.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/005.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/006.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/007.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/008.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/009.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/010.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/011.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/012.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/013.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/014.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/015.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/016.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/017.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/018.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/019.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/020.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/021.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/022.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/023.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/024.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/025.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/026.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/027.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/028.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/029.jpg |
ABBOTTS BABBLER |
train |
0 |
train/ABBOTTS BABBLER/030.jpg |
ABBOTTS BABBLER |
train |
译:
58388训练集,2000测试测试集,2000验证图像224X224X3 jpg格式
400种鸟类的数据集。58388张训练图像、2000张测试图像(每种5张图像)和2000张验证图像(每种5张图像)。这是一个非常高质量的数据集,每张图像中只有一只鸟,鸟通常占据图像中至少50%的像素。因此,即使是一个中等复杂的模型也能在90%的范围内实现训练和测试精度。
所有图像均为jpg格式的224 X 224 X 3彩色图像。数据集包括列车集、测试集和验证集。每套包含400个子目录,每种鸟类一个。如果使用Keras ImageDataGenerator,则数据结构非常方便。flowfromdirectory创建列车、测试和有效数据生成器。数据集还包括一个鸟类物种档案。csv。此cvs文件包含三列。“文件路径”列包含图像文件的文件路径。“标签”列包含与图像文件关联的类名。鸟类种类。如果使用df=pandas读入csv文件。birdscsv(Bird Species.csv)将创建一个pandas数据帧,然后可以将其拆分为traindf、testdf和validdf数据帧,以创建您自己的数据划分为train、test和validdf数据集。
注:数据集中的测试和验证图像是手工选择的“最佳”图像,因此使用这些数据集与创建自己的测试和验证集相比,您的模型可能会获得最高的准确度分数。然而,就看不见的图像上的模型性能而言,后一种情况更为准确。
这些图片是通过网络搜索按物种名称收集的。下载一个物种的图像文件后,使用我开发的python duplicate image detector程序检查其重复图像。删除所有检测到的重复项,以防止它们在训练集、测试集和验证集之间成为共同的图像。
之后,对图像进行裁剪,使鸟占据图像中至少50%的像素。然后,这些图像以jpg格式调整为224x224 X3。裁剪确保了当CNN对其进行处理时,图像中有足够的信息来创建高度准确的分类器。即使是一个中等稳健的模型,也应在高90%的范围内实现训练、验证和测试精度。由于数据集很大,我建议您尝试使用150 X 150 X3的模型和图像大小进行训练,以减少训练时间。所有文件也从每个物种的一个开始按顺序编号。所以测试图像被命名为1。jpg至5。jpg。对于验证图像也是如此。训练图像也用“零”填充顺序编号。例如001。jpg,002。jpg…010。jpg,011。jpg…。。099.jpg,100jpg,102。当与python文件函数和目录中的Keras流一起使用时,zero的填充保留了文件顺序。
训练集是不平衡的,每个物种有不同数量的文件。然而,每个物种至少有120个训练图像文件。这种不平衡并没有影响我的内核分类器,因为它在测试集上达到了98%以上的准确率。
数据集中一个显著的不平衡是雄性物种图像与雌性物种图像的比例。大约85%的图片是男性的,15%是女性的。典型的雄性动物的肤色要多样化得多,而一个物种的雌性动物通常是平淡无奇的。因此,男性和女性的形象可能看起来完全不同。几乎所有的测试和验证图像都来自该物种的雄性。因此,分类器可能无法在雌性物种图像上表现良好。
大家可以到官网地址下载数据集,我自己也在百度网盘分享了一份。可关注本人公众号,回复“202204”获取下载链接。