同步更新: https://github.com/zhulf0804/Detection-Segmentation-PointCloud-Papers/blob/master/Segmentation.md
This is only a brief introduction to some semantic segmentation datasets. More detailed information and more semantic segmentation datasets will be updated in the future.
Overview There are 20 object classes in the dataset.
Group | Classes |
---|---|
Person | person |
Animal | bird, cat, cow, dog, horse, sheep |
Vehicle | aeroplane, bicycle, boat, bus, car, motorbike, train |
Indoor | bottle, chair, dining table, potted plant, sofa, tv/monitor |
Three main object recognition competitions on this dataset: classification, detection and segmentation.
The dataset directory is as follows:
|--VOCdevkit
|--VOC2012(trainval)
|--Annotations (17,125)
|--*.xml
|--ImageSets
|--Action (33)
|--*.txt
|--Layout
|--train.txt
|--trainval.txt
|--val.txt
|--Main (63)
|--*.txt
|--Segmentation
|--train.txt (1464 lines)
|--trainval.txt (2913 lines)
|--val.txt (1449 lines)
|--JPEGImages (17,125)
|--*.jpg
|--SegmentationClass (2913)
|--*.png
|--SegmentationObject (2913)
|--*.png
|--VOC2012(test)
|--Annotations (5,138)
|--*.xml
|--ImageSets
|--Action (11)
|--*.txt
|--Layout
|--test.txt
|--Main (21)
|--*.txt
|--Segmentation
|--test.txt (1456 lines)
|--JPEGImages (16,135)
|--*.jpg
Overview This dataset is a set of additional annotations for PASCAL VOC 2010. It goes beyond the original PASCAL semantic segmentation task by providing annotations for the whole scene. Training and validation contains 10,103 images while testing contains 9.637 images. The statistics section has a full list of 400+ labels. See pascal-voc.txt
Overview The NYU-Depth V2 data set is comprised of video sequences from a variety of indoor scenes as recorded by both the RGB and Depth cameras from the Microsoft Kinect.
It features:
The dataset has several components:
Overview The dataset is captured by four different sensors and contains 10,335 RGB-D images, at a similar scale as PASCAL VOC. The whole dataset is densely annotated and includes 146,617 2D polygons and 64,595 3D bounding boxes with accurate object orientations, as well as a 3D room layout and scene category for each image. This dataset enables us to train data-hungry algorithms for scene-understanding tasks, evaluate them using meaningful 3D metrics, avoid
overfitting to a small testing set, and study cross sensor bias.
Semantic segmentation in the 2D image domain is currently the most popular task for RGB-D scene understanding. In this task, the algorithm outputs a semantic label for each pixel in the RGB-D image. We use the standard average accuracy across object categories for evaluation
Overview COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features:
Overview The Cityscapes Dataset focuses on semantic understanding of urban street scenes. 5,000 annotated images with fine annotations and 20,000 annotated images with coarse annotations are in the dataset. The dataset directory is as follows:
|--cityscape
|--leftImg8bit_trainvaltest
|--leftImg8bit (5,000)
|--train (2975)
|--val (500)
|--test (1525)
|--gtFine_trainvaltest (5000 * 4)
|--gtFine
|--train (2975 * 4)
|--val (500 * 4)
|--test (1525 * 4)
|--leftImg8bit_trainextra
|--leftImg8bit (20,000)
|--train_extra (20000)
|--gtCoarse
|--gtCoarse (23475 * 4)
|--train (2975 * 4)
|--train_extra (20000 * 4)
|--val (500 * 4)
The number in parentheses indicates the number of images. The images in the annotation(gtFine_trainvaltest, gtCoarse) directory contains 4 types: color, instanceIds, labelIds and polygons. The test set in the annotation directory doesn’t have labels(ground truth).
It contains 30 classes object. The class definitions details can be seen in the following:
Group | Classes |
---|---|
flat | road · sidewalk · parking+ · rail track+ |
human | person* · rider* |
construction | building · wall · fence · guard rail+ · bridge+ · tunnel+ |
object | pole · pole group+ · traffic sign · traffic light |
nature | vegetation · terrain |
sky | sky |
void | ground+ · dynamic+ · static+ |
Overview The Cambridge-driving Labeled Video Database(Cam Vid) is the first collection of videos with object class semantic labels, complete with metadata. The database provides ground truth labels that associate each pixel with one of 32 semantic classes.
Group | Classes |
---|---|
Moving object | Animal · Pedestrian · Child · Rolling cart/luggage/pram · Bicyclist · Motorcycle/scooter · Car (sedan/wagon) · SUV / pickup truck · Truck / bus · Train · Misc |
Road | Road == drivable surface · Shoulder · Lane markings drivable · Non-Drivable |
Ceiling | Sky · Tunnel · Archway |
Fixed objects | Building · Wall · Tree · Vegetation misc. · Fence · Sidewalk · Parking block · Column/pole · Traffic cone · Bridge · Sign / symbol · Misc text · Traffic light · Other |
Labeled Images (701 so far)
Note: Some links on the official-website are invalid, including the raw images download link, you can visit the link2 to get the data.
The dataset directory is as follows:
|--CamVid
|--701_StillsRaw_full (701)
|--*.png
|--LabeledApproved_full (701)
|--*.png
|--label_colors.txt (32 lines)
|-- color(like 64, 128, 64) classes(like Animal)
For the train/val/test, we can refer to the site, the directory can be as follows:
|--CamVid
|--train (367)
|--*.png
|--trainannot (367)
|--*.png
|--val (101)
|--*.png
|--valannot (101)
|--*.png
|--test (233)
|--*.png
|--testannot (233)
|--*.png
|--train.txt (367 lines)
|--val.txt (101 lines)
|--test.txt (233 lines)
Overview The dataset is for scene parsing and semantic segmentation task. Training set contains 20,210 images, validation set contains 2,000 images, test set is to be released later
Each folder contains images separated by scene category (same scene categories than the Places Database). For each image, the object and part segmentations are stored in two different png files. All object and part instances are annotated sparately.
For each image there are the following files:
*.jpg: RGB image.
*_seg.png: object segmentation mask. This image contains information about the object class segmentation masks and also separates each class into instances. The channels R and G encode the objects class masks. The channel B encodes the instance object masks. The function loadAde20K.m extracts both masks.
*_seg_parts_N.png: parts segmentation mask, where N is a number (1,2,3,…) indicating the level in the part hierarchy.
*_.txt: text file describing the content of each image (describing objects and parts). This information is redundant with other files.