
同步更新: https://github.com/zhulf0804/Detection-Segmentation-PointCloud-Papers/blob/master/Segmentation.md

Segmentation Dataset

This is only a brief introduction to some semantic segmentation datasets. More detailed information and more semantic segmentation datasets will be updated in the future.

  • PASCAL VOC(Visual Object Classes) 2012
  • PASCAL-Context
  • NYUDv2
  • Microsoft COCO
  • Cityscape
  • CamVid
  • ADE20K

PASCAL VOC(Visual Object Classes) 2012: [link] [paper] [leaderboard] [download]

Overview There are 20 object classes in the dataset.

Group Classes
Person person
Animal bird, cat, cow, dog, horse, sheep
Vehicle aeroplane, bicycle, boat, bus, car, motorbike, train
Indoor bottle, chair, dining table, potted plant, sofa, tv/monitor

Three main object recognition competitions on this dataset: classification, detection and segmentation.

The dataset directory is as follows:

            |--Annotations (17,125)
                    |--Action (33)
                    |--Main   (63)
                            |--train.txt (1464 lines)
                            |--trainval.txt (2913 lines)
                            |--val.txt (1449 lines)
            |--JPEGImages  (17,125)
            |--SegmentationClass  (2913)
            |--SegmentationObject (2913)
            |--Annotations (5,138)
                    |--Action (11)
                    |--Main   (21)
                            |--test.txt (1456 lines)
            |--JPEGImages  (16,135)

PASCAL-Context: [link] [paper]

Overview This dataset is a set of additional annotations for PASCAL VOC 2010. It goes beyond the original PASCAL semantic segmentation task by providing annotations for the whole scene. Training and validation contains 10,103 images while testing contains 9.637 images. The statistics section has a full list of 400+ labels. See pascal-voc.txt

NYUDv2: [link] [paper]

Overview The NYU-Depth V2 data set is comprised of video sequences from a variety of indoor scenes as recorded by both the RGB and Depth cameras from the Microsoft Kinect.

It features:

  • 1449 densely labeled pairs of aligned RGB and depth images
  • 464 new scenes taken from 3 cities
  • 407,024 new unlabeled frames
    Each object is labeled with a class and an instance + number (cup1, cup2, cup3, etc

The dataset has several components:

  • Labeled: A subset of the video data accompanied by dense multi-class labels. This data has also been preprocessed to fill in missing depth labels.
  • Raw: The raw rgb, depth and accelerometer data as provided by the Kinect.
  • Toolbox: Useful functions for manipulating the data and labels

SUN-RGBD: [link] [paper]

Overview The dataset is captured by four different sensors and contains 10,335 RGB-D images, at a similar scale as PASCAL VOC. The whole dataset is densely annotated and includes 146,617 2D polygons and 64,595 3D bounding boxes with accurate object orientations, as well as a 3D room layout and scene category for each image. This dataset enables us to train data-hungry algorithms for scene-understanding tasks, evaluate them using meaningful 3D metrics, avoid
overfitting to a small testing set, and study cross sensor bias.

Semantic segmentation in the 2D image domain is currently the most popular task for RGB-D scene understanding. In this task, the algorithm outputs a semantic label for each pixel in the RGB-D image. We use the standard average accuracy across object categories for evaluation

Microsoft COCO: [link] [paper]

Overview COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features:

  • Object segmentation
  • Recognition in context
  • Superpixel stuff segmentation
  • 330K images (>200K labeled)
  • 1.5 million object instances
  • 80 object categories
  • 91 stuff categories
  • 5 captions per image
  • 250,000 people with keypoints

Cityscape: [link] [paper] [git repository]

Overview The Cityscapes Dataset focuses on semantic understanding of urban street scenes. 5,000 annotated images with fine annotations and 20,000 annotated images with coarse annotations are in the dataset. The dataset directory is as follows:

                 |--leftImg8bit (5,000)
                       |--train (2975)
                       |--val   (500)
                       |--test  (1525)
      |--gtFine_trainvaltest    (5000 * 4)
                       |--train (2975 * 4)
                       |--val   (500 * 4)
                       |--test  (1525 * 4)
                 |--leftImg8bit (20,000)
                       |--train_extra (20000)
                 |--gtCoarse    (23475 * 4)
                       |--train (2975 * 4)
                       |--train_extra (20000 * 4)
                       |--val   (500 * 4)                 

The number in parentheses indicates the number of images. The images in the annotation(gtFine_trainvaltest, gtCoarse) directory contains 4 types: color, instanceIds, labelIds and polygons. The test set in the annotation directory doesn’t have labels(ground truth).

It contains 30 classes object. The class definitions details can be seen in the following:

Group Classes
flat road · sidewalk · parking+ · rail track+
human person* · rider*
construction building · wall · fence · guard rail+ · bridge+ · tunnel+
object pole · pole group+ · traffic sign · traffic light
nature vegetation · terrain
sky sky
void ground+ · dynamic+ · static+

CamVid: [link, link2] [paper1, paper2]

Overview The Cambridge-driving Labeled Video Database(Cam Vid) is the first collection of videos with object class semantic labels, complete with metadata. The database provides ground truth labels that associate each pixel with one of 32 semantic classes.

Group Classes
Moving object Animal · Pedestrian · Child · Rolling cart/luggage/pram · Bicyclist · Motorcycle/scooter · Car (sedan/wagon) · SUV / pickup truck · Truck / bus · Train · Misc
Road Road == drivable surface · Shoulder · Lane markings drivable · Non-Drivable
Ceiling Sky · Tunnel · Archway
Fixed objects Building · Wall · Tree · Vegetation misc. · Fence · Sidewalk · Parking block · Column/pole · Traffic cone · Bridge · Sign / symbol · Misc text · Traffic light · Other

Labeled Images (701 so far)

Note: Some links on the official-website are invalid, including the raw images download link, you can visit the link2 to get the data.

The dataset directory is as follows:

      |--701_StillsRaw_full   (701)
      |--LabeledApproved_full (701)
      |--label_colors.txt (32 lines)
              |-- color(like 64, 128, 64) classes(like Animal)

For the train/val/test, we can refer to the site, the directory can be as follows:

     |--train       (367)
     |--trainannot  (367)
     |--val         (101)
     |--valannot    (101)
     |--test        (233)
     |--testannot   (233)
     |--train.txt   (367 lines)
     |--val.txt     (101 lines)
     |--test.txt    (233 lines)

ADE20K: [link] [paper1, paper2]

Overview The dataset is for scene parsing and semantic segmentation task. Training set contains 20,210 images, validation set contains 2,000 images, test set is to be released later

Each folder contains images separated by scene category (same scene categories than the Places Database). For each image, the object and part segmentations are stored in two different png files. All object and part instances are annotated sparately.

For each image there are the following files:

  • *.jpg: RGB image.

  • *_seg.png: object segmentation mask. This image contains information about the object class segmentation masks and also separates each class into instances. The channels R and G encode the objects class masks. The channel B encodes the instance object masks. The function loadAde20K.m extracts both masks.

  • *_seg_parts_N.png: parts segmentation mask, where N is a number (1,2,3,…) indicating the level in the part hierarchy.

  • *_.txt: text file describing the content of each image (describing objects and parts). This information is redundant with other files.
