【数据集详情笔记】PKUMMD:多模态人体动作检测数据集

PKUMMD

Notes

Dataset URL: PKU-MMD

Dataset Details

  1. This is a large-scale multi-modalities action detection dataset.
  2. contains 1076 long video sequences, each of which lasts about 3∼4 minutes (recording ratio set to 30 FPS) and contains approximately 20 action instances
  3. 51 action categories: 41 daily actions (drinking, waving hand, putting on the glassed, etc.) and 10 interaction actions (hugging, shaking hands, etc.).
  4. performed by 66 subjects. Each subjects takes part in 4 daily action videos and 2 interactive action videos. The ages of the subjects are between 18 and 40.
  5. 3 camera views
  6. almost 20,000 action instances
  7. 5,312,580 frames of 3,000 minutes in total    
  8. 4 raw modalities: RGB frame (1920 × 1080), depth map (512 × 424), skeleton data, and infrared (512 × 424).
  9. To improve the sequential continuity of long action sequences, the daily actions are designed in a weak connection mode. For example, we design an action sequence of taking off shirt, taking off cat, drinking water and sitting down to describes the scene that occur after going back home. Note that our videos only contain one part of the actions, either daily actions or interaction actions. We design 54 sequences and divide subjects into 9 groups, and each groups randomly choose 6 sequences to perform.

【数据集详情笔记】PKUMMD:多模态人体动作检测数据集_第1张图片

How to Collect

Recording Multi-Modality Videos.

  1. we choose a daily-life indoor environment to capture the video samples.
  2. Considering that the temperature changes will lead to the deviation of infrared sequences, we fully calculate the distance among the action occurrence, windows and Kinect devices.
  3. Windows are occluded for illumination consistency.
  4. We use three cameras in the fixed angle and height at the same time to capture three different horizontal views.
  5. We set up an action area with 180cm as length and 120cm as width.
  6. Each subject will perform each action instances in a long sequence toward a random camera, and it is accepted to perform two continuous actions toward different cameras.
  7. The horizontal angles of each camera is −45◦, 0◦, and +45◦, with a height of 120cm.

Localizing Temporal Intervals.

  1. At this stages, captured video sources are labeled on frame level.
  2. we merely employ proficient volunteers who have experiences in labeling temporal actions.
  3. We divide actions into several groups and the actions in each group are labeled by only one person.

Verifying and Enhancing Label.

  1. Firstly, we design basic evaluation protocol of each video, like If there is overlap of actions or Is the length of an action reasonable.
  2. we then use cross-view method to evaluate and verify the data label.

【数据集详情笔记】PKUMMD:多模态人体动作检测数据集_第2张图片

Evaluation Protocols

Dataset Partition Setting:

1) Cross-View Evaluation: the videos sequences from the middle and right Kinect devices are chosen for training set and the left is for testing set. For this evaluation, the training and testing sets have 717 and 359 video samples, respectively.

2) Cross-Subject Evaluation: we split the subjects into training and testing groups which consists of 57 and 9 subjects respectively. For this evaluation, the training and testing sets have 944 and 132 long video samples, respectively.

Average Precision Protocols:

The detection interval is correct when IOU between predicted action interval I and the ground truth interval I is larger than threshold θ. With θ, the precision p(θ) and recall r(θ) can be calculated.

1) F1-Score: F1-score is a basic evaluation criterion regardless of the confidence.

2) Interpolated Average Precision (AP): Interpolated average precision is a famous evaluation score using the information of confidence for ranked retrieval results. The interpolated precision pinterp  at a certain recall level r is defined as the highest precision found for any recall level r'≥r :

With confidence changing, precision and recall values can be plotted to give a precision-recall curve. The interpolated average precision is calculated by the arithmetic mean of the interpolated precision at each recall level.

3) Mean Average Precision (mAP): With several parts of retrieval set , each part  proposes  action occurrences  and  is the recall result of ranked k retrieval results, then mAP is formulated by

Note that with several parts of retrieval set , the AP score (4) is discretely formulated. We design two splitting protocols:

  • mean average precision of different actions  
  • mean average precision of different videos .

4) 2D Interpolated Average Precision: Though several protocols have been designed for information retrieval, none of them takes the overlap ratio into consideration. We can find that each AP score and mAP score is associated to θ. To further evaluate the performance of precisions of different overlap ratios, we now propose the 2D-AP score which takes both retrieval result and overlap ratio of detection into consideration:

你可能感兴趣的:(学习,人工智能,计算机视觉,视频理解,动作识别,多模态,数据集)