Object Detection, Instance Segmentation and Person Keypoint Detection
The models subpackage contains definitions for the following model architectures for detection:
Faster R-CNN ResNet-50 FPN
Mask R-CNN ResNet-50 FPN
The pre-trained models for detection, instance segmentation and keypoint detection are initialized with the classification models in torchvision.
The models expect a list of Tensor[C, H, W], in the range 0-1. The models internally resize the images so that they have a minimum size of 800. This option can be changed by passing the option min_size to the constructor of the models.
For object detection and instance segmentation, the pre-trained models return the predictions of the following classes:
COCO_INSTANCE_CATEGORY_NAMES = [
'__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]
Here are the summary of the accuracies for the models trained on the instances set of COCO train2017 and evaluated on COCO val2017.
Network | box AP | mask AP | keypoint AP |
---|---|---|---|
Faster R-CNN ResNet-50 FPN | 37.0 | ||
Mask R-CNN ResNet-50 FPN | 37.9 | 34.6 |
For person keypoint detection, the accuracies for the pre-trained models are as follows
Network
box AP
mask AP
keypoint AP
Keypoint R-CNN 54.6
ResNet-50 FPN 65.0
For person keypoint detection, the pre-trained model return the keypoints in the following order:
COCO_PERSON_KEYPOINT_NAMES = [
'nose',
'left_eye',
'right_eye',
'left_ear',
'right_ear',
'left_shoulder',
'right_shoulder',
'left_elbow',
'right_elbow',
'left_wrist',
'right_wrist',
'left_hip',
'right_hip',
'left_knee',
'right_knee',
'left_ankle',
'right_ankle'
]