1. Existing network(such as AlexNet pre-trained on ImageNet)
2. SPP --> region level descriptor
3. (1) class score --> recognition
(2) probability distribution(which region contains the most salient image structure) --> detection
4. aggregate the recognition and detection scores to predict the class of image(image level supervision)
1. MIL: Use the appearance model itself to perform region selection
WSDDN: detection branch is independent of recognition branch
2. Bilinear architecture: two streams are symmetric
WSDDN: detection branch is explicitly designed
1. Pre-trained network
2. Weakly supervised deep detection network
(1) Region level descriptor:
Region proposal: SSW, EB
(2) Classification data stream: fc + softmax
(3) Detection data stream: fc + softmax(differently defined)
(4) Combined region scores and detection
Final score of each region:
Then rank regions for each class independently.
Then apply nms(0.4)
(5) Image-level classification scores
Image level class score:
(yc in (0, 1))
Softmax is not applied because one image can have multiple label
3. Training WSDDN
A collection of images xi, i=1, 2, … , n
Image level labels yi∈ {-1, 1}C
4. Spatial regulariser
Penalize the feature map discrepancies between the highest scoring region and the regions with at least 60% IoU during training.
CorLoc: the percentage of images that contain at least one instance of the target object class for which the most confident detected bounding box overlaps by at least 0.5 with one of these instances.
Problem: (1) group multiple object instances with a single bounding box
(2)focus on parts rather than the whole object
Result: