In practice dealing with some business cases using deep learning we work with datasets that have only image labels and no any information about objects locations on the image. At the same time using only this type of annotation we have to not only classify images to say which class it belongs to, but localize object on this image.
Convolutional Neural Networks (CNNs) showed state of the art results in different visual recognition tasks, also CNN proved that trained to classify images it also does objects localization. So using CNN for image classification task, it does weakly supervised object localization too. It means that our business purpose doesn’t strictly depend on rich annotation for dataset, image-level labels is quite enough for saying where is object place in. I want to talk about the most useful approaches for solving weakly supervised object localization using image-level labels.
Class Activation Map (CAM)
The main idea of CAM is to perform global average pool on the convolution feature maps that goes into fully connected layer for category classification loss.
Let’s we have S ∈ ℝ^(H×H×K), where H × H is the output feature map size and K is a number of output feature maps. Global average pool looks like take average value of kth feature map Sk:
The weight matrix for fully connected layer:
where C is a number of target classes.
When we want to know the localization map for certain class what we have? We have the weights that were used for calculating the input of softmax layer for certain class. After forward pass we can take these weights and make something that will help to get activation map for a certain class cth. For understanding what is that ‘’something’’, need to know how the input of softmax layer for cth class is calculated:
where W^fc_k,c ∈ ℝ is element of the matrix W^fc kth row and cth column. Now we know the weights that were used for considered cth class and paper proposes to multiply all feature maps S by these weights W^fc_k,c for getting object localization map:
Intuition behind it that we weight our features according to certain class values: some neurons are increased if they respond to this class and other otherwise if they doesn’t have any correlation to this class.
In paper they talk about the reason why Global Average Pooling (GAP) is better than Global Max Pooling (GMP): the max pooling only pays attention for such neurons that have the highest values that means that GMP encourages loss function only reacts on one discriminative part while GAP identify the extent of the object. They show that in classification task GAP and GMP have the the same results, but in localization GAP outperforms GMP.
But this method need an extra step for getting a localization map:
- Do forward pass.
- Take weights of the considered class and feature maps of the last convolution layer for calculating localization map.
The result of this approach really can localize object on an image, but in CVPR paper 2018 they show that such localization is not good enough — not the all parts of the object are localized and proposes new approach for localizing complementary parts of the object. New method has similar logic but do it in adversarial learning manner.
Adversarial Complementary Learning (ACoL)
The main issue of this method that every existing approaches don’t localize entire object, they just find some patterns for every classes and then react only on such patterns. For instance, duck has long nose and rounded head, neurons remember it and then react on such types of views, other parts of object they can’t recognize.
ACoL works on these two issues:
- No need extra step for calculating localization map, only forward pass.
- Learn to localize complementary parts of object.
Instead of using fully connected layer they proposes to use convolution layer 1 × 1 with stride = 1 and number of filters equals number of classes. In the output we get C (number of classes) feature maps with size H ×H. It means that on this step we already have localization map for every trained class:
where W^conv ∈ ℝ^(K×C).
As you see the formula for calculating localization map seems to be the same, but the difference that there is no fully connected layer (that use CAM) there is a convolution layer.
For ACoL after convolution layer they also use GAP and softmax layer. The input of the cth softmax item for the target class c is y^conv_c :
All things that was written above for ACoL do not have completely different thing comparing with CAM, this approach solves the first problem — extra step for calculating localization map. So now ACoL shows that there is no need to have extra step, we can have localization map in forward pass and have the same result. At the same time they also solve the second main problem — find entire object in the image.
Complementary parts of object
ACoL proposes use two branches that are added before GAP layer in main network architecture: branch A — ClassifierA and branch B — ClassifierB.
- ClassifierA takes the feature maps S and applies convolution layer in the way for getting localization maps for each class (as described above) and then prepares output for cross-entropy loss.
- ClassifierB is complementary branch that works collaboratively with ClassifierA. It also takes as input the feature maps, but they are updated by ClassifierA: using threshold (hyper parameter) ClassifierA defines discriminative region R in the localization map M^A (chosen by ground truth label) like R = ¯M¯^A >δ. Then it erases feature maps S using discriminative region R like setting zero for that part that was discriminative for ClassifierA and keeps neurons as it is when it wasn’t discriminative region. So this erased feature maps go to the ClassifierB. ClassifierB has similar architecture as ClassifierA and uses own loss function.
The intuition behind it that they try to learn ClassifierB find complementary parts of object that ClassifierA couldn’t find. As the result at the end we have the entire object. The final localization map is a fused map of ClassifierA and ClassifierB:
Note: localization map for all operations is normalized: values between 0 and 1. Trainable variables in both classifiers are not shared. Also total loss is the sum of cross-entropy losses from ClassifierA and ClassifierB.
Conclusion
In ACoL paper they show that they get better localization results. But I tried both, and the training process for ACoL became longer and required more time for getting good classification accuracy and hence localization result. CAM really shows good localization map and can be quite enough for production, also it doesn’t need any modification with network architecture just train network as you have. Unfortunately, at work we couldn’t get impressive results with ACoL, but it seems like very interesting approach that has potential and still keeps curiosity for further investigation. In their paper they describe the architecture of both branches, they contain three convolution layers. I tried this one but I think that one convolution could be quite enough, this is what I didn’t test but have to. Maybe it will reduce training time and give better localization results. Nevertheless both approaches are really allow to get localization map but I can advice to use the first one — CAM. If somebody got better results with ACoL, you are welcome to share with your experience!
P.S. All pictures and its descriptions were taken from the original papers and are used for showing original papers results.