The class activation mapping, or CAM, has been the cornerstone of feature attribution methods for multiple vision tasks. Its simplicity and effectiveness have led to wide applications in the explanation of visual predictions and weakly-supervised localization tasks. However, CAM has its own shortcomings. The computation of attribution maps relies on ad-hoc calibration steps that are not part of the training computational graph, making it difficult for us to understand the real meaning of the attribution values. In this paper, we improve CAM by explicitly incorporating a latent variable encoding the location of the cue for recognition in the formulation, thereby subsuming the attribution map into the training computational graph. The resulting model, class activation latent mapping, or CALM, is trained with the expectation-maximization algorithm. Our experiments show that CALM identifies discriminative attributes for image classifiers more accurately than CAM and other visual attribution baselines.
In this paper, we focus on the class activation mapping (CAM) method, which has been the cornerstone of the feature attribution research. It answers "which pixels are responsible for the prediction" for CNN models. Overview of CAM is...
BUT, there is a remaining weakness for CAM
"The pixel-wise pre-GAP, pre-softmax feature value at (h, w), measured in relative scale within the range of Values [0, A] where A is the maximum of the feature values in the entire image"
HERE is where we address these issues using probabilistic ML. A good way to normalize something is to use probabilities.
CALM has several benefits.
"The probability that the cue for recognition was at position z when the image x is predicted as y"
One Caveat is....
In conclusion, Keep CALM and improve your visual feature attribution!