TITLE: Weakly Supervised Attention Pyramid Convolutional Neural Network for Fine-Grained Visual Classification
AUTHOR: Yifeng Ding, Shaoguo Wen, Jiyang Xie, Dongliang Chang, Zhanyu Ma, Zhongwei Si, Haibin Ling
ASSOCIATION: Beijing University of Posts and Telecommunications, Stony Brook University
FROM: arXiv:2002.03353
CONTRIBUTION
- A novel attention pyramid convolutional neural network (AP-CNN) is propsed by building an enhanced pyramidal hierarchy, which combines a top-down pathway of features and a bottom-up pathway of attentions, and thus learns both high-level semantic and low-level detailed feature representations.
- ROI guided refinement is proposed which consists of ROI guided dropblock and ROI guided zoom-in to further refine the features. The dropblock operation helps to locate more discriminative local regions, and the zoom-in operation aligns features with background noises eliminated.
METHOD
AP-CNN is a two-stage network, raw-stage and refined-stage, that respectively takes coarse full images and refined features as input. An overview of the proposed AP-CNN is illustrated in the following figure.
First, the feature and attention pyramid structure takes coarse images as input, which generates the pyramidal features and the pyramidal attentions by establishing hierarchy on the basic CNN following a top-down feature pathway and a bottom-up attention pathway.
Second, once the spatial attention pyramid has been obtained from the raw input, the region proposal network (RPN) proceeds to generate the pyramidal regions of interest (ROIs) in a weakly supervised way. Then the ROI guided refinement is conducted on low-level features with a) the ROI guided dropblock which erases the most discriminative regions selected from small-scaled ROIs, and b) the ROI guided zoom-in which locates the major regions merged from all ROIs.
Third, the refined features are sent into the refined-stage to distill more discriminative information. Both stages set individual classifiers for each pyramid level, and the final classification result is averaged over the raw-stage predictions and the refined-stage predictions.
The Attention Pyramid consists of two types of attentions, Spatial Attention and Channel Attention. The following figure shows the data-flow.
Spatial Attention Pyramid is a set of feature maps of different resolutions and is generated from feature pyramid. Then ROI pyramid is generated from the spatial activations using RPN. At training stage, a ROI is selected to be droped, erasing the informative part and encouraging the network to find more discriminative regions. At testing stage, this operation is skipped. The following figure shows the ROI guided refinement.
The following table shows the comparison between this work and prior work.