TITLE: ResNeSt: Split-Attention Networks
AUTHOR: Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R. Manmatha, Mu Li, Alexander Smola
ASSOCIATION: Amazon, University of California, Davis
FROM: arXiv:2004.08955
CONTRIBUTION
- A simple architectural modification of the ResNet is explored, incorporating feature-map split attention within the individual network blocks.
- Models utilizing a ResNeSt backbone are able to achieve state of the art performance on several tasks, namely: image classification, object detection, instance segmentation and semantic segmentation.
METHOD
Split-Attention Block
In this work, a Split-Attention Block is explored. For implementation convinience, the Radix-major version is easier to understand. The following figure gives an illustration of Split-Attention Block.
As shown in Radix-major implementation of ResNeSt block, the featuremap groups with same radix index but different cardinality are next to each other physically. This implementation can be easily accelerated, because the $1 \times 1$ convolutional layers can be unified into a layer and the $3 \times 3$ convolutional layers can be implemented using group convolution with the number of groups equal to $RK$.
Comparison with Prior Work
The following figure shows the comparison. Splict-Attention Block is shown in cardinality-major view.
SE-Net introduces squeeze-and-attention (called excitation in the original paper) to employ a global context to predict channel-wise attention factors. With radix=1, Split-Attention block is applying a squeeze-and-attention operation to each cardinal group, while the SE-Net operates on top of the entire block regardless of multiple groups. Previous models like SK-Net introduced feature attention between two network branches, but their operation is not optimized for training efficiency and scaling to large neural networks. This work generalizes prior work on feature-map attentio within a cardinal group setting, and its implementation remains computationally efficient.