Reading Note: R-FCN: Object Detection via Region-based Fully Convolutional Networks

TITLE: R-FCN: Object Detection via Region-based Fully Convolutional Networks

AUTHER: Jifeng Dai, Yi Li, Kaiming He, Jian Sun

ASSOCIATION: MSRA, Tsinghua University

FROM: arXiv:1605.06409

A framework called Region-based Fully Convolutional Network (R-FCN) is develpped for object detection, which consists of shared, fully convolutional architectures.
A set of position-sensitive score maps are introduced to enalbe FCN representing translation variance.
A unique ROI pooling method is proposed to shepherd information from metioned score maps.

The image is processed by a FCN manner network.
At the end of FCN, a RPN (Region Proposal Network) is used to generate ROIs.
On the other hand, a score map of $k^{2}(C+1)$ channels is generated using a bank of specialized convolutional layers.
For each ROI, a selective ROI pooling is utilized to generate a $C+1$ channel score map.
The scores in the score map are averaged to vote for category.
Another $ 4k^2 $ dim convolutional layer is learned for bounding box regression.

Training Details

R-FCN is trained end-to-end with pre-computed region proposals. Both category and position are learnt with the loss function $L(s,t{x,y,w,h})=L{cls}(s{c})+\lambda[c>0]L{reg}(t)$
For each image, N proposals are generated and B out of N proposals are selected to train weights according to the highest losses. B is set to 128 in this work.
4-step alternating training is utilized to realizing feature sharing between R-FCN and RPN.

It is fast (170ms/image, 2.5-20x faster than Faster R-CNN).
End-to-end training is easier to process.
All learnable layers are convolutional and shared on the entire image, yet encode spatial information required for object detection.