Reading Note: SSD: Single Shot MultiBox Detector

TITLE: SSD: Single Shot MultiBox Detector

AUTHER: Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg

FROM: arXiv:1512.02325v2

SSD, a single-shot detector for multiple categories is introduced that is fast and accurate.
The network is easy to train, simple end-to-end training and high accuracy, even with relatively low resolution input images, further improving the speed vs accuracy trade-off.

Network structure:

Multiple scale feature maps from different layers are used in order to handle objects with different sizes.
On each feature map used for detectoin, an unique small network (filter) is utilized to learn to predict category scores and location offsets.
Each feature map corresponds to a fixed set of default boxes. These default boxes have different aspect ratios.

Training:

Default and groundtruth boxes are matched. Each ground truth box is matched to the default box with the best jaccard overlap. On the other hand default boxes are matched to any ground truth with jaccard overlap higher than a threshold.
The training objective is is a weighted sum of the localization loss (loc) and the confidence loss (conf):
$L(x,c,l,g)= \frac{1}{N}(L_{conf}(x,c)+ \alpha L_{loc}(x,l,g))$
where N is the number of matched default boxes, and the localization loss is the Smooth L1 loss between the predicted box $(l)$ and the ground truth box $(g)$ parameters. Confidence loss is the softmax loss over multiple classes confidences $(c)$.
The scale of the default boxes for each $(k_{th})$ feature map is computed as:
$s_{k}=s_{min}+ \frac{s_{max}-s_{min}}{m-1}(k-1)$
where $s{min}=0.2$ and $s{max}=0.95$. The width of default box is $s{k}\sqrt{a{r}}$ and the height is $s{k}/\sqrt{a{r}}$ where $a{r}$ is the aspect ratio. The centre of a default box at location of $(i, j)$ in the $k{th}$ feature map is $(\frac{i+0.5}{|f{k}|}, \frac{j+0.5}{|f{k}|})$.
Hard negatives are extracted. The unmatched default boxes are sorted according to confidence and top ones are used as hard negatives so that the ratio between the negatives and positives is at most 3:1.
Data augmentation is done by using the entire original input image and sampling a patch so that the minimum jaccard overlap with the objects is 0.1, 0.3, 0.5, 0.7, or 0.9.

It is fast because only one shot is utilized and the input is of lower resolution.
Multiple scale feature maps are used so that it can handle objects with different sizes.
End-to-end training.