TITLE: Pelee: A Real-Time Object Detection System on Mobile Devicesn
AUTHOR: Robert J. Wang, Xiang Li, Shuang Ao, Charles X. Ling
ASSOCIATION: University ofWestern Ontario
FROM: arXiv:1804.06882
CONTRIBUTION
- A variant of DenseNet architecture called PeleeNet for mobile devices is proposed.
- The network architecture of Single Shot MultiBox Detector (SSD) is optimized for speed acceleration and then combine it with PeleeNet.
METHOD
BUILDING BLOCKS
Two-Way Dense Layer. A 2-way dense layer is used to get different scales of receptive fields. One branch uses a small kernel size (3x3) to capture small-size objects. The other branch stacks two 3x3 convolution layers for larger objects. The structure is shown in the following figure.
Stem Block. This block is placed before the first dense layer for the sake of cost efficiency. This stem block can effectively improve the feature expression ability without adding computational cost too much. The structure is shown as follows.
Dynamic Number of Channels in Bottleneck Layer. The number of channels in the bottleneck layer varies according to the input shape to make sure the number of output channels does not exceed the number of its input channels.
Transition Layer without Compression. experiments show that the compression factor proposed by DenseNet hurts the feature expression so that the number of output channels is kept the same as the number of input channels in transition layers.
Composite Function. The post-activation (Convolution - Batch Normalization - Relu) is used for speed acceleration. For post-activation, all batch normalization layers can be merged with convolution layer at the inference stage. To compensate for the negative impact on accuracy caused by this change, a shallow and wide network structure is designed. In addition, a 1x1 convolution layer is added to the last dense block to get a stronger representational ability.
ARCHITECTURE
The framework of the work is illustrated in the following table.
OPTIMIZATION FOR SSD
Feature Map Selection. 5 scale feature maps (19x19, 10x10, 5x5, 3x3, and 1x1) are selected. Larger resolution features are discarded for speed acceleration.
Residual Prediction Block. For each feature map used for detection, a residual block (ResBlock) is constructed before conducting prediction, shown in the following figure.
PERFORMANCE
The classification performance on ILSVRC2012 is shown in the following table.
The detection performance on VOC2007 is shown in the following table.
The detection performance on COCO2015 is shown in the following table.
SOME IDEAS
From my own experince, DW convolution is not pruning friendly so that recently pruning methods, such as ThiNet and Net-Trim, works poorly on DW convolution. This work uses conventional convolutional layers, so maybe those pruning methods can play a role.