Joshua's Blog

Reading Note: Tiny SSD: A Tiny Single-shot Detection Deep Convolutional Neural Network for Real-time Embedded Object Detection

Posted on 2018-03-04 Edited on 2022-08-19 In Computer Vision

TITLE: Tiny SSD: A Tiny Single-shot Detection Deep Convolutional Neural Network for Real-time Embedded Object Detection

AUTHOR: Alexander Wong, Mohammad Javad Shafiee, Francis Li, Brendan Chwyl

ASSOCIATION: University of Waterloo, DarwinAI

FROM: arXiv:1802.06488

CONTRIBUTION

A single-shot detection deep convolutional neural network, Tiny SSD, is designed specifically for real-time embedded object detection.
A non-uniform Fire module is proposed based on SqueezeNet.
The network achieves 61.3% mAP in VOC2007 dataset with a model size of 2.3MB.

METHOD

DESIGN STRATEGIES

Tiny SSD network for real-time embedded object detection is composed of two main sub-network stacks:

A non-uniform Fire sub-network stack.
A non-uniform sub-network stack of highly optimized SSD-based auxiliary convolutional feature layers.

The first sub-network stack is feed into the second sub-network stack. Both sub-networks needs carefully design to run on an embedded device. The first sub-network works as the backbone, which directly affect the performance of object detection. The second sub-network should balance the performance and model size as well as inference speed.

Three key design strategies are:

Reduce the number of $3 \times 3$ filters as much as possible.
Reduce the number of input channels to $3 \times 3$ filters where possible.
Perform downsampling at a later stage in the network.

NETWORK STRUCTURE

PERFORMANCE

SOME THOUGHTS

The paper uses half precision floating-point to store the model, which reduce the model size by half. From my own expirence, several methods can be tried to export a deep learning model to embedded devices, including

Architecture design, just like this work illustrated.
Model pruning, such as decomposition, filter pruning and connection pruning.
BLAS library optimization.
Algorithm optimization. Using SSD as an example, the Prior-Box layer needs only one forward as long as the input image size does not change.

Reading Note: $S^3FD$: Single Shot Scale-invariant Face Detector

Posted on 2018-01-20 Edited on 2022-08-19 In Computer Vision

TITLE: $S^3FD$: Single Shot Scale-invariant Face Detector

AUTHOR: Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, Stan Z. Li

ASSOCIATION: Chinese Academy of Sciences

FROM: arXiv:1708.05237

CONTRIBUTION

Proposing a scale-equitable face detection framework with a wide range of anchor-associated layers and a series of reasonable anchor scales so as to handle dif- ferent scales of faces well.
Presenting a scale compensation anchor matching strategy to improve the recall rate of small faces.
Introducing a max-out background label to reduce the high false positive rate of small faces.
Achieving state-of-the-art results on AFW, PASCAL face, FDDB and WIDER FACE with real-time speed.

METHOD

There are mainly three reasons that why the performance of anchor-based detetors drop dramatically as the objects becoming smaller:

Biased Framework. Firstly, the stride size of the lowest anchor-associated layer is too large, thus few features are reliable for small faces. Secondly, anchor scale mismatches receptive field and both are too large to fit small faces.
Anchor Matching Strategy. Anchor scales are discrete but face scale is continuous. Those faces whose scale distribute away from anchor scales can not match enough anchors, such as tiny and outer face.
Background from Small Anchors. Small anchors lead to sharp increase in the number of negative anchors on the background, bringing about many false positive faces.

The architecture of Single Shot Scale-invariant Face Detector is shown in the following figure.

Scale-equitable framework

Constructing Architecture

Base Convolutional Layers: layers of VGG16 from conv1_1 to pool5 are kept.
Extra Convolutional Layers: fc6 and fc7 of VGG16 are converted to convolutional layers. Then extra convolutional layers are added, which is similar to SSD.
Detection Convolutional Layers: conv3_3, conv4_3, conv5_3, conv_fc7, conv6_2 and conv7_2 are selected as the detection layers.
Normalization Layers: L2 normalization is applied to conv3_3, conv4_3 and conv5_3 to rescale their norm to 10, 8 and 5 respectively. The scales are then learned during the back propagation.
Predicted Convolutional Layers: For each anchor, 4 offsets relative to its coordinates and $N_{s}$ scores for classification, where $N_s=N_m+1$ ($N_m$ is the maxout background label) for conv3_3 detection layer and $N_s=2$ for other detection layers.
Multi-task Loss Layer: Softmax loss for classification and smooth L1 loss for regression.

Designing scales for anchors

Effective receptive field: the anchor should be significantly smaller than theoretical receptive field in order to match the effective receptive field.
Equal-proportion interval principle: the scales of the anchors are 4 times its interval, which guarantees that different scales of anchor have the same density on the image, so that various scales face can approximately match the same number of anchors.

Scale compensaton anchor matching strategy

To solve the problems that 1) the average number of matched anchors is about 3 which is not enough to recall faces with high scores; 2) the number of matched anchors is highly related to the anchor scales, a scale compensation anchor matching strategy is proposed. There are two stages:

Stage One: decrease threshold from 0.5 to 0.35 in order to increase the average number of matched anchors.
Stage Two: firstly pick out anchors whose jaccard overlap with tiny or outer faces are higher than 0.1, then sorting them to select top-N as matched anchors. N is set as the average number from stage one.

Max-out background label

For conv3_3 detection layer, a max-out background label is applied. For each of the smallest anchors, $N_m$ scores are predicted for background label and then choose the highest as its final score.

Training

Training dataset and data augmentation, including color distort, random crop and horizontal flip.
Loss function is a multi-task loss defined in RPN.
Hard negative mining.

The experiment result on WIDER FACE is illustrated in the following figure.

Reading Note: Single-Shot Refinement Neural Network for Object Detection

Posted on 2018-01-10 Edited on 2022-08-19 In Computer Vision

TITLE: Single-Shot Refinement Neural Network for Object Detection

AUTHOR: Shifeng Zhang, LongyinWen, Xiao Bian, Zhen Lei, Stan Z. Li

ASSOCIATION: CACIA, GE Global Research

FROM: arXiv:1711.06897

CONTRIBUTION

A novel one-stage framework for object detection is introduced, composed of two inter-connected modules, i.e., the ARM (Anchor Refinement Module) and the ODM (Object Detection Module). This leads to performance better than the two-stage approach while maintaining high efficiency of the one-stage approach.
To ensure the effectiveness, TCB (Transfer Connection Block) is designed to transfer the features in the ARM to handle more challenging tasks, i.e., predict accurate object locations, sizes and class labels, in the ODM.
RefineDet achieves the latest state-of-the-art results on generic object detection

METHOD

The idea of this work can be seen as an improvement based on DSSD method. The DSSD method uses multi-scale feature maps to predict categories and regress bounding boxes. In DSSD, deconvolution is also used to increase the resolution of the last feature maps. In this work, a binary classifier and a coarse regressor is added to the downsampling stages. Their outputs are the inputs to the multi-category classifier and fine regressor. The framework this single-shot refinement neural network is illustrated in the following figure.

The ARM is designed to (1) identify and remove negative anchors to reduce search space for the classifier, and (2) coarsely adjust the locations and sizes of anchors to provide better initialization for the subsequent regressor.

In training phase, for a refined anchor box, if its negative confidence is larger than a preset threshold θ (i.e., set θ = 0.99 empirically), we will discard it in training the ODM.

Object Detection Module

The ODM takes the refined anchors as the input from the former to further improve the regression and predict multi-class labels.

Transfer Connection Block

TCB is introduced to convert features of different layers from the ARM, into the form required by the ODM, so that the ODM can share features from the ARM. Another function of the TCBs is to integrate large-scale context by adding the high-level features to the transferred features to improve detection accuracy. An illustration of TCB can be found in the following figure.

Training

The training method is much like SSD. The experiment result and comparison with other method can be found in the following table.

Reading Note: Panoptic Segmentation

Posted on 2018-01-08 Edited on 2022-08-19 In Computer Vision

TITLE: Panoptic Segmentation

AUTHOR: Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, Piotr Dollar

ASSOCIATION: FAIR, Heidelberg University

FROM: arXiv:1801.00868

CONTRIBUTION

A novel ‘Panoptic Segmentation’ (PS) task is proposed and studied.
A panoptic quality (PQ) measure is introduced to measure performance on the task.
A basic algorithmic approach to combine instance and semantic segmentation outputs into panoptic outputs is proposed.

PROBLEM DEFINATION

Panoptic refers to a unified, global view of segmentation. Each pixel of an image must be assigned a semantic label and an instance id. Pixels with the same label and id belong to the same object; for stuff labels the instance id is ignored.

Panoptic Segmentation

Given a predetermined set of $L$ semantic categories encoded by $\mathcal{L} := {1,…,L}$, the task requires a panoptic segmentation algorithm to map each pixel $i$ of an image to a pair $(l{i}, z{i}) \in \mathcal{L} \times N$, where $l{i}$ represents the semantic class of pixel $i$ and $z{i}$ represents its instance id.

The semantic label set consist of subsets $\mathcal{L}^{St}$ and $\mathcal{L}^{Th}$, such that $\mathcal{L} = \mathcal{L}^{St} \cup \mathcal{L}^{Th}$ and $\mathcal{L}^{St} \cap \mathcal{L}^{Th} = \phi$. These subsets correspond to stuff labels and thing labels, respectively.

Panoptic Quality (PQ)

For each class, the unique matching splits the predicted and ground truth segments into three sets: true positives (TP), false positives (FP), and false negatives (FN), representing matched pairs of segments, unmatched predicted segments, and unmatched ground truth segments, respectively. Given these three sets, PQ is defined as:

$PQ=\frac{\sum_{(p,g) \in TP} IoU(p,g)}{|TP|+\frac{1}{2}|FP|+\frac{1}{2}|FN|}$

A predicted segment and a ground truth segment can match only if their intersection over union (IoU) is strictly greater than 0.5.

PQ can be seen as the multiplication of a Segmentation Quality (SQ) term and a Detection Quality (DQ) term:

$PQ=\frac{\sum_{(p,g) \in TP} IoU(p,g)}{|TP|} \times \frac{|TP|}{|TP|+\frac{1}{2}|FP|+\frac{1}{2}|FN|}$

where the first term can be seen as SQ and the second term can be seen as DQ.

Human vs. Machine

Little Things [20171210 May the Force Be with Me]

Posted on 2017-12-10 Edited on 2022-08-19 In Life Discovery

Little Things [20171116 Stand and Work]

Posted on 2017-11-16 Edited on 2022-08-19 In Life Discovery

Reading Note: Progressive Growing of GANs for Improved Quality, Stability, and Variation

Posted on 2017-11-12 Edited on 2022-08-19 In Computer Vision

TITLE: Progressive Growing of GANs for Improved Quality, Stability, and Variation

AUTHOR: Tero Karras, Timo Aila, Samuli Laine, Jaakko Lehtinen

ASSOCIATION: NVIDIA

FROM: ICLR2018

CONTRIBUTION

A training methodology is proposed for GANs which starts with low-resolution images, and then progressively increases the resolution by adding layers to the networks. This incremental nature allows the training to first discover large-scale structure of the image distribution and then shift attention to increasingly finer scale detail, instead of having to learn
all scales simultaneously.

METHOD

PROGRESSIVE GROWING OF GANS

The following figure illustrates the training procedure of this work.

The training starts with both the generator $G$ and discriminator $D$ having a low spatial resolution of $4 \times 4$ pixels. As the training advances, successive layers are incrementally added to $G$ and $D$, thus increasing the spatial resolution of the generated images. All existing layers remain trainable throughout the process. Here $N \times N$ refers to convolutional layers operating on $N \times N$ spatial resolution. This allows stable synthesis in high resolutions and also speeds up training considerably.

fade in is adopted when the new layers are added to double resolution of the generator $G$ and discriminator $D$ smoothly. This example illustrates the transition from $16 \times 16$ images (a) to $32 \times 32$ images (c). During the transition (b) the layers that operate on the higher resolution works like a residual block, whose weight $\alpha$ increases linearly from 0 to 1. Here 2x and 0.5x refer to doubling and halving the image resolution using nearest neighbor filtering and average pooling, respectively. The toRGB represents a layer that projects feature vectors to RGB colors and fromRGB does the reverse; both use $1 \times 1$ convolutions. When training the discriminator, the real images are downscaled to match the current resolution of the network. During a resolution transition, interpolation is carried out between two resolutions of the real images, similarly to how the generator output combines two resolutions.

INCREASING VARIATION USING MINIBATCH STANDARD DEVIATION

Compute the standard deviation for each feature in each spatial location over the minibatch.
Average these estimates over all features and spatial locations to arrive at a single value.
Consturct one additional (constant) feature map by replicating the value and concatenate it to all spatial locations and over the minibatch

NORMALIZATION IN GENERATOR AND DISCRIMINATOR

EQUALIZED LEARNING RATE. A trivial $N (0; 1)$ initialization is used and then explicitly the weights are scaled at runtime. To be precise, $\hat{w}_i = w_i/c$, where $w_i$ are the weights and $c$ is the per-layer normalization constant from He’s initializer.The benefit of doing this dynamically instead of during initialization is somewhat subtle, and relates to the scale-invariance in commonly used adaptive stochastic gradient descent methods.

PIXELWISE FEATURE VECTOR NORMALIZATION IN GENERATOR. To disallow the scenario where the magnitudes in the generator and discriminator spiral out of control as a result of competition, the feature vector is normalized in each pixel to unit length in the generator after each convolutional layer, using a variant of “local response normalization”, configured as

$b_{x,y}=a_{x,y}/ \sqrt{\frac{1}{N} \sum_{j=0}^{N-1}(a_{x,y}^j)^2 + \epsilon}$

where $\epsilon=10^{-8}$, $N$ is the number of feature maps, and $a{x,y}$ is original feature vector, $b{x,y}$ is the normalized feature vector in pixel $(x,y)$.

Reading Note: Be Your Own Prada: Fashion Synthesis with Structural Coherence

Posted on 2017-10-31 Edited on 2022-08-19 In Computer Vision

TITLE: Be Your Own Prada: Fashion Synthesis with Structural Coherence

AUTHOR: Shizhan Zhu, Sanja Fidler, Raquel Urtasun, Dahua Lin, Chen Change Loy

ASSOCIATION: The Chinese University of Hong Kong, University of Toronto, Vector Institute, Uber Advanced Technologies Group

FROM: ICCV2017

CONTRIBUTION

A method that can generate new outfits onto existing photos is developped so that it can

retain the body shape and pose of the wearer,
roduce regions and the associated textures that conform to the language description,
Enforce coherent visibility of body parts.

METHOD

Given an input photograph of a person and a sentence description of a new desired outfit, the model first generates a segmentation map $\tilde{S}$ using the generator from the first GAN. Then the new image is rendered with another GAN, with the guidance from the segmentation map generated in the previous step. At test time, the final rendered image is obtained with a forward pass through the two GAN networks. The workflow of this work is shown in the following figure.

The first generator $G{shape}$ aims to generate the desired semantic segmentation map $\tilde{S}$ by conditioning on the spatial constraint $\downarrow m(S_0)$ , the design coding $\textbf{d}$ , and the Gaussian noise $$\textbf{z}{S} $.$ S{0} $is the original pixel-wise one-hot segmentation map of the input image with height of$ m$$, width of $n$ and channel of $L$, which represents the number of labels. $\downarrow m(S_0)$ downsamples and merges $S{0}$ so that it is agnostic of the clothing worn in the original image, and only captures information about the user’s body. Thus $G{shape}$ can generate a segmentation map $\tilde{S}$ with sleeves from a segmentation map $S{0}$ without sleeves.

The second generator $G_{image}$ renders the final image $\tilde{I}$ based on the generated segmentation map $\tilde{S}$, design coding $\textbf{d}$, and the Gaussian noise $\textbf{z}_I$.

Reading Note: Detect to Track and Track to Detect

Posted on 2017-10-25 Edited on 2022-08-19 In Computer Vision

TITLE: Detect to Track and Track to Detect

AUTHOR: Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman

ASSOCIATION: Graz University of Technology, University of Oxford

FROM: arXiv:1710.03958

CONTRIBUTION

A ConvNet architecture is set up for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression.
Correlation features that represent object co-occurrences across time are introduced to aid the ConvNet during tracking.
Frame-level detections are linked to produce high accuracy detections at the video-level based on across-frame tracklets.

METHOD

For frame-level detections, this work adopts R-FCN as the base framework to detect objects in a single frame. The inter-frame correlation features are extracted from the feature maps of the two frames. A multi-task loss of localization, classification and displacement is used to train the net work. The workflow of this work is shown in the following figure.

The key innovation of this work is an operation denoted as ROI tracking. The input of this operation is the bounding box regression features of the two frames $\textbf{x}_{reg}^{t}$ , $\textbf{x}_{reg}^{t+\tau}$ and the correlation features $\textbf{x}^{t,t+\tau}_{corr}$ , which are concatenated. The correlation layer performs point-wise feature comparison of two feature maps $\textbf{x}^{t}_{l}$ , $\textbf{x}^{t+\tau}_{l}$

$\textbf{x}_{corr}^{t,t+\tau} (i,j,p,q) = \langle \textbf{x}_{l}^{t} (i,j), \textbf{x}_{l}^{t+\tau} (i+p,j+q) \rangle$

where $-d \leq p \leq d$ and $-d \leq q \leq d$ are offsets to compare features in a square neighbourhood around the locations $i$, $j$ in the feature map, defined by the maximum displacement $d$.

The loss function is written as

$Loss(\{p_i\},\{b_i\},\{\Delta_i\} ) = \frac{1}{N} \sum_{i=1}^{N}L_{cls}(p_i,c^{*}) + \lambda \frac{1}{N_{fg}} \sum_{i=1}^{N} [c_i^*>0] L_{reg}(b_i, b_i^*) + \lambda \frac{1}{N_{tra}} \sum_{i=1}^{N_{tra}} L_{tra}(\Delta_i^{t+\tau}, \Delta_i^{*,t+\tau})$

A class-wise linking score is defined to combine detections and tracks across time

$s_{c}(D_{i,c}^t,D_{j,c}^{t+\tau},T^{t,t+\tau})=p_{i,c}^t+p_{j,c}^{t+\tau}+\phi(D_{i}^{t},D_{j},T^{t,t+\tau})$

where the pairwise term $\phi$ evaluates to 1 if the IoU overlap a track correspondences $T^{t,t+\tau}$ with the detection boxes $D{i}^{t}$, $D{i}^{t+\tau}$ is larger than 0.5. $p{i,c}^{t}$, $p{j,c}^{t+\tau}$ is the softmax probability for class $c$. The optimal path across a video can be found by maximizing the scores over the duration $T$ of the video. Once the optimal tube is found, the detections corresponding to that tube are removed. Then reweight the detection scores in the tube by adding the mean of the 50% highest scores in that tube. And the procedure is applied again to the remaining detections.

Reading Note: Interpretable Convolutional Neural Networks

Posted on 2017-10-12 Edited on 2022-08-19 In Computer Vision

TITLE: Interpretable Convolutional Neural Networks

AUTHOR: Quanshi Zhang, Ying Nian Wu, Song-Chun Zhu

ASSOCIATION: UCLA

FROM: arXiv:1710.00935

CONTRIBUTION

Slightly revised CNNs are propsed to improve their interpretability, which can be broadly applied to CNNs with different network structures.
No annotations of object parts and/or textures are needed to ensure each high-layer filter to have a certain semantic meaning. Each filter automatically learns a meaningful object-part representation without any additional human supervision.
When a traditional CNN is modified to an interpretable CNN, experimental settings need not to be changed for learning. I.e. the interpretable CNN does not change the previous loss function on the top layer and uses exactly the same training samples.
The design for interpretability may decrease the discriminative power of the network a bit, but such a decrease is limited within a small range.

METHOD

The loss for filter is illustrated in the following figure.

A feature map is expected to be strongly activated in images of a certain category and keep silent on other images. Therefore, a number of templates are used to evaluate the fitness between the current feature map and the ideal distribution of activations w.r.t. its semantics. The template is an ideal distribution of activations according to space locations. The loss for layers is formulated as the mutual information between feature map $\textbf{X}$ and templates $\textbf{T}$.

$Loss_{f} = - MI(\textbf{X}; \textbf{T})$

the loss can be re-written as

$Loss_{f} = - H(\textbf{T}) + H(\textbf{T'}=\{T^{-}, \textbf{T}^{+}|\textbf{X}\})+\sum_{x}p(\textbf{T}^{+},x)H(\textbf{T}^{+}|X=x)$

The first term is a constant denoting the piror entropy of $\textbf{T}^{+}$. The second term encourages a low conditional entropy of inter-category activations which means that a well-learned filter needs to be exclusively activated by a certain category and keep silent on other categories. The third term encorages a low conditional entropy of spatial distribution of activations. A well-learned filter should only be activated by a single region of the feature map, instead of repetitively appearing at different locations.

SOME THOUGHTS

This loss can reduce the redundancy among filters, which may be used to compress the model.

CONTRIBUTION

METHOD

DESIGN STRATEGIES

NETWORK STRUCTURE

PERFORMANCE

SOME THOUGHTS

CONTRIBUTION

METHOD

Scale-equitable framework

Scale compensaton anchor matching strategy

Max-out background label

Training

CONTRIBUTION

METHOD

Anchor Refinement Module

Object Detection Module

Transfer Connection Block

Training

CONTRIBUTION

PROBLEM DEFINATION

Panoptic Segmentation

Panoptic Quality (PQ)

Human vs. Machine

CONTRIBUTION

METHOD

PROGRESSIVE GROWING OF GANS

INCREASING VARIATION USING MINIBATCH STANDARD DEVIATION

NORMALIZATION IN GENERATOR AND DISCRIMINATOR

CONTRIBUTION

METHOD

CONTRIBUTION

METHOD

CONTRIBUTION

METHOD

SOME THOUGHTS