0%

TITLE: Deformable Part-based Fully Convolutional Network for Object Detection

AUTHOR: Taylor Mordan, Nicolas Thome, Matthieu Cord, Gilles Henaff

FROM: arXiv:1707.06175

CONTRIBUTIONS

  1. Deformable Part-based Fully Convolutional Network (DPFCN), an end-to-end model integrating ideas from DPM into region-based deep ConvNets for object detection, is proposed.
  2. A new deformable part-based RoI pooling layer is introduced, which explicitly selects discriminative elements of objects around region proposals by simultaneously optimizing latent displacements of all parts.
  3. Another improvement is the design of a deformation-aware localization module, a specific module exploiting configuration information to refine localization.

METHOD

R-FCN is the work closest to DP-FCN. Both are developed on the basis of Faster-RCNN, in which an RPN is used to generate object proposals and a designed pooling layer is used to extract features for classification and localization. The architecture of DP-FCN is illustrated in the following figure. A Deformable part-based RoI Pooling layer follows a FCN network. Then two branches predict category and location respectively. The output of the backbone FCN is similar to that in R-FCN. It has $ k^2(C+1) $ channels corresponding to $ k \times k $ parts and $ C $ categories and background.

DP-FCN

Deformable part-based RoI pooling

For each input channel, just like what has been done in DPM, a transformation is carried out to spread high responses to nearby locations, taking into account the deformation costs.

Deformable part-based RoI pooling

In my understanding, the output of RPN works like the root filter in DPM. Then the region proposal is evenly divided into $ k \times k $ sub-regions. Then these sub-regions will displace taking deformation into account. Displacement computed during the forward pass are stored and used to backpropagate gradients at the same locations.

Classification and localization predictions with deformable parts

Predictions are performed with two sibling branches for classification and relocalization of region proposals as is common practice. The classification branch is simply composed of an average pooling followed by a SoftMax layer.

Deformation-aware localization refinement

As for location prediction, every part has 4 elements to be predicted. In addition to that, the displacement is sent to two fully connected layers and is then element-wise multiplied with the first values to yield the final localization output for this class.

TITLE: ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices

AUTHOR: Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun

ASSOCIATION: Megvii Inc (Face++)

FROM: arXiv:1707.01083

CONTRIBUTIONS

  1. Two operations, pointwise group convolution and channel shuffle, are proposed to greatly reduce computation cost while maintaining accuracy.

MobileNet Architecture

In MobileNet and other works, efficient depthwise separable convolutions or group convolutions strike an excellent trade-off between representation capability and computational cost. However, both designs do not fully take the $ 1 \times 1 $ convolutions (also called pointwise convolutions in MobileNet) into account, which require considerable complexity.

Channel Shuffle for Group Convolutions

In order to address the mentioned issue, a straightforward solution is applying group convolutions on $ 1 \times 1 $ layers like what has been done on $ 3 \times 3 $ in MobileNet. However, if multiple group convolutions stack together, there is one side effect: outputs from a certain channel are only derived from a small fraction of input channels. This property blocks information flow between channel groups and weakens representation. To allow group convolution obtaining input data from different groups, for the feature map generated from the previous group layer, we can first divide the channels in each group into several subgroups, then feed each group in the next layer with different subgroups. It can be implemented by reshaping the previous output channel dimension into $ (g, n) $, transposing and then flattening it back as the input of next layer, which is called channel shuffle operation and illustrated in the following figure.

Channel Shuffle

ShuffleNet Unit

The following figure shows the ShuffleNet Unit.

ShuffleNet Unit

In the figure, (a) is the building block in ResNeXt, and (b) is the building block in ShuffleNet. Given the input size $ c \times h \times w $ and the bottleneck channels $ m $, ResNext has $ hw(2cm+9m^2/g) $ FLOPs, while ShuffleNet needs $ hw(2cm/g+9m) $ FLOPs.

Network Architecture

Network Architecture

Comparison

Comparison

TITLE: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

AUTHOR: Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam

ASSOCIATION: Google

FROM: arXiv:1704.04861

CONTRIBUTIONS

  1. A class of efficient models called MobileNets for mobile and embedded vision applications is proposed, which are based on a streamlined architecture that uses depthwise separable convolutions to build light weight deep neural networks
  2. Two simple global hyper-parameters that efficiently trade off between latency and
    accuracy are introduced.

MobileNet Architecture

The core layer of MobileNet is depthwise separable filters, named as Depthwise Separable Convolution. The network structure is another factor to boost the performance. Finally, the width and resolution can be tuned to trade off between latency and accuracy.

Depthwise Separable Convolution

Depthwise separable convolutions which is a form of factorized convolutions which factorize a standard convolution into a depthwise convolution and a $1 \times 1$ convolution called a pointwise convolution. In MobileNet, the depthwise convolution applies a single filter to each input channel. The pointwise convolution then applies a $ 1 \times 1 $ convolution to combine the outputs the depthwise convolution. The following figure illustrates the difference between standard convolution and depthwise separable convolution.

Difference between Standard Convolution and Depthwise Separable Convolution

The standard convolution has the computation cost of

Depthwise separable convolution costs

MobileNet Structure

The following table shows the structure of MobileNet

MobileNet Structure

Width and Resolution Multiplier

The Width Multiplier is used to reduce the number of the channels. The Resolution Multiplier is used to reduce the input image of the network.

Comparison

Comparison

This weekend was so hot that everyone seems to be bad tempered and have difficulty to breathe. However, I helped myself to feel the beauty of life by cooking myself food and going out to watch a movie. Recently, I’ve been always hiding in a room with air-conditioning, feeling bored to almost anything. Perhaps, when it is hot, it is time to sweat. Sweating made me loose much pressure and feel re-freshed. The cold udon noodles and bolognese calmed me down and gave me energy.

Weekend

TITLE: Optimizing Deep CNN-Based Queries over Video Streams at Scale

AUTHOR: Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, Matei Zaharia

ASSOCIATION: Stanford InfoLab

FROM: arXiv:1703.02529

CONTRIBUTIONS

  1. NOSCOPE, the first data management system that accelerates CNN-based classification queries over video streams at scale.
  2. CNN-specific techniques for difference detection across frames and model specialization for a given stream and query, as well as a cost-based optimizer that can automatically identify the best combination of these filters for a given accuracy target.
  3. An evaluation of NOSCOPE on fixed-angle binary classification showing up to 3,200x speedups on real-world data.

METHOD

The work flow of NoScope can be viewed in the following figure. Brefiely, it can be explained that NoScope’s optimizer selects a different configuration of difference detectors and specialized models for each video stream to perform binary classification as quickly as possible without calling the full target CNN, which will be called only when necessary.

Overall Framework of NoScope

There are mainly three compoments in this system, Difference Detectors, Specialized Models and Cost-based Optimizer.

  1. Difference Detectors consider attempts to detect differences between images. They are used to determine whether the considered frame is significantly different from another image with known labels. There are two forms of difference detectors supported: difference detection against a fixed reference image for the video stream that is known to contain no objects and difference detection against an earlier frame, some configured time into the past.
  2. Specialized Models are small CNNs specified for each video and query. They are designed using different combinations of numbers of channels and layers. This can be thought as expert classifiers or detectors for different videos. For static cameras, one specifialized model does not need to consider samples that would only appear in other camers.
  3. Cost-based Optimizer brings difference detectors and model specialization together that maximizes the throughput subject to a certain condition, e.g. FP and FN rate.

DISADVANTAGES

  1. This scheme is suitable for fixed views, but if the input changes frequently, this scheme may work less efficiently or effectively.

TITLE: Learning Spatial Regularization with Image-level Supervisions for Multi-label Image Classification

AUTHOR: Feng Zhu, Hongsheng Li, Wanli Ouyang, Nenghai Yu, Xiaogang Wang

ASSOCIATION: University of Science and Technology of China, University of Sydney, The Chinese University of Hong Kong

FROM: arXiv:1702.05891

CONTRIBUTIONS

  1. An end-to-end deep neural network for multi-label image classification is proposed, which exploits both semantic and spatial relations of labels by training learnable convolutions on the attention maps of labels. Such relations are learned with only image-level supervisions. Investigation and visualization of learned models demonstrate that our model can effectively capture semantic and spatial relations of labels.
  2. The proposed algorithm has great generalization capability and works well on data with different types of labels.

METHOD

The proposed Spatial Regularization Net (SRN) takes visual features from the main net as inputs and learns to regularize spatial relations between labels. Such relations are exploited based on the learned attention maps for the multiple labels. Label confidences from both main net and SRN are aggregated to generate final classification confidences. The whole network is a unified framework and is trained in an end-to-end manner.

The scheme of SRN is illustrated in the following figure.

Overall Framework of SRN

To train the network,

  1. Finetune only the main net on the target dataset. Both $ f{cnn} $ and $ f{cls} $ are learned with cross-entropy loss for classification.
  2. Fix $ f{cnn} $ and $ f{cls} $. Train $ f_{att} $ and $ conv1 $ with cross-entropy loss for classification.
  3. Train $ f_{sr} $ with cross-entropy loss for classification by fixing all other sub-networks.
  4. The whole network is jointly finetuned with joint loss.

The main network follows the structure of ResNet-101. And it is finetuned on the target dataset. The output of Attention Map and Confidence Map has $ C $ channels which is same with the number of categories. Their outputs are merged by element-wise multiplication and average-pooled to a feature vector in step 2. In step 3, instead of an average-pooling, $ f{sr} $ follows. $ f{sr} $ is implemented as three convolution layers with ReLU nonlinearity followed by one fully-connected layer as shown in the following figure.

Structure of fsr

$ conv4 $ is composed of single-channel filters. In Caffe, it can be implemnted using “group”. Such design is because one label may only semantically relate to a small number of other labels, and measuring spatial relations with those unrelated attention maps is unnecessary.

It’s been less input to me recently so that less output from me. This month is really busy. It’s time to keep up!

I think I am a single-thread processor. It’s really hard for me to handle multiple tasks simultaneously. I’d become worried about another one if I’m working on one task, which means that I might mess up with the current one. I don’t know whether anyone can do it better. Besides the single-thread thing, sometimes I become too anxious to having any mood to do anything, like keeping a diary, reading paper or picking up a habit. Maybe I’m too narrow-minded??