0%

This weekend was so hot that everyone seems to be bad tempered and have difficulty to breathe. However, I helped myself to feel the beauty of life by cooking myself food and going out to watch a movie. Recently, I’ve been always hiding in a room with air-conditioning, feeling bored to almost anything. Perhaps, when it is hot, it is time to sweat. Sweating made me loose much pressure and feel re-freshed. The cold udon noodles and bolognese calmed me down and gave me energy.

Weekend

TITLE: Optimizing Deep CNN-Based Queries over Video Streams at Scale

AUTHOR: Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, Matei Zaharia

ASSOCIATION: Stanford InfoLab

FROM: arXiv:1703.02529

CONTRIBUTIONS

  1. NOSCOPE, the first data management system that accelerates CNN-based classification queries over video streams at scale.
  2. CNN-specific techniques for difference detection across frames and model specialization for a given stream and query, as well as a cost-based optimizer that can automatically identify the best combination of these filters for a given accuracy target.
  3. An evaluation of NOSCOPE on fixed-angle binary classification showing up to 3,200x speedups on real-world data.

METHOD

The work flow of NoScope can be viewed in the following figure. Brefiely, it can be explained that NoScope’s optimizer selects a different configuration of difference detectors and specialized models for each video stream to perform binary classification as quickly as possible without calling the full target CNN, which will be called only when necessary.

Overall Framework of NoScope

There are mainly three compoments in this system, Difference Detectors, Specialized Models and Cost-based Optimizer.

  1. Difference Detectors consider attempts to detect differences between images. They are used to determine whether the considered frame is significantly different from another image with known labels. There are two forms of difference detectors supported: difference detection against a fixed reference image for the video stream that is known to contain no objects and difference detection against an earlier frame, some configured time into the past.
  2. Specialized Models are small CNNs specified for each video and query. They are designed using different combinations of numbers of channels and layers. This can be thought as expert classifiers or detectors for different videos. For static cameras, one specifialized model does not need to consider samples that would only appear in other camers.
  3. Cost-based Optimizer brings difference detectors and model specialization together that maximizes the throughput subject to a certain condition, e.g. FP and FN rate.

DISADVANTAGES

  1. This scheme is suitable for fixed views, but if the input changes frequently, this scheme may work less efficiently or effectively.

TITLE: Learning Spatial Regularization with Image-level Supervisions for Multi-label Image Classification

AUTHOR: Feng Zhu, Hongsheng Li, Wanli Ouyang, Nenghai Yu, Xiaogang Wang

ASSOCIATION: University of Science and Technology of China, University of Sydney, The Chinese University of Hong Kong

FROM: arXiv:1702.05891

CONTRIBUTIONS

  1. An end-to-end deep neural network for multi-label image classification is proposed, which exploits both semantic and spatial relations of labels by training learnable convolutions on the attention maps of labels. Such relations are learned with only image-level supervisions. Investigation and visualization of learned models demonstrate that our model can effectively capture semantic and spatial relations of labels.
  2. The proposed algorithm has great generalization capability and works well on data with different types of labels.

METHOD

The proposed Spatial Regularization Net (SRN) takes visual features from the main net as inputs and learns to regularize spatial relations between labels. Such relations are exploited based on the learned attention maps for the multiple labels. Label confidences from both main net and SRN are aggregated to generate final classification confidences. The whole network is a unified framework and is trained in an end-to-end manner.

The scheme of SRN is illustrated in the following figure.

Overall Framework of SRN

To train the network,

  1. Finetune only the main net on the target dataset. Both $ f_{cnn} $ and $ f_{cls} $ are learned with cross-entropy loss for classification.
  2. Fix $ f_{cnn} $ and $ f_{cls} $. Train $ f_{att} $ and $ conv1 $ with cross-entropy loss for classification.
  3. Train $ f_{sr} $ with cross-entropy loss for classification by fixing all other sub-networks.
  4. The whole network is jointly finetuned with joint loss.

The main network follows the structure of ResNet-101. And it is finetuned on the target dataset. The output of Attention Map and Confidence Map has $ C $ channels which is same with the number of categories. Their outputs are merged by element-wise multiplication and average-pooled to a feature vector in step 2. In step 3, instead of an average-pooling, $ f_{sr} $ follows. $ f_{sr} $ is implemented as three convolution layers with ReLU nonlinearity followed by one fully-connected layer as shown in the following figure.

Structure of fsr

$ conv4 $ is composed of single-channel filters. In Caffe, it can be implemnted using “group”. Such design is because one label may only semantically relate to a small number of other labels, and measuring spatial relations with those unrelated attention maps is unnecessary.

It’s been less input to me recently so that less output from me. This month is really busy. It’s time to keep up!

I think I am a single-thread processor. It’s really hard for me to handle multiple tasks simultaneously. I’d become worried about another one if I’m working on one task, which means that I might mess up with the current one. I don’t know whether anyone can do it better. Besides the single-thread thing, sometimes I become too anxious to having any mood to do anything, like keeping a diary, reading paper or picking up a habit. Maybe I’m too narrow-minded??

1. Content

Brief Revisit to the “Ancient” Algorithm

  • HOG (before *2007)
  • DPM (*2010~2014)

Epochal Evolution of R-CNN

  • R-CNN *2014
  • Fast-RCNN *2015
  • Faster-RCNN *2015

Efficient One-shot Methods

  • YOLO
  • SSD

Others

Goal of Object Detection

2. Brief Revisit to the “Ancient” Algorithm

2.1 Histograms of Gradients (HOG)

Histograms of Gradients

  • Calculate gradient for each pixel
  • For each Cell, a histogram of gradient is computed
  • For each Block, a HOG feature is extracted by concatenating histograms of each Cell

If Block size = 16*16, Block stride = 8, Cell size = 8*8, Bin size = 9, Slide-window size = 128*64, then HOG feature is a 3780-d feature. #Block=((64-16)/8+1)*((128-16)/8+1)=105, #Cell=(16/8)*(16/8)=4, 105*4*9=3780

2.2 Deformable Part Models (DPM)

DPM

$$ D_{i,l}(x,y) = \max \limits_{dx,dy} (R_{i,l}(x+dx, y+dy)-d_{i}\cdot \phi_{d}(dx,dy)) $$

This transformation spreads high filter scores to nearby locations, taking into account the deformation costs.

$$ score(x_{0},y_{0},l_{0}) = R_{0,l_{0}}(x_{0},y_{0})+ \sum_{i=1}^{n} D_{i, l_{0}-\lambda}(2(x_{0},y_{0})+v_{i})+b $$

The overall root scores at each level can be expressed by the sum of the root filter response at that level, plus shifted versions of transformed and sub-sampled part responses.

3. Epochal Evolution of R-CNN

3.1 RCNN

3.1.1 Regions with CNN Features

RCNN

  • Region proposals (Selective Search, ~2k)
  • CNN features (AlexNet, VGG-16, warped region in image)
  • Classifier (Linear SVM per class)
  • Bounding box (Class-specific regressor)
  • Run-time speed (VGG-16, 47 s/img on single K40 GPU)

3.1.2 Experiment Result (AlexNet)

RCNN_Result

  • Without FT, fc7 is worse than fc6, pool5 is quite competitive. Much of the CNN’s representational power comes from its convolutional layers, rather than from the much larger densely connected layers.
  • With FT, The boost from fine-tuning is much larger for fc6 and fc7 than for pool5. Pool5 features are general. Learning domain-specific non-linear classifiers helps a lot.
  • Bounding box regression helps reduce localization errors.

3.1.3 Interesting Details – Training

  • Pre-trained on ILSVRC2012 classification task

  • Fine-tuned on proposals with N+1 classes without any modification to the network

    1. IOU>0.5 over ground-truth as positive samples, others as negative samples
    2. Each mini-batch contains 32 positive samples and 96 background samples
  • SVM for each category

    1. Ground-truth window as positive samples
    2. IOU<0.3 over ground-truth as negative samples
    3. Hard negative mining is adopted
  • Bounding-box regression

    1. Class-specific
    2. Features computed by CNN
    3. Only the proposals IOU>0.6 overlap ground-truth
    4. Coordinates in pixel

3.1.4 Interesting Details – FP Error Types

RCNN_Error

  • Loc: poor localization, 0.1 < IOU < 0.5
  • Sim: confusion with a similar category
  • Oth: confusion with a dissimilar object category
  • BG: a FP that fired on background

3.2 Fast-RCNN

3.2.1 What’s Wrong with RCNN

  • Training is a multi-stage pipeline (Proposal, Fine-tune, SVMs, Regressors)
  • Training is expensive in space and time (Extract feature from every proposal, Need to save to disk)
  • Oject detection is slow (47 s/img on K40)

3.2.2 R-CNN with ROI Pooling

Fast_RCNN

  • Region proposals (Selective Search, ~2k)
  • CNN features (AlexNet, VGG-16, ROI in feature map)
  • Classifier (sub-network softmax)
  • Bounding box (sub-network regressor)
  • Run-time speed (VGG-16, 0.32 s/img on single K40 GPU)

3.2.3 ROI Pooling

SPP

  • Inspired by Spatial Pyramid Pooling (SPPNet)

  • Convert arbitrary input size to fixed length

    1. The input is an ROI area in feature map
    2. The input is divided into grids
    3. In each grid, pooling is used to extract features

3.2.4 Experiment Result (VGG16)

Fast_RCNN_Result

3.2.5 Interesting Details – Training

  • Pre-trained on ILSVRC2012 classification task

  • Fine-tuned with N+1 classes and two sibling layers

    Fast_RCNN_Finetune

    1. Fine-tune the whole network

    2. Each mini-batch has 2 images and 64 ROIs from each images

    3. 25% of the ROIs have IOU>0.5 with ground-truth as positive samples

    4. The rest of the ROIs have IOU [0.1, 0.5) with ground-truth as background samples

      $$ L(p,u,l^u,v) = L_{cls}(p,u) + \lambda [u \geq 1] L_{loc}(t^u,v) $$

    5. Multi-task loss, one loss for classification and one for bounding box regression

    6. ROI pooling back-propagation is similar with max-pooling

  • Accelerate using truncated SVD

    Fast_RCNN_SVD

    Implemented by using two FCs without non-linear activation

  • Training time

    1. 146x faster than R-CNN
    2. If accelerated with truncated SVD, 213x faster than R-CNN

3.2.6 Interesting Details – Design evaluation

  • Does multi-task training help? Yes, it does!

Multi-task Training

  • Test with multiple scales? Yes but with cost.

Multi-scale vs. single scale

  • Do SVMs outperform softmax? Interesting…

softmax vs. SVM

3.3 Faster-RCNN

3.3.1 Room to improve Fast-RCNN

  • Region proposal has become the bottleneck
  • 2s for Selective Search, 0.320s for Fast-RCNN
  • Why not a unified end-to-end framework

3.3.2 Fast-RCNN with RPN (Region Proposal Network)

softmax vs. SVM

  • Region proposals (RPN, ~300)
  • Classifier (sub-network softmax)
  • Bounding box (RPN regressor, sub-network regressor)
  • Run-time speed (VGG-16, 0.198 s/img on single K40 GPU)

3.3.3 Region Proposal Network

Anchors

  • Anchors

    1. reference box, prior box, default box
    2. Works in a slide-window way
    3. Implemented by 3*3 kernel convolution
    4. Centered at the slide-window
  • Translation-Invariant

    1. If objects translated, proposal should be translated
    2. Translated along slide-window
  • Multi-Scale and Multi-Ratio

    1. A pyramid of anchors
    2. A set of different ratios
    3. Relies on single scale feature map
  • Objectness and Localization

    1. Two siblings
    2. Objectness score
    3. Bounding box regression

3.3.4 Experiment Result

Result

3.3.5 Interesting Details – Training

  • Sharing Features for RPN and Fast-RCNN

    1. Alternating training
    2. Approximate joint training
    3. Non-approximate joint training
  • Alternating Training

    1. Train RPN(@) using pre-trained model
    2. Train Fast-RCNN(#) using pre-trained model and @’s proposal
    3. Train RPN($) using #’s weight with shared layers fixed
    4. Train Fast-RCNN using $’s proposal with shared layers fixed
  • Training RPN

    1. Each mini-batch arises from a single image
    2. Positive samples: the anchors with (1) the highest IOU and (2) IOU>0.7 overlap with any ground-truth
    3. Negative samples: IOU<0.3 overlap with all ground-truth
    4. Randomly sample 256 anchors, pos:neg = 1:1
    5. Loss function with Ncls=256, Nreg=~2400, λ=10
    6. Anchors cross image boundaries do not contribute at training stage

3.3.6 Interesting Details – Design Evaluation

  • Ablation on RPN

    Ablation on RPN

    1. Sharing: Detector feature helps RPN
    2. RPN generate quite good proposals
    3. No cls, randomly selecting proposal, score matters
    4. No reg, worse localization error
  • Timing

    Timing

    1. Nearly cost free
    2. Less proposal
  • Anchors

    Setting of Anchors

    1. Scale is more effective

4 Efficient One-shot Methods

4.1 YOLO

4.1.1 You Only Look Once

YOLO

  • A simple forward on the full image (almost same with a classification task)
  • Frame object detection as a regression problem (bounding box coordinates, class probabilities)
  • Extremely fast (45 fps for base network, or 150 fps for fast version)
  • Reasoning globally on the full context (no slide-window or region proposals)
  • Generalizable representations of objects (stable from natural images to artwork)

4.1.2 Unified Detection

YOLO model

  • The input is divided into S x S grid
  • Each grid cell predicts B bounding boxes

YOLO architecture

  • 5 predictions for one bounding box: x, y, w, h, score

    1. (x, y) center of the box relative to the bounds of grid
    2. w, h are width and height relative to the whole image
    3. the score is a measure of objectness
  • Each grid cell predicts C conditional class probabilities

  • Class-specific confidence score is defined as

  • One predictor is “responsible” for an object having the highest IOU with the ground-truth

  • The output is an SxSx(Bx5+C) tensor

4.1.3 Experiment Result

YOLO Result

  • Most effective among real-time detectors
  • Most efficient among near real-time detectors

4.1.4 Limitations

  • Too few bounding boxes

    1. Nearby objects
    2. Small objects
  • Data driven

    1. Sensitive to new or rare ration

4.1.5 Interesting Details – Training

  • Pre-train the first 20 layers on ImageNet

  • Pre-train on 224*224 images

  • Fine-tune 24 layers on detection dataset

  • Fine-tune on 448*448 images

  • Tricks to balance loss

    1. Weight: localization vs. classification
    2. Weight: positive vs. negative of objectness
    3. Square root: large object vs. small object

    YOLO Loss Function

  • “Warm up” to start training

    1. For first epoch, raise 0.001 to 0.01
    2. 0.01 for 75 epochs
    3. 0.001 for 30 epochs
    4. 0.0001 for 30 epochs

4.2 SSD

4.2.1 Single Shot MultiBox Detector

SSD framework

  • Combine anchor and one-shot prediction
  • Extract multi-scale features
  • Refine multi-scale and multi-ratio anchors
  • Dilated convolution

SSD Dilated convolution

4.2.2 Multi-scale Prediction

  • Multi-scale and Multi-ratio anchors

    SSD Model

    1. Each feature map cell corresponds to k anchors
    2. Similar to Faster-RCNN, but in multi-scale feature map and directly output category info
  • Multi-scale feature maps for detection

    1. Additional layers are added to the base network
    2. Different filters are applied to different scale/ratio anchors
    3. (c+4)k filters for k anchors and c categories in one cell, (c+4)kmn outputs for an m*n feature map

4.2.3 Experiment Result

  • VOC2012

SSD VOC2012

  • COCO2015

SSD COCO2015

4.2.4 Interesting Details – Training

  • Matching anchors with ground-truth

    1. Match each ground-truth to a default box with the best IOU
    2. Match the anchor to any ground truth with IOU higher than a threshold
  • Training objective

    SSD Training objective

  • Scales and aspect ratios for anchors

    1. Regularly spaced scales
    2. {1,2,3,1/2,1/3} – 6 ratios
  • Hard negative mining

    1. Sort the anchors using the highest confidence loss
    2. Pick the top ones so that neg:pos = 3:1
  • Data augmentation

    1. Sample randomly from training images
    2. Entire input image
    3. Sample a patch so that min IOU with object is 0.1, 0.3, 0.5, 0.7 or 0.9
    4. Randomly sample a patch, [0.1, 1] of the image, aspect ration [1/2, 2], randomly flip
  • Comparison

5 Others

  • PVANET: Deep but Lightweight Neural Networks for Real-time Object Detection (Reading Note)

    1. Variant of Faster-RCNN
    2. Design of architecture
  • Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks (Reading Note)

    1. Both local and global information are take into account
    2. Skip pooling uses the information of different scales
  • R-FCN: Object Detection via Region-based Fully Convolutional Networks (Reading Note)

    1. Position-sensitive RoI pooling
  • Feature Pyramid Networks for Object Detection (Reading Note)

    1. lateral connections is developed for building high-level semantic feature maps at all scales
  • Beyond Skip Connections: Top-Down Modulation for Object Detection (Reading Note)

    1. Similar with FPN
  • YOLO9000: Better, Faster, Stronger (Reading Note)

    1. Better, Faster, Stronger
  • DSSD: Deconvolutional Single Shot Detector (Reading Note)

Maybe when we are alone we could have an opportunity to look into our deepest corners in our hearts. And we would find that much nasty and evil thoughts there, hatred, jealousy, prejudices and so on. When and how could we find true peace and settlement? And how to be less anxious in this wold of fierce competition.

何者