Joshua's Blog

Reading Note: Mask R-CNN

Posted on 2017-05-02 Edited on 2022-08-19 In Computer Vision

TITLE: Mask R-CNN

AUTHOR: Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick

ASSOCIATION: Facebook AI Research

FROM: arXiv:1703.06870

CONTRIBUTIONS

A conceptually simple, flexible, and general framework for object instance segmentation is presented.
The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.

METHOD

Mask R-CNN is conceptually simple: Faster R-CNN has two outputs for each candidate object, a class label and a bounding-box offset; to this a third branch is added that outputs the object mask. The idea is illustrated in the following image.

In order to avoid competition across classes, the mask branch has a $ Km^{2} $ dimensional output for each ROI, which endoces $ K $ binary masks of resolution $ m \times m $, one for each of the $ K $ classes. When training, for an ROI associated with ground-truth class $ k $, loss is only computed on the $k$-th mask.

An $ m \times m $ mask from each ROI is predicted using a small FCN network. The input of the small FCN network is an RoIAlign feature, which using bilinear interpolation to compute the exact values of the input feature at four regularly sampled locations in each ROI bin.

Little Things [20170429]

Posted on 2017-04-29 Edited on 2022-08-19 In Life Discovery

Tired after a business trip…

Reading Note: FastMask: Segment Multi-scale Object Candidates in One Shot

Posted on 2017-04-18 Edited on 2022-08-19 In Computer Vision

TITLE: FastMask: Segment Multi-scale Object Candidates in One Shot

AUTHOR: Hexiang Hu, Shiyi Lan, Yuning Jiang, Zhimin Cao, Fei Sha

ASSOCIATION: UCLA, Fudan University, Megvii Inc.

FROM: arXiv:1703.03872

CONTRIBUTIONS

A novel weight-shared residual neck module is proposed to zoom out feature maps of CNN while preserving calibrated feature semantics, which enables efficient multi-scale training and inference.
A novel scale-tolerant head module is proposed which takes advantage of attention model and significantly reduces the impact of background noises caused by unmatched receptive fields.
A framework capable for one-shot segment proposal is made up, namely FastMask. The proposed framework achieves the the state-of-the-art results in accuracy while running in near real time on MS COCO benchmark.

METHOD

Network Architecture

The network architecture is illustrated in the following figure.

With the base feature map, a shared neck module is applied recursively to build feature maps with different scales. These feature maps are then fed to a one-by-one convolution to reduce their feature dimensionality. Then we extract dense sliding windows from those feature maps and do a batch normalization across all windows to calibrate and redistribute window feature maps. With a feature map downscaled by factor $m$, a sliding window of size $(k, k)$ corresponds to a patch of $(m \times k, m \times k)$ at original image. Finally, a unified head module is used to decode these window features and produce the output confidence score as well as object mask.

Residual Neck

The neck module is actually used to downscale the feature maps so that features with different scales can be extracted.

There are another two choices. One is Max pooling neck, which produces uncalibrated feature in encoding pushing the mean of downscaled feature higher than original. The other one is Average pooling neck, which smoothes out discriminative feature during encoding, making the top feature maps appear to be blurry.

Residual neck is then proposed to learn parametric necks that preserve feature semantics. The following figure illustrates the method.

Attentional Head

Given the feature map of a sliding window as the input, a spatial attention is generated through a fully connected layer, which takes the entire window feature to generate the attention score for each spatial location on the feature map. The spatial attention is then applied to window feature map via the element-wise multiplication across channels. Such operation enables the head module to enhance features on the salient region, where is supposed to be the rough location of the target object. Finally, the enhanced feature map will be fed into a fully connected laye to decode the segmentation mask of the object. This module is illustrated in the following figure.

The feature pyramid is sparse in this work because of the downscale operation. The sparse feature pyramid raises the probability that there exists no suitable feature maps for an object to decode, and also raises the risk of introducing background noises when the object is decoded from an unsuitable feature map with too larger receptive field. So salient region is introduced in this head. With the capability of paying attention to the salient region, a decoding head could reduce the noises from the backgrounds of a sliding window and thus produce high quality segmentation results when the receptive field is unmatched with the scale of object. Also the salient region attention has the tolerance to shift disturbance.

SOME IDEAS

This work shares the similar idea with most one-shot alogrithms, extracting sliding window in the feature map and endcode them with a following network.
How to extract sliding windows?

Little Things [20170417]

Posted on 2017-04-17 Edited on 2022-08-19 In Life Discovery

This photo reflects my life.

I just wanted to take a picture of the flowers at first. When I had a better look at this photo, it turned out to be very interesting that it happened to record an epitome of my life.

There are too books, one of which is about algorithms while the other one introducing how to sketch. I need to develop my own core ability to survive in this world with fierce competition. On the other hand, a hobby is needed to enjoy life, which can help me forget troubles for a while and look down to my own heart to be a better man. With those two, beautiful flowers bloom in my life.

Little Things [20170415]

Posted on 2017-04-15 Edited on 2022-08-19 In Life Discovery

How’s about this?? Maybe I’d like to start up my own business of bakery if I was unemployed. LOL…

Little Things [20170410]

Posted on 2017-04-10 Edited on 2022-08-19 In Life Discovery

人生就是不停地在做选择，抓起一些，就得放下一些。豁达的人经常问自己，我得到了什么，而不是我失去了什么？选择了一条路，自然会错过另一条路上的风景，与其眺望远处，不如珍惜眼前，每个地方都会春暖花开。

—— 杨坚华《遇见德国》

这两天在读《遇见德国》，本来是为了猎取奇观的，看看有趣的文化冲突。但是读到以上这一段话的时候，却引起了我深深的共鸣。有些时候选择很多，但是我们需要对自己有足够的了解才能选择一条最适合自己的，而且迈出那一步也是相当需要勇气和技巧，所谓万事开头难。

Little Things [20170403]

Posted on 2017-04-03 Edited on 2022-08-19 In Life Discovery

The topic of this weekend is watching movies. For each movie, I wrote a one-setence comment.

CRIMSON TIDE: Resultant justice is on the basis of procedural justice, and there is not a conflict between them.

فروشنده: Life is like a play and good people is always touching you.

PATRIOTS DAY: Evils encourage us to care and love.

THE JUNGLE BOOK: Know who we are and act as who we are.

PASSENGERS: Human is a species of society.

Little Things [20170328]

Posted on 2017-03-28 Edited on 2022-08-19 In Life Discovery

It’s spring. Let’s go out embracing the nature!

Reading Note: Deep Image Matting

Posted on 2017-03-16 Edited on 2022-08-19 In Computer Vision

TITLE: Deep Image Matting

AUTHOR: Ning Xu, Brian Price, Scott Cohen, Thomas Huang

ASSOCIATION: Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Adobe Research

FROM: arXiv:1703.03872

CONTRIBUTIONS

A novel deep learning based algorithm is proposed that can predict alpha matte of an image based on both low-level features and high-level context.

METHOD

The proposed deep model has two parts.

The first part is a CNN based encoder-decoder network, which is similar with typical FCN networks that are used for semantic segmentation. This part takes the RGB image and its corresponding trimap as input. Its output is the alpha matte of the image.
The second part is a small convolutional network that is used to refine the output of the first part. The input of this part is the original image and the predicted alpha matte from the first part.

The method is illustrated in the following figure.

Matting encoder-decoder stage

The first network leverages two losses. One is alpha-prediction loss and the other one is compositional loss.

Alpha-prediction loss is the absolute difference between the ground truth alpha values and the predicted alpha values at each pixel, which defines as

$\mathcal{L}_{\alpha}^{i} = \sqrt{(\alpha_{p}^{i} - \alpha_{g}^{i})^{2}+\epsilon^2}, \alpha_{p}^{i}, \alpha_{g} \in [0,1]$

where $\alpha{p}^{i}$ is the output of the prediction layer at pixel $i$ and $\alpha{g}^{i}$ is the ground truth alpha value at pixel $i$. $\epsilon$ is a small value which is equal to $10^{-1}$ and is used to ensure differentiable property.

Compositional loss the absolute difference between the ground truth RGB colors and the predicted RGB colors composited by the ground truth foreground, the ground truth background and the predicted alpha mattes. The loss is defined as

$\mathcal{L}_{c}^{i} = \sqrt{(c_{p}^{i} - c_{g}^{i})^{2}+\epsilon^2}$

where $c$ denotes the RGB channel, $p$ denotes the image composited by the predicted alpha, and $g$ denotes the image composited by the ground truth alpha.

Since only the alpha values inside the unknown regions of trimaps need to be inferred, therefore weights are set on the two types of losses according to the pixel locations, which can help the network pay more attention on the important areas. Specifically, $w{i} = 1$ if pixel $i$ is inside the unknown region of the trimap while $w{i} = 0$ otherwise.

The input to the second stage of our network is the concatenation of an image patch and its alpha prediction from the first stage, resulting in a 4-channel input. This part is trained after the first part is converged. After the refinement part is also converged, finally fine-tune the the whole network together. Only the alpha prediction loss is used.

SOME IDEAS

The trimap is a very strong prior. The question is how to get it.

Reading Note: A Pursuit of Temporal Accuracy in General Activity Detection

Posted on 2017-03-15 Edited on 2022-08-19 In Computer Vision

TITLE: A Pursuit of Temporal Accuracy in General Activity Detection

AUTHOR: Yuanjun Xiong, Yue Zhao, Limin Wang, Dahua Lin, Xiaoou Tang

ASSOCIATION: The Chinese University of Hong Kong, ETH

FROM: arXiv:1703.02716

CONTRIBUTIONS

A novel proposal scheme is proposed that can efficiently generate candidates with accurate temporal boundaries.
A cascaded classification pipeline is introduced that explicitly distinguishes between relevance and completeness of a candidate instance.

METHOD

The proposed action detection framework starts with evaluating the actionness of the snippets of the video. A set of temporal action proposals (in orange color) are generated with temporal actionness grouping (TAG). The proposals are evaluated against the cascaded classifiers to verify their relevance and completeness. Only proposals being complete instances are produced by the framework. Non-complete proposals and background proposals are rejected by a cascaded classification pipeline. The framework is illustrated in the following figure.

Temporal Region proposals

The temporal region proposals are generated with a bottom-up procedure, which consists of three steps: extract snippets, evaluate snippet-wise actionness, and finally group them into region proposals.

To evaluate the actionness, a binary classifier is learnt based on the Temporal Segment Network proposed in Temporal segment networks: Towards good practices for deep action recognition.
To generate temporal region proposals, the basic idea is to group consecutive snippets with high actionness scores. The scheme first obtains a number of action fragments by thresholding – a fragment here is a consecutive sub-sequence of snippets whose actionness scores are above a certain threshold, referred to as actionness threshold.
Then, to generate a region proposal, a fragment is picked as a starting point and expanded recursively by absorbing succeeding fragments. The expansion terminates when the portion of low-actionness snippets goes beyond a threshold, a positive value which is referred to as the tolerance threshold. Beginning with different fragments, we can obtain a collection of different region proposals.

Note that this scheme is controlled by two design parameters: the actionness threshold and the tolerance threshold. The final proposal set is the union of those derived from individual combination of the two values. This scheme is called Temporal Actionness Grouping, illustrated in the above figure, which has several advantages:

Thanks to the actionness classifier, the generated proposals are mostly focused on action-related contents, which greatly reduce the number of needed proposals.
Action fragments are sensitive to temporal transitions. Hence, as a bottom-up method that relies on merging action fragments, it often yields proposals with more accurate temporal boundaries.
With the multi-threshold design, it can cover a broad range of actions without the need of case-specific parameter tuning. With these properties, the proposed method can achieve high recall with just a moderate number of proposals. This also benefits the training of the classifiers in the next stage.

Detecting Action Instances

this is accomplished by a cascaded pipeline with two steps: activity classification and completeness filtering.

Activity Classification

A classifier is trained based on TSN. During training, region proposals that overlap with a ground-truth instance with an IOU above 0.7 will be used as positive samples. A proposal is considered as a negative sample only when less than 5% of its time span overlaps with any annotated instances. Only the proposals classified as non-background classes will be retained for completeness filtering. The probability from the activity classifier is denoted as $P_{a}$.

Completeness Filtering

To evaluate the completeness, a simple feature representation is extracted and used to train class-specific SVMs. The feature comprises three parts: (1) A temporal pyramid of two levels. The first level pools the snippet scores within the proposed region. The second level split the segment into two parts and pool the snippet scores inside each part. (2) The average classification scores of two short periods – the ones before and after the proposed region. The method is illustrated in the following figure.

The output of the SVMs for one class is denoted as $S_{c}$.

Then final detection confidence for each proposal is

$S_{Det} = P_{a} \times S_{c}$

CONTRIBUTIONS

METHOD

CONTRIBUTIONS

METHOD

Network Architecture

Residual Neck

Attentional Head

SOME IDEAS

CONTRIBUTIONS

METHOD

Matting encoder-decoder stage

Matting refinement stage

SOME IDEAS

CONTRIBUTIONS

METHOD

Temporal Region proposals

Detecting Action Instances