Joshua's Blog

My Drawings [20170216]

Posted on 2017-02-16 Edited on 2022-08-19 In Life Discovery

Miscellaneous [20170215]

Posted on 2017-02-15 Edited on 2022-08-19 In Life Discovery

Forgiveness comes from courage.

Little Things [20170212]

Posted on 2017-02-12 Edited on 2022-08-19 In Life Discovery

原来“不转死全家”来自于《午夜凶铃》

Miscellaneous [20170211]

Posted on 2017-02-11 Edited on 2022-08-19 In Life Discovery

春节之后好像更累。

工作上堆积了很多没有解决的问题，要么是程序有问题，要么是样本有问题，要么是模型有问题，总之是每天都紧赶慢赶，这边一下那边一下。连文章都没有时间好好看看，又有一种跟不上潮流的感觉了。

除了上班干活儿，另一个更让人心力憔悴的事就是看房子。自住房也快摇号了，一定要让我摇中了啊，中了也就不转了，直接买一个，以后就在顺义混了。再一个就是看租的房子，还好眼急手快，直接出手了一个不错的一居室，就是打扫收拾搬家比较费劲。从下周开始蚂蚁搬家，每天从现住地搬点东西到公司，然后下班再搬到新租的房子，拖着空箱子回到现在住的地方。想想也够折腾，不过也没办法啦。希望搬过去之后能有精力再装饰一下新屋子，花了那么多钱租房子，总得让自己住的舒服点。

新年加油吧！

Reading Note: DSSD: Deconvolutional Single Shot Detector

Posted on 2017-02-10 Edited on 2022-08-19 In Computer Vision

TITLE: DSSD: Deconvolutional Single Shot Detector

AUTHER: Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, Alexander C. Berg

FROM: arXiv:1701.06659

CONTRIBUTIONS

A combination of a state-of-the-art classifier (Residual-101) with a fast detection framework (SSD) is proposed.
Deconvolution layers are applied to introduce additional large-scale context in object detection and improve accuracy, especially for small objects.

METHOD

This is a successive work of SSD. Compared with original SSD, DSSD (Deconvolutional Single Shot Detector) adds additional deconvolutional layers and more sophisticated structure for category classifiction and bounding box coordinates regression. As shown in the following figure, the part till blue feature maps is same with original SSD. Then Deconvolution Module and Prediction Module are applied.

Recent works such as Beyond Skip Connections: Top-Down Modulation for Object Detection and Feature Pyramid Networks for Object Detection propose to incorporate fine details into the detection framework using deconvolutional layers and skip connections. DSSD utilizes this idea as well using Deconvolutional Module, shown in the following figure.

Several different structures for Prediction Module are proposed. These structures take the idea from ResNet as illustrated in the following figure.

SOME IDEAS

Using ResNet-101 and more sophisticated structure for prediction is helpful to improve the performance, but the computation cost is high.
The idea of using deconvolutional layers to enlarge the feature maps and using skip connections to combine detail features is becoming popular.

Little Things [20170127]

Posted on 2017-01-27 Edited on 2022-08-19 In Life Discovery

After reviewing the classic movies Master and Commander: The Far Side of the World and A Beautiful Mind, I surprisingly found that both of them Russell Crowe and Paul Bettany took part in. They had very interesting chemical reaction in both movies and acted well. I must say I like these two actors very much because of their marvelous act.

Reading Note: A New Convolutional Network-in-Network Structure and Its Applications in Skin Detection, Semantic Segmentation, and Artifact Reduction

Posted on 2017-01-24 Edited on 2022-08-19 In Computer Vision

TITLE: A New Convolutional Network-in-Network Structure and Its Applications in Skin Detection, Semantic Segmentation, and Artifact Reduction

AUTHOR: Yoonsik Kim, Insung Hwang, Nam Ik Cho

ASSOCIATION: Seoul National University

FROM: arXiv:1701.06190

CONTRIBUTIONS

a new inception-like convolutional network-in-network structure is proposed, which consists of convolution and rectified linear unit (ReLU) layers only. That is, pooling and subsampling layer are excluded that reduce feature map size, because decimated features are not helpful at the reconstruction stage. Hence, it is able to do one-to-one (pixel wise) matching at the inner network and also intuitive analysis of feature map correlation.
Proposed architecture is applied to several pixel-wise labeling and restoration problems and it is shown to provide comparable or better performances compared to the state-of-the-art methods.

METHOD

The network structure is inspired by Inception. The comparison of the structure is illustrated in the following figure.

Pooling is removed in the proposed inception module and a larger size kernel instead is added to widen the receptive field which might have been reduced by the removal of pooling. The main inspiration of such modification is to maintain the large receptive field while keep the resolution of output same with input resolution at the same time.

SOME IDEAS

As the network removes the operation that reduces the resolution of the feature maps, both forward and backward propagation could be very slow if the input size is large.

Little Things [20170123]

Posted on 2017-01-23 Edited on 2022-08-19 In Life Discovery

It’s only 18 years till the story of I, Robot happens…

Reading Note: Pixel Objectness

Posted on 2017-01-22 Edited on 2022-08-19 In Computer Vision

TITLE: Pixel Objectness

AUTHOR: Suyog Dutt Jain, Bo Xiong, Kristen Grauman

ASSOCIATION: The University of Texas at Austin

FROM: arXiv:1701.05349

CONTRIBUTIONS

An end-to-end learning framework for foreground object segmentation is proposed. Given a single novel image, a pixel-level mask is produced for all “object-like” regions even for object categories never seen during training.

METHOD

Problem Formulation

Given an RGB image of size $m \times n \times c$ as input, the problem is formulated as densely labeling each pixel in the images as eigher “object” or “background”. The output is a binary map of size $m \times n$.

Dataset

Two different datasets are used including 1) one dataset with explicit boundary-level annotations and 2) one dataset with implicit imagelevel object category annotations.

Training

The network is first trained on a large scale object classification task, such as ImageNet 1000-category classification. This stage can be regarded as training on an implicit labeled dataset. Its image representation has a strong notion of objectness built inside it, even though it never observes any segmentation annotations.

Then the network is trained on PASCAL 2012 segmentation dataset, which is an explicit labeled dataset. The 20 object labels are discarded, and mapped instead to the single generic “object-like” (foreground) label for training.

Reading Note: Towards Accurate Multi-person Pose Estimation in the Wild

Posted on 2017-01-19 Edited on 2022-08-19 In Computer Vision

TITLE: Towards Accurate Multi-person Pose Estimation in the Wild

AUTHOR: George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, Kevin Murphy

ASSOCIATION: Google

FROM: arXiv:1701.01779

CONTRIBUTIONS

A method for multi-person detection and 2D keypoint localization in the wild is proposed.

METHOD

The multi-person pose estimation system is a two step cascade, as illustrated in the Following figure.

In the first stage, a person detector is used to produce a bounding box around each person instance. In the second stage, a pose estimator is produced to the image crop extracted around each detected person instance in order to localize its keypoints.

Person Box Detection

A Faster-RCNN system based on ResNet-Inception architecture is used for person box detection. The detector is first trained on 80 categories in COCO dataset. Then the model is further finetuned on dataset only with bounding boxes of person.

Person Pose Estimation

A combined classification and regression approach is adoptted. Each spatial position is first classified whether it is in the vicinity of keypoints (K types) or not (which is a K-channel “heatmap”), then a 2-D local offset vector is predicted to get a more precise estimate of the corresponding keypoint location. The following figure illustrates the procedure.

The bounding box is first adjusted to a fixed aspect ratio (height/width = 1.37) and the patch is cropped from the image and resized to 353*257. A ResNet with 101 layers is used to produce heatmap and offsets. The following figure shows an input and ground-truth output of the network.

SOME IDEAS

The pipeline in two stages separated detection and pose estimation.
The relations between keypoints might be learnt in CNN, but it is not obvious.