Reading Note: SuperCNN: A Superpixelwise Convolutional Neural Network for Salient Object Detection

TITLE: SuperCNN: A Superpixelwise Convolutional Neural Network for Salient Object Detection

AUTHER: Shengfeng He, Rynson W.H. Lau, Wenxi Liu, Zhe Huang, Qingxiong Yang Reid, Ian

FROM: IJCV2015

CONTRIBUTIONS

A novel superpixel-wise convolutional neural network approach is proposed.
Two kinds of sequence code are designed as input feature to CNN.

METHOD

Superpixels are extracted via some methods such as oversegmentation.
Extract Color Uniqueness Sequences (CU) for each superpixels to describe the color contrast between regions.
Extract Color Distribution Sequences (CD) for each superpixels to measure the color compactness of colors.
The two sequences are fed into a CNN to generate two saliency maps.
A regressor is used to merge the two predicted saliency maps

SOME DETAILS

Color Uniqueness Sequence is used to describe the color contrast of a Region. Given an image $I$ and the superpixels or regions $R=\lbrace r{1},…,r{x},…,r{N} \rbrace$, each region $r{x}$ contains a color uniqueness sequence $Q{x}^{C} = \lbrace q{1}^{c},…,q{j}^{c},…,q{N}^{c} \rbrace$. Each element, $q_{x}^{c}$ is defined as

$q_{x}^{c} = t(r_{j})\cdot \vert C(r_{x})-C(r_{j}) \vert \cdot w(P(r_{x}),P(r_{j}))$

where $t(r{j})$ counts the total number of pixels in region $r{j}$. $\vert C(r{x})-C(r{j}) \vert$ is a 3D vector storing the absolute differences of each color channel. $P(r{x}$ is the mean position of region $r{j}$ and $w(P(r{x}),P(r{j}))$ is defined as

$w(P(r_{x}),P(r_{j}))=exp(- \frac{1}{2\sigma_{s}^{2}}\Vert P(r_{x})-P(r_{j}) \Vert^{2} )$

The sequence $Q{x}^{C}$ is sorted by the spatial distance to region $r{x}$.

Color Distribution Sequence is a sequence $Q{x}^{D} = \lbrace q{1}^{d},…,q{j}^{d},…,q{N}^{d} \rbrace$ with the element $q_{j}^{d}$ defined as:

$q_{j}^{d} = t(r_{j})\cdot \vert P(r_{x})-P(r_{j}) \vert \cdot w(C(r_{x}),C(r_{j}))$

where

$w(C(r_{x}),C(r_{j}))=exp(- \frac{1}{2\sigma_{s}^{2}}\Vert C(r_{x})-C(r_{j}) \Vert^{2} )$

the sequence is also sorted by the spatial distance.

Network Structure is briefly illustrated as below:

Saliency Inference is first to get the $N$ predicted saliency scores of the $N$ regions. Because of the two kinds of sequences, two sets of scores $S{1} and S{2}$ are predicted. The final saliency map can be obtained by:

$S(r_{x})=\prod_{u\in \lbrace 1,2\rbrace}v_c^{u} \cdot S_{u}(r_{x})$

ADVANTAGES

It is fast when infering.
Large context are encoded in the sequences.

DISADVANTAGES

The CNN is of a very light-weight structure. Deeper network may provide better performance.
As the sequences are used to describe contrast information, which may lead to failure with the foreground and background having similar colors.