
TITLE: Chained Predictions Using Convolutional Neural Networks

AUTHER: Georgia Gkioxari, Alexander Toshev, Navdeep Jaitly

ASSOCIATION: UC Berkeley, Google

FROM: arXiv:1605.02346


  1. A chain model for structured outputs, such as human pose estimation. The output convolutional neural networks is a multiscale deconvolution that we called deception because of its relationship to deconvolution and inception models.
  2. Two formulations of the chain model is proposed. One is without weight sharing between different predictors (poses in images) and the other is with weight sharing (poses in videos).


There are two formulations of the chain model in this work. The one used for single image is taken as an example here. It is a similar procedure in video version.

The inference stage is illustrated in the figure. The input is the image and the image is first fed to a CNN denoted as CNNx. For every stage, a joint of the person is localized by a CNN denoted as CNNy, denoted as “Predictio@0”. Then both the input and output of CNNy is used to predict next joint in the next stage. The procedure can be formalized as:

where $h_0$=CNNx(x), $e(\cdot)$ is a full neural net, $m_t$ is the operation of CNNy on $h_t$, and $P$ is the probability of the location of a joint.


  1. Using chain models allows us to sidestep any assumptions about the joint distribution of the output variables.
  2. Jointly considering other structures can lead to better performance.
  3. Hand-crafted features are replaced by CNN, which can be learnt end-to-end.


  1. $e(\cdot)$ is not explained in this work.

TITLE: R-FCN: Object Detection via Region-based Fully Convolutional Networks

AUTHER: Jifeng Dai, Yi Li, Kaiming He, Jian Sun

ASSOCIATION: MSRA, Tsinghua University

FROM: arXiv:1605.06409


  1. A framework called Region-based Fully Convolutional Network (R-FCN) is develpped for object detection, which consists of shared, fully convolutional architectures.
  2. A set of position-sensitive score maps are introduced to enalbe FCN representing translation variance.
  3. A unique ROI pooling method is proposed to shepherd information from metioned score maps.


  1. The image is processed by a FCN manner network.
  2. At the end of FCN, a RPN (Region Proposal Network) is used to generate ROIs.
  3. On the other hand, a score map of $k^{2}(C+1)$ channels is generated using a bank of specialized convolutional layers.
  4. For each ROI, a selective ROI pooling is utilized to generate a $C+1$ channel score map.
  5. The scores in the score map are averaged to vote for category.
  6. Another $ 4k^2 $ dim convolutional layer is learned for bounding box regression.

Training Details

  1. R-FCN is trained end-to-end with pre-computed region proposals. Both category and position are learnt with the loss function $L(s,t{x,y,w,h})=L{cls}(s{c})+\lambda[c>0]L{reg}(t)$
  2. For each image, N proposals are generated and B out of N proposals are selected to train weights according to the highest losses. B is set to 128 in this work.
  3. 4-step alternating training is utilized to realizing feature sharing between R-FCN and RPN.


  1. It is fast (170ms/image, 2.5-20x faster than Faster R-CNN).
  2. End-to-end training is easier to process.
  3. All learnable layers are convolutional and shared on the entire image, yet encode spatial information required for object detection.


  1. Compared with Single Shot methods, more computation resource is needed.





TITLE: SSD: Single Shot MultiBox Detector

AUTHER: Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg

FROM: arXiv:1512.02325v2


  1. SSD, a single-shot detector for multiple categories is introduced that is fast and accurate.
  2. The network is easy to train, simple end-to-end training and high accuracy, even with relatively low resolution input images, further improving the speed vs accuracy trade-off.


Network structure:

  1. Multiple scale feature maps from different layers are used in order to handle objects with different sizes.
  2. On each feature map used for detectoin, an unique small network (filter) is utilized to learn to predict category scores and location offsets.
  3. Each feature map corresponds to a fixed set of default boxes. These default boxes have different aspect ratios.


  1. Default and groundtruth boxes are matched. Each ground truth box is matched to the default box with the best jaccard overlap. On the other hand default boxes are matched to any ground truth with jaccard overlap higher than a threshold.
  2. The training objective is is a weighted sum of the localization loss (loc) and the confidence loss (conf):

    where N is the number of matched default boxes, and the localization loss is the Smooth L1 loss between the predicted box $(l)$ and the ground truth box $(g)$ parameters. Confidence loss is the softmax loss over multiple classes confidences $(c)$.

  3. The scale of the default boxes for each $(k_{th})$ feature map is computed as:

    where $s{min}=0.2$ and $s{max}=0.95$. The width of default box is $s{k}\sqrt{a{r}}$ and the height is $s{k}/\sqrt{a{r}}$ where $a{r}$ is the aspect ratio. The centre of a default box at location of $(i, j)$ in the $k{th}$ feature map is $(\frac{i+0.5}{|f{k}|}, \frac{j+0.5}{|f{k}|})$.

  4. Hard negatives are extracted. The unmatched default boxes are sorted according to confidence and top ones are used as hard negatives so that the ratio between the negatives and positives is at most 3:1.
  5. Data augmentation is done by using the entire original input image and sampling a patch so that the minimum jaccard overlap with the objects is 0.1, 0.3, 0.5, 0.7, or 0.9.


  1. It is fast because only one shot is utilized and the input is of lower resolution.
  2. Multiple scale feature maps are used so that it can handle objects with different sizes.
  3. End-to-end training.



