0%

TITLE: Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks

AUTHER: Sean Bell, C. Lawrence Zitnick, Kavita Bala, Ross Girshick

ASSOCIATION: Cornell University, Microsoft Research

FROM: arXiv:1512.04143

CONTRIBUTIONS

  1. ION architecture is introduce that leverages context and multi-scale skip pooling for object detection. Use the information both inside and outside the ROI to determine the detection result.

METHOD

The main steps of the method is shown in the following figure.

  1. The image is first fed into a CNN, e.g.VGG16.
  2. ROI proposals are generated in the same way of Fast R-CNN.
  3. The information within the ROI are extracted by ROI pooling on different feature maps from different convolutional layers of different scales.
  4. The information outside the ROI are extracted by 2 successive 4-direction IRNNs. And ROI pooling is used to extract features.
  5. The pooled features are L2 nomalized and concated. Then a 1X1 conv layer is used to reduce the dimension.
  6. Two branches are learned to predict category and location.

some details

A 4-direction IRNN contains 4 independent IRNNs and each IRNN moves in different directions (left, right, up and down). The internal IRNN computations are splitted into separate logical layers. the input-to-hidden transition is implemented by a 1x1 convolution, and its computation can be shared across different directions.

ADVANTAGES

  1. The proposed detector works better on smaller objects compared with other works.
  2. Both local and global information are take into account.
  3. Skip pooling uses the informaiton of different scales.
  4. Two successive 4-direction IRNN cover the information form the whole image.

TITLE: Semantic Object Parsing with Graph LSTM

AUTHER: Xiaodan Liang, Xiaohui Shen, Jiashi Feng, Liang Lin, Shuicheng Yan

ASSOCIATION: National University of Singapore, Sun Yat-sen University, Adobe Research

FROM: arXiv:1603.07063

CONTRIBUTIONS

  1. A novel Graph LSTM structure is proposed handle general graph-structured data, which effectively exploits global context by superpixels extracted by over-segmentation.
  2. A confidence-driven scheme is proposed to select the starting node and the order of updating sequences.
  3. In each Graph LSTM unit, different forget gates for the neighboring nodes are learned to dynamically incorporate the local contextual interactions in accordance with their semantic relations.

METHOD

The main steps of the method is shown in the following figure.

  1. The input image first passes through a stack of convolutional layers to generate the convolutional feature maps.
  2. The convolutional feature maps are further used to generate an initial semantic confidence map for each pixel.
  3. The input image is over-segmented to multiple superpixels. For each superpixel, a feature vector is extracted from the upsampled convolutional feature maps.
  4. The first Graph LSTM takes the feature vector of every superpixel as input to compute a better state.
  5. The second Graph LSTM takes the feature vector of every superpixel and the output of first Graph LSTM as input.
  6. The update sequence of the superpixel is according to the initial confidence of the superpiexels.
  7. several 1×1 convolution filters are employed to produce the final parsing results.

some details

A graph structure is built based on the superpixels. The nodes are the superpixels and the nodes are linked when they are adjacent. The history information used by the G-LSTM for one superpixel come from the adjacent superpixels.

ADVANTAGES

  1. Constructed on superpixels generated by oversegmentation, the Graph LSTM is more naturally aligned with the visual patterns in the image.
  2. Adaptively learning the forget gates with respect to different neighboring nodes when updating the hidden states of a certain node is beneficial to model various neighbor connections.

TITLE: Object Detection from Video Tubelets with Convolutional Neural Networks

AUTHER: Kai Kang, Wanli Ouyang, Hongsheng Li, Xiaogang Wang

ASSOCIATION: The Chinese University of Hong Kong

FROM: arXiv:1604.04053

CONTRIBUTIONS

  1. A complete multi-stage framework is proposed for object detection in videos.
  2. A special temporal convolutional neural network is proposed to incorporate temporal information into object detection from video.

METHOD

The main steps of the method is shown in the following figure.

  1. Image object proposal. The regions are generated in each frame by Selective Search and classified by AlexNet of 200 categories. It is a similar method to R-CNN. The region with scores lower than a threshold are remove and the rest are the proposals.
  2. Obejct proposal scoring. The proposals are scored by a 30-category classifier deprived from GoogleNet. And the proposals with higher scores are kept.
  3. High-confidence proposal tracking. The proposals with higher scores are tracked and the overlapped proposals are pressed using IOU. The trackes are tubelet proposals.
  4. Tublet box perturbation and max-pooling. As the tracking result may drift, multiple regions are generated around tubelet proposals. All the regions are sent to the CNN in step 2 and sorted by the scores. Select the region of highest score to replace the one in tubelet.
  5. Temporal convolution and re-scoring. Temporal Convolutional Network (TCN) is proposed that uses 1-D serial features including detection scores, tracking scores, anchor offsets and generates temporally dense prediction on every tubelet box. The tubelet with high detection score are regarded as detection result. However, TCN has not been well explained in this work

ADVANTAGES

  1. The TCN help reduce the negative effect caused by the large variations of detection scores along the same track.

DISADVANTAGES

  1. Too many stages.
  2. Too many CNN operations.

TITLE: Chained Predictions Using Convolutional Neural Networks

AUTHER: Georgia Gkioxari, Alexander Toshev, Navdeep Jaitly

ASSOCIATION: UC Berkeley, Google

FROM: arXiv:1605.02346

CONTRIBUTIONS

  1. A chain model for structured outputs, such as human pose estimation. The output convolutional neural networks is a multiscale deconvolution that we called deception because of its relationship to deconvolution and inception models.
  2. Two formulations of the chain model is proposed. One is without weight sharing between different predictors (poses in images) and the other is with weight sharing (poses in videos).

METHOD

There are two formulations of the chain model in this work. The one used for single image is taken as an example here. It is a similar procedure in video version.

The inference stage is illustrated in the figure. The input is the image and the image is first fed to a CNN denoted as CNNx. For every stage, a joint of the person is localized by a CNN denoted as CNNy, denoted as “Predictio@0”. Then both the input and output of CNNy is used to predict next joint in the next stage. The procedure can be formalized as:

$$h_t=\sigma(w_t^h \ast h_{t-1}+\sum_{i=0}^{t-1}w_{i,t}^y \ast e(y_i))$$

$$P(Y_t=y_t|X,y_0,…,y_{t-1})=Softmax(m_t(h_t))$$

where $h_0$=CNNx(x), $e(\cdot)$ is a full neural net, $m_t$ is the operation of CNNy on $h_t$, and $P$ is the probability of the location of a joint.

ADVANTAGES

  1. Using chain models allows us to sidestep any assumptions about the joint distribution of the output variables.
  2. Jointly considering other structures can lead to better performance.
  3. Hand-crafted features are replaced by CNN, which can be learnt end-to-end.

DISADVANTAGES

  1. $e(\cdot)$ is not explained in this work.

TITLE: R-FCN: Object Detection via Region-based Fully Convolutional Networks

AUTHER: Jifeng Dai, Yi Li, Kaiming He, Jian Sun

ASSOCIATION: MSRA, Tsinghua University

FROM: arXiv:1605.06409

CONTRIBUTIONS

  1. A framework called Region-based Fully Convolutional Network (R-FCN) is develpped for object detection, which consists of shared, fully convolutional architectures.
  2. A set of position-sensitive score maps are introduced to enalbe FCN representing translation variance.
  3. A unique ROI pooling method is proposed to shepherd information from metioned score maps.

METHOD

  1. The image is processed by a FCN manner network.
  2. At the end of FCN, a RPN (Region Proposal Network) is used to generate ROIs.
  3. On the other hand, a score map of $k^{2}(C+1)$ channels is generated using a bank of specialized convolutional layers.
  4. For each ROI, a selective ROI pooling is utilized to generate a $C+1$ channel score map.
  5. The scores in the score map are averaged to vote for category.
  6. Another $ 4k^2 $ dim convolutional layer is learned for bounding box regression.

Training Details

  1. R-FCN is trained end-to-end with pre-computed region proposals. Both category and position are learnt with the loss function $L(s,t_{x,y,w,h})=L_{cls}(s_{c})+\lambda[c>0]L_{reg}(t)$
  2. For each image, N proposals are generated and B out of N proposals are selected to train weights according to the highest losses. B is set to 128 in this work.
  3. 4-step alternating training is utilized to realizing feature sharing between R-FCN and RPN.

ADVANTAGES

  1. It is fast (170ms/image, 2.5-20x faster than Faster R-CNN).
  2. End-to-end training is easier to process.
  3. All learnable layers are convolutional and shared on the entire image, yet encode spatial information required for object detection.

DISADVANTAGES

  1. Compared with Single Shot methods, more computation resource is needed.

因为最近迷上东野圭吾,一口气读了好几部他的作品,继《嫌疑人X的献身》和《白夜行》之后,又读了《幻夜》和《放学后》。上一次说到读《白夜行》时,感觉雪穗是一个极端冷酷无情的人,而亮司则是一个完全没有自我的木偶,他所做的一切都来自于雪穗的指挥。当读过《幻夜》后,才发现或许雪穗对亮司是有真情的,美冬才是真的将所有人玩弄于鼓掌之中,而她自己心中却毫无人性可言。不知道是哪里对这对姊妹篇的评价:如果《白夜行》是一本“极恶之书”,《幻夜》则更进一步,是彻头彻尾的“绝望之书”。这评价还真是到位。东野圭吾自己曾说过:“我不想让《幻夜》成为《白夜行》的续集,希望能多留一点空间,让两部作品都读完的读者开心地徜徉于各种各样的想象”,不过两本书中的故事不得不让人产生遐想,雪冬到底是不是美穗?起码在日剧中,编剧认为雪冬就是美穗。或许在《白夜行》中,雪穗恶事做绝,却仍让人心生怜悯,因为一切都是为了和在亮司一起,她还保留着为亮司的人性。但是当亮司死去,美穗失去了她的太阳,她完全堕入无边幻夜变成了雪冬。

与《幻夜》不同,《放学后》虽然也有人丧命,但给我的感觉却要阳光的多,或许是因为故事发生在高中女校,是一个散发着青春活力的地方。另一个原因可能是故事用第一人称写成,我自然地被带入到故事的“我”中,而“我”是一个心地善良的人,同时也是一个被学生喜爱的老师。说起来有些讽刺,对于看上去好像有人要杀“我”,而其实只是幌子这件事,我还有所庆幸。但是“妻子”的种种表现却让人感到不安,当故事好像完美收场时,来自“妻子”的背叛却将幌子扯开,变成了假戏真做。这感觉就好像巴兹·鲁赫曼导演的《罗密欧与朱丽叶》和《了不起的盖茨比》,故事结束时的绝望是最让人痛彻心扉的。

推理小说看的不多,但就我读过的而言,我发现日本推理小说更注重雕琢人的内心感受,心理描写和人物刻画特别多,事情的发展都是从人开始。而西方的推理小说更注重逻辑性,精彩之处是事件的发展。不能说哪一种更好,只是不得不说日本人对人物的刻画太细致了,总是透露着一种日式小清新的气质。说到这,顺便提一提最近看到的两部日本电影《小森林》和《老师与流浪猫》。两部电影都是乍看上去很无聊的故事,《小森林》通片都在讲如何烹饪,《老师与流浪猫》的故事是找一只流浪猫,但是日本电影就是有一种魅力,让人停不下来地看下去。不管是市子终于寻找到自我心安理得地返回小森,还是老师安详地与老伴相会,最后都给人一种温暖的感受,整个故事也由主人公的心理转变为主线,慢慢地将故事展开。

对于我个人而言,其他的作品可能都是休闲娱乐,但是《小森林》却让我有所思考。市子由于在大城市里打拼不顺,不得已回到家乡小森,过起了世外桃源般的生活,但是市子知道自己的内心并不甘于这样逃回小森,她需要一个真正的决定——回到小森。反观我自己,有些时候会对自己说生活好像也没必要那么拼,现在过得也很好,但我知道那是因为打拼太难了,什么时候我才会真正觉得生活过得很好呢?

TITLE: SSD: Single Shot MultiBox Detector

AUTHER: Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg

FROM: arXiv:1512.02325v2

CONTRIBUTIONS

  1. SSD, a single-shot detector for multiple categories is introduced that is fast and accurate.
  2. The network is easy to train, simple end-to-end training and high accuracy, even with relatively low resolution input images, further improving the speed vs accuracy trade-off.

METHOD

Network structure:

  1. Multiple scale feature maps from different layers are used in order to handle objects with different sizes.
  2. On each feature map used for detectoin, an unique small network (filter) is utilized to learn to predict category scores and location offsets.
  3. Each feature map corresponds to a fixed set of default boxes. These default boxes have different aspect ratios.

Training:

  1. Default and groundtruth boxes are matched. Each ground truth box is matched to the default box with the best jaccard overlap. On the other hand default boxes are matched to any ground truth with jaccard overlap higher than a threshold.

  2. The training objective is is a weighted sum of the localization loss (loc) and the confidence loss (conf):

    $$ L(x,c,l,g)= \frac{1}{N}(L_{conf}(x,c)+ \alpha L_{loc}(x,l,g)) $$

    where N is the number of matched default boxes, and the localization loss is the Smooth L1 loss between the predicted box $(l)$ and the ground truth box $(g)$ parameters. Confidence loss is the softmax loss over multiple classes confidences $(c)$.

  3. The scale of the default boxes for each $(k_{th})$ feature map is computed as:

    $$s_{k}=s_{min}+ \frac{s_{max}-s_{min}}{m-1}(k-1)$$

    where $s_{min}=0.2$ and $s_{max}=0.95$. The width of default box is $s_{k}\sqrt{a_{r}}$ and the height is $s_{k}/\sqrt{a_{r}}$ where $a_{r}$ is the aspect ratio. The centre of a default box at location of $(i, j)$ in the $k_{th}$ feature map is $(\frac{i+0.5}{|f_{k}|}, \frac{j+0.5}{|f_{k}|})$.

  4. Hard negatives are extracted. The unmatched default boxes are sorted according to confidence and top ones are used as hard negatives so that the ratio between the negatives and positives is at most 3:1.

  5. Data augmentation is done by using the entire original input image and sampling a patch so that the minimum jaccard overlap with the objects is 0.1, 0.3, 0.5, 0.7, or 0.9.

ADVANTAGES

  1. It is fast because only one shot is utilized and the input is of lower resolution.
  2. Multiple scale feature maps are used so that it can handle objects with different sizes.
  3. End-to-end training.

上一周有点忙,说忙其实事情不多,就是程序遇到一个问题,总也找不到问题的缘由,每天都加班,周六还加了一天班也只大概定位了问题的位置,但是依旧不知道原因。因此这一周的心情也不太好,提不起精神读书或者画画。今天终于在同事的帮助下解决了问题,顿时感觉心情大好,下班也早,回来还看了一部超长的电影——《末代皇帝》。

以前一直听说这是一部经典,而且也使奥斯卡获奖影片,但是我本身对传记类电影不怎么感冒,而且这个电影名字听起来就挺沉闷的,一直都没能静下心来欣赏。但是今天一开始看,我就感觉到我深深地被这部电影吸引了,因为这不是一部沉闷的电影,虽然电影的气氛很压抑,但压抑之中蕴含着强劲的暗流,将近三个半小时的电影并没有让人看得十分疲劳,反而牢牢地抓住了我的心,让我十分自然地沉浸到了主人公的世界,从一个人的角度回顾了那段历史。所以我在微博里转发片源的时候写了一句话评价:这是一个“人”的故事。就是字面上的意思,一个“人”的故事。