0%

好久没有更博了,每次想写点什么都觉得静不下心。其实仔细想想,最近的输入确实也比较少,输出自然也少。

这一段时间没有读新的文献,因此专业技能上没有太多需要记录的。不过近一段时间仔细研究了一下SSD(Single Shot MultiBox Detector),从研读作者的代码到动手跑样例,再到使用自己设计的模型训练检测器,对这一套检测算法有了比较深入的理解,也得到了一些效果。做这些实验的最大感触就是——细节决定成败,这一套算法乍看比较简单,思路和流程都比较清晰,但是其中的细节却隐藏着很多有意思的trick,包括各个参数之间的配合,具体问题所对应的参数设置都需要仔细品味。现在读文献,很多时候都是不求甚解,但真正要解决一个问题,或者真正吃透一个算法,打破砂锅问到底的精神是必不可少的。看文献要多问为什么,而不应该只是简单地接受结论。

好长时间没有画画了,本来计划每个周末都动动笔,没想到最近两周要么加班,要么杂七杂八的事情,一犯懒就一笔没动。自己画得也谈不上好,想画的主要原因一个是换换脑子,每天都对着电脑写代码看文档,其实也挺疲劳的,再加上自己下了班也会学一些相关的专业知识,估计大脑的其他部分都要萎缩了,所以需要做点人文相关的事情,调节调节心情;第二个原因就是可以让心情沉稳下来,都说现在是一个浮躁的社会,在这样的大环境下,我想我是很难独善其身了,说不浮躁肯定是假的,但是在画画的时候,外界的一切可以暂时的被忘记,我可以专注于线条的走势和颜色的搭配,心情也自然沉静了,在这没有空调的出租屋里,好歹也能收获一个心静自然凉。

最近看了两部很有意思电影,一部是《盗钥匙的方法》,另一部是《曼妮姐妹》。最初注意到这两部电影,其实都是因为女主角。真正看了才发现,不只是女主角吸引人,电影本身也很精彩。

《盗钥匙的方法》是一部2012年的日本电影,网上的剧情介绍是:

小剧场演员樱井武史(堺雅人饰)奋斗多年始终未见出头天,穷困潦倒,情场失意,绝望至极的他连自杀都不成功。无奈之下,他只得进入大众浴池洗澡,谁知却阴差阳错用一枚香皂滑到了某个陌生男子。男子倒地晕厥,樱井则鬼迷心窍偷走了对方的储物柜钥匙,从此化名近藤,过起了从来不敢企及的富贵人生活。然而令他怎么也想象不到的是,近藤(香川照之饰)竟是一名名震黑道的冷血杀手。在此之后,樱井不得不接手来自黑道的委托,硬着头皮干起杀人的营生。与此同时,从医院醒来的近藤失去记忆,误以为自己是走投无路的樱井。在偶然结识的美丽女性水岛早苗(广末凉子饰)的帮助下,他一点一滴重新认识作为演员的自己,苦苦探索前进的方向,在此过程中他和早苗的内心也悄悄发生变化。而当他的记忆恢复那一刹那,三个人的命运也由此纠缠到了一起……

看到女主是广末凉子,感觉电影应该很温馨,本来是抱着看温情片的心情来看的,电影开头的凶杀案倒是很出乎我的意料,后来随着剧情发展,发现这剧情很有意思啊,反转很多。虽然夹杂着杀手的情节,但确实是一部彻底的温情喜剧,最终还真符合我的期待,就是一部很温馨的电影。电影里透着日式电影的细致,还夹杂着很多无厘头,这种组合看着挺新鲜。小的笑点也很多,给人的感觉就是角色们都在正经地胡说八道和逗比,这种反差反而让人忍俊不禁。

另一部电影《曼妮姐妹》相比较而言更沉闷一些,但是丝毫不影响观众体会人性和人与人之间的关心。两部电影的主题有点类似,都反映了陌生人之间的由猜疑到信任的情感变化,最终大家都开始互相依赖,反而无法分开了。网上的剧情介绍是:

11岁的亚曼达(Amanda)和16岁的萝芮(Lo)这对姐妹,从各自的收养家庭逃了出来,展开她们的流浪之旅。然而过程中,萝芮发现自己怀孕了,惊慌失措的两人,竟因此绑架了一名婴儿用品店店员伊莲,她们深信伊莲可以帮助她们安然度过这个难关。在长时间的相处下,伊莲慢慢发现,自己也开始依赖与需要这两个女孩的陪伴了。

点进这部电影是因为看到这是斯嘉丽约翰逊13岁时演的电影,对寡姐幼年的形象很好奇,就进来看了。寡姐在这部电影里就是个小天使,不管是对姐姐还是伊莲都抱有着极大爱心。而且一些小动作,比如挑眉毛,和成年之后真是一模一样,但是从一个幼齿小孩的脸上做出来,总透露出一种不一样的感觉。这部电影拍得很细腻,总觉的好像日本电影,不知道是不是因为最近日本电影看多了。

TITLE: Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks

AUTHER: Sean Bell, C. Lawrence Zitnick, Kavita Bala, Ross Girshick

ASSOCIATION: Cornell University, Microsoft Research

FROM: arXiv:1512.04143

CONTRIBUTIONS

  1. ION architecture is introduce that leverages context and multi-scale skip pooling for object detection. Use the information both inside and outside the ROI to determine the detection result.

METHOD

The main steps of the method is shown in the following figure.

  1. The image is first fed into a CNN, e.g.VGG16.
  2. ROI proposals are generated in the same way of Fast R-CNN.
  3. The information within the ROI are extracted by ROI pooling on different feature maps from different convolutional layers of different scales.
  4. The information outside the ROI are extracted by 2 successive 4-direction IRNNs. And ROI pooling is used to extract features.
  5. The pooled features are L2 nomalized and concated. Then a 1X1 conv layer is used to reduce the dimension.
  6. Two branches are learned to predict category and location.

some details

A 4-direction IRNN contains 4 independent IRNNs and each IRNN moves in different directions (left, right, up and down). The internal IRNN computations are splitted into separate logical layers. the input-to-hidden transition is implemented by a 1x1 convolution, and its computation can be shared across different directions.

ADVANTAGES

  1. The proposed detector works better on smaller objects compared with other works.
  2. Both local and global information are take into account.
  3. Skip pooling uses the informaiton of different scales.
  4. Two successive 4-direction IRNN cover the information form the whole image.

TITLE: Semantic Object Parsing with Graph LSTM

AUTHER: Xiaodan Liang, Xiaohui Shen, Jiashi Feng, Liang Lin, Shuicheng Yan

ASSOCIATION: National University of Singapore, Sun Yat-sen University, Adobe Research

FROM: arXiv:1603.07063

CONTRIBUTIONS

  1. A novel Graph LSTM structure is proposed handle general graph-structured data, which effectively exploits global context by superpixels extracted by over-segmentation.
  2. A confidence-driven scheme is proposed to select the starting node and the order of updating sequences.
  3. In each Graph LSTM unit, different forget gates for the neighboring nodes are learned to dynamically incorporate the local contextual interactions in accordance with their semantic relations.

METHOD

The main steps of the method is shown in the following figure.

  1. The input image first passes through a stack of convolutional layers to generate the convolutional feature maps.
  2. The convolutional feature maps are further used to generate an initial semantic confidence map for each pixel.
  3. The input image is over-segmented to multiple superpixels. For each superpixel, a feature vector is extracted from the upsampled convolutional feature maps.
  4. The first Graph LSTM takes the feature vector of every superpixel as input to compute a better state.
  5. The second Graph LSTM takes the feature vector of every superpixel and the output of first Graph LSTM as input.
  6. The update sequence of the superpixel is according to the initial confidence of the superpiexels.
  7. several 1×1 convolution filters are employed to produce the final parsing results.

some details

A graph structure is built based on the superpixels. The nodes are the superpixels and the nodes are linked when they are adjacent. The history information used by the G-LSTM for one superpixel come from the adjacent superpixels.

ADVANTAGES

  1. Constructed on superpixels generated by oversegmentation, the Graph LSTM is more naturally aligned with the visual patterns in the image.
  2. Adaptively learning the forget gates with respect to different neighboring nodes when updating the hidden states of a certain node is beneficial to model various neighbor connections.

TITLE: Object Detection from Video Tubelets with Convolutional Neural Networks

AUTHER: Kai Kang, Wanli Ouyang, Hongsheng Li, Xiaogang Wang

ASSOCIATION: The Chinese University of Hong Kong

FROM: arXiv:1604.04053

CONTRIBUTIONS

  1. A complete multi-stage framework is proposed for object detection in videos.
  2. A special temporal convolutional neural network is proposed to incorporate temporal information into object detection from video.

METHOD

The main steps of the method is shown in the following figure.

  1. Image object proposal. The regions are generated in each frame by Selective Search and classified by AlexNet of 200 categories. It is a similar method to R-CNN. The region with scores lower than a threshold are remove and the rest are the proposals.
  2. Obejct proposal scoring. The proposals are scored by a 30-category classifier deprived from GoogleNet. And the proposals with higher scores are kept.
  3. High-confidence proposal tracking. The proposals with higher scores are tracked and the overlapped proposals are pressed using IOU. The trackes are tubelet proposals.
  4. Tublet box perturbation and max-pooling. As the tracking result may drift, multiple regions are generated around tubelet proposals. All the regions are sent to the CNN in step 2 and sorted by the scores. Select the region of highest score to replace the one in tubelet.
  5. Temporal convolution and re-scoring. Temporal Convolutional Network (TCN) is proposed that uses 1-D serial features including detection scores, tracking scores, anchor offsets and generates temporally dense prediction on every tubelet box. The tubelet with high detection score are regarded as detection result. However, TCN has not been well explained in this work

ADVANTAGES

  1. The TCN help reduce the negative effect caused by the large variations of detection scores along the same track.

DISADVANTAGES

  1. Too many stages.
  2. Too many CNN operations.

TITLE: Chained Predictions Using Convolutional Neural Networks

AUTHER: Georgia Gkioxari, Alexander Toshev, Navdeep Jaitly

ASSOCIATION: UC Berkeley, Google

FROM: arXiv:1605.02346

CONTRIBUTIONS

  1. A chain model for structured outputs, such as human pose estimation. The output convolutional neural networks is a multiscale deconvolution that we called deception because of its relationship to deconvolution and inception models.
  2. Two formulations of the chain model is proposed. One is without weight sharing between different predictors (poses in images) and the other is with weight sharing (poses in videos).

METHOD

There are two formulations of the chain model in this work. The one used for single image is taken as an example here. It is a similar procedure in video version.

The inference stage is illustrated in the figure. The input is the image and the image is first fed to a CNN denoted as CNNx. For every stage, a joint of the person is localized by a CNN denoted as CNNy, denoted as “Predictio@0”. Then both the input and output of CNNy is used to predict next joint in the next stage. The procedure can be formalized as:

$$h_t=\sigma(w_t^h \ast h_{t-1}+\sum_{i=0}^{t-1}w_{i,t}^y \ast e(y_i))$$

$$P(Y_t=y_t|X,y_0,…,y_{t-1})=Softmax(m_t(h_t))$$

where $h_0$=CNNx(x), $e(\cdot)$ is a full neural net, $m_t$ is the operation of CNNy on $h_t$, and $P$ is the probability of the location of a joint.

ADVANTAGES

  1. Using chain models allows us to sidestep any assumptions about the joint distribution of the output variables.
  2. Jointly considering other structures can lead to better performance.
  3. Hand-crafted features are replaced by CNN, which can be learnt end-to-end.

DISADVANTAGES

  1. $e(\cdot)$ is not explained in this work.

TITLE: R-FCN: Object Detection via Region-based Fully Convolutional Networks

AUTHER: Jifeng Dai, Yi Li, Kaiming He, Jian Sun

ASSOCIATION: MSRA, Tsinghua University

FROM: arXiv:1605.06409

CONTRIBUTIONS

  1. A framework called Region-based Fully Convolutional Network (R-FCN) is develpped for object detection, which consists of shared, fully convolutional architectures.
  2. A set of position-sensitive score maps are introduced to enalbe FCN representing translation variance.
  3. A unique ROI pooling method is proposed to shepherd information from metioned score maps.

METHOD

  1. The image is processed by a FCN manner network.
  2. At the end of FCN, a RPN (Region Proposal Network) is used to generate ROIs.
  3. On the other hand, a score map of $k^{2}(C+1)$ channels is generated using a bank of specialized convolutional layers.
  4. For each ROI, a selective ROI pooling is utilized to generate a $C+1$ channel score map.
  5. The scores in the score map are averaged to vote for category.
  6. Another $ 4k^2 $ dim convolutional layer is learned for bounding box regression.

Training Details

  1. R-FCN is trained end-to-end with pre-computed region proposals. Both category and position are learnt with the loss function $L(s,t_{x,y,w,h})=L_{cls}(s_{c})+\lambda[c>0]L_{reg}(t)$
  2. For each image, N proposals are generated and B out of N proposals are selected to train weights according to the highest losses. B is set to 128 in this work.
  3. 4-step alternating training is utilized to realizing feature sharing between R-FCN and RPN.

ADVANTAGES

  1. It is fast (170ms/image, 2.5-20x faster than Faster R-CNN).
  2. End-to-end training is easier to process.
  3. All learnable layers are convolutional and shared on the entire image, yet encode spatial information required for object detection.

DISADVANTAGES

  1. Compared with Single Shot methods, more computation resource is needed.

因为最近迷上东野圭吾,一口气读了好几部他的作品,继《嫌疑人X的献身》和《白夜行》之后,又读了《幻夜》和《放学后》。上一次说到读《白夜行》时,感觉雪穗是一个极端冷酷无情的人,而亮司则是一个完全没有自我的木偶,他所做的一切都来自于雪穗的指挥。当读过《幻夜》后,才发现或许雪穗对亮司是有真情的,美冬才是真的将所有人玩弄于鼓掌之中,而她自己心中却毫无人性可言。不知道是哪里对这对姊妹篇的评价:如果《白夜行》是一本“极恶之书”,《幻夜》则更进一步,是彻头彻尾的“绝望之书”。这评价还真是到位。东野圭吾自己曾说过:“我不想让《幻夜》成为《白夜行》的续集,希望能多留一点空间,让两部作品都读完的读者开心地徜徉于各种各样的想象”,不过两本书中的故事不得不让人产生遐想,雪冬到底是不是美穗?起码在日剧中,编剧认为雪冬就是美穗。或许在《白夜行》中,雪穗恶事做绝,却仍让人心生怜悯,因为一切都是为了和在亮司一起,她还保留着为亮司的人性。但是当亮司死去,美穗失去了她的太阳,她完全堕入无边幻夜变成了雪冬。

与《幻夜》不同,《放学后》虽然也有人丧命,但给我的感觉却要阳光的多,或许是因为故事发生在高中女校,是一个散发着青春活力的地方。另一个原因可能是故事用第一人称写成,我自然地被带入到故事的“我”中,而“我”是一个心地善良的人,同时也是一个被学生喜爱的老师。说起来有些讽刺,对于看上去好像有人要杀“我”,而其实只是幌子这件事,我还有所庆幸。但是“妻子”的种种表现却让人感到不安,当故事好像完美收场时,来自“妻子”的背叛却将幌子扯开,变成了假戏真做。这感觉就好像巴兹·鲁赫曼导演的《罗密欧与朱丽叶》和《了不起的盖茨比》,故事结束时的绝望是最让人痛彻心扉的。

推理小说看的不多,但就我读过的而言,我发现日本推理小说更注重雕琢人的内心感受,心理描写和人物刻画特别多,事情的发展都是从人开始。而西方的推理小说更注重逻辑性,精彩之处是事件的发展。不能说哪一种更好,只是不得不说日本人对人物的刻画太细致了,总是透露着一种日式小清新的气质。说到这,顺便提一提最近看到的两部日本电影《小森林》和《老师与流浪猫》。两部电影都是乍看上去很无聊的故事,《小森林》通片都在讲如何烹饪,《老师与流浪猫》的故事是找一只流浪猫,但是日本电影就是有一种魅力,让人停不下来地看下去。不管是市子终于寻找到自我心安理得地返回小森,还是老师安详地与老伴相会,最后都给人一种温暖的感受,整个故事也由主人公的心理转变为主线,慢慢地将故事展开。

对于我个人而言,其他的作品可能都是休闲娱乐,但是《小森林》却让我有所思考。市子由于在大城市里打拼不顺,不得已回到家乡小森,过起了世外桃源般的生活,但是市子知道自己的内心并不甘于这样逃回小森,她需要一个真正的决定——回到小森。反观我自己,有些时候会对自己说生活好像也没必要那么拼,现在过得也很好,但我知道那是因为打拼太难了,什么时候我才会真正觉得生活过得很好呢?