0%

虽说我不是什么信徒,但是这圣诞节的气氛还是挺感染人的,今天加了一上午班总觉的怪凄惨的,再加上想解决的问题最终没有搞定,就更悲凉了。下午应景又看了《真爱至上》,也算抚慰一下受伤的心灵。这电影看了得有五六年了,每年都会翻出了再看看,虽然故事基本都背下来了,但是每次看依旧会有满满的幸福感。

这电影虽然是以圣诞节为主题,不过其中的感情应该是为各个族群所接受的普世价值吧,追求爱与被爱。除了这个,今年再看这部电影突然怀念起读硕士的日子,因为电影的开篇与结尾都是机场的场景,而我在硕士阶段做智能监控所用的数据就是机场的监控录像,其中需要检测的一种事件就是“拥抱”,当时看了很多这样的样本,当时会联想到《真爱至上》这部电影,还觉得做课题有点苦,但看这样的样本却还有点幸福。今年通过看电影开始怀念做实验了,大概是因为觉得还是年轻好。

TITLE: Beyond Skip Connections: Top-Down Modulation for Object Detection

AUTHOR: Abhinav Shrivastava, Rahul Sukthankar, Jitendra Malik, Abhinav Gupta

ASSOCIATION: CMU, UC Berkeley, Google Research

FROM: arXiv:1612.06851

CONTRIBUTIONS

In this paper top-down modulations is proposed as a way to incorporate fine details into the detection framework. The standard bottom-up, feedforward ConvNet is supplemented with a top-down modulation (TDM) network, connected using lateral connections. These connections are responsible for the modulation of lower layer filters, and the top-down network handles the selection and integration of features.

METHOD

The idea of this work is very similar with the work of Feature Pyramid Networks for Object Detection. An example of Top-Down Modulation (TDM) Network is illustrated as the following figure

TDM is integrated with the bottom-up network with lateral connections. $C_{i}$ are bottom-up, feedforward feature blocks, $L_{i}$ are the lateral modules which transform low level features for the top-down contextual pathway. Finally, $T_{j,i}$, which represent flow of top-down information from index $j$ to $i$.

In this paper, the $T$ blocks are implemented using single convolutional layer (with non-linear activation) optionally with upsampling operation. The features from $C$ (processed by $L$) and $T$ are concated then sent to a convolutional layer for combination, as the following figure shows

At training stage, one new pair of lateral and top-down modules is added at a time and trained repeatedly from a pre-trained model.

前一阵看了《血战钢锯岭》,好像很久很久没有看过这么经典风格的战争片了,看一看还挺过瘾的,不过仔细想想这片好像不能只算是战争片,更多的可能算宗教篇。

耶稣说:“我就是道路、真理、生命”,整部电影的推进也差不多就是这三部分。道路篇基本就是戴斯蒙德·道斯参军前,讲述了戴斯蒙德如何因为误伤兄弟而受到神启、因为父亲差点枪杀了母亲而坚决不碰武器,又进一步进入教会成为虔诚的教徒。真理篇发生在军营,耶稣告诉我们不可杀生,但敌人该不该杀,真理何在?当戴斯蒙德经受了试炼,神迹降临,他救下了几十个战友。电影中由三个片段给我印象深刻,第一个是戴斯蒙德在清洗身上的血迹时,那是一个受洗的仪式;第二个是戴斯蒙德从山崖上下来时,战友们伸出手抚摸他,这应该是封圣,大家在迎接一个圣人;第三个是周六工作还是会被小小的惩罚一下。

TITLE: Feature Pyramid Networks for Object Detection

AUTHOR: Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie

ASSOCIATION: Facebook AI Research, Cornell University and Cornell Tech

FROM: arXiv:1612.03144

CONTRIBUTIONS

A new topdown architecture with lateral connections is developed for building high-level semantic feature maps at all scales is proposed. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications.

METHOD

The idea of this work is simple and is illustrated in the following figure

Similar with SSD, predictions are made in different scales. But connects exist between different scales. Though the author calls the connections as lateral connections, it is actually skip connections among different layers, just like what ResNet does.

昨天搞mxnet和caffe的混合编译,结果一直弄到今天早晨快五点,昏昏沉沉睡到了今天中午十一二点。起来后想想吃啥,发现前一阵买的辛拉面一直堆在角落里,心想不如自制一顿芝士年糕火锅吧,遂去超市买来一大堆材料,有泡菜、韩式辣椒酱、西葫芦、年糕、金针菇、午餐肉……胡乱切了切,几本就乱炖了,卖相不是太好,幸亏泡菜味道正宗,吃起来相当不错。

TITLE: Wider or Deeper: Revisiting the ResNet Model for Visual Recognition

AUTHOR: Zifeng Wu, Chunhua Shen, Anton van den Hengel

ASSOCIATION: The University of Adelaide

FROM: arXiv:1611.10080

CONTRIBUTIONS

  1. A further developed intuitive view of ResNets is introduced, which helps to understand their behaviours and find possible directions to further improvements.
  2. A group of relatively shallow convolutional networks is proposed based on our new understanding. Some of them achieve the state-of-the-art results on the ImageNet classification dataset.
  3. The impact of using different networks is evaluated on the performance of semantic image segmentation, and these networks, as pre-trained features, can boost existing algorithms a lot.

SUMMARY

For the residual $unit \ i$, let $y_{i-1}$ be the input, and let $f_{i}(\cdot)$ be its trainable non-linear mappings, also named $Blok \ i$. The output of $unit \ i$ is recursively defined as

$$ y_{i} = f_{i}(y_{i-1}, \omega_{i})+y_{i-1} $$

where $\omega_{i}$ denotes the trainalbe parameters, and $f_{i}(\cdot)$ is often two or three stacked convolution stages in a ResNet building block. Then top left network can be formulated as

$$ y_{2} = y_{1}+f_{2}(y_{1},\omega_{2}) $$

$$ = y_{0}+f_{1}(y_{0},\omega_{1})+f_{2}(y_{0}+f_{1}(y_{0}, \omega_{1}, \omega_{2}) $$

Thus, in SGD iteration, the backward gradients are:

$$ \Delta \omega_{2}=\frac{df_{s}}{d\omega_{2}}\cdot \Delta y_{2} $$

$$ \Delta y_{1}= \Delta y_{2} + f_{2}^{‘} \cdot \Delta y_{2} $$

$$ \Delta \omega_{1} = \frac{df_{1}}{d \omega_{1}} \cdot \Delta y_{2}+ \frac{df_{1}}{d \omega_{1}} \cdot f_{2}^{‘} \cdot \Delta y_{2} $$

Ideally, when effective depth $l\geq2$, both terms of $\Delta \omega_{1}$ are non-zeros as the bottom-left case illustrated. However, when effective depth $l=1$, the second term goes to zeros, which is illustrated by the bottom-right case. If this case happens, we say that the ResNet is over-deepened, and that it cannot be trained in a fully end-to-end manner, even with those shortcut connections.

To summarize, shortcut connections enable us to train wider and deeper networks. As they growing to some point, we will face the dilemma between width and depth. From that point, going deep, we will actually get a wider network, with extra features which are not completely end-to-end trained; going wider, we will literally get a wider network, without changing its end-to-end characteristic.

The author designed three kinds of network structure as illustrated in the following figure

and the classification performance on ImageNet validation set is shown as below

TITLE: Weakly Supervised Cascaded Convolutional Networks

AUTHOR: Ali Diba, Vivek Sharma, Ali Pazandeh, Hamed Pirsiavash, Luc Van Gool

ASSOCIATION: KU Leuven, Sharif Tech., UMBC, ETH Zürich

FROM: arXiv:1611.08258

CONTRIBUTIONS

A new architecture of cascaded networks is proposed to learn a convolutional neural network handling the task without
expensive human annotations.

METHOD

This work trains a CNN to detect objects using image level annotaion, which tells what are in one image. At training stage, the input of the network are 1) original image, 2) image level labels and 3) object proposals. At inference stage, the image level labels are excluded. The object proposals can be generated by any method, such as Selective Search and EdgeBox. Two differenct cascaded network structures are proposed.

Two-stage Cascade

The two-stage cascade network structure is illustrated in the following figure.

The first stage is a location network, which is a fully-convolutional CNN with a global average pooling or global maximum pooling. In order to learn multiple classes for single image, an independent loss function for each class is used. The class activation maps are used to select candidate boxes.

The second stage is multiple instance learning network. Given a bag for instances

$$ x_{c}={x_{j}|j=1,…,n} $$

and a label set

$$ y_{c}={y_{i}|y_{i} \in {0,1}, i=1,..,C } $$

where each $$ x $$ is one of the condidate boxes, $$ n $$ is the number of candidate box, $$C$$ is the number of categories and $$ \sum_{i=1}^{C}y_{i} $$, the probabilities and loss can be defined as

$$ Score(I,f_{i})=max(f_{i1},…,f_{in}) $$

$$ P(I, f_{i}) = \frac{exp(Score(I,f_{i}))}{\sum_{k=1}^{C}exp(Score(I,f_{k}))} $$

$$ L_{MIL}(P,y) = - \sum_{i=1}^{C}y_{i}log(P(I, f_i)) $$

Im my understanding, only the boxes with the most confidence in each category will be punished if they are wrong. Besides, the equations in the paper have some mistakes.

Three-stage Cascade

The three-stage cascade network structure adds a weak segmentation network between the two stages in the two-stage cascade network. It is illustrated in the following figure.

The weak segmentation network uses the results of the first stage as supervision signal. $$s_{ic}$$ is defined as the CNN score for pixel $i$ and class $$c$$ in image $I$. The score is normalized using softmax

$$ S_{ic}= exp(s_{ic})/\sum_{k=1}^{C}exp(s_{ik})$$

Considering $$y$$ as the label set for image $I$ , the loss function for the weakly supervised segmentation network is given by

$$ L_{seg}(S,G,y)=-\sum_{i=1}^{C}y_{i}log(S_{t_{c}c}) - \sum_{i \in I_{s}} \alpha_{i}log(S_{t_{c}G_{i}}) $$

$$ By \ \ \ \ \ t_{c} = \mathop{argmax}{i \in I} S{ic} $$

where $$G_{i}$$ is the supervision map for the segmentation from the first stage.

SOME IDEAS

This work requires little annotation. The only annotation is the image level label. However, this kind of training still needs complete annotation. For example, we want to detect 20 categories, then we need a 20-d vector to annotate the image. What if we only know 10/20 categories’ status in one image?

在这一时期,朝鲜半岛上的新罗与唐朝结盟,扩张势力,压迫百济。朝廷决定援助百济,所以661年天皇亲自前往九州岛,不过她却在远征开始前就过世了。那一年,中大兄虽然以皇太子名义称制,但没有正式即位,随后663年就经历了白村江之战。

——小岛毅《东大爸爸写给我的日本史》

历史总是人写的,所以说客观的历史可能是完全不存在的。比如以上叙述的历史事件,在国内的大多数叙述都是说日本以侵略朝鲜为目的,但日本人自己的叙述却是帮助百济。想一想是不是侵略,可能也是以现代国家的边界为标准的。好像作者在叙述“阿弖流为反叛”事件:

然而,阿弖流为发动“叛乱”了吗?反抗来自天皇的“控制体制”,或许可被视为叛乱,不过问题在于,他们是否原本就接受来自天皇的统治?

如果现在的朝鲜半岛和日本列岛为一个国家,那么现在的历史应该说是唐朝侵略了“朝鲜-日本国”。最近还有人在探讨元朝是否应该算作中国历史的一部分。或许历史应该站在更广泛的范围之内来看,而不能只拿现在的国界范围来区分国别史,这也是小岛毅的历史观。

《东大爸爸写给我的日本史》估计是第一本我不能完全看懂而又坚持看完的书,看不懂是因为书中的很多史实我完全不了解,这本书更像是给有一定日本历史知识的人的课本补充读物,而又放不下的原因是作者本身对历史的思考,以及其中作者对“仁义道德”、鲁迅、人吃人的一些想法,这些元素对于我们中国人来说应该是很熟悉的,能够看到一个外国人对这些内容的认识,也是一件很有意思的事情。