Joshua's Blog

Reading Note: Chained Predictions Using Convolutional Neural Networks

Posted on 2016-05-24 Edited on 2022-08-19 In Computer Vision

TITLE: Chained Predictions Using Convolutional Neural Networks

AUTHER: Georgia Gkioxari, Alexander Toshev, Navdeep Jaitly

ASSOCIATION: UC Berkeley, Google

FROM: arXiv:1605.02346

CONTRIBUTIONS

A chain model for structured outputs, such as human pose estimation. The output convolutional neural networks is a multiscale deconvolution that we called deception because of its relationship to deconvolution and inception models.
Two formulations of the chain model is proposed. One is without weight sharing between different predictors (poses in images) and the other is with weight sharing (poses in videos).

METHOD

There are two formulations of the chain model in this work. The one used for single image is taken as an example here. It is a similar procedure in video version.

The inference stage is illustrated in the figure. The input is the image and the image is first fed to a CNN denoted as CNNx. For every stage, a joint of the person is localized by a CNN denoted as CNNy, denoted as “Predictio@0”. Then both the input and output of CNNy is used to predict next joint in the next stage. The procedure can be formalized as:

$h_t=\sigma(w_t^h \ast h_{t-1}+\sum_{i=0}^{t-1}w_{i,t}^y \ast e(y_i))$ $P(Y_t=y_t|X,y_0,...,y_{t-1})=Softmax(m_t(h_t))$

where $h_0$=CNNx(x), $e(\cdot)$ is a full neural net, $m_t$ is the operation of CNNy on $h_t$, and $P$ is the probability of the location of a joint.

ADVANTAGES

Using chain models allows us to sidestep any assumptions about the joint distribution of the output variables.
Jointly considering other structures can lead to better performance.
Hand-crafted features are replaced by CNN, which can be learnt end-to-end.

DISADVANTAGES

$e(\cdot)$ is not explained in this work.

Reading Note: R-FCN: Object Detection via Region-based Fully Convolutional Networks

Posted on 2016-05-23 Edited on 2022-08-19 In Computer Vision

TITLE: R-FCN: Object Detection via Region-based Fully Convolutional Networks

AUTHER: Jifeng Dai, Yi Li, Kaiming He, Jian Sun

ASSOCIATION: MSRA, Tsinghua University

FROM: arXiv:1605.06409

CONTRIBUTIONS

A framework called Region-based Fully Convolutional Network (R-FCN) is develpped for object detection, which consists of shared, fully convolutional architectures.
A set of position-sensitive score maps are introduced to enalbe FCN representing translation variance.
A unique ROI pooling method is proposed to shepherd information from metioned score maps.

METHOD

The image is processed by a FCN manner network.
At the end of FCN, a RPN (Region Proposal Network) is used to generate ROIs.
On the other hand, a score map of $k^{2}(C+1)$ channels is generated using a bank of specialized convolutional layers.
For each ROI, a selective ROI pooling is utilized to generate a $C+1$ channel score map.
The scores in the score map are averaged to vote for category.
Another $ 4k^2 $ dim convolutional layer is learned for bounding box regression.

Training Details

R-FCN is trained end-to-end with pre-computed region proposals. Both category and position are learnt with the loss function $L(s,t{x,y,w,h})=L{cls}(s{c})+\lambda[c>0]L{reg}(t)$
For each image, N proposals are generated and B out of N proposals are selected to train weights according to the highest losses. B is set to 128 in this work.
4-step alternating training is utilized to realizing feature sharing between R-FCN and RPN.

ADVANTAGES

It is fast (170ms/image, 2.5-20x faster than Faster R-CNN).
End-to-end training is easier to process.
All learnable layers are convolutional and shared on the entire image, yet encode spatial information required for object detection.

DISADVANTAGES

Compared with Single Shot methods, more computation resource is needed.

My Drawings [20160521 Railway Track]

Posted on 2016-05-21 Edited on 2022-08-19 In Life Discovery

Miscellaneous [20160509 日本小说和电影]

Posted on 2016-05-09 Edited on 2022-08-19 In Life Discovery

因为最近迷上东野圭吾，一口气读了好几部他的作品，继《嫌疑人X的献身》和《白夜行》之后，又读了《幻夜》和《放学后》。上一次说到读《白夜行》时，感觉雪穗是一个极端冷酷无情的人，而亮司则是一个完全没有自我的木偶，他所做的一切都来自于雪穗的指挥。当读过《幻夜》后，才发现或许雪穗对亮司是有真情的，美冬才是真的将所有人玩弄于鼓掌之中，而她自己心中却毫无人性可言。不知道是哪里对这对姊妹篇的评价：如果《白夜行》是一本“极恶之书”，《幻夜》则更进一步，是彻头彻尾的“绝望之书”。这评价还真是到位。东野圭吾自己曾说过：“我不想让《幻夜》成为《白夜行》的续集，希望能多留一点空间，让两部作品都读完的读者开心地徜徉于各种各样的想象”，不过两本书中的故事不得不让人产生遐想，雪冬到底是不是美穗？起码在日剧中，编剧认为雪冬就是美穗。或许在《白夜行》中，雪穗恶事做绝，却仍让人心生怜悯，因为一切都是为了和在亮司一起，她还保留着为亮司的人性。但是当亮司死去，美穗失去了她的太阳，她完全堕入无边幻夜变成了雪冬。

与《幻夜》不同，《放学后》虽然也有人丧命，但给我的感觉却要阳光的多，或许是因为故事发生在高中女校，是一个散发着青春活力的地方。另一个原因可能是故事用第一人称写成，我自然地被带入到故事的“我”中，而“我”是一个心地善良的人，同时也是一个被学生喜爱的老师。说起来有些讽刺，对于看上去好像有人要杀“我”，而其实只是幌子这件事，我还有所庆幸。但是“妻子”的种种表现却让人感到不安，当故事好像完美收场时，来自“妻子”的背叛却将幌子扯开，变成了假戏真做。这感觉就好像巴兹·鲁赫曼导演的《罗密欧与朱丽叶》和《了不起的盖茨比》，故事结束时的绝望是最让人痛彻心扉的。

推理小说看的不多，但就我读过的而言，我发现日本推理小说更注重雕琢人的内心感受，心理描写和人物刻画特别多，事情的发展都是从人开始。而西方的推理小说更注重逻辑性，精彩之处是事件的发展。不能说哪一种更好，只是不得不说日本人对人物的刻画太细致了，总是透露着一种日式小清新的气质。说到这，顺便提一提最近看到的两部日本电影《小森林》和《老师与流浪猫》。两部电影都是乍看上去很无聊的故事，《小森林》通片都在讲如何烹饪，《老师与流浪猫》的故事是找一只流浪猫，但是日本电影就是有一种魅力，让人停不下来地看下去。不管是市子终于寻找到自我心安理得地返回小森，还是老师安详地与老伴相会，最后都给人一种温暖的感受，整个故事也由主人公的心理转变为主线，慢慢地将故事展开。

对于我个人而言，其他的作品可能都是休闲娱乐，但是《小森林》却让我有所思考。市子由于在大城市里打拼不顺，不得已回到家乡小森，过起了世外桃源般的生活，但是市子知道自己的内心并不甘于这样逃回小森，她需要一个真正的决定——回到小森。反观我自己，有些时候会对自己说生活好像也没必要那么拼，现在过得也很好，但我知道那是因为打拼太难了，什么时候我才会真正觉得生活过得很好呢？

My Drawings [20160505 江南]

Posted on 2016-05-05 Edited on 2022-08-19 In Life Discovery

Reading Note: SSD: Single Shot MultiBox Detector

Posted on 2016-04-26 Edited on 2022-08-19 In Computer Vision

TITLE: SSD: Single Shot MultiBox Detector

AUTHER: Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg

FROM: arXiv:1512.02325v2

CONTRIBUTIONS

SSD, a single-shot detector for multiple categories is introduced that is fast and accurate.
The network is easy to train, simple end-to-end training and high accuracy, even with relatively low resolution input images, further improving the speed vs accuracy trade-off.

METHOD

Network structure:

Multiple scale feature maps from different layers are used in order to handle objects with different sizes.
On each feature map used for detectoin, an unique small network (filter) is utilized to learn to predict category scores and location offsets.
Each feature map corresponds to a fixed set of default boxes. These default boxes have different aspect ratios.

Training:

Default and groundtruth boxes are matched. Each ground truth box is matched to the default box with the best jaccard overlap. On the other hand default boxes are matched to any ground truth with jaccard overlap higher than a threshold.
The training objective is is a weighted sum of the localization loss (loc) and the confidence loss (conf):
$L(x,c,l,g)= \frac{1}{N}(L_{conf}(x,c)+ \alpha L_{loc}(x,l,g))$
where N is the number of matched default boxes, and the localization loss is the Smooth L1 loss between the predicted box $(l)$ and the ground truth box $(g)$ parameters. Confidence loss is the softmax loss over multiple classes confidences $(c)$.
The scale of the default boxes for each $(k_{th})$ feature map is computed as:
$s_{k}=s_{min}+ \frac{s_{max}-s_{min}}{m-1}(k-1)$
where $s{min}=0.2$ and $s{max}=0.95$. The width of default box is $s{k}\sqrt{a{r}}$ and the height is $s{k}/\sqrt{a{r}}$ where $a{r}$ is the aspect ratio. The centre of a default box at location of $(i, j)$ in the $k{th}$ feature map is $(\frac{i+0.5}{|f{k}|}, \frac{j+0.5}{|f{k}|})$.
Hard negatives are extracted. The unmatched default boxes are sorted according to confidence and top ones are used as hard negatives so that the ratio between the negatives and positives is at most 3:1.
Data augmentation is done by using the entire original input image and sampling a patch so that the minimum jaccard overlap with the objects is 0.1, 0.3, 0.5, 0.7, or 0.9.

ADVANTAGES

It is fast because only one shot is utilized and the input is of lower resolution.
Multiple scale feature maps are used so that it can handle objects with different sizes.
End-to-end training.

《末代皇帝》

Posted on 2016-04-25 Edited on 2022-08-19 In Life Discovery

上一周有点忙，说忙其实事情不多，就是程序遇到一个问题，总也找不到问题的缘由，每天都加班，周六还加了一天班也只大概定位了问题的位置，但是依旧不知道原因。因此这一周的心情也不太好，提不起精神读书或者画画。今天终于在同事的帮助下解决了问题，顿时感觉心情大好，下班也早，回来还看了一部超长的电影——《末代皇帝》。

以前一直听说这是一部经典，而且也使奥斯卡获奖影片，但是我本身对传记类电影不怎么感冒，而且这个电影名字听起来就挺沉闷的，一直都没能静下心来欣赏。但是今天一开始看，我就感觉到我深深地被这部电影吸引了，因为这不是一部沉闷的电影，虽然电影的气氛很压抑，但压抑之中蕴含着强劲的暗流，将近三个半小时的电影并没有让人看得十分疲劳，反而牢牢地抓住了我的心，让我十分自然地沉浸到了主人公的世界，从一个人的角度回顾了那段历史。所以我在微博里转发片源的时候写了一句话评价：这是一个“人”的故事。就是字面上的意思，一个“人”的故事。

My Drawings [20160417 Sunflowers]

Posted on 2016-04-17 Edited on 2022-08-19 In Life Discovery

《元首在现代》

Posted on 2016-04-15 Edited on 2022-08-19 In Life Discovery

在看过《地雷区》后不久，还看了一部与德国相关的电影《元首在现代》，德语名称为《ER IST WIEDER DA》。因为不懂德文，特意在翻译软件中搜了一下，字面意思是《他回来了》，这里的“他”就是特指希特勒。以前看过一些报道，据说在德国谈论希特勒、纳粹是一件十分敏感的事情，这部政治喜剧能够在德国上映并获得不错的票房，还真是令人有些意外。虽说是一部喜剧片，但其中的内容却充满了讽刺和争议。

电影的大概内容是1945年希特勒在地堡中开枪自杀时并没有死，而是穿越到了2014年的德国，他降临的位置就是元首地堡的旧址，自此希特勒开始了一段很有意思的故事。当元首最初接触我们的现代社会时，着实受到了不小的冲击。这些冲击来自于先进的科技，电视、互联网、手机……一切我们看起来稀松平常的东西，对于元首来说都是革命性技术；欧洲的政治版图也吓到了元首，70年过去了，波兰居然还在地图上，而且占领了大片原本属于德国的土地，这是不能忍受的；德国境内居然居住着大量的非雅利安人，包括土耳其人、亚洲人、黑人……元首的内心在滴血，优等种族的血统就这么被污染了。

然而，很快元首就在现代社会中找到了自己的位置，现代社会是一个充斥着大众传媒的社会，元首向来对舆论宣传极其重视，因此元首在现代社会反而如鱼得水，运用起宣传的武器。现代社会中的大多数人都当他是小丑、是闹剧、是行为艺术，而正是这种轻视，也让元首有了施展能耐的机会。他通过各种手段了解现在的民众需求，利用民众的民族主义、对移民的排斥、对政府的不满，元首的演讲极具天赋又富有煽动性，德国又一次被征服了。元首终于重返政坛了，而人们推崇他为民主的胜利，新一轮的狂热崇拜又开始了。唯一一位认识到事态严重性的人，那个将元首一手推到前台的记者，却被关进了精神病院。历史再次重演了。而这一切的原因，也许就像电影中给出的答案，元首说：“人们为什么选择我？因为，他们心底里，都像我。“而另一则影评中的信息更加令人担忧

这部电影由作家蒂穆尔·韦尔姆所写的同名社会讽刺小说改编，讲的是希特勒穿越时空来到2014年德国，发生了一系列荒诞故事。小说本身是用来讽刺当代德国的，并不是向纳粹致敬。扮演希特勒的马萨奇本人也不是新纳粹分子。但在拍戏的过程中，剧组居然意外的遇到了许多德国人向希特勒致敬，包括喊出“我们需要集中营”这类口号。这让演员马萨奇本人也很震惊。演员马萨奇扮成希特勒在德国大受欢迎，当然人群中不全是右翼，也有人开始打他，而马萨奇对此表示：“那真让我感到轻松，还是有人是理智的。”在小说和电影里，希特勒本人穿越到今天的德国，尽管说的天花乱坠费尽口舌，但没人相信他是真的希特勒，所有人都以为他只是个疯子，一个行为艺术家。而在真正的现实世界，明明只是一场拍摄活动，演员却被人当成真的希特勒膜拜。现实比小说还要讽刺。

除了沉重的政治话题，影评中也有很多颇具娱乐性的桥段，例如元首在街头买画，大家都知道希特勒曾经想当一名画家；元首观看《帝国的毁灭》时的懵逼表情；以及片中对《帝国的毁灭》中希特勒发飙片段的恶搞等等，都充满了十足的笑料。这是一部充满欢笑，而又引人深思的电影。

My Drawings [20160412 The Street]

Posted on 2016-04-12 Edited on 2022-08-19 In Life Discovery