Joshua's Blog

《异形：契约》

Posted on 2017-08-27 Edited on 2026-03-14 In Life Discovery

这部电影早就在国内上映了，但是国内上映的版本被删减得不成样子，几乎所有血腥镜头都被剪掉了。对于这样一部电影，这样的删减几乎毁掉了整部电影，使得观影丧失了乐趣。等啊等，今天终于有空看了全片。怎么说呢，整体感觉有点一般，能够及格，但是不够优秀。

在异形的上一部电影《普罗米修斯》里，导演讨论了造物主、人类起源、人工智能等等一系列哲学问题，造物主因为什么而想灭绝人类，人类为了追寻自己的起源而挑战神，造物主为了灭绝人类而制造异形反而被反杀，造物主之上的造物主是谁，人类因为制造出精密的人工智能而成为造物主，人造人为了获得自由而与人类发生矛盾，女权主义，无政府，反托拉斯……虽然剧情有漏洞，但让人开了无数有深度的脑洞，让人十分期待接下来的《契约》。

然而在《契约》里，这些思考全没了，好像又退化为那个B级片的异形，而不是科幻化的异形。两层造物主与被造者的关系：造物主与人类、人类与人造人，在这一部里完全被消减到了一层，造物主与人类的关系毫无阐述，人类的神被人造人灭绝，但是消灭的过程太过简单了，对于一个可以创造出人类的超先进文明，却被一个小伎俩而灭绝，这样的逻辑实在不能被接受。之后人造人成了造物主，不断杂交产生新的更高级的异形。这一点是为了和异形第一部进行强行对接，但是更高层次的讨论完全丧失了。

《在这世界的角落》

Posted on 2017-08-15 Edited on 2026-03-14 In Life Discovery

在日本战败日看了一部反映战时日本国内情形的动画电影，导演为片渊须直。

影片中铃出生在广岛，从小到大铃都是是一个整日里慢慢腾腾迷迷糊糊的姑娘，最大的爱好就是画画。感觉铃还没长大，就到了嫁人的年纪。在父母的安排下，铃嫁给了北条周作，到山那边的吴市成为了北条家的媳妇。

周作的姐姐径子因为和婆婆不和带着女儿晴美住回了娘家。和贤惠老实的小铃不同，径子性格张扬而泼辣，虽然她常常对唯唯诺诺的小铃感到不满，但两人之间的感情其实十分要好。战争开始了，越来越多的战斗机飞过天空，刺耳的防空警报声常常划破夜晚的宁静，战争的阴影一点一点的覆盖着北条一家人的生活。在一场空袭中，晴美不幸遇难，小铃亦失去了右手再也无法作画，然而，在如此残酷的时局下，一家人依然没有放弃对生存的渴望。

整部影片给人的感觉非常像《活着》，在时代的大背景下，平头百姓只能无休止地忍受下去，为了活下去而努力着。看日本战时反映小人物的电影总有一种特殊的观影感受，既同情影片中角色的遭遇，又感觉他们真是活该。现在很多人都说“原子弹下无冤魂”，作为一个中国人，我也觉得这影片在一些细节上在粉饰战争，比如影片中有一句话是“是暴力打败了日本”，为什么不是“正义打败了日本”？但是仔细想想，如果我们的国家对外发动战争，我在多大程度上会觉得国家有错误呢，又或者说我会希望自己的国家战败吗？或许更多时候我们站到了历史的制高点上，用事后诸葛亮的角色看待历史中的人。

Reading Note: DSOD: Learning Deeply Supervised Object Detectors from Scratch

Posted on 2017-08-10 Edited on 2026-03-14 In Computer Vision

TITLE: DSOD: Learning Deeply Supervised Object Detectors from Scratch

AUTHOR: Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong Chen, Xiangyang Xue

ASSOCIATION: Fudan University, Tsinghua University, Intel Labs China

FROM: arXiv:1708.01241

CONTRIBUTIONS

DSOD is presented that can train object detection networks from scratch with state-of-the-art performance.
A set of principles are introduced and validated to design efficient object detection networks from scratch through step-by-step ablation studies.
DSOD can achieve state-of-the-art performance on three standard benchmarks (PASCAL VOC 2007, 2012 and MS COCO datasets) with realtime processing speed and more compact models.

METHOD

The critical limitations when adopting the pre-trained networks in object detection include:

Limited structure design space. The pre-trained network models are mostly from ImageNet-based classification task, which are usually very heavy — containing a huge number of parameters. Existing object detectors directly adopt the pre-trained networks, and as a result there is little flexibility to control/adjust the network structures. The requirement of computing resources is also bounded by the heavy network structures.
Learning bias. As both the loss functions and the category distributions between classification and detection tasks are different, it will lead to different searching/optimization spaces. Therefore, learning may be biased towards a local minimum which is not the best for detection task.
Domain mismatch. As is known, fine-tuning can mitigate the gap due to different target category distribution. However, it is still a severe problem when the source domain (ImageNet) has a huge mismatch to the target domain such as depth images, medical images, etc.

DSOD Architecture

The following table shows the architecture of DSOD

The proposed DSOD method is a multi-scale proposal-free detection framework similar to SSD. The design principles are as follows:

Proposal-free. The author observes that only the proposal-free method such as SSD and YOLO can converge successfully without the pre-trained models. It may be because the RoI pooling layer in the proposal based methods hinders the gradients being smoothly back-propagated from region-level to convolutional feature maps.
Deep Supervision. The central idea is to provide integrated objective function as direct supervision to the earlier hidden layers, rather than only at the output layer. These “companion” or “auxiliary” objective functions at multiple hidden layers can mitigate the “vanishing” gradients problem.
Stem Block. Motivated by Inception-v3 and v4, the author defines stem block as a stack of three $3 \times 3$ convolution layers followed by a $2 \times 2$ max pooling layer. This simple stem structure can reduce the information loss from raw input images compared with the original design in DenseNet.
Dense Prediction Structure The following figure illustrates the comparison of the plain structure (as in SSD) and the proposed dense structure in the front-end sub-network. Dense structure for prediction fuses multi-scale information for each scale. Each scale outputs the same number of channels for the prediction feature maps. In DSOD, in each scale (except scale 1), half of the feature maps are learned from the previous scale with a series of conv-layers, while the remaining half feature maps are directly down-sampled from the contiguous high-resolution feature maps.

Ablation Study

The following table shows the effectiveness of various designs on VOC 2007 test set.

transition w/o pooling increases the number of dense blocks without reducing the final feature map resolution. hi-comp factor in Transition Layers is the factor to reduce feature maps. wide bottlenect means more channels in bottleneck layers. wide 1st conv-layer means that that more channels are kept in the first conv-layer. big growth rate is a parameter used in DenseNet to increase the number of channels in the dense block.

Some Ideas

If we train Faster R-CNN in non-approximate joint training way, maybe we can also train the detector from scratch.

Little Things [20170808 Sweet]

Posted on 2017-08-08 Edited on 2026-03-14 In Life Discovery

Reading Note: ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression

Posted on 2017-08-03 Edited on 2026-03-14 In Computer Vision

TITLE: ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression

AUTHOR: Jian-Hao Luo, Jianxin Wu, Weiyao Lin

ASSOCIATION: Nanjing University, Shanghai Jiao Tong University

FROM: arXiv:1707.06342

CONTRIBUTIONS

A simple yet effective framework, namely ThiNet, is proposed to simultaneously accelerate and compress CNN models.
Filter pruning is formally established as an optimization problem, and statistics information computed from its next layer is used to prune filters.

METHOD

Framework

The framework of ThiNet compression procedure is illustrated in the following figure. The yellow dotted boxes are the weak channels and their corresponding filgers that would be pruned.

Filter selection. The output of layer $i + 1$ is used to guide the pruning in layer $i$. The key idea is: if a subset of channels in layer $(i + 1)$’s input can approximate the output in layer $i + 1$, the other channels can be safely removed from the input of layer $i + 1$. Note that one channel in layer $(i + 1)$’s input is produced by one filter in layer $i$, hence the corresponding filter in layer $i$ can be safely pruned.
Pruning. Weak channels in layer $(i + 1)$’s input and their corresponding filters in layer i would be pruned away, leading to a much smaller model. Note that, the pruned network has exactly the same structure but with fewer filters and channels.
Fine-tuning. Fine-tuning is a necessary step to recover the generalization ability damaged by filter pruning. For time-saving considerations, fine-tune one or two epochs after the pruning of one layer. In order to get an accurate model, more additional epochs would be carried out when all layers have been pruned.
Iterate to step 1 to prune the next layer.

Data-driven channel selection

Denote the convolution process in layer $i$ as A triplet

$$
\left \langle \mathscr{I}{i}, \mathscr{W}{i}, \right \rangle
$$

where $\mathscr{I}{i}$ is the input tensor, which has $C$ channels, $H$ rows and $W$ columns. And $\mathscr{W}{i}$ is a set of filters with $K \times K$ kernel size, which generates a new tensor with $D$ channels. Note that, if a filter in $\mathscr{W}{i}$ is removed, its corresponding channel in $\mathscr{I}{i+1}$ and $\mathscr{W}{i+1}$ would also be discarded. However, since the filter number in layer $i + 1$ has not been changed, the size of its output tensor, i.e., $\mathscr{I}{i+2}$, would be kept exactly the same. If we can remove several filters that has little influence on $\mathscr{I}_{i+2}$ (which is also the output of layer $i + 1$), it would have little influence on the overall performance too.

Collecting training examples

The training set is randomly sampled from the tensor $\mathscr{I}_{i+2}$ as illustrated in the following figure.

The convolution operation can be formalized in a simple way

$$ \hat{y} = \sum_{c=1}^{C} \hat{x}_{c} $$

A greedy algorithm for channel selection

Given a set of $$m$$ training examples $\left{ (\mathrm{\hat{x}}{i}, \hat{y}{i}) \right}$, selecting channel can be seen as a optimization problem,

and it eauivalently can be the following alternative objective,

where $S \cup T = \left{ 1,2,…,C \right}$ and $S \cap T = \emptyset$. This problem can be sovled by greedy algorithm.

Minimize the reconstruction error

After the subset $T$ is obtained, a scaling factor for each filter weights is learned to minimize the reconstruction error.

Some Ideas

Maybe the finetue can help avoid the final step of minimizing the reconstruction error.
If we use this work on non-classification task, such as detection and segmentation, the performance remains to be checked.

Little Things [20170730]

Posted on 2017-07-30 Edited on 2026-03-14 In Life Discovery

The traffic jam in Beijing :(

Reading Note: Deformable Part-based Fully Convolutional Network for Object Detection

Posted on 2017-07-28 Edited on 2026-03-14 In Computer Vision

TITLE: Deformable Part-based Fully Convolutional Network for Object Detection

AUTHOR: Taylor Mordan, Nicolas Thome, Matthieu Cord, Gilles Henaff

FROM: arXiv:1707.06175

CONTRIBUTIONS

Deformable Part-based Fully Convolutional Network (DPFCN), an end-to-end model integrating ideas from DPM into region-based deep ConvNets for object detection, is proposed.
A new deformable part-based RoI pooling layer is introduced, which explicitly selects discriminative elements of objects around region proposals by simultaneously optimizing latent displacements of all parts.
Another improvement is the design of a deformation-aware localization module, a specific module exploiting configuration information to refine localization.

METHOD

R-FCN is the work closest to DP-FCN. Both are developed on the basis of Faster-RCNN, in which an RPN is used to generate object proposals and a designed pooling layer is used to extract features for classification and localization. The architecture of DP-FCN is illustrated in the following figure. A Deformable part-based RoI Pooling layer follows a FCN network. Then two branches predict category and location respectively. The output of the backbone FCN is similar to that in R-FCN. It has $ k^2(C+1) $ channels corresponding to $ k \times k $ parts and $ C $ categories and background.

Deformable part-based RoI pooling

For each input channel, just like what has been done in DPM, a transformation is carried out to spread high responses to nearby locations, taking into account the deformation costs.

In my understanding, the output of RPN works like the root filter in DPM. Then the region proposal is evenly divided into $ k \times k $ sub-regions. Then these sub-regions will displace taking deformation into account. Displacement computed during the forward pass are stored and used to backpropagate gradients at the same locations.

Classification and localization predictions with deformable parts

Predictions are performed with two sibling branches for classification and relocalization of region proposals as is common practice. The classification branch is simply composed of an average pooling followed by a SoftMax layer.

As for location prediction, every part has 4 elements to be predicted. In addition to that, the displacement is sent to two fully connected layers and is then element-wise multiplied with the first values to yield the final localization output for this class.

Little Things [20170725]

Posted on 2017-07-25 Edited on 2026-03-14 In Life Discovery

Stay focused, go after your dreams and keep moving toward your goals.

LL Cool J

Reading Note: ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices

Posted on 2017-07-20 Edited on 2026-03-14 In Computer Vision

TITLE: ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices

AUTHOR: Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun

ASSOCIATION: Megvii Inc (Face++)

FROM: arXiv:1707.01083

CONTRIBUTIONS

Two operations, pointwise group convolution and channel shuffle, are proposed to greatly reduce computation cost while maintaining accuracy.

MobileNet Architecture

In MobileNet and other works, efficient depthwise separable convolutions or group convolutions strike an excellent trade-off between representation capability and computational cost. However, both designs do not fully take the $ 1 \times 1 $ convolutions (also called pointwise convolutions in MobileNet) into account, which require considerable complexity.

Channel Shuffle for Group Convolutions

In order to address the mentioned issue, a straightforward solution is applying group convolutions on $ 1 \times 1 $ layers like what has been done on $ 3 \times 3 $ in MobileNet. However, if multiple group convolutions stack together, there is one side effect: outputs from a certain channel are only derived from a small fraction of input channels. This property blocks information flow between channel groups and weakens representation. To allow group convolution obtaining input data from different groups, for the feature map generated from the previous group layer, we can first divide the channels in each group into several subgroups, then feed each group in the next layer with different subgroups. It can be implemented by reshaping the previous output channel dimension into $ (g, n) $, transposing and then flattening it back as the input of next layer, which is called channel shuffle operation and illustrated in the following figure.

ShuffleNet Unit

The following figure shows the ShuffleNet Unit.

In the figure, (a) is the building block in ResNeXt, and (b) is the building block in ShuffleNet. Given the input size $ c \times h \times w $ and the bottleneck channels $ m $, ResNext has $ hw(2cm+9m^2/g) $ FLOPs, while ShuffleNet needs $ hw(2cm/g+9m) $ FLOPs.

Network Architecture

Comparison

Reading Note: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Posted on 2017-07-19 Edited on 2026-03-14 In Computer Vision

TITLE: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

AUTHOR: Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam

ASSOCIATION: Google

FROM: arXiv:1704.04861

CONTRIBUTIONS

A class of efficient models called MobileNets for mobile and embedded vision applications is proposed, which are based on a streamlined architecture that uses depthwise separable convolutions to build light weight deep neural networks
Two simple global hyper-parameters that efficiently trade off between latency and
accuracy are introduced.

MobileNet Architecture

The core layer of MobileNet is depthwise separable filters, named as Depthwise Separable Convolution. The network structure is another factor to boost the performance. Finally, the width and resolution can be tuned to trade off between latency and accuracy.

Depthwise Separable Convolution

Depthwise separable convolutions which is a form of factorized convolutions which factorize a standard convolution into a depthwise convolution and a $1 \times 1$ convolution called a pointwise convolution. In MobileNet, the depthwise convolution applies a single filter to each input channel. The pointwise convolution then applies a $ 1 \times 1 $ convolution to combine the outputs the depthwise convolution. The following figure illustrates the difference between standard convolution and depthwise separable convolution.