
TITLE: DetNet: A Backbone network for Object Detection

AUTHOR: Xuepeng Shi, Shiguang Shan, Meina Kan, Shuzhe Wu, Xilin Chen

ASSOCIATION: Tsinghua University, Face++

FROM: arXiv:1804.06215


  1. The inherent drawbacks of traditional ImageNet pre-trained model for fine-tunning recent object detectors is analyzed.
  2. A novel backbone, called DetNet, is proposed, which is specifically designed for object detection task by maintaining the spatial resolution and enlarging the receptive field.



There are two problems using the classification backbone for object detection tasks. (i) Recent detectors, e.g., FPN, involve extra stages compared with the backbone network for ImageNet classification in order to detect objects with various sizes. (ii) Traditional backbone produces higher receptive field based on large downsampling factor, which is beneficial to the visual classification. However, the spatial resolution is compromised which will fail to accurately localize the large objects and recognize the small objects.

To sumarize, there are 3 main problems to use current pre-trained models, including

  1. The number of network stages is different. It means that extra layers for object detection compared to classification has not been pretrained.
  2. Weak visibility of large objects. It is because The feature map with strong semantic information has large strides respect to input image, which is harmful for the object localization.
  3. Invisibility of small objects. The information from the small objects will be easily weaken as the spatial resolution of the feature maps is decreased and the large context information is integrated.

To address these problems, DetNet has following characteristics. (i) The number of stages is directly designed for Object Detection. (ii) Even though more stages are involved, high spatial resolution of the feature maps is mainted, while keeping large receptive field using dilated convolution.

DetNet Design

The main architecture of DetNet is designed based on ResNet-50. The first 4 stages are kept same with ResNet-50. The main differences are illustrated as follows:

  1. The extra stages are merged into the backbone which will be later utilized for object detection as in FPN. Meanwhile, the spatial resolution is fixed as 16x downsampling even after stage 4.
  2. Since the spatial size is fixed after stage 4, in order to introduce a new stage, a dilated bottleneck with $1 \times 1$ convolution projection is utilized in the begining of the each stage. The dilation convolution efficiently enlarge the receptive field.
  3. Since dilated convolution is still time consuming, stage 5 and stage 6 keep the same channels as stage 4 (256 input channels for bottleneck block). This is different from traditional backbone design, which will double channels in a later stage.

The following figure shows the dialted bottleneck with $1 \times 1$ conv projection and the architecture of DetNet.




TITLE: Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks

AUTHOR: Xuepeng Shi, Shiguang Shan, Meina Kan, Shuzhe Wu, Xilin Chen

ASSOCIATION: Chinese Academy of Sciences

FROM: arXiv:1804.06039


  1. A real-time and accurate rotation-invariant face detector with progressive calibration networks (PCN) is proposed.
  2. PCN divides the calibration process into several progressive steps, each of which is an easy task, rsulting in accurate calibration with low time cost. And the range of full rotation-in-plane (RIP) angles is gradually decreasing, which helps distinguish faces from non-faces.
  3. In the first two stages of PCN, only coarse calibrations are conducted, such as calibrations from facing down to facing up, and from facing left to facing right. On the one hand, a robust and accurate RIP angle prediction for this coarse calibration is easier to attain without extra time cost, by jointly learning calibration task with the classification task and bounding box regression task in a multi-task learning manner. On the other hand, the calibration can be easier to implement as flipping original image with quite low time cost.



Given an image, all face candidates are obtained according to the sliding window and image pyramid principle, and each candidate window goes through the detector stage by stage. In each stage of PCN, the detector simultaneously rejects most candidates with low face confidences, regresses the bounding boxes of remaining face candidates, and calibrates the RIP orientations of the face candidates. After each stage, non-maximum suppression (NMS) is used to merge those highly overlapped candidates.

PCN progressively calibrates the RIP orientation of each face candidate to upright for better distinguishing faces from non-faces.

  1. PCN-1 first identifies face candidates and calibrates those facing down to facing up, halving the range of RIP angles from [$-180^{\circ}$,$180^{\circ}$] to [$-90^{\circ}$, $90^{\circ}$].
  2. Then the rotated face candidates are further distinguished and calibrated to an upright range of [$-45^{\circ}$, $45^{\circ}$] in PCN-2, shrinking the RIP ranges by half again.
  3. Finally, PCN-3 makes the accurate final decision for each face candidate to determine whether it is a face and predict the precise RIP angle. Briefly,

The following figure illustrates the framework.


First Stage PCN-1

For each input window $x$, PCN-1 has three objectives: face or non-face classification, bounding box regression, and calibration, formulated as follows:

where $F_{1}$ is the detector in the first stage structured with a small CNN. The $f$ is face confidence score, $t$ is a vector representing the prediction of bounding box regression, and $g$ is orientation score. Overall, the objective for PCN-1 in the first stage is defined as:

where $\lambda{reg}$, $\lambda{cal}$ are parameters to balance different loss. The first objective, which is also the primary objective, aims for distinguishing faces from non-faces. The second objective attempts to regress the fine bounding box. The third objective aims to predict the coarse orientation of the face candidate in a binary classification manner, telling the candidate is facing up or facing down.

The PCN-1 can be used to filter all windows to get a small number of face candidates. For the remaining face candidates, firstly they are updated to the new regressed bounding boxes. Then the updated face candidates are rotated according to the predicted coarse RIP angles.

Second Stage PCN-2

Similar as the PCN-1 in the first stage, the PCN-2 in the second stage further distinguishes the faces from non-faces more accurately, regresses the bounding boxes, and calibrates face candidates. Differently, the coarse orientation prediction in this stage is a ternary classification of the RIP angle range, telling the candidate is facing left, right or front.

Third Stage PCN-3

After the second stage, all the face candidates are calibrated to an upright quarter of RIP angle range, i.e. [$-45^{\circ}$,$45^{\circ}$]. Therefore, the PCN-3 in the third stage can easily and accurately determine whether it is a face and regress the bounding box. Since the RIP angle has been reduced to a small range in previous stages, PCN-3 attempts to directly regress the precise RIP angles of face candidates instead of coarse orientations.

Accurate and Fast Calibration

The early stages only predict coarse RIP ori- entations, which is robust to the large diversity and further benefits the prediction of successive stages.

The calibration based on the coarse RIP prediction can be efficiently achieved via flipping original image three times, which brings almost no additional time cost. Rotating the original image by $-90^{\circ}$, $90^{\circ}$ and $180^{\circ}$ to get image-left, image-right, and image-down. And the windows with $0^{\circ}$,$-90^{\circ}$, $90^{\circ}$ and $180^{\circ}$ can be cropped from original image, image-left, image-right, and image-down respectively, as the following figure shows.


CNN Architecture

CNN Architecture



TITLE: Pelee: A Real-Time Object Detection System on Mobile Devicesn

AUTHOR: Robert J. Wang, Xiang Li, Shuang Ao, Charles X. Ling

ASSOCIATION: University ofWestern Ontario

FROM: arXiv:1804.06882


  1. A variant of DenseNet architecture called PeleeNet for mobile devices is proposed.
  2. The network architecture of Single Shot MultiBox Detector (SSD) is optimized for speed acceleration and then combine it with PeleeNet.



Two-Way Dense Layer. A 2-way dense layer is used to get different scales of receptive fields. One branch uses a small kernel size (3x3) to capture small-size objects. The other branch stacks two 3x3 convolution layers for larger objects. The structure is shown in the following figure.

Two-Way Dense Layer

Stem Block. This block is placed before the first dense layer for the sake of cost efficiency. This stem block can effectively improve the feature expression ability without adding computational cost too much. The structure is shown as follows.

Stem Block

Dynamic Number of Channels in Bottleneck Layer. The number of channels in the bottleneck layer varies according to the input shape to make sure the number of output channels does not exceed the number of its input channels.

Transition Layer without Compression. experiments show that the compression factor proposed by DenseNet hurts the feature expression so that the number of output channels is kept the same as the number of input channels in transition layers.

Composite Function. The post-activation (Convolution - Batch Normalization - Relu) is used for speed acceleration. For post-activation, all batch normalization layers can be merged with convolution layer at the inference stage. To compensate for the negative impact on accuracy caused by this change, a shallow and wide network structure is designed. In addition, a 1x1 convolution layer is added to the last dense block to get a stronger representational ability.


The framework of the work is illustrated in the following table.

PeleeNet Architecture


Feature Map Selection. 5 scale feature maps (19x19, 10x10, 5x5, 3x3, and 1x1) are selected. Larger resolution features are discarded for speed acceleration.

Residual Prediction Block. For each feature map used for detection, a residual block (ResBlock) is constructed before conducting prediction, shown in the following figure.

PeleeNet SSD


The classification performance on ILSVRC2012 is shown in the following table.


The detection performance on VOC2007 is shown in the following table.


The detection performance on COCO2015 is shown in the following table.



From my own experince, DW convolution is not pruning friendly so that recently pruning methods, such as ThiNet and Net-Trim, works poorly on DW convolution. This work uses conventional convolutional layers, so maybe those pruning methods can play a role.

最近在给封装一个动态库,需要支持古老的windows xp系统。而我的开发系统是windows 10,使用visual studio 2013作为IDE。


  1. 在工程设置里,配置属性->常规->平台工具集,选择 Visual Studio 2013 - Windows XP (v120_xp)
  2. 在工程设置里,配置属性->C/C++->代码生成->运行库,选择MT/MTd。分别对应于release和debug模式。





Debug Assertion Failed! Expression: __acrt_first_block == header

As this is a DLL, the problem might lie in different heaps used for allocation and deallocation (try to build the library statically and check if that will work).

The problem is, that DLLs and templates do not agree together very well. In general, depending on the linkage of the MSVC runtime, it might be problem if the memory is allocated in the executable and deallocated in the DLL and vice versa (because they might have different heaps). And that can happen with templates very easily, for example: you push_back() to the vector inside the removeWhiteSpaces() in the DLL, so the vector memory is allocated inside the DLL. Then you use the output vector in the executable and once it gets out of scope, it is deallocated, but inside the executable whose heap doesn’t know anything about the heap it has been allocated from. Bang, you’re dead.

This can be worked-around if both DLL and the executable use the same heap. To ensure this, both the DLL and the executable must use the dynamic MSVC runtime - so make sure, that both link to the runtime dynamically, not statically. In particular, the exe should be compiled and linked with /MD[d] and the library with /LD[d] or /MD[d] as well, neither one with /MT[d]. Note that afterwards the computer which will be running the app will need the MSVC runtime library to run (for example, by installing “Visual C++ Redistributable” for the particular MSVC version).

You could get that work even with /MT, but that is more difficult - you would need to provide some interface which will allow the objects allocated in the DLL to be deallocated there as well. For example something like:

>__declspec(dllexport) void deallocVector(std::vector<std::string> &x);

>void deallocVector(std::vector<std::string> &x) {
std::vector<std::string> tmp;

(however this does not work very well in all cases, as this needs to be called explicitly so it will not be called e.g. in case of exception - to solve this properly, you would need to provide some interface from the DLL, which will cover the vector under the hood and will take care about the proper RAII)

EDIT: the final solution was actually was to have all of the projects (the exe, dll and the entire googleTest project) built in Multi-threaded Debug DLL (/MDd) (the GoogleTest projects are built in Multi-threaded debug(/MTd) by default)


Using Evermonkey

I’ve been using VSCode for a while and used to Markdown, which has not been supported by EverNote yet. Thus, I wonder whether there’s any extension that can help. Luckily, evermonkey shows up.


There are 3 steps to use this extension.

  1. Get a developer token. Currently, EverNote does not accept applications for tokens on their official website. But we can get a token by sending emails to their costumer service. I got a token in only one or two days.
  2. Install evermonkey extension to VSCODE.
  3. Set evermonkey.token and evermonkey.noteStoreUrl in settings.


Open command panel by F1 or ctrl+shift+p then type

  • ever new to start a new blank note.
  • ever open to open a note in a tree-like structure.
  • ever search to search note in EverNote grammar.
  • ever publish to publish current editing note to EverNote server.
  • ever sync to synchronizing EverNote account.


Currently, third-party extensions only support synchronizing files. The file can not be modified in apps. For example, I can now only modify the file in VSCODE, but not in EverNote application.

  1. list the different files in two branches

    git diff branch1 branch2 --stat
  2. list the differences in detail in two branches

    git diff branch1 branch2
  3. Relpace one file from branch1 to branch2

    git checkout branch2
    git checkout --patch branch1 filename
  4. Start a new branch1

    git checkout -b NewBranch
  5. List branches in remote git

    git branch -a

TITLE: MobileNetV2: Inverted Residuals and Linear Bottlenecks

AUTHOR: Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen


FROM: arXiv:1801.04381


  1. The main contribution is a novel layer module: the inverted residual with linear bottleneck.



Depthwise Separable Convolutions. The basic idea is to replace a full convolutional operator with a factorized version that splits convolution into two separate layers. The first layer is called a depthwise convolution, it performs lightweight filtering by applying a single convolutional filter per input channel. The second layer is a $1 \times 1$ convolution, called a pointwise convolution, which is responsible for building new features through computing linear combinations of the input channels.

Linear Bottlenecks Consider. It has been long assumed that manifolds of interest in neural networks could be embedded in low-dimensional subspaces. Two properties are indicative of the requirement that the manifold of interest should lie in a low-dimensional subspace of the higher-dimensional activation space:

  1. If the manifold of interest remains non-zero vol-ume after ReLU transformation, it corresponds to a linear transformation.
  2. ReLU is capable of preserving complete information about the input manifold, but only if the input manifold lies in a low-dimensional subspace of the input space.

Assuming the manifold of interest is low-dimensional we can capture this by inserting linear bottleneck layers into the convolutional blocks.

Inverted Residuals. Inspired by the intuition that the bottlenecks actually contain all the necessary information, while an expansion layer acts merely as an implementation detail that accompanies a non-linear transformation of the tensor, shortcuts are used directly between the bottlenecks. In residual networks the bottleneck layers are treated as low-dimensional supplements
to high-dimensional “information” tensors.

The following figure gives the Inverted resicual block. The diagonally hatched texture indicates layers that do not contain non-linearities. It provides a natural separation between the input/output domains of the building blocks (bottleneck layers), and the layer transformation – that is a non-linear function that converts input to the output. The former can be seen as the capacity of the network at each layer, whereas the latter as the expressiveness.

The framework of the work is illustrated in the following figure. The main idea of this work is to learn image aesthetic classification and vision-to-language generation using a multi-task framework.

Inverted Residuals

And the following table gives the basic implementation structure.

Bottleneck residual block





Object Detection

Semantic Segmentation

TITLE: Tiny SSD: Neural Aesthetic Image Reviewer

AUTHOR: WenshanWang, Su Yang, Weishan Zhang, Jiulong Zhang

ASSOCIATION: Fudan University, China University of Petroleum, Xi’an University of Technology

FROM: arXiv:1802.10240


  1. The problem is whether computer vision systems can perceive image aesthetics as well as generate reviews or explanations as human. It is the first work to investigate into this problem.
  2. By incorporating shared aesthetically semantic layers at a high level, an end-to-end trainable NAIR architecture is proposed, which can approach the goal of performing aesthetic prediction as well as generating natural-language comments related to aesthetics.
  3. To enable this research, the AVA-Reviews dataset is collected, which contains 52,118 images and 312,708 comments.


The framework of the work is illustrated in the following figure. The main idea of this work is to learn image aesthetic classification and vision-to-language generation using a multi-task framework.


The authors tried two designs, Model-I and Model-II. The difference between the two architectures is whether there are task-specific embedding layers for each task in addition to the shared layers. The potential limitation of Model-I is that some task-specific features can not be captured by the shared aesthetically semantic layer. Thus a task-specific embedding layer is introduced.

For image aesthetic classification part, it is a typical binary classification task. For comment generation part, LSTM is applied, the input of which is the high-level visual feature vector for an image.





理念一 事情其实很简单


  1. 寻找简单的解决方法。要时刻问自己“完成这件事情的最简单方法是什么?”
  2. 用不超过25个字把一件事情描述清楚。
  3. 如果你发现自己采用了某种复杂的解决方法或者思路,那么你可能已经走上了错误的道路。那么如何定义复杂呢?
  4. 只问简单的问题:谁?什么?为什么?在哪里?什么时候?怎么发生的?产生了什么结果?
  5. 只寻求简单的答案。
  6. 记住让事情简单易懂,把对象当作6岁的小孩,然后再解释。
  7. 跳出思维惯性,使用水平思维。


理念二 弄明白自己要做什么




  1. 真正理解想要做的事情
  2. 搞清楚做这件事是否也是其他人希望做的



理念三 任何事情都有连续性



  1. 一开始就做好计划
  2. 把假话做得详细周到
  3. 清楚地说出自己的意图
  4. 善于运用知识和假设
  5. 懂得运用因果关系
  6. 记录已经发生的事情





理念四 如果不去做,永远都做不完



  1. 将工作落实到人
  2. 舞会卡
  3. 使团队的力量最大化




  1. 明星人员。这类人喜欢特定的工作,具备一切必需的技能并几乎可以确定会完成工作。让他们按照自己的方式完成工作,尽量不要干预他们。
  2. 可依赖人员。这类人很愿意工作并知道工作方法。也许他们对于这项工作不是那么热情,但他们还是很可能会完成它。对于这类人别太妨碍他们,但也别对他们抱有百分之百的信息。
  3. 不确定人员。这类人由于各种原因,很可能不会很好地完成工作任务。我们需要尽快对这类人进行细分,分配给他们一些工作,并根据工作成效判断他们的能力。如果他们能够较好地完成工作,可归于第二类,如果不能就归于第五类。
  4. 实习人员。无论如何他们都是新人,在确认他们具备可以完成工作的能力之前,需要对他们进行手把手地指导、正规的培训和细节管理。要确保他们至少可以成为第二类人员。
  5. 无希望人员。他们不会完成任务,所以我们需要寻找别的方法来完成这项工作。需要对这类人做出合理处置,包括解雇或者改造他们。

理念五 事情的结果往往和预期不一样


  1. 应急措施
  2. 风险管理




理念六 明确界定事情的结果



理念七 学会从他人的角度看问题


  1. 试着穿上别人的鞋子
  2. 尽可能满足利益相关者的获利条件

