0%

Using Evermonkey

I’ve been using VSCode for a while and used to Markdown, which has not been supported by EverNote yet. Thus, I wonder whether there’s any extension that can help. Luckily, evermonkey shows up.

Installation

There are 3 steps to use this extension.

  1. Get a developer token. Currently, EverNote does not accept applications for tokens on their official website. But we can get a token by sending emails to their costumer service. I got a token in only one or two days.
  2. Install evermonkey extension to VSCODE.
  3. Set evermonkey.token and evermonkey.noteStoreUrl in settings.

Usage

Open command panel by F1 or ctrl+shift+p then type

  • ever new to start a new blank note.
  • ever open to open a note in a tree-like structure.
  • ever search to search note in EverNote grammar.
  • ever publish to publish current editing note to EverNote server.
  • ever sync to synchronizing EverNote account.

Shortage

Currently, third-party extensions only support synchronizing files. The file can not be modified in apps. For example, I can now only modify the file in VSCODE, but not in EverNote application.

  1. list the different files in two branches

    1
    git diff branch1 branch2 --stat
  2. list the differences in detail in two branches

    1
    git diff branch1 branch2
  3. Relpace one file from branch1 to branch2

    1
    2
    git checkout branch2
    git checkout --patch branch1 filename
  4. Start a new branch1

    1
    git checkout -b NewBranch
  5. List branches in remote git

    1
    git branch -a

TITLE: MobileNetV2: Inverted Residuals and Linear Bottlenecks

AUTHOR: Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen

ASSOCIATION: Google

FROM: arXiv:1801.04381

CONTRIBUTION

  1. The main contribution is a novel layer module: the inverted residual with linear bottleneck.

METHOD

BUILDING BLOCKS

Depthwise Separable Convolutions. The basic idea is to replace a full convolutional operator with a factorized version that splits convolution into two separate layers. The first layer is called a depthwise convolution, it performs lightweight filtering by applying a single convolutional filter per input channel. The second layer is a $1 \times 1$ convolution, called a pointwise convolution, which is responsible for building new features through computing linear combinations of the input channels.

Linear Bottlenecks Consider. It has been long assumed that manifolds of interest in neural networks could be embedded in low-dimensional subspaces. Two properties are indicative of the requirement that the manifold of interest should lie in a low-dimensional subspace of the higher-dimensional activation space:

  1. If the manifold of interest remains non-zero vol-ume after ReLU transformation, it corresponds to a linear transformation.
  2. ReLU is capable of preserving complete information about the input manifold, but only if the input manifold lies in a low-dimensional subspace of the input space.

Assuming the manifold of interest is low-dimensional we can capture this by inserting linear bottleneck layers into the convolutional blocks.

Inverted Residuals. Inspired by the intuition that the bottlenecks actually contain all the necessary information, while an expansion layer acts merely as an implementation detail that accompanies a non-linear transformation of the tensor, shortcuts are used directly between the bottlenecks. In residual networks the bottleneck layers are treated as low-dimensional supplements
to high-dimensional “information” tensors.

The following figure gives the Inverted resicual block. The diagonally hatched texture indicates layers that do not contain non-linearities. It provides a natural separation between the input/output domains of the building blocks (bottleneck layers), and the layer transformation – that is a non-linear function that converts input to the output. The former can be seen as the capacity of the network at each layer, whereas the latter as the expressiveness.

The framework of the work is illustrated in the following figure. The main idea of this work is to learn image aesthetic classification and vision-to-language generation using a multi-task framework.

Inverted Residuals

And the following table gives the basic implementation structure.

Bottleneck residual block

ARCHITECTURE

Architecture

PERFORMANCE

Classification

Object Detection

Semantic Segmentation

TITLE: Tiny SSD: Neural Aesthetic Image Reviewer

AUTHOR: WenshanWang, Su Yang, Weishan Zhang, Jiulong Zhang

ASSOCIATION: Fudan University, China University of Petroleum, Xi’an University of Technology

FROM: arXiv:1802.10240

CONTRIBUTION

  1. The problem is whether computer vision systems can perceive image aesthetics as well as generate reviews or explanations as human. It is the first work to investigate into this problem.
  2. By incorporating shared aesthetically semantic layers at a high level, an end-to-end trainable NAIR architecture is proposed, which can approach the goal of performing aesthetic prediction as well as generating natural-language comments related to aesthetics.
  3. To enable this research, the AVA-Reviews dataset is collected, which contains 52,118 images and 312,708 comments.

METHOD

The framework of the work is illustrated in the following figure. The main idea of this work is to learn image aesthetic classification and vision-to-language generation using a multi-task framework.

Framework

The authors tried two designs, Model-I and Model-II. The difference between the two architectures is whether there are task-specific embedding layers for each task in addition to the shared layers. The potential limitation of Model-I is that some task-specific features can not be captured by the shared aesthetically semantic layer. Thus a task-specific embedding layer is introduced.

For image aesthetic classification part, it is a typical binary classification task. For comment generation part, LSTM is applied, the input of which is the high-level visual feature vector for an image.

PERFORMANCE

Performance

自从出来创业,自己一直处于一种颇为混乱的状态。包括如何管理团队、如何保持自己的技术能力、如何释放压力及平衡生活和工作……也许在最近的未来,我依旧无法找到令人满意的答案,但至少从现在开始我要努力去寻找答案。

有一句话“有一种迷茫叫想得太多,做得太少”,的确如此,当自己在不断审视自己的时候,不知道该如何是好的时候,其实就是该行动的时候了。比如我最近读文献读得少了,总是焦虑要和技术脱轨跟不上形势,其实不如赶紧补一补课,多读一读最近发表的论文。同样的,为了缓解其他方面的焦虑,我也要开始多读读书,给自己充电。这本《极简主义》就是一个开头。

理念一 事情其实很简单

很多事情都是复杂的,而且大多数情况下我们都在寻找复杂的解决方案,但其实越简单的方案越有可行性,也越可能正确。反思一下自己,好像从读硕士阶段开始,就在想象自己应该做一些复杂的事情,比如复杂的算法、复杂的代码、复杂的系统……其实真正解决问题的方案都没有那么复杂。书中给出的建议是:

  1. 寻找简单的解决方法。要时刻问自己“完成这件事情的最简单方法是什么?”
  2. 用不超过25个字把一件事情描述清楚。
  3. 如果你发现自己采用了某种复杂的解决方法或者思路,那么你可能已经走上了错误的道路。那么如何定义复杂呢?
  4. 只问简单的问题:谁?什么?为什么?在哪里?什么时候?怎么发生的?产生了什么结果?
  5. 只寻求简单的答案。
  6. 记住让事情简单易懂,把对象当作6岁的小孩,然后再解释。
  7. 跳出思维惯性,使用水平思维。

**“事情其实很简单”**的基本要求是我们要用简单的方法提问和解答,同时寻求别人的简单回答和问题。

理念二 弄明白自己要做什么

如果你不知道自己要驶向哪个港口,那么无论是东南风还是西北风,对你来说都是无所谓的。

感觉这一理念和“事情其实很简单”是相辅相成的,即当我们可以将问题和答案简化的时候,也就距离“明白自己要做什么”不远了。其实最简单的“弄明白自己要做什么”的方法就是制定计划。将我们所要达到的目标形象化,设定一件事情结束的指标,即当我们要知道一件事做到什么程度的时候就可以算作是结束了。

我们除了要知道自己想做什么,更重要的两个问题是:

  1. 真正理解想要做的事情
  2. 搞清楚做这件事是否也是其他人希望做的

对于第一个问题,我们除了要知道自己要做什么,还需要知道做这件事背后的原因。我们需要问自己:“这个工作最终要达到什么目标?”。

对于第二个问题,我们要思考的是如何让所有利益相关者获利,即共赢。让利益相关者永远开心是成功的关键,我们需要告知利益相关者他们会得到什么,而他们得到的一定是他们需要的。

理念三 任何事情都有连续性

我最初在读到这个标题的时候,以为是说任何事情都有“惯性”,它会影响其他事情。但后来发现,原来这里说的是要有计划的开展工作,一件事情的完成是一个一个连续的小事件组成的。

为了能够让事情连续起来,我们需要作出一些努力,包括

  1. 一开始就做好计划
  2. 把假话做得详细周到
  3. 清楚地说出自己的意图
  4. 善于运用知识和假设
  5. 懂得运用因果关系
  6. 记录已经发生的事情

这些努力是一种递进的关系,核心就是“做好计划”。什么是好的计划呢?答案是详细周到的计划。而如何做到详细周到,首先,我们需要能够清晰地给出目标,这其实就是理念二对我们的要求。其次,在制定计划的时候,很多东西肯定不是马上就能遇到的,我们需要作出预判,预判的根据就是已有的知识假设以及各个事件之间的因果关系。最后,就是记录以往的经验,这些经验讲转化为4和5中的知识。

书中有一个情节让我十分感同身受——那就是“救火”。书中是这样描述的:

早餐你到达办公室后翻看了代办清单。当你开始做清单上的第一件事情时有人通知你参加九点三十分的会议。在开会期间有人敲门找你,说“我能耽误你几分钟吗?”就在你和他谈话的时候,你的手机响了,于是你又得接电话。还没接完电话呢,电脑“叮”地响了一声提醒你收到了一封邮件。紧接着你的座机又响了……

真的是这样,感觉在创业的半年多来,很多时候我被这个同事叫过去,然后又被那个实习生拦住,接着处理杂事……以后要注意制定计划,然后跟住这个计划。

理念四 如果不去做,永远都做不完

首先重要的事情说三遍:开始做!开始做!开始做!

开始做的前提是应用好第二和第三个理念,然后就是如何做,书中给出工具包括

  1. 将工作落实到人
  2. 舞会卡
  3. 使团队的力量最大化

第一个工具是说任何一件工作都应该有负责人,不应出现这个工作你做也行我做也行的情况。具体的一项工作可能是属于一个团队的,但是这个工作一定由一个人负责。而且一项工作一定可以被拆分到团队中的每个人头上。

第二个工具其实是一个估算工作量的工具。当我们发现工作量完全超过了我们可以承受的范围时,我们就要对工作进行优先级的排序,并适当地放弃一些工作。想起来一句大家常说的话:舍得舍得,有舍才有得。如何建立优先级,又回到了理念一、二、三的问题。读这一本书的感觉就是,事情都是螺旋上升的,计划一件工作,或者学习这本书中的知识,也需要循环使用这些工具才能到达目标。

第三个工具是一种如何最大化员工输出的方法。人员可以被分为五类:

  1. 明星人员。这类人喜欢特定的工作,具备一切必需的技能并几乎可以确定会完成工作。让他们按照自己的方式完成工作,尽量不要干预他们。
  2. 可依赖人员。这类人很愿意工作并知道工作方法。也许他们对于这项工作不是那么热情,但他们还是很可能会完成它。对于这类人别太妨碍他们,但也别对他们抱有百分之百的信息。
  3. 不确定人员。这类人由于各种原因,很可能不会很好地完成工作任务。我们需要尽快对这类人进行细分,分配给他们一些工作,并根据工作成效判断他们的能力。如果他们能够较好地完成工作,可归于第二类,如果不能就归于第五类。
  4. 实习人员。无论如何他们都是新人,在确认他们具备可以完成工作的能力之前,需要对他们进行手把手地指导、正规的培训和细节管理。要确保他们至少可以成为第二类人员。
  5. 无希望人员。他们不会完成任务,所以我们需要寻找别的方法来完成这项工作。需要对这类人做出合理处置,包括解雇或者改造他们。

理念五 事情的结果往往和预期不一样

计划做的再好,也开始执行了,总会有一些事情是不受控制的,总有一些“惊喜“要来突袭我们,那我们该如何处理这些事情呢?

  1. 应急措施
  2. 风险管理

第一个工具是指我们事先就会预测到一些问题,我们需要做的是计划好如果真的发生了问题,我们要怎么做。

第二个工具要求我们队可能发生的问题进行评估,包括发生的概率和带来的破坏力,这样可以帮助我们尽量避免一些严重的问题,所谓“两害相权取其轻”。

其实我的理解是要做好完备的计划,即在应用第二个理念时,就要考虑到可能遇到的问题。对于项目管理来说,这些风险可能是需要经验积累才能预见到的,需要不断的实践和总结。

理念六 明确界定事情的结果

一旦我们践行了理念一理念五,做一件事情的基本框架就搭建完成了,但是还有一些细节需要注意。该理念就是告诉我们在实践计划的过程中,每个工作只能有两个状态:要么完成了,要么没有完成。如何判定完成或者没有完成呢,需要我们“明确界定事情的结果”。

其实如果我们已经开始施行理念三,那么我们就已经开始理念六了,因为我们把达到一个目标分解成了一个个的小工作,当我们完成这些小工作之前,目标就是没有完成。但是,有了这些小工作,我们可以有效的监控大目标已经完成了多少。至于如何评估一个小工作是否完成,我们可以进一步借助理念二,我们到底要做什么。每项细化的工作都应该明确工作目标和成果形式,确定一些我们能实实在在看到并能够掌控的东西,通过对成果的检验,我们可以判断一件工作是否完成了。

理念七 学会从他人的角度看问题

这里作者给出了两个工具:

  1. 试着穿上别人的鞋子
  2. 尽可能满足利益相关者的获利条件

首先,当你与任何层次的人打交道遇到挫折时,能够尽量把自己放在他们的位置上考虑,会更容易理解其观点和做法。这个其实就是经常会被提到的“体恤下属”和“站在领导的角度看问题”,其实这两件事说起来容易,做起来难度很大。对于前者,当我们有责任有压力在身时,很难做到冷静地处理问题。至于后者,当我们的眼界还不够的时候,几乎不可能具有那样的眼光,或者有时候是“不在其职不谋其政”。

其次,让所有人都有所得,而且是让每个人得到他自己想要的,而不是我们想给的,才能让所有人顺畅地合作。

TITLE: Tiny SSD: A Tiny Single-shot Detection Deep Convolutional Neural Network for Real-time Embedded Object Detection

AUTHOR: Alexander Wong, Mohammad Javad Shafiee, Francis Li, Brendan Chwyl

ASSOCIATION: University of Waterloo, DarwinAI

FROM: arXiv:1802.06488

CONTRIBUTION

  1. A single-shot detection deep convolutional neural network, Tiny SSD, is designed specifically for real-time embedded object detection.
  2. A non-uniform Fire module is proposed based on SqueezeNet.
  3. The network achieves 61.3% mAP in VOC2007 dataset with a model size of 2.3MB.

METHOD

DESIGN STRATEGIES

Tiny SSD network for real-time embedded object detection is composed of two main sub-network stacks:

  1. A non-uniform Fire sub-network stack.
  2. A non-uniform sub-network stack of highly optimized SSD-based auxiliary convolutional feature layers.

The first sub-network stack is feed into the second sub-network stack. Both sub-networks needs carefully design to run on an embedded device. The first sub-network works as the backbone, which directly affect the performance of object detection. The second sub-network should balance the performance and model size as well as inference speed.

Three key design strategies are:

  1. Reduce the number of $3 \times 3$ filters as much as possible.
  2. Reduce the number of input channels to $3 \times 3$ filters where possible.
  3. Perform downsampling at a later stage in the network.

NETWORK STRUCTURE

Fire

Auxiliary Layers

PERFORMANCE

Performance

SOME THOUGHTS

The paper uses half precision floating-point to store the model, which reduce the model size by half. From my own expirence, several methods can be tried to export a deep learning model to embedded devices, including

  1. Architecture design, just like this work illustrated.
  2. Model pruning, such as decomposition, filter pruning and connection pruning.
  3. BLAS library optimization.
  4. Algorithm optimization. Using SSD as an example, the Prior-Box layer needs only one forward as long as the input image size does not change.

TITLE: $S^3FD$: Single Shot Scale-invariant Face Detector

AUTHOR: Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, Stan Z. Li

ASSOCIATION: Chinese Academy of Sciences

FROM: arXiv:1708.05237

CONTRIBUTION

  1. Proposing a scale-equitable face detection framework with a wide range of anchor-associated layers and a series of reasonable anchor scales so as to handle dif- ferent scales of faces well.
  2. Presenting a scale compensation anchor matching strategy to improve the recall rate of small faces.
  3. Introducing a max-out background label to reduce the high false positive rate of small faces.
  4. Achieving state-of-the-art results on AFW, PASCAL face, FDDB and WIDER FACE with real-time speed.

METHOD

There are mainly three reasons that why the performance of anchor-based detetors drop dramatically as the objects becoming smaller:

  1. Biased Framework. Firstly, the stride size of the lowest anchor-associated layer is too large, thus few features are reliable for small faces. Secondly, anchor scale mismatches receptive field and both are too large to fit small faces.
  2. Anchor Matching Strategy. Anchor scales are discrete but face scale is continuous. Those faces whose scale distribute away from anchor scales can not match enough anchors, such as tiny and outer face.
  3. Background from Small Anchors. Small anchors lead to sharp increase in the number of negative anchors on the background, bringing about many false positive faces.

The architecture of Single Shot Scale-invariant Face Detector is shown in the following figure.

Framework

Scale-equitable framework

Constructing Architecture

  • Base Convolutional Layers: layers of VGG16 from conv1_1 to pool5 are kept.
  • Extra Convolutional Layers: fc6 and fc7 of VGG16 are converted to convolutional layers. Then extra convolutional layers are added, which is similar to SSD.
  • Detection Convolutional Layers: conv3_3, conv4_3, conv5_3, conv_fc7, conv6_2 and conv7_2 are selected as the detection layers.
  • Normalization Layers: L2 normalization is applied to conv3_3, conv4_3 and conv5_3 to rescale their norm to 10, 8 and 5 respectively. The scales are then learned during the back propagation.
  • Predicted Convolutional Layers: For each anchor, 4 offsets relative to its coordinates and $N_{s}$ scores for classification, where $N_s=N_m+1$ ($N_m$ is the maxout background label) for conv3_3 detection layer and $N_s=2$ for other detection layers.
  • Multi-task Loss Layer: Softmax loss for classification and smooth L1 loss for regression.

Designing scales for anchors

  • Effective receptive field: the anchor should be significantly smaller than theoretical receptive field in order to match the effective receptive field.
  • Equal-proportion interval principle: the scales of the anchors are 4 times its interval, which guarantees that different scales of anchor have the same density on the image, so that various scales face can approximately match the same number of anchors.

Scale compensaton anchor matching strategy

To solve the problems that 1) the average number of matched anchors is about 3 which is not enough to recall faces with high scores; 2) the number of matched anchors is highly related to the anchor scales, a scale compensation anchor matching strategy is proposed. There are two stages:

  • Stage One: decrease threshold from 0.5 to 0.35 in order to increase the average number of matched anchors.
  • Stage Two: firstly pick out anchors whose jaccard overlap with tiny or outer faces are higher than 0.1, then sorting them to select top-N as matched anchors. N is set as the average number from stage one.

Max-out background label

For conv3_3 detection layer, a max-out background label is applied. For each of the smallest anchors, $N_m$ scores are predicted for background label and then choose the highest as its final score.

Training

  1. Training dataset and data augmentation, including color distort, random crop and horizontal flip.
  2. Loss function is a multi-task loss defined in RPN.
  3. Hard negative mining.

The experiment result on WIDER FACE is illustrated in the following figure.

Experiment

TITLE: Single-Shot Refinement Neural Network for Object Detection

AUTHOR: Shifeng Zhang, LongyinWen, Xiao Bian, Zhen Lei, Stan Z. Li

ASSOCIATION: CACIA, GE Global Research

FROM: arXiv:1711.06897

CONTRIBUTION

  1. A novel one-stage framework for object detection is introduced, composed of two inter-connected modules, i.e., the ARM (Anchor Refinement Module) and the ODM (Object Detection Module). This leads to performance better than the two-stage approach while maintaining high efficiency of the one-stage approach.
  2. To ensure the effectiveness, TCB (Transfer Connection Block) is designed to transfer the features in the ARM to handle more challenging tasks, i.e., predict accurate object locations, sizes and class labels, in the ODM.
  3. RefineDet achieves the latest state-of-the-art results on generic object detection

METHOD

The idea of this work can be seen as an improvement based on DSSD method. The DSSD method uses multi-scale feature maps to predict categories and regress bounding boxes. In DSSD, deconvolution is also used to increase the resolution of the last feature maps. In this work, a binary classifier and a coarse regressor is added to the downsampling stages. Their outputs are the inputs to the multi-category classifier and fine regressor. The framework this single-shot refinement neural network is illustrated in the following figure.

Framework

Anchor Refinement Module

The ARM is designed to (1) identify and remove negative anchors to reduce search space for the classifier, and (2) coarsely adjust the locations and sizes of anchors to provide better initialization for the subsequent regressor.

In training phase, for a refined anchor box, if its negative confidence is larger than a preset threshold θ (i.e., set θ = 0.99 empirically), we will discard it in training the ODM.

Object Detection Module

The ODM takes the refined anchors as the input from the former to further improve the regression and predict multi-class labels.

Transfer Connection Block

TCB is introduced to convert features of different layers from the ARM, into the form required by the ODM, so that the ODM can share features from the ARM. Another function of the TCBs is to integrate large-scale context by adding the high-level features to the transferred features to improve detection accuracy. An illustration of TCB can be found in the following figure.

TCB

Training

The training method is much like SSD. The experiment result and comparison with other method can be found in the following table.

TCB

TITLE: Panoptic Segmentation

AUTHOR: Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, Piotr Dollar

ASSOCIATION: FAIR, Heidelberg University

FROM: arXiv:1801.00868

CONTRIBUTION

  1. A novel ‘Panoptic Segmentation’ (PS) task is proposed and studied.
  2. A panoptic quality (PQ) measure is introduced to measure performance on the task.
  3. A basic algorithmic approach to combine instance and semantic segmentation outputs into panoptic outputs is proposed.

PROBLEM DEFINATION

Panoptic refers to a unified, global view of segmentation. Each pixel of an image must be assigned a semantic label and an instance id. Pixels with the same label and id belong to the same object; for stuff labels the instance id is ignored.

Panoptic Segmentation

Given a predetermined set of $L$ semantic categories encoded by $\mathcal{L} := {1,…,L}$, the task requires a panoptic segmentation algorithm to map each pixel $i$ of an image to a pair $(l_{i}, z_{i}) \in \mathcal{L} \times N$, where $l_{i}$ represents the semantic class of pixel $i$ and $z_{i}$ represents its instance id.

The semantic label set consist of subsets $\mathcal{L}^{St}$ and $\mathcal{L}^{Th}$, such that $\mathcal{L} = \mathcal{L}^{St} \cup \mathcal{L}^{Th}$ and $\mathcal{L}^{St} \cap \mathcal{L}^{Th} = \phi$. These subsets correspond to stuff labels and thing labels, respectively.

Panoptic Quality (PQ)

For each class, the unique matching splits the predicted and ground truth segments into three sets: true positives (TP), false positives (FP), and false negatives (FN), representing matched pairs of segments, unmatched predicted segments, and unmatched ground truth segments, respectively. Given these three sets, PQ is defined as:

$$PQ=\frac{\sum_{(p,g) \in TP} IoU(p,g)}{|TP|+\frac{1}{2}|FP|+\frac{1}{2}|FN|}$$

A predicted segment and a ground truth segment can match only if their intersection over union (IoU) is strictly greater than 0.5.

PQ can be seen as the multiplication of a Segmentation Quality (SQ) term and a Detection Quality (DQ) term:

$$PQ=\frac{\sum_{(p,g) \in TP} IoU(p,g)}{|TP|} \times \frac{|TP|}{|TP|+\frac{1}{2}|FP|+\frac{1}{2}|FN|}$$

where the first term can be seen as SQ and the second term can be seen as DQ.

Human vs. Machine

Human vs. Machine