0%

Spring, the sweet spring

BY THOMAS NASHE

Spring, the sweet spring, is the year’s pleasant king,
Then blooms each thing, then maids dance in a ring,
Cold doth not sting, the pretty birds do sing:
Cuckoo, jug-jug, pu-we, to-witta-woo!

The palm and may make country houses gay,
Lambs frisk and play, the shepherds pipe all day,
And we hear aye birds tune this merry lay:
Cuckoo, jug-jug, pu-we, to-witta-woo!

The fields breathe sweet, the daisies kiss our feet,
Young lovers meet, old wives a-sunning sit,
In every street these tunes our ears do greet:
Cuckoo, jug-jug, pu-we, to witta-woo!

Spring, the sweet spring!

2021.04.04

TITLE: Destruction and Construction Learning for Fine-grained Image Recognition

AUTHOR: Yue Chen, Yalong Bai, Wei Zhang, Tao Mei

ASSOCIATION: JD AI Research

FROM: arXiv:2003.14142

CONTRIBUTION

  1. A novel “Destruction and Construction Learning (DCL)” framework is proposed for fine-grained recognition.For destruction, the region confusion mechanism (RCM) forces the classification network to learn from discriminative regions, and the adversarial loss prevents over-fitting the RCM-induced noisy patterns. For construction, the region alignment network re- stores the original region layout by modeling the semantic correlation among regions.
  2. State-of-the-art performances are reported on three standard benchmark datasets, where DCL consistently outperforms existing methods.
  3. Compared to existing methods, proposed DCL does not need extra part/object annotation and introduces no computational overhead at inference time.

METHOD

The proposed method consists of four parts as the following figure shows.

Framework

At training stage, three losses are used, including classification loss, adversarial loss and region alignment loss. The loss can be defined as

$$L = \alpha L_{cls} + \beta L_{adv} + \gamma L_{loc} $$

The three losses play different roles in this work.

Classification Network

At inference stage, only this part of the network is used. And this part introduces the classification loss $L_{cls}$.

Region Confuston Mechanism

Given an input image, the image is first uniformly partioned into $N \times N$ sub-regions. Then the sub-regions are rearranged whithin neighbourhood. This shuffling method destructs the global structure and ensures that the local region jitters inside its neighbourhood with a tunable size. Since the global structure has been destructed, to recognize these randomly shuffled images, the classification network has to find the discriminative regions and learn the delicate differences among categories.

Adversarial Learning Network

Destructing images with RCM does not always bring beneficial information for fine-grained classification. Features learned from these noise visual patterns are harmful to the classification task. Thus adversarial loss $L_{adv}$ is introduced to prevent such overfitting. This loss helps the filters to response differently on original images and region-shuffled images. Thus the network can work reliabally.

Region Alignment Network

The direct aim of the Region Alignment Network is to restore the original image from the scattered image. By end-to-end training, the region alignment loss $L_{loc}$ can help the classification backbone network to build deep understanding about objects and model the structure information, such as the shape of objects and semantic correlation among parts of object.

PERFORMANCE

The following table shows the comparison between this work and prior work.

performance

ablation

TITLE: Weakly Supervised Attention Pyramid Convolutional Neural Network for Fine-Grained Visual Classification

AUTHOR: Yifeng Ding, Shaoguo Wen, Jiyang Xie, Dongliang Chang, Zhanyu Ma, Zhongwei Si, Haibin Ling

ASSOCIATION: Beijing University of Posts and Telecommunications, Stony Brook University

FROM: arXiv:2002.03353

CONTRIBUTION

  1. A novel attention pyramid convolutional neural network (AP-CNN) is propsed by building an enhanced pyramidal hierarchy, which combines a top-down pathway of features and a bottom-up pathway of attentions, and thus learns both high-level semantic and low-level detailed feature representations.
  2. ROI guided refinement is proposed which consists of ROI guided dropblock and ROI guided zoom-in to further refine the features. The dropblock operation helps to locate more discriminative local regions, and the zoom-in operation aligns features with background noises eliminated.

METHOD

AP-CNN is a two-stage network, raw-stage and refined-stage, that respectively takes coarse full images and refined features as input. An overview of the proposed AP-CNN is illustrated in the following figure.

Overview

First, the feature and attention pyramid structure takes coarse images as input, which generates the pyramidal features and the pyramidal attentions by establishing hierarchy on the basic CNN following a top-down feature pathway and a bottom-up attention pathway.

Second, once the spatial attention pyramid has been obtained from the raw input, the region proposal network (RPN) proceeds to generate the pyramidal regions of interest (ROIs) in a weakly supervised way. Then the ROI guided refinement is conducted on low-level features with a) the ROI guided dropblock which erases the most discriminative regions selected from small-scaled ROIs, and b) the ROI guided zoom-in which locates the major regions merged from all ROIs.

Third, the refined features are sent into the refined-stage to distill more discriminative information. Both stages set individual classifiers for each pyramid level, and the final classification result is averaged over the raw-stage predictions and the refined-stage predictions.

The Attention Pyramid consists of two types of attentions, Spatial Attention and Channel Attention. The following figure shows the data-flow.

attention

Spatial Attention Pyramid is a set of feature maps of different resolutions and is generated from feature pyramid. Then ROI pyramid is generated from the spatial activations using RPN. At training stage, a ROI is selected to be droped, erasing the informative part and encouraging the network to find more discriminative regions. At testing stage, this operation is skipped. The following figure shows the ROI guided refinement.

roi_dropblock

PERFORMANCE

The following table shows the comparison between this work and prior work.

performance

TITLE: ResNeSt: Split-Attention Networks

AUTHOR: Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R. Manmatha, Mu Li, Alexander Smola

ASSOCIATION: Amazon, University of California, Davis

FROM: arXiv:2004.08955

CONTRIBUTION

  1. A simple architectural modification of the ResNet is explored, incorporating feature-map split attention within the individual network blocks.
  2. Models utilizing a ResNeSt backbone are able to achieve state of the art performance on several tasks, namely: image classification, object detection, instance segmentation and semantic segmentation.

METHOD

Split-Attention Block

In this work, a Split-Attention Block is explored. For implementation convinience, the Radix-major version is easier to understand. The following figure gives an illustration of Split-Attention Block.

Split-Attention Block

As shown in Radix-major implementation of ResNeSt block, the featuremap groups with same radix index but different cardinality are next to each other physically. This implementation can be easily accelerated, because the $1 \times 1$ convolutional layers can be unified into a layer and the $3 \times 3$ convolutional layers can be implemented using group convolution with the number of groups equal to $RK$.

Comparison with Prior Work

The following figure shows the comparison. Splict-Attention Block is shown in cardinality-major view.

Comparison

SE-Net introduces squeeze-and-attention (called excitation in the original paper) to employ a global context to predict channel-wise attention factors. With radix=1, Split-Attention block is applying a squeeze-and-attention operation to each cardinal group, while the SE-Net operates on top of the entire block regardless of multiple groups. Previous models like SK-Net introduced feature attention between two network branches, but their operation is not optimized for training efficiency and scaling to large neural networks. This work generalizes prior work on feature-map attentio within a cardinal group setting, and its implementation remains computationally efficient.

PERFORMANCE

Classification

classification

classification-stoa

Detection

detection

Segmentation

instance-segmentation

semantic-segmentation

TITLE: NBDT: Neural-Backed Decision Trees

AUTHOR: Alvin Wan, Lisa Dunlap, Daniel Ho, Jihan Yin, Scott Lee, Henry Jin, Suzanne Petryk, Sarah Adel Bargal, Joseph E. Gonzalez

ASSOCIATION: UC Berkeley, Boston University

FROM: arXiv:2004.00221

CONTRIBUTION

  1. A method is proposed for running any classification neural network as a decision tree by defining a set of embedded decision rules that can be constructed from the fully-connected layer. Induced hierarchies are designed that are easier for neural networks to learn.
  2. Tree supervision loss is proposed, which boosts neural network accuracy by
    0.5% and produces high-accuracy NBDTs. NBDTs achieve accuracies comparable to neural networks on small, medium, and large-scale image classification datasets.
  3. Qualitative and quantitative evidence of semantic interpretations are illustrated.

METHOD

Steps for Converting CNN into a Decision Tree

  1. Build an induced hierarchy;
  2. Fine-tune the model with a tree supervision loss;
  3. For inference, featurize samples with the neural network backbone;
  4. And run decision rules embedded in the fully-connected layer.

The following figure illustrates the main steps for converting a classification neural network into a decision tree:

Main Steps

Building Induced Hierarchies

The following figure illustrates how to build induced hierarchies from the network’s final fully-connected layer. For the leaf nodes, the representative vectors are extracted from the weights of FC layer. The parents’ representative vectors are computed by averaging the children.

Building Induced Hierarchies

In this work, the author found a minimal subset of the WordNet hierarchy that includes all classes as leaves, pruning redundant leaves and single-child intermediate nodes. To leverage the source of labels, hypotheses is generated for each intermediate node by finding the earliest ancestor of each subtrees’ leaves.

Training with Tree Supervision Loss

A tree supervision loss is added to the final loss function to encourage the network to separate representative vectors for each internal node. Two losses are proposed, hard tree supervision loss and soft tree supervision loss. The final loss is

$$
Loss=L_{original}+L_{hard/soft}
$$

where $L_{original}$ is the typical cross entopy loss for classification, and $L_{hard/soft}$ stands for hard or soft tree supervision loss.

The hard tree supervsion loss is defined as

$$
L_{hard}=\frac{1}{N} \sum_{i=1}^{N} CrossEntropy( D(i){pred}, D(i){label} )
$$

where $N$ is the number of nodes in the tree, excluding leaves. $D(i){pred}$ is the predicted probabilities and $D{label}$ is the label in node $i$. Note that nodes that are not included in the path from the label to the root have no defined losses.

The soft tree supervsion loss is defined as

$$
L_{soft}=CrossEntropy(D_{pred}, D_{label})
$$

where $D_{pred}$ is the predicted distribution over leaves and $D_{label}$ is the wanted distribution.

The following figure gives an example of the hard and soft tree supervison loss.

Tree Supervison Loss

PERFORMANCE

On all CIFAR10, CIFAR100, TinyImageNet, and ImageNet datasets, NBDT outperforms competing decision-tree-based methods, even uninterpretable variants such as a decision forest, by up to 18%. On CIFAR10, CIFAR100, and TinyImageNet, NBDTs largely stay within 1% of neural network performance.

Performance

SOME THOUGHTs

  1. The performance seems promissing. Howerver, the ablation studies is confusing because they have different expirement settings with more than one variables.
  2. The method of constructing a reasonable hierarchies is illustrated less exhaustive. My best guess is that the author force the tree to be a binary tree.
  3. Is this possible that the leaves have duplicated labels?

ncnn is a high-performance neural network inference framework optimized for the mobile platform.

I’ve been using NCNN for quite a while. And recently after compiling the latest version, I was surprised the network could not give the correct output. Besides, the program crashed randomly.

Cropping seemed to be the reason when I digging into the source code. The cropping operation crops not only the 2D feature map but also the channel dim when the input blob is a 3-dim tensor. So I modified _outc = ref_dims == 3 ? ref_channels : channels; to _outc = channels. I’m not sure whether there is another way to avoid this operation. The modification temporately cope the problem.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
void Crop::resolve_crop_roi(const Mat &bottom_blob, const Mat &reference_blob, int &_woffset, int &_hoffset, int &_coffset, int &_outw, int &_outh, int &_outc) const
{
int w = bottom_blob.w;
int h = bottom_blob.h;
int channels = bottom_blob.c;
int dims = bottom_blob.dims;

int ref_w = reference_blob.w;
int ref_h = reference_blob.h;
int ref_channels = reference_blob.c;
int ref_dims = reference_blob.dims;

if (dims == 1)
{
_woffset = woffset;
_outw = ref_w;
}
if (dims == 2)
{
_woffset = woffset;
_hoffset = hoffset;
_outw = ref_w;
_outh = ref_h;
}
if (dims == 3)
{
_woffset = woffset;
_hoffset = hoffset;
_coffset = coffset;
_outw = ref_w;
_outh = ref_h;
// _outc = ref_dims == 3 ? ref_channels : channels;
_outc = channels;
}
}

The following image shows the result of a foreground segmentation nework before and after the modification.

Output Comparison

In order to deploy MXNet based vision engine to projects develped in C++, we need to compile MXNet CPP API. Though the instruction of how to compile is well illustrated in Build from Source and Build the C++ package, I still confronted some difficulties. This blog records some tips for compling MXNet CPP API.

  1. Modify Source Code

    By following the instruction, I could easily complie and get the libmxnet. However, when compling cpp-package, the op.h file can not be generated correctly. In issues#14116, Vigilans provided a solution.

    Here: https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/tuple.h#L744

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    namespace dmlc {
    /*! \brief description for optional TShape */
    DMLC_DECLARE_TYPE_NAME(optional<mxnet::TShape>, "Shape or None");
    MLC_DECLARE_TYPE_NAME(optional<mxnet::Tuple<int>>, "Shape or None");
    // avoid low version of MSVC
    #if !defined(_MSC_VER) // <----------- Here !
    template<typename T>
    struct type_name_helper<mxnet::Tuple<T> > {
    static inline std::string value() {
    return "tuple of <" + type_name<T>() + ">";
    }
    };
    #endif
    } // namespace dmlc

    So the specialization of mxnet::tuple was disabled for Visual Studio in the first place!
    I removed the #if block, recompile, then everything works fine.

  2. Set the Environment Variables

    In my own case, I only needed to set OpenBLAS_HOME and OpenCV_DIR. Both of the can be set by set command or -D in cmake config.

  3. Use CMake to generate VS solution

    cmake -G "Visual Studio 14 2015 Win64" -DUSE_CUDA=0 -DUSE_CUDNN=0 -DUSE_NVRTC=0 -DUSE_OPENCV=1 -DUSE_OPENMP=1 -DUSE_BLAS=open -DUSE_LAPACK=0 -DUSE_DIST_KVSTORE=0 -DUSE_CPP_PACKAGE=1 -DCMAKE_INSTALL_PREFIX=install ..

    Above command can be used to generate a solution without GPU support. By modifying config -DUSE_CUDA and -DUSE_CUDNN, we can generate a solution with GPU support.

  4. Generate op.h

    After generating libmxnet, we should run python OpWrapperGenerator.py libmxnet.dll to generate op.h. Note to place libmxnet.dll, libopenblas.dll and libopencv_world.dll together with OpWrapperGenerator.py.

  5. No mxnet_static.lib

    The cpp example project failed to link to mxnet_static.lib, which was actually named as libmxnet.lib. I modified the name of the static library. I believe the project settings can be fixed to cope with this problem.