0%

最近开发一个需要从Java端将图像传至native层的功能,发现了一个奇怪现象,native层每次获取到的图像都会有一些像素值发生随机变化。原始代码类似如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cv::Mat imgbuf2mat(JNIEnv *env, jbyteArray buf, int width, int height){
jbyte *ptr = env->GetByteArrayElements(buf, 0);
cv::Mat img(height, width, img_type, (unsigned char *)ptr);
env->ReleaseByteArrayElements(buf, ptr, 0);
return img;
}

static void nativeProcessImageBuff
(JNIEnv *env, jobject thiz,
jbyteArray img_buff,
jint width,
jint height)
{
cv::Mat img = imgbuf2mat(env, img_buff, width, height, img_type, img_rotate);
//do somethint to the image
process(img);
}

简单来说就是先把图像内容转成cv::Mat,然后对图像做一些处理,但是即使传入相同的一张图像,处理结果每次都不一样。后来发现其实应该先处理图像,再释放引用,修改后的代码类似如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cv::Mat imgbuf2mat(JNIEnv *env, jbyte *ptr, int width, int height){
cv::Mat img(height, width, img_type, (unsigned char *)ptr);
return img;
}

static void nativeProcessImageBuff
(JNIEnv *env, jobject thiz,
jbyteArray img_buff,
jint width,
jint height)
{
jbyte *ptr = env->GetByteArrayElements(img_buff, 0);
cv::Mat img = imgbuf2mat(env, ptr, width, height);
//do somethint to the image
process(img);
env->ReleaseByteArrayElements(img_buff, ptr, 0);
}

MTCNN 训练记录

最近尝试使用 Caffe 复现 MTCNN,感觉坑很大,记录一下训练过程,目前还没有好的结果。网上也有很多童鞋在尝试训练 MTCNN,普遍反映使用 TensorFlow 可以得到比较好的结果,但是使用 Caffe 不是很乐观。

已经遇到的问题现象

  • 2018.09.11:目前只训练了 12net,召回率偏低。

    blankWorld/MTCNN-Accelerate-Onet为 baseline,blankWorld 在 FDDB 上的测试性能如下图

    FDDB Result

    这个效果很不错,但是我自己生成样本后训练 12net,召回率有明显下降。性能对比如下图

    FDDB 12net Compare

    暂且不管 12net 的测试结果为什么会这么差,两个模型的性能差距是可以反映的。

训练记录

  • 2018.09.11

    训练数据生成

    参考AITTSMD/MTCNN-Tensorflow提供的 prepare_data 进行数据生成。数据集情况如下表

    Positive Negative Part Landmark
    Training Set 156728/189530 470184/975229 156728/547211 313456/357604
    Validation Set 10000 10000 10000 10000

    其中 Pos:Neg:Part:Landmark = 1:3:1:2,样本比例参考原作的比例。Pos、Neg、Part 来自于WiderFace,Landmark 来自于CelebA。其中正样本进行了人工的数据筛选,筛选的原因是根据 WiderFace 生成的正样本,有很多都是质量很差的图像,包含人脸大面积遮挡或十分模糊的情况。之前召回率很差的性能来自没有经过筛选的训练集,因为使用了 OHEM,只有 loss 值在前 70%的样本才参与梯度计算,感觉如果质量差的样本占比较大,网络学习到的特征是错误的,那些质量好的图像可能得不到充分的学习。

    训练参数设置

    初始训练参数如下

    1
    2
    3
    4
    5
    6
    7
    type:"Adam"
    momentum: 0.9
    momentum2:0.999
    delta:1e-8
    base_lr: 0.01
    weight_decay: 0.0005
    batch_size: 256

    第一轮训练在 75000 次迭代(17.5 个 epoch)时停止,测试记录如下

    1
    2
    3
    4
    5
    6
    I0911 10:16:25.019253 21722 solver.cpp:347] Iteration 75000, Testing net (#0)
    I0911 10:16:28.057858 21727 data_layer.cpp:89] Restarting data prefetching from start.
    I0911 10:16:28.072748 21722 solver.cpp:414] Test net output #0: cls_Acc = 0.4638
    I0911 10:16:28.072789 21722 solver.cpp:414] Test net output #1: cls_loss = 0.096654 (* 1 = 0.096654 loss)
    I0911 10:16:28.072796 21722 solver.cpp:414] Test net output #2: pts_loss = 0.008529 (* 0.5 = 0.0042645 loss)
    I0911 10:16:28.072801 21722 solver.cpp:414] Test net output #3: roi_loss = 0.0221648 (* 0.5 = 0.0110824 loss)

    注意:分类测试结果是 0.4638 是因为测试集没有打乱,1-10000 为 pos 样本,10001-20000 为 neg 样本,20001-30000 为 part 样本,30001-40000 为 landmark 样本。因此,实际分类正确率应该是 0.9276

    降低学习率至 0.001,训练 135000 次迭代(31.5 个 epoch)时停止,测试记录如下

    1
    2
    3
    4
    5
    6
    I0911 13:14:36.482010 23543 solver.cpp:347] Iteration 135000, Testing net (#0)
    I0911 13:14:39.629933 23660 data_layer.cpp:89] Restarting data prefetching from start.
    I0911 13:14:39.645612 23543 solver.cpp:414] Test net output #0: cls_Acc = 0.4714
    I0911 13:14:39.645649 23543 solver.cpp:414] Test net output #1: cls_loss = 0.0765401 (* 1 = 0.0765401 loss)
    I0911 13:14:39.645656 23543 solver.cpp:414] Test net output #2: pts_loss = 0.00756469 (* 0.5 = 0.00378234 loss)
    I0911 13:14:39.645661 23543 solver.cpp:414] Test net output #3: roi_loss = 0.0201988 (* 0.5 = 0.0100994 loss)

    实际分类正确率是 0.9428。训练 260000 次迭代后停止,测试记录如下

    1
    2
    3
    4
    5
    6
    I0911 16:58:47.514267 28442 solver.cpp:347] Iteration 260000, Testing net (#0)
    I0911 16:58:50.624385 28448 data_layer.cpp:89] Restarting data prefetching from start.
    I0911 16:58:50.639556 28442 solver.cpp:414] Test net output #0: cls_Acc = 0.471876
    I0911 16:58:50.639595 28442 solver.cpp:414] Test net output #1: cls_loss = 0.0750447 (* 1 = 0.0750447 loss)
    I0911 16:58:50.639602 28442 solver.cpp:414] Test net output #2: pts_loss = 0.0074394 (* 0.5 = 0.0037197 loss)
    I0911 16:58:50.639608 28442 solver.cpp:414] Test net output #3: roi_loss = 0.0199694 (* 0.5 = 0.00998469 loss)

    实际分类正确率是 0.943752。

    问题: 训练结果看似还可以,但是召回率很低,在阈值设置为 0.3 的情况下,召回率也才将将达到 90%。阈值要设置到 0.05,才能达到 97%-98%的召回率,ROC 曲线如下图。严格来说这个测试并不严谨,应该用检测器直接在图像中进行检测,但是为了方便,我直接用 val 集上的性能画出了 ROC 曲线,其中的 FDDB 曲线是将的人脸区域截取出来进行测试得到的。

    12net 1st ROC

  • 2018.09.14

    使用上述 12net 在 WiderFace 上提取正负样本,提取结果如下:

    Thresholed Positive Negative Part
    0.05 85210 36745286 632861
    0.5 66224 6299420 354350
  • 2018.09.17

    准备 24net 的训练样本。由于生成 12net 检测到的正样本数目有限,训练 24net 的 pos 样本包含两部分,一部分是训练 12net 的正样本,一部分是经过筛选的 12net 检测到的正样本;neg 样本和 part 样本全部来自 12net 的难例;landmark 与 12net 共用样本。经过采样后达到样本比例 1:3:1:2,样本数目如下表:

    Positive Negative Part Landmark
    Training Set 225172 675516 225172 313456
    Validation Set 10000 10000 10000 10000

    训练过程与 12net 类似,学习率从 0.01 下降到 0.0001,最终的训练结果如下

    1
    2
    3
    4
    5
    6
    I0917 15:19:00.631140 36330 solver.cpp:347] Iteration 70000, Testing net (#0)
    I0917 15:19:03.305665 36335 data_layer.cpp:89] Restarting data prefetching from start.
    I0917 15:19:03.317827 36330 solver.cpp:414] Test net output #0: cls_Acc = 0.481501
    I0917 15:19:03.317865 36330 solver.cpp:414] Test net output #1: cls_loss = 0.0479137 (* 1 = 0.0479137 loss)
    I0917 15:19:03.317874 36330 solver.cpp:414] Test net output #2: pts_loss = 0.00631254 (* 0.5 = 0.00315627 loss)
    I0917 15:19:03.317879 36330 solver.cpp:414] Test net output #3: roi_loss = 0.0179083 (* 0.5 = 0.00895414 loss)

    实际分类正确率是 0.963。ROC 曲线如下图,同样使用 val 集上的性能画出曲线。

    24net 1st ROC

  • 2018.09.18

    使用 24net 在 WiderFace 上提取正负样本,提取结果如下:

    Thresholed Positive Negative Part
    0.5, 0.5 86396 83212 225285

    利用以上数据生成 48net 的训练样本,由于 24net 生成的样本数量有限,结合前两次训练所用的数据,生成训练集:

    Positive Negative Part Landmark
    Training Set 283616 850848 283616 567232
    Validation Set 10000 10000 10000 10000
  • 2018.09.19

    在训练 48net 的过程中,首先尝试了 Adam 算法进行优化,后来发现训练十分不稳定。转而使用 SGD 进行优化,效果好转。训练初始参数如下:

    1
    2
    3
    4
    type:"SGD"
    base_lr: 0.01
    momentum: 0.9
    weight_decay: 0.0005

    48net 的训练结果比较一般,性能如下:

    1
    2
    3
    4
    5
    6
    I0919 18:02:22.318362  3822 solver.cpp:347] Iteration 165000, Testing net (#0)
    I0919 18:02:25.877437 3827 data_layer.cpp:89] Restarting data prefetching from start.
    I0919 18:02:25.894898 3822 solver.cpp:414] Test net output #0: cls_Acc = 0.4662
    I0919 18:02:25.894937 3822 solver.cpp:414] Test net output #1: cls_loss = 0.0917524 (* 1 = 0.0917524 loss)
    I0919 18:02:25.894943 3822 solver.cpp:414] Test net output #2: pts_loss = 0.00566356 (* 1 = 0.00566356 loss)
    I0919 18:02:25.894948 3822 solver.cpp:414] Test net output #3: roi_loss = 0.0177907 (* 0.5 = 0.00889534 loss)

    实际的分类精度为 0.9324。整体来看基本实现了文章中参考文献[19]在验证集上的性能,性能对比如下表

    CNN 12-net 24-net 48-net
    [19] 94.4% 95.1% 93.2%
    MTCNN 94.6% 95.4% 95.4%
    Ours 94.3% 96.3% 93.2%
  • 2018.09.20

    整个系统连通后进行测试,发现人脸框抖动比较厉害,这应该是训练过程和样本带来的问题。

    比较奇怪的问题是在 Caffe 上进行 CPU 运算时,速度极慢,尤其 12net 运行速度慢 30 倍左右。通过观察参数分布发现,有大量 kernel 都是全零分布,初步感觉是因为 Adam 和 ignore label 相互作用的结果,即 ignore label 的样本会产生 0 值 loss,这些 loss 会影响 Adam 的优化过程,具体原因还需进一步理论推导。目前的解决方案是将含有大量 0 值 kernel 的层随机初始化,使用 SGD 进行训练。至于抖动的问题,需要进一步分析。重训后的模型性能如下表:

    12-net 24-net 48-net
    Accuracy 94.59% 96.52% 93.94%
  • 2018.09.26

    目前来看回归器的训练是完全失败。失败的原因可能有以下几点:

    1. 对欧拉损失层做了代码修改,用来支持 ignore label,可能代码出现问题
    2. 数据本身存在问题,需要验证数据的正确性

    先训练一个不带 ignore label 的回归器,看 loss 是否发生变化。

  • 2018.09.27

    做了一个实验,用 20 张图像生成一个数据集,经过训练,网络是可以完全过拟合这个数据集的,说明数据和代码是没有问题的,但是加大数据量后,bounding box 的回归依旧不好。通过跟大神讨论,发现自己犯了一个很弱智的低级错误,回归问题是不可以做镜像的,回归问题是不可以做镜像的,回归问题是不可以做镜像的,重要的事情说三遍,如果图像做镜像操作,那么标注也需要进行镜像操作。

    重训后的模型性能如下表:

    12-net 24-net 48-net
    Accuracy 94.35% 97.45% 94.67%
  • 2018.10.01

    实验还是不够顺利,有些受挫感。目前来看 12net 和 24net 的训练是相对顺利的,至少能够去除大量虚检并给出相对准确的回归框。48net 的训练比较失败,存在虚检、定位不准以及关键点定位不准的问题。

    在训练 12net 和 24net 时,因为网络较小,而且这两个网络都不负责输出关键点,所以训练时只训练了分类和框回归问题。训练 48net 时三个 loss 都很重要,感觉按照原文的 loss weight 进行配置会造成回归问题学习不充分的问题,需要提高回归问题的 loss weight。

    接下来的工作包括:

    1. 之前的训练为了快速验证思路,并没有严格按照后一阶段数据是前一阶段数据中的难例的原则,接下来需要重新走这个流程。
    2. 探索 48net 的训练过程。具体来说有几个点需要尝试,首先,是刚刚提到的难例挖掘;其次,调整各个 loss 的权重;最后,除了阶段间的难例挖掘,也应注意阶段内的难例挖掘。
    3. 以 24net 在 landmark 数据集上进行检测,将其输出作为 48net 进行 landmark 回归的输入。
  • 2018.10.13

    最近一次训练的记录如下:

    12net

    12net 的样本由随机采样得来。

    Positive Negative Part
    Training Set 156728 470184 156728
    Validation Set 10000 10000 10000

    验证集上的性能为:

    1
    2
    3
    Test net output #0: cls_Acc = 0.9435
    Test net output #1: cls_loss = 0.0747717 (* 1 = 0.0747717 loss)
    Test net output #2: roi_loss = 0.0168385 (* 0.5 = 0.00841924 loss)

    24net

    24net 的样本全部来自 12net 的检测结果。

    Positive Negative Part
    Training Set 60149 180447 120298
    Validation Set 1500 1500 1500

    验证集上的性能为:

    1
    2
    3
    Test net output #0: cls_Acc = 0.977588
    Test net output #1: cls_loss = 0.0648633 (* 1 = 0.0648633 loss)
    Test net output #2: roi_loss = 0.0192365 (* 5 = 0.0961826 loss)

    48net

    48net 的正样本和 part 样本来自于 24net 在 widerface 上的检测结果,负样本来自于 24net 在 widerface 和 celeba 上的检测结果,landmark 样本来自于 24net 在 celeba 上的检测结果。

    Positive Negative Part landmark
    Training Set 242862 728586 242862 485724
    Validation Set 5000 5000 5000 5000

    在验证集上的性能为:

    1
    2
    3
    4
    Test net output #0: cls_Acc = 0.978155
    Test net output #1: cls_loss = 0.0694968 (* 1 = 0.0694968 loss)
    Test net output #2: pts_loss = 0.00119616 (* 5 = 0.00598078 loss)
    Test net output #3: roi_loss = 0.0111277 (* 1 = 0.0111277 loss)

    使用 0.5,0.5,0.5 作为阈值,在 FDDB 上测得的 discROC 曲线如下图

    ROC 20181013

TITLE: Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks

AUTHOR: Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li

ASSOCIATION: Chinese Academy of Sciences, The Chinese University of Hong Kong

FROM: IEEE Signal Processing Letters

CONTRIBUTION

  1. A new cascaded CNNs based framework is proposed for joint face detection and alignment, and a lightweight CNN architecture is carefully designed for real time performance.
  2. An effective method is proposed to conduct online hard sample mining to improve the performance.
  3. Extensive experiments are conducted on challenging benchmarks, to show significant performance improvement of the proposed approach compared to the state-of-the-art techniques in both face detection and face alignment tasks.

METHOD

This work is very much like a traditional slide-window based face detection method. A cascade classifier is used to classify an image patch. The first classifier is very simple so that it can efficiently remove easy negative patches. Then more complex classifiers are used to remove harder patches. The last CNN is used as a regressor to localize face landmarks. Along with the classifiers, bounding box is also localized. To handle multiple scales, an image pyramid is built.

The pipeline of this work is shown as follows. The input image is first resized to different scales to build an image pyramid, which is sent to the three-stage cascaded framework.

Framework

Stage 1: A fully convolutional network is proposed, called Proposal Network (P-Net), to obtain the candidate facial windows and their bounding box regression vectors. Then candidates are calibrated based on the estimated bounding box regression vectors. After that, non-maximum suppression (NMS) is utilized to merge highly overlapped candidates.

Stage 2: All candidates are fed to another CNN, called Refine Network (R-Net), which further rejects numerous false candidates, performs calibration with bounding box regression, and conducts NMS.

Stage 3: This stage is similar to the second stage, but this stage aims to identify face regions with more supervision. In particular, the network will output five facial landmarks’ positions.

The architecture of the three networks are shown in the following figure.

Architecture

At training stage, the face classification is trained as typical two-class classification task using cross-entropy loss. In each mini-batch, the samples are sorted by their losses and only the top 70% of them are selected to compute gradients. On the other hand, Bounding box regression and facial landmark localization are trained using Euclidean loss.

PERFORMANCE

Face detection performance:

Detection

Face landmark regression performance:

Landmark

IDEAS

  1. As the author provided no training code, I implement the training code based on caffe by adding ignore labels to EuclideanLossLayer and surpporting multiple label to DataLayer.
  2. In this work, samples are assigned as different types. In my own experiments, I assign samples with multiple labels, which means that one sample can be used for different losses at the same time. I’m not sure whether this modification can help improve the performance.
  3. Further experiments are of need to realize the performance of the work.

Jupyter Notebook 有两种键盘输入模式。编辑模式,允许你往单元中键入代码或文本;这时的单元框线是绿色的。命令模式,键盘输入运行程序命令;这时的单元框线是灰色。

  1. 命令模式 (按键 Esc 开启)
  • Enter : 转入编辑模式
  • Shift-Enter : 运行本单元,选中下个单元
  • Ctrl-Enter : 运行本单元
  • Alt-Enter : 运行本单元,在其下插入新单元
  • Y : 单元转入代码状态
  • M :单元转入markdown状态
  • R : 单元转入raw状态
  • 1 : 设定 1 级标题
  • 2 : 设定 2 级标题
  • 3 : 设定 3 级标题
  • 4 : 设定 4 级标题
  • 5 : 设定 5 级标题
  • 6 : 设定 6 级标题
  • Up : 选中上方单元
  • K : 选中上方单元
  • Down : 选中下方单元
  • J : 选中下方单元
  • Shift-K : 扩大选中上方单元
  • Shift-J : 扩大选中下方单元
  • A : 在上方插入新单元
  • B : 在下方插入新单元
  • X : 剪切选中的单元
  • C : 复制选中的单元
  • Shift-V : 粘贴到上方单元
  • V : 粘贴到下方单元
  • Z : 恢复删除的最后一个单元
  • D,D : 删除选中的单元
  • Shift-M : 合并选中的单元
  • Ctrl-S : 文件存盘
  • S : 文件存盘
  • L : 转换行号
  • O : 转换输出
  • Shift-O : 转换输出滚动
  • Esc : 关闭页面
  • Q : 关闭页面
  • H : 显示快捷键帮助
  • I,I : 中断Notebook内核
  • 0,0 : 重启Notebook内核
  • Shift : 忽略
  • Shift-Space : 向上滚动
  • Space : 向下滚动
  1. 编辑模式 ( Enter 键启动)
  • Tab : 代码补全或缩进
  • Shift-Tab : 提示
  • Ctrl-] : 缩进
  • Ctrl-[ : 解除缩进
  • Ctrl-A : 全选
  • Ctrl-Z : 复原
  • Ctrl-Shift-Z : 再做
  • Ctrl-Y : 再做
  • Ctrl-Home : 跳到单元开头
  • Ctrl-Up : 跳到单元开头
  • Ctrl-End : 跳到单元末尾
  • Ctrl-Down : 跳到单元末尾
  • Ctrl-Left : 跳到左边一个字首
  • Ctrl-Right : 跳到右边一个字首
  • Ctrl-Backspace : 删除前面一个字
  • Ctrl-Delete : 删除后面一个字
  • Esc : 进入命令模式
  • Ctrl-M : 进入命令模式
  • Shift-Enter : 运行本单元,选中下一单元
  • Ctrl-Enter : 运行本单元
  • Alt-Enter : 运行本单元,在下面插入一单元
  • Ctrl-Shift– : 分割单元
  • Ctrl-Shift-Subtract : 分割单元
  • Ctrl-S : 文件存盘
  • Shift : 忽略
  • Up : 光标上移或转入上一单元
  • Down :光标下移或转入下一单元

TITLE: DetNet: A Backbone network for Object Detection

AUTHOR: Xuepeng Shi, Shiguang Shan, Meina Kan, Shuzhe Wu, Xilin Chen

ASSOCIATION: Tsinghua University, Face++

FROM: arXiv:1804.06215

CONTRIBUTION

  1. The inherent drawbacks of traditional ImageNet pre-trained model for fine-tunning recent object detectors is analyzed.
  2. A novel backbone, called DetNet, is proposed, which is specifically designed for object detection task by maintaining the spatial resolution and enlarging the receptive field.

METHOD

Motivation

There are two problems using the classification backbone for object detection tasks. (i) Recent detectors, e.g., FPN, involve extra stages compared with the backbone network for ImageNet classification in order to detect objects with various sizes. (ii) Traditional backbone produces higher receptive field based on large downsampling factor, which is beneficial to the visual classification. However, the spatial resolution is compromised which will fail to accurately localize the large objects and recognize the small objects.

To sumarize, there are 3 main problems to use current pre-trained models, including

  1. The number of network stages is different. It means that extra layers for object detection compared to classification has not been pretrained.
  2. Weak visibility of large objects. It is because The feature map with strong semantic information has large strides respect to input image, which is harmful for the object localization.
  3. Invisibility of small objects. The information from the small objects will be easily weaken as the spatial resolution of the feature maps is decreased and the large context information is integrated.

To address these problems, DetNet has following characteristics. (i) The number of stages is directly designed for Object Detection. (ii) Even though more stages are involved, high spatial resolution of the feature maps is mainted, while keeping large receptive field using dilated convolution.

DetNet Design

The main architecture of DetNet is designed based on ResNet-50. The first 4 stages are kept same with ResNet-50. The main differences are illustrated as follows:

  1. The extra stages are merged into the backbone which will be later utilized for object detection as in FPN. Meanwhile, the spatial resolution is fixed as 16x downsampling even after stage 4.
  2. Since the spatial size is fixed after stage 4, in order to introduce a new stage, a dilated bottleneck with $1 \times 1$ convolution projection is utilized in the begining of the each stage. The dilation convolution efficiently enlarge the receptive field.
  3. Since dilated convolution is still time consuming, stage 5 and stage 6 keep the same channels as stage 4 (256 input channels for bottleneck block). This is different from traditional backbone design, which will double channels in a later stage.

The following figure shows the dialted bottleneck with $1 \times 1$ conv projection and the architecture of DetNet.

Framework

PERFORMANCE

Performance

TITLE: Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks

AUTHOR: Xuepeng Shi, Shiguang Shan, Meina Kan, Shuzhe Wu, Xilin Chen

ASSOCIATION: Chinese Academy of Sciences

FROM: arXiv:1804.06039

CONTRIBUTION

  1. A real-time and accurate rotation-invariant face detector with progressive calibration networks (PCN) is proposed.
  2. PCN divides the calibration process into several progressive steps, each of which is an easy task, rsulting in accurate calibration with low time cost. And the range of full rotation-in-plane (RIP) angles is gradually decreasing, which helps distinguish faces from non-faces.
  3. In the first two stages of PCN, only coarse calibrations are conducted, such as calibrations from facing down to facing up, and from facing left to facing right. On the one hand, a robust and accurate RIP angle prediction for this coarse calibration is easier to attain without extra time cost, by jointly learning calibration task with the classification task and bounding box regression task in a multi-task learning manner. On the other hand, the calibration can be easier to implement as flipping original image with quite low time cost.

METHOD

Framework

Given an image, all face candidates are obtained according to the sliding window and image pyramid principle, and each candidate window goes through the detector stage by stage. In each stage of PCN, the detector simultaneously rejects most candidates with low face confidences, regresses the bounding boxes of remaining face candidates, and calibrates the RIP orientations of the face candidates. After each stage, non-maximum suppression (NMS) is used to merge those highly overlapped candidates.

PCN progressively calibrates the RIP orientation of each face candidate to upright for better distinguishing faces from non-faces.

  1. PCN-1 first identifies face candidates and calibrates those facing down to facing up, halving the range of RIP angles from [$-180^{\circ}$,$180^{\circ}$] to [$-90^{\circ}$, $90^{\circ}$].
  2. Then the rotated face candidates are further distinguished and calibrated to an upright range of [$-45^{\circ}$, $45^{\circ}$] in PCN-2, shrinking the RIP ranges by half again.
  3. Finally, PCN-3 makes the accurate final decision for each face candidate to determine whether it is a face and predict the precise RIP angle. Briefly,

The following figure illustrates the framework.

Framework

First Stage PCN-1

For each input window $x$, PCN-1 has three objectives: face or non-face classification, bounding box regression, and calibration, formulated as follows:

$$[f, t, g] = F_{1}(x)$$

where $F_{1}$ is the detector in the first stage structured with a small CNN. The $f$ is face confidence score, $t$ is a vector representing the prediction of bounding box regression, and $g$ is orientation score. Overall, the objective for PCN-1 in the first stage is defined as:

$$\min L = L_{cls} +\lambda_{reg} \cdot L_{reg} + \lambda_{cal} \cdot L_{cal}$$

where $\lambda_{reg}$, $\lambda_{cal}$ are parameters to balance different loss. The first objective, which is also the primary objective, aims for distinguishing faces from non-faces. The second objective attempts to regress the fine bounding box. The third objective aims to predict the coarse orientation of the face candidate in a binary classification manner, telling the candidate is facing up or facing down.

The PCN-1 can be used to filter all windows to get a small number of face candidates. For the remaining face candidates, firstly they are updated to the new regressed bounding boxes. Then the updated face candidates are rotated according to the predicted coarse RIP angles.

Second Stage PCN-2

Similar as the PCN-1 in the first stage, the PCN-2 in the second stage further distinguishes the faces from non-faces more accurately, regresses the bounding boxes, and calibrates face candidates. Differently, the coarse orientation prediction in this stage is a ternary classification of the RIP angle range, telling the candidate is facing left, right or front.

Third Stage PCN-3

After the second stage, all the face candidates are calibrated to an upright quarter of RIP angle range, i.e. [$-45^{\circ}$,$45^{\circ}$]. Therefore, the PCN-3 in the third stage can easily and accurately determine whether it is a face and regress the bounding box. Since the RIP angle has been reduced to a small range in previous stages, PCN-3 attempts to directly regress the precise RIP angles of face candidates instead of coarse orientations.

Accurate and Fast Calibration

The early stages only predict coarse RIP ori- entations, which is robust to the large diversity and further benefits the prediction of successive stages.

The calibration based on the coarse RIP prediction can be efficiently achieved via flipping original image three times, which brings almost no additional time cost. Rotating the original image by $-90^{\circ}$, $90^{\circ}$ and $180^{\circ}$ to get image-left, image-right, and image-down. And the windows with $0^{\circ}$,$-90^{\circ}$, $90^{\circ}$ and $180^{\circ}$ can be cropped from original image, image-left, image-right, and image-down respectively, as the following figure shows.

Calibration

CNN Architecture

CNN Architecture

PERFORMANCE

Performance

TITLE: Pelee: A Real-Time Object Detection System on Mobile Devicesn

AUTHOR: Robert J. Wang, Xiang Li, Shuang Ao, Charles X. Ling

ASSOCIATION: University ofWestern Ontario

FROM: arXiv:1804.06882

CONTRIBUTION

  1. A variant of DenseNet architecture called PeleeNet for mobile devices is proposed.
  2. The network architecture of Single Shot MultiBox Detector (SSD) is optimized for speed acceleration and then combine it with PeleeNet.

METHOD

BUILDING BLOCKS

Two-Way Dense Layer. A 2-way dense layer is used to get different scales of receptive fields. One branch uses a small kernel size (3x3) to capture small-size objects. The other branch stacks two 3x3 convolution layers for larger objects. The structure is shown in the following figure.

Two-Way Dense Layer

Stem Block. This block is placed before the first dense layer for the sake of cost efficiency. This stem block can effectively improve the feature expression ability without adding computational cost too much. The structure is shown as follows.

Stem Block

Dynamic Number of Channels in Bottleneck Layer. The number of channels in the bottleneck layer varies according to the input shape to make sure the number of output channels does not exceed the number of its input channels.

Transition Layer without Compression. experiments show that the compression factor proposed by DenseNet hurts the feature expression so that the number of output channels is kept the same as the number of input channels in transition layers.

Composite Function. The post-activation (Convolution - Batch Normalization - Relu) is used for speed acceleration. For post-activation, all batch normalization layers can be merged with convolution layer at the inference stage. To compensate for the negative impact on accuracy caused by this change, a shallow and wide network structure is designed. In addition, a 1x1 convolution layer is added to the last dense block to get a stronger representational ability.

ARCHITECTURE

The framework of the work is illustrated in the following table.

PeleeNet Architecture

OPTIMIZATION FOR SSD

Feature Map Selection. 5 scale feature maps (19x19, 10x10, 5x5, 3x3, and 1x1) are selected. Larger resolution features are discarded for speed acceleration.

Residual Prediction Block. For each feature map used for detection, a residual block (ResBlock) is constructed before conducting prediction, shown in the following figure.

PeleeNet SSD

PERFORMANCE

The classification performance on ILSVRC2012 is shown in the following table.

ILSVRC2012

The detection performance on VOC2007 is shown in the following table.

VOC2007

The detection performance on COCO2015 is shown in the following table.

COCO

SOME IDEAS

From my own experince, DW convolution is not pruning friendly so that recently pruning methods, such as ThiNet and Net-Trim, works poorly on DW convolution. This work uses conventional convolutional layers, so maybe those pruning methods can play a role.

最近在给封装一个动态库,需要支持古老的windows xp系统。而我的开发系统是windows 10,使用visual studio 2013作为IDE。

一通谷歌百度之后,我采用了曝光度最高的方法。具体来说包括两个步骤:

  1. 在工程设置里,配置属性->常规->平台工具集,选择 Visual Studio 2013 - Windows XP (v120_xp)
  2. 在工程设置里,配置属性->C/C++->代码生成->运行库,选择MT/MTd。分别对应于release和debug模式。

最初并没有发现这样做有什么问题,后来写了一个接口函数,release模式下没有发现问题,但是debug模式下调用该接口的函数在出栈时一直崩溃,错误如下

Error

因为对这一块儿实在不熟悉,就抱着死马当活马医的态度,把所有MT/MTd都改成了MD/MDd,又把所有依赖库和自己的库编译了一遍。在目标测试机上安装,居然成功了。

在网上搜了一些解释:

Debug Assertion Failed! Expression: __acrt_first_block == header

As this is a DLL, the problem might lie in different heaps used for allocation and deallocation (try to build the library statically and check if that will work).

The problem is, that DLLs and templates do not agree together very well. In general, depending on the linkage of the MSVC runtime, it might be problem if the memory is allocated in the executable and deallocated in the DLL and vice versa (because they might have different heaps). And that can happen with templates very easily, for example: you push_back() to the vector inside the removeWhiteSpaces() in the DLL, so the vector memory is allocated inside the DLL. Then you use the output vector in the executable and once it gets out of scope, it is deallocated, but inside the executable whose heap doesn’t know anything about the heap it has been allocated from. Bang, you’re dead.

This can be worked-around if both DLL and the executable use the same heap. To ensure this, both the DLL and the executable must use the dynamic MSVC runtime - so make sure, that both link to the runtime dynamically, not statically. In particular, the exe should be compiled and linked with /MD[d] and the library with /LD[d] or /MD[d] as well, neither one with /MT[d]. Note that afterwards the computer which will be running the app will need the MSVC runtime library to run (for example, by installing “Visual C++ Redistributable” for the particular MSVC version).

You could get that work even with /MT, but that is more difficult - you would need to provide some interface which will allow the objects allocated in the DLL to be deallocated there as well. For example something like:

1
2
3
4
5
6
>__declspec(dllexport) void deallocVector(std::vector<std::string> &x);

>void deallocVector(std::vector<std::string> &x) {
std::vector<std::string> tmp;
v.swap(tmp);
>}

(however this does not work very well in all cases, as this needs to be called explicitly so it will not be called e.g. in case of exception - to solve this properly, you would need to provide some interface from the DLL, which will cover the vector under the hood and will take care about the proper RAII)

EDIT: the final solution was actually was to have all of the projects (the exe, dll and the entire googleTest project) built in Multi-threaded Debug DLL (/MDd) (the GoogleTest projects are built in Multi-threaded debug(/MTd) by default)

说实话,对计算机原理的理解十分欠缺,遇到稍微专业一些的问题只能照着网上的一些方法试一试,如果成了也就不会再深入研究了,如果不成也不知道为什么不成,只能再去试别的方法。:-(