0%

Reading Note: Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks

TITLE: Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks

AUTHOR: Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li

ASSOCIATION: Chinese Academy of Sciences, The Chinese University of Hong Kong

FROM: IEEE Signal Processing Letters

CONTRIBUTION

  1. A new cascaded CNNs based framework is proposed for joint face detection and alignment, and a lightweight CNN architecture is carefully designed for real time performance.
  2. An effective method is proposed to conduct online hard sample mining to improve the performance.
  3. Extensive experiments are conducted on challenging benchmarks, to show significant performance improvement of the proposed approach compared to the state-of-the-art techniques in both face detection and face alignment tasks.

METHOD

This work is very much like a traditional slide-window based face detection method. A cascade classifier is used to classify an image patch. The first classifier is very simple so that it can efficiently remove easy negative patches. Then more complex classifiers are used to remove harder patches. The last CNN is used as a regressor to localize face landmarks. Along with the classifiers, bounding box is also localized. To handle multiple scales, an image pyramid is built.

The pipeline of this work is shown as follows. The input image is first resized to different scales to build an image pyramid, which is sent to the three-stage cascaded framework.

Framework

Stage 1: A fully convolutional network is proposed, called Proposal Network (P-Net), to obtain the candidate facial windows and their bounding box regression vectors. Then candidates are calibrated based on the estimated bounding box regression vectors. After that, non-maximum suppression (NMS) is utilized to merge highly overlapped candidates.

Stage 2: All candidates are fed to another CNN, called Refine Network (R-Net), which further rejects numerous false candidates, performs calibration with bounding box regression, and conducts NMS.

Stage 3: This stage is similar to the second stage, but this stage aims to identify face regions with more supervision. In particular, the network will output five facial landmarks’ positions.

The architecture of the three networks are shown in the following figure.

Architecture

At training stage, the face classification is trained as typical two-class classification task using cross-entropy loss. In each mini-batch, the samples are sorted by their losses and only the top 70% of them are selected to compute gradients. On the other hand, Bounding box regression and facial landmark localization are trained using Euclidean loss.

PERFORMANCE

Face detection performance:

Detection

Face landmark regression performance:

Landmark

IDEAS

  1. As the author provided no training code, I implement the training code based on caffe by adding ignore labels to EuclideanLossLayer and surpporting multiple label to DataLayer.
  2. In this work, samples are assigned as different types. In my own experiments, I assign samples with multiple labels, which means that one sample can be used for different losses at the same time. I’m not sure whether this modification can help improve the performance.
  3. Further experiments are of need to realize the performance of the work.