Reading Note: YOLO9000: Better, Faster, Stronger
TITLE: YOLO9000: Better, Faster, Stronger
AUTHOR: Joseph Redmon, Ali Farhadi
ASSOCIATION: University of Washington, Allen Institute for AI
FROM: arXiv:1612.08242
- several improvements have been made for YOLO.
- a new method is proposed to harness the large amount of classification data and use it to expand the scope of detection systems.
- a joint training algorithm is proposed that trains object detectors on both detection and classification data.
The authors summarize the work as a better, faster and Stronger version of YOLO.
Batch Normalization
Batch Normalization is used in this work. The authors claim that it helps YOLO get more than 2% improvement in mAP. Even though, I doubt BN would help or it might even worsen the performance in real world applications because of my own experience using BN.
High Resolution Classifier
Instead finetuned on 224224 images, the classification network is finetuned on 448448 images, which helps the network perform better on higher resolution. This high resolution classification network gives an increase of almost 4% mAP.
Convolutional With Anchor Boxes
In YOLOv2, anchor boxes and FCN manner are also adopted. This enbles the YOLO generate much more boxes, which improves recall from 81% (69.5 mAP) to 88% (69.2 mAP).
Dimension Clusters
Prior works usally define the anchor boxes by hand, for example 1:1, 1:2(2:1) or 1:3(3:1) in SSD. In this work, the anchor boxes are defined by clustering. K-means clustering is used and the distance metric is defined based on IOU, which eliminates the effect caused by the actual size of boxes: larger boxes generate more error than smaller boxes using Euclidean distance.
Direct Location Prediction
Instead of predicting offsets to the center of the bounding box, YOLO9000 predicts location coordinates relative to the location of the grid cell, which bounds the ground truth to fall between 0 and 1. Then constrained location prediction is easier to learn.
Fine-Grained Features
In order to ultize finer grained features for localizing smaller objects, the authors add a passthrough layer that brings features from an earlier layer.
This is similar what has been done in ResNet.Multi-Scale Training
Data of different resolutions are used to train the network. This regime forces the network to learn to predict well across a variety of input dimensions. This means the same network can predict detections at different resolutions.
- Instead of using VGG-16, The YOLO framework uses a custom network Darnet-19, which has has 19 convolutional layers and 5 maxpooling layers.
Hierarchical Classification
A hierarchical prediction is built. Several nodes are added to build a tree. At each node, a semantic category is defined at a level. Thus images of different objects may be combined as one label because they belong to one higher level semantic label.
Joint Classification and Detection
Two datasets are used to train the large scale detetor. One is a traditional classification dataset, which contains a large number of categories. The other one is a detection dataset. When a detection image is seen, backpropagate loss as normal. For classification loss, only backpropagate loss at or above the corresponding level of the label.
Miscellaneous [20170109]
Miscellaneous [20170106]
Little Things [20170103]
Reading Note: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
TITLE: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
AUTHOR: Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh
FROM: arXiv:1611.08050
- a method for multi-person pose estimation is proposed that approaches the problem in a bottom-up manner to maintain realtime performance and robustness to early commitment, but utilizes global contextual information in the detection of parts and their association.
- Part Affinity Fields (PAFs), a set of 2D vector fields, is presented, each of which encode the location and orientation of a particular limb at each position in the image domain.
This work is the successor of Convolutional Pose Machines. The network structure, which predict the part emergence heatmap and part aafinity field jointly, is illustrated in the following figure. We can compare it with previous work.
Similar with previous work, the network works as sequence learning scheme. One of the branch predicts confidence maps for part detection, while the other one predicts part affinity fields for part association.
Confidence Maps for Part Detection
At each location $ \mathbf{P} $, the value of the confidence $ S_{j}^{\ast}(\mathbf{P}) $ for a part type $ j $ is defined as
It means that for every type of part, a heatmap is predicted with multiple highlight areas, indicating the emergence of a part instance.
Part Affinity Fields for Part Association
If we consider a single limb, let and be the position of body parts and from the limb class for a person on the image. is the length of the limb, and is the unit vector in the direction of the limb. The ideal part affinity vector field, , at an image point as
Similar to confidence maps for part detection, part affinity fields are also predicted for all persons
where is the number of non-zero vectos at point . The confidence score of each limb candidate is measured by
where and are two detected body parts.
Multi-Person Parsing using PAFs
The last problem is to select different limbs linked in PAFs to combine as one person’s skeleton. This is a classical generalized maximum clique problem. I think in additional to the method mentioned in this paper, many other optimiaztion algorithms can be tried. These algorithms are well discussed in multi-object tracking problem.
Reading Note: Convolutional Pose Machines
TITLE: Convolutional Pose Machines
AUTHOR: Shih-En Wei, Varun Ramakrishna, Takeo Kanade, Yaser Sheikh
FROM: arXiv:1602.00134
- learning implicit spatial models via a sequential composition of convolutional architectures
- a systematic approach to designing and training such an architecture to learn both image features and image-dependent spatial models for structured prediction tasks, without the need for any graphical model style inference.
The following figure shows the comparison of traditional Pose Machine and Convolutional Pose Machine
Pose Machines
A pose machine consists of a sequence of multi-class predictors, $g{t}(\cdot)$, that are trained to predict the location of each part in each level of the hierarchy. In each $stage$ $t \in {1…T}$, the classifiers $g{t}$ predict beliefs for assigning a location to each part $Y_{p}=z, \forall z \in \mathbb{Z}$, where $\mathbb{Z}$ is the set of all locations in an image.
As illustrated in the figure (a) and (b), the image is first sent to $Stage$ $1$ and a belief map is predicted. Then the belief map and image features $x’$ are combined to sent to the following stage. As the procedure repeats, final result is predicted from the last $Stage$ $T$.
Convolutional Pose Machines
Convolutional Neural Network is naturally a sequence of stages if multiple losses and predictors are inserted at the intermediate layers. The (c) and (d) in the figure illustrated a convolutional pose machine. The sub-network in (c) plays the role of first stage. The shared network at the top-left corner in (d) is used to extract image features $x’$, which will be combined with the output of every $Stage$ $t-1$ and sent to $Stage$ $t$. In addition, the stacked convolutional layers’ perceptual field increases as deepening, which means that more contextual infomation is taken into consideration helping refine the output.
When training, every stage has its own loss function to predict parts. These losses work similar with the auxiliary classifiers in GoogleNet, which helps alleviate the problem caused by the vanishing of gradient. The network can be trained end-to-end. Compared with traditional pose machine, CMP is much easier to train. The visualization of the network can be found here
Little Things [20161224]
Reading Note: Beyond Skip Connections: Top-Down Modulation for Object Detection
TITLE: Beyond Skip Connections: Top-Down Modulation for Object Detection
AUTHOR: Abhinav Shrivastava, Rahul Sukthankar, Jitendra Malik, Abhinav Gupta
ASSOCIATION: CMU, UC Berkeley, Google Research
FROM: arXiv:1612.06851
In this paper top-down modulations is proposed as a way to incorporate fine details into the detection framework. The standard bottom-up, feedforward ConvNet is supplemented with a top-down modulation (TDM) network, connected using lateral connections. These connections are responsible for the modulation of lower layer filters, and the top-down network handles the selection and integration of features.
The idea of this work is very similar with the work of Feature Pyramid Networks for Object Detection. An example of Top-Down Modulation (TDM) Network is illustrated as the following figure
TDM is integrated with the bottom-up network with lateral connections. $C{i}$ are bottom-up, feedforward feature blocks, $L{i}$ are the lateral modules which transform low level features for the top-down contextual pathway. Finally, $T_{j,i}$, which represent flow of top-down information from index $j$ to $i$.
In this paper, the $T$ blocks are implemented using single convolutional layer (with non-linear activation) optionally with upsampling operation. The features from $C$ (processed by $L$) and $T$ are concated then sent to a convolutional layer for combination, as the following figure shows
At training stage, one new pair of lateral and top-down modules is added at a time and trained repeatedly from a pre-trained model.