
TITLE: Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding

AUTHOR: Kendall, Alex and Badrinarayanan, Vijay and Cipolla, Roberto

FROM: arXiv:1511.02680


  1. Extending deep convolutional encoder-decoder neural network architectures to Bayesian convolutional neural networks which can produce a probabilistic output.
  2. Bayesian SegNet outputs a measure of model uncertainty, which could be used to provide segmentation confidence.


The first half of the network is a traditional convolutional neural network (VGG-16 in this work). The second half is sort of a mirror of the first half, applying upsampling layers to recover the size of output to that of input. The network is trained in an end-to-end method. The probabilistic output is obtained from Monte Carlo samples of the model with dropout at test time.


  1. For each pixel, a softmax classifier is utilized to predict class label.
  2. At test stage, multiple times of forward is applied to simulate Monte Carlo sampling. Thus the mean of the softmax outputs is taken as class label, and the variance is taken as uncertainty.
  3. Situations of high model uncertainty: 1) different class boundaries, 2) object difficult to identify because of occlusion or distance and 3) vague classes such as dogs and cats, chairs and tables.


  1. Monte Carlo sampling with dropout performs better than weight averaging after approximately 6 samples.
  2. No fully connected layers makes the network easier to be trained.
  3. The network could run in real time when computing in parallel.
  4. Do not need to convolve in a slide window method, which contributes its fast speed.


  1. Applying Bayesian weights to lower layers does not result in a better performance, because low level features are consistent across the distribution of models.
  2. Higher level features, such as shape and contextual relationships, are more effectively modeled with Bayesian weights.
  3. At training stage, dropout samples from a number of thinned networks with reduced width. At test time, standard dropout approximates the effect of averaging the predictions of all these thinned networks by using the weights of the unthinned network.

The online demo and codes can be found here and here

TITLE: Two-Stream Convolutional Networks for Action Recognition in Videos

AUTHOR: Simonyan, Karen and Zisserman, Andrew



  1. A two-stream ConvNet combines spatial and temporal networks.
  2. A ConvNet trained on multi-frame dense optical flow is able to achieve a good performance in spite of small training dataset
  3. Multi-task training procedure benefits performance on different datasets.


Two-stream architecture convolutional network:

  1. Spatial stream ConvNet: take a still frame as input and perform action recognition in this single frame.
  2. Temporal stream ConvNet: take a 2L-channel optical flow/trajectory stacking corresponding to the still frame as input and perform action recognition in this multi-channel input.
  3. The two outputs of the streams are concated as a feature to train a SVM classifier to fuse them.


  1. Mean flow subtraction is utilized to eliminate displacements caused by camera movement.
  2. At test stage, 25 frames (time points) are extracted and their corresponding 2L-channel stackings are sent to the network. In addition, 5 patches and their flips are extracted in space domain.


  1. Simulate bio-structure of human visual cortex.
  2. Competitive performance with the state of the art representations in spite of small size of training dataset.
  3. CNN with convolution filters could generalize hand-crafted features.


  1. Can not localize action in neither spatial nor temporal domain.

TITLE: Joint Tracking and Segmentation of Multiple Targets

AUTHOR: Milan, Anton and Leal-Taixe, Laura and Schindler, Konrad and Reid, Ian



  1. A new CRF model taking advantage of both high-level detector responses and low-level superpixel information
  2. Fully automated segmentation and tracking of an unknown number of targets.
  3. A complete state representation at every time step could handle occlusions


  1. Generate an overcomplete set of trajectory hypotheses.
  2. Solve data association problem by optimizing an objective function, which is a multi-label conditional random field (CRF).


The goal is to find the most probable labeling for all nodes given the observations, which is equivalent to

in which

where \(\phi^{\large{\nu}{S}}\) and \(\phi^{\large{\nu}{D}}\) are unary potential functions for superpixel and detection nodes, respectively, measuring the cost of one detection node in \(\large{\nu}{D}\) or one superpixel node in \(\large{\nu}{S}\) belonging to a certain target; \(\psi(v,w)\) is pairwise edges among superpixels and detections, including spacial and temporal information among superpixels and information among superpixels and detections in the same frame; \(\psi^{\lambda}\) is trajectory cost, containing several constrains of height, shape, dynamics, persistence, image likelihood and parsimony.


  1. Taking pixel (superpixel) level information in addition to detection results into consideration could handle partial occlusions, which would lead to higher recall.
  2. Segments could provide considerable information even no reliable detection result exists.
  3. Modeling multi-targets tracking problem to graph model could take advantage of existing optimization algorithms.


  1. Solving CRF problem is slow, needing 12 seconds per frame.
  2. Can not handle ID switch in two adjacent temporal slidewindows.


  1. Tracking-by-detection has proven to be the most successful strategy to address multi-target tracking problem.
  2. Noise and imprecise measurements, long-term occlusions, complicated dynamics and target interactions all contributes to the problem’s complexity.

TITLE: Learning to Segment Moving Objects in Videos

AUTHOR: Fragkiadaki, Katerina and Arbelaez, Pablo and Felsen, Panna and Malik, Jitendra



  1. Moving object proposals from multiple segmentations on optical flow boundaries
  2. A moving objectness detector for ranking per frame segments and tube proposals
  3. A method of extending per frame segments into spatial-temporal tubes


  1. Extract motion boundaries by optical flow
  2. Generate segment proposals according to motion boundaries, called MOPs (Moving Object Proposal)
  3. Rank the MOPs using a CNN based regressor
  4. Combine per frame MOPs to space-time tubes based on pixelwise trajectory clusters


  1. Using optical flow could reduce the noises caused by inner texture of one object. Optical flow is more suitable for detecting rigid objects.
  2. Using trajectory tracking could deal with objects that are temporary static.
  3. Segments are effective to tackle frequent occlusions/dis-occlustions.


  1. Too slow. Every stage would take seconds to process, which is not suitable for practical applications.
  2. Use several independent method to detect objects. Less computations are shared.
  3. The power of CNN has not been fully applied.


  1. RCNN has excellent performance on object detection in static images
  2. For slidewindow methods, too many patches need to be evaluated.
  3. MRF methods neglect nearby pixels’ relation and could not separate adjacent instances.
  4. Methods of object detection in video could be categorized into two types i) top-down tracking and ii) bottom-up segmentation.








Table of Content

  • Content


  1. install NVIDIA GTX970M driver
  2. install CUDA 7.0 Toolkit

Please refer to my previous blog Installation of NVIDIA GPU Driver and CUDA Toolkit

Install OpenBLAS

  1. download source code from OpenBLAS official website and extract the archive
  2. (optional) install gfortran by sudo apt-get install gfortran
  3. change directory to the position of extracted folder the and compile make FC=gfortran
  4. install by make PREFIX=/your/path install
  5. add paths to envrionment: PATH=/your/path/to/openblas/include:$PATH and LD_LIBRARY_PATH=/your/path/to/openblas/lib:$LD_LIBRARY_PATH and export the pathes.

Install Anaconda

  1. download the script from http://continuum.io/downloads
  2. change mode sudo chmod +x Anaconda*.sh
  3. execute the installer by bash Anaconda*.sh
  4. in ~/.bashrc add

NEVER put it in /etc !!! Otherwise, one may be in danger of unable to get into GUI.

  1. config HDF5 version
cd /usr/lib/x86_64-linux-gnu
sudo ln -s libhdf5.so.7 libhdf5.so.10
sudo ln -s libhdf5_hl.so.7 libhdf5_hl.so.10
sudo ldconfig

Install OpenCV

One can conveniently install OpenCV by run a shell script from a Github repository

  1. download the script. For me, I use OpenCV 2.4.10.
  2. change mode of the shell sudo chmod +x opencv2_4_10.sh
  3. run the script sudo ./opencv2_4_10.sh. Note that one may need to modify the cmake settings, such as eliminating QT.

Install a Set of Dpendencies

Following the guideline in Caffe, we can set up the dependencies by commond sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libboost-all-dev libhdf5-serial-dev libgflags-dev libgoogle-glog-dev liblmdb-dev protobuf-compiler

Compile Caffe

  1. get Caffe from github git clone https://github.com/BVLC/caffe.git
  2. edit Makefile.config to set correct paths. Firstly create Makefile.config by cp Makefile.config.example Makefile.config. Then modify several paths. For me, I set blas to openblas and set blas path to /opt/OpenBLAS/include and /opt/OpenBLAS/lib where I install OpenBLAS; Python is set to Anaconda as well as its paths.
  3. compile Caffe by make -j and make pycaffe
  4. In addition, so far Caffe should be able to be compiled without any problem. However, when running exampls such as MNIST, some libs might be missing. My solution is to add libraries to the system library cache. For example, create a file called cuda.conf in /etc/ld.so.conf.d/ and add the path “/usr/local/cuda-7.0/lib64” to this file.