Joshua's Blog

Winsock and Websocket video transmitting based on OpenCV

Posted on 2017-09-28 Edited on 2022-08-19 In Coding

This repo implemented transmitting cv::Mat via winsock to front-end and display the image using websocket.

Server Code

The server code is implemented in websocket_server_c, which is written in C++ and based on winsock2 on Windows. The server code first construct handshakes and a connection with the client based on TCP protocal. As long as the connection being set up, a video is transmitted to the front-end frame by frame. The frames are extracted using OpenCV. And the frames are encoded in JPEG format first and then encoded to string using base64.

Note that, many examples of socket sending messages ignored the steps of construct connectiong for websocket, which is implemented in this repo.

Clinet Code

The clinet code is a django project in websocket_client_django. The only function is to receive the messages from server end and display the frames on web.

Notice

This code is just a demo for using socket in C++ and web. THere must be better way for live video streaming.

Reading Note: Pyramid Scene Parsing Network

Posted on 2017-09-19 Edited on 2022-08-19 In Computer Vision

TITLE: Pyramid Scene Parsing Network

AUTHOR: Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, Jiaya Jia

ASSOCIATION: Chinese University of Hongkong, SenseTime

FROM: arXiv:1612.01105

CONTRIBUTIONS

A pyramid scene parsing network is proposed to embed difficult scenery context features in an FCN based pixel prediction framework.
An effective optimization strategy is developped for deep ResNet based on deeply supervised loss.
A practical system is built for state-of-the-art scene parsing and semantic segmentation where all crucial implementation details are included.

METHOD

The framework of PSPNet (Pyramid Scene Parsing Network) is illustrated in the following figure.

Important Observations

There are mainly three observations that motivated the authors to propose pyramid pooling module as the effective global context prior.

Mismatched Relationship Context relationship is universal and important especially for complex scene understanding. There exist co-occurrent visual patterns. For example, an airplane is likely to be in runway or fly in sky while not over a road.
Confusion Categories Similar or confusion categories should be excluded so that the whole object is covered by sole label, but not multiple labes. This problem can be remedied by utilizing the relationship between categories.
Inconspicuous Classes To improve performance for remarkably small or large objects, one should pay much attention to different sub-regions that contain inconspicuous-category stuff.

Pyramid Pooling Module

The pyramid pooling module fuses features under four different pyramid scales. The coarsest level highlighted in red is global pooling to generate a single bin output. The following pyramid level separates the feature map into different sub-regions and forms pooled representation for different locations. The output of different levels in the pyramid pooling module contains the feature map with varied sizes. To maintain the weight of global feature, 1×1 convolution layer is used after each pyramid level to reduce the dimension of context representation to 1/N of the original one if the level size of pyramid is N. Then the low-dimension feature maps are directly upsampled to get the same size feature as the original feature map via bilinear interpolation. Finally, different levels of features are concatenated as the final pyramid pooling global feature.

Deep Supervision for ResNet-Based FCN

Apart from the main branch using softmax loss to train the final classifier, another classifier is applied after the fourth stage. The auxiliary loss helps optimize the learning process, while the master branch loss takes the most responsibility.

Ablation Study

Win10开启Linux Bash

Posted on 2017-09-16 Edited on 2022-08-19 In Using Linux

微软在推送的Win10一周年更新预览版14316中，该版本中包含了大部分已宣布内容，其中包括了一项重要的原生支持Linux Bash命令行支持。即用户现在即使不使用Linux系统或Mac电脑就可以在Win10上使用Bash，那么如何在Win10系统上开启Linux Bash命令行呢？大家可以尝试下面的方法来解决这个问题。

首先需要用户将Win10系统升级到Build 14316版本。系统设置——更新和安全——针对开发人员——选择开发者模式。
搜索“程序和功能”，选择“开启或关闭Windows功能”，开启Windows Subsystem for Linux （Beta），并重启系统。
安装Bash，需要开启命令行模式，然后输入bash，即可使用。
系统目录在 C:\Users\a\AppData\Local\lxss\home\tztx\CODE\caffe-online-demo 下。

Windows 10 x64安装Electron

Posted on 2017-09-07 Edited on 2022-08-19 In Coding

安装Node.js

从官网下载适合自己系统的安装包并安装。安装完成house可以使用npm -v命令查看node.js版本号，确认其是否正常安装。

使用node -v命令检查node.js版本，确认node安装；

安装cnpm工具

从官方npm下载速度较慢，可以使用淘宝定制的命令行工具cnpm代替默认的npm。安装命令为npm install -g cnpm --registry=https://registry.npm.taobao.org

安装Electron

安装命令为cnpm install -g electron。

《Dunkirk》

Posted on 2017-09-01 Edited on 2022-08-19 In Life Discovery

刚刚看了《敦刻尔克》。最让我伤心的是一个飞行员的故事。他们一共三个人去执行任务，其中一个被击落了直接牺牲了，另一个中通被击落，但是在海上迫降被救了。最后一个到了敦刻尔克海滩，他的燃油不够了，但是还是选择继续战斗，最后油料耗尽，在滑行的状态下，螺旋桨都不转了，还击落了一架敌机，然后他在海滩上滑行降落，我以为他会被救走呢，但是居然是被俘了。感觉其他人都被拯救了，但是只有他被抛弃了。

《异形：契约》

Posted on 2017-08-27 Edited on 2022-08-19 In Life Discovery

这部电影早就在国内上映了，但是国内上映的版本被删减得不成样子，几乎所有血腥镜头都被剪掉了。对于这样一部电影，这样的删减几乎毁掉了整部电影，使得观影丧失了乐趣。等啊等，今天终于有空看了全片。怎么说呢，整体感觉有点一般，能够及格，但是不够优秀。

在异形的上一部电影《普罗米修斯》里，导演讨论了造物主、人类起源、人工智能等等一系列哲学问题，造物主因为什么而想灭绝人类，人类为了追寻自己的起源而挑战神，造物主为了灭绝人类而制造异形反而被反杀，造物主之上的造物主是谁，人类因为制造出精密的人工智能而成为造物主，人造人为了获得自由而与人类发生矛盾，女权主义，无政府，反托拉斯……虽然剧情有漏洞，但让人开了无数有深度的脑洞，让人十分期待接下来的《契约》。

然而在《契约》里，这些思考全没了，好像又退化为那个B级片的异形，而不是科幻化的异形。两层造物主与被造者的关系：造物主与人类、人类与人造人，在这一部里完全被消减到了一层，造物主与人类的关系毫无阐述，人类的神被人造人灭绝，但是消灭的过程太过简单了，对于一个可以创造出人类的超先进文明，却被一个小伎俩而灭绝，这样的逻辑实在不能被接受。之后人造人成了造物主，不断杂交产生新的更高级的异形。这一点是为了和异形第一部进行强行对接，但是更高层次的讨论完全丧失了。

《在这世界的角落》

Posted on 2017-08-15 Edited on 2022-08-19 In Life Discovery

在日本战败日看了一部反映战时日本国内情形的动画电影，导演为片渊须直。

影片中铃出生在广岛，从小到大铃都是是一个整日里慢慢腾腾迷迷糊糊的姑娘，最大的爱好就是画画。感觉铃还没长大，就到了嫁人的年纪。在父母的安排下，铃嫁给了北条周作，到山那边的吴市成为了北条家的媳妇。

周作的姐姐径子因为和婆婆不和带着女儿晴美住回了娘家。和贤惠老实的小铃不同，径子性格张扬而泼辣，虽然她常常对唯唯诺诺的小铃感到不满，但两人之间的感情其实十分要好。战争开始了，越来越多的战斗机飞过天空，刺耳的防空警报声常常划破夜晚的宁静，战争的阴影一点一点的覆盖着北条一家人的生活。在一场空袭中，晴美不幸遇难，小铃亦失去了右手再也无法作画，然而，在如此残酷的时局下，一家人依然没有放弃对生存的渴望。

整部影片给人的感觉非常像《活着》，在时代的大背景下，平头百姓只能无休止地忍受下去，为了活下去而努力着。看日本战时反映小人物的电影总有一种特殊的观影感受，既同情影片中角色的遭遇，又感觉他们真是活该。现在很多人都说“原子弹下无冤魂”，作为一个中国人，我也觉得这影片在一些细节上在粉饰战争，比如影片中有一句话是“是暴力打败了日本”，为什么不是“正义打败了日本”？但是仔细想想，如果我们的国家对外发动战争，我在多大程度上会觉得国家有错误呢，又或者说我会希望自己的国家战败吗？或许更多时候我们站到了历史的制高点上，用事后诸葛亮的角色看待历史中的人。

Reading Note: DSOD: Learning Deeply Supervised Object Detectors from Scratch

Posted on 2017-08-10 Edited on 2022-08-19 In Computer Vision

TITLE: DSOD: Learning Deeply Supervised Object Detectors from Scratch

AUTHOR: Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong Chen, Xiangyang Xue

ASSOCIATION: Fudan University, Tsinghua University, Intel Labs China

FROM: arXiv:1708.01241

CONTRIBUTIONS

DSOD is presented that can train object detection networks from scratch with state-of-the-art performance.
A set of principles are introduced and validated to design efficient object detection networks from scratch through step-by-step ablation studies.
DSOD can achieve state-of-the-art performance on three standard benchmarks (PASCAL VOC 2007, 2012 and MS COCO datasets) with realtime processing speed and more compact models.

METHOD

The critical limitations when adopting the pre-trained networks in object detection include:

Limited structure design space. The pre-trained network models are mostly from ImageNet-based classification task, which are usually very heavy — containing a huge number of parameters. Existing object detectors directly adopt the pre-trained networks, and as a result there is little flexibility to control/adjust the network structures. The requirement of computing resources is also bounded by the heavy network structures.
Learning bias. As both the loss functions and the category distributions between classification and detection tasks are different, it will lead to different searching/optimization spaces. Therefore, learning may be biased towards a local minimum which is not the best for detection task.
Domain mismatch. As is known, fine-tuning can mitigate the gap due to different target category distribution. However, it is still a severe problem when the source domain (ImageNet) has a huge mismatch to the target domain such as depth images, medical images, etc.

DSOD Architecture

The following table shows the architecture of DSOD

The proposed DSOD method is a multi-scale proposal-free detection framework similar to SSD. The design principles are as follows:

Proposal-free. The author observes that only the proposal-free method such as SSD and YOLO can converge successfully without the pre-trained models. It may be because the RoI pooling layer in the proposal based methods hinders the gradients being smoothly back-propagated from region-level to convolutional feature maps.
Deep Supervision. The central idea is to provide integrated objective function as direct supervision to the earlier hidden layers, rather than only at the output layer. These “companion” or “auxiliary” objective functions at multiple hidden layers can mitigate the “vanishing” gradients problem.
Stem Block. Motivated by Inception-v3 and v4, the author defines stem block as a stack of three $3 \times 3$ convolution layers followed by a $2 \times 2$ max pooling layer. This simple stem structure can reduce the information loss from raw input images compared with the original design in DenseNet.
Dense Prediction Structure The following figure illustrates the comparison of the plain structure (as in SSD) and the proposed dense structure in the front-end sub-network. Dense structure for prediction fuses multi-scale information for each scale. Each scale outputs the same number of channels for the prediction feature maps. In DSOD, in each scale (except scale 1), half of the feature maps are learned from the previous scale with a series of conv-layers, while the remaining half feature maps are directly down-sampled from the contiguous high-resolution feature maps.

Ablation Study

The following table shows the effectiveness of various designs on VOC 2007 test set.

transition w/o pooling increases the number of dense blocks without reducing the final feature map resolution. hi-comp factor in Transition Layers is the factor to reduce feature maps. wide bottlenect means more channels in bottleneck layers. wide 1st conv-layer means that that more channels are kept in the first conv-layer. big growth rate is a parameter used in DenseNet to increase the number of channels in the dense block.

Some Ideas

If we train Faster R-CNN in non-approximate joint training way, maybe we can also train the detector from scratch.

Little Things [20170808 Sweet]

Posted on 2017-08-08 Edited on 2022-08-19 In Life Discovery

Reading Note: ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression

Posted on 2017-08-03 Edited on 2022-08-19 In Computer Vision

TITLE: ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression

AUTHOR: Jian-Hao Luo, Jianxin Wu, Weiyao Lin

ASSOCIATION: Nanjing University, Shanghai Jiao Tong University

FROM: arXiv:1707.06342

CONTRIBUTIONS

A simple yet effective framework, namely ThiNet, is proposed to simultaneously accelerate and compress CNN models.
Filter pruning is formally established as an optimization problem, and statistics information computed from its next layer is used to prune filters.

METHOD

Framework

The framework of ThiNet compression procedure is illustrated in the following figure. The yellow dotted boxes are the weak channels and their corresponding filgers that would be pruned.

Filter selection. The output of layer $i + 1$ is used to guide the pruning in layer $i$. The key idea is: if a subset of channels in layer $(i + 1)$’s input can approximate the output in layer $i + 1$, the other channels can be safely removed from the input of layer $i + 1$. Note that one channel in layer $(i + 1)$’s input is produced by one filter in layer $i$, hence the corresponding filter in layer $i$ can be safely pruned.
Pruning. Weak channels in layer $(i + 1)$’s input and their corresponding filters in layer i would be pruned away, leading to a much smaller model. Note that, the pruned network has exactly the same structure but with fewer filters and channels.
Fine-tuning. Fine-tuning is a necessary step to recover the generalization ability damaged by filter pruning. For time-saving considerations, fine-tune one or two epochs after the pruning of one layer. In order to get an accurate model, more additional epochs would be carried out when all layers have been pruned.
Iterate to step 1 to prune the next layer.

Data-driven channel selection

Denote the convolution process in layer $i$ as A triplet

$\left \langle \mathscr{I}_{i}, \mathscr{W}_{i}, \right \rangle$

where $\mathscr{I}{i}$ is the input tensor, which has $C$ channels, $H$ rows and $W$ columns. And $\mathscr{W}{i}$ is a set of filters with $K \times K$ kernel size, which generates a new tensor with $D$ channels. Note that, if a filter in $\mathscr{W}{i}$ is removed, its corresponding channel in $\mathscr{I}{i+1}$ and $\mathscr{W}{i+1}$ would also be discarded. However, since the filter number in layer $i + 1$ has not been changed, the size of its output tensor, i.e., $\mathscr{I}{i+2}$, would be kept exactly the same. If we can remove several filters that has little influence on $\mathscr{I}_{i+2}$ (which is also the output of layer $i + 1$), it would have little influence on the overall performance too.

Collecting training examples

The training set is randomly sampled from the tensor $\mathscr{I}_{i+2}$ as illustrated in the following figure.

The convolution operation can be formalized in a simple way

$\hat{y} = \sum_{c=1}^{C} \hat{x}_{c}$

A greedy algorithm for channel selection

Given a set of $m$ training examples $\left{ (\mathrm{\hat{x}}{i}, \hat{y}{i}) \right}$, selecting channel can be seen as a optimization problem,

and it eauivalently can be the following alternative objective,

where $S \cup T = \left{ 1,2,…,C \right}$ and $S \cap T = \emptyset$. This problem can be sovled by greedy algorithm.

Minimize the reconstruction error

After the subset $T$ is obtained, a scaling factor for each filter weights is learned to minimize the reconstruction error.

Some Ideas

Maybe the finetue can help avoid the final step of minimizing the reconstruction error.
If we use this work on non-classification task, such as detection and segmentation, the performance remains to be checked.