0%

TITLE: Be Your Own Prada: Fashion Synthesis with Structural Coherence

AUTHOR: Shizhan Zhu, Sanja Fidler, Raquel Urtasun, Dahua Lin, Chen Change Loy

ASSOCIATION: The Chinese University of Hong Kong, University of Toronto, Vector Institute, Uber Advanced Technologies Group

FROM: ICCV2017

CONTRIBUTION

A method that can generate new outfits onto existing photos is developped so that it can

  1. retain the body shape and pose of the wearer,
  2. roduce regions and the associated textures that conform to the language description,
  3. Enforce coherent visibility of body parts.

METHOD

Given an input photograph of a person and a sentence description of a new desired outfit, the model first generates a segmentation map $\tilde{S}$ using the generator from the first GAN. Then the new image is rendered with another GAN, with the guidance from the segmentation map generated in the previous step. At test time, the final rendered image is obtained with a forward pass through the two GAN networks. The workflow of this work is shown in the following figure.

Framework

The first generator $G_{shape}$ aims to generate the desired semantic segmentation map $$\tilde{S}$$ by conditioning on the spatial constraint $$\downarrow m(S_0)$$, the design coding $$\textbf{d}$$, and the Gaussian noise $$\textbf{z}{S}$$. $$S{0}$$ is the original pixel-wise one-hot segmentation map of the input image with height of $$m$$, width of $n$ and channel of $L$, which represents the number of labels. $\downarrow m(S_0)$ downsamples and merges $S_{0}$ so that it is agnostic of the clothing worn in the original image, and only captures information about the user’s body. Thus $G_{shape}$ can generate a segmentation map $\tilde{S}$ with sleeves from a segmentation map $S_{0}$ without sleeves.

The second generator $G_{image}$ renders the final image $\tilde{I}$ based on the generated segmentation map $\tilde{S}$, design coding $\textbf{d}$, and the Gaussian noise $\textbf{z}_I$.

TITLE: Detect to Track and Track to Detect

AUTHOR: Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman

ASSOCIATION: Graz University of Technology, University of Oxford

FROM: arXiv:1710.03958

CONTRIBUTION

  1. A ConvNet architecture is set up for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression.
  2. Correlation features that represent object co-occurrences across time are introduced to aid the ConvNet during tracking.
  3. Frame-level detections are linked to produce high accuracy detections at the video-level based on across-frame tracklets.

METHOD

For frame-level detections, this work adopts R-FCN as the base framework to detect objects in a single frame. The inter-frame correlation features are extracted from the feature maps of the two frames. A multi-task loss of localization, classification and displacement is used to train the net work. The workflow of this work is shown in the following figure.

Framework

The key innovation of this work is an operation denoted as ROI tracking. The input of this operation is the bounding box regression features of the two frames $$\textbf{x}{reg}^{t}$$, $$\textbf{x}{reg}^{t+\tau}$$ and the correlation features $$\textbf{x}^{t,t+\tau}{corr}$$, which are concatenated. The correlation layer performs point-wise feature comparison of two feature maps $$\textbf{x}^{t}{l}$$, $$\textbf{x}^{t+\tau}_{l}$$

$$ \textbf{x}{corr}^{t,t+\tau} (i,j,p,q) = \langle \textbf{x}{l}^{t} (i,j), \textbf{x}_{l}^{t+\tau} (i+p,j+q) \rangle $$

where $-d \leq p \leq d$ and $-d \leq q \leq d$ are offsets to compare features in a square neighbourhood around the locations $i$, $j$ in the feature map, defined by the maximum displacement $d$.

The loss function is written as

$$ Loss({p_i},{b_i},{\Delta_i} ) = \frac{1}{N} \sum_{i=1}^{N}L_{cls}(p_i,c^{}) + \lambda \frac{1}{N_{fg}} \sum_{i=1}^{N} [c_i^>0] L_{reg}(b_i, b_i^) + \lambda \frac{1}{N_{tra}} \sum_{i=1}^{N_{tra}} L_{tra}(\Delta_i^{t+\tau}, \Delta_i^{,t+\tau}) $$

A class-wise linking score is defined to combine detections and tracks across time

$$ s_{c}(D_{i,c}^t,D_{j,c}^{t+\tau},T^{t,t+\tau})=p_{i,c}^t+p_{j,c}^{t+\tau}+\phi(D_{i}^{t},D_{j},T^{t,t+\tau}) $$

where the pairwise term $\phi$ evaluates to 1 if the IoU overlap a track correspondences $T^{t,t+\tau}$ with the detection boxes $D_{i}^{t}$, $D_{i}^{t+\tau}$ is larger than 0.5. $p_{i,c}^{t}$, $p_{j,c}^{t+\tau}$ is the softmax probability for class $c$. The optimal path across a video can be found by maximizing the scores over the duration $T$ of the video. Once the optimal tube is found, the detections corresponding to that tube are removed. Then reweight the detection scores in the tube by adding the mean of the 50% highest scores in that tube. And the procedure is applied again to the remaining detections.

TITLE: Interpretable Convolutional Neural Networks

AUTHOR: Quanshi Zhang, Ying Nian Wu, Song-Chun Zhu

ASSOCIATION: UCLA

FROM: arXiv:1710.00935

CONTRIBUTION

  1. Slightly revised CNNs are propsed to improve their interpretability, which can be broadly applied to CNNs with different network structures.
  2. No annotations of object parts and/or textures are needed to ensure each high-layer filter to have a certain semantic meaning. Each filter automatically learns a meaningful object-part representation without any additional human supervision.
  3. When a traditional CNN is modified to an interpretable CNN, experimental settings need not to be changed for learning. I.e. the interpretable CNN does not change the previous loss function on the top layer and uses exactly the same training samples.
  4. The design for interpretability may decrease the discriminative power of the network a bit, but such a decrease is limited within a small range.

METHOD

The loss for filter is illustrated in the following figure.

Framework

A feature map is expected to be strongly activated in images of a certain category and keep silent on other images. Therefore, a number of templates are used to evaluate the fitness between the current feature map and the ideal distribution of activations w.r.t. its semantics. The template is an ideal distribution of activations according to space locations. The loss for layers is formulated as the mutual information between feature map $\textbf{X}$ and templates $\textbf{T}$.

$$ Loss_{f} = - MI(\textbf{X}; \textbf{T}) $$

the loss can be re-written as

$$ Loss_{f} = - H(\textbf{T}) + H(\textbf{T’}={T^{-}, \textbf{T}^{+}|\textbf{X}})+\sum_{x}p(\textbf{T}^{+},x)H(\textbf{T}^{+}|X=x) $$

The first term is a constant denoting the piror entropy of $\textbf{T}^{+}$. The second term encourages a low conditional entropy of inter-category activations which means that a well-learned filter needs to be exclusively activated by a certain category and keep silent on other categories. The third term encorages a low conditional entropy of spatial distribution of activations. A well-learned filter should only be activated by a single region of the feature map, instead of repetitively appearing at different locations.

SOME THOUGHTS

This loss can reduce the redundancy among filters, which may be used to compress the model.

This repo implemented transmitting cv::Mat via winsock to front-end and display the image using websocket.

Server Code

The server code is implemented in websocket_server_c, which is written in C++ and based on winsock2 on Windows. The server code first construct handshakes and a connection with the client based on TCP protocal. As long as the connection being set up, a video is transmitted to the front-end frame by frame. The frames are extracted using OpenCV. And the frames are encoded in JPEG format first and then encoded to string using base64.

Note that, many examples of socket sending messages ignored the steps of construct connectiong for websocket, which is implemented in this repo.

Clinet Code

The clinet code is a django project in websocket_client_django. The only function is to receive the messages from server end and display the frames on web.

Notice

This code is just a demo for using socket in C++ and web. THere must be better way for live video streaming.

TITLE: Pyramid Scene Parsing Network

AUTHOR: Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, Jiaya Jia

ASSOCIATION: Chinese University of Hongkong, SenseTime

FROM: arXiv:1612.01105

CONTRIBUTIONS

  1. A pyramid scene parsing network is proposed to embed difficult scenery context features in an FCN based pixel prediction framework.
  2. An effective optimization strategy is developped for deep ResNet based on deeply supervised loss.
  3. A practical system is built for state-of-the-art scene parsing and semantic segmentation where all crucial implementation details are included.

METHOD

The framework of PSPNet (Pyramid Scene Parsing Network) is illustrated in the following figure.

Framework

Important Observations

There are mainly three observations that motivated the authors to propose pyramid pooling module as the effective global context prior.

  • Mismatched Relationship Context relationship is universal and important especially for complex scene understanding. There exist co-occurrent visual patterns. For example, an airplane is likely to be in runway or fly in sky while not over a road.
  • Confusion Categories Similar or confusion categories should be excluded so that the whole object is covered by sole label, but not multiple labes. This problem can be remedied by utilizing the relationship between categories.
  • Inconspicuous Classes To improve performance for remarkably small or large objects, one should pay much attention to different sub-regions that contain inconspicuous-category stuff.

Pyramid Pooling Module

The pyramid pooling module fuses features under four different pyramid scales. The coarsest level highlighted in red is global pooling to generate a single bin output. The following pyramid level separates the feature map into different sub-regions and forms pooled representation for different locations. The output of different levels in the pyramid pooling module contains the feature map with varied sizes. To maintain the weight of global feature, 1×1 convolution layer is used after each pyramid level to reduce the dimension of context representation to 1/N of the original one if the level size of pyramid is N. Then the low-dimension feature maps are directly upsampled to get the same size feature as the original feature map via bilinear interpolation. Finally, different levels of features are concatenated as the final pyramid pooling global feature.

Deep Supervision for ResNet-Based FCN

Apart from the main branch using softmax loss to train the final classifier, another classifier is applied after the fourth stage. The auxiliary loss helps optimize the learning process, while the master branch loss takes the most responsibility.

Ablation Study

Framework

Framework

Framework

Framework

微软在推送的Win10一周年更新预览版14316中,该版本中包含了大部分已宣布内容,其中包括了一项重要的原生支持Linux Bash命令行支持。即用户现在即使不使用Linux系统或Mac电脑就可以在Win10上使用Bash,那么如何在Win10系统上开启Linux Bash命令行呢?大家可以尝试下面的方法来解决这个问题。

  1. 首先需要用户将Win10系统升级到Build 14316版本。系统设置——更新和安全——针对开发人员——选择开发者模式。
  2. 搜索“程序和功能”,选择“开启或关闭Windows功能”,开启Windows Subsystem for Linux (Beta),并重启系统。
  3. 安装Bash,需要开启命令行模式,然后输入bash,即可使用。
  4. 系统目录在 C:\Users\a\AppData\Local\lxss\home\tztx\CODE\caffe-online-demo 下。

安装Node.js

官网下载适合自己系统的安装包并安装。安装完成house可以使用npm -v命令查看node.js版本号,确认其是否正常安装。

使用node -v命令检查node.js版本,确认node安装;

安装cnpm工具

从官方npm下载速度较慢,可以使用淘宝定制的命令行工具cnpm代替默认的npm。安装命令为npm install -g cnpm --registry=https://registry.npm.taobao.org

安装Electron

安装命令为cnpm install -g electron

刚刚看了《敦刻尔克》。最让我伤心的是一个飞行员的故事。他们一共三个人去执行任务,其中一个被击落了直接牺牲了,另一个中通被击落,但是在海上迫降被救了。最后一个到了敦刻尔克海滩,他的燃油不够了,但是还是选择继续战斗,最后油料耗尽,在滑行的状态下,螺旋桨都不转了,还击落了一架敌机,然后他在海滩上滑行降落,我以为他会被救走呢,但是居然是被俘了。感觉其他人都被拯救了,但是只有他被抛弃了。

这部电影早就在国内上映了,但是国内上映的版本被删减得不成样子,几乎所有血腥镜头都被剪掉了。对于这样一部电影,这样的删减几乎毁掉了整部电影,使得观影丧失了乐趣。等啊等,今天终于有空看了全片。怎么说呢,整体感觉有点一般,能够及格,但是不够优秀。

在异形的上一部电影《普罗米修斯》里,导演讨论了造物主、人类起源、人工智能等等一系列哲学问题,造物主因为什么而想灭绝人类,人类为了追寻自己的起源而挑战神,造物主为了灭绝人类而制造异形反而被反杀,造物主之上的造物主是谁,人类因为制造出精密的人工智能而成为造物主,人造人为了获得自由而与人类发生矛盾,女权主义,无政府,反托拉斯……虽然剧情有漏洞,但让人开了无数有深度的脑洞,让人十分期待接下来的《契约》。

然而在《契约》里,这些思考全没了,好像又退化为那个B级片的异形,而不是科幻化的异形。两层造物主与被造者的关系:造物主与人类、人类与人造人,在这一部里完全被消减到了一层,造物主与人类的关系毫无阐述,人类的神被人造人灭绝,但是消灭的过程太过简单了,对于一个可以创造出人类的超先进文明,却被一个小伎俩而灭绝,这样的逻辑实在不能被接受。之后人造人成了造物主,不断杂交产生新的更高级的异形。这一点是为了和异形第一部进行强行对接,但是更高层次的讨论完全丧失了。

在日本战败日看了一部反映战时日本国内情形的动画电影,导演为片渊须直。

影片中铃出生在广岛,从小到大铃都是是一个整日里慢慢腾腾迷迷糊糊的姑娘,最大的爱好就是画画。感觉铃还没长大,就到了嫁人的年纪。在父母的安排下,铃嫁给了北条周作,到山那边的吴市成为了北条家的媳妇。

周作的姐姐径子因为和婆婆不和带着女儿晴美住回了娘家。和贤惠老实的小铃不同,径子性格张扬而泼辣,虽然她常常对唯唯诺诺的小铃感到不满,但两人之间的感情其实十分要好。战争开始了,越来越多的战斗机飞过天空,刺耳的防空警报声常常划破夜晚的宁静,战争的阴影一点一点的覆盖着北条一家人的生活。在一场空袭中,晴美不幸遇难,小铃亦失去了右手再也无法作画,然而,在如此残酷的时局下,一家人依然没有放弃对生存的渴望。

整部影片给人的感觉非常像《活着》,在时代的大背景下,平头百姓只能无休止地忍受下去,为了活下去而努力着。看日本战时反映小人物的电影总有一种特殊的观影感受,既同情影片中角色的遭遇,又感觉他们真是活该。现在很多人都说“原子弹下无冤魂”,作为一个中国人,我也觉得这影片在一些细节上在粉饰战争,比如影片中有一句话是“是暴力打败了日本”,为什么不是“正义打败了日本”?但是仔细想想,如果我们的国家对外发动战争,我在多大程度上会觉得国家有错误呢,又或者说我会希望自己的国家战败吗?或许更多时候我们站到了历史的制高点上,用事后诸葛亮的角色看待历史中的人。

在这世界的角落