Editor's Note: The Computer Vision Summit CVPR 2019 was held in Long Beach, California on June 15-21. A total of 21 papers from Microsoft Research Asia were selected for this CVPR, covering topics such as pose estimation, object detection, target tracking, image editing, 3D shape generation, and efficient CNN. This article selected 7 of them. Introduce the article. The offline CVPR sharing event we hosted also went to the third session, and the missed friends can see it here.
Context-enhanced semantic segmentation
Context-Reinforced Semantic Segmentation
Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, Wenjun Zeng
The main goal of the image semantic segmentation task is to perform pixel-level semantic classification on a given natural image to obtain a fine-grained scene semantic description. This task plays an important role in applications such as autonomous driving and medical image analysis.
In the semantic segmentation task, much work has already demonstrated the importance of the context of the environment. One of the directions is to use the existing segmentation prediction results to perform coarse-to-fine semantic segmentation. For example, the segmentation result is directly post-processed by a conditional random field, or the segmentation prediction obtained by the previous step is iteratively re-predicted as an input of the current step by using a recurrent architecture.
This article focuses on how to adaptively utilize context information that exists in a predicted segmentation map. In fact, because there is inevitably unpredictable noise such as a misclassified region in the segmentation prediction graph, and it is impossible to artificially define which information in the segmentation prediction graph is most beneficial to help the segmentation network to obtain better segmentation results, we believe that it is necessary to learn A separate module is responsible for explicitly extracting valid subsets from the segmentation prediction map as context information. By expressing the extraction of the context as a Markov decision process, we can optimize the module using reinforcement learning without introducing a new supervisory signal to explicitly select contextual information that has a positive effect on the segmentation prediction.
As shown in the figure above, we iteratively perform segmentation prediction. In the nth iteration, the segmentation network not only refers to the feature map of the image, but also considers the context C^n after the encoding. The context used here is from the context network (Context Net) from the nth - Extracted from the segmentation prediction graph of 1 iteration. Since the extracted context information will affect the segmentation prediction of all subsequent steps, and there is no corresponding annotation to guide the context network to extract what kind of information, we will treat the extraction context as an action, and the image and the previous step. The iterative segmentation prediction graph is regarded as the environment, constructs a Markov decision process, and indirectly guides the network to select the most long-term benefit context information by maximizing the accuracy of future segmentation. We used the A3C (asynchronous advantage actor-critic) algorithm to optimize the above process end-to-end. The experimental results show that the context information selected by this context-reinforced method can achieve a performance improvement of 3.9% with respect to the baseline.
As shown in the figure above, we use white to represent the region that is not selected as the context in the segmentation prediction graph. It can be observed that although it is still impossible to define what is really useful environmental context, these adaptively selected regions are somewhat in line with human expectations, that is, context networks are more inclined to select semantic information with scene representation as context. .
Triangulation Learning Network: 3D object detection from monocular to stereo images
Triangulation Learning Network: from Monocular to Stereo 3D Object Detection
Zengyi Qin, Jinglu Wang, Yan Lu
3D object detection is designed to locate the 3D bounding box of a particular category of objects in 3D space. Detection tasks are relatively easy when providing active 3D scan data. However, active scanning of data is costly and scalability is limited. We solved the 3D detection problem of passive image data, which only requires low-cost hardware, adapts to objects of different sizes, and has better semantic features.
Due to the ambiguous mapping from 2D images to 3D geometry, monocular 3D inspection of a single RGB image is very difficult, and adding more input maps can provide more information for 3D reasoning. Multi-view geometry estimates the 3D position of the points by first finding the dense correspondence of the points and then triangulation. Geometric methods deal with the use of local features of points, regardless of semantic cues at the object level.
Stereo data with paired images is more suitable for 3D detection because the difference between the left and right images can reveal spatial variance, especially in the depth dimension. Although researchers have done a lot of work on stereo matching based on deep learning, their main focus is on the pixel level rather than the object level. By properly placing the 3D anchor and extending the Regional Proposal Network (RPN) to 3D, we can get good results using only monocular images.
In this paper, we propose a stereoscopic three-dimensional object detection of the Stereo Triangulation Learning Network (TLNet), which can be easily integrated into the basic monocular detector without calculating the pixel-level depth map. The key idea of this work is to use the 3D anchor box to construct the object-level geometric correspondence of its two projections on a pair of stereo images, from which the network learns the target object near the triangulation anchor. In TLNet, we introduced an effective feature re-weighting strategy to enhance information feature channels by measuring left-right consistency. The re-weighting scheme filters the signals from the noisy and unmatched channels to facilitate the learning process, allowing the network to focus more on the critical parts of the object. We first proposed a basic monocular 3D detector, as shown in the following figure.
3D Detector Overview
Combined with TLNet, we have demonstrated significant improvements in 3D object detection in a variety of situations. In addition, we have also quantitatively analyzed the feature re-weighting strategy in TLNet to better understand its effects. In short, our contribution is in three areas:
(1) A reliable basic 3D detector that uses only monocular images as input and has comparable performance to the most advanced stereo detectors available today.
(2) Triangulation learning network, using the geometric correlation of stereo images to locate the target 3D object, the performance is better than the basic model.
(3) A functional re-weighting strategy that enhances the information channel of the RoI function of a particular view, and is useful for triangulation learning by focusing the network attention on key parts of the object.
SPM Tracker: Series-Parallel Matching for Real-Time Visual Object Tracking
SPM-Tracker: Series-Parallel Matching for Real-Time Visual Object Tracking
Guangting Wang, Chong Luo, Zhiwei Xiong, and Wenjun Zeng
Visual Object Tracking (VOT) is a classic and classic problem in video analysis tasks. In a video, for a given object, the tracking task requires that the algorithm continuously give the position of the object in subsequent frames (usually represented by a rectangular box).
The key to the VOT mission is to be able to keep up with ”&&&&&&&&&&" First of all, the appearance of an object is constantly changing in a video due to a number of reasons such as attitude changes, camera angles, and lighting changes. “Can keep up” requires the algorithm to accurately find objects under any change in the appearance of the object. Second, the target object does not always appear alone, and the presence of similar objects can cause relatively large interference to the result. For example, when a person is followed by a group of people, we want the algorithm to have the ability to distinguish between different individuals, that is, “do not follow the wrong”. In practice, we found that the two requirements of “can keep up” and “do not follow” are difficult to satisfy in one model at the same time. On the one hand, we want the model to be insensitive to changes in the appearance of the object; on the other hand, we require the model to distinguish the appearance of similar objects. There is a contradiction between the two.
In order to solve this problem, we have innovatively proposed the structure of Series-Parallel Matching. The whole structure is divided into two parts, which we call “rough match” and “fine match”.
The task of "coarse matching" is to find all objects in the graph that are similar to the tracking target, that is, “ can follow. In this part we have adopted the SiamRPN framework. The difference is that in order to make the model as robust as possible to the appearance of the object, we use the same category of objects as a positive sample pair to train the model. The results of the visualization show that this training method enables the model to accurately find the object even when the appearance of the object changes greatly.
“Rough matching” will generate some candidate boxes to the "fine match" model. The task of this part is to distinguish these similar objects, that is, "do not follow the wrong". In order to make the model more discriminative, we used an associative network to learn the distance metric between the tracking target and the candidate frame. Experimental results show that this structure is more effective than the previous cross-correlation method.
These two parts are combined in series and parallel to obtain the final tracking result. They share the same convolutional features, so the processing speed is very fast, reaching 120 FPS, far exceeding the requirements of real-time. We have achieved the best real-time tracking results on multiple test sets such as OTB / VOT / LaSOT, further validating the validity of the model.
Use multi-projection GAN to synthesize 3D shapes from unannotated image collections
Synthesizing 3D Shapes from Unannotated Image Collections using Multi-projection Generative Adversarial Networks
Xiao Li, Yue Dong, Pieter Peers, Xin Tong
Three-dimensional shape generation is an important issue in computer vision. Traditional methods either use existing large-scale three-dimensional shape data sets for training, or use multiple multi-view photos of the same object with known viewpoint information to reconstruct the three-dimensional shape of the object. However, in a large number of practical application scenarios, it is still difficult to obtain a large number of high-quality three-dimensional physical data and multi-view images of the same object with known viewpoint information. To solve this problem, we propose a method for generating three-dimensional shape using the two-dimensional image set with unknown viewpoint information and no correspondence between images.
Since this unlabeled two-dimensional image does not have a correspondence between images, we do not have a sample of its multi-view for a particular object. However, we noticed that a large number of two-dimensional images as a whole express the statistical distribution of two-dimensional projection images of three-dimensional shapes at different viewpoints. Therefore, we can use the method of generating anti-neural network (GAN) to learn this statistical distribution. The problem of multi-view reconstruction of a single object is transformed into a problem of using a generated network to learn and generate a statistical distribution of two-dimensional images that meet the conditions under multiple viewpoints.
On the other hand, since this unlabeled two-dimensional image does not have corresponding viewpoint information, we still cannot obtain statistical information of a two-dimensional image for a certain viewpoint. In order to solve this problem, we need to train the corresponding neural network to predict the viewpoint information of the image. However, the training viewpoint prediction network requires viewpoint annotation information or three-dimensional shape data to generate training data. In order to jump out of this interdependent situation, we propose a joint alternation training method for 3D shape generation and viewpoint information prediction, and solve the problem of 2D image viewpoint information prediction and 3D shape generation.
In practical applications, our method only needs the contour image of a certain kind of object under multiple viewpoints, and does not need the assumption of viewpoint information, viewpoint distribution, correspondence between images, etc., and can generate a class that can generate this class. A network of generations of different geometries of objects. At the same time, our training process will also get a perspective prediction network for the contour images of such objects.
We tested on the public 2D contour dataset (Caltech-UCSD Bird, Pix3D-Car) and obtained good shape generation and perspective prediction. We also performed synthetic data testing on the ShapeNet dataset and performed detailed experimental analysis of the effects of each module of the algorithm.
Finally, this method of training high-dimensional data generation networks based on multiple low-dimensional projections can be extended to the generation of other high-dimensional data. We extend this method to material texture generation, using a large number of materials under different illuminations. Images, we can train neural networks to generate different texture materials for a particular type of material.
Masked portraits with conditional GAN editing
Mask-Guided Portrait Editing with Conditional GANs
Shuyang Gu, Jianmin Bao, Hao Yang, Dong Chen, Fang Wen, Lu Yuan
Portrait editing is a hot and practical issue in the field of computer vision. The predecessors' work in this area has the following problems: either focusing on specific tasks (such as opening a closed eye), or requiring a lot of annotated expression data (costly), or generating a face that is of low quality. In the paper "Mask-Guided Portrait Editing with Conditional GAN", we propose a universal, high-quality, controllable face portrait editing method.
The following is a general framework of our algorithm:
Our network is divided into three main parts: Local Embedding Sub-Network, Mask-Guided Generative Sub-Network, Background Fusing Sub-Network.
Among them, Local Embedding Sub-Network separately encodes the five regions of the face (left eye, right eye, skin, lips, hair), and uses L_local constraint to retain local features as much as possible in the process of encoding and decoding. Mask-Guided Generative Sub-Network fused the local feature's encoding to the target mask based on the spatial position, producing a portrait without a background. Background Fusing Sub-Network combines this foreground portrait with the background of the target mask to produce the final result. For the final result, we use L_GD's GAN constraint to satisfy the distribution of the face, and use L_GP to constrain it to the original target mask. When source image and target image are the same image, the image reconstructed with L_global constraint should be The input is exactly the same.
We conducted a comparative experiment to verify the validity of the three subnetworks. Specific to the task of face portrait editing, on the one hand, we can edit the face by modifying the mask (left image below), on the other hand, we can migrate the partial code to the target face (bottom right) to make the input The face has other characteristics. Face image editing, face exchange and rendering face experiments prove that the method is universal, high quality and controllable. In addition, since the method has the characteristics of dividing a face from a face to a one-to-one face, the data enhancement for face image segmentation can also obtain better results. The paper also shows the results of the method under extreme conditions to prove its robustness.
Experimental results of portrait editing
Pyramid context coding network for high quality image completion tasks
Learning Pyramid-Context Encoder Network for High-Quality Image Inpainting
Yanhong Zeng, Jianlong Fu, Hongyang Chao, Baining Guo
Image inpainting requires the algorithm to fill in the missing area of the image to be repaired based on the image itself or image library information, making the restored image look very natural and difficult to distinguish from the undamaged image. According to the Horror Valley theory, as long as there is subtle inconsistency between the content and the undamaged area, it will be very conspicuous. Therefore, high-quality image completion not only requires the semantics of the generated content to be reasonable, but also requires that the generated image texture is clear enough and true.
At present, the best methods of image completion are mainly divided into two categories: one is the classic texture synthesis method, and the core is to sample similar pixels from the undamaged area of the image to fill the area to be complemented. The other type is a neural network-based generation model, which encodes the image into a feature of high-dimensional hidden space, and then decodes the feature into a restored full image. However, these two methods have limitations in ensuring semantics and texture clarity.
Experimental results of different methods on the human face
Through a lot of experimental research and observation and discussion, we propose the concept of high-level semantic features as a guide and multiple completions from deep to shallow, so that the network can generate richer semantics while ensuring semantic consistency. Clear texture details, resulting in the Pyramid-Context Encoder Network (PEN-Net).
PEN-Net is built on the U-Net network as the backbone structure. According to observations, low-level features have richer texture details, higher-level features have more abstract semantics, and high-level features can guide the completion of low-level features layer by layer. The core of PEN-Net is to calculate the high-level feature map through attention mechanism. The similarity between the damaged area and the undamaged area is applied to the feature complementation on the lower layer feature map. The completed feature map continues to guide the completion of the missing area of the next layer of features, until the shallowest The pixel layer of the layer. In this process, the network performs multiple different levels of feature completion. Finally, the decoding network combines the complemented features and the features with high-level semantics to generate the final complementary image, so that the complemented image is not only semantically reasonable, but also has clearer and richer texture details.
Pyramid-Context Encoder Network (PEN-Net)
SeerNet: Predicting the sparsity of convolutional neural network feature maps by low-bit quantization
SeerNet: Predicting Convolutional Neural Network Feature-Map Sparsity through Low-Bit Quantization
Shijie Cao, Lingxiao Ma, Wencong Xiao, Chen Zhang, Yunxin Liu, Lintao Zhang, Lanshun Nie, Zhi Yang
Deep neural networks have made major breakthroughs in the fields of image, voice, language, etc., relying heavily on larger and deeper networks. The ever-increasing size and computational complexity of the model makes it difficult to meet the low latency, high throughput and energy efficiency requirements of model reasoning with the most expensive and highest performance devices (eg TPU, GPU).
In fact, the current neural network models are built on the operations of dense matrices, causing a lot of computational power and bandwidth waste for both GPU and TPU. Many algorithmic researchers have realized that neural networks have a large amount of sparsity. Many neural networks can maintain the accuracy of the model on the basis of reducing the amount of computation through reasonable branching. In addition, many very large, sparse neural networks are sprouting, such as the Mixture-of-Experts model proposed by Geoffrey Hinton.
This paper focuses on the sparsity of output feature maps in convolutional neural networks. For example, in a convolutional neural network, each convolutional layer is usually connected to a ReLU layer or a Max-pooling layer. After the ReLU or Max-pooling layer, most of the output of the convolutional layer is set to zero or discarded. From the calculation point of view, if the leading convolution calculation corresponding to the zero value output and the discard output can be omitted, the calculation amount of the convolution layer can be greatly reduced.
The paper presents SeerNet. “Seer” is the meaning of "foreseeers/prophets”. As the name suggests, we use very low bit networks to predict output feature sparsity at a very low cost, and accelerate neural network calculations through sparse high-precision calculations. SeerNet can be applied directly to pre-trained models without any modifications or retraining of the original model.
The following diagram outlines the core ideas of this article. For each layer of convolutional neural networks, the quantized low-bit (eg, 4-bit, 2-bit, 1-bit) network is used to predict the sparse distribution of the output features, and then the sparse distribution information is used to guide the original precision model reasoning, ie, only valid. Convolution calculation for the output (non-zero output).
Through the sparse algorithm design for hardware optimization, this paper achieves the highest acceleration of 5.79x on the convolutional layer based on CPU, and accelerates 1.2x-1.4x in end-to-end model reasoning. At the same time, since the new AI hardware provides better support for the calculation of the mixing accuracy, the SeedNet will have more use. For example, NVIDIA's newly-launched Turing architecture supports the 16/8/4-bit mixed-precision tensor calculation unit. , Xilinx and Altera's FPGAs provide support for arbitrary precision integer calculations. The support of low-bit operations by these hardware platforms can reduce the overhead of the prediction process, while custom algorithms and architectures can maximize the sparse computation.
SeerNet's sparseness and acceleration at different layers of ResNet and VGG
The list of accepted papers of Microsoft Asia Research Institute is as follows: