The World Conference on Computer Vision and Pattern Recognition (IEEE International Conference on Computer Vision and Pattern Recognition) will be held in Long Beach, USA, in June. A total of 1299 papers from all over the world were admitted to the conference. The Chinese team has performed well. This time, more than 58 papers from Tencent were accepted by this CVPR conference. Among them, 25 papers from Tencent Youtu Laboratory and 33 papers from Tencent AI Lab have improved greatly compared with the past two years.
As the highest level research conference in the field of computer vision, CVPR2019 admission papers represent the latest and highest level of technology in the field of computer vision in 2019 and the trend of future development. CVPR official website showed that more than 5165 papers were submitted to the conference this year, and 1299 papers were finally accepted. These latest scientific research achievements cover all frontier work in the field of computer vision. In 2019, more than 58 papers from Tencent were accepted by this CVPR conference, including 33 papers from Tencent AI Lab and 25 papers from Tencent Youtu Laboratory. In total, 31 articles were included in 2018 and 18 in 2017. The number of admissions in 2019 has greatly increased compared with the previous two years, with remarkable results.
Tencent's papers include in-depth learning optimization principles, visual confrontation learning, face modeling and recognition, video depth understanding, pedestrian re-recognition, face detection and other hot and cutting-edge areas. The world's leading scientific research results show Tencent's strong talent reserve, scientific research and innovation ability in the field of computer vision. These novel computer vision algorithms not only have abundant application scenarios, making more computer vision algorithms can be applied to daily life, but also provide valuable experience and direction for follow-up research.
Following are some of Tencent Youtu's papers selected for CVPR2019:
Unsupervised Person Re-identification by Soft Multilabel Learning
Unsupervised pedestrian recognition based on soft multi-label learning
Compared with supervised pedestrian re-identification (RE-ID), unsupervised RE-ID has attracted more and more attention due to its better scalability. However, in the non-overlapping multi-camera view, the lack of pairwise label leads to the learning of discriminatory information is still a very challenging task. To overcome this problem, we propose a soft multi-label learning depth model for unsupervised RE-ID. The idea is to label soft labels (like the likelihood vectors of real labels) for unlabeled people by comparing them with a group of known references in the auxiliary domain. Based on the visual features and the similarity consistency of soft labels of unlabeled target pairs, we propose a hard negative mining method guided by soft multi-labels to learn a discriminative embedding. Since most of the target pairs come from cross-perspective, we propose a soft multi-label consistency learning method from Cross-perspective to ensure the consistency of labels from different perspectives. In order to achieve efficient soft label learning, reference agent learning is introduced. Our method has been evaluated on Market-1501 and Duke MTMC-reID, which is significantly better than the best unsupervised RE-ID method at present.
Visual Tracking via Adaptive Spatially-Regularized Correlation Filters
Research on visual tracking based on adaptive spatial weighted correlation filtering
Adversarial Attacks Beyond the Image Space
Counterattack beyond image space
The generation of countermeasure instances is an important way to understand the working mechanism of deep neural networks. Most of the existing methods will produce disturbances in the image space, i.e. modifying each pixel independently. In this paper, we focus more on a subset of adversarial examples corresponding to meaningful changes in three-dimensional physical properties, such as rotation and translation, lighting conditions, etc. It can be said that these confrontation methods raise a more noteworthy issue, because they prove that simply interfering with three-dimensional objects and scenes in the real world may also lead to neural network misclassification examples.
In the task of classification and visual question answering, we add a rendering module in front of the neural network receiving 2D input to expand the existing neural network. The process of our method is to render a 3D scene (physical space) into a 2D picture (picture space), and then map them to a predictive value (output space) through a neural network. This countermeasure jamming method can go beyond image space. It has a clear meaning in the three-dimensional physical world. Although the countermeasure attacks in image space can be explained by the change of albedo of pixels, we prove that they can not be well explained in physical space, which usually has non-local effects. But attacks in physical space are likely to outperform those in image space. Although this attack is more difficult than that in image space, attacks in physical world have lower success rate and need more interference.
Learning Context Graph for Person Search
Pedestrian Retrieval Model Based on Context Graph Network
This paper is led by Tencent Youtu Laboratory and Shanghai Jiaotong University.
In recent years, deep neural network has achieved great success in pedestrian retrieval tasks. However, these methods are usually based on the appearance information of a single person, and it is still difficult to deal with the situation of attitude change, illumination change, occlusion and so on. This paper presents a new pedestrian retrieval model based on context information. The proposed model takes other pedestrians in the scene as context information, and uses convolutional graph model to model the impact of these context information on the target pedestrian. We refreshed the world record in two famous pedestrian retrieval datasets, CUHK-SYSU and PRW, and obtained the results of Top1 pedestrian retrieval.
Underexposed Photo Enhancement using Deep Illumination Estimation
Optimizing Image Enhancement in Dark Light Based on Deep Learning
This paper introduces a new end-to-end network for enhancing underexposed photos. Instead of directly learning image-to-image mapping as in previous work, we introduce intermediate lighting in our network to correlate input with expected enhancement results, which enhances the network's ability to learn complex photographic adjustments from expert-modified input/output images. Based on this model, we formulate a loss function, which uses constraints and priori in the middle of the illumination. We prepare a new data set of 3000 underexposed image pairs, and train the network to effectively learn various adjustments of illumination conditions. In these ways, our network can restore clear details, sharp contrast and natural color in enhanced results. We have done a lot of experiments on the benchmark MIT-Adobe FiveK data set and our new data set, and show that our network can effectively deal with previously difficult images.
Homomorphic Latent Space Interpolation for Unpaired Image-to-image Translation
Unpaired Picture-to-Picture Conversion Based on Homomorphic Hidden Space Interpolation
The generation of confrontation networks has achieved great success in image-to-image conversion from incompatible images. Cyclic consistency allows modeling the relationship between two different domains without paired data. In this paper, we propose an alternative framework as an extension of potential spatial interpolation, which considers the middle part between two domains in image transformation. The framework is based on the fact that there are multiple paths connecting two sampling points in a flat and smooth potential space. Proper selection of interpolation paths allows some image attributes to be changed, which is very useful for generating intermediate images between two domains. We also show that the framework can be applied to multi-domain and multi-mode conversion. Extensive experiments show that the framework has universality and applicability to various tasks.
Biplane X-ray to CT Generation System Based on Generating Countermeasure Network
At present, CT imaging can provide a three-dimensional panoramic perspective to help doctors understand the condition of tissues and organs in patients, to assist in the diagnosis of diseases. However, compared with X-ray imaging, CT imaging brings more radiation dose to patients and costs more. In the process of three-dimensional reconstruction of traditional CT images, a large number of X-ray projections are collected and used around the center of the object, which can not be achieved in traditional X-ray machines. In this paper, we propose an innovative method based on confrontation generation network, which uses only two orthogonal two-dimensional X-ray images to reconstruct realistic three-dimensional CT images. The core innovations include dimension-increasing generation network, multi-view feature fusion algorithm and so on. Through experiments and quantitative analysis, we show that this method is superior to other contrast methods in two-dimensional X-ray to three-dimensional CT reconstruction. Through the visualization of CT reconstruction results, we can also intuitively see that the details provided by this method are more realistic. In practical applications, without changing the existing X-ray imaging process, our method can provide doctors with additional CT-like three-dimensional images to help them better diagnose.