Tencent AI Lab Machine Learning Center announced today the successful development of the world's first automated in-depth learning model compression framework
Developers can quickly deploy AI technology to mobile products without knowing the details of specific algorithms to achieve efficient local processing of user data. At present, the framework is providing model compression and acceleration technical support for Tencent's multiple mobile services, and has been applied in a variety of mobile APPs.
1. PocketFlow background
With the rapid development of AI technology, more and more companies want to inject AI capabilities into their mobile products to optimize the user experience. AI technology represented by depth learning greatly improves the recognition accuracy in many application fields such as image understanding and speech recognition. However, the mainstream depth learning model often requires high computational resources and is difficult to deploy directly to consumer mobile devices. A common solution is to deploy a complex in-depth learning model in the cloud. The mobile end uploads the data to the cloud and waits for the result to be returned. However, this requires a high speed of network transmission. Users in areas with poor network coverage experience is poor, and the data is uploaded to the cloud for privacy. Sex is also hard to guarantee.
In this case, a number of model compression and acceleration algorithms have emerged, which can effectively improve the computational efficiency of CNN and RNN network structures with less precision loss (or even lossless), thus making the deployment of depth learning model on the mobile side possible. However, how to choose the appropriate model compression and acceleration algorithm and the corresponding hyper-parameter values according to the actual application scenarios often requires more professional knowledge and practical experience, which undoubtedly raises the threshold for the general developers to use this technology.
In this context, Tencent AI Lab Machine Learning Center has developed a PocketFlow open source framework to achieve automated in-depth learning model compression and acceleration, to help AI technology in more mobile products widely used. By integrating multiple deep learning model compression algorithms and innovatively introducing parametric optimization components, the automation of model compression technology is greatly improved. Developers do not need to intervene in the selection of specific model compression algorithm and its hyper-parameter values, only specify the desired performance indicators, then they can get the compression model that meets the requirements through PocketFlow and deploy it to mobile applications quickly.
Research progress of 2. AI Lab on PocketFlow
Recently, AI Lab Machine Learning Center has been investing in deep learning model compression and hyperparametric optimization algorithms, and has made a lot of progress. In the aspect of model compression algorithm, the team proposed a channel pruning algorithm based on maximum discriminant criterion, which can greatly reduce the computational complexity of CNN network model without losing performance. The related papers were published in NIPS 2018 . In order to enhance the discriminant power of each layer in CNN network, the algorithm introduces several additional loss terms in the training process, and then prunes the channel layer by layer based on the optimization objective of minimizing the classification error and reconstruction error to remove the redundant channel with relatively small discriminant power, thus realizing lossless compression of the model. In the area of parametric optimization, the team developed an AutoML framework for automatic parametric optimization, which integrates a variety of parametric optimization algorithms, including Gaussian Processes (GP) and Tree-structured Parzen Estimator (TPE), and solves the human problem through full automated hosting. The problem of time-consuming and energy consuming of parameter adjustment greatly improves the efficiency of algorithm development.
On the other hand, considering that the training cycle of the depth learning model is generally long, the team optimizes the training process of multi-machine and multi-card based on TensorFlow to reduce the time-consuming of gradient communication in the process of distributed optimization, and develops a distributed optimization framework named TF-Plus, which can be targeted at a single G with only a dozen lines of code modification. The training code of PU is extended to multi machine and multi card version, and obtains near linear speedup. In addition, the team proposes a quantized random gradient descent algorithm with error compensation. By introducing a quantized error compensation mechanism, the convergence rate of model training can be accelerated, and the gradient compression of one or two orders of magnitude can be achieved without performance loss, thus reducing the gradient traffic in distributed optimization, thus speeding up the convergence rate. The speed of training was published in ICML 2018 .
During the development of the PocketFlow framework, the team joined in the support of the above self-study algorithms, and effectively reduced the precision loss of model compression, improved the training efficiency of the model, and greatly improved the automation of hyper-parameter adjustment.
3. introduction to PocketFlow framework
The PocketFlow framework consists of two main components, namely model compression/acceleration algorithm component and super-parameter optimization component. The specific structure is shown in the following figure.
Developers use uncompressed raw models as input to the PocketFlow framework and specify expected performance metrics, such as compression and/or acceleration multiples of the model; during each iteration, the hyperparametric optimization component selects a set of hyperparametric combinations, and then the model compression/acceleration algorithm component values based on the hyperparametric combinations Combining, the original model is compressed to get a compressed candidate model; based on the results of performance evaluation of the candidate model, the hyperparametric optimization component adjusts its model parameters and selects a new set of hyperparametric combination to start the next iteration; when the iteration terminates, PocketFlow selects the most Optimal combinations of hyperparameters and corresponding candidate models are returned as final outputs to the developer's model deployment for the mobile end.
Specifically, PocketFlow achieves compression and acceleration of depth learning models with less precision loss and higher degree of automation by effectively combining the following algorithm components:
A) channel pruning (channel pruning) component:In CNN network, the model size and computational complexity can be reduced simultaneously by pruning the channel dimension in feature graph, and the compressed model can be deployed directly based on the existing in-depth learning framework. In CIFAR-10 image classification task, channel pruning based on ResNet-56 model can achieve 0.4% loss of classification accuracy at 2.5 times acceleration and 0.7% loss of accuracy at 3.3 times acceleration.
B)Weight sparsity (weight sparsification)) component:By introducing sparse constraints into network weights, the number of non-zero elements in network weights can be greatly reduced, and the network weights of the compressed model can be stored and transmitted in the form of sparse matrices, thus achieving model compression. For the MobileNet image classification model, the loss of Top-1 classification accuracy on the ImageNet dataset is only 0.6% after deleting 50% of the network weight.
C)Weight quantification (weight quantization)) component:By introducing quantization constraints on network weights, the number of bits needed to represent each network weight can be reduced; the team also provides support for both uniform and non-uniform quantization algorithms, which can take full advantage of hardware optimization of ARM and FPGA devices to improve the computing efficiency of the mobile end and for future nerves. Network chip design provides software support. Taking the ResNet-18 model for image classification task on ImageNet as an example, the 4-fold compression with lossless precision can be achieved under 8-bit fixed-point quantization.
D)Network distillation (network distillation)) component:For the above model compression components, the uncompressed output of the original model can be used as additional supervisory information to guide the training of the compressed model, and the accuracy can be improved by 0.5% - 2.0% with the same compression / acceleration ratio.
E)Multi GPUTraining (multi-GPU training)) component:Due to the high requirement of computing resources in the process of deep learning model training, it is difficult for a single GPU to complete model training in a short time. Therefore, the team provides comprehensive support for multi-machine and multi-card distributed training to speed up the user development process. Both Resnet-50 image classification model based on ImageNet data and Transformer machine translation model based on WMT14 data can be trained in an hour.
F)Super parameter optimization (hyper-parameter optimization)) component:Most developers don't know much about model compression algorithms, but super-parameter values often have a great impact on the final results. So the team introduced super-parameter optimization components, adopted algorithms including reinforcement learning and AutoML automatic super-parameter optimization framework developed by AI Lab to determine the optimum according to specific performance requirements. Combination of optimal parameters. For example, for the channel pruning algorithm, the superparametric optimization component can automatically prune different layers according to the redundancy of each layer in the original model. Under the premise of guaranteeing the overall compression multiple of the model, the model recognition accuracy can be maximized after compression.
4. PocketFlow performance display
By introducing the super-parameter optimization component, not only the high threshold and tedious manual tuning work is avoided, but also the effect of PocketFlow in various compression algorithms is more than that of manual tuning. Taking the image classification task as an example, PocketFlow effectively compresses and accelerates the models of ResNet and MobileNet on CIFAR-10 and ImageNet datasets.
On the CIFAR-10 data set, PocketFlow uses ResNet-56 as the benchmark model to prune channels, and adds training strategies such as hyperparametric optimization and network distillation to achieve a classification accuracy loss of 0.4% at 2.5 times of acceleration and 0.7% at 3.3 times of acceleration, which is significantly better than the uncompressed ResNet-44 model. On the data set, PocketFlow can continue to sparse the weights of MobileNet models, which are already very simplified, to achieve similar classification accuracy with smaller model size; compared with Inception-V1, ResNet-18 and other models, the model size is only about 20-40% of the latter, but the classification accuracy is basically the same (or even higher).
Compared with manual tuning, the AutoML automatic parametric optimization component in PocketFlow framework can achieve similar performance with manual tuning in only 10 iterations. After 100 iterations, the superparametric combination can reduce the precision loss by about 0.6%; by using the superparametric optimization component automatically. To determine the quantized bit number of each layer weight in the network, PocketFlow achieves a consistent performance improvement when compressing the ResNet-18 model for image classification tasks on ImageNet. When the average quantized bit number is 4 bits, the classification accuracy can be improved from 63.6% to 68.1%. The accuracy of the original model is 70.3%.
5. PocketFlow help mobile terminal business landing
Within Tencent, the PocketFlow framework is providing model compression and acceleration support for a number of mobile-side real-world services. For example, in mobile phone photography APP, face key point location model is a commonly used pre-processing module, through the face of more than 100 feature points (such as corners of the eyes, nose tip, etc.) to identify and locate, can provide the necessary feature data for subsequent face recognition, intelligent beauty and other applications. Based on the PocketFlow framework, the team compressed the facial key point location model. While keeping the positioning accuracy unchanged, the computing overhead was greatly reduced. The acceleration effect ranged from 25% to 50% on different mobile processors. The compressed model has been deployed in real products.
Compression and acceleration of in-depth learning model is one of the hot topics in current academic research, and it also has a wide range of applications in industry. With the introduction of PocketFlow, developers do not need to know the details of the model compression algorithm, nor do they need to care about the selection and tuning of hyperparameters. Based on this automation framework, they can quickly get a simplified model for mobile end deployment, thus paving the way for the application of AI capabilities in more mobile end products. .
 Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Jiezhang Cao, Qingyao Wu, Junzhou Huang, Jinhui Zhu,
 Jiaxiang Wu, Weidong Huang, Junzhou Huang, Tong Zhang,