Home > News content

Tencent's first AI open source project Angel released 3.0 milestone version: Towards a full stack machine learning platform

via:博客园     time:2019/8/23 23:32:39     readed:175

On August 22, 2019, Tencent's first AI open source project, Angel, was officially released in version 3.0. Angel 3.0 attempts to create a full-stack machine learning platform that covers all phases of machine learning: feature engineering, model training, hyper-parameter tuning, and model services.

data-copyright=0

Overview

Angel is a distributed machine learning platform based on parametric server architecture of Tencent Open Source. It is dedicated to solving sparse data model training and large-scale graph data analysis. It is jointly developed by Tencent and Peking University, taking into account the high availability and academics of the industry. The innovation of the world. Currently, it is the Linux Deep Learning Foundation Incubation Project. Compared with similar platforms such as TensorFlow, PyTorch and Spark, she has the following characteristics:

  • Angel is a high-performance distributed machine learning platform based on the parameter server (PS) concept. It has a flexible and customizable function PS Function (PSF) that can push some calculations down to the PS side. The PS's well-developed scale-out capabilities allow Angel to efficiently process billions of models.

  • Angel has a math library specifically optimized for handling high-dimensional sparse features with performance up to 10 times better than the breeze math library. Angel's PS and built-in algorithm kernel are built on top of the math library.

  • Angel is good at recommending models and graphs related to network models (such as social network analysis). The following figure is a comparison of Angel and several industry mainstream platforms in terms of sparse data, model dimensions, performance, depth models, and ecological construction. Tensorflow and PyTouch have significant advantages in deep learning and ecological construction, but the processing power in sparse data and high-dimensional models is relatively insufficient, and Angel is just complementary to them. The 3.0 version of PyTorch On Angel tries to PyTorch and Angel. The advantages are combined.

data-ratio=0.757396449704142

Angel's comparison with mainstream platforms in the industry

Since its launch in Tencent in early 2016, Angel has been used in WeChat payment, QQ, Tencent video, Tencent social advertising and user portrait mining.

In June 2017, Angel made a low-profile open source on GitHub. Open source for two weeks, this project has been harvested on GitHub 183 Watch, 1693 Star, 389 Fork, and attracted many industry engineers to pay attention and contribution.

In September 2018, Angel 2.0 was released, supporting the training of 100 billion model dimensions, and the algorithm library was also richer. The deep learning algorithm and graph algorithm were introduced for the first time. In the same year, Angel joined the Deep Learning Foundation of Linux (now renamed the LF AI Foundation), and in conjunction with the well-established operations of the Foundation, the fully upgraded Angel 2.0 continues to interact with the international open source community and is committed to making machines Learning technology is easier to get started with research and application goals.

Let's take a look at the new features that are worth paying attention to in the Angel 3.0 milestone.

Angel System Architecture

The Angel 3.0 system architecture is shown below:data-ratio=0.5688073394495413

Angel 3.0 architecture

Angel's self-developed high-performance math library is the foundation of the entire system. Angel's PS function and built-in algorithm kernel are based on this math library.

Angel PS provides efficient, stable and flexible parameter storage and exchange services. In version 3.0, we extended the Angel PS functionality so that it can store any type of object. A typical example is the use of Angel PS to store a large number of complex objects during the implementation of the graph algorithm.

MLcore is Angel's self-developed set of algorithmic kernels that support automatic derivation and can be defined and run using JSON configuration files. In addition, in version 3.0, Angel also integrated PyTorch as a calculation engine. Above the Compute Engine layer are computational frameworks that can be thought of as a container for computational engines and currently support three computational frameworks: native Angel, Spark On Angel (SONA), and PyTorch On Angel (PyTONA), which can make Spark Switch seamlessly to the Angel platform with PyTorch users. The top level is two common components: AutoML and model services.

Angel 3.0 new features

data-ratio=0.4524793388429752

An overview of Angel 3.0 (red for new features, white for existing but continuous improvements)

The image above provides a holistic view of the Angel 3.0 features. Angel 3.0 attempts to build a full-stack machine learning platform with features that cover all phases of machine learning: feature engineering, model training, hyper-parameter tuning, and model services.

Angel's feature engineering module is based on Spark development and enhances Spark's feature selection capabilities, while feature feature intersection and re-indexing for automatic feature generation. These components can be seamlessly integrated into Spark's pipeline. In order to make the whole system more intelligent, Angel 3.0 has added the function of hyperparameter adjustment. Currently, it supports 3 algorithms: random search, grid search and Bayesian optimization. In terms of model services, Angel 3.0 provides a cross-platform component. Angel Serving, Angel Serving not only meets Angel's own needs, but also provides model services for other platforms.

In terms of ecology, Angel also tried to assign the parameter server (PS, Parameter Server) capability to other computing platforms, and has completed the construction of two platforms, Spark On Angel and PyTorch On Angel. Each of these platforms has its advantages and focus. Spark On Angel uses Angel's built-in algorithmic core, which is responsible for machine learning algorithms and basemap algorithms in common recommended areas. PyTorch On Angel uses PyTorch as the core of computing, and is mainly responsible for recommending domain deep learning algorithms and graph deep learning algorithms.

Automatic feature engineering

Feature engineering, such as feature crossing and selection, is important for machine learning applications in industry. Spark provides some feature selection operators, but there are still some limitations. Angel provides more feature selection operators based on Spark:

  • Statistics-based operators, including VarianceSelector and FtestSelector
  • Model-based operators, including LassoSelector and RandomForestSelector

Most online recommendation systems often choose linear algorithms, such as logistic regression as a machine learning model, but logistic regression requires complex feature engineering to achieve higher precision, which makes automatic feature synthesis essential. However, existing automated high-order feature synthesis methods bring dimensional disasters. To solve this problem, Angel implemented a method of iteratively generating high-order synthetic features. Each iteration consists of two phases:

  • Amplification phase: Cartesian product of arbitrary features
  • Reduction phase: feature selection and feature reindexing

Here are the iteration steps:

  • First, a composite feature is generated by a Cartesian product between arbitrary input features. After this step, the number of features will increase in a second way.
  • Next, select the most important subset of features from the composite features (using, for example, VarianceSelector and RandomForestSelector)
  • Then, re-index the selected features to reduce feature space

Finally, the composite features are stitched together with the original features.

data-ratio=1.1862745098039216

Automatic feature engineering process

As shown in the figure above, this feature synthesis method linearly increases the number of features, avoiding dimensional disasters. Experiments on the Higgs data set show that the synthesized features can effectively improve the model accuracy (as shown in Table 1).

data-ratio=0.0981651376146789

Table 1 Feature synthesis effect

Spark On Angel (SONA)

In Angel 3.0, we made a major optimization of Spark On Angel, adding these new features:

  • Feature engineering is integrated in Spark On Angel. In the process of integration, Spark's feature engineering is not simply borrowed. We support long-indexed vectors for all operations to train high-dimensional sparse models.
  • Seamless connection with automatic tuning
  • Spark users can effortlessly convert Spark into Angel via the Spark-fashion API
  • Two new data formats are supported: LibFFM and Dummy

data-ratio=0.5515370705244123

Spark On Angel architecture

In addition to these large features, we are continuing to improve Spark On Angel's algorithm library: new algorithms such as Deep & Cross Network (DCN) and Attention Factorization Machines (AFM); and existing algorithms A lot of optimizations have been done, such as refactoring the LINE and K-Core algorithms, and the performance and stability of the refactored algorithms have been greatly improved.

As can be seen from the figure below, the algorithm in Spark On Angel is significantly different from the algorithm in Spark. For example, the algorithm based on Spark On Angel is mainly for the recommendation and graph fields, but the algorithm in Spark is more general.

data-ratio=0.32368896925858953

Comparison between Spark and Spark On Angel algorithms

data-ratio=0.6112115732368897

Spark On Angel algorithm example

The above figure provides an example of a distributed algorithm based on Spark On Angel, which mainly includes the following steps:

  • Start the parameter server at the beginning of the program and close the parameter server at the end of the program
  • Load training and test sets as Spark DataFrame
  • Define an Angel model and set parameters for it in Spark's parameter settings. In this example, the algorithm is a computational graph defined by JSON
  • Use the “fit” method to train the model
  • Use the “evaluate” method to evaluate trained models

After the training is completed, Spark On Angel will showcase a variety of model metrics such as accuracy, ROC curve and AUC. The user can save the trained model for next use.

data-ratio=0.793733681462141

Spark On Angel and TensorFlow performance comparison

We compared the performance of Spark On Angel and TensorFlow using the same resources and datasets on two popular recommendation algorithms, Deep & Wide and DeepFM. As shown in the figure above, Spark On Angel is 3 times faster than TensorFlow on the Deep & Wide algorithm, while TensorFlow runs slightly faster on the DeepFM algorithm.

PyTorch On Angel (PyTONA)

PyTorch On Angel is a new feature in Angel 3.0 that is designed to solve large-scale graph representation learning and deep learning model training.

In the past few years, the graph convolutional neural network (GNN) has developed rapidly, a series of research papers and related algorithms have emerged: such as GCN, GraphSAGE and GAT, research and test results show that they can learn more than traditional graph representation. Good extraction of graph features. Tencent has a large social network (QQ and WeChat), and has a large demand for analysis of graph data. The graph indicates that learning is the basis of these analyses. Therefore, Tencent has a strong demand for GNN, which is why we develop PyTorch On. One of the main reasons for Angel.

The representation of large-scale graphs faces two major challenges: the first challenge comes from the storage and access of hyperscale graph structures, which requires the system not only to survive but also to provide efficient access interfaces, such as the need to provide efficient The interface to access the two-hop neighbor of any node; the second challenge comes from the GNN calculation process, which requires an efficient auto-derivation module.

By analyzing Angel's own situation and the existing systems in the industry, we have the following conclusions:

  • TensorFlow and PyTorch have efficient automated derivation modules, but they are not good at handling high-dimensional models and sparse data
  • Angel is good at dealing with high-dimensional models and sparse data. Although Angel's self-developed computational graph framework (MLcore) can also be automatically derived, it is not as efficient and functionally complete as TensorFlow and PyTorch, and cannot meet GNN requirements.

In order to combine the advantages of both, we developed the PyTorch On Angel platform based on Angel PS. The basic idea is to use Angel PS to store large models, and use Spark as a distributed scheduling platform for PyTorch, which is called in Spark's Executor. PyTorch to complete the calculation.

The architecture of PyTorch On Angel is shown below:data-ratio=0.562929061784897

PyTorch On Angel System Architecture

PyTorch On Angel has 3 main components:

  • Angel PS: stores model parameters, graph structure information and node features, etc., and provides access interfaces for model parameters and graph related data structures, such as providing a two-hop adjacency access interface.
  • Spark Driver: The central control node, responsible for the scheduling of computing tasks and some global control functions, such as initiating the creation of the matrix, initializing the model, saving the model, writing the checkpoint, and restoring the model commands.
  • Spark Worker: Read the calculation data, pull the model parameters and network structure from the PS, and then pass the training data parameters and network structure to PyTorch. PyTorch is responsible for the specific calculation and returns the gradient. Finally, Spark Worker pushes the gradient. To PS update model

Of course, these details are encapsulated, and algorithm developers and users don't need to know. Developing a new algorithm on the PyTorch On Angel platform requires only a focus on the algorithm logic, which is not much different from developing a stand-alone PyTorch algorithm. An example of the implementation of a 2-layer GCN algorithm is given below:

data-ratio=0.7863777089783281

Example of implementing GCN on PyTorch On Angel

After the algorithm is developed, save the code as a pt file and submit the pt file to the PyTorch On Angel platform for distributed training.

We have implemented many algorithms on PyTorch On Angel: including algorithms commonly recommended in the recommended fields (FM, DeepFM, Wide & Deep, xDeepFM, AttentionFM, DCN and PNN, etc.) and GNN algorithms (GCN and GraphSAGE). In subsequent iterations of the version, we will further enrich PyTorch On Angel's algorithm library.

Combining the advantages of PyTorch and Angel, PyTorch On Angel has great advantages in algorithm performance: the performance of the deep learning algorithm common in the recommended field can reach more than 4 times that of TensorFlow; for GNN algorithm, the performance is far better than the current one. The same type of platform open source in the industry (specific performance data will be published in the open source community). The following figure is a comparison test on the public dataset criteo kaggle2014 (45 million training samples, 1 million features):

data-ratio=0.5564245810055866

PyTorch On Angel and TensorFlow performance comparison test

In addition to the performance advantages, PyTorch On Angel has a big advantage is the ease of use. As shown in the figure "PyTorch On Angel System Architecture": PyTorch runs in Spark's Executor, which enables seamless interfacing of Spark graph data preprocessing and PyTorch model training, and completes the entire calculation process in one program.

Automatic hyperparameter adjustment

There are two ways to adjust the traditional hyperparameters (as shown below):

  • Grid Search: Grid Search divides the entire search space into grids, assuming that hyperparameters are equally important. Although this method is intuitive, it has two obvious disadvantages: 1) the computational cost increases exponentially with the number of parameters; 2) the importance of hyperparameters is often different, and grid search may take too much effort. Optimize less important hyperparameters
  • Random search: Randomly sample hyperparameter combinations and evaluate sampling combinations. Although this approach may focus on more important hyperparameters, there is no guarantee that the best combination will be found.

data-ratio=0.4909090909090909

Grid search and random search

Bayesian optimization, unlike traditional model-free methods, uses a less expensive surrogate function to approximate the original objective function. In Bayesian optimization, the agent function generates the probability mean and variance of the hyperparameter combination. The availability function will then evaluate the expected loss or improvement of the hyperparameter combination. Such a probabilistic interpretation method enables Bayesian optimization to find a better solution of the objective function using much less overhead.

Angel 3.0 includes two traditional methods and Bayesian algorithm optimization. For Bayesian optimization, Angel implements the following features:

  • Agent function. In addition to the two commonly used models (Gaussian process and random forest), the EM + LBFGS is also used to optimize the hyperparameters in the Gaussian process kernel function.
  • Utility function: Implemented PI (Probability of improvement), EI (Expected Improvement) and UCB (Upper Confidence Bound)

Since the computational overhead of each evaluation objective function may be large, if it is observed that the candidate hyperparameter combination does not perform well in the first few iterations, these candidate hyperparameter combinations can be stopped early. This early stop strategy was implemented in the Angel 3.0 release.

Table 2 is an experiment in the logistic regression algorithm. The adjusted hyperparameters are the learning speed and the learning speed decay rate. The results show that the Bayesian optimization performance is better than the random search and the grid search, while the random search results are slightly better than the grid. search for.

data-ratio=0.0881542699724518

Table 2 Comparison of the effects of different hyperparameter automatic condition methods

Angel Serving

To meet the needs of efficient model services in a production environment, we implemented the Angel Serving subsystem in Angel 3.0, a scalable, high-performance machine learning model service system that is a full-stack machine learning platform. Angel's upper service portal allows the Angel Eco to form a closed loop. The figure below shows the architectural design of Angel Serving.

data-ratio=0.7382769901853872

Angel Serving Architecture

The main features of Angel Serving include:

  1. Support for multiple types of API access services, including gRPC and Restful interfaces
  2. Angel Serving is a general-purpose machine learning service framework. The pluggable mechanism design makes it easy to use Angel Serving models from other third-party machine learning platforms. Currently, it supports three platforms: Angel, PyTorch and PMML. Model format platform (Spark and XGBoost, etc.)
  3. Inspired by TensorFlow Serving, Angel Serving also offers a fine-grained version control strategy: including the use of the earliest, latest and specified versions of the model.
  4. Angel Serving also provides a rich set of model service monitoring metrics including: QPS per second, total requests and total number of successful requests, response time distribution of requests and average response time.

data-ratio=0.24080882352941177

Table 3 Comparison of Angel Serving and TensorFlow Serving Performance

Table 3 shows the comparison of Angel Serving and TensorFlow Serving performance. We used a DeepFM model with 1 million features to send 100,000 prediction requests to the service. The total time spent on Angel Serving and TensorFlow Serving is 56 seconds and 59 seconds respectively. The average response time for both service systems is 2 milliseconds. Angel Serving's QPS is 1,900, while TensorFlow Serving's QPS is 1,800. The above results show that Angel Serving is comparable or even better than TensorFlow Serving.

Support Kubernetes

Angel 3.0 supports Kubernetes so it can run on the cloud.

Angel usage

As shown in the chart below, in the past 12 months, Angel's number of tasks within Tencent has grown significantly, with an increase of 150%. It is worth mentioning that Spark On Angel's number of tasks has increased by 10 times. In order to make Spark On Angel easier to use, the 3.0 version has greatly upgraded Spark On Angel. Within Tencent, Angel's businesses include Tencent Video, Tencent News and WeChat.

data-ratio=0.7587131367292225

Tencent internal Angel tasks

Angel officially maintains a QQ group to communicate with external developers. Statistics on group users indicate:

  • The vast majority of Angel's users are from China, mainly in cities with relatively developed Internet industries such as Beijing, Shanghai, Hangzhou, Chengdu and Shenzhen.
  • More than 100 companies and research institutes use or test Angel, including China's top IT companies: Weibo, Huawei and Baidu.

data-ratio=0.34657039711191334

Angel open source user

Angel open source

data-ratio=0.4620938628158845

Angel's statistics on GitHub and papers by Angel

Since the open source in June 2017, Angel has received more attention. As of now, Angel has more than 4200 Stars on GitHub and more than 1000 Fork. The Angel project currently has a total of 38 code contributors, and the other includes an 8-bit committer, which submitted more than 2000 commits in total. Tencent's total number of projects on GitHub has exceeded 80, covering AI, cloud computing, security and other fields, with a total of more than 230,000 Stars.

From 1.0 to 3.0, Angel has undergone tremendous changes, from a single model training platform to a machine learning process, including its own general-purpose computing platform, with more than 500,000 lines of code. For subsequent maintenance and ease of use, Angel is split into 8 sub-projects and placed in the Angel-ML directory (https://github.com/Angel-ML): angel, PyTorch On Angel, sona (Spark On Angel) , serving, automl, mlcore, math2 and format, these sub-projects are described in detail above.

Applications

Tencent short video recommendation

data-ratio=0.351985559566787

Short video recommendation data processing flow

The above picture shows a use case of the Tencent short video department. The user's video playback log and context information are forwarded to Kafka in real time, and the streaming data engine Storm subscribes to Kafka's data. Storm is a real-time feature generator that takes user portraits and video information from an offline key-value store and stitches the two together to generate features. The generated features are transmitted to the online training system to update the online model; at the same time, these features are also dumped to HDFS as input for offline training. The offline model is usually used to initialize the online training system. When an abnormality occurs, the offline model can also be used to reset the online system.

The recommended algorithm used in this case is FM, with 2.4 billion training samples and a feature dimension of 63611. It takes more than 10 hours to train on Spark and 1 hour after applying Angel.

Financial fraud

data-ratio=0.259927797833935

Financial anti-fraud data processing

Financial fraud detection is a common case of large-scale graph learning. Its network data is heterogeneous and contains several different types of edges:

  • Trading Relationship: If there is a trading relationship between User A and User B, it indicates that there has been a trading behavior between them.
  • Device relationship: If there is a device relationship between user A and user B, they have shared the same device.
  • Wi-Fi relationship: If there is a Wi-Fi relationship between User A and User B, they have connected to the Internet through a Wi-Fi connection.

Financial scammers often share devices and Wi-Fi to generate communities by extending edge relationships. The fast unfolding algorithm on Angel can effectively discover these communities. Downstream fraud risk models can use the user portraits and network characteristics of these communities as input to learn and push to anti-fraud strategies. The graph data contains 1.5 billion nodes and 20 billion edges, and the implementation based on Spark GraphX ​​takes 20 hours, while Angel takes only 5 hours.

China IT News APP

Download China IT News APP

Please rate this news

The average score will be displayed after you score.

Post comment

Do not see clearly? Click for a new code.

User comments