Home > News content

360 open source deep learning scheduling platform that supports TensorFlow, MXNet and other frameworks

via:博客园     time:2017/12/7 14:15:25     readed:193

Qihoo 360 today announced the open source deep learning scheduling platform XLearning, the project open source address:https://github.com/Qihoo360/XLearning.

Developed by the 360 ​​Systems Big Data team in collaboration with the Artificial Intelligence Institute, XLearning is based on the integration of Hadoop Yarn with commonly used deep learning frameworks such as TensorFlow, MXNet, Caffe, Theano, PyTorch, Keras, and XGBoost. The platform has been online for nearly a year and has been iterated and updated through multiple iterations, providing a unified and stable job submission platform for users of all kinds of deep learning frameworks, realizing resource sharing, greatly improving resource utilization, and having Good scalability and compatibility are widely used in business search, artificial intelligence institutes, commercialization, data centers and other business units.

We interviewed for the first timeXLearning project leader Li Yuan policy, Understand the platform construction background and design ideas. Li Yuanze joined Qihoo 360 in 2013 and has successively participated in the construction of Hadoop, Spark and in-depth learning platforms. After the rapid development of the company's Hadoop platform and the implementation of Spark platform from nothing to large-scale practice and deep learning platform. Has presided over the data warehouse index, MPI on Yarn, XLearning and other projects. Work focused on solving various Bug in the platform and the user encountered various problems, like open source, willing to learn and share. At present, the main focus on big data index, big data + deep learning and other fields.

AI front line:Excuse me, what is the history of XLearning's development at 360? Initially to solve what problems? Currently there are 360 ​​companies in the application scenarios? Why choose open source now?

Li Yuan policy:Artificial intelligence technology has developed rapidly in recent two years, and various deep learning frameworks, such as Google's open source TensorFlow, emerge in an endless stream. In order for Artificial Intelligence to be better positioned at the company, our Big Data Foundation team, the Joint Company Institute of Artificial Intelligence, developed the XLearning platform. XLearning was officially launched in April of this year (2017) and has been widely used in business search, artificial intelligence research institutes, commercialization, big data center and other business lines after three iterations. Depth of learning technology platform can effectively improve the utilization of hardware resources such as GPU, save the cost of hardware investment. In addition, algorithmic engineers can use various types of deep learning techniques more conveniently, freeing themselves from complicated tasks such as operation and maintenance of the operating environment.

XLearning's design philosophy is to use Hadoop Yarn to schedule deep learning framework is typical of the realization of "Al on Hadoop", with the industry companies will have similar needs, so we choose open source, hoping to give everyone a big data + artificial Intelligent platform to be a reference.

AI front line:Why consider integrating a variety of deep learning frameworks on top of big data platforms? What requirements does XLearning meet and what is the workload?

Li Yuan policy:In the construction company's artificial intelligence platform architecture design does have a number of programs to choose from, we mainly from the following aspects to consider:

(1) integration with existing platforms; the majority of the company's existing machine learning operations using Spark MLLib and MPI framework are based on Hadoop Yarn unified scheduling, if the depth of learning framework is also integrated into Yarn while data is stored by HDFS, You can achieve the unity of the platform.

(2) Operation and maintenance complexity; rebuilding a new platform will introduce new operation and maintenance work;

(3) the habits of the company programmers; many developers in the company are more familiar with Hadoop ecology and it is easier to submit deep learning assignments directly on Hadoop;

(4) Development workload; our team is more familiar with the components of the Hadoop ecosystem and once again implemented the "MPI on Yarn" system. Can say that with "AI on Hadoop" technical reserves.

XLearning open source version compatible with the community Hadoop, peer companies if there is a Hadoop platform you can use it to schedule deep learning jobs. If you do not have a Hadoop platform, you need to deploy early. Deep learning training often relies on massive sample data. A reliable big data storage system is a prerequisite for a training platform. Hadoop is easy to deploy and is stable and reliable. It is the industry standard for big data platforms and is recommended.

AI front line:What are the essentials of XLearning functional design and architecture design?

Li Yuan policy:XLearning system architecture is as follows:


  • Client: XLearning client, responsible for starting the job and obtaining job execution status;
  • ApplicationMaster (AM): Responsible for input data slice, start and manage Container, execute log save, etc;
  • Container: The actual executor of the job, responsible for starting the Worker or PS (Parameter Server) process, monitoring and reporting process status to the AM, and uploading job output. For TensorFlow type jobs, it is also responsible for starting the TensorBoard service.

XLearning, although simple in structure, has a wealth of features to help users train their models and rely on Yarn for unified management of job resources.

(1) support a variety of deep learning framework

XLearning supports TensorFlow, MXNet distributed and stand-alone modes, and all stand-alone deep learning frameworks such as Caffe, Theano, PyTorch and others. For the same deep learning framework to support multiple versions and custom versions to meet the needs of individual users, not limited to the cluster machines installed on each learning framework version.

(2) HDFS-based unified data management

XLearning provides a variety of modes for data input and output, including streaming data read and write, direct HDFS read and write, etc., depending on the amount of data processed by the job and the hard disk capacity of the cluster machine, depending on the read and write adopted.

(3) visual interface

For user-friendly viewing of job information, XLearning provides a visual interface for displaying job execution progress and output logs. After the job is completed, you can also view the contents of the log to facilitate the analysis of the training process. For TensorFlow type jobs, TensorBoard service is supported. Job operation interface is roughly divided into three parts (as shown below):

  • All Containers: Displays the list of Containers contained in the current job and corresponding information of each Container such as Contianer ID, Container Host, Container Role, Container Status, Start Time, End Finish Time, Reporter Progress;
  • View TensorBoard: When the job type is TensorFlow, you can click the link to go directly to the TensorBoard page;
  • Save Model: Users can upload the output of the current training model to HDFS during job execution and display the list of currently uploaded models.


(4) native code compatible

XLearning supports ClusterSpec autoconfiguration for TensorFlow distributed mode. Stand-alone mode and other deep learning framework code can be migrated to XLearning without any modification for quick access to users.

(5) Checkpoint function

Leveraging the Deep Learning Framework's Checkpoint mechanism and direct read and write HDFS data capabilities, XLearning makes it easy for users to resume training.

AI front line:What are the advantages of XLearning in performance? Which ease of use needs to be taken into account when designing a platform?

Li Yuan policy:XLearning is mainly responsible for scheduling and monitoring work, in principle, training performance and native TensorFlow, Caffe and other frameworks consistent. The ease of use of the platform is a key enabler of promotion and XLeanring makes the following considerations:

(1) compatibility with the native framework, in addition to the consistent performance just mentioned, the code is also compatible, can effectively reduce the cost of business operations migration;

(2) Web display; XLearnin scheduling home page will display the necessary scheduling information, job progress, logs, etc., while also providing an additional function of saving intermediate results at any time, to facilitate engineers to terminate their homework according to the actual situation;

(3) Integrated TensorBoard; for TensorFlow job XLearning will automatically pull up the TensorBoard service, which will be easier compared to manual startup;

(4) automatically build TensorFlow ClusterSpec; TensorFlow jobs for distributed mode, engineers no longer need to manually specify the worker, ps host information, you only need to inform XLearning worker and ps nodes can;

These are mentioned in the framework of the design elements just mentioned.

AI front line:What are your experiences and experiences in developing XLearning worth sharing with you?

Li Yuan policy:In addition to the type of architecture just mentioned outside the most want to share the experience that is: all the functional design must be based on actual needs, the function of the imagination often flashy. XLearning design of the beginning and the new version of the planning process, will be the actual user with the company to do a full functional needs of the discussion, a clear pain point for everyone to work, arrange priority and deadline do architecture design and development.

AI frontline:Is there a difference between the open source version of XLearning and the version function that the company uses?

Li Yuance:Frankly, open source XLearning is a simplified version, mainly because it is limited to the dependency on the Yarn function. The company's version of Yarn is a lot of enhancement in our community version. For example, it supports GPU resource scheduling, GPU communication affinity perception, DockerContainer support and so on. Depending on these characteristics, the version used by the company has many functions such as GPU resource scheduling support, operation Docker, temporary GPU virtual machine, Container Metrics visualization chart display and so on. These functions we will follow through the provision of Yarn Patch or open source self - use Yarn version to share to you, and also welcome you to communicate with us at any time.

AI frontline:We know that you are not only responsible for the XLearning deep learning platform, Spark is the early researchers and preacher, after the rapid development of Hadoop platforms and Spark platform from 360 companies to large-scale practice and landing, whether the combination of your experience to tell you from the general big data platform to deep learning platform evolution?

Li Yuance:MR Spark, the computing framework is widely used in most Internet Co, can meet most of the needs of the data processing. The performance and scalability of Spark is limited by the MLLib company, there are many MPI types of jobs running in dedicated scheduling system (named Euler). In order to realize the unified scheduling and multiplexing server resources. Our team developed Euclid (MPI on Yarn) system, a unified scheduling preliminary unified machine learning and big data operation operation. Then to the deep learning stage, we initially go some detours. For example, our most early in the second half of 2016 developed a product called SparkFlow (TensorFlow on Spark) system. TensorFlow can be integrated into Spark, complete the data exchange through RDD. Then the Yahoo Institute is an open source

China IT News APP

Download China IT News APP

Please rate this news

The average score will be displayed after you score.

Post comment

Do not see clearly? Click for a new code.

User comments