The LinkedIn open source TonY project allows users to build YARN-based TensorFlow application solutions on a single node or large Hadoop cluster. TonY works like MapReduce in Hadoop, executing Pig and Hive scripts in a similar way, providing first-level support for TensorFlow tasks. TonY consists of three main components: client, ApplicationMaster and TaskExecutor. It provides four main functions of GPU scheduling, precise resource request, TensorBoard support and fault tolerance.
With nearly 600 million members on the LinkedIn platform, with the development of in-depth learning technology, LinkedIn's AI engineers are trying to apply AI to many functions, such as summaries or replies, many of which are developed using Google. The deep learning framework TensorFlow is constructed. Initially, TensorFlow users within LinkedIn were designed for use on small applications and unmanaged bare machines. But with later development, they became increasingly aware of the need for TensorFlow to connect and use computing and storage resources on the Hadoop large data platform. LinkedIn's Hadoop cluster, with hundreds of PBs of data, is an ideal choice for developing in-depth learning applications.
In addition to performing the basic decentralized Tensor Flow on Hadoop, TonY also implements the ability to support large-scale training. TonY supports GPU scheduling, and Hadoop API can be used to request GPU resources from clusters. In addition, it also supports high precision resource requests. Because TonY can request different entities as separate components, users can request different resources for each entity type, that is, users can control the resources used by the application. It also helps cluster administrators avoid wasting hardware resources.