Dike : Revisiting Resource Management for Distributed Deep Learning (original) (raw)

The recent adoption of deep learning for diverse applications has required scaling infrastructures both horizontally and vertically. As a result, efficient resource management for distributed deep learning (DDL) frameworks is becoming increasingly important. However, existing techniques for scaling DDL applications rely on general-purpose resource managers originally designed for data intensive applications. In contrast, DDL applications present unique challenges for resource management as compared to traditional big data frameworks, such as a different master-slave communication paradigm, deeper ML models that are more computationally and network bound than I/O, and use of heterogeneous resources (GPUs, TPUs, and variable memory). In addition, most DDL frameworks require data scientists to manually configure the task placement and resource assignment to execute DDL models. In this paper, we present Dike, an application scheduler framework that transparently makes scheduling decisio...