Many-task computing: Bridging the gap between high-throughput computing and high-performance computing (original) (raw)

Many-task computing aims to bridge the gap between two computing paradigms, high-throughput computing and high-performance computing. Many-task computing is reminiscent to high-throughput computing, but it differs in the emphasis of using many computing resources over short periods of time to accomplish many computational tasks, where the primary metrics are measured in seconds (e.g. tasks per second, I/O per second), as opposed to operations per month (e.g. jobs per month). Many-task computing denotes high-performance computations comprising of multiple distinct activities, coupled via file system operations. Tasks may be small or large, uniprocessor or multiprocessor, compute-intensive or data-intensive. The set of tasks may be static or dynamic, homogeneous or heterogeneous, loosely coupled or tightly coupled. The aggregate number of tasks, quantity of computing, and volumes of data may be extremely large. Many-task computing includes loosely coupled applications that are generally communication-intensive but not naturally expressed using message passing interface commonly found in high-performance computing, drawing attention to the many computations that are heterogeneous but not "happily" parallel. This dissertation explores fundamental issues in defining the many-task computing paradigm, as well as theoretical and practical issues in supporting both compute and data intensive many-task computing on large scale systems. We have defined an abstract model for data diffusion-an approach to supporting data-intensive many-task computing, have defined data-aware scheduling policies with heuristics to optimize real world performance, and developed a competitive online caching eviction policy. We also iv designed and implemented the necessary middleware-Falkon-to enable the support of many-task computing on clusters, grids and supercomputers. Falkon, a Fast and Lightweight tasK executiON framework, addresses shortcomings in traditional resource management systems that support high throughput and high performance computing that are not suitable or efficient at supporting many-task computing applications. Falkon was designed to enable the rapid and efficient execution of many tasks on large scale systems (i.e. through multi-level scheduling and streamlined distributed task dispatching), and integrate novel data management capabilities (i.e. data diffusion which uses data caching and data-aware scheduling to exploit data locality) to extend data intensive applications scalability well beyond that of traditional shared or parallel file systems. As the size of scientific data sets and the resources required for their analysis increase, data locality becomes crucial to the efficient use of large scale distributed systems for data-intensive many-task computing. We propose a "data diffusion" approach that acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data. As demand increases, more resources are acquired, allowing faster response to subsequent requests that refer to the same data, and as demand drops, resources are released. This approach provides the benefits of dedicated hardware without the associated high costs, depending on workload and resource characteristics. Micro-benchmarks have shown Falkon to achieve over 15K+ tasks/sec throughputs, scale to millions of queued tasks, to execute billions of tasks per day, and achieve hundreds of Gb/s I/O rates. Falkon has shown orders of magnitude improvements in v performance and scalability across many diverse workloads (e.g heterogeneous tasks from milliseconds to hours long, compute/data intensive, varying arrival rates) and applications (e.g. astronomy, medicine, chemistry, molecular dynamics, economic modeling, and data analytics) at scales of billions of tasks on hundreds of thousands of processors across Grids (e.g. TeraGrid