Task Scheduling in Cloud Using Deep Reinforcement Learning (original) (raw)
Related papers
—Cloud computing has become an attractive computing paradigm in both academia and industry. Through virtu-alization technology, Cloud Service Providers (CSPs) that own data centers can structure physical servers into Virtual Machines (VMs) to provide services, resources, and infrastructures to users. Profit-driven CSPs charge users for service access and VM rental, and reduce power consumption and electric bills so as to increase profit margin. The key challenge faced by CSPs is data center energy cost minimization. Prior works proposed various algorithms to reduce energy cost through Resource Provisioning (RP) and/or Task Scheduling (TS). However, they have scalability issues or do not consider TS with task dependencies, which is a crucial factor that ensures correct parallel execution of tasks. This paper presents DRL-Cloud, a novel Deep Reinforcement Learning (DRL)-based RP and TS system, to minimize energy cost for large-scale CSPs with very large number of servers that receive enormous numbers of user requests per day. A deep Q-learning-based two-stage RP-TS processor is designed to automatically generate the best long-term decisions by learning from the changing environment such as user request patterns and realistic electric price. With training techniques such as target network, experience replay, and exploration and exploitation, the proposed DRL-Cloud achieves remarkably high energy cost efficiency, low reject rate as well as low runtime with fast convergence. Compared with one of the state-of-the-art energy efficient algorithms, the proposed DRL-Cloud achieves up to 320% energy cost efficiency improvement while maintaining lower reject rate on average. For an example CSP setup with 5, 000 servers and 200, 000 tasks, compared to a fast round-robin baseline, the proposed DRL-Cloud achieves up to 144% runtime reduction.
Deep and reinforcement learning for automated task scheduling in large‐scale cloud computing systems
Concurrency and Computation: Practice and Experience, 2020
Cloud computing is undeniably becoming the main computing and storage platform for today's major workloads. From Internet of things and Industry 4.0 workloads to big data analytics and decision-making jobs, cloud systems daily receive a massive number of tasks that need to be simultaneously and efficiently mapped onto the cloud resources. Therefore, deriving an appropriate task scheduling mechanism that can both minimize tasks' execution delay and cloud resources utilization is of prime importance. Recently, the concept of cloud automation has emerged to reduce the manual intervention and improve the resource management in large-scale cloud computing workloads. In this article, we capitalize on this concept and propose four deep and reinforcement learning-based scheduling approaches to automate the process of scheduling large-scale workloads onto cloud computing resources, while reducing both the resource consumption and task waiting time. These approaches are: reinforcement learning (RL), deep Q networks, recurrent neural network long short-term memory (RNN-LSTM), and deep reinforcement learning combined with LSTM (DRL-LSTM). Experiments conducted using real-world datasets from Google Cloud Platform revealed that DRL-LSTM outperforms the other three approaches. The experiments also showed that DRL-LSTM minimizes the CPU usage cost up to 67% compared with the shortest job first (SJF), and up to 35% compared with both the round robin (RR) and improved particle swarm optimization (PSO) approaches. Moreover, our DRL-LSTM solution decreases the RAM memory usage cost up to 72% compared with the SJF, up to 65% compared with the RR, and up to 31.25% compared with the improved PSO.
IEEE Transactions on Parallel and Distributed Systems
Big data frameworks such as Spark and Hadoop are widely adopted to run analytics jobs in both research and industry. Cloud offers affordable compute resources which are easier to manage. Hence, many organizations are shifting towards a cloud deployment of their big data computing clusters. However, job scheduling is a complex problem in the presence of various Service Level Agreement (SLA) objectives such as monetary cost reduction, and job performance improvement. Most of the existing research does not address multiple objectives together and fail to capture the inherent cluster and workload characteristics. In this article, we formulate the job scheduling problem of a cloud-deployed Spark cluster and propose a novel Reinforcement Learning (RL) model to accommodate the SLA objectives. We develop the RL cluster environment and implement two Deep Reinforce Learning (DRL) based schedulers in TF-Agents framework. The proposed DRL-based scheduling agents work at a fine-grained level to place the executors of jobs while leveraging the pricing model of cloud VM instances. In addition, the DRL-based agents can also learn the inherent characteristics of different types of jobs to find a proper placement to reduce both the total cluster VM usage cost and the average job duration. The results show that the proposed DRL-based algorithms can reduce the VM usage cost up to 30%.
A REVIEW OF TASK OFFLOADING ALGORITHMS WITH DEEP REINFORCEMENT LEARNING
British Journal of Computer, Networking and Information Technology, 2024
Enormous data generated by IoT devices are handled in processing and storage by edge computing, a paradigm that allows tasks to be processed outside host devices. Task offloading is the movement of tasks from IoT devices to an edge or cloud server –where resources and processing capabilities are abundant– for processing, it is an important aspect of edge computing. This paper reviewed some task-offloading algorithms and the techniques used by each algorithm. Existing algorithms focus on either latency, load, cost, energy or delay, the deep reinforcement phase of a task offloading algorithm automates and optimizes the offloading decision process, it trains agents and defines rewards. Latency-aware phase then proceeds to obtain the best offload destination in order to significantly reduce the latency.
CAAI Transactions on Intelligence Technology, 2021
Many organizations around the world use cloud computing Testing as Service (Taas) for their services. Cloud computing is principally based on the idea of on-demand delivery of computation, storage, applications, and additional resources. It depends on delivering user services through Internet connectivity. In addition, it uses a pay-as-you-go business design to deliver user services. It offers some essential characteristics including ondemand service, resource pooling, rapid elasticity, virtualization, and measured services. There are various types of virtualization, such as full virtualization, para-virtualization, emulation, OS virtualization, and application virtualization. Resource scheduling in Taas is among the most challenging jobs in resource allocation to mandatory tasks/jobs based on the required quality of applications and projects. Because of the cloud environment, uncertainty, and perhaps heterogeneity, resource allocation cannot be addressed with prevailing policies. This situation remains a significant concern for the majority of cloud providers, as they face challenges in selecting the correct resource scheduling algorithm for a particular workload. The authors use the emergent artificial intelligence algorithms deep RM2, deep reinforcement learning, and deep reinforcement learning for Taas cloud scheduling to resolve the issue of resource scheduling in cloud Taas. This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
Three-Tier Computing Platform Optimization: A Deep Reinforcement Learning Approach
Mobile Information Systems
The increasing number of computing platforms is critical with the increasing trend of delay-sensitive complex applications with enormous power consumption. These computing platforms attach themselves to multiple small base stations and macro base stations to optimize system performance if appropriately allocated. The arrival rate of computing tasks is often stochastic under time-varying wireless channel conditions in the mobile edge computing Internet of things (MEC IoT) network, making it challenging to implement an optimal offloading scheme. The user needs to choose the best computing platforms and base stations to minimize the task completion time and consume less power. In addition, the reliability of our system in terms of the number of computing resources (power, CPU cycles) each computing platform consumes to process the user’s task efficiently needs to be ascertained. This paper implements a computational task offloading scheme to a high-performance processor through a small...
Dependent Task Offloading for Edge Computing based on Deep Reinforcement Learning
IEEE Transactions on Computers, 2021
Edge computing is an emerging promising computing paradigm that brings computation and storage resources to the network edge, hence significantly reducing the service latency and network traffic. In edge computing, many applications are composed of dependent tasks where the outputs of some are the inputs of others. How to offload these tasks to the network edge is a vital and challenging problem which aims to determine the placement of each running task in order to maximize the Quality-of-Service (QoS). Most of the existing studies either design heuristic algorithms that lack strong adaptivity or learning-based methods but without considering the intrinsic task dependency. Different from the existing work, we propose an intelligent task offloading scheme leveraging off-policy reinforcement learning empowered by a Sequence-to-Sequence (S2S) neural network, where the dependent tasks are represented by a Directed Acyclic Graph (DAG). To improve the training efficiency, we combine a specific off-policy policy gradient algorithm with a clipped surrogate objective. We then conduct extensive simulation experiments using heterogeneous applications modelled by synthetic DAGs. The results demonstrate that: 1) our method converges fast and steadily in training; 2) it outperforms the existing methods and approximates the optimal solution in latency and energy consumption under various scenarios.
Computing Offloading with Fairness Guarantee: A Deep Reinforcement Learning Method
IEEE Transactions on Circuits and Systems for Video Technology
Edge computing can reduce service latency and save backhaul bandwidth by completing services at network edges, providing support for diverse computation-intensive and delaysensitive services. However, it is not practical to support all services at edge nodes due to the limited network resources. The decision that which services can be provided locally and which services should been offloaded to cloud significantly impacts the user experience. Cloud-edge computing offloading becomes an important issue in edge computing. In this paper, we take the fairness into the optimization objective of computing offloading problem, and consider both computing capacity and storage space as problem constraints. The problem is formulated as a long-term average optimization problem to maximize the αfair utility function of saved time, and further translated as a Markov decision process. As the optimization problem with fairness guarantee and huge action space, we cannot solve it with traditional methods. Therefore, an innovative multi-update deep reinforcement learning algorithm is proposed which can optimize the objective with α-fair utility function and reduce dramatically the size of action space. We also prove the convergence of our algorithm theoretically. To our best knowledge, the longterm average optimization of computing offloading with fairness guarantee is rarely seen in literature. Extensive simulation experiments show that our algorithm can converge quickly and has better performance in terms of service delay and fairness.
IEEE Access
In recent years, computation offloading has become an effective way to overcome the constraints of mobile devices (MDs) by offloading delay-sensitive and computation-intensive mobile application tasks to remote cloud-based data centers. Smart cities can benefit from offloading to edge points in the framework of the so-called cyber-physical-social systems (CPSS), as for example in traffic violation tracking cameras. We assume that there are mobile edge computing networks (MECNs) in more than one region, and they consist of multiple access points, multi-edge servers, and N MDs, where each MD has M independent real-time massive tasks. The MDs can connect to a MECN through the access points or the mobile network. Each task be can processed locally by the MD itself or remotely. There are three offloading options: nearest edge server, adjacent edge server, and remote cloud. We propose a reinforcementlearning-based state-action-reward-state-action (RL-SARSA) algorithm to resolve the resource management problem in the edge server, and make the optimal offloading decision for minimizing system cost, including energy consumption and computing time delay. We call this method OD-SARSA (offloading decision-based SARSA). We compared our proposed method with reinforcement learning based Q learning (RL-QL), and it is concluded that the performance of the former is superior to that of the latter. INDEX TERMS Mobile devices, edge computing, mobile edge computing, edge cloud computing, virtual machines, access points.
Cloud Resource Allocation from the User Perspective: A Bare-Bones Reinforcement Learning Approach
Web Information Systems Engineering – WISE 2016
Cloud computing enables effortless access to a seemingly infinite shared pool of resources, on a pay-per-use basis. As a result, a new challenge has emerged: designing control mechanisms to precisely meet the actual workload requirements of cloud applications in an online manner. To this end, a variety of complex resource management issues have to be addressed, because workloads in the cloud are of a dynamic and heterogeneous nature, and traditional algorithms do not cope well within this context. In this work, we adopt the point of view of the user of a cloud infrastructure and focus on the task of controlling leased resources. We formulate this task as a Reinforcement Learning problem and we simulate the decision-making process of a controller implementing the Q-learning algorithm. We conduct an experimental study, the outcomes of which offer valuable insight into the advantages and shortcomings of using Reinforcement Learning to implement such adaptive cloud resource controllers.