Automated agents for management and control of the ALICE Computing Grid (original) (raw)

2010, Journal of Physics: Conference Series

A complex software environ ment such as the ALICE Co mputing Grid infrastruc ture requires permanent control and management for the large set of services involved. Automating control procedures reduces the human interaction with the various components of the system and yields better availability of the overall system. In this paper we will present how we used the MonALISA framework to gather, store and display the relevant metrics in the entire system fro m central and remote site services. We will also show the automatic local and global procedures that are triggered by the monitored values. Decision-taking agents are used to restart remote services, alert the operators in case of problems that cannot be automatically solved, submit production jobs, replicate and analyze raw data, resource load -balance and other control mechanisms that optimize the overall work flow and simplify day-to-day operations. Synthetic graphical views for all operational parameters, correlations, state of services and applications as well as the full h istory of all monitoring met rics are available for the ent ire system that now encompasses 85 sites all over the world, mo re than 14000 CPU cores and 10PB of storage.

Loading...

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.