Galaxy CloudMan: delivering cloud compute clusters - PubMed (original) (raw)

Galaxy CloudMan: delivering cloud compute clusters

Enis Afgan et al. BMC Bioinformatics. 2010.

Abstract

Background: Widespread adoption of high-throughput sequencing has greatly increased the scale and sophistication of computational infrastructure needed to perform genomic research. An alternative to building and maintaining local infrastructure is "cloud computing", which, in principle, offers on demand access to flexible computational infrastructure. However, cloud computing resources are not yet suitable for immediate "as is" use by experimental biologists.

Results: We present a cloud resource management system that makes it possible for individual researchers to compose and control an arbitrarily sized compute cluster on Amazon's EC2 cloud infrastructure without any informatics requirements. Within this system, an entire suite of biological tools packaged by the NERC Bio-Linux team (http://nebc.nerc.ac.uk/tools/bio-linux) is available for immediate consumption. The provided solution makes it possible, using only a web browser, to create a completely configured compute cluster ready to perform analysis in less than five minutes. Moreover, we provide an automated method for building custom deployments of cloud resources. This approach promotes reproducibility of results and, if desired, allows individuals and labs to add or customize an otherwise available cloud system to better meet their needs.

Conclusions: The expected knowledge and associated effort with deploying a compute cluster in the Amazon EC2 cloud is not trivial. The solution presented in this paper eliminates these barriers, making it possible for researchers to deploy exactly the amount of computing power they need, combined with a wealth of existing analysis software, to handle the ongoing data deluge.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Main web interface for Galaxy CloudMan. Screenshot of the CloudMan cloud controller web interface running on the master instance of the cloud compute cluster. This interface is used to control the size of the cloud cluster, including adding cluster services, scaling the size of the cluster in terms of worker instances and associated persistent data volume, and as an overview of the cluster status.

Figure 2

Figure 2

Scaling worker instances within CloudMan. A progression of the act of scaling the number of worker instances associated with the given cloud cluster. Each icon represents an individual cloud instance. Within each icon, the load of the instance over the past 15 minutes is shown as a small glyph. Based on the load of worker instances, the user can decide to scale the size of the cluster up or down.

Figure 3

Figure 3

Modular architecture of CloudMan. The architecture of CloudMan is based on separation and subsequent coordination of otherwise independent components: the machine image, a persistent data repository, and persistent storage resources (i.e., snapshots). The machine image is characterized by simplicity; it consists only of the basic services required to initiate the application unit deployment process. The persistent data repository lives independent of the machine image and is used to provide instance contextualization details, such as, boot time scripts that define which services should be started. Lastly, persistent storage resources or snapshots are used as the storage medium for tools, libraries, or datasets required by the tools. Once instantiated, those components are aggregated by CloudMan into a cohesive operational unit. Because they are not modified during the life of a cluster, the persistent storage resources are deleted upon cluster termination. All of the cluster settings and user data are preserved in user’s account and will be reused on next cluster instantiation. The Galaxy CloudMan developers maintain items in blue; items in green are (currently) maintained by the CloudBioLinux community (

http://www.cloudbiolinux.com/

), while items in gray are private to a user.

Similar articles

Cited by

References

    1. Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I, Zaharia M. In: Book Above the Clouds: A Berkeley View of Cloud Computing. Editor ed.^eds., editor. University of California at Berkeley; 2009. Above the Clouds: A Berkeley View of Cloud Computing; p. 23.
    1. Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL. Searching for SNPs with cloud computing. Genome Biol. 2009;10:R134. doi: 10.1186/gb-2009-10-11-r134. - DOI - PMC - PubMed
    1. Schatz MC. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics. 2009;25:1363–1369. doi: 10.1093/bioinformatics/btp236. - DOI - PMC - PubMed
    1. Wall DP, Kudtarkar P, Fusaro VA, Pivovarov R, Patil P, Tonellato PJ. Cloud computing for comparative genomics. BMC Bioinformatics. 2010;11:259. - PMC - PubMed
    1. Schatz MC, Langmead B, Salzberg SL. Cloud computing and the DNA data race. Nat Biotechnol. 2010;28:691–693. doi: 10.1038/nbt0710-691. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources