Beth Plale | Indiana University (original) (raw)
Uploads
Papers by Beth Plale
Workflow systems are an increasingly popular eScience tool for executing complex sequences of tas... more Workflow systems are an increasingly popular eScience tool for executing complex sequences of tasks. The large volumes of data created during the course of these computationally intense and datadriven scientific investigations drives research in techniques to automate metadata capture to relieve the burden on the user of manual annotation. In this paper we describe our experience to date in quantifying the limits of automated metadata collection in e-Science workflow systems.
The scientific knowledge discovery process has been aided recently by advances in cyberinfrastruc... more The scientific knowledge discovery process has been aided recently by advances in cyberinfrastructure that automate the execution of data retrieval, modeling and analysis tasks typically undertaken during scientific exploration. But these infrastructures lack a generalized programming model for integrating real time data from sensors and instruments into the analysis process. In this ongoing work we examine two approaches to stream processing, a rule system and a query language-based system, and argue that either can be made suitable for our user base with enough hand-written code, but can either become as accepted as workflow systems in the science community? What will that take?
Clouds are increasingly being used for running dataintensive scientific applications. However, sc... more Clouds are increasingly being used for running dataintensive scientific applications. However, science applications need to contend with the I/O and network performance characteristics of cloud environments. Additionally, managing data effectively and efficiently over these cloud resources is challenging due to the myriad storage choices with different performance-cost trade-offs, complex application choices, complexity associated with elasticity and failure rates. In this paper, we evaluate various aspects of data management strategies in cloud environments. Our evaluation is performed in the context of two frameworks - Hadoop and FRIEDA and conducted on four cloud testbeds - FutureGrid, ExoGeni, Grid5000, Amazon. Our experiments highlight the different performance implications of storage, file system, provis
Workflow systems are an increasingly popular eScience tool for executing complex sequences of tas... more Workflow systems are an increasingly popular eScience tool for executing complex sequences of tasks. The large volumes of data created during the course of these computationally intense and datadriven scientific investigations drives research in techniques to automate metadata capture to relieve the burden on the user of manual annotation. In this paper we describe our experience to date in quantifying the limits of automated metadata collection in e-Science workflow systems.
The scientific knowledge discovery process has been aided recently by advances in cyberinfrastruc... more The scientific knowledge discovery process has been aided recently by advances in cyberinfrastructure that automate the execution of data retrieval, modeling and analysis tasks typically undertaken during scientific exploration. But these infrastructures lack a generalized programming model for integrating real time data from sensors and instruments into the analysis process. In this ongoing work we examine two approaches to stream processing, a rule system and a query language-based system, and argue that either can be made suitable for our user base with enough hand-written code, but can either become as accepted as workflow systems in the science community? What will that take?
Clouds are increasingly being used for running dataintensive scientific applications. However, sc... more Clouds are increasingly being used for running dataintensive scientific applications. However, science applications need to contend with the I/O and network performance characteristics of cloud environments. Additionally, managing data effectively and efficiently over these cloud resources is challenging due to the myriad storage choices with different performance-cost trade-offs, complex application choices, complexity associated with elasticity and failure rates. In this paper, we evaluate various aspects of data management strategies in cloud environments. Our evaluation is performed in the context of two frameworks - Hadoop and FRIEDA and conducted on four cloud testbeds - FutureGrid, ExoGeni, Grid5000, Amazon. Our experiments highlight the different performance implications of storage, file system, provis