Scientific Workflows Research Papers - Academia.edu (original) (raw)

Where there was once delineation between banking processes that a consumer could do from the both the mobile and branch account opening experience. With 70% of likely checking account applicants saying they would prefer to submit a... more

Where there was once delineation between banking processes that a consumer could do from the both the mobile and branch account opening experience. With 70% of likely checking account applicants saying they would prefer to submit a digital application in 2015, it is clear that using digital functionality to improve the online, mobile and even the branch account opening process will eventually improve the on boarding and engagement process for new customers. Unfortunately, there is still the challenge of abandoned new account opening processes because of lengthy applications, unclear directions, the lack of mobile-first design and the perception that branches have the edge when it comes to protecting personal data getting advice. Surprisingly, most banks have not responded to this revolution in digital functionality. From hard-to-read screens comfort of their home or with the convenience of a smart phone and those that were done in a branch office, the use of digital functionality has finally become universal. No place is this more apparent than with the new account opening process, where features such as the camera phone OCR have improved to requiring signature cards and proof of identity at a branch, the process must improve. Even more surprising, while most banks offer online account opening, less than 20% offer a truly mobile new account opening process. The 57-page Digital Banking Report, Digital Account Opening, focuses on the digital account opening (DAO) experience for checking accounts, and the landscape of solutions and workflows that comprise the end-to end account opening process. We focus primarily on account

Co-authoring Dofiles can be challenging as most Stata users have idiosyncratic preferences and methods for organizing and writing Dofiles. Which standards and practices can research teams adopt to improve the cohesion of this group work?... more

Co-authoring Dofiles can be challenging as most Stata users have idiosyncratic preferences and methods for organizing and writing Dofiles. Which standards and practices can research teams adopt to improve the cohesion of this group work? This article proposes some best practices to overcome team research coordination issues adapting methods from software engineering and data science along with personal experience with group research. We prioritize improvements that increase efficiency of the team workflow by establishing global parameters and directories, standardizing communication between team members, and enabling reproducibility of results.

In the modern era, workflows are adopted as a powerful and attractive paradigm for expressing/solving a variety of applications like scientific, data intensive computing, and big data applications such as MapReduce and Hadoop. These... more

In the modern era, workflows are adopted as a powerful and attractive paradigm for expressing/solving a variety of applications like scientific, data intensive computing, and big data applications such as MapReduce and Hadoop. These complex applications are described using high-level representations in workflow methods. With the emerging model of cloud computing technology, scheduling in the cloud becomes the important research topic. Consequently, workflow scheduling problem has been studied extensively over the past few years, from homogeneous clusters, grids to the most recent paradigm, cloud computing. The challenges that need to be addressed lies in task-resource mapping, QoS requirements, resource provisioning, performance fluctuation, failure handling, resource scheduling, and data storage. This work focuses on the complete study of the resource provisioning and scheduling algorithms in cloud environment focusing on Infrastructure as a service (IaaS). We provided a comprehensive understanding of existing scheduling techniques and provided an insight into research challenges that will be a possible future direction to the researchers.

Recently, the emergence of Function-as-a-Service (FaaS) has gained increasing attention by researchers. FaaS, also known as serverless computing, is a new concept in cloud computing that allows the services computation that triggers the... more

Recently, the emergence of Function-as-a-Service (FaaS) has gained increasing attention by researchers. FaaS, also known as serverless computing, is a new concept in cloud computing that allows the services computation that triggers the code execution as a response for certain events. In this paper, we discuss various proposals related to scheduling tasks in clouds. These proposals are categorized according to their objective functions, namely minimizing execution time, minimizing execution cost, or multi objectives (time and cost). The dependency relationships between the tasks plays a vital role in determining the efficiency of the scheduling approach. This dependency may result in resources underutilization. FaaS is expected to have a significant impact on the process of scheduling tasks. This problem can be reduced by adopting a hybrid approach that combines both the benefit of FaaS and Infrastructure-as-a-Service (IaaS). Using FaaS, we can run the small tasks remotely and focus...

Weka is a mature and widely used set of Java software tools for machine learning, data-driven modelling and data mining – and is regarded as a current gold standard for the practical application of these techniques. This paper describes... more

Weka is a mature and widely used set of Java software tools for machine learning, data-driven modelling and data mining – and is regarded as a current gold standard for the practical application of these techniques. This paper describes the integration and use of elements of the Weka open source machine learning toolkit within the cloud based data analytics e-Science Central Platform. The purpose of this is to extend the data mining capabilities of the e-Science Central platform using trusted, widely used software components in such a way that the non-machine learning specialist can apply these techniques to their own data easily. To these ends, around 25 Weka blocks have been added to the e-Science Central workflow palette. These blocks encapsulate (1) a representative sample of supervised learning algorithms in Weka (2) utility blocks for the manipulation and pre-processing of data and (3) blocks that generate detailed model performance reports in PDF format. The blocks in the latter group were created to extend existing Weka functionality and allow the user to generate a single document that allows model details and performance to be referenced outside of e-Science Central and Weka. Two real world examples are used to demonstrate Weka functionality in e-Science Central workflows: a regression modelling problem where the objective is to develop a model to predict a quality variable from an industrial distillation tower, and a classification problem, where the objective to is predict cancer diagnostics (tumours classified as 'Malignant' or 'Benign') based on measurements taken from lab cell nuclei imaging. Step by step methods are used to show how these data sets may be modelled, and the models evaluated, using blocks in e-Science Central workflows.

In the fast developing world of scholarly communication it is good to take a step back and look at the patterns and processes of innovation in this field. To this end, we have selected 101 innovations (in the form of tools & sites) and... more

In the fast developing world of scholarly communication it is good to take a step back and look at the patterns and processes of innovation in this field. To this end, we have selected 101 innovations (in the form of tools & sites) and graphically displayed them by year and also according to 6 phases of the research workflow: collection of data & literature, analysis, writing, publishing & archiving, outreach and assessment. This overview facilitates discussion on processes of innovation, disruption, diffusion, consolidation, competition and success, but also of failure and stagnation, over the last 3 decades. We describe some of the trends, expectations, uncertainties, opportunities and challenges within each of the workflow phases. Also, based on the graphical overview we present a juxtaposition of typical traditional, innovative and experimental workflows.

This paper offers an account of two Documentary Linguistics Workshops held in Tokyo based on the author's personal experience. The workshops have been held for nine consecutive years at the Research Institute for Languages and Cultures of... more

This paper offers an account of two Documentary Linguistics Workshops held in Tokyo based on the author's personal experience. The workshops have been held for nine consecutive years at the Research Institute for Languages and Cultures of Asia and Africa (ILCAA), Tokyo University of Foreign Studies (TUFS). The advantages and disadvantages of the courses are discussed in detail, and recommendations to students seeking similar programs are given.

Scientific Workflows are abstractions used to model in silico scientific experiments. Cloud environments are still incipient in collecting and recording prospective and retrospective provenance. This paper presents an approach to support... more

Scientific Workflows are abstractions used to model in silico scientific experiments. Cloud environments are still incipient in collecting and recording prospective and retrospective provenance. This paper presents an approach to support collecting metadata provenance of in silico scientific experiments executed in public clouds. The strategy was implemented as a distributed and modular architecture named Matriohska. This paper also presents a provenance data model compatible with PROV specification. We also show preliminary results that describe how provenance metadata was captured from the components running in the cloud.

The paper shows how error statistical theory can be deployed to grasp the deeper epistemic logic of the peer-review process. The intent is to provide the readers with a novel lens through which to make sense of the practices of academic... more

The paper shows how error statistical theory can be deployed to grasp the deeper epistemic logic of the peer-review process. The intent is to provide the readers with a novel lens through which to make sense of the practices of academic publishing.

It is time to escape the constraints of the Systematics Wars narrative and pursue new questions that are better positioned to establish the relevance of the field in this time period to broader issues in the history of biology and history... more

It is time to escape the constraints of the Systematics Wars narrative and pursue new questions that are better positioned to establish the relevance of the field in this time period to broader issues in the history of biology and history of science. To date, the underlying assumptions of the Systematics Wars narrative have led historians to prioritize theory over practice and the conflicts of a few leading theorists over the less-polarized interactions of systematists at large. We show how shifting to a practice-oriented view of methodology, centered on the trajectory of mathematization in systematics, demonstrates problems with the common view that one camp (cladistics) straightforwardly ''won'' over the other (phenetics). In particular, we critique David Hull's historical account in Science as a Process by demonstrating exactly the sort of intermediate level of positive sharing between phenetic and cladistic theories that undermines their mutually exclusive individuality as conceptual systems over time. It is misleading, or at least inadequate, to treat them simply as holistically opposed theories that can only interact by competition to the death. Looking to the future, we suggest that the concept of workflow provides an important new perspective on the history of mathematization and computerization in biology after World War II.

The paper describes a new cloud-oriented workflow system called Flowbster. It was designed to create efficient data pipelines in clouds by which large compute-intensive data sets can efficiently be processed. The Flowbster workflow can be... more

The paper describes a new cloud-oriented workflow system called Flowbster. It was designed to create efficient data pipelines in clouds by which large compute-intensive data sets can efficiently be processed. The Flowbster workflow can be deployed in the target cloud as a virtual infrastructure through which the data to be processed can flow and meanwhile it flows through the workflow it is transformed as the business logic of the workflow defines it. Instead of using the enactor based workflow concept Flowbster applies the service choreography concept where the workflow nodes directly communicate with each other. Workflow nodes are able to recognize if they can be activated with a certain data set without the interaction of central control service like the enactor in service orchestration workflows. As a result Flowb-ster workflows implement a much more efficient data P. Kacsuk () · J. Kovács · Z. Farkas path through the workflow than service orchestration workflows. A Flowbster workflow works as a data pipeline enabling the exploitation of pipeline paral-lelism, workflow parallel branch parallelism and node scalability parallelism. The Flowbster workflow can be deployed in the target cloud on-demand based on the underlying Occopus cloud deployment and orches-trator tool. Occopus guarantees that the workflow can be deployed in several major types of IaaS clouds (OpenStack, OpenNebula, Amazon, CloudSigma). It takes care of not only deploying the nodes of the workflow but also to maintain their health by using various health-checking options. Flowbster also provides an intuitive graphical user interface for end-user scientists. This interface hides the low level cloud-oriented layers and hence users can concentrate on the business logic of their data processing applications without having detailed knowledge on the underlying cloud infrastructure.

The continuous quest for knowledge stimulates companies and research institutions not only to investigate new ways to improve the quality of scientific experiments, but also to reduce the time and costs needed for its implementation in... more

The continuous quest for knowledge stimulates companies and research institutions not only to investigate new ways to improve the quality of scientific experiments, but also to reduce the time and costs needed for its implementation in distributed environments. The management of provenance descriptors collected during the life cycle of scientific experiments may represent an important goal to be achieved. This thesis presents a new strategy which was focused to aid scientists to manage different kinds of provenance descriptors. It describes a computational approach that uses a well founded ontology named OvO (Open proVenance Ontology) and a provenance infrastructure entitled Matriohska that can be attached to scientific workflows executed on distributed and heterogeneous environments like the cloud of computers. The approach also allows scientists to further perform semantic queries on provenance descriptors with distinct types of granularity.
This thesis was indicated by PESC/COPPE/UFRJ and was awarded by the SAE (Strategic Affairs Secretary of the presidency of the Brazilian Republic ) as the best Brazillian PhD thesis in Computer Science in 2011

Modern computational experiments imply that the resources of the cloud computing environment are often used to solve a large number of tasks, which differ only in the values of a relatively small set of simulation parameters. Such sets of... more

Modern computational experiments imply that the resources of the cloud computing environment are often used to solve a large number of tasks, which differ only in the values of a relatively small set of simulation parameters. Such sets of tasks may occur while implementing multivariate calculations aimed at finding the simulation parameter values, which optimize certain characteristics of the computational model. Applications of this type make a large percentage of modern HPC systems load, which implies a need for methods and algorithms for efficient allocation of resources in order to optimize systems for solving such problems. The aim of this work is to implement a PO-HEFT problem-oriented scientific workflow scheduling algorithm and to compare it with other workflow scheduling algorithms.

BackgroundWorkflow engine technology represents a new class of software with the ability to graphically model step-based knowledge. We present application of this novel technology to the domain of clinical decision support. Successful... more

BackgroundWorkflow engine technology represents a new class of software with the ability to graphically model step-based knowledge. We present application of this novel technology to the domain of clinical decision support. Successful implementation of decision support within an electronic health record (EHR) remains an unsolved research challenge. Previous research efforts were mostly based on healthcare-specific representation standards and execution engines and did not reach wide adoption. We focus on two challenges in decision support systems: the ability to test decision logic on retrospective data prior prospective deployment and the challenge of user-friendly representation of clinical logic.ResultsWe present our implementation of a workflow engine technology that addresses the two above-described challenges in delivering clinical decision support. Our system is based on a cross-industry standard of XML (extensible markup language) process definition language (XPDL). The core components of the system are a workflow editor for modeling clinical scenarios and a workflow engine for execution of those scenarios. We demonstrate, with an open-source and publicly available workflow suite, that clinical decision support logic can be executed on retrospective data. The same flowchart-based representation can also function in a prospective mode where the system can be integrated with an EHR system and respond to real-time clinical events. We limit the scope of our implementation to decision support content generation (which can be EHR system vendor independent). We do not focus on supporting complex decision support content delivery mechanisms due to lack of standardization of EHR systems in this area. We present results of our evaluation of the flowchart-based graphical notation as well as architectural evaluation of our implementation using an established evaluation framework for clinical decision support architecture.ConclusionsWe describe an implementation of a free workflow technology software suite (available at http://code.google.com/p/healthflow) and its application in the domain of clinical decision support. Our implementation seamlessly supports clinical logic testing on retrospective data and offers a user-friendly knowledge representation paradigm. With the presented software implementation, we demonstrate that workflow engine technology can provide a decision support platform which evaluates well against an established clinical decision support architecture evaluation framework. Due to cross-industry usage of workflow engine technology, we can expect significant future functionality enhancements that will further improve the technology's capacity to serve as a clinical decision support platform.

A scientific workflow management system can be considered as a binding agent which brings together scientists and distributed resources. A workflow graph plays the central role in such a system as it is the component understood by both... more

A scientific workflow management system can be considered as a binding agent which brings together scientists and distributed resources. A workflow graph plays the central role in such a system as it is the component understood by both scientist and machine. Making sense out of a scientific workflow graphs is undoubtedly, the first and foremost responsibility of a workflow management system. Typical
systems include an orchestration engine which models a workflow and schedules individual components onto distributed resources. As part of the WS-VLAM, we
present an alternative orchestration engine which takes a different stand on interpreting the workflow graph. Whilst the current engine in WS-VLAM models the graph as a process network where components are tightly coupled through communication channels, the Datafluo engine models the graph as a dataflow network with farming capabilities. In this dissertation, we present the Datafluo architecture followed by a prototype implementation. The prototype is taken through its passes using scientific workflow applications where the generated results demonstrate the orchestration features. Through our results, we show how dataflow techniques reduce queue waiting times whilst farming techniques circumvent common workflow bottlenecks.

Recently, the emergence of Function-as-a-Service (FaaS) has gained increasing attention by researchers. FaaS, also known as serverless computing, is a new concept in cloud computing that allows the services computation that triggers the... more

Recently, the emergence of Function-as-a-Service (FaaS) has gained increasing attention by researchers. FaaS, also known as serverless computing, is a new concept in cloud computing that allows the services computation that triggers the code execution as a response for certain events. In this paper, we discuss various proposals related to scheduling tasks in clouds. These proposals are categorized according to their objective functions, namely minimizing execution time, minimizing execution cost, or multi objectives (time and cost). The dependency relationships between the tasks plays a vital role in determining the efficiency of the scheduling approach. This dependency may result in resources underutilization. FaaS is expected to have a significant impact on the process of scheduling tasks. This problem can be reduced by adopting a hybrid approach that combines both the benefit of FaaS and Infrastructure-as-a-Service (IaaS). Using FaaS, we can run the small tasks remotely and focus only on scheduling the large tasks. This helps in increasing the utilization of the resources because the small tasks will not be considered during the process of scheduling. An extension of the restricted time limit by cloud vendors will allow running the complete workflow using the serverless architecture, avoiding the scheduling problem.

Distributed computing has always been a challenge due to the NP-completeness of finding optimal underlying management routines. The advent of big data increases the dimensionality of the problem whereby data partitionability, processing... more

Distributed computing has always been a challenge due to the NP-completeness of finding optimal underlying management routines. The advent of big data increases the dimensionality of the problem whereby data partitionability, processing complexity and locality play a crucial role in the effectiveness of distributed systems. The flexibility and control brought forward by virtualization means that for the first time we control the whole stack from application down to the network layer but, to a certain extent, the best way to exploit this level of programmability still eludes us.
Our research tackles this problem from both the data and the infrastructure fronts. We investigate the evolving dynamic infrastructure whereby we research distributed computing on inter-clouds and web browsers. Dynamism in the infrastructure leads to more adaptable middleware; we investigate prediction and fuzzy based data processing scaling techniques for workflows and dataflows. The increasing complexity in data processing is a challenge; we address this complexity by introducing an automata-based modeling and coordination system.
The role of semantics will play an essential role in the future of data processing. For this reason we investigate how semantics can be used to inter-link globally distributed data processors. This layer forms the final layer in our research which started from the dynamic infrastructure layer.

Laser scanners enable bridge inspectors to collect dense 3D point clouds, which capture detailed geometries of bridges. While these data sets contain rich geometric information, they bring unique challenges related to geometric... more

Laser scanners enable bridge inspectors to collect dense 3D point clouds, which capture detailed geometries of bridges. While these data sets contain rich geometric information, they bring unique challenges related to geometric information retrieval. This paper describes a case study to show the necessity and potential value of automating the manual data processing workflows being executed for extracting geometric data items (surveying goals) from 3D point clouds, and presents an approach for formalizing these workflows to enable such automation. We analyzed manual procedures of taking measurements on 3D point clouds and as-built models for obtaining bridge inspection related surveying goals, synthesized and categorized all data processing operations into nine generic operations. These nine categories of operations can be formalized using < operator, inputs, output, parameters, constraints> tuples. Using these tuples, we formalized an operation library and workflow construction mechanisms for enabling inspectors to semi-automatically construct executable workflows. This formalism also incorporates several mechanisms for facilitating extensions to the existing operation library to accommodate additional surveying goals that have not been covered. The developed approach is validated for its generality to support workflows needed for all surveying goals required by the National Bridge Inventory (NBI) program, and for its extensibility to support workflows needed to support a variety of other surveying goals identified in the Architecture/Engineering/Construction domain.

Understanding the core function of the brain is one the major challenges of our times. In the areas of neuroscience and education, several new studies try to correlate the learning difficulties faced by children and youth with behavioral... more

Understanding the core function of the brain is one the major challenges of our times. In the areas of neuroscience and education, several new studies try to correlate the learning difficulties faced by children and youth with behavioral and social problems. This work aims to present the challenges and opportunities of computational neuroscience research, with the aim of detecting people with learning disorders. We present a line of investigation based on the key areas: neuroscience, cognitive sciences and computer science, which considers young people between nine and eighteen years of age, with or without a learning disorder. The adoption of neural networks reveals consistency in dealing with pattern recognition problems and they are shown to be effective for early detection in patients with these disorders. We argue that computational neuroscience can be used for identifying and analyzing young Brazilian people with several cognitive disorders.

Many new websites and online tools have come into existence to support scholarly communication in all phases of the research workflow. To what extent researchers are using these and more traditional tools has been largely unknown. This... more

Many new websites and online tools have come into existence to support scholarly communication in all phases of the research workflow. To what extent researchers are using these and more traditional tools has been largely unknown. This 2015-2016 survey aimed to fill that gap. Its results may help decision making by stakeholders supporting researchers and may also help researchers wishing to reflect on their own online workflows. In addition, information on tools usage can inform studies of changing research workflows.

Software Product Line (SPL) engineering is a paradigm shift towards modeling and developing software system families rather than individual systems. It focuses on the means of efficiently producing and maintaining multiple similar... more

Software Product Line (SPL) engineering is a paradigm shift towards modeling and developing software system families rather than individual systems. It focuses on the means of efficiently producing and maintaining multiple similar software products, exploiting what they have in common and managing what varies among them. This is analogous to what is practiced in the automotive industry, where the focus is on creating a single production line, out of which many customized but similar variations of a car model are produced. Feature models (FMs) are a fundamental formalism for specifying and reasoning about commonality and variability of SPLs. FMs are becoming increasingly complex, handled by several stakeholders or organizations, used to describe features at various levels of abstraction and related in a variety of ways. In different contexts and application domains, maintaining a single large FM is neither feasible nor desirable. Instead, multiple FMs are now used. In this thesis, we develop theoretical foundations and practical support for managing multiple FMs. We design and develop a set of composition and decomposition operators (aggregate, merge, slice) for supporting separation of concerns. The operators are formally defined, implemented with a fully automated algorithm and guarantee properties in terms of sets of configurations. We show how the composition and decomposition operators can be combined together or with other reasoning and editing operators to realize complex tasks. We propose a textual language, FAMILIAR (for FeAture Model scrIpt Language for manIpulation and Automatic Reasoning), which provides a practical solution for managing FMs on a large scale. An SPL practitioner can combine the different operators and manipulate a restricted set of concepts (FMs, features, configurations, etc.) using a concise notation and language facilities. FAMILIAR hides implementation details (e.g., solvers) and comes with a development environment. We report various applications of the operators and usages of FAMILIAR in different domains (medical imaging, video surveillance) and for different purposes (scientific workflow design, variability modeling from requirements to runtime, reverse engineering), showing the applicability of both the operators and the supporting language. Without the new capabilities brought by the operators and FAMILIAR, some analysis and reasoning operations would not be made possible in the different case studies. To conclude, we discuss different research perspectives in the medium term (regarding the operators, the language and validation elements) and in the long term (e.g., relationships between FMs and other models).

The introduction of Next Generation Sequencing into the disciplines of plant systematics, ecology, and metagenomics, among others, has resulted in a phenomenal increase in the collecting and storing of tissue samples and their respective... more

The introduction of Next Generation Sequencing into the disciplines of plant systematics, ecology, and metagenomics, among others, has resulted in a phenomenal increase in the collecting and storing of tissue samples and their respective vouchers. This manual suggests standard practices that will insure the quality and preservation of the tissue and vouchers and their respective data. Although written for use by the Smithsonian Institution botanists it suggests a framework for collecting tissues and vouchers that other research programs can adapt to their own needs. It includes information on collecting voucher specimens, collecting plant tissue intended for genomic analysis, how to manage these collections, and how to incorporate the data into a database management system. It also includes many useful references for collecting and processing collections. We hope it will be useful for a variety of botanists but especially those who know how to collect plants and want to collect tissue samples that will be useful for genomic research, and those who are skilled in lab work and want to know how to properly voucher and record their tissue collections.

Bei der Bearbeitung von historischen Fragestellungen mit digitalen Tools waren die von historischen Forschungsprojekten entwickelte Programme bisher sehr spezialisiert. Die Software wurde genau auf die historische Fragestellung angepasst... more

Bei der Bearbeitung von historischen Fragestellungen mit digitalen Tools waren die von historischen Forschungsprojekten entwickelte Programme bisher sehr spezialisiert. Die Software wurde genau auf die historische Fragestellung angepasst und wurde von einem Informatiker nur für den Zweck der Beantwortung einer spezifischen historischen Fragestellung gebaut. Viele Historiker kennen sich wenig bis gar nicht mit der Programmierung aus. Daher können sie selbst oft nicht den Code ihrer benutzten Software an ihre neuen Bedürfnisse und Fragen anpassen.

• Premise of the study: Internationally, gardens hold diverse living collections that can be preserved for genomic research. Work- ows have been developed for genomic tissue sampling in other taxa (e.g., vertebrates), but are inadequate... more

• Premise of the study: Internationally, gardens hold diverse living collections that can be preserved for genomic research. Work- ows have been developed for genomic tissue sampling in other taxa (e.g., vertebrates), but are inadequate for plants. We outline a work ow for tissue sampling intended for two audiences: botanists interested in genomics research and garden staff who plan to voucher living collections.
• Methods and Results: Standard herbarium methods are used to collect vouchers, label information and images are entered into a publicly accessible database, and leaf tissue is preserved in silica and liquid nitrogen. A ve-step approach for genomic tissue sampling is presented for sampling from living collections according to current best practices.
• Conclusions: Collecting genome-quality samples from gardens is an economical and rapid way to make available for scienti c research tissue from the diversity of plants on Earth. The Global Genome Initiative will facilitate and lead this endeavor through international partnerships.

A significant amount of recent research in scientific workflows aims to develop new techniques, algorithms and systems that can overcome the challenges of efficient and robust execution of ever larger workflows on increasingly complex... more

A significant amount of recent research in scientific workflows aims to develop new techniques, algorithms and systems that can overcome the challenges of efficient and robust execution of ever larger workflows on increasingly complex distributed infrastructures. Since the infrastructures, systems and applications are complex, and their behavior is difficult to reproduce using physical experiments, much of this research is based on simulation. However, there exists a shortage of realistic datasets and tools that can be used for such studies. In this paper we describe a collection of tools and data that have enabled research in new techniques, algorithms, and systems for scientific workflows. These resources include: 1) execution traces of real workflow applications from which workflow and system characteristics such as resource usage and failure profiles can be extracted, 2) a synthetic workflow generator that can produce realistic synthetic workflows based on profiles extracted from execution traces, and 3) a simulator framework that can simulate the execution of synthetic workflows on realistic distributed infrastructures. This paper describes how we have used these resources to investigate new techniques for efficient and robust workflow execution, as well as to provide improvements to the Pegasus Workflow Management System or other workflow tools. Our goal in describing these resources is to share them with other researchers in the workflow research community. All of the tools and data are freely available online for the community at http://www.workflowarchive.org. These data have already been leveraged for a number of studies.

Cloud Computing is an ubiquitous model that enables clients to access different services in a fast and easy manner. In this context, one of the most used models isSoftware as a Service (SaaS), which means that software is deployed and... more

Cloud Computing is an ubiquitous model that enables clients to access different services in a fast and easy manner. In this context, one of the most used models isSoftware as a Service (SaaS), which means that software is deployed and provi-sioned to the customer via internet through a web browser on a pay per use mode. However, given its complexity and characteristics, such as reusability, scalability, flexibility and customization, SaaS may be defined by workflows, which consist of atomic services, or micro-services hosted geographically in different places. SaaS execution under this type of composition may lead to abnormal behavior or failures in the end user applications at runtime. This paper presents a new model of dynamic orchestration for SaaS, which aims to reduce failures or abnormal behavior of the services involved in the execution process of business application.

A leading, international, engineering and construction company has carried out efforts to engage a new tool set and work process. Four-Dimensional Planning and Scheduling (4D-PS) is the new work process that aims toward better, more... more

A leading, international, engineering and construction company has carried out efforts to engage a new tool set and work process. Four-Dimensional Planning and Scheduling (4D-PS) is the new work process that aims toward better, more efficient planning and execution of large construction projects. This paper describes the case history and forecasts how this revitalized technique may ultimately impact the
construction industry. Despite academic and practitioners’ research and
development efforts to leverage from Information Technology (IT) in construction, the industry at large, being generally conservative, has adhered to the values of predictability and existing methods to minimize risk. 4D technology has struggled to find its way into mainstream construction practice for several years, and just
recently it has been shown that commercially available software and hardware can be applied effectively toward this end, greatly reducing investment risk. These relatively new tools promise new impetus to the use of 4D-PS in the construction industry. This paper describes how 4D-PS was applied on a major construction project, giving rise to a new work process that proved to be productive and cost effective. Emphasis is made on the fact that those expected to use such technology
must have the necessary training and, conversely, near-future versions of computerized tools can be made more intuitive for more widespread use. The use of such techniques will necessarily draw engineering/design and construction entities closer together, essentially improving coordination among them.

We propose a new method for mining sets of patterns for classification, where patterns are represented as SPARQL queries over RDFS. The method contributes to so-called semantic data mining, a data mining approach where domain ontologies... more

We propose a new method for mining sets of patterns for classification, where patterns are represented as SPARQL queries over RDFS. The method contributes to so-called semantic data mining, a data mining approach where domain ontologies are used as background knowledge, and where the new challenge is to mine knowledge encoded in domain ontologies, rather than only purely empirical data. We have developed a tool that implements this approach. Using this we have conducted an experimental evaluation including comparison of our method to state-of- the-art approaches to classification of semantic data and an experimental study within emerging subfield of meta-learning called semantic meta-mining. The most important research contributions of the paper to the state-of-art are as follows. For pattern mining research or relational learning in general, the paper contributes a new algorithm for discovery of new type of patterns. For Semantic Web research, it theoretically and empirically illustrates how semantic, structured data can be used in traditional machine learning methods through a pattern-based approach for constructing semantic features.

Schedulers for cloud computing determine on which processing resource jobs of a workflow should be allocated. In hybrid clouds, jobs can be allocated either on a private cloud or on a public cloud on a pay per use basis. The capacity of... more

Schedulers for cloud computing determine on which processing resource jobs of a workflow should be allocated. In hybrid clouds, jobs can be allocated either on a private cloud or on a public cloud on a pay per use basis. The capacity of the communication channels connecting these two types of resources impact the makespan and the cost of workflows execution. This paper introduces the scheduling problem in hybrid clouds presenting the main characteristics to be considered when scheduling workflows, as well as a brief survey of some of the scheduling algorithms used in these systems. To assess the influence of communication channels on job allocation, we compare and evaluate the impact of the available bandwidth on the performance of some of the scheduling algorithms.