Ilkay Altintas - Profile on Academia.edu (original) (raw)

Papers by Ilkay Altintas

Advances in satellite imagery presents unprecedented opportunities for understanding natural and ... more Advances in satellite imagery presents unprecedented opportunities for understanding natural and social phenomena at global and regional scales. Although the field of satellite remote sensing has evaluated imperative questions to human and environmental sustainability, scaling those techniques to very high spatial resolutions at regional scales remains a challenge. Satellite imagery is now more accessible with greater spatial, spectral and temporal resolution creating a data bottleneck in identifying the content of images. Because satellite images are unlabeled, unsupervised methods allow us to organize images into coherent groups or clusters. However, the performance of unsupervised methods, like all other machine learning methods, depends on features. Recent studies using features from pre-trained networks have shown promise for learning in new datasets. This suggests that features from pre-trained networks can be used for learning in temporally and spatially dynamic data sources such as satellite imagery. It is not clear, however, which features from which layer and network architecture should be used for learning new tasks. In this paper, we present an approach to evaluate the transferability of features from pre-trained Deep Convolutional Neural Networks for satellite imagery. We explore and evaluate different features and feature combinations extracted from various deep network architectures, and systematically evaluate over 2,000 network-layer combinations. In addition, we test the transferability of our engineered features and learned features from an unlabeled dataset to a different labeled dataset. Our feature engineering and learning are done on the unlabeled Draper Satellite Chronology dataset, and we test on the labeled UC Merced Land dataset to achieve near state-of-the-art classification results. These results suggest that even without any or minimal training, these networks can generalize well to other datasets. This method could be useful in the task of clustering unlabeled images and other unsupervised machine learning tasks.

The provenance of a data product contains information about how the product was derived, and is c... more The provenance of a data product contains information about how the product was derived, and is crucial for enabling scientists to easily understand, reproduce, and verify scientific results. Currently, most provenance models are designed to capture the provenance related to a single run, and mostly executed by a single user. However, a scientific discovery is often the result of methodical execution of many scientific workflows with many datasets produced at different times by one or more users. Further, to promote and facilitate exchange of information between multiple workflow systems supporting provenance, the Open Provenance Model (OPM) has been proposed by the scientific workflow community. In this paper, we describe a new query model that captures implicit user collaborations. We show how this model maps to OPM and helps to answer collaborative queries, e.g., identifying combined workflows and contributions of users collaborating on a project based on the records of previous workflow executions. We also adopt and extend the high-level Query Language for Provenance (QLP) with additional constructs, and show how these extensions allow non-expert users to express collaborative provenance queries against this model easily and concisely. Furthermore, we adopt the Provenance Challenge 3 (PC3) workflows as a collaborative and interoperable usecase scenario, where different stages of the workflow are executed in three different workflow environments - Kepler, Taverna, and WSVLAM. Through this usecase, we demonstrate how we can establish and understand collaborative studies through interoperable workflow provenance.

Integration of Kepler with ROADNet: Visual Dataflow Design with Real-time Geophysical Data

Abstract The ROADNet project concentrates real-time data from a wide variety of signal domains, p... more Abstract The ROADNet project concentrates real-time data from a wide variety of signal domains, providing a reliable platform to store and transport these data. Ptolemy is a general purpose visual programming environment in which work flows on data streams can be constructed by connecting general purpose components. The Kepler scientific workflow system extends Ptolemy to approach design and automation of scientific data analysis tasks.

Extending the Data Model for Data-Centric Metagenomics Analysis using Scientific Workflows in Camera

Abstract Community Cyber infrastructure for Advanced Marine Microbial Ecology Research and Analys... more Abstract Community Cyber infrastructure for Advanced Marine Microbial Ecology Research and Analysis (CAMERA) is an eScience project to enable the microbial ecology community in managing the challenges of metagenomics analysis. CAMERA supports extensive metadata based data acquisition and access, as well as execution of metagenomics experiments through standard and customized scientific workflows.

MAAMD: A Workflow to Standardize Meta-Analyses of Affymetrix Microarray Data

Background: As more and more microarray data sets are available to the public, the opportunities ... more Background: As more and more microarray data sets are available to the public, the opportunities to conduct data meta-analyses increase. Meta-analyses of microarray data consist of general steps such as downloading, decompressing, classifying, quality controlling and normalizing. These steps are time-consuming if they are not automated. A workflow is needed to automate these general steps, improve the efficiency and standardize the analyses.

Monitoring data quality in Kepler

Abstract Data quality is an important component of modern scientific discovery. Many scientific d... more Abstract Data quality is an important component of modern scientific discovery. Many scientific discovery processes consume data from a diverse array of resources such as streaming sensor networks, web services, and databases. The validity of a scientific computation's results is highly dependent on the quality of these input data. Scientific workflow systems are being increasingly used to automate scientific computations by facilitating experiment design, data capture, integration, processing, and analysis.

Seamless Provenance Representation and Use in Collaborative Science Scenarios

Abstract The notion of sharing scientific data has only recently begun to gain ground in science,... more Abstract The notion of sharing scientific data has only recently begun to gain ground in science, where data is still considered a private asset. There is growing evidence, however, that the benefits of scientific collaboration through early data sharing during the course of a science project may outgrow the risk of losing exclusive ownership of the data. As exemplar success stories are making the headlines [1], principles of effective information sharing have become the subject of e-science research.

Abstract To execute workflows on a compute cluster resource, workflow engines can work with clust... more Abstract To execute workflows on a compute cluster resource, workflow engines can work with cluster resource manager software to distribute jobs into compute nodes on the cluster. We discuss how to interact with traditional Oracle Grid Engine and recent Hadoop cluster resource managers using a dataflow-based scheduling approach to balance compute resource load for data-parallel workflow execution.

Abstract With the increasing volume and complexity of data produced by ultra-scale simulations an... more Abstract With the increasing volume and complexity of data produced by ultra-scale simulations and highthroughput experiments, understanding the science is largely hampered by the lack of comprehensive, end-to-end data management solutions ranging from initial data acquisition to final analysis and visualization.

Distributed Data-Parallel (DDP) patterns such as MapReduce have become increasingly popular as so... more Distributed Data-Parallel (DDP) patterns such as MapReduce have become increasingly popular as solutions to facilitate data-intensive applications, resulting in a number of systems supporting DDP workﬂows. Yet, applications or workﬂows built using these patterns are usually tightly-coupled with the underlying DDP execution engine they select. We present a framework for distributed data-parallel execution in the Kepler scientiﬁc workﬂow system that enables users to easily switch between different DDP execution engines.

With the increasing popularity of the Cloud computing, there are more and more requirements for s... more With the increasing popularity of the Cloud computing, there are more and more requirements for scientiﬁc work–ﬂows to utilize Cloud resources. In this paper, we present our preliminary work and experiences on enabling the interaction between the Kepler scientiﬁc workﬂow system and the Amazon Elastic Compute Cloud (EC2). A set of EC2 actors and Kepler Amazon Machine Images are introduced with the discussion on their different usage modes.

Comprehensive, end-to-end, data and workflow management solutions are needed to handle the increa... more Comprehensive, end-to-end, data and workflow management solutions are needed to handle the increasing complexity of processes and data volumes associated with modern distributed scientific problem solving, such as ultrascale simulations and high-throughput experiments. The key to the solution is an integrated network-based framework that is functional, dependable, faulttolerant, and supports data and process provenance.

Abstract Experimental science can be thought of as the exploration of a large research space, in ... more Abstract Experimental science can be thought of as the exploration of a large research space, in search of a few valuable results. While it is this “Golden Data” that gets published, the history of the exploration is often as valuable to the scientists as some of its outcomes. We envision an e-research infrastructure that is capable of systematically and automatically recording such history–an assumption that holds today for a number of workflow management systems routinely used in e-science.

Page 1. 1 Ilkay Altintas Director, Scientific Workflow Automation Technologies Laboratory San Die... more Page 1. 1 Ilkay Altintas Director, Scientific Workflow Automation Technologies Laboratory San Diego Supercomputer Center, UCSD Astrophysics Workflows in the Kepler System Page 2. 2 The Big Picture: Supporting the Scientist Here: John Blondin, NC State Astrophysics Terascale Supernova Initiative SciDAC, DOE Conceptual SWF Executable SWF From “Napkin Drawings” … … to Executable Workflows Page 3.

Abstract Scientific discoveries are often the result of methodical execution of many interrelated... more Abstract Scientific discoveries are often the result of methodical execution of many interrelated scientific workflows, where workflows and datasets published by one set of users can be used by other users to perform subsequent analyses, leading to implicit or explicit collaboration. In this paper, we describe a data model for “collaborative provenance” that extends common workflow provenance models by introducing attributes for characterizing the nature of user collaborations as well as their strength (or weight).

pasl.eng.auburn.edu

The DOE leadership facilities were established in 2004 to provide scientist capability computing ... more The DOE leadership facilities were established in 2004 to provide scientist capability computing for high-profile science. Since it's inception, the systems went from 14 TF to 1.8 PF, an increase of 100 in 5 years, and will increase by another factor of 100 in 5 more years. This growth, along with user policies, which enable scientist to run at, scale for long periods of time, have allowed scientist to write unprecedented amounts of data to the file system. In the same time, the effective speed of the I/O system (time to write full system memory to the file system) has decreased, going from 49 TB/140 GB/s on ASCI purple, to 300 TB/200 GB/s on jaguar. As future systems will further intensify this imbalance, we need to extend the definition of I/O to include that of I/O pipelines, which blend together; 1) Self describing file formats, 2) Staging methods with visualization and analysis "plug-ins", 3) new techniques to multiplex outputs using a novel Service Oriented Approach (SOA), in a 4) Easy-to-use I/O componentization framework which can add resiliency to large scale calculations. Our approach to this has been in the community created ADIOS system, driven by many of the leading edge application teams, and by many of the top I/O researchers.

Procedia Computer …, Jan 1, 2010

e-Science has been greatly enhanced from the developing capability and usability of cyberinfrastr... more e-Science has been greatly enhanced from the developing capability and usability of cyberinfrastructure. This chapter explains how scientific workflow systems can facilitate e-Science discovery in Grid environments by providing features including scientific process automation, resource consolidation, parallelism, provenance tracking, fault tolerance, and workflow reuse. We first overview the core services to support e-Science discovery. To demonstrate how these services can be seamlessly assembled, an open source scientific workflow system, called Kepler, is integrated into the University of California Grid. This architecture is being applied to a computational enzyme design process, which is a formidable and collaborative problem in computational chemistry that challenges our knowledge of protein chemistry. Our implementation and experiments validate how the Kepler workflow system can make the scientific computation process automated, pipelined, efficient, extensible, stable, and easy-to-use.

e-Science and Grid …, Jan 1, 2006

Data Grid and Web Services are among the advanced computing technologies that are available to su... more Data Grid and Web Services are among the advanced computing technologies that are available to support scientists and scientific applications. We use Kepler scientific workflow system to integrate the two popular technologies for e-science applications. A prototype system for exploring species distribution patterns has been developed for demonstration purposes by using data grid resources and a similarity-based clustering Web service in conjunction with Geographical Information System (GIS) based spatial visualization resources in Kepler.

Workflows in Support …

Scientific collaboration increasingly involves data sharing between separate groups. We consider ... more Scientific collaboration increasingly involves data sharing between separate groups. We consider a scenario where data products of scientific workflows are published and then used by other researchers as inputs to their workflows. For proper interpretation, shared data must be complemented by descriptive metadata. We focus on provenance traces, a prime example of such metadata which describes the genesis and processing history of data products in terms of the computational workflow steps. Through the reuse of published data, virtual, implicitly collaborative experiments emerge, making it desirable to compose the independently generated traces into global ones that describe the combined executions as single, seamless experiments. We present a model for provenance sharing that realizes this holistic view by overcoming the various interoperability problems that emerge from the heterogeneity of workflow systems, data formats, and provenance models. At the heart lie (i) an abstract workflow and provenance model in which (ii) data sharing becomes itself part of the combined workflow. We then describe an implementation of our model that we developed in the context of the Data Observation Network for Earth (DataONE) project and that can "stitch together" traces from different Kepler and Taverna workflow runs. It provides a prototypical framework for seamless cross-system, collaborative provenance management and can be easily extended to include other systems. Our approach also opens the door to new ways of workflow interoperability not only through often elusive workflow standards but through shared provenance information from public repositories.

IEEE Comp. Soc. Press, Jan 1, 2004