Ilkay Altintas | University of California, San Diego (original) (raw)
Papers by Ilkay Altintas
Advances in satellite imagery presents unprecedented opportunities for understanding natural and ... more Advances in satellite imagery presents unprecedented opportunities for understanding natural and social phenomena at global and regional scales. Although the field of satellite remote sensing has evaluated imperative questions to human and environmental sustainability, scaling those techniques to very high spatial resolutions at regional scales remains a challenge. Satellite imagery is now more accessible with greater spatial, spectral and temporal resolution creating a data bottleneck in identifying the content of images. Because satellite images are unlabeled, unsupervised methods allow us to organize images into coherent groups or clusters. However, the performance of unsupervised methods, like all other machine learning methods, depends on features. Recent studies using features from pre-trained networks have shown promise for learning in new datasets. This suggests that features from pre-trained networks can be used for learning in temporally and spatially dynamic data sources such as satellite imagery. It is not clear, however, which features from which layer and network architecture should be used for learning new tasks. In this paper, we present an approach to evaluate the transferability of features from pre-trained Deep Convolutional Neural Networks for satellite imagery. We explore and evaluate different features and feature combinations extracted from various deep network architectures, and systematically evaluate over 2,000 network-layer combinations. In addition, we test the transferability of our engineered features and learned features from an unlabeled dataset to a different labeled dataset. Our feature engineering and learning are done on the unlabeled Draper Satellite Chronology dataset, and we test on the labeled UC Merced Land dataset to achieve near state-of-the-art classification results. These results suggest that even without any or minimal training, these networks can generalize well to other datasets. This method could be useful in the task of clustering unlabeled images and other unsupervised machine learning tasks.
The provenance of a data product contains information about how the product was derived, and is c... more The provenance of a data product contains information about how the product was derived, and is crucial for enabling scientists to easily understand, reproduce, and verify scientific results. Currently, most provenance models are designed to capture the provenance related to a single run, and mostly executed by a single user. However, a scientific discovery is often the result of methodical execution of many scientific workflows with many datasets produced at different times by one or more users. Further, to promote and facilitate exchange of information between multiple workflow systems supporting provenance, the Open Provenance Model (OPM) has been proposed by the scientific workflow community. In this paper, we describe a new query model that captures implicit user collaborations. We show how this model maps to OPM and helps to answer collaborative queries, e.g., identifying combined workflows and contributions of users collaborating on a project based on the records of previous workflow executions. We also adopt and extend the high-level Query Language for Provenance (QLP) with additional constructs, and show how these extensions allow non-expert users to express collaborative provenance queries against this model easily and concisely. Furthermore, we adopt the Provenance Challenge 3 (PC3) workflows as a collaborative and interoperable usecase scenario, where different stages of the workflow are executed in three different workflow environments - Kepler, Taverna, and WSVLAM. Through this usecase, we demonstrate how we can establish and understand collaborative studies through interoperable workflow provenance.
Abstract The ROADNet project concentrates real-time data from a wide variety of signal domains, p... more Abstract The ROADNet project concentrates real-time data from a wide variety of signal domains, providing a reliable platform to store and transport these data. Ptolemy is a general purpose visual programming environment in which work flows on data streams can be constructed by connecting general purpose components. The Kepler scientific workflow system extends Ptolemy to approach design and automation of scientific data analysis tasks.
Abstract Community Cyber infrastructure for Advanced Marine Microbial Ecology Research and Analys... more Abstract Community Cyber infrastructure for Advanced Marine Microbial Ecology Research and Analysis (CAMERA) is an eScience project to enable the microbial ecology community in managing the challenges of metagenomics analysis. CAMERA supports extensive metadata based data acquisition and access, as well as execution of metagenomics experiments through standard and customized scientific workflows.
Background: As more and more microarray data sets are available to the public, the opportunities ... more Background: As more and more microarray data sets are available to the public, the opportunities to conduct data meta-analyses increase. Meta-analyses of microarray data consist of general steps such as downloading, decompressing, classifying, quality controlling and normalizing. These steps are time-consuming if they are not automated. A workflow is needed to automate these general steps, improve the efficiency and standardize the analyses.
Abstract Data quality is an important component of modern scientific discovery. Many scientific d... more Abstract Data quality is an important component of modern scientific discovery. Many scientific discovery processes consume data from a diverse array of resources such as streaming sensor networks, web services, and databases. The validity of a scientific computation's results is highly dependent on the quality of these input data. Scientific workflow systems are being increasingly used to automate scientific computations by facilitating experiment design, data capture, integration, processing, and analysis.
Abstract The notion of sharing scientific data has only recently begun to gain ground in science,... more Abstract The notion of sharing scientific data has only recently begun to gain ground in science, where data is still considered a private asset. There is growing evidence, however, that the benefits of scientific collaboration through early data sharing during the course of a science project may outgrow the risk of losing exclusive ownership of the data. As exemplar success stories are making the headlines [1], principles of effective information sharing have become the subject of e-science research.
Abstract To execute workflows on a compute cluster resource, workflow engines can work with clust... more Abstract To execute workflows on a compute cluster resource, workflow engines can work with cluster resource manager software to distribute jobs into compute nodes on the cluster. We discuss how to interact with traditional Oracle Grid Engine and recent Hadoop cluster resource managers using a dataflow-based scheduling approach to balance compute resource load for data-parallel workflow execution.
Abstract Due to the enormous complexity of computer systems, researchers use simulators to model ... more Abstract Due to the enormous complexity of computer systems, researchers use simulators to model system behavior and generate quantitative estimates of expected performance. Researchers also use simulators to model and assess the efficacy of future enhancements and novel systems. Arguably the most important tools available to computer architecture researchers, simulators offer a balance of cost, timeliness, and flexibility.
Abstract With the increasing volume and complexity of data produced by ultra-scale simulations an... more Abstract With the increasing volume and complexity of data produced by ultra-scale simulations and highthroughput experiments, understanding the science is largely hampered by the lack of comprehensive, end-to-end data management solutions ranging from initial data acquisition to final analysis and visualization.
Distributed Data-Parallel (DDP) patterns such as MapReduce have become increasingly popular as so... more Distributed Data-Parallel (DDP) patterns such as MapReduce have become increasingly popular as solutions to facilitate data-intensive applications, resulting in a number of systems supporting DDP workflows. Yet, applications or workflows built using these patterns are usually tightly-coupled with the underlying DDP execution engine they select. We present a framework for distributed data-parallel execution in the Kepler scientific workflow system that enables users to easily switch between different DDP execution engines.
With the increasing popularity of the Cloud computing, there are more and more requirements for s... more With the increasing popularity of the Cloud computing, there are more and more requirements for scientific work–flows to utilize Cloud resources. In this paper, we present our preliminary work and experiences on enabling the interaction between the Kepler scientific workflow system and the Amazon Elastic Compute Cloud (EC2). A set of EC2 actors and Kepler Amazon Machine Images are introduced with the discussion on their different usage modes.
Comprehensive, end-to-end, data and workflow management solutions are needed to handle the increa... more Comprehensive, end-to-end, data and workflow management solutions are needed to handle the increasing complexity of processes and data volumes associated with modern distributed scientific problem solving, such as ultrascale simulations and high-throughput experiments. The key to the solution is an integrated network-based framework that is functional, dependable, faulttolerant, and supports data and process provenance.
Abstract Experimental science can be thought of as the exploration of a large research space, in ... more Abstract Experimental science can be thought of as the exploration of a large research space, in search of a few valuable results. While it is this “Golden Data” that gets published, the history of the exploration is often as valuable to the scientists as some of its outcomes. We envision an e-research infrastructure that is capable of systematically and automatically recording such history–an assumption that holds today for a number of workflow management systems routinely used in e-science.
Page 1. 1 Ilkay Altintas Director, Scientific Workflow Automation Technologies Laboratory San Die... more Page 1. 1 Ilkay Altintas Director, Scientific Workflow Automation Technologies Laboratory San Diego Supercomputer Center, UCSD Astrophysics Workflows in the Kepler System Page 2. 2 The Big Picture: Supporting the Scientist Here: John Blondin, NC State Astrophysics Terascale Supernova Initiative SciDAC, DOE Conceptual SWF Executable SWF From “Napkin Drawings” … … to Executable Workflows Page 3.
Abstract Scientific discoveries are often the result of methodical execution of many interrelated... more Abstract Scientific discoveries are often the result of methodical execution of many interrelated scientific workflows, where workflows and datasets published by one set of users can be used by other users to perform subsequent analyses, leading to implicit or explicit collaboration. In this paper, we describe a data model for “collaborative provenance” that extends common workflow provenance models by introducing attributes for characterizing the nature of user collaborations as well as their strength (or weight).
Procedia Computer …, Jan 1, 2010
e-Science and Grid …, Jan 1, 2006
Advances in satellite imagery presents unprecedented opportunities for understanding natural and ... more Advances in satellite imagery presents unprecedented opportunities for understanding natural and social phenomena at global and regional scales. Although the field of satellite remote sensing has evaluated imperative questions to human and environmental sustainability, scaling those techniques to very high spatial resolutions at regional scales remains a challenge. Satellite imagery is now more accessible with greater spatial, spectral and temporal resolution creating a data bottleneck in identifying the content of images. Because satellite images are unlabeled, unsupervised methods allow us to organize images into coherent groups or clusters. However, the performance of unsupervised methods, like all other machine learning methods, depends on features. Recent studies using features from pre-trained networks have shown promise for learning in new datasets. This suggests that features from pre-trained networks can be used for learning in temporally and spatially dynamic data sources such as satellite imagery. It is not clear, however, which features from which layer and network architecture should be used for learning new tasks. In this paper, we present an approach to evaluate the transferability of features from pre-trained Deep Convolutional Neural Networks for satellite imagery. We explore and evaluate different features and feature combinations extracted from various deep network architectures, and systematically evaluate over 2,000 network-layer combinations. In addition, we test the transferability of our engineered features and learned features from an unlabeled dataset to a different labeled dataset. Our feature engineering and learning are done on the unlabeled Draper Satellite Chronology dataset, and we test on the labeled UC Merced Land dataset to achieve near state-of-the-art classification results. These results suggest that even without any or minimal training, these networks can generalize well to other datasets. This method could be useful in the task of clustering unlabeled images and other unsupervised machine learning tasks.
The provenance of a data product contains information about how the product was derived, and is c... more The provenance of a data product contains information about how the product was derived, and is crucial for enabling scientists to easily understand, reproduce, and verify scientific results. Currently, most provenance models are designed to capture the provenance related to a single run, and mostly executed by a single user. However, a scientific discovery is often the result of methodical execution of many scientific workflows with many datasets produced at different times by one or more users. Further, to promote and facilitate exchange of information between multiple workflow systems supporting provenance, the Open Provenance Model (OPM) has been proposed by the scientific workflow community. In this paper, we describe a new query model that captures implicit user collaborations. We show how this model maps to OPM and helps to answer collaborative queries, e.g., identifying combined workflows and contributions of users collaborating on a project based on the records of previous workflow executions. We also adopt and extend the high-level Query Language for Provenance (QLP) with additional constructs, and show how these extensions allow non-expert users to express collaborative provenance queries against this model easily and concisely. Furthermore, we adopt the Provenance Challenge 3 (PC3) workflows as a collaborative and interoperable usecase scenario, where different stages of the workflow are executed in three different workflow environments - Kepler, Taverna, and WSVLAM. Through this usecase, we demonstrate how we can establish and understand collaborative studies through interoperable workflow provenance.
Abstract The ROADNet project concentrates real-time data from a wide variety of signal domains, p... more Abstract The ROADNet project concentrates real-time data from a wide variety of signal domains, providing a reliable platform to store and transport these data. Ptolemy is a general purpose visual programming environment in which work flows on data streams can be constructed by connecting general purpose components. The Kepler scientific workflow system extends Ptolemy to approach design and automation of scientific data analysis tasks.
Abstract Community Cyber infrastructure for Advanced Marine Microbial Ecology Research and Analys... more Abstract Community Cyber infrastructure for Advanced Marine Microbial Ecology Research and Analysis (CAMERA) is an eScience project to enable the microbial ecology community in managing the challenges of metagenomics analysis. CAMERA supports extensive metadata based data acquisition and access, as well as execution of metagenomics experiments through standard and customized scientific workflows.
Background: As more and more microarray data sets are available to the public, the opportunities ... more Background: As more and more microarray data sets are available to the public, the opportunities to conduct data meta-analyses increase. Meta-analyses of microarray data consist of general steps such as downloading, decompressing, classifying, quality controlling and normalizing. These steps are time-consuming if they are not automated. A workflow is needed to automate these general steps, improve the efficiency and standardize the analyses.
Abstract Data quality is an important component of modern scientific discovery. Many scientific d... more Abstract Data quality is an important component of modern scientific discovery. Many scientific discovery processes consume data from a diverse array of resources such as streaming sensor networks, web services, and databases. The validity of a scientific computation's results is highly dependent on the quality of these input data. Scientific workflow systems are being increasingly used to automate scientific computations by facilitating experiment design, data capture, integration, processing, and analysis.
Abstract The notion of sharing scientific data has only recently begun to gain ground in science,... more Abstract The notion of sharing scientific data has only recently begun to gain ground in science, where data is still considered a private asset. There is growing evidence, however, that the benefits of scientific collaboration through early data sharing during the course of a science project may outgrow the risk of losing exclusive ownership of the data. As exemplar success stories are making the headlines [1], principles of effective information sharing have become the subject of e-science research.
Abstract To execute workflows on a compute cluster resource, workflow engines can work with clust... more Abstract To execute workflows on a compute cluster resource, workflow engines can work with cluster resource manager software to distribute jobs into compute nodes on the cluster. We discuss how to interact with traditional Oracle Grid Engine and recent Hadoop cluster resource managers using a dataflow-based scheduling approach to balance compute resource load for data-parallel workflow execution.
Abstract Due to the enormous complexity of computer systems, researchers use simulators to model ... more Abstract Due to the enormous complexity of computer systems, researchers use simulators to model system behavior and generate quantitative estimates of expected performance. Researchers also use simulators to model and assess the efficacy of future enhancements and novel systems. Arguably the most important tools available to computer architecture researchers, simulators offer a balance of cost, timeliness, and flexibility.
Abstract With the increasing volume and complexity of data produced by ultra-scale simulations an... more Abstract With the increasing volume and complexity of data produced by ultra-scale simulations and highthroughput experiments, understanding the science is largely hampered by the lack of comprehensive, end-to-end data management solutions ranging from initial data acquisition to final analysis and visualization.
Distributed Data-Parallel (DDP) patterns such as MapReduce have become increasingly popular as so... more Distributed Data-Parallel (DDP) patterns such as MapReduce have become increasingly popular as solutions to facilitate data-intensive applications, resulting in a number of systems supporting DDP workflows. Yet, applications or workflows built using these patterns are usually tightly-coupled with the underlying DDP execution engine they select. We present a framework for distributed data-parallel execution in the Kepler scientific workflow system that enables users to easily switch between different DDP execution engines.
With the increasing popularity of the Cloud computing, there are more and more requirements for s... more With the increasing popularity of the Cloud computing, there are more and more requirements for scientific work–flows to utilize Cloud resources. In this paper, we present our preliminary work and experiences on enabling the interaction between the Kepler scientific workflow system and the Amazon Elastic Compute Cloud (EC2). A set of EC2 actors and Kepler Amazon Machine Images are introduced with the discussion on their different usage modes.
Comprehensive, end-to-end, data and workflow management solutions are needed to handle the increa... more Comprehensive, end-to-end, data and workflow management solutions are needed to handle the increasing complexity of processes and data volumes associated with modern distributed scientific problem solving, such as ultrascale simulations and high-throughput experiments. The key to the solution is an integrated network-based framework that is functional, dependable, faulttolerant, and supports data and process provenance.
Abstract Experimental science can be thought of as the exploration of a large research space, in ... more Abstract Experimental science can be thought of as the exploration of a large research space, in search of a few valuable results. While it is this “Golden Data” that gets published, the history of the exploration is often as valuable to the scientists as some of its outcomes. We envision an e-research infrastructure that is capable of systematically and automatically recording such history–an assumption that holds today for a number of workflow management systems routinely used in e-science.
Page 1. 1 Ilkay Altintas Director, Scientific Workflow Automation Technologies Laboratory San Die... more Page 1. 1 Ilkay Altintas Director, Scientific Workflow Automation Technologies Laboratory San Diego Supercomputer Center, UCSD Astrophysics Workflows in the Kepler System Page 2. 2 The Big Picture: Supporting the Scientist Here: John Blondin, NC State Astrophysics Terascale Supernova Initiative SciDAC, DOE Conceptual SWF Executable SWF From “Napkin Drawings” … … to Executable Workflows Page 3.
Abstract Scientific discoveries are often the result of methodical execution of many interrelated... more Abstract Scientific discoveries are often the result of methodical execution of many interrelated scientific workflows, where workflows and datasets published by one set of users can be used by other users to perform subsequent analyses, leading to implicit or explicit collaboration. In this paper, we describe a data model for “collaborative provenance” that extends common workflow provenance models by introducing attributes for characterizing the nature of user collaborations as well as their strength (or weight).
Procedia Computer …, Jan 1, 2010
e-Science and Grid …, Jan 1, 2006