Science User Scenarios for a Virtual Observatory Design Reference Mission: Science Requirements for Data Mining (original) (raw)
Related papers
Distributed data mining in the National Virtual Observatory
Data Mining and Knowledge Discovery: Theory, Tools, and Technology V, 2003
The astronomy research community is about to become the beneficiary of huge multi−terabyte databases from a host of sky surveys. The rich and diverse information content within this "virtual sky" and the array of results to be derived therefrom will far exceed the current capacity of data search and research tools. The new digital surveys have the potential of facilitating a wide range of scientific discoveries about the Universe! To enable this to happen, the astronomical community is embarking on an ambitious endeavor, the creation of a National Virtual Observatory (NVO). This will in fact develop into a Global Virtual Observatory. To facilitate the new type of science enabled by the NVO, new techniques in data mining and knowledge discovery in large databases must be developed and deployed, and the next generation of astronomers must be trained in these techniques. This activity will benefit greatly from developments in the fields of information technology, computer science, and statistics. Aspects of the NVO initiative, including sample science user scenarios and user requirements will be presented. The value of scientific data mining and some early test case results will be discussed in the context of the speaker's research interests in colliding and merging galaxies.
Mining knowledge in astrophysical massive data sets
2010
Modern scientific data mainly consist of huge datasets gathered by a very large number of techniques and stored in very diversified and often incompatible data repositories. More in general, in the e-science environment, it is considered as a critical and urgent requirement to integrate services across distributed, heterogeneous, dynamic "virtual organizations" formed by different resources within a single enterprise. In the last decade, Astronomy has become an immensely data rich field due to the evolution of detectors (plates to digital to mosaics), telescopes and space instruments. The Virtual Observatory approach consists into the federation under common standards of all astronomical archives available worldwide, as well as data analysis, data mining and data exploration applications. The main drive behind such effort being that once the infrastructure will be completed, it will allow a new type of multi-wavelength, multi-epoch science which can only be barely imagined. Data Mining, or Knowledge Discovery in Databases, while being the main methodology to extract the scientific information contained in such MDS (Massive Data Sets), poses crucial problems since it has to orchestrate complex problems posed by transparent access to different computing environments, scalability of algorithms, reusability of resources, etc. In the present paper we summarize the present status of the MDS in the Virtual Observatory and what is currently done and planned to bring advanced Data Mining methodologies in the case of the DAME (DAta Mining & Exploration) project.
Extracting Knowledge from Massive Astronomical Data Sets
Astrostatistics and Data Mining, 2012
The exponential growth of astronomical data collected by both ground based and space borne instruments has fostered the growth of Astroinformatics: a new discipline laying at the intersection between astronomy, applied computer science, and information and computation (ICT) technologies. At the very heart of Astroinformatics is a complex set of methodologies usually called Data Mining (DM) or Knowledge Discovery in Data Bases (KDD). In the astronomical domain, DM/KDD are still in a very early usage stage, even though new methods and tools are being continuously deployed in order to cope with the Massive Data Sets (MDS) that can only grow in the future. In this paper, we briefly outline some general problems encountered when applying DM/KDD methods to astrophysical problems, and describe the DAME (DAta Mining & Exploration) web application. While specifically tailored to work on MDS, DAME can be effectively applied also to smaller data sets. As an illustration, we describe two application of DAME to two different problems: the identification of candidate globular clusters in external galaxies, and the classification of active galactic nuclei (AGN). We believe that tools and services of this nature will become increasingly necessary for the data-intensive astronomy (and indeed all sciences) in the 21 st century.
DAMEWARE: A Web Cyberinfrastructure for Astrophysical Data Mining
Publications of the Astronomical Society of the Pacific, 2014
Astronomy is undergoing through a methodological revolution triggered by an unprecedented wealth of complex and accurate data. The new panchromatic, synoptic sky surveys require advanced tools for discovering patterns and trends hidden behind data which are both complex and of high dimensionality. We present DAMEWARE (DAta Mining & Exploration Web Application & REsource): a general purpose, web-based, distributed data mining environment developed for the exploration of large data sets, and finely tuned for astronomical applications.
arXiv: Instrumentation and Methods for Astrophysics, 2020
We report the outcomes of a survey that explores the current practices, needs and expectations of the astrophysics community, concerning four research aspects: open science practices, data access and management, data visualization, and data analysis. The survey, involving 329 professionals from several research institutions, pinpoints significant gaps in matters such as results reproducibility, availability of visual analytics tools and adoption of Machine Learning techniques for data analysis. This research is conducted in the context of the H2020 NEANIAS project.
Scientific Data Mining in Astronomy
We describe the application of data mining algorithms to research problems in astronomy. We posit that data mining has always been fundamental to astronomical research, since data mining is the basis of evidence-based discovery, including classification, clustering, and novelty discovery. These algorithms represent a major set of computational tools for discovery in large databases, which will be increasingly essential in the era of data-intensive astronomy. Historical examples of data mining in astronomy are reviewed, followed by a discussion of one of the largest data-producing projects anticipated for the coming decade: the Large Synoptic Survey Telescope (LSST). To facilitate data-driven discoveries in astronomy, we envision a new data-oriented research paradigm for astronomy and astrophysics -- astroinformatics. Astroinformatics is described as both a research approach and an educational imperative for modern data-intensive astronomy. An important application area for large tim...
Distributed data mining for astronomy catalogs
SDM Workshop on …, 2006
The design, implementation, and archiving of very large sky surveys is playing an increasingly important role in today's astronomy research. However, these data archives will necessarily be geographically distributed. To fully exploit the potential of this data lode, we believe that capabilities ought to be provided allowing users a more communication-efficient alternative to multiple archive data analysis than first downloading the archives fully to a centralized site. In this paper, we propose a system, DEMAC, for the distributed mining of massive astronomical catalogs. The system is designed to sit on top of the existing national virtual observatory environment and provide tools for distributed data mining (as web services) without requiring datasets to be fully down-loaded to a centralized server. To illustrate the potential effectiveness of our system, we develop communicationefficient distributed algorithms for principal component analysis (PCA) and outlier detection. Then, we carry out a case study using distributed PCA for detecting fundamental planes of astronomical parameters. In particular, PCA enables dimensionality reduction within a set of correlated physical parameters, such as a reduction of a 3-dimensional data distribution (in astronomer's observed units) to a planar data distribution (in fundamental physical units). Fundamental physical insights are thereby enabled through efficient access to distributed multi-dimensional data sets.
Data mining and knowledge discovery resources for astronomy in the web 2.0 age
Software and Cyberinfrastructure for Astronomy II, 2012
The emerging field of AstroInformatics, while on the one hand appears crucial to face the technological challenges, on the other is opening new exciting perspectives for new astronomical discoveries through the implementation of advanced data mining procedures. The complexity of astronomical data and the variety of scientific problems, however, call for innovative algorithms and methods as well as for an extreme usage of ICT technologies. The DAME (DAta Mining & Exploration) Program exposes a series of web-based services to perform scientific investigation on astronomical massive data sets. The engineering design and requirements, driving its development since the beginning of the project, are projected towards a new paradigm of Web based resources, which reflect the final goal to become a prototype of an efficient data mining framework in the data-centric era.
Virtual Data Cosmos -- Information Design in Modern Astronomy
2021
Where do cosmic X-rays come from? Every new unidentified X-ray source could potentially revolutionize our understanding of the universe. The international collaborative astronomy project EXTraS aimed at automatically classifying new sources of X-ray emission (e.g., stars or galaxies) in the large observation database of the X-ray satellite XMM-Newton. Because data archives have reached dimensions of big data astronomers used different machine-learning (ML) random forest decision tree algorithms that performed the classification process. In this bachelor thesis in information design, I was interested in the challenge to visualize these big data sets and the results of the ML algorithms in an interactive and intuitive way to facilitate the visual exploration of its internal structures and relationships. The VIRTUAL DATA COSMOS is an interactive data visualization tool in virtual reality (VR) for scientists to explore multidimensional data sets.