Distributed data mining in the National Virtual Observatory (original) (raw)

Data Mining in Distributed Databases for Interacting Galaxies

Astronomical Data Analysis Software and Systems Xiv, 2005

We present results from an exploratory data mining project to identify classification features of special classes of interacting galaxies (for example, infrared-luminous galaxies) within distributed astronomical databases. Using a variety of data mining techniques, interaction-specific features are learned, to distinguish this class of galaxies from a control sample of normal galaxies. Eventually, the corresponding rule-based feature model of that class of galaxies will be applied to the large multi-wavelength astronomical survey databases that are becoming available. This distributed data mining activity is a prototype science use case for the VO (Virtual Observatory). We specifically apply multi-archive multi-wavelength data to the problem. In a preliminary validation experiment, we recovered exactly the type of object that we hope to find automatically with our data mining tools: a distant hyper-luminous infrared galaxy (HyLIRG), the most luminous class of known galaxies. This particular galaxy was previously known, but we re-discovered it serendipitously.

Mining knowledge in astrophysical massive data sets

2010

Modern scientific data mainly consist of huge datasets gathered by a very large number of techniques and stored in very diversified and often incompatible data repositories. More in general, in the e-science environment, it is considered as a critical and urgent requirement to integrate services across distributed, heterogeneous, dynamic "virtual organizations" formed by different resources within a single enterprise. In the last decade, Astronomy has become an immensely data rich field due to the evolution of detectors (plates to digital to mosaics), telescopes and space instruments. The Virtual Observatory approach consists into the federation under common standards of all astronomical archives available worldwide, as well as data analysis, data mining and data exploration applications. The main drive behind such effort being that once the infrastructure will be completed, it will allow a new type of multi-wavelength, multi-epoch science which can only be barely imagined. Data Mining, or Knowledge Discovery in Databases, while being the main methodology to extract the scientific information contained in such MDS (Massive Data Sets), poses crucial problems since it has to orchestrate complex problems posed by transparent access to different computing environments, scalability of algorithms, reusability of resources, etc. In the present paper we summarize the present status of the MDS in the Virtual Observatory and what is currently done and planned to bring advanced Data Mining methodologies in the case of the DAME (DAta Mining & Exploration) project.

Science User Scenarios for a Virtual Observatory Design Reference Mission: Science Requirements for Data Mining

The knowledge discovery potential of the new large astronomical databases is vast. When these are used in conjunction with the rich legacy data archives, the opportunities for scientific discovery multiply rapidly. A Virtual Observatory (VO) framework will enable transparent and efficient access, search, retrieval, and visualization of data across multiple data repositories, which are generally heterogeneous and distributed. Aspects of data mining that apply to a variety of science user scenarios with a VO are reviewed. The development of a VO should address the data mining needs of various astronomical research constituencies. By way of example, two user scenarios are presented which invoke applications and linkages of data across the catalog and image domains in order to address specific astrophysics research problems. These illustrate a subset of the desired capabilities and power of the VO, and as such they represent potential components of a VO Design Reference Mission.

Distributed data mining for astronomy catalogs

SDM Workshop on …, 2006

The design, implementation, and archiving of very large sky surveys is playing an increasingly important role in today's astronomy research. However, these data archives will necessarily be geographically distributed. To fully exploit the potential of this data lode, we believe that capabilities ought to be provided allowing users a more communication-efficient alternative to multiple archive data analysis than first downloading the archives fully to a centralized site. In this paper, we propose a system, DEMAC, for the distributed mining of massive astronomical catalogs. The system is designed to sit on top of the existing national virtual observatory environment and provide tools for distributed data mining (as web services) without requiring datasets to be fully down-loaded to a centralized server. To illustrate the potential effectiveness of our system, we develop communicationefficient distributed algorithms for principal component analysis (PCA) and outlier detection. Then, we carry out a case study using distributed PCA for detecting fundamental planes of astronomical parameters. In particular, PCA enables dimensionality reduction within a set of correlated physical parameters, such as a reduction of a 3-dimensional data distribution (in astronomer's observed units) to a planar data distribution (in fundamental physical units). Fundamental physical insights are thereby enabled through efficient access to distributed multi-dimensional data sets.

Extracting Knowledge from Massive Astronomical Data Sets

Astrostatistics and Data Mining, 2012

The exponential growth of astronomical data collected by both ground based and space borne instruments has fostered the growth of Astroinformatics: a new discipline laying at the intersection between astronomy, applied computer science, and information and computation (ICT) technologies. At the very heart of Astroinformatics is a complex set of methodologies usually called Data Mining (DM) or Knowledge Discovery in Data Bases (KDD). In the astronomical domain, DM/KDD are still in a very early usage stage, even though new methods and tools are being continuously deployed in order to cope with the Massive Data Sets (MDS) that can only grow in the future. In this paper, we briefly outline some general problems encountered when applying DM/KDD methods to astrophysical problems, and describe the DAME (DAta Mining & Exploration) web application. While specifically tailored to work on MDS, DAME can be effectively applied also to smaller data sets. As an illustration, we describe two application of DAME to two different problems: the identification of candidate globular clusters in external galaxies, and the classification of active galactic nuclei (AGN). We believe that tools and services of this nature will become increasingly necessary for the data-intensive astronomy (and indeed all sciences) in the 21 st century.

DAMEWARE: A Web Cyberinfrastructure for Astrophysical Data Mining

Publications of the Astronomical Society of the Pacific, 2014

Astronomy is undergoing through a methodological revolution triggered by an unprecedented wealth of complex and accurate data. The new panchromatic, synoptic sky surveys require advanced tools for discovering patterns and trends hidden behind data which are both complex and of high dimensionality. We present DAMEWARE (DAta Mining & Exploration Web Application & REsource): a general purpose, web-based, distributed data mining environment developed for the exploration of large data sets, and finely tuned for astronomical applications.

Scientific Data Mining in Astronomy

We describe the application of data mining algorithms to research problems in astronomy. We posit that data mining has always been fundamental to astronomical research, since data mining is the basis of evidence-based discovery, including classification, clustering, and novelty discovery. These algorithms represent a major set of computational tools for discovery in large databases, which will be increasingly essential in the era of data-intensive astronomy. Historical examples of data mining in astronomy are reviewed, followed by a discussion of one of the largest data-producing projects anticipated for the coming decade: the Large Synoptic Survey Telescope (LSST). To facilitate data-driven discoveries in astronomy, we envision a new data-oriented research paradigm for astronomy and astrophysics -- astroinformatics. Astroinformatics is described as both a research approach and an educational imperative for modern data-intensive astronomy. An important application area for large tim...

Data mining and knowledge discovery resources for astronomy in the web 2.0 age

Software and Cyberinfrastructure for Astronomy II, 2012

The emerging field of AstroInformatics, while on the one hand appears crucial to face the technological challenges, on the other is opening new exciting perspectives for new astronomical discoveries through the implementation of advanced data mining procedures. The complexity of astronomical data and the variety of scientific problems, however, call for innovative algorithms and methods as well as for an extreme usage of ICT technologies. The DAME (DAta Mining & Exploration) Program exposes a series of web-based services to perform scientific investigation on astronomical massive data sets. The engineering design and requirements, driving its development since the beginning of the project, are projected towards a new paradigm of Web based resources, which reflect the final goal to become a prototype of an efficient data mining framework in the data-centric era.

A communication efficient and scalable distributed data mining for the astronomical data

Astronomy and Computing, 2016

In 2020, ∼ 60PB of archived data will be accessible to the astronomers. But to analyze such a paramount data will be a challenging task. This is basically due to the computational model used to download the data from complex geographically distributed archives to a central site and then analyzing it in the local systems. Because the data has to be downloaded to the central site, the network BW limitation will be a hindrance for the scientific discoveries. Also analyzing this PB-scale on local machines in a centralized manner is challenging. In this virtual observatory is a step towards this problem, however, it does not provide the data mining model. Adding the distributed data mining layer to the VO can be the solution in which the knowledge can be downloaded by the astronomers instead the raw data and thereafter astronomers can either reconstruct the data back from the downloaded knowledge or use the knowledge directly for further analysis.Therefore, in this paper, we present Distributed Load Balancing Principal Component Analysis for optimally distributing the computation among the available nodes to minimize the transmission cost and downloading cost for the end user. The experimental analysis is done with Fundamental Plane(FP) data, Gadotti data and complex Mfeat data. In terms of transmission cost, our approach performs better than Qi. et al. and Yue.et al. The analysis shows that with the complex Mfeat data ∼ 90% downloading cost can be reduced for the end user with the negligible loss in accuracy.