Unifying biological image formats with HDF5 (original) (raw)

2009, Communications of The ACM

https://doi.org/10.1145/1562764.1562781

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

The biosciences need an image format capable of high performance and long-term maintenance. Is HDF5 the answer? bY mAttheW t. DouGhertY, miChAeL J. foLK, ereZ ZADoK, herbert J. bernstein, frAnCes C. bernstein, KeVin W. eLiCeiri, Werner benGer, ChristoPh best Related articles on queue.acm.org Catching disk latency in the act http://queue.acm.org/detail.cfm?id=1483106

Battle of the Defaults: Extracting Performance Characteristics of HDF5 under Production Load

2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), 2021

Popular parallel I/O libraries, such as HDF5, provide tuning parameters to obtain superior performance. However, the selection of effective parameters on production systems is complex due to the interdependence of I/O software and file system layers. Hence, application developers typically use the default parameters and often experience poor I/O performance. This work conducts a benchmarking-based analysis on the HDF5 behaviors with a wide variety of I/O patterns to extract performance characteristics under the production workload. To make the analysis well controlled, we exercise I/O benchmarks on POSIX-IO, MPI-IO, and HDF5 using the same I/O patterns and in the same jobs. To address high performance variability in production environments, we repeat the benchmarks across I/O patterns, storage devices, and time intervals. Based on the results, we identified consistent HDF5 behaviors that appropriate configurations and operations on dataset layout and file-metadata placement can improve performance significantly. We apply our findings and evaluate the tuned I/O library on two supercomputers: Summit and Cori. The results show that our solution can achieve more than 10× performance speedup than the default for both of the systems, suggesting the effectiveness, stability, and generality of our solution. • We introduced a benchmarking approach to understand the I/O performance of large-scale production systems. • We identified efficient HDF5 alignment and file metadata optimizations and built the chosen configurations as the default in HDF5. Our solution is adopted by OLCF for production use and is publicly available 1 . • We evaluated the tuned HDF5 on Summit and Cori. The results show that the tuned HDF5 achieves more than 10× performance speedup on both Summit and Cori, suggesting that our solution is consistently effective across systems.

A Study of NetCDF as an Approach for High Performance Medical Image Storage

Journal of Physics: Conference Series, 2012

The spread of telemedicine systems increases every day. The systems and PACS based on DICOM images has become common. This rise reflects the need to develop new storage systems, more efficient and with lower computational costs. With this in mind, this article discusses a study for application in NetCDF data format as the basic platform for storage of DICOM images. The study case comparison adopts an ordinary database, the HDF5 and the NetCDF to storage the medical images. Empirical results, using a real set of images, indicate that the time to retrieve images from the NetCDF for large scale images has a higher latency compared to the other two methods. In addition, the latency is proportional to the file size, which represents a drawback to a telemedicine system that is characterized by a large amount of large image files.

H5Prov: I/O Performance Analysis of Science Applications Using HDF5 File-level Provenance

2019

Systematic capture of extensive, useful science metadata and provenance requires an easy-to-use strategy to automatically record information throughout the data life cycle, without posing significant performance impact. Toward that goal, we have developed a Virtual Object Layer (VOL) connector for HDF5, the most popular I/O middleware on HPC systems. The VOL connector, called H5Prov, transparently intercepts HDF5 calls and records operations at multiple levels, namely file, group, dataset, and data element levels. The provenance data produced can also be analyzed to reveal I/O patterns and correlations between application behaviors/semantics and I/O performance issues, which enables optimization opportunities. In this effort, we analyze captured provenance information from two application benchmarks to understand HDF5 file usage and to detect I/O patterns, with preliminary results showing good promise.

OME-Zarr: a cloud-optimized bioimaging file format with international community support

A growing community is constructing a next-generation file format (NGFF) for bioimaging to overcome problems of scalability and heterogeneity. Organized by the Open Microscopy Environment (OME), individuals and institutes across diverse modalities facing these problems have designed a format specification process (OME-NGFF) to address these needs. This paper brings together a wide range of those community members to describe the cloud-optimized format itself – OME-Zarr – along with tools and data resources available today to increase FAIR access and remove barriers in the scientific process. The current momentum offers an opportunity to unify a key component of the bioimaging domain — the file format that underlies so many personal, institutional, and global data management and analysis tasks.

An architecture for DICOM medical images storage and retrieval adopting distributed file systems

International Journal of High Performance Systems Architecture, 2009

Conventional storage and retrieval of information from telemedicine environments are usually based on ordinary database systems. Therefore, aspects such as scalability, information distribution, high performance system techniques and operational costs are well known challenges to be circumvented in the research for novel proposals. In this research work, it is presented an architecture that targets high performance levels to store and retrieve DICOM medical images adopting a distributed approach in a cluster configuration. The proposal has two main components. The first element is a data model that is based on image hierarchy, considering the Hierarchical Data Format 5 (HDF5). On the other hand, the second component is a distributed file system, characterized by the Parallel Virtual File System (PVFS) that was employed in this proposal as a distributed storage data system. As a result, this paper presents a differentiated approach for storage and retrieval of information for a telemedicine environment. Experimental results, utilizing the architecture, indicate an enhanced level of performance around 16% in terms of storage process, this number represents an improved performance in comparison to a conventional database system.

Standardizing the next generation of bioinformatics software development with BioHDF (HDF5)

Advances in Computational Biology, 2010

Next Generation Sequencing technologies are limited by the lack of standard bioinformatics infrastructures that can reduce data storage, increase data processing performance, and integrate diverse information. HDF technologies address these requirements and have a long history of use in data-intensive science communities. They include general data file formats, libraries, and tools for working with the data. Compared to emerging standards, such as the SAM/BAM formats, HDF5-based systems demonstrate significantly better ...

Structures and Metrics For Image Storage & Interchange

imageThere are hundreds of different image file specifications in existence. A recent informal survey recorded almost 100 formats in use by USENET readers alone. Thus, an imaging practitioner is faced with a large and sometimes bewildering range of image file standards to chose from which, when coupled with the sparsity of studies in the area, makes acquiring a general overview of the field a difficult task. This paper will seek to address this problem by reviewing the overall topic of image formats, describing the most notable standards, proposing a set of related metrics and providing a source of further information.

Scientific data exchange: a schema for HDF5-based storage of raw and analyzed data

Journal of synchrotron radiation, 2014

Data Exchange is a simple data model designed to interface, or `exchange', data among different instruments, and to enable sharing of data analysis tools. Data Exchange focuses on technique rather than instrument descriptions, and on provenance tracking of analysis steps and results. In this paper the successful application of the Data Exchange model to a variety of X-ray techniques, including tomography, fluorescence spectroscopy, fluorescence tomography and photon correlation spectroscopy, is described.

Loading...

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (8)

  1. BiohDf; http://www.geospiza.com/research/biohdf/.
  2. DicoM (Digital imaging and communications in Medicine); http://medical.nema.org.
  3. MeDsBio (consortium for Management of experimental Data in structural Biology); http://www. medsbio.org.
  4. Mets (Metadata encoding and transmission standard); http://www.loc.gov/standards/mets/.
  5. MPeg (Moving Picture experts group); http://www. chiariglione.org/mpeg/.
  6. Matthew T. Dougherty (matthewd@bcm.edu) is at the national center for Macromolecular imaging, specializing in cryo-electron microscopy, visualization, and animation.
  7. Michael J. Folk (mfolk@hdfgroup.org) is president of the hDf group.
  8. Erez Zadok (ezk@cs.sunysb.edu) is associate professor at stony Brook university, specializing in computer storage systems performance and design. herbert J. Bernstein (yaya@dowling.edu) is professor of computer science at Dowling college, active in the development of iucr standards.

Balancing performance and preservation lessons learned with HDF5

Proceedings of the 2010 Roadmap for Digital Preservation Interoperability Framework Workshop on - US-DPIF '10, 2010

Fifteen years ago, The HDF Group set out to re-invent the HDF format and software suite to address two conflicting challenges. The first was to enable exceptionally scalable, extensible storage and access for every kind of scientific and engineering data. The second was to facilitate access to data stored in the HDF long into the future.

Tuning hdf5 for lustre file systems

2012

HDF5 is a cross-platform parallel I/O library that is used by a wide variety of HPC applications for the flexibility of its hierarchical object-database representation of scientific data. We describe our recent work to optimize the performance of the HDF5 and MPI-IO libraries for the Lustre parallel file system. We selected three different HPC applications to represent the diverse range of I/O requirements, and measured their performance on three different systems to demonstrate the robustness of our optimizations across different file system configurations and to validate our optimization strategy. We demonstrate that the combined optimizations improve HDF5 parallel I/O performance by up to 33 times in some cases-running close to the achievable peak performance of the underlying file system-and demonstrate scalable performance up to 40,960-way concurrency.

An overview of the HDF5 technology suite and its applications

Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases - AD '11, 2011

In this paper, we give an overview of the HDF5 technology suite and some of its applications. We discuss the HDF5 data model, the HDF5 software architecture and some of its performance enhancing capabilities.

A Plugin for HDF5 Using PLFS for Improved I/O Performance and Semantic Analysis

2012 SC Companion: High Performance Computing, Networking Storage and Analysis, 2012

HDF5 is a data model, library and file format for storing and managing data. It is designed for flexible and efficient I/O for high volume and complex data. Natively, it uses a single-file format where multiple HDF5 objects are stored in a single file. In a parallel HDF5 application, multiple processes access a single file, thereby resulting in a performance bottleneck in I/O. Additionally, a single-file format does not allow semantic post processing on individual objects outside the scope of the HDF5 application. We have developed a new plugin for HDF5 using its Virtual Object Layer that serves two purposes: 1) it uses PLFS to convert the single-file layout into a data layout that is optimized for the underlying file system, and 2) it stores data in a unique way that enables semantic post-processing on data. We measure the performance of the plugin and discuss work leveraging the new semantic post-processing functionality enabled. We further discuss the applicability of this approach for exascale burst buffer storage systems.

Image Management for Biological Data

Encyclopedia of Database Systems, 2009

Databases for biomedical images; Image management for life sciences Definition Image management for biological data refers to the organization of biological images and their associated metadata and annotations in a digital system so that they can be searched, retrieved and shared. Foundations Content-Based Similarity Search Similarity search by semantic content may be performed at the level of whole images (contentbased image retrieval, CBIR) or regions (regionbased image retrieval, RBIR). Datta et al. provide an excellent survey [3]. Some important image retrieval systems are SIMPLIcity, WALRUS, Virage, QBIC, NeTra, Photobook, VisualSEEk, and Keyblock. The two important components of such image comparison systems are the image features and the distance metrics. Examples of general image features are the MPEG-7 standard features, color histograms, texture, wavelets, and shape descriptors. In addition, there are domain

Tuning HDF 5 subfiling performance on parallel file systems

2017

Subfiling is a technique used on parallel file systems to reduce locking and contention issues when multiple compute nodes interact with the same storage target node. Subfiling provides a compromise between the single shared file approach that instigates the lock contention problems on parallel file systems and having one file per process, which results in generating a massive and unmanageable number of files. In this paper, we evaluate and tune the performance of recently implemented subfiling feature in HDF5. In specific, we explain the implementation strategy of subfiling feature in HDF5, provide examples of using the feature, and evaluate and tune parallel I/O performance of this feature with parallel file systems of the Cray XC40 system at NERSC (Cori) that include a burst buffer storage and a Lustre disk-based storage. We also evaluate I/O performance on the Cray XC30 system, Edison, at NERSC. Our results show performance benefits of 1.2X to 6X performance advantage with subfi...

SCIFIO: an extensible framework to support scientific image formats

BMC Bioinformatics, 2016

Background: No gold standard exists in the world of scientific image acquisition; a proliferation of instruments each with its own proprietary data format has made out-of-the-box sharing of that data nearly impossible. In the field of light microscopy, the Bio-Formats library was designed to translate such proprietary data formats to a common, open-source schema, enabling sharing and reproduction of scientific results. While Bio-Formats has proved successful for microscopy images, the greater scientific community was lacking a domain-independent framework for format translation. Results: SCIFIO (SCientific Image Format Input and Output) is presented as a freely available, open-source library unifying the mechanisms of reading and writing image data. The core of SCIFIO is its modular definition of formats, the design of which clearly outlines the components of image I/O to encourage extensibility, facilitated by the dynamic discovery of the SciJava plugin framework. SCIFIO is structured to support coexistence of multiple domain-specific open exchange formats, such as Bio-Formats' OME-TIFF, within a unified environment. Conclusions: SCIFIO is a freely available software library developed to standardize the process of reading and writing scientific image formats.

MIPortal: a high capacity server for molecular imaging research

Molecular imaging

The introduction of novel molecular tools in research and clinical medicine has created a need for more refined information management systems. This article describes the design and implementation of such a new information platform: the Molecular Imaging Portal (MIPortal). The platform was created to organize, archive, and rapidly retrieve large datasets using Web-based browsers as access points. The system has been implemented in a heterogeneous, academic research environment serving Macintosh, Unix, and Microsoft Windows clients and has been shown to be extraordinarily robust and versatile. In addition, it has served as a useful tool for clinical trials and collaborative multi-institutional small-animal imaging research.

BioImageIT: Integration of image data-management with analysis

HAL (Le Centre pour la Communication Scientifique Directe), 2021

HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

BD5: An open HDF5-based data format to represent quantitative biological dynamics data

PLOS ONE, 2020

BD5 is a new binary data format based on HDF5 (hierarchical data format version 5). It can be used for representing quantitative biological dynamics data obtained from bioimage informatics techniques and mechanobiological simulations. Biological Dynamics Markup Language (BDML) is an XML (Extensible Markup Language)-based open format that is also used to represent such data; however, it becomes difficult to access quantitative data in BDML files when the file size is large because parsing XML-based files requires large computational resources to first read the whole file sequentially into computer memory. BD5 enables fast random (i.e., direct) access to quantitative data on disk without parsing the entire file. Therefore, it allows practical reuse of data for understanding biological mechanisms underlying the dynamics.

Cited by

OMERO: flexible, model-driven data management for experimental biology

Nature methods, 2012

Data-intensive research depends on tools that manage multi-dimensional, heterogeneous data sets. We have built OME Remote Objects (OMERO), a software platform that enables access to and use of a wide range of biological data. OMERO uses a server-based middleware application to provide a unified interface for images, matrices, and tables. OMERO's design and flexibility have enabled its use for light microscopy, high content screening, electron microscopy, and even nonimage genotype data. OMERO is open source software and available at http:// openmicroscopy.org.

Scientific data exchange: a schema for HDF5-based storage of raw and analyzed data

Journal of synchrotron radiation, 2014

Data Exchange is a simple data model designed to interface, or `exchange', data among different instruments, and to enable sharing of data analysis tools. Data Exchange focuses on technique rather than instrument descriptions, and on provenance tracking of analysis steps and results. In this paper the successful application of the Data Exchange model to a variety of X-ray techniques, including tomography, fluorescence spectroscopy, fluorescence tomography and photon correlation spectroscopy, is described.

Towards data format standardization for X-ray absorption spectroscopy

Journal of Synchrotron Radiation, 2012

A working group on data format standardization for X-ray absorption spectroscopy (XAS) has recently formed under the auspices of the International X-ray Absorption Society and the XAFS Commission of the International Union of Crystallography. This group of beamline scientists and XAS practitioners has been tasked to propose data format standards to meet the needs of the world-wide XAS community. In this report, concepts for addressing three XAS data storage needs are presented: a single spectrum interchange format, a hierarchical format for multispectral X-ray experiment, and a relational database format for XAS data libraries.

mz5: Space- and Time-efficient Storage of Mass Spectrometry Data Sets

Molecular & Cellular Proteomics, 2012

Across a host of MS-driven -omics fields, researchers witness the acquisition of ever increasing amounts of high throughput MS data and face the need for their compact yet efficiently accessible storage. Addressing the need for an open data exchange format, the Proteomics Standards Initiative (PSI) and the Seattle Proteome Center at the Institute for Systems Biology (ISB) independently developed the mzData and mzXML formats, respectively. In a subsequent joint effort they defined an ontology and associated controlled vocabulary that specifies the contents of MS data files, implemented as the newer mzML format. All three formats are based on XML and are thus not particular efficient in either storage space requirements or read/write speed. This contribution introduces mz5, a complete reimplementation of the mzML ontology that is based on the efficient, industrial strength storage back-end HDF5. Compared to the current mzML standard, this strategy yields an average file size reduction to ∼ 54% and increases linear read and write speed ∼ 3−4 fold. The format is implemented as part of the ProteoWizard project and is available under a permissive Apache license. Additional information and download links are available from http://software.steenlab.org/mz5.

Methods and Technologies for Research- and Metadata Management in Collaborative Experimental Research

Applied Mechanics and Materials

Newly developed technologies and methods for the purpose of controlling uncertainty in technical systems must be proven and validated against reliable experimental studies. The availability of descriptive metadata is mandatory to enable long term usability and sharing of such experimental research data. This article introduces a concept for a software independent solution for managing data in collaborative research environments. The proposed approach leverages the advantages of capturing metadata in a uniform, modular data structure and providing software independent access to a centralized data repository as well as its contents by means of a web-application. The article presents a prototype implementation of the proposed approach and discusses its application on the demonstrator test rig of a collaborative research centre in the field of mechanical engineering.

Waveform Signal Entropy and Compression Study of Whole-Building Energy Datasets

2019

Electrical energy consumption has been an ongoing research area since the coming of smart homes and Internet of Things devices. Consumption characteristics and usages profiles are directly influenced by building occupants and their interaction with electrical appliances. Extracted information from these data can be used to conserve energy and increase user comfort levels. Data analysis together with machine learning models can be utilized to extract valuable information for the benefit of occupants themselves, power plants, and grid operators. Public energy datasets provide a scientific foundation to develop and benchmark these algorithms and techniques. With datasets exceeding tens of terabytes, we present a novel study of five whole-building energy datasets with high sampling rates, their signal entropy, and how a well-calibrated measurement can have a significant effect on the overall storage requirements. We show that some datasets do not fully utilize the available measurement precision, therefore leaving potential accuracy and space savings untapped. We benchmark a comprehensive list of 365 file formats, transparent data transformations, and lossless compression algorithms. The primary goal is to reduce the overall dataset size while maintaining an easy-to-use file format and access API. We show that with careful selection of file format and encoding scheme, we can reduce the size of some datasets by up to 73%.

BD5: An open HDF5-based data format to represent quantitative biological dynamics data

PLOS ONE, 2020

BD5 is a new binary data format based on HDF5 (hierarchical data format version 5). It can be used for representing quantitative biological dynamics data obtained from bioimage informatics techniques and mechanobiological simulations. Biological Dynamics Markup Language (BDML) is an XML (Extensible Markup Language)-based open format that is also used to represent such data; however, it becomes difficult to access quantitative data in BDML files when the file size is large because parsing XML-based files requires large computational resources to first read the whole file sequentially into computer memory. BD5 enables fast random (i.e., direct) access to quantitative data on disk without parsing the entire file. Therefore, it allows practical reuse of data for understanding biological mechanisms underlying the dynamics.