The National Sleep Research Resource: towards a sleep data commons (original) (raw)
Journal Article
,
Institute for Biomedical Informatics, University of Kentucky, Lexington, Kentucky, USA
Department of Computer Science, University of Kentucky, Lexington, Kentucky, USA
Corresponding Author: Guo-Qiang Zhang, PhD, Multidisciplinary Science Building 230, 725 Rose Street, Lexington, KY 40536, USA (gq.zhang@uky.edu)
Search for other works by this author on:
,
Institute for Biomedical Informatics, University of Kentucky, Lexington, Kentucky, USA
Department of Computer Science, University of Kentucky, Lexington, Kentucky, USA
Search for other works by this author on:
,
Brigham and Women’s Hospital, Boston, Massachusetts, USA
Harvard Medical School, Harvard University, Boston, Massachusetts, USA
Search for other works by this author on:
,
Institute for Biomedical Informatics, University of Kentucky, Lexington, Kentucky, USA
Search for other works by this author on:
,
Brigham and Women’s Hospital, Boston, Massachusetts, USA
Harvard Medical School, Harvard University, Boston, Massachusetts, USA
Search for other works by this author on:
,
Brigham and Women’s Hospital, Boston, Massachusetts, USA
Harvard Medical School, Harvard University, Boston, Massachusetts, USA
Search for other works by this author on:
,
Brigham and Women’s Hospital, Boston, Massachusetts, USA
Harvard Medical School, Harvard University, Boston, Massachusetts, USA
Search for other works by this author on:
,
Brigham and Women’s Hospital, Boston, Massachusetts, USA
Harvard Medical School, Harvard University, Boston, Massachusetts, USA
Search for other works by this author on:
Brigham and Women’s Hospital, Boston, Massachusetts, USA
Harvard Medical School, Harvard University, Boston, Massachusetts, USA
Search for other works by this author on:
Revision received:
05 April 2018
Cite
Guo-Qiang Zhang, Licong Cui, Remo Mueller, Shiqiang Tao, Matthew Kim, Michael Rueschman, Sara Mariani, Daniel Mobley, Susan Redline, The National Sleep Research Resource: towards a sleep data commons, Journal of the American Medical Informatics Association, Volume 25, Issue 10, October 2018, Pages 1351–1358, https://doi.org/10.1093/jamia/ocy064
Close
Navbar Search Filter Mobile Enter search term Search
Abstract
Objective
The gold standard for diagnosing sleep disorders is polysomnography, which generates extensive data about biophysical changes occurring during sleep. We developed the National Sleep Research Resource (NSRR), a comprehensive system for sharing sleep data. The NSRR embodies elements of a data commons aimed at accelerating research to address critical questions about the impact of sleep disorders on important health outcomes.
Approach
We used a metadata-guided approach, with a set of common sleep-specific terms enforcing uniform semantic interpretation of data elements across three main components: (1) annotated datasets; (2) user interfaces for accessing data; and (3) computational tools for the analysis of polysomnography recordings. We incorporated the process for managing dataset-specific data use agreements, evidence of Institutional Review Board review, and the corresponding access control in the NSRR web portal. The metadata-guided approach facilitates structural and semantic interoperability, ultimately leading to enhanced data reusability and scientific rigor.
Results
The authors curated and deposited retrospective data from 10 large, NIH-funded sleep cohort studies, including several from the Trans-Omics for Precision Medicine (TOPMed) program, into the NSRR. The NSRR currently contains data on 26 808 subjects and 31 166 signal files in European Data Format. Launched in April 2014, over 3000 registered users have downloaded over 130 terabytes of data.
Conclusions
The NSRR offers a use case and an example for creating a full-fledged data commons. It provides a single point of access to analysis-ready physiological signals from polysomnography obtained from multiple sources, and a wide variety of clinical data to facilitate sleep research.
INTRODUCTION
To advance sleep research,1 the National Institutes of Health (NIH) has funded a number of clinical trials and epidemiological cohort studies, such as the Sleep Heart Health Study,2,3 Childhood Adenotonsillectomy Trial,3–5 Heart Biomarker Evaluation in Apnea Treatment,6 Cleveland Family Study,7,8 Study of Osteoporotic Fractures,9,10 MrOS Sleep Study,3,11,12 Cleveland Children's Sleep and Health Study,13–15 Hispanic Community Health Study/Study of Latinos,16,17 Honolulu-Asia Aging Study of Sleep Apnea,18 and Multi-Ethnic Study of Atherosclerosis,19 and the Jackson Heart Study.20 Polysomnography recordings called polysomnograms (PSGs) from these studies were analyzed by a central Sleep Reading Center. Several of these studies are also part of the Trans-omics in Precision Medicine (TOPMed) initiative,21 a program aimed at generating -omics data on over 100 000 research participants. Collectively, these represent a largely untapped extensive data resource involving human physiology.
In this paper, we describe the design and implementation of the NSRR,22 a system for the structural and semantic harmonization of and web-based access to PSGs and associated clinical data generated from NIH-funded epidemiological cohort studies. The NSRR offers a use case and an example to guide the creation of a full-fledged data commons.23 It provides a single point of access to analysis-ready polysomnography and clinical data to facilitate sleep research. It also serves as an exemplar to promote the FAIR24 principles for advancing data-enabled clinical and translational research.
Sharing clinical research data
Proper sharing and reuse of data sets can help accelerate research. Since early 2000, the NIH mandated policies and regulations requiring the sharing of final research data for larger awards.
However, it is one thing to generate one’s own data and perform one’s own analysis; it is a different matter making data accessible in an analysis-ready form for analysis by an independent researcher. The state of affairs for sharing research data is at best uneven, and at worst underdeveloped, due to multiple reasons and challenges. Efforts needed to meet the challenges have been grossly underestimated. Some investigators may not have sufficient resources or expertise for the proper sharing of their data. Others may not be sharing data in a way that facilitates reuse. There is also a well-known gap in the lack of data standards to ensure interoperability and proper attribution of data collection efforts.
The NIH data commons
The NIH Data Commons25 (or Commons) is an ambitious vision for a shared virtual space to allow digital objects to be stored and computed upon by the scientific community. The Commons would allow investigators to find, manage, share, use, and reuse data, software, metadata, and workflows. It imagines an ecosystem that makes digital objects findable, accessible, interoperable, and reusable (FAIR24). Four components are considered integral parts of the Commons: a computing resource for accessing and processing of digital objects; a “digital object compliance model” that describes the properties of digital objects that enable them to be FAIR; datasets that adhere to the digital object compliance model; and software and services to facilitate access to and use of data.
Creating such a Commons could benefit from many strategies, including a bottom-up and domain-specific approach involving key stakeholders in iterative processes, because the complexities involved in realizing this vision cannot be fully anticipated, and the ultimate product needs to be responsive to the community’s needs. Sample bottom-up efforts include the bioCADDIE26 project aimed at tackling some of the inherent challenges in managing digital object identifiers that could serve the purpose of a Commons. Sample domain-specific efforts include the National Sleep Research Resource, demonstrating a FAIR-oriented digital object environment for the domain of sleep medicine that involves polysomnograms as unique types of digital objects.
The National Sleep Research Resource
Over the last several years, we have developed a system with a single point of access for sharing and reusing large-scale physiological signals for the NIH-funded National Sleep Research Resource (NSRR; R24HL114473). The NSRR offers free and open web access to de-identified data for more than 26 000 subjects, including PSGs and links to risk factor and outcome data for study participants.22
This national repository of sleep data, the first and largest of its kind, is significant because biophysiological data have not been previously made available at such a large and systematic scale. In fact, many sleep studies have been conducted using data from single laboratories, limiting scope, statistical power, and generalizability. The NSRR provides opportunities for investigators to address critical questions about the impact of sleep disorders, using data from scored summary data, annotations, and the actual physiological signal data, on important clinical outcomes, thereby enhancing clinical and translational work in human sleep medicine and physiology.
The NSRR makes data available in two ways. One is through the open access of a standard collection of study metadata (What, Who, When, Funding). The “What” section provides further details in subsection headings About, Data Overview, Protocols and Manuals, Analysis, Equipment, and Publication Links. The second is through the cross-cohort open search interface x-search.net, allowing a user to query and visualize data across multiple datasets. For registered users, a case-control exploration interface allows the user to specify a control cohort and a case cohort in a step-by-step manner to quickly assess the data support for a potential hypothesis.
Overview
This paper describes the contributions of the NSRR along several aspects of the Commons vision: metadata for sleep research digital objects; a collection of annotated sleep data sets; and interfaces and tools for accessing and analyzing such data. More importantly, the NSRR provides the design of a functional architecture for implementing a Sleep Data Commons. The NSRR also reveals complexities and challenges involved in making clinical sleep data conform to the FAIR principles.
APPROACH
We, the NSRR team of informaticians, data managers, computer scientists, and clinical and epidemiological researchers, developed the NSRR using a metadata-guided approach, in the sense that a set of common sleep terms (Sleep Common Data Elements - SCDEs) was created and used for data annotation and mapping, for user interfaces that support browsing and cross-cohort data exploration, and for sleep signal visualization and analysis. The NSRR consists of three main components: (1) annotated datasets; (2) user interfaces for accessing data; and (3) computational tools for the analysis of polysomnography recordings. The NSRR also embeds the management of the approval process for dataset-specific data access and use agreements (DAUAs), evidence of Institutional Review Board (IRB) review for accessing NSRR source data sets (part or whole), and the corresponding access control strategies in a single web portal housed at https://sleepdata.org.
Sleep metadata: common data elements and provenance information
The NSRR uses the metadata-guided approach to achieve uniform semantic interpretation of data elements across the entire spectrum of data integration activities: for annotating source data, for interfaces to query and search data, and for tools that access and assist in analysis.
Existing terminological systems do not cover the sleep domain in sufficient detail to meet the goals of the NSRR. For this reason, we developed SCDEs (Figure 1), which consist of more than 900 core sleep terms that capture demographic information, anthropometric parameters, physiologic measurements, medical history elements, sleep study data, sleep symptoms, polysomnogram sleep events, relevant medical history elements, laboratory data, and neurocognitive testing results. The provenance information on the terms includes assessment point (eg baseline visit, follow-up visit), method of data capture (eg direct measurement, calculated, reported by subject), and equipment (eg sensors used for data capture). Whenever possible, the SCDE terms have been linked to coded terms listed in established biomedical terminologies, including the SNOMED CT, the FDA Drug Classification system, the Ontology for Biomedical Investigations, International Classification of Sleep Disorders, and the effort on a Sleep Domain Ontology (SDO27) developed as part of the earlier Physio-MIMI project by the same NSRR team.28–31 Terms appearing as NIH Common Data Elements (CDEs) are also provided for cross-reference.
Figure 1.
Above: Screenshot of NSRR Sleep Common Data Elements with attributes consisting of Name, Type, Unit, Min, Max, and Version. Below: Screenshot of NSRR’s cross-cohort exploration system, guided by the Sleep Common Data Elements (left column). This system is openly accessible at x-search.net.
Functional architecture
We designed the NSRR functional architecture to flexibly accommodate the deposition of a growing set of new tools and data. To do so, the NSRR functional architecture consists of two main parts: Resource Construction and Resource Access (Figure 2). Resource Construction allows data from individual data sets to be curated, mapped, and integrated into NSRR on a cohort-by-cohort basis over time (retrospectively or prospectively) by the NSRR team. We process two broad categories of data in Resource Construction: study variables as defined by each cohort and transformed and mapped to NSRR’s sleep terms, and PSGs converted to European Data Format (EDF), a standard file format for polysomnography recordings, with precisely annotated sleep events. We also preprocess EDF files to generate derived data with established signal analysis tools such as those for spectral analyses adopted from PhysioNet,32 while keeping the raw data also available. Four categories of resources are made available incrementally: (a) PSGs in EDF and clinical covariates; (b) metadata for sleep research; (c) tools for data analytics and cross-cohort exploration; (d) user guides, technical guides, and study documents originated from individual study (Figure 2).
Figure 2.
Functional architecture representing the connections and interactions among NSRR components. Sleep Common Data Elements play a central role in coordinating and facilitating incremental resource construction (above) and resource access (below).
We developed and used a number of tools for Resource Construction including Spout,33 a data dictionary tool for maintaining and validating information on datasets and their clinical data elements, and Edfize,34 for testing the integrity of the converted EDF files. For Resource Access, the NSRR team provides a gem tool35 that simplifies the downloading of large source files by users with appropriate credentials of data access and use agreements and IRB reviews, which are also tracked and managed through the NSRR web portal.
Workflow for data curation and metadata annotation
We curate study-specific variables from individual cohorts and make them publicly available using a two-stage process. In stage one, our team performs the curation of datasets and their data dictionaries. All variables from each data dictionary are transformed into an NSRR-specific format using an automated script developed by the NSRR team. The script systematically checks the variables after their transformation, against source data for variable-specification quality control, such as conformation to the NSRR naming convention, value set or variable domain matching (value type, range gap, out of range, and outliers), and variable tagging by dataset-specific questionnaire forms where applicable. This systematic quality check against source data is performed iteratively by the NSRR team, resulting in progressively enhanced versions of the variable specifications that are tracked using version control. In stage two, a subset of variables shared across datasets are manually identified and mapped to the SCDE terms to facilitate cross-cohort search and meta-analysis.36 The versions of mappings are also systematically tracked.37
The import of a dataset into the NSRR involves annotation with two types of metadata: study variable metadata and polysomnography recording metadata. Each study variable that has been mapped to an SCDE term is also annotated with provenance metadata abstracted from documents on study processes and methodology. Provenance metadata include the source of the data, the time point at which the data were collected, the method used to collect or abstract the data, the equipment used to collect the data, and any formulas applied to calculate derived values.
Polysomnography recording metadata contain information such as study identifiers, signal channel names, signal units, and signal minimums and maximums that were acquired at the time of signal acquisition based on the configuration of the recording system. Such information is stored in the EDF header and is extracted and mapped using our EDF Editor and Translator38 for analyses. For details of the NSRR metadata extraction pipeline, see Supplementary Appendix A.
Tools for data exploration, visualization, and analysis
The NSRR team developed a number of tools for data exploration and analysis. X-search.net36 enables users to query across multiple datasets using the SCDE terms (Figure 1). The tool Altamira39 is used for web-based rendering of EDF signals, making use of the Edfize library.34 The NSRR also provides a downloading tool called the NSRR gem35 that facilitates the downloading of large datasets. The gem checks the integrity of files to be downloaded, recreates the local folder hierarchy, and manages downloading interruptions.
The data source for x-search.net is a mirrored copy of the NSRR clinical data after data mapping. The NSRR team handles mapping and coding inconsistencies before new datasets (excluding the EDF files) are integrated and become queryable using x-search.net. The data source for Altamira is the same imported copy of the EDF files.
We developed and used the Spout data dictionary management tool33 to generate dataset descriptions for access through the NSRR web portal. Spout generates and updates the specifications and histogram renderings of thousands of variables used in individual NSRR cohorts. An extended list of openly accessible tools appears in Supplementary Appendix A.
Data quality review
The NSRR team reviews new datasets for outlying and implausible values, which are tracked in a Known Issues file in each version-controlled data dictionary repository. Raw EDF files are processed in the EDF Editor and Translator tool to ensure that each file is readable and conforms to EDF specifications. Polysomnograms that cannot be converted successfully to EDF are noted in the dataset documentation.
Additional details about data protection and the hosting environment appear in Supplementary Appendix A.
RESULTS
Annotated and integrated data sets in NSRR
We have integrated data from 10 large, NIH-funded sleep cohort studies (Table 1). In total, the NSRR contains semantically annotated data from 26 808 subjects, and 51 435 linked files of raw or processed signals, including 31 166 EDF files that are available for downloading.
Table 1.
Description of data sets included in NSRR
Cohort/Study | N, subjects (n, PSGs)* | Objective Sleep Data | Main Study Outcomes | Present in TOPMed |
---|---|---|---|---|
Sleep Heart Health Study (SHHS: subsets of ARIC, CHS, FHS, Tucson) | 5600 (8080) | Full PSG | Incident cardiovascular disease | Selected samples (ARIC, FHS) |
40+ years | ||||
Childhood Adenotonsillectomy Trial | 1244 (1639) | Full PSG | Sleep apnea treatment effects on cognition, behavior, and growth | No |
5–10 yrs | ||||
Heart Biomarkers in Apnea Treatment, HeartBEAT | 305 (580) | Oximetry, NP, RIP; ECG | Sleep apnea treatment effects on 24-hour blood pressure and biomarkers | No |
45–75 yrs | ||||
Cleveland Family Study | 1600 (3200) | Oximetry; Thermistry; Chest effort, ECG; Full PSG in n = 700 | Genetics of sleep apnea | Yes |
4–96 yrs | ||||
Study of Osteoporetic Fractures in Older Women (SOF)-Sleep | 460 | Full PSG; actigraphy | Incident dementia, falls, and fractures. | No |
75+ yrs | ||||
Study of Osteoporetic Fractures in Older Men-Sleep (MrOS) | 2991 (4452) | Full PSG; actigraphy | Incident falls, fractures, and cardiovascular disease | No |
65+ yrs | ||||
Cleveland Children’s Sleep and Health Study | 850 (1603) | Oximetry; Thermistry; NP, RIP; ECG in all; Full PSG and actigraphy on n = 504 | Incident obesity and pediatric sleep disorders | No |
8–19 yrs | ||||
Hispanic Community Health Study (HCHS/SOL) | 15 000 | Oximetry, NP, snoring, movement; actigraphy on n = 2000 | Diabetes, cardiovascular disease, neurocognition, hearing loss | Yes |
18–74 yrs | ||||
Honolulu Asian American Asian Sleep Study, HAAS | 700 | Full PSG | Incident cognitive impairment | No |
85+ yrs | ||||
Multiethnic Study of Atherosclerosis (MESA)-Sleep Study | 2200 | Full PSG; actigraphy | Incident cardiovascular disease (including cardiac MRI) | Yes |
45–84 yrs |
Cohort/Study | N, subjects (n, PSGs)* | Objective Sleep Data | Main Study Outcomes | Present in TOPMed |
---|---|---|---|---|
Sleep Heart Health Study (SHHS: subsets of ARIC, CHS, FHS, Tucson) | 5600 (8080) | Full PSG | Incident cardiovascular disease | Selected samples (ARIC, FHS) |
40+ years | ||||
Childhood Adenotonsillectomy Trial | 1244 (1639) | Full PSG | Sleep apnea treatment effects on cognition, behavior, and growth | No |
5–10 yrs | ||||
Heart Biomarkers in Apnea Treatment, HeartBEAT | 305 (580) | Oximetry, NP, RIP; ECG | Sleep apnea treatment effects on 24-hour blood pressure and biomarkers | No |
45–75 yrs | ||||
Cleveland Family Study | 1600 (3200) | Oximetry; Thermistry; Chest effort, ECG; Full PSG in n = 700 | Genetics of sleep apnea | Yes |
4–96 yrs | ||||
Study of Osteoporetic Fractures in Older Women (SOF)-Sleep | 460 | Full PSG; actigraphy | Incident dementia, falls, and fractures. | No |
75+ yrs | ||||
Study of Osteoporetic Fractures in Older Men-Sleep (MrOS) | 2991 (4452) | Full PSG; actigraphy | Incident falls, fractures, and cardiovascular disease | No |
65+ yrs | ||||
Cleveland Children’s Sleep and Health Study | 850 (1603) | Oximetry; Thermistry; NP, RIP; ECG in all; Full PSG and actigraphy on n = 504 | Incident obesity and pediatric sleep disorders | No |
8–19 yrs | ||||
Hispanic Community Health Study (HCHS/SOL) | 15 000 | Oximetry, NP, snoring, movement; actigraphy on n = 2000 | Diabetes, cardiovascular disease, neurocognition, hearing loss | Yes |
18–74 yrs | ||||
Honolulu Asian American Asian Sleep Study, HAAS | 700 | Full PSG | Incident cognitive impairment | No |
85+ yrs | ||||
Multiethnic Study of Atherosclerosis (MESA)-Sleep Study | 2200 | Full PSG; actigraphy | Incident cardiovascular disease (including cardiac MRI) | Yes |
45–84 yrs |
RIP: inductance plethysmography; NP: nasal pressure.
Table 1.
Description of data sets included in NSRR
Cohort/Study | N, subjects (n, PSGs)* | Objective Sleep Data | Main Study Outcomes | Present in TOPMed |
---|---|---|---|---|
Sleep Heart Health Study (SHHS: subsets of ARIC, CHS, FHS, Tucson) | 5600 (8080) | Full PSG | Incident cardiovascular disease | Selected samples (ARIC, FHS) |
40+ years | ||||
Childhood Adenotonsillectomy Trial | 1244 (1639) | Full PSG | Sleep apnea treatment effects on cognition, behavior, and growth | No |
5–10 yrs | ||||
Heart Biomarkers in Apnea Treatment, HeartBEAT | 305 (580) | Oximetry, NP, RIP; ECG | Sleep apnea treatment effects on 24-hour blood pressure and biomarkers | No |
45–75 yrs | ||||
Cleveland Family Study | 1600 (3200) | Oximetry; Thermistry; Chest effort, ECG; Full PSG in n = 700 | Genetics of sleep apnea | Yes |
4–96 yrs | ||||
Study of Osteoporetic Fractures in Older Women (SOF)-Sleep | 460 | Full PSG; actigraphy | Incident dementia, falls, and fractures. | No |
75+ yrs | ||||
Study of Osteoporetic Fractures in Older Men-Sleep (MrOS) | 2991 (4452) | Full PSG; actigraphy | Incident falls, fractures, and cardiovascular disease | No |
65+ yrs | ||||
Cleveland Children’s Sleep and Health Study | 850 (1603) | Oximetry; Thermistry; NP, RIP; ECG in all; Full PSG and actigraphy on n = 504 | Incident obesity and pediatric sleep disorders | No |
8–19 yrs | ||||
Hispanic Community Health Study (HCHS/SOL) | 15 000 | Oximetry, NP, snoring, movement; actigraphy on n = 2000 | Diabetes, cardiovascular disease, neurocognition, hearing loss | Yes |
18–74 yrs | ||||
Honolulu Asian American Asian Sleep Study, HAAS | 700 | Full PSG | Incident cognitive impairment | No |
85+ yrs | ||||
Multiethnic Study of Atherosclerosis (MESA)-Sleep Study | 2200 | Full PSG; actigraphy | Incident cardiovascular disease (including cardiac MRI) | Yes |
45–84 yrs |
Cohort/Study | N, subjects (n, PSGs)* | Objective Sleep Data | Main Study Outcomes | Present in TOPMed |
---|---|---|---|---|
Sleep Heart Health Study (SHHS: subsets of ARIC, CHS, FHS, Tucson) | 5600 (8080) | Full PSG | Incident cardiovascular disease | Selected samples (ARIC, FHS) |
40+ years | ||||
Childhood Adenotonsillectomy Trial | 1244 (1639) | Full PSG | Sleep apnea treatment effects on cognition, behavior, and growth | No |
5–10 yrs | ||||
Heart Biomarkers in Apnea Treatment, HeartBEAT | 305 (580) | Oximetry, NP, RIP; ECG | Sleep apnea treatment effects on 24-hour blood pressure and biomarkers | No |
45–75 yrs | ||||
Cleveland Family Study | 1600 (3200) | Oximetry; Thermistry; Chest effort, ECG; Full PSG in n = 700 | Genetics of sleep apnea | Yes |
4–96 yrs | ||||
Study of Osteoporetic Fractures in Older Women (SOF)-Sleep | 460 | Full PSG; actigraphy | Incident dementia, falls, and fractures. | No |
75+ yrs | ||||
Study of Osteoporetic Fractures in Older Men-Sleep (MrOS) | 2991 (4452) | Full PSG; actigraphy | Incident falls, fractures, and cardiovascular disease | No |
65+ yrs | ||||
Cleveland Children’s Sleep and Health Study | 850 (1603) | Oximetry; Thermistry; NP, RIP; ECG in all; Full PSG and actigraphy on n = 504 | Incident obesity and pediatric sleep disorders | No |
8–19 yrs | ||||
Hispanic Community Health Study (HCHS/SOL) | 15 000 | Oximetry, NP, snoring, movement; actigraphy on n = 2000 | Diabetes, cardiovascular disease, neurocognition, hearing loss | Yes |
18–74 yrs | ||||
Honolulu Asian American Asian Sleep Study, HAAS | 700 | Full PSG | Incident cognitive impairment | No |
85+ yrs | ||||
Multiethnic Study of Atherosclerosis (MESA)-Sleep Study | 2200 | Full PSG; actigraphy | Incident cardiovascular disease (including cardiac MRI) | Yes |
45–84 yrs |
RIP: inductance plethysmography; NP: nasal pressure.
The existing NSRR datasets were derived from completed prospective research studies, and therefore the clinical data were obtained through direct data collection, questionnaires, or research procedures (eg adjudication of follow-up data and medical records performed during primary data collection at the cohort level). Future extensions of the repository may include data directly obtained from clinical records, which will involve additional efforts to achieve their reusability.
The actigraphy data (usually covering 5 to 7 days) in NSRR are collected from research/medical devices prospectively for a subset of the studies. We have made these data available as summary metrics (the processed output of the devices) and raw data (counts), and are in the process of sharing specific metrics such as the one derived from yielding measurements of diurnal rhythm. The NSRR can also accommodate mobile sensor data as they become available, although the actigraphy data were collected in a controlled setting.
Deployed tools
As of March 2018, the NSRR Tool section40 features 17 tools, 1 R script, and 5 tutorials. While most of these tools have been created by the NSRR team, we recently invited the extended sleep research community to share their signal analysis tools in open source on the NSRR. Tools that have been contributed by external researchers are SpiSOP,41 a standalone tool for sleep EEG analysis that includes slow wave and spindle detection, and a MATLAB algorithm42 for detecting rapid eye movements in REM sleep.
Registered users
A total of 3059 users have registered for the resource, of which 6 are NSRR designated Academic User Group members, and 15 are core team members. Of the remaining 3038 registered users, 901 DAUAs have been submitted, of which 819 (90.9%) have been approved by the NSRR team. On average, the resource approves 17 new DAUAs per month since inception. We noticed a growing trend in the number of approved DAUAs per month in each year: 8 per month for 2014; 14 per month for 2015; 17 per month for 2016; 22 per month for 2017; and 34 per month for 2018.
Evidence of usage
Supplementary Appendix B (Table A) summarizes the numbers of files and data download sizes by registered users for each dataset. Overall, the NSRR has served over 5.8 million files, covering 133 TB of data. On average, over the 47 months since its inception, the NSRR served up approximately 95 000 files per month, with recent data sharing of 1TB per week. This equates to a daily average of 3100 files covering 75 GB of data per day.
Evidence of NSRR usage for scientific research includes research proposals submitted and publications. So far, 13 research proposals using NSRR as a data resource have been submitted or funded. More than 35 publications, identified in the acknowledgements or references section, have appeared in scientific venues, including a recent publication characterizing sleep spindles.43 Additional user feedback information is provided in Supplementary Appendix B.
DISCUSSION
Related efforts on sharing medical time series data
Although NIH has consistently invested in data-sharing resources, such as BioLINCC,44 dbGAP,45 PhysioNet,32 and CVRG,46 there is no single repository for complex time-series physiological data such as those represented by PSGs together with relevant subject-level clinical data. Existing resources also do not contain query and processing tools needed to maximize the usability of the data and the user experience.
PhysioNet32 is an NIBIB/NIGMS supported resource for openly disseminating and exchanging biomedical signals and open-source analysis software. Compared to NSRR, PhysioNet has a limited set of PSGs and lacks comprehensive clinical data.
Measuring degrees of FAIR quality
The NSRR provides a use case on how large-scale domain-specific data sets can be semantically enriched using a metadata-guided approach, a necessary step to make sleep research data FAIR using the following strategies and techniques:
Findable. This FAIR principle requires data objects to be uniquely and persistently identifiable. The NSRR achieves this through the use of Uniform Resource Identifiers (URIs), a string of characters used to identify a resource. For example, “https://sleepdata.org/datasets/shhs/variables/angina” is a string that uniquely identifies NSRR’s Sleep Heart Health Study variable for the number of angina episodes since baseline; “https://sleepdata.org/datasets/shhs/files/polysomnography/edfs/shhs1/shhs1-200008.edf” is a string that uniquely identifies Sleep Heart Health Study subject 200008’s baseline polysomnography recording in EDF format. Effort is underway to incorporate the SCDEs into NIH CDEs, and unique identifiers will be created for terms in the SCDEs.
Accessible: The NSRR provides open access to the entire study metadata, including study provenance information and sleep concepts, events, and variable specifications (see https://github.com/nsrr/cross-dataset-mapping/tree/master/mappings). Access to subject-level data and polysomnogram signal files is facilitated through an online-tracked process involving data use agreement and IRB as a part of the functionalities of the NSRR web portal. Tools for downloading large datasets are also provided by the NSRR team to facilitate access to a large collection of EDF files.
Interoperable: The NSRR uses standard formats for both the polysomnography and clinical data: polysomnography data in EDF, annotation in XML, and clinical data and data dictionaries in Comma Separated Values (CSV). The SCDEs reuse standard terminology from a variety of sources such as SNOMED CT and NIH CDEs. The interoperability between different clinical datasets is achieved by mapping individual datasets to SCDEs, and the mapping files are managed through the publicly available GitHub repository (https://github.com/nsrr/cross-dataset-mapping). The SCDEs also facilitate the exchange of PSG data in EDF: each EDF file derived from the PSGs has an accompanying Extensible Markup Language (XML) file using terms defined in SCDEs, documenting a wide scope of sleep events in the PSGs. This provides standardized interpretation of sleep events across different cohorts. Downstream tools for EDF signal analysis and visualization can all take advantage of the standardized sleep event descriptions given in such XML files.
Reusable: The NSRR data, metadata, and tools are shared through 3 main web resources to promote their reuse: sleepdata.org, github.com/nsrr, and github.com/sleepepi. They are well described and rich enough to support reuse, citation, and automated linkage, as demonstrated through cross-reference with BioLINCC (https://biolincc.nhlbi.nih.gov/studies) and the number of proposals and publications that cite their use of the NSRR resource.
Lessons learned
The NIH Data Commons is a vision laying out a set of coherent higher-level requirements that emphasize FAIR as the required properties for the “digital object compliance model” to fully leverage the computational power afforded by the cloud. However, it leaves open important questions such as how such a vision is to be achieved, who are to implement it, and where the digital object content comes from.
We believe that the Data Commons vision is best achieved by instantiating it through disease- or domain-specific efforts, such as the NCI Genomic Data Commons (GDC47,48). The GDC addresses unique computational challenges and provides researchers uniform analytic pipelines for bioinformatics processes. The NSRR, on the other hand, offers pragmatic experiences in answering questions related to how digital objects in sleep medicine can be developed toward the goal of a sleep data commons, where the digital contents come from, and the type of team composition suited for such a development.
The NSRR functional architecture was designed and implemented to support continuous data integration and sharing. This metadata-guided approach is adaptable to support other similar data integration and sharing needs. For instance, it has been successfully adapted and further enhanced to support prospective data capture, integration, and sharing for an ongoing multi-center project for epilepsy research,49,50 and current efforts are underway to incorporate data from independent research groups (eg Wisconsin Sleep Cohort).
The agile development paradigm, with the hallmark that requirements and solutions evolve through collaborative team efforts, has successfully enabled involvement of the key stakeholders in all phases of software development to ensure the usability of the NSRR web-based application tools. “Version-control everything,” including documentation, metadata, and code repository using GitHub51 turned out to be an effective means for managing the project and facilitating collaboration.
One of the non-technical challenges is obtaining agreements to share data through the NSRR from the “owners” of individual study cohorts. Additional details about this aspect appear in Supplementary Appendix A.
In a way, the NSRR is a data engineering experiment that confirms the notion of a rough 80-20 split data science between data readiness work and data analysis work52,53: higher than expected effort is required to make data accessible and reusable. Two non-technical strategies emerged as indispensable for the success of a project such as the NSRR: one is the team-science, trusted, collaborative, results-driven spirit growing from a long-term partnership among a group of sleep investigators, computer scientists, and informaticians; and the other is the equal partnership, shared leadership roles, and proper intellectual property ownership among the domain-experts from distinct individual disciplines.
Future directions
Shared resources offered by emerging resources such as cloud instances provide promising platforms for the Data Commons. However, simply expanding storage or adding computational power may not allow us to cope with the rapidly expanding volume and increasing complexity of biomedical data. Concurrent efforts must be spent to address digital object organization challenges. To make our approach future proof, we need to continue advancing research in data representation and interfaces for human-data interaction.
A possible next phase of the NSRR is the creation of a universal self-descriptive sequential data format. The idea is to break large, unstructured, sequential data files into minimal, semantically meaningful fragments.54 Such fragments can be indexed, assembled, retrieved, rendered, or repackaged on-the-fly, for multitudes of application scenarios. Data points in such a fragment will be locally embedded with relevant metadata labels, governed by terminology and ontology. Potential benefits of such an approach may include precise levels of data access, increased analysis readiness with on-the-fly data conversion, multi-level data discovery, and support for effective web-based visualization of contents in large sequential files.
CONCLUSIONS
In this paper, we introduced the NSRR, a data-sharing system aimed at fully supporting the FAIR principles, for integrating clinical data and physiological signal data from NIH-funded epidemiological cohort studies in sleep research. We believe that several aspects of the NSRR can help inform progress towards the implementation of a future-proof NIH Data Commons from a domain-specific, usability-informed, bottom-up perspective.
FUNDING
This work is supported by the National Heart, Lung, and Blood Institute (NHLBI R24 HL114473), the National Science Foundation under MRI Grant No.1626364, and the University of Kentucky Center for Clinical and Translational Science (UL1TR001998).
CONTRIBUTORS
GQZ and SR conceptualized and designed this study. RM, LC, ST, SM, and SP implemented the tools. LC, MK, and GQZ developed the sleep metadata framework. MR, RM, DM, SM, and SR processed the data. GQZ wrote the manuscript with contributions from LC, RM, ST, MT, MR, SM, and DM. SR and LC reviewed and contributed to formatting the manuscript.
Conflict of interest
None declared.
REFERENCES
1
Altevogt
BM
,
Colten
HR
, ed.
Sleep Disorders and Sleep Deprivation: An Unmet Public Health Problem
. Washington, DC:
National Academies Press
;
2006
.
2
Nieto
EJ
,
O’Connor
GT
,
Rapoport
DM
, et al. .
The sleep heart health study: design, rationale, and methods
.
Sleep
1997
;
20
12
:
1077
–
85
.
3
Redline
S
,
Sanders
MH
,
Lind
BK
, et al. .
Methods for obtaining and analyzing unattended polysomnography data for a multicenter study
.
Sleep
1998
;
21
7
:
759
–
68
.
4
Redline
S
,
Amin
R
,
Beebe
D
, et al. .
The Childhood Adenotonsillectomy Trial (CHAT): rationale, design, and challenges of a randomized controlled trial evaluating a standard surgical procedure in a pediatric population
.
Sleep
2011
;
34
11
:
1509
–
17
.
5
Marcus
CL
,
Moore
RH
,
Rosen
CL
, et al. .
A randomized trial of adenotonsillectomy for childhood sleep apnea
.
N Engl J Med
2013
;
368
25
:
2366
–
76
.
6
Gottlieb
DJ
,
Punjabi
NM
,
Mehra
R
, et al. .
CPAP versus oxygen in obstructive sleep apnea
.
N Engl J Med
2014
;
370
24
:
2276
–
85
.
7
Redline
S
,
Tishler
PV
,
Tosteson
TD
, et al. .
The familial aggregation of obstructive sleep apnea
.
Am J Respir Crit Care Med
1995
;
151
(
3 Pt 1
):
682
–
7
.
8
Redline
S
,
Tishler
PV
,
Schluchter
M
, et al. .
Risk factors for sleep-disordered breathing in children: associations with obesity, race, and respiratory problems
.
Am J Respir Crit Care Med
1999
;
159
(
5 Pt 1
):
1527
–
32
.
9
Cummings
SR
,
Black
DM
,
Nevitt
MC
, et al. .
Appendicular bone density and age predict hip fracture in women
.
JAMA
1990
;
263
5
:
665
–
8
.
10
Spira
AP
,
Blackwell
T
,
Stone
KL
, et al. .
Sleep-disordered breathing and cognition in older women
.
J Am Geriatr Soc
2008
;
56
1
:
45
–
50
.
11
Blank
JB
,
Cawthon
PM
,
Carrion-Petersen
ML
, et al. .
Overview of recruitment for the osteoporotic fractures in men study (MrOS)
.
Contemp Clin Trials
2005
;
26
5
:
557
–
68
.
12
Orwoll
E
,
Blank
JB
,
Barrett-Connor
E
, et al. .
Design and baseline characteristics of the osteoporotic fractures in men (MrOS) study—a large observational study of the determinants of fracture in older men
.
Contemp Clin Trials
2005
;
26
5
:
569
–
85
.
13
Rosen
CL
,
Larkin
EK
,
Kirchner
HL
, et al. .
Prevalence and risk factors for sleep-disordered breathing in 8-to 11-year-old children: association with race and prematurity
.
J Pediatr
2003
;
142
4
:
383
–
9
.
14
Spilsbury
JC
,
Storfer-Isser
A
,
Drotar
D
, et al. .
Effects of the home environment on school-aged children's sleep
.
Sleep
2005
;
28
11
:
1419
–
27
.
15
Hibbs
AM
,
Storfer-Isser
A
,
Rosen
C
, et al. .
Advanced sleep phase in adolescents born preterm
.
Behav Sleep Med
2014
;
12
5
:
412
–
24
.
16
Redline
S
,
Sotres-Alvarez
D
,
Loredo
J
, et al. .
Sleep-disordered breathing in Hispanic/Latino individuals of diverse backgrounds. The Hispanic community health study/study of Latinos
.
Am J Respir Crit Care Med
2014
;
189
3
:
335
–
44
.
17
Patel
SR
,
Weng
J
,
Rueschman
M
, et al. .
Reproducibility of a standardized actigraphy scoring algorithm for sleep in a US Hispanic/Latino population
.
Sleep
2015
;
38
9
:
1497
–
503
.
18
Foley
DJ
,
Masaki
K
,
White
L
, et al. .
Sleep-disordered breathing and cognitive impairment in elderly Japanese-American men
.
Sleep
2003
;
26
5
:
596
–
9
.
19
Dean
DA
,
Goldberger
AL
,
Mueller
R
, et al. .
Scaling up scientific discovery in sleep medicine: the National Sleep Research Resource
.
Sleep
2016
;
39
5
:
1151
–
64
.
20
Sempos
CT
,
Bild
DE
,
Manolio
TA.
Overview of the Jackson Heart Study: a study of cardiovascular diseases in African American men and women
.
Am J Med Sci
1999
;
317
3
:
142
–
6
.
24
Wilkinson
MD
,
Dumontier
M
,
Aalbersberg
IJ
, et al. .
The FAIR Guiding Principles for scientific data management and stewardship
.
Sci Data
2016
;
3
:
26
biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE). https://biocaddie.org (Accessed July 25, 2017).
29
Arabandi
S
,
Ogbuji
C
,
Redline
S
, et al. .
Developing a sleep domain ontology
.
AMIA Jt Summits Transl Sci Proc
2010
:
12
–
3
.
30
Zhang
GQ
,
Siegler
T
,
Saxman
P
, et al. .
VISAGE: a query interface for clinical research
.
AMIA Jt Summits Transl Sci Proc
2010
;
2010
:
76
–
80
.
31
Mueller
R
,
Sahoo
S
,
Dong
X
, et al. .
Mapping multi-institution data sources to domain ontology for data federation: the Physio-MIMI approach
.
AMIA Jt Summits Transl Sci Proc
2011
: 38.
32
Costa
M
,
Moody
GB
,
Henry
I
, et al. .
PhysioNet: an NIH research resource for complex signals
.
J Electrocardiol
2003
;
36
:
139
–
44
.
38
Jayapandian
CP
,
Wang
W
,
Morrical
MG
, et al. .
RREV: reconfigurable rendering engine for visualization of clinically annotated polysomnograms
.
IEEE Int Conf Bioinformatics Biomed Proc
2015
:
309
–
16
.
43
Purcell
SM
,
Manoach
DS
,
Demanuele
C
, et al. .
Characterizing sleep spindles in 11,630 individuals from the National Sleep Research Resource
.
Nat Comms
2017
;
8
:
44
National Institutes of Health, National Heart, Lung, and Blood Institute
. Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC). https://biolincc.nhlbi.nih.gov/home/ (Accessed July 25, 2017).
45
Mailman
MD
,
Feolo
M
,
Jin
Y
, et al. .
The NCBI dbGaP database of genotypes and phenotypes
.
Nat Genet
2007
;
39
10
:
46
Winslow
RL
,
Saltz
J
,
Foster
I
, et al. .
The CardioVascular Research Grid (CVRG) project
.
AMIA Jt Summits Transl Sci Proc
2011
:
77
–
81
.
47
Grossman
RL
,
Heath
AP
,
Ferretti
V
, et al. .
Toward a shared vision for cancer genomic data
.
N Engl J Med
2016
;
375
12
:
1109
–
12
.
48
Jensen
MA
,
Ferretti
V
,
Grossman
RL
, et al. .
The NCI Genomic Data Commons as an engine for precision medicine
.
Blood
2017
;
130
4
:
453
–
9
.
49
Zhang
GQ
,
Cui
L
,
Lhatoo
S
, et al. .
MEDCIS: multi-modality epilepsy data capture and integration system
.
AMIA Annu Symp Proc
2014
;
2014
:
1248
–
57
.
50
Cui
L
,
Huang
Y
,
Tao
S
, et al. .
ODaCCI: Ontology-guided Data Curation for Multisite Clinical Research Data Integration in the NINDS Center for SUDEP Research
.
AMIA Annu Symp Proc
2016
;
2016
:
441
–
50
.
54
Li
X
,
Cui
L
,
Tao
S
, et al. .
SpindleSphere: a web-based platform for large-scale sleep spindle analysis and visualization
.
AMIA Annu Symp Proc
2017
:
1159
–
68
.
© The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs licence (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial reproduction and distribution of the work, in any medium, provided the original work is not altered or transformed in any way, and that the work is properly cited. For commercial re-use, please contactjournals.permissions@oup.com
Supplementary data
Citations
Views
Altmetric
Metrics
Total Views 16,438
13,244 Pageviews
3,194 PDF Downloads
Since 5/1/2018
Month: | Total Views: |
---|---|
May 2018 | 12 |
June 2018 | 2,076 |
July 2018 | 1,851 |
August 2018 | 1,660 |
September 2018 | 204 |
October 2018 | 307 |
November 2018 | 189 |
December 2018 | 78 |
January 2019 | 46 |
February 2019 | 59 |
March 2019 | 95 |
April 2019 | 75 |
May 2019 | 76 |
June 2019 | 62 |
July 2019 | 66 |
August 2019 | 67 |
September 2019 | 72 |
October 2019 | 68 |
November 2019 | 83 |
December 2019 | 59 |
January 2020 | 98 |
February 2020 | 90 |
March 2020 | 59 |
April 2020 | 49 |
May 2020 | 57 |
June 2020 | 67 |
July 2020 | 66 |
August 2020 | 102 |
September 2020 | 113 |
October 2020 | 135 |
November 2020 | 127 |
December 2020 | 129 |
January 2021 | 94 |
February 2021 | 87 |
March 2021 | 149 |
April 2021 | 130 |
May 2021 | 115 |
June 2021 | 123 |
July 2021 | 114 |
August 2021 | 124 |
September 2021 | 201 |
October 2021 | 163 |
November 2021 | 137 |
December 2021 | 130 |
January 2022 | 135 |
February 2022 | 149 |
March 2022 | 227 |
April 2022 | 214 |
May 2022 | 166 |
June 2022 | 156 |
July 2022 | 132 |
August 2022 | 157 |
September 2022 | 230 |
October 2022 | 158 |
November 2022 | 189 |
December 2022 | 154 |
January 2023 | 209 |
February 2023 | 140 |
March 2023 | 213 |
April 2023 | 186 |
May 2023 | 179 |
June 2023 | 190 |
July 2023 | 147 |
August 2023 | 210 |
September 2023 | 158 |
October 2023 | 237 |
November 2023 | 190 |
December 2023 | 220 |
January 2024 | 211 |
February 2024 | 214 |
March 2024 | 256 |
April 2024 | 324 |
May 2024 | 249 |
June 2024 | 207 |
July 2024 | 205 |
August 2024 | 210 |
September 2024 | 233 |
October 2024 | 119 |
Citations
367 Web of Science
×
Email alerts
Citing articles via
More from Oxford Academic