Sharing individual level data from observational studies and clinical trials: a perspective from NHLBI (original) (raw)

A brief history of the NHLBI data repository

A formal data sharing policy, applicable to Institute-supported contract studies, was first established by the Institute in 1989 under then Director Claude L’Enfant. This first data-sharing policy timed release of data to within 3 years of ‘major publications’ and described generally the data that were to be removed from a study dataset prior to release. Although a formal data-sharing policy for contract-supported studies had been established, the availability of these datasets was not well-known, there was a lack of guidance on what process should be used to distribute datasets, and it was unclear as to what data should be made available. A revised policy, incorporating specific data release timelines, guidelines for data submission, and the data request process was developed in 1999. Since all of the data for the data repository were derived from human subjects in both ongoing and closed studies, the policy was submitted to the NHLBI Institutional Review Board (IRB) as a formal protocol and was first approved by the NHLBI IRB in 2000. The NHLBI IRB continues today in its oversight role to annually review the activities of the data repository.

The current data-sharing policy

The data-sharing protocol has evolved over time, expanding to include grant-supported studies as well as contract-supported studies. Clarification of the common rule regarding local IRB review for secondary analysis of existing data provided additional guidance on how requests for data should be reviewed, and studies depositing data into the repository are now required to certify that the data can be shared with outside investigators.

The data repository protocol can be divided into three aspects: rules for when data will be released and what data to include, rules for submitting data to the repository, and rules governing how data will be shared.

Data and timelines for release

The data released to the general research community through the data repository consist of all data collected during the conduct of the study including: baseline, interim visit(s), laboratory measurements, and outcome data. To minimize re-identification risks and provide reasonable protection for participant privacy, the de-identification procedures for data included in the repository generally follow the HIPAA privacy act [[14](/article/10.1186/1745-6215-14-201#ref-CR14 "Summary of the HIPAA Privacy Rule. U.S. Department of Health and Human Services: 2012, http://www.hhs.gov/ocr/privacy/hipaa/understanding/summary/

              ,")\], these include, but are not limited to: removal of all obvious identifiers (name, address, social security number, date of birth), removal of all dates and replacement with time intervals (that is, time since randomization or time since enrollment), and geography is generally removed. Other redactions are also encouraged such as top/bottom coding of height and weight, categorizing traits such as race/ethnicity, marital status, education, income, and so on, to reduce any re-identification risk. Sensitive items such as sexual behaviors or illicit drug use may be removed from the data when these data elements are not part of the primary focus of the study; therefore, as opposed to ‘rules’ for de-identification, the redaction process is more aptly described as following ‘guidelines’, in which considerable judgment on the part of the study is needed to maintain scientific utility and also protection of participant privacy. The goal is to minimize re-identification risk while maximizing secondary use of the redacted datasets. Over-redaction can result in a dataset with low utility.

Datasets are released following a standard timeline. Study investigators are given a protected period of time (2 years) to prepare key manuscripts from the study. In this manner the data are made available to the general research community within a reasonable time frame and provides the investigators who collected the data time to publish study results.

Data collection for clinical trials is generally predefined with specific starting and ending dates; however, observational studies may collect data over many years. Therefore, the protocol defines data release in a slightly different manner depending on the study type. Specifically, the protocol states that data from observational studies are made available: (1) 3 years after the last participant visit for a clinic exam cycle or close out date for event ascertainment; or (2) 2 years after the exam or surveillance data are made available for within study use. Ongoing observational studies will therefore tend to have periodic updates to their data. Data from clinical trials are made available 2 years after publication of the primary outcome paper or 3 years after the end of clinical activity.

It is important to note that most large clinical trials and observational studies have active publications/presentation committees that are open to collaborations with outside investigators. These studies are essentially open to immediate data-sharing using study-specific internal policies.

Submitting data to the repository

Release of data through the data repository is required for Institute-supported contract studies and may also be required for certain large grants or cooperative agreements. The inclusion of grant-supported studies was introduced in 2005 and applies to studies reviewed and initiated after October 1, 2005. In brief, grant-supported studies with direct costs equal to or greater than $500K in any 1 year and identified as being of high programmatic interest, along with cooperative agreements with 500 or more participants are required to submit data to the data repository as part of the grant award.

All data submitted to the data repository must be consistent with the informed consent. Participants explicitly requesting that their data not be shared must be removed from all repository datasets and any other consent restrictions must be noted. For example, the informed consent for the Hemochromatosis and Iron Overload Screening Study (HEIRS) indicated that participants’ data would only be used for ‘iron-related and hereditary hemochromatosis studies’. Therefore, any requests for HEIRS study data must have an iron-related focus.

Following on the database of Genotypes and Phenotypes (dbGaP) experience, institutions submitting data to the data repository are required to certify that the data can be shared. In brief, the certification affirms to the NHLBI that the informed consent explicitly permits sharing data with other investigators, or the depositing Institution’s IRB has approved the sharing of the participants’ data.

Accessing data in the repository

Applications for study datasets in the Data Repository are done online at http://www.biolincc.nhlbi.nih.gov. The website is maintained by the NHLBI Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC) program. The online application for study datasets include: (1) a description of the research project; (2) local IRB review or institutional certification of exemption from IRB review; and (3) agreeing to the terms and conditions of a data use agreement (termed a Research Materials Distribution Agreement (RMDA) since both biospecimens and data can be obtained from the NHLBI repositories). Terms of the RMDA include provisions limiting use of the data to 3 years, prohibits transfer of the data to another party, requires appropriate computer security measures, and specifies the acknowledgment to be used in manuscripts. Both the investigator and the Institution are required to sign and agree to the terms of the RMDA. Applicants are not subject to a scientific review, but instead an administrative review is conducted to ensure the project is within the spirit and intent of the data agreement. Likewise, there is no requirement for screening manuscripts prior to submission to a journal. The absence of a scientific review or pre-submission review of manuscripts avoids any perception of blocking access or controlling research; however, this does allow for the possibility of misleading or incorrect analyses to be published. The possibility of intentionally misleading analyses is mitigated to some degree through the open nature of the repository in that replication is possible with the identical dataset. Avoiding unintentionally incorrect analyses is far more difficult and is largely only correctable post publication.

A summary of the clinical data and biospecimen resources available are provided in Additional file 1: Table S1.