Challenges in the Setup of Large-scale Next-Generation Sequencing Analysis Workflows (original) (raw)

A Study of Next Generation Sequencing Data, Workflow, Application and Platform Comparison

IOP Conference Series: Materials Science and Engineering, 2018

DNA sequencing determines the precise order of nucleotides within a DNA molecule.Next generation Sequencing (NGS) is an efficient parallel high throughput DNA Sequencing Technology which revolutionizing the genomic research. Earlier methods that are quite expensive give rise to different sequence comparison techniques. NGS used for faster detection of variants in human genome and give rise to accelerated response to disease detection like cancer, hepatitis etc. In this paper work flow of data analysis part of NGS are clearly discussed. The different NGS platforms, its applications and different sequencing comparison techniques are also mentioned here.

Implementation of Cloud based Next Generation Sequencing data analysis in a clinical laboratory

BMC Research Notes, 2014

Background: The introduction of next generation sequencing (NGS) has revolutionized molecular diagnostics, though several challenges remain limiting the widespread adoption of NGS testing into clinical practice. One such difficulty includes the development of a robust bioinformatics pipeline that can handle the volume of data generated by high-throughput sequencing in a cost-effective manner. Analysis of sequencing data typically requires a substantial level of computing power that is often cost-prohibitive to most clinical diagnostics laboratories. Findings: To address this challenge, our institution has developed a Galaxy-based data analysis pipeline which relies on a web-based, cloud-computing infrastructure to process NGS data and identify genetic variants. It provides additional flexibility, needed to control storage costs, resulting in a pipeline that is cost-effective on a per-sample basis. It does not require the usage of EBS disk to run a sample. Conclusions: We demonstrate the validation and feasibility of implementing this bioinformatics pipeline in a molecular diagnostics laboratory. Four samples were analyzed in duplicate pairs and showed 100% concordance in mutations identified. This pipeline is currently being used in the clinic and all identified pathogenic variants confirmed using Sanger sequencing further validating the software.

SMITH: a LIMS for handling next-generation sequencing workflows

BMC bioinformatics, 2014

Life-science laboratories make increasing use of Next Generation Sequencing (NGS) for studying bio-macromolecules and their interactions. Array-based methods for measuring gene expression or protein-DNA interactions are being replaced by RNA-Seq and ChIP-Seq. Sequencing is generally performed by specialized facilities that have to keep track of sequencing requests, trace samples, ensure quality and make data available according to predefined privileges. An integrated tool helps to troubleshoot problems, to maintain a high quality standard, to reduce time and costs. Commercial and non-commercial tools called LIMS (Laboratory Information Management Systems) are available for this purpose. However, they often come at prohibitive cost and/or lack the flexibility and scalability needed to adjust seamlessly to the frequently changing protocols employed. In order to manage the flow of sequencing data produced at the Genomic Unit of the Italian Institute of Technology (IIT), we developed SM...

Cloud-based bioinformatics workflow platform for large-scale next-generation sequencing analyses

Journal of Biomedical Informatics, 2014

Due to the upcoming data deluge of genome data, the need for storing and processing large-scale genome data, easy access to biomedical analyses tools, efficient data sharing and retrieval has presented significant challenges. The variability in data volume results in variable computing and storage requirements, therefore biomedical researchers are pursuing more reliable, dynamic and convenient methods for conducting sequencing analyses. This paper proposes a Cloud-based bioinformatics workflow platform for large-scale next-generation sequencing analyses, which enables reliable and highly scalable execution of sequencing analyses workflows in a fully automated manner. Our platform extends the existing Galaxy workflow system by adding data management capabilities for transferring large quantities of data efficiently and reliably (via Globus Transfer), domain-specific analyses tools preconfigured for immediate use by researchers (via user-specific tools integration), automatic deployment on Cloud for on-demand resource allocation and pay-as-you-go pricing (via Globus Provision), a Cloud provisioning tool for auto-scaling (via HTCondor scheduler), and the support for validating the correctness of workflows (via semantic verification tools). Two bioinformatics workflow use cases as well as performance evaluation are presented to validate the feasibility of the proposed approach.

Clinical Integration of Next-Generation Sequencing Technology

Clinics in Laboratory Medicine, 2012

Recent technological advances in Next Generation Sequencing (NGS) methods have substantially reduced cost and operational complexity leading to the production of bench top sequencers and commercial software solutions for implementation in small research and clinical laboratories. This chapter summarizes requirements and hurdles to the successful implementation of these systems including 1) calibration, validation and optimization of the instrumentation, experimental paradigm and primary readout, 2) secure transfer, storage and secondary processing of the data, 3) implementation of software tools for targeted analysis, and 4) training of research and clinical personnel to evaluate data fidelity and interpret the molecular significance of the genomic output. In light of the commercial and technological impetus to bring NGS technology into the clinical domain, it is critical that novel tests incorporate rigid protocols with built-in calibration standards and that data transfer and processing occur under exacting security measures for interpretation by clinicians with specialized training in molecular diagnostics.

SLIMS: a LIMS for handling next-generation sequencing workflows

EMBnet.journal, 2013

Next-generation sequencing (NGS) is becoming a standard method in modern life-science laboratories for studying biomacromolecules and their interactions. Methods such as RNA-Seq and DNA resequencing are replacing array-based methods that dominated the last decade. A sequencing facility needs to keep track of requests, requester details, reagent barcodes, sample tracing and monitoring, quality controls, data delivery, creation of workflows for customised data analysis, privileges of access to the data, customised reports etc. An integrated software tool to handle these tasks helps to troubleshoot problems quickly, to maintain a high quality standard, and to reduce time and costs needed for data production. Commercial and non-commercial tools called LIMS (Laboratory Information Management Systems) are available for this purpose. However, they often come at prohibitive cost and/or lack the flexibility and scalability needed to adjust seamlessly to the frequently changing protocols employed. In order to manage the flow of sequencing data produced at the IIT Genomic Unit, we developed SLIMS (Sequencing LIMS).

From sequencer to supercomputer: an automatic pipeline for managing and processing next generation sequencing data

AMIA Joint Summits on Translational Science proceedings AMIA Summit on Translational Science, 2012

Next Generation Sequencing is highly resource intensive. NGS Tasks related to data processing, management and analysis require high-end computing servers or even clusters. Additionally, processing NGS experiments requires suitable storage space and significant manual interaction. At The Ohio State University's Biomedical Informatics Shared Resource, we designed and implemented a scalable architecture to address the challenges associated with the resource intensive nature of NGS secondary analysis built around Illumina Genome Analyzer II sequencers and Illumina's Gerald data processing pipeline. The software infrastructure includes a distributed computing platform consisting of a LIMS called QUEST (http://bisr.osumc.edu), an Automation Server, a computer cluster for processing NGS pipelines, and a network attached storage device expandable up to 40TB. The system has been architected to scale to multiple sequencers without requiring additional computing or labor resources. Thi...

Closha: bioinformatics workflow system for the analysis of massive sequencing data

BMC bioinformatics, 2018

While next-generation sequencing (NGS) costs have fallen in recent years, the cost and complexity of computation remain substantial obstacles to the use of NGS in bio-medical care and genomic research. The rapidly increasing amounts of data available from the new high-throughput methods have made data processing infeasible without automated pipelines. The integration of data and analytic resources into workflow systems provides a solution to the problem by simplifying the task of data analysis. To address this challenge, we developed a cloud-based workflow management system, Closha, to provide fast and cost-effective analysis of massive genomic data. We implemented complex workflows making optimal use of high-performance computing clusters. Closha allows users to create multi-step analyses using drag and drop functionality and to modify the parameters of pipeline tools. Users can also import the Galaxy pipelines into Closha. Closha is a hybrid system that enables users to use both a...

Next Generation Sequencing – Algorithms , and Software For Biomedical Applications

2016

Next Generation Sequencing (NGS) data have begun to appear in many applications that are clinically relevant, such as resequencing of cancer patients, disease-gene discovery and diagnostics for rare diseases, microbiome analyses, and gene expression profiling. The analysis of sequencing data is demanding because of the enormous data volume and the need for fast turnaround time, accuracy, reproducibility, and data security. This Dagstuhl Seminar aimed at a free and deep exchange of ideas and needs between the communities of algorithmicists and theoreticians and practitioners from the biomedical field. It identified several relevant fields such as data structures and algorithms for large data sets, hardware acceleration, new problems in the upcoming age of genomes, etc., which were discussed in breakout groups. Seminar August 28 to September 2, 2016 – http://www.dagstuhl.de/16351 1998 ACM Subject Classification D.2.11 Software Architectures, D.2.13 Reusable Software, D.2.2 Design Tool...