From sequencer to supercomputer: an automatic pipeline for managing and processing next generation sequencing data (original) (raw)

miCloud: a plug and play, on-premises bioinformatics cloud, providing seamless integration with Illumina genome sequencers

2017

Benchtop genome sequencers such as the Illumina MiSeq or MiniSeq [1], [2] are revolutionizing genomics research for smaller, independent laboratories, by enabling access to low-cost Next Generation Sequencing (NGS) technology in-house. These benchtop genome sequencing instruments require only standard laboratory equipment, in addition to minimal time for sample preparation. However, post-sequencing bioinformatics data analysis still presents a significant bottleneck, for research laboratories lacking specialized software and technical data analysis skills on their teams. While bioinformatics computes clouds providing solutions following a Software as a Service (SaaS) are available ([3]-[6] , review in [7]), currently, there are only a few options which are user-friendly for non-experts while at the same time are also low-cost or free. One primary example is Illumina BaseSpace [8] that is very easy to access by non-experts, and also offers an integrated solution where data are streamed directly from the MiSeq sequencing instrument to the cloud. Once the data is on the BaseSpace cloud, users can access a range of bioinformatics applications with pre-installed algorithms through an intuitive web interface. Nonetheless, BaseSpace can be a costly solution as a yearly subscription depending on whether the user is associated with an academic or private institution, ranges in price from 999−999-9994,999. Additional "iCredits" [9] might need to be purchased for frequent users that exhaust the base credit allowance as part of the subscription. Considering the reduction of computer hardware cost in recent years, a multi-core Intel Xeon server with 64 GigaByte (GB) of memory and multiple TeraByte (TB) of storage is priced less than the yearly subscription to Basespace [10] , and similarly when compared to renting compute cycles from providers such as Amazon Web Services (AWS) [11]. Furthermore, the current generation of laptops usually come with 6-10 GigaBytes (GB) of memory and 1 TeraByte (TB) of storage, providing enough computational capacity to analyze data from small NGS experiments [12] that include only a few samples. We developed miCloud, a bioinformatics platform for NGS data analysis, as a solution to fill the gap between the low-cost, widely available computational resources and lack of user-friendly bioinformatics software. Laboratories lacking NGS data analysis expertise can easily perform analysis of data generated from in-house sequencing instruments or external service providers using this platform. The miCloud is highly modular and is based on Docker virtual machine containers [13] with components (e.g., user interface, file manager,pre-configured data analysis certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Challenges in the Setup of Large-scale Next-Generation Sequencing Analysis Workflows

Computational and structural biotechnology journal, 2017

While Next-Generation Sequencing (NGS) can now be considered an established analysis technology for research applications across the life sciences, the analysis workflows still require substantial bioinformatics expertise. Typical challenges include the appropriate selection of analytical software tools, the speedup of the overall procedure using HPC parallelization and acceleration technology, the development of automation strategies, data storage solutions and finally the development of methods for full exploitation of the analysis results across multiple experimental conditions. Recently, NGS has begun to expand into clinical environments, where it facilitates diagnostics enabling personalized therapeutic approaches, but is also accompanied by new technological, legal and ethical challenges. There are probably as many overall concepts for the analysis of the data as there are academic research institutions. Among these concepts are, for instance, complex IT architectures develope...

Sustainable software development for next-gen sequencing (NGS) bioinformatics on emerging platforms

Background and Motivation The advent of Next Generation Sequencing (NGS) technology more than half a decade ago produced a deluge of DNA sequence data by completely altering the scale at which sequence data is generated and dramatically reducing sequencing costs. Both the increase in sheer volume of data being produced-fed by the continually increasing throughput and decreasing costs of sequencing technologies-and the unique and evolving characteristics of NGS data itself continue to fuel the development of countless new analysis tools. At the same time, the pace of development of novel hardware platforms has seen a resurgence. As the 'Moore's Law' of computational progress plateaus in single-chip performance, we have seen the emergence (or renewed focus) on accelerated platforms such as General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), massively multi-threaded and many-core processors. These accelerators supplement the democratized access to computing infrastructure made possible through Cloud and commodity clusters. These accelerators offer several benefits over traditional single processor machines, including a large memory bandwidth, massive parallelism, elastic scaling, and a small energy footprint, although each technology presents its own trade-off between these benefits. Accelerators also becoming more affordable, to the extent that GPGPUs are routinely found in scientific workstations and are being embedded within processors. The rapid pace of hardware innovation, however, has not been adequately supported by accessible middleware and software kernels that make these cyber-infrastructure more widely usable. The challenges stretch from the learning curve of specialized programming models and abstractions to the need for novel algorithms that leverage these accelerators effectively. Consequently, the impact of accelerators on scientific and 'Big Data' computing has not been as transformative as the potential of them. Indeed, for a rapidly changing domain such as NGS where oftentimes data lies unused due to the lack of scalable analytical capabilities, the use of accelerated cyber-infrastructure supported by sustainable software infrastructure is vital to spur scientific innovation. Unlocking the inner-workings of genomic sequences is at the heart of many grand challenge problems in biological research. The growth available sequence data engendered by NGS technology provides promise of solutions to many of these problems; however, the

A case study for cloud based high throughput analysis of NGS data using the globus genomics system

Computational and Structural Biotechnology Journal, 2015

Next generation sequencing (NGS) technologies produce massive amounts of data requiring a powerful computational infrastructure, high quality bioinformatics software, and skilled personnel to operate the tools. We present a case study of a practical solution to this data management and analysis challenge that simplifies terabyte scale data handling and provides advanced tools for NGS data analysis. These capabilities are implemented using the "Globus Genomics" system, which is an enhanced Galaxy workflow system made available as a service that offers users the capability to process and transfer data easily, reliably and quickly to address end-to-end NGS analysis requirements. The Globus Genomics system is built on Amazon's cloud computing infrastructure. The system takes advantage of elastic scaling of compute resources to run multiple workflows in parallel and it also helps meet the scale-out analysis needs of modern translational genomics research.

An automated infrastructure to support high-throughput bioinformatics

2014 International Conference on High Performance Computing & Simulation (HPCS), 2014

The number of domains affected by the big data phenomenon is constantly increasing, both in science and industry, with high-throughput DNA sequencers being among the most massive data producers. Building analysis frameworks that can keep up with such a high production rate, however, is only part of the problem: current challenges include dealing with articulated data repositories where objects are connected by multiple relationships, managing complex processing pipelines where each step depends on a large number of configuration parameters and ensuring reproducibility, error control and usability by nontechnical staff. Here we describe an automated infrastructure built to address the above issues in the context of the analysis of the data produced by the CRS4 next-generation sequencing facility. The system integrates open source tools, either written by us or publicly available, into a framework that can handle the whole data transformation process, from raw sequencer output to primary analysis results.

Experiences building Globus Genomics: a next-generation sequencing analysis service using Galaxy, Globus, and Amazon Web Services

Concurrency and Computation: Practice and Experience, 2014

We describe Globus Genomics, a system that we have developed for rapid analysis of large quantities of next-generation sequencing genomic data. This system achieves a high degree of end-to-end automation that encompasses every stage of data analysis including initial data retrieval from remote sequencing centers or storage (via the Globus file transfer system); specification, configuration, and reuse of multistep processing pipelines (via the Galaxy workflow system); creation of custom Amazon Machine Images and on-demand resource acquisition via a specialized elastic provisioner (on Amazon EC2); and efficient scheduling of these pipelines over many processors (via the HTCondor scheduler). The system allows biomedical researchers to perform rapid analysis of large next-generation sequencing datasets in a fully automated manner, without software installation or a need for any local computing infrastructure. We report performance and cost results for some representative workloads.

DDBJ Read Annotation Pipeline: A Cloud Computing-Based Pipeline for High-Throughput Analysis of Next-Generation Sequencing Data

DNA Research, 2013

High-performance next-generation sequencing (NGS) technologies are advancing genomics and molecular biological research. However, the immense amount of sequence data requires computational skills and suitable hardware resources that are a challenge to molecular biologists. The DNA Data Bank of Japan (DDBJ) of the National Institute of Genetics (NIG) has initiated a cloud computing-based analytical pipeline, the DDBJ Read Annotation Pipeline (DDBJ Pipeline), for a high-throughput annotation of NGS reads. The DDBJ Pipeline offers a user-friendly graphical web interface and processes massive NGS datasets using decentralized processing by NIG supercomputers currently free of charge. The proposed pipeline consists of two analysis components: basic analysis for reference genome mapping and de novo assembly and subsequent high-level analysis of structural and functional annotations. Users may smoothly switch between the two components in the pipeline, facilitating web-based operations on a supercomputer for high-throughput data analysis. Moreover, public NGS reads of the DDBJ Sequence Read Archive located on the same supercomputer can be imported into the pipeline through the input of only an accession number. This proposed pipeline will facilitate research by utilizing unified analytical workflows applied to the NGS data. The DDBJ Pipeline is accessible at http://p.ddbj.nig.ac.jp/.

SLIMS: a LIMS for handling next-generation sequencing workflows

EMBnet.journal, 2013

Next-generation sequencing (NGS) is becoming a standard method in modern life-science laboratories for studying biomacromolecules and their interactions. Methods such as RNA-Seq and DNA resequencing are replacing array-based methods that dominated the last decade. A sequencing facility needs to keep track of requests, requester details, reagent barcodes, sample tracing and monitoring, quality controls, data delivery, creation of workflows for customised data analysis, privileges of access to the data, customised reports etc. An integrated software tool to handle these tasks helps to troubleshoot problems quickly, to maintain a high quality standard, and to reduce time and costs needed for data production. Commercial and non-commercial tools called LIMS (Laboratory Information Management Systems) are available for this purpose. However, they often come at prohibitive cost and/or lack the flexibility and scalability needed to adjust seamlessly to the frequently changing protocols employed. In order to manage the flow of sequencing data produced at the IIT Genomic Unit, we developed SLIMS (Sequencing LIMS).

Cloud Computing for Next-Generation Sequencing Data Analysis

Cloud Computing - Architecture and Applications, 2017

High-throughput next-generation sequencing (NGS) technologies have evolved rapidly and are reshaping the scope of genomics research. The substantial decrease in the cost of NGS techniques in the past decade has led to its rapid adoption in biological research and drug development. Genomics studies of large populations are producing a huge amount of data, giving rise to computational issues around the storage, transfer, and analysis of the data. Fortunately, cloud computing has recently emerged as a viable option to quickly and easily acquire the computational resources for large-scale NGS data analyses. Some cloud-based applications and resources have been developed specifically to address the computational challenges of working with very large volumes of data generated by NGS technology. In this chapter, we will review some cloud-based systems and solutions for NGS data analysis, discuss the practical hurdles and limitations in cloud computing, including data transfer and security, and share the lessons we learned from the implementation of Rainbow, a cloud-based tool for large-scale genome sequencing data analysis.