HTSeq--a Python framework to work with high-throughput sequencing data - PubMed (original) (raw)

HTSeq--a Python framework to work with high-throughput sequencing data

Simon Anders et al. Bioinformatics. 2015.

Abstract

Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard workflows, custom scripts are needed.

Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data, such as genomic coordinates, sequences, sequencing reads, alignments, gene model information and variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes.

Availability and implementation: HTSeq is released as an open-source software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index at https://pypi.python.org/pypi/HTSeq.

© The Author 2014. Published by Oxford University Press.

PubMed Disclaimer

Figures

Fig. 1.

Fig. 1.

(a) The SAM_Alignment class as an example of an HTSeq data record: subsets of the content are bundled in object-valued fields, using classes (here SequenceWithQualities and GenomicInterval) that are also used in other data records to provide a common view on diverse data types. (b) The cigar field in a SAM_alignment object presents the detailed structure of a read alignment as a list of CigarOperation. This allows for convenient downstream processing of complicated alignment structures, such as the one given by the cigar string on top and illustrated in the middle. Five CigarOperation objects, with slots for the columns of the table (bottom) provide the data from the cigar string, along with the inferred coordinates of the affected regions in read (‘query’) and reference

Fig. 2.

Fig. 2.

Using the class GenomicArrayOfSets to represent overlapping annotation metadata. The indicated features are assigned to the array, which then represents them internally as steps, each step having as value a set whose elements are references to the features overlapping the step

Similar articles

Cited by

References

    1. Beazley DM, et al. Proceedings of the 4th USENIX Tcl/Tk workshop. 1996. SWIG: an easy to use tool for integrating scripting languages with C and C++ pp. 129–139.
    1. Behnel S, et al. Cython: the best of both worlds. Comput. Sci. Eng. 2011;13:31–39.
    1. Bolger AM, et al. Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics. 2014;30:2114–2120. - PMC - PubMed
    1. Cock PJ, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–1423. - PMC - PubMed
    1. Dale RK, et al. Pybedtools: a flexible Python library for manipulating genomic datasets and annotations. Bioinformatics. 2011;27:3423–3424. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources