GitHub - jdidion/seqio: A library for high-performance reading and writing of common NGS formats. (pre-alpha) (original) (raw)
Overview
seqio
is a python library (3.3+ only) for highly optimized reading/writing of raw (i.e. unaligned) NGS data in a variety of formats:
- FASTA
- FASTQ
- Interleaved FASTQ
- SAM/BAM/CRAM
- Common compression formats (gz, bz2, xz), which are handled automatically (via xphyle)
seqio
is currently in development (pre-alpha).
Details
seqio
achieves significant performance gains over similar libraries (e.g. screed
) in three ways:
- Implements critical portions of code in C (via Cython).
- Maintains data as bytes by default, rather than perform expensive string encoding/decoding operations.
- Uses system-level compression/decompression (via xphyle) when possible.
The generated sequences can be either immutable or mutable. With immutable sequences, slicing always returns a new sequence, and modifications to the sequence and qualities are not supported. With mutable sequences, modifications are applied in-place and each modification is recorded as an Edit
.
Usage
Here we will filter out reads with low mean quality
or which contain an adapter sequence, and write the
retained reads to an interleaved, compressed fastq file
from seqio import reader, writer
fq = ('reads{}.fq.gz'.format(i) for i in (1,2)) adapter = b'AGATCGAAT' writer = writer('output.fq.gz', interleaved=True)
for read1, read2 in reader(*fq): # qualities can also be retrieved as a list of # phred-scale integers mean_qual = sum(read1.qualities_phred(base=33)) / len(read1) # the following is equivalent and a bit faster: from seqio.util import mean_quality mean_qual = mean_quality(read1)
# for most operations (e.g. comparisons), byte array
# sequence and qualities can be used directly
has_adapter = read1.sequence.startswith(adapter)
# the following is equivalent and will take care of
# string conversion for us, but is slightly slower
from seqio.util import has_prefix
has_adapter = has_prefix(read1, 'AGATCGAAT')
# names are stored as unicode strings
print(read1.name)
# sequences and qualities are stored as byte arrays,
# so explicit conversion to string is required
print("{}\n{}\n\n{}\n{}".format(
read1.sequence_str, read1.qualities_str,
read2.sequence_str, read2.qualities_str))
print("Mean read1 quality: {:.3d}".format(mean_qual))
print("Contains adapter? {}".format(has_adapter))
if mean_qual >= 30 and not has_adapter:
writer.write(read1, read2)
Dependencies
These are all installable via pip:
- Cython 0.25+ (only required at build-time)
- xphyle 1.0+
- pysam (only required for SAM/BAM/CRAM support)
TODO
- Look at dnaio - maybe combine forces? https://github.com/marcelm/dnaio/
- Add FAST5 support (using poretools)
- Add xphyle plugins for sequence-specific compression formats:
- Treat BAM and CRAM as compression formats using native samtools/sambamba if available
- fqzcomp, quip, scalce http://www.nature.com.ezproxy.nihlibrary.nih.gov/nmeth/journal/v13/n12/pdf/nmeth.4037.pdf
- https://github.com/COMBINE-lab/quark
- https://github.com/rcanovas/libCSAM
- https://github.com/Zhuzxlab/LW-FQZip2
- Casava/BCL support might be useful to some; will wait to see if this is requested.
- Check out similar library in Rust: https://github.com/onecodex/needletail
- FaStore support? https://github.com/refresh-bio/FaStore
- Check out skbio.io: http://scikit-bio.org/docs/0.4.1/index.html
- Look at using pybam instead of pysam
- Look at using the numpy-based data structures in biotite