Sequence Compression Benchmark (SCB) database-A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences - PubMed (original) (raw)
Sequence Compression Benchmark (SCB) database-A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences
Kirill Kryukov et al. Gigascience. 2020.
Abstract
Background: Nearly all molecular sequence databases currently use gzip for data compression. Ongoing rapid accumulation of stored data calls for a more efficient compression tool. Although numerous compressors exist, both specialized and general-purpose, choosing one of them was difficult because no comprehensive analysis of their comparative advantages for sequence compression was available.
Findings: We systematically benchmarked 430 settings of 48 compressors (including 29 specialized sequence compressors and 19 general-purpose compressors) on representative FASTA-formatted datasets of DNA, RNA, and protein sequences. Each compressor was evaluated on 17 performance measures, including compression strength, as well as time and memory required for compression and decompression. We used 27 test datasets including individual genomes of various sizes, DNA and RNA datasets, and standard protein datasets. We summarized the results as the Sequence Compression Benchmark database (SCB database, http://kirr.dyndns.org/sequence-compression-benchmark/), which allows custom visualizations to be built for selected subsets of benchmark results.
Conclusion: We found that modern compressors offer a large improvement in compactness and speed compared to gzip. Our benchmark allows compressors and their settings to be compared using a variety of performance measures, offering the opportunity to select the optimal compressor on the basis of the data type and usage scenario specific to a particular application.
Keywords: DNA; RNA; benchmark; compression; database; genome; protein; sequence.
© The Author(s) 2020. Published by Oxford University Press.
Figures
Figure 1:
Comparison of 36 compressors on human genome. The best settings of each compressor are selected on the basis of different aspects of performance: (A) compression ratio, (B) transfer + decompression speed, and (C) compression + transfer + decompression speed. The copy-compressor ("cat" command), shown in red, is included as a control. The selected settings of each compressor are shown in their names, after hyphen. Multi-threaded compressors have "-1t" or "-4t" at the end of their names to indicate the number of threads used. Test data are the 3.31 GB reference human genome (accession number GCA_000001405.28). Benchmark CPU: Intel Xeon E5-2643v3 (3.4 GHz). Link speed of 100 Mbit/sec was used for estimating the transfer time.
Figure 2:
Comparison of 334 settings of 36 compressors on human genome. Each point represents a particular setting of some compressor. A, The relationship between compression ratio and decompression speed. B, The transfer + decompression speed plotted against compression + transfer + decompression speed. Test data are the 3.31 GB reference human genome (accession number GCA_000001405.28). Benchmark CPU: Intel Xeon E5-2643v3 (3.4 GHz). Link speed of 100 Mbit/sec was used for estimating the transfer time.
Figure 3:
Comparison of compressor settings to gzip. Genome datasets were used as test data. Each point shows the performance of a compressor setting on a specific genome test dataset. All values are shown relative to representative setting of gzip. Only performances that are at least half as good as gzip on both axes are shown. A, Settings that performed best in Transfer + Decompression speed. B, Settings that performed best in Compression + Transfer + Decompression speed. Link speed of 100 Mbit/sec was used for estimating the transfer time.
Figure 4:
Compressor memory consumption. The strongest setting of each compressor is shown. On the x-axis is the test data size. On the y-axis is the peak memory used by the compressor, for compression (A) and decompression (B).
Similar articles
- Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences.
Kryukov K, Ueda MT, Nakagawa S, Imanishi T. Kryukov K, et al. Bioinformatics. 2019 Oct 1;35(19):3826-3828. doi: 10.1093/bioinformatics/btz144. Bioinformatics. 2019. PMID: 30799504 Free PMC article. - CoGI: Towards Compressing Genomes as an Image.
Xie X, Zhou S, Guan J. Xie X, et al. IEEE/ACM Trans Comput Biol Bioinform. 2015 Nov-Dec;12(6):1275-85. doi: 10.1109/TCBB.2015.2430331. IEEE/ACM Trans Comput Biol Bioinform. 2015. PMID: 26671800 - LCQS: an efficient lossless compression tool of quality scores with random access functionality.
Fu J, Ke B, Dong S. Fu J, et al. BMC Bioinformatics. 2020 Mar 18;21(1):109. doi: 10.1186/s12859-020-3428-7. BMC Bioinformatics. 2020. PMID: 32183707 Free PMC article. - Pan-Genome Storage and Analysis Techniques.
Zekic T, Holley G, Stoye J. Zekic T, et al. Methods Mol Biol. 2018;1704:29-53. doi: 10.1007/978-1-4939-7463-4_2. Methods Mol Biol. 2018. PMID: 29277862 Review. - Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format.
Kryukov K, Jin L, Nakagawa S. Kryukov K, et al. Patterns (N Y). 2022 Sep 9;3(9):100562. doi: 10.1016/j.patter.2022.100562. Epub 2022 Jul 7. Patterns (N Y). 2022. PMID: 35818472 Free PMC article. Review.
Cited by
- Compression rates of microbial genomes are associated with genome size and base composition.
Bohlin J, Pettersson JH. Bohlin J, et al. Genomics Inform. 2024 Oct 10;22(1):16. doi: 10.1186/s44342-024-00018-z. Genomics Inform. 2024. PMID: 39390533 Free PMC article. - AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data.
Silva JM, Qi W, Pinho AJ, Pratas D. Silva JM, et al. Gigascience. 2022 Dec 28;12:giad101. doi: 10.1093/gigascience/giad101. Epub 2023 Dec 13. Gigascience. 2022. PMID: 38091509 Free PMC article. - Bioinformatics tools for the sequence complexity estimates.
Orlov YL, Orlova NG. Orlov YL, et al. Biophys Rev. 2023 Sep 15;15(5):1367-1378. doi: 10.1007/s12551-023-01140-y. eCollection 2023 Oct. Biophys Rev. 2023. PMID: 37974990 Free PMC article. Review. - Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity.
Ding Y, Liao Y, He J, Ma J, Wei X, Liu X, Zhang G, Wang J. Ding Y, et al. Front Genet. 2023 Jun 1;14:1213907. doi: 10.3389/fgene.2023.1213907. eCollection 2023. Front Genet. 2023. PMID: 37323665 Free PMC article. - Nanopore Sequencing Data Analysis of 16S rRNA Genes Using the GenomeSync-GSTK System.
Kryukov K, Imanishi T, Nakagawa S. Kryukov K, et al. Methods Mol Biol. 2023;2632:215-226. doi: 10.1007/978-1-0716-2996-3_15. Methods Mol Biol. 2023. PMID: 36781731
References
- Walker JR, Willett P. Compression of nucleic acid and protein sequence data. Comput Appl Biosci. 1986;2(2):89–93. - PubMed
- Grumbach S, Tahi F. Compression of DNA sequences. In: Data Compression Conference. Snowbird, UT: IEEE; 1993:340–50.
- Hernaez M, Pavlichin D, Weissman T et al. .. Genomic data compression. Annu Rev Biomed Data Sci. 2019;2:19–37.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources