GitHub - bovee/entab: * -> TSV (original) (raw)

Entab

What is everything were a/could be turned into a table?

Entab is a parsing framework to turn a variety of record-based scientific file formats into usable tabular data across a variety of programming languages.

Test status codecov Package on Crates.io Package on NPM Package on PyPI DOI

Formats

Entab supports reading a variety of bioinformatics, chemoinformatics, and other formats.

CLI

Entab has a CLI that allows piping in arbitrary files and outputs TSVs. Install with:

Example usage to see how many records are in a file:

cat test.fa | entab | sed '1d' | wc -l

Bindings

There are bindings for two languages, Python and JavaScript, that support reading data streams and converting them into a series of records.

The Javascript library can be installed with:

The Python library can be installed with:

The R bindings can be installed from inside R with (note you will need Cargo and a Rust buildchain locally):

library(devtools) devtools::install_github("bovee/entab", subdir="entab-r")

Priorities

  1. _Handling many formats:_Support as many record-based, streamable scientific formats as possible. Formats like HDF5 with complex headers and already existing, well-supported parsers are not considered a priority though.
  2. _Correctness:_Formats should be parsed with good error messages, consistant failure states, and well-tested code.
  3. _Language bindings:_Support using Entab from a decent selection of the programming languages currently used for science, data science, and related fields. Currently supporting Python, Javascript, and experimentally R with possible support for Julia in the future.
  4. _Speed:_Entab should be as fast as possible while still prioritizing the above issues. Parsers are split into two forms: a fast one that produces a specialized struct and a slow one that produces a generic record and is capable of being switched to at run time.

Website

There is a small demo ofentab running in the browser that can open small files and plot the data in them.

  1. This format uses multiple files so it's not supported in streaming mode or in e.g. the JS bindings.