GitHub - liquidaty/zsv: zsv+lib: tabular data swiss-army knife CLI + world's fastest (simd) CSV parser (original) (raw)

zsv+lib: the world's fastest (simd) CSV parser, with an extensible CLI

lib + CLI:ci GitHub release (latest by date) GitHub all releases (downloads) License

npm:NPM Version NPM Install Size

Playground (without sheet viewer command): https://liquidaty.github.io/zsv

zsv+lib is the world's fastest CSV parser library and extensible command-line utility. It achieves high performance using SIMD operations, efficient memory use and other optimization techniques, and can also parse generic-delimited and fixed-width formats, as well as multi-row-span headers.

While zsv is written in C, it can be used in other languages such as ruby. See below for more details.

CLI

The ZSV CLI can be compiled to virtually any target, includingWebAssembly, and offers a variety of commands including select, count, direct CSV sql, flatten, serialize, 2json conversion, 2db sqlite3 conversion, stack, pretty, 2tsv, compare, paste, overwrite,check and more.

The ZSV CLI also includes sheet, an in-console interactive grid viewer that includes basic navigation, filtering, and pivot table with drill down, and that supports custom extensions:

Installation

Language Bindings & Wrappers

Binding contributions are welcome!

Language Project Maintainer
Ruby https://github.com/sebyx07/zsv-ruby @sebyx07

Note: These projects are maintained independently. Please file issues related to specific bindings in their respective repositories.

Playground

An online playground is available as well (without the sheet feature due to browser limitations)

If you like zsv+lib, do not forget to give it a star! 🌟

Performance

Performance results compare favorably vs other CSV utilities (xsv,tsv-utils, csvkit, mlr (miller) etc).

See benchmarks

Which "CSV"

"CSV" is an ambiguous term. This library uses, by default, the same definition as Excel (the library and app have various options to change this default behavior); a more accurate description of it would be "UTF8 delimited data parser" insofar as it requires UTF8 input and its options support customization of the delimiter and whether to allow quoting.

In addition, zsv provides a row-level (as well as cell-level) API and provides "normalized" CSV output (e.g. input of this"iscell1,"thisis,"cell2 becomes"this""iscell1","thisis,cell2"). Each of these three objectives (Excel compatibility, row-level API and normalized output) has a measurable performance impact; conversely, it is possible to achieve-- which a number of other CSV parsers do-- much faster parsing speeds if any of these requirements (especially Excel compatibility) are dropped.

Examples of input that does not comply with RFC 4180

The following is a comprehensive list of all input patterns that are non-compliant with RFC 4180, and how zsv (by default) parses each:

Input Description Parser treatment Example input How example input is parsed
Non-ASCII input, UTF8 BOM BOM at start of the stream is ignored (0xEF BB BF) Ignored
Non-ASCII input, valid UTF8 Parsed as UTF8 你,好 cell1 = 你, cell2 = 好
Non-ASCII input, invalid UTF8 Parsed as UTF8; any non-compliant bytes are retained, or replaced with specified char aaa,bXb,ccc where Y is malformed UTF8 cell1 = aaa, cell2 = bXb, cell3 = ccc
\n, \r, or \r\n newlines Any non-quote-captured occurrence of \n, \r, \r\n or \n\r is parsed as a row end 1a,1b,1c\n2a,2b,2c\r3a,3b,3c\n\r4a,4b,4c\r\n5a,"5\nb",5c\n6a,"6b\r","6c"\n7a,7b,7c Parsed as 7 rows each with 3 cells
Unquoted quote Treated like any other non-delmiter aaa,b"bb,ccc Cell 2 value is b"bb, output as CSV "b""bb"
Closing quote followed by character other than delimiter (comma) or row end Treated like any other non-delmiter "aa"a,"bb"bb"b,ccc Cell 1 value is aaa, cell2 value is bbbb"b, output as CSV aaa and "bbbb""b"
Missing final CRLF Ignored; end-of-stream is considered end-of-row if not preceded by explicit row terminator aaa,bbb,ccc Row with 3 cells, same as if input ended with row terminator preceding EOF
Row and header contain different number of columns (cells) Number of cells in each row is independent of other rows aaa,bbb\naaa,bbb,ccc Row 1 = 2 cells; Row 2 = 3 cells
Header row contains duplicate cells or embedded newlines Header rows are parsed the same was as other rows (see NOTE below) "a\na","a\na" Two cells of a\na

The above behavior can be altered with various optional flags:

Built-in and extensible features

zsv is an extensible CSV utility, which uses zsvlib, for tasks such as slicing and dicing, querying with SQL, combining, serializing, flattening,converting between CSV/JSON/sqlite3 and more.

zsv is streamlined for easy development of custom dynamic extensions.

zsvlib and zsv are written in C, but since zsvlib is a library, and zsvextensions are just shared libraries, you can extend zsv with your own code in any programming language, so long as it has been compiled into a shared library that implements the expectedinterface.

Key highlights

Why another CSV parser/utility?

Our objectives, which we were unable to find in a pre-existing project, are:

There are several excellent tools that achieve high performance. Among those we considered were xsv and tsv-utils. While they met our performance objective, both were designed primarily as a utility and not a library, and were not easy enough, for our needs, to customize and/or to support modular customizations that could be maintained (or licensed) independently of the related project (in addition to the fact that they were written in Rust and D, respectively, which happen to be languages with which we lacked deep experience, especially for web assembly targeting).

Others we considered were Miller (mlr), csvkit and Go (csv module), which did not meet our performance objective. We also considered various other libraries using SIMD for CSV parsing, but none that we tried met the "real-world CSV" objective.

Hence, zsv was created as a library and a versatile application, both optimized for speed and ease of development for extending and/or customizing to your needs.

Batteries included

zsv comes with several built-in commands:

Most of these can also be built as an independent executable named zsv_xxxwhere xxx is the command name.

Running the CLI

After installing, run zsv help to see usage details. The typical syntax iszsv <command> <parameters> e.g.:

zsv sql my_population_data.csv "select * from data where population > 100000"

Using the API

Simple API usage examples include:

Pull parsing

zsv_parser parser = zsv_new(NULL); while (zsv_next_row(parser) == zsv_status_row) { // for each row // ... const size_t cell_count = zsv_cell_count(parser); for (size_t i = 0; i < cell_count; i++) { // for each cell struct zsv_cell cell = zsv_get_cell(parser, i); printf("cell: %.*s\n", cell.len, cell.str); // ... } }

Push parsing

static void my_row_handler(void *ctx) { zsv_parser parser = ctx; const size_t cell_count = zsv_cell_count(parser); for (size_t i = 0; i < cell_count; i++) { // ... } }

int main() { zsv_parser parser = zsv_new(NULL); zsv_set_row_handler(parser, my_row_handler); zsv_set_context(parser, parser); while (zsv_parse_more(parser) == zsv_status_ok); return 0; }

Full application code examples can be found atexamples/lib/README.md.

An example of using the API, compiled to wasm and called via Javascript, is inexamples/js/README.md.

For more sophisticated (but at this time, only sporadically commented/documented) use cases, see the various CLI C source files in the appdirectory such as app/serialize.c.

Creating your own extension

You can extend zsv by providing a pre-compiled shared or static library that defines the functions specified in extension_template.h and which zsv loads in one of three ways:

Example and template

You can build and run a sample extension by running make test fromapp/ext_example.

The easiest way to implement your own extension is to copy and customize the template files in app/ext_template

Contribute

Feel free to open an issue or discussion.

Via PR

Language bindings

We currently have community support for Ruby, and we'd love to see a zsv wrapper for Python, Rust, Go, or Java or any of your other favorite languages

License

MIT

The zsv CLI uses some permissively-licensed third-party libraries. See misc/THIRDPARTY.md for details.