GitHub - IlyaGrebnov/libsais: The libsais library provides fast linear-time construction of suffix array (SA), generalized suffix array (GSA), longest common prefix (LCP) array, permuted LCP (PLCP) array, Burrows-Wheeler transform (BWT) and inverse BWT based on the induced sorting algorithm with optional OpenMP support for multi-core parallel construction. (original) (raw)

libsais

The libsais library provides fast (see Benchmarks below) linear-time construction of suffix array (SA), generalized suffix array (GSA), longest common prefix (LCP) array, permuted LCP (PLCP) array, Burrows-Wheeler transform (BWT) and inverse BWT, based on the induced sorting algorithm described in the following papers (with optional OpenMP support for multi-core parallel construction):

Copyright (c) 2021-2025 Ilya Grebnov ilya.grebnov@gmail.com

The libsais is inspired by libdivsufsort, sais libraries by Yuta Mori and msufsort by Michael Maniscalco.

libcubwt

If you are looking for even faster construction times, you can try libcubwt a library for GPU-based suffix array, inverse suffix array and Burrows-Wheeler transform construction.

Introduction

The libsais provides simple C99 API to construct suffix array and Burrows-Wheeler transformed string from a given string over constant-size alphabet. The algorithm runs in a linear time using typically only ~16KB of extra memory (with 2n bytes as absolute worst-case; where n is the length of the string). OpenMP acceleration uses 200KB of addition memory per thread.

License

The libsais is released under the Apache License Version 2.0

Changes

Versions of the libsais

Examples of APIs (see libsais.h, libsais16.h, libsais16x64.h and libsais64.h for complete APIs list)

/**
* Constructs the suffix array of a given string.
* @param T [0..n-1] The input string.
* @param SA [0..n-1+fs] The output array of suffixes.
* @param n The length of the given string.
* @param fs The extra space available at the end of SA array (0 should be enough for most cases).
* @param freq [0..255] The output symbol frequency table (can be NULL).
* @return 0 if no error occurred, -1 or -2 otherwise.
*/
int32_t libsais(const uint8_t * T, int32_t * SA, int32_t n, int32_t fs, int32_t * freq);

/**
* Constructs the suffix array of a given integer array.
* Note, during construction input array will be modified, but restored at the end if no errors occurred.
* @param T [0..n-1] The input integer array.
* @param SA [0..n-1+fs] The output array of suffixes.
* @param n The length of the integer array.
* @param k The alphabet size of the input integer array.
* @param fs Extra space available at the end of SA array (can be 0, but 4k or better 6k is recommended for optimal performance).
* @return 0 if no error occurred, -1 or -2 otherwise.
*/
int32_t libsais_int(int32_t * T, int32_t * SA, int32_t n, int32_t k, int32_t fs);

/**
* Constructs the burrows-wheeler transformed string (BWT) of a given string.
* @param T [0..n-1] The input string.
* @param U [0..n-1] The output string (can be T).
* @param A [0..n-1+fs] The temporary array.
* @param n The length of the given string.
* @param fs The extra space available at the end of A array (0 should be enough for most cases).
* @param freq [0..255] The output symbol frequency table (can be NULL).
* @return The primary index if no error occurred, -1 or -2 otherwise.
*/
int32_t libsais_bwt(const uint8_t * T, uint8_t * U, int32_t * A, int32_t n, int32_t fs, int32_t * freq);

/**
* Constructs the original string from a given burrows-wheeler transformed string (BWT) with primary index.
* @param T [0..n-1] The input string.
* @param U [0..n-1] The output string (can be T).
* @param A [0..n] The temporary array (NOTE, temporary array must be n + 1 size).
* @param n The length of the given string.
* @param freq [0..255] The input symbol frequency table (can be NULL).
* @param i The primary index.
* @return 0 if no error occurred, -1 or -2 otherwise.
*/
int32_t libsais_unbwt(const uint8_t * T, uint8_t * U, int32_t * A, int32_t n, const int32_t * freq, int32_t i);

/**
* Constructs the permuted longest common prefix array (PLCP) of a given string and a suffix array.
* @param T [0..n-1] The input string.
* @param SA [0..n-1] The input suffix array.
* @param PLCP [0..n-1] The output permuted longest common prefix array.
* @param n The length of the string and the suffix array.
* @return 0 if no error occurred, -1 otherwise.
*/
int32_t libsais_plcp(const uint8_t * T, const int32_t * SA, int32_t * PLCP, int32_t n);

Example installation using CPM

CPMAddPackage( NAME libsais GITHUB_REPOSITORY IlyaGrebnov/libsais GIT_TAG v2.8.5 OPTIONS "LIBSAIS_USE_OPENMP OFF" "LIBSAIS_BUILD_SHARED_LIB OFF" )

target_link_libraries( libsais)

Algorithm description

The libsais uses the SA-IS (Suffix Array Induced Sorting) algorithm to construct both the suffix array and the Burrows-Wheeler transform through recursive decomposition and induced sorting:

The SA-IS algorithm is quite elegant, yet implementing it efficiently presents multiple challenges. The primary challenge is that the SA-IS algorithm exhibits random memory access patterns, which can significantly decrease efficiency due to cache misses. Another significant challenge is that the SA-IS algorithm is not a lightweight construction algorithm; it requires additional memory to support positions classification, induced sorting, compacted string representations, and recursive decomposition. To circumvent this, the libsais implements careful optimizations that are worth highlighting:

Benchmarks

Full list of benchmarks are moved to own Benchmarks.md file.