Pizza&Chili Corpus -- Compressed Indexes and their Testbeds (original) (raw)

The new millennium has seen the born of a new class of full-text indexes which are structurally similar to Suffix Trees and Suffix Arrays, in that they support the powerful substring search operation, but are succinct in space, in that it is close to the empirical entropy of the indexed data. They are therefore called compressed Suffix Trees and compressed Suffix Arrays, or in general compressed indexes.

In the literature we counted more than 20 papers authored by more than 20 different researchers. This interest is motivated by the large availability of textual data in electronic form, by the ever increasing gap in performance among the memory levels of current PCs, and by the "non negligible" space occupancy of classic data structures like Suffix Trees and Suffix Arrays which are pervading the BioInformatics and the Text Mining fields.

Don Knuth already observed, in its famous 3rd Volume on the Art of Computer Programming, that "space optimization is closely related to time optimization in a disk memory". So we believe that compressed indexes may become a crucial tool for the design of sophisticated and efficient software solutions given the ubiquity of indexing data structures in them. We nevertheless note that the algorithmic technology underlying these compressed indexes stays not at an undergraduate level. Consequently the implementation of any known compressed index requires much engineering effort, a strong algorithmic background, and still the final program may possibly not achieve its best performance!

This site has two mirrors: one in Italy and one in Chile. Hence you can argue the why of its name ;-) Its ultimate goal is to push towards the technology transfer of this fascinating algorithmic technology lying at the crossing point of data compression and data structure design. In order to achieve this goal the Pizza&Chili site offers publicly available implementations of compressed indexes. Each implementation follows a suitable API of functions which should, in our intention, allow any programmer to plug the provided compressed indexes within their own software and play with their functionalities and efficiency. The site also offers a collection of texts for experimenting and validating compressed indexes. In detail it offers three kinds of material:

A set of compressed indexes which are able to support the same search functionalities of Suffix Trees and Suffix Arrays (e.g., substring searches), but requiring succinct space occupancy and offering, in addition, some text access operations that make them useful within text retrieval and mining software systems.
A set of text collections of various types and sizes to test experimentally the available compressed indexes, or the new compressed indexes that researchers like to submit to this site. The text collections have been selected to form a representative sample of different applications where indexed text searching might be useful. The sizes of these texts are large enough to stress the impact of data compression over memory usage and CPUperformance. The goal of experimenting with this testbed is to conclude whether, or not, compressed indexing is beneficial over uncompressed indexing approaches, like Suffix Trees and Suffix Arrays. And, in case it is beneficial, which compressed index is preferable according to the various applicative scenarios sampled with the testbed.
Additional material useful to experiment with compressed indexes, such as _scripts_for automatic software validation over the available text collections, and to deepen the understanding of this algorithmic technology, such as scientific publications and links to interesting related sites.

It goes without saying that with this site we hope to mimic the success and impact of other initiatives as data-compression.info and the Calgaryand Canterbury corpora, just to cite a few. Actually the Pizza&Chili site is born as a "mix" of both of them, by offering software and testbeds. Several people have already contributed to make this site working and, hopefully, many more people will contribute to turn it into the reference for all researchers and software developers interested in experimenting and developing the compressed-index technology. The APIwe propose is thus intended to provide a reference for any researcher who wishes to contribute to the Pizza&Chili repository with his/her new compressed index.

Have fun!

Paolo Ferragina (University of Pisa)
Gonzalo Navarro (University of Chile)