Pizza&Chili Corpus -- Compressed Indexes and their Testbeds (original) (raw)

The new millennium has seen the born of a new class of full-text indexes which are structurally similar to Suffix Trees and Suffix Arrays, in that they support the powerful substring search operation, but are succinct in space, in that it is close to the empirical entropy of the indexed data. They are therefore called compressed Suffix Trees and compressed Suffix Arrays, or in general compressed indexes.

In the literature we counted more than 20 papers authored by more than 20 different researchers. This interest is motivated by the large availability of textual data in electronic form, by the ever increasing gap in performance among the memory levels of current PCs, and by the "non negligible" space occupancy of classic data structures like Suffix Trees and Suffix Arrays which are pervading the BioInformatics and the Text Mining fields.

Don Knuth already observed, in its famous 3rd Volume on the Art of Computer Programming, that "space optimization is closely related to time optimization in a disk memory". So we believe that compressed indexes may become a crucial tool for the design of sophisticated and efficient software solutions given the ubiquity of indexing data structures in them. We nevertheless note that the algorithmic technology underlying these compressed indexes stays not at an undergraduate level. Consequently the implementation of any known compressed index requires much engineering effort, a strong algorithmic background, and still the final program may possibly not achieve its best performance!

This site has two mirrors: one in Italy and one in Chile. Hence you can argue the why of its name ;-) Its ultimate goal is to push towards the technology transfer of this fascinating algorithmic technology lying at the crossing point of data compression and data structure design. In order to achieve this goal the Pizza&Chili site offers publicly available implementations of compressed indexes. Each implementation follows a suitable API of functions which should, in our intention, allow any programmer to plug the provided compressed indexes within their own software and play with their functionalities and efficiency. The site also offers a collection of texts for experimenting and validating compressed indexes. In detail it offers three kinds of material:

It goes without saying that with this site we hope to mimic the success and impact of other initiatives as data-compression.info and the Calgaryand Canterbury corpora, just to cite a few. Actually the Pizza&Chili site is born as a "mix" of both of them, by offering software and testbeds. Several people have already contributed to make this site working and, hopefully, many more people will contribute to turn it into the reference for all researchers and software developers interested in experimenting and developing the compressed-index technology. The APIwe propose is thus intended to provide a reference for any researcher who wishes to contribute to the Pizza&Chili repository with his/her new compressed index.

Have fun!

Paolo Ferragina (University of Pisa)
Gonzalo Navarro (University of Chile)