Succinct Data Structures

 

Project:†† AF:Small:Compact Data Structures for String Matching and Retrieval (nsf link)

NSF CCF 1527435
May 2015-May 2018
PI: Sukhamay Kundu

 

Abstract:

In the era of big-data, one needs to organize massive amounts of data so that it can be searched quickly. This requires the building of an index over the raw data. In many cases of big-data, like world-wide web or DNA sequencing, the actual information content is low. This data is highly compressible. On the other hand, the indexes require space which is several times the raw data. Thus, indexing and compression are often conflicting goals. The field of compact (or succinct) data structures attempts to achieve both these goals -compression and indexability- simultaneously. This project will address some of the most fundamental open problems in this field. This will have impact on next generation biological sequence mining databases which could simply work within the memory of a PC instead of requiring high-performance clusters. The foundations built by this project will also impact image matching and music retrieval. Since data structures is one of the most fundamental areas in computer science education, research from this project will also impact data structures curriculum.

Suffix trees are central to string indexing and have myriads of applications. However, suffix trees are known to take 15 to 50 times the size of the text they index. This actually stems from a complexity gap in the size of data which is n log s bits compared to the size of the index which is O( n log n) bits, for the text of n characters drawn from alphabet size of s. The techniques of Burrows-Wheeler Transform(BWT) and Phi-function were introduced in the last decade to address this gap. Most subsequent research in this field has treated BWT as a black box, compressing augmenting structures around it to address various applications. However, many problems (like parameterized pattern matching and 2D pattern matching) have remained open in this field. This project will attempt to go deeper and beyond the philosophy of BWT to solve such issues. It will also try to build foundations for deriving lower bounds for problems where compact index would be impossible. To create better understanding of data structure space and query complexity, the project will explore the recent theoretical model called "encoding model". The project will also explore the applied case of compressed indexing for highly repetitive sequences.

People:

Rahul Shah (External Collaborator)
Sudip Biswas (Graduate Student:2015-2016)
Arnab Ganguly (Graduate Student:2015-2017)
Sharma Thankachan (External Collaborator)
Wing-Kai Hon (External Collaborator)
Sandeep Kumar Majji (External Collaborator Undergraduate Student IIT Mumbai - Summer 2017)
BVS Naidu (External Collaborator Undergraduate Student IIT Mumbai - Summer 2017)

Results:

Our goal in this project is to develop succinct indexes for some of the unsolved problems in the area of succinct indexes. Suffix trees have many applications and many of them can be solved by augmenting suffix tree with additional information. The emphasis in the area till now has been to replace suffix tree by it succinct counter-part and then re-permute and compress the augmenting information. However, there are other problems which have their own different variant of suffix tree. We look beyond the existing Burrows-Wheeler Transform (or Phi-function) based approach to address some of these challenges.

One of our initial breakthrough has come the following result (published in SODA 2017) which solves this case for parameterized pattern matching. In parameterized pattern matching the symbols of pattern and the text can be matched under any one-to-one correspondence. At the heart of this solution is a transform called pBWT (parameterized Burrrows-Wheeler Transform) and we show how its corresponding LF-mapping can be computed.

Here is a link to my talk - Parameterized Text Indexing.pptx

Publications:


Arnab Ganguly, Rahul Shah, Sharma V. Thankachan, pBWT: Achieving Succinct Data Structures for Parameterized Pattern Matching and Related Problems, In ACM-SIAM Symposium on Discrete Algorithms (SODA), 397-407, 2017.

Arnab Ganguly, Wing-Kai Hon, Rahul Shah, Stabbing Colors in One Dimension, Data Compression Conference (DCC), 2017.

Arnab Ganguly, Wing-Kai Hon, Kunihiko Sadakane, Rahul Shah, Sharma V. Thankachan, Yilin Yang, Space-Efficient Dictionaries for Parameterized and Order-Preserving Pattern Matching, In Combinatorial Pattern Matching (CPM), 2:1-2:12, 2016.

Arnab Ganguly, Wing-Kai Hon, Rahul Shah, Sharma V. Thankachan, Space-Time Trade-Offs for the Shortest Unique Substring Problem, International Symposium on Algorithms and Computation (ISAAC), 34:1-34:13, 2016.

Arnab Ganguly, Wing-Kai Hon, Rahul Shah, A Framework for Dynamic Parameterized Dictionary Matching, Symposium on Algorithms Theory (SWAT), 10:1-10:1, 2016.

 

 

Acknowledgement:

This material is based upon work supported by the National Science Foundation under Grant CCF 1527435.

 

Disclaimer:

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and collaborators and do not necessarily reflect the views of the National Science Foundation.

 

Last updated: July 13th, 2017

Contact: visit PIís webpage link at the top of the page