Dina Sokol - Academia.edu (original) (raw)

Papers by Dina Sokol

Research paper thumbnail of Small-Space 2D Compressed Dictionary Matching

Lecture Notes in Computer Science, 2010

Research paper thumbnail of Succinct 2D Dictionary Matching with No Slowdown

Lecture Notes in Computer Science, 2011

... In: Dong, Y., Du, D.-Z., Ibarra, O. (eds.) ISAAC 2009. LNCS, vol. 5878, pp. 1034–1043. ... LN... more ... In: Dong, Y., Du, D.-Z., Ibarra, O. (eds.) ISAAC 2009. LNCS, vol. 5878, pp. 1034–1043. ... LNCS, vol. 6129, pp. 27–39. Springer, Heidelberg (2010) [18] Rytter, W.: On maximal suffixes and constant-space linear-time versions of kmp algorithm. Theor. Comput. Sci. ...

Research paper thumbnail of 2D Lyndon Words and Applications

Research paper thumbnail of Approximate Tandem Repeats

Encyclopedia of Algorithms, 2008

ABSTRACT Keywords and SynonymsApproximate repetitions; Approximate periodicities Problem Definiti... more ABSTRACT Keywords and SynonymsApproximate repetitions; Approximate periodicities Problem DefinitionIdentification of periodic structures in words (variants of which are known as tandem repeats, repetitions, powers or runs) is a fundamental algorithmic task (see entry Squares and Repetitions). In many practical applications, such as DNA sequence analysis, considered repetitions admit a certain variation between copies of the repeated pattern. In other words, repetitions under interest are approximate tandem repeats and not necessarily exact repeats only.The simplest instance of an approximate tandem repeat is an approximate square. An approximate square in a word w is a subword uv, where u and v are within a given distance k according to some distance measure between words, such as Hamming distance or edit (also called Levenstein) distance. There are several ways to define approximate tandem repeats as successions of approximate squ ...

Research paper thumbnail of Jonathan Badger, Paul Kearney, Ming Li, John Tsang, and Tao Jiang. Selecting the

Research paper thumbnail of On Two-Dimensional Lyndon Words

Research paper thumbnail of Classification of Tandem Repeats in the Human Genome

International Journal of Knowledge Discovery in Bioinformatics, 2012

ABSTRACT Tandem repeats in DNA sequences are extremely relevant in biological phenomena and diagn... more ABSTRACT Tandem repeats in DNA sequences are extremely relevant in biological phenomena and diagnostic tools. Computational programs that discover these tandem repeats generate a huge volume of data, which is often difficult to decipher without further organization. In this paper, the authors describe a new method for post-processing tandem repeats through clustering and classification. Their work presents multiple ways of expressing tandem repeats using the n-gram model with different clustering distance measures. Analysis of the clusters for the tandem repeats in the human genome shows that the method yields a well-defined grouping in which similarity among repeats is apparent. The authors' new, alignment-free method facilitates the analysis of the myriad of tandem repeats that occur in the human genome and they believe that this work will lead to new discoveries on the roles, origins, and significance of tandem repeats.

Research paper thumbnail of Clustering Tandem Repeats via Trinucleotides

2012 IEEE 12th International Conference on Data Mining Workshops, 2012

ABSTRACT Tandem repeats in DNA sequences are extremely relevant in biological phenomena and diagn... more ABSTRACT Tandem repeats in DNA sequences are extremely relevant in biological phenomena and diagnostic tools. Computational programs that discover these tandem repeats generate a huge volume of data, which is often difficult to decipher without further organization. In this paper, we describe a new method for post-processing tandem repeats through clustering. Our work presents multiple ways of expressing tandem repeats using the n-gram model with different clustering distance measures. Analysis of these clusters for chromosome 1 of the human genomes shows that the clustering of tandem repeats according to 3-grams yields well-defined clusters. Our new, alignment-free method facilitates the analysis of the myriad of tandem repeats that occur in the human genome and we believe that this work will lead to new discoveries on the roles, origins, and significance of tandem repeats.

Research paper thumbnail of Speeding up the detection of tandem repeats over the edit distance

Theoretical Computer Science, 2014

Research paper thumbnail of Inplace run-length 2d compressed search

Theoretical Computer Science, 2003

The recent explosion in the amount of stored data has necessitated the storage and transmission o... more The recent explosion in the amount of stored data has necessitated the storage and transmission of data in compressed form. The need to quickly access this data has given rise to a new paradigm in searching, that of compressed matching (Proc.

Research paper thumbnail of An Algorithm for Approximate Tandem Repeats

Journal of Computational Biology, 2001

A perfect single tandem repeat is defined as a nonempty string that can be divided into two ident... more A perfect single tandem repeat is defined as a nonempty string that can be divided into two identical substrings, e.g., abcabc. An approximate single tandem repeat is one in which the substrings are similar, but not identical, e.g., abcdaacd. In this paper we consider two criterions of similarity: the Hamming distance (k mismatches) and the edit distance (k differences). For a string S of length n and an integer k our algorithm reports all locally optimal approximate repeats, r = umacro û, for which the Hamming distance of umacro and û is at most k, in O(nk log (n/k)) time, or all those for which the edit distance of umacro and û is at most k, in O(nk log k log (n/k)) time. This paper concentrates on a more general type of repeat called multiple tandem repeats. A multiple tandem repeat in a sequence S is a (periodic) substring r of S of the form r = u(a)u', where u is a prefix of r and u' is a prefix of u. An approximate multiple tandem repeat is a multiple repeat with errors; the repeated subsequences are similar but not identical. We precisely define approximate multiple repeats, and present an algorithm that finds all repeats that concur with our definition. The time complexity of the algorithm, when searching for repeats with up to k errors in a string S of length n, is O(nka log (n/k)) where a is the maximum number of periods in any reported repeat. We present some experimental results concerning the performance and sensitivity of our algorithm. The problem of finding repeats within a string is a computational problem with important applications in the field of molecular biology. Both exact and inexact repeats occur frequently in the genome, and certain repeats occurring in the genome are known to be related to diseases in the human.

Research paper thumbnail of Inplace 2D matching in compressed images

Journal of Algorithms, 2003

The compressed matching problem is the problem of finding all occurrences of a pattern in a compr... more The compressed matching problem is the problem of finding all occurrences of a pattern in a compressed text. In this paper we discuss the 2-dimensional compressed matching problem in Lempel-Ziv compressed images. Given a pattern P of (uncompressed) size m × m, and a text T of (uncompressed) size n × n, both in 2D-LZ compressed form, our algorithm finds all occurrences of P in T . The algorithm is strongly inplace, that is, the amount of extra space used is proportional to the best possible compression of a pattern of size m 2 . The best compression that the 2D-LZ technique can obtain for a file of size m 2 is O(m). The time for performing the search is O(n 2 ) and the preprocessing time is O(m 3 ). Our algorithm is general in the sense that it can be used for any 2D compression which can be sequentially decompressed in small space.

Research paper thumbnail of TRedD--A database for tandem repeats over the edit distance

Database, 2010

A 'tandem repeat' in DNA is a sequence of two or more contiguous,... more A 'tandem repeat' in DNA is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats are common in the genomes of both eukaryotic and prokaryotic organisms. They are significant markers for human identity testing, disease diagnosis, sequence homology and population studies. In this article, we describe a new database, TRedD, which contains the tandem repeats found in the human genome. The database is publicly available online, and the software for locating the repeats is also freely available. The definition of tandem repeats used by TRedD is a new and innovative definition based upon the concept of 'evolutive tandem repeats'. In addition, we have developed a tool, called TandemGraph, to graphically depict the repeats occurring in a sequence. This tool can be coupled with any repeat finding software, and it should greatly facilitate analysis of results. Database URL: http://tandem.sci.brooklyn.cuny.edu/

Research paper thumbnail of Tandem repeats over the edit distance

Bioinformatics, 2007

A tandem repeat in DNA is a sequence of two or more contiguous, approximate copies of a pattern o... more A tandem repeat in DNA is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats occur in the genomes of both eukaryotic and prokaryotic organisms. They are important in numerous fields including disease diagnosis, mapping studies, human identity testing (DNA fingerprinting), sequence homology and population studies. Although tandem repeats have been used by biologists for many years, there are few tools available for performing an exhaustive search for all tandem repeats in a given sequence. In this paper we describe an efficient algorithm for finding all tandem repeats within a sequence, under the edit distance measure. The contributions of this paper are two-fold: theoretical and practical. We present a precise definition for tandem repeats over the edit distance and an efficient, deterministic algorithm for finding these repeats. The algorithm has been implemented in C++, and the software is available upon request and can be used at http://www.sci.brooklyn.cuny.edu/~sokol/trepeats. The use of this tool will assist biologists in discovering new ways that tandem repeats affect both the structure and function of DNA and protein molecules.

Research paper thumbnail of Filtering Tandem Repeats in DNA Sequences

Research paper thumbnail of Finding Repeats Within Strings

Research paper thumbnail of Dynamic 2D Dictionary Matching in Small Space

Research paper thumbnail of TandemGraph: a graphical tool for modeling string regularities

Research paper thumbnail of Dynamic text and static pattern matching

ACM Transactions on Algorithms, 2007

In this paper, we address a new version of dynamic pattern matching. The dynamic text and static ... more In this paper, we address a new version of dynamic pattern matching. The dynamic text and static pattern matching problem is the problem of finding a static pattern in a text that is continuously being updated. The goal is to report all new occurrences of the pattern in the text after each text update. We present an algorithm for solving the problem, where the text update operation is changing the symbol value of a text location. Given a text of length n and a pattern of length m, our algorithm preprocesses the text in time O(n log log m), and the pattern in time O(m √ log m). The extra space used is O(n + m √ log m). Following each text update, the algorithm deletes all prior occurrences of the pattern that no longer match, and reports all new occurrences of the pattern in the text in O(log log m) time. We note that the complexity is not proportional to the number of pattern occurrences since all new occurrences can be reported in a succinct form.

Research paper thumbnail of Succinct 2D Dictionary Matching

Research paper thumbnail of Small-Space 2D Compressed Dictionary Matching

Lecture Notes in Computer Science, 2010

Research paper thumbnail of Succinct 2D Dictionary Matching with No Slowdown

Lecture Notes in Computer Science, 2011

... In: Dong, Y., Du, D.-Z., Ibarra, O. (eds.) ISAAC 2009. LNCS, vol. 5878, pp. 1034–1043. ... LN... more ... In: Dong, Y., Du, D.-Z., Ibarra, O. (eds.) ISAAC 2009. LNCS, vol. 5878, pp. 1034–1043. ... LNCS, vol. 6129, pp. 27–39. Springer, Heidelberg (2010) [18] Rytter, W.: On maximal suffixes and constant-space linear-time versions of kmp algorithm. Theor. Comput. Sci. ...

Research paper thumbnail of 2D Lyndon Words and Applications

Research paper thumbnail of Approximate Tandem Repeats

Encyclopedia of Algorithms, 2008

ABSTRACT Keywords and SynonymsApproximate repetitions; Approximate periodicities Problem Definiti... more ABSTRACT Keywords and SynonymsApproximate repetitions; Approximate periodicities Problem DefinitionIdentification of periodic structures in words (variants of which are known as tandem repeats, repetitions, powers or runs) is a fundamental algorithmic task (see entry Squares and Repetitions). In many practical applications, such as DNA sequence analysis, considered repetitions admit a certain variation between copies of the repeated pattern. In other words, repetitions under interest are approximate tandem repeats and not necessarily exact repeats only.The simplest instance of an approximate tandem repeat is an approximate square. An approximate square in a word w is a subword uv, where u and v are within a given distance k according to some distance measure between words, such as Hamming distance or edit (also called Levenstein) distance. There are several ways to define approximate tandem repeats as successions of approximate squ ...

Research paper thumbnail of Jonathan Badger, Paul Kearney, Ming Li, John Tsang, and Tao Jiang. Selecting the

Research paper thumbnail of On Two-Dimensional Lyndon Words

Research paper thumbnail of Classification of Tandem Repeats in the Human Genome

International Journal of Knowledge Discovery in Bioinformatics, 2012

ABSTRACT Tandem repeats in DNA sequences are extremely relevant in biological phenomena and diagn... more ABSTRACT Tandem repeats in DNA sequences are extremely relevant in biological phenomena and diagnostic tools. Computational programs that discover these tandem repeats generate a huge volume of data, which is often difficult to decipher without further organization. In this paper, the authors describe a new method for post-processing tandem repeats through clustering and classification. Their work presents multiple ways of expressing tandem repeats using the n-gram model with different clustering distance measures. Analysis of the clusters for the tandem repeats in the human genome shows that the method yields a well-defined grouping in which similarity among repeats is apparent. The authors' new, alignment-free method facilitates the analysis of the myriad of tandem repeats that occur in the human genome and they believe that this work will lead to new discoveries on the roles, origins, and significance of tandem repeats.

Research paper thumbnail of Clustering Tandem Repeats via Trinucleotides

2012 IEEE 12th International Conference on Data Mining Workshops, 2012

ABSTRACT Tandem repeats in DNA sequences are extremely relevant in biological phenomena and diagn... more ABSTRACT Tandem repeats in DNA sequences are extremely relevant in biological phenomena and diagnostic tools. Computational programs that discover these tandem repeats generate a huge volume of data, which is often difficult to decipher without further organization. In this paper, we describe a new method for post-processing tandem repeats through clustering. Our work presents multiple ways of expressing tandem repeats using the n-gram model with different clustering distance measures. Analysis of these clusters for chromosome 1 of the human genomes shows that the clustering of tandem repeats according to 3-grams yields well-defined clusters. Our new, alignment-free method facilitates the analysis of the myriad of tandem repeats that occur in the human genome and we believe that this work will lead to new discoveries on the roles, origins, and significance of tandem repeats.

Research paper thumbnail of Speeding up the detection of tandem repeats over the edit distance

Theoretical Computer Science, 2014

Research paper thumbnail of Inplace run-length 2d compressed search

Theoretical Computer Science, 2003

The recent explosion in the amount of stored data has necessitated the storage and transmission o... more The recent explosion in the amount of stored data has necessitated the storage and transmission of data in compressed form. The need to quickly access this data has given rise to a new paradigm in searching, that of compressed matching (Proc.

Research paper thumbnail of An Algorithm for Approximate Tandem Repeats

Journal of Computational Biology, 2001

A perfect single tandem repeat is defined as a nonempty string that can be divided into two ident... more A perfect single tandem repeat is defined as a nonempty string that can be divided into two identical substrings, e.g., abcabc. An approximate single tandem repeat is one in which the substrings are similar, but not identical, e.g., abcdaacd. In this paper we consider two criterions of similarity: the Hamming distance (k mismatches) and the edit distance (k differences). For a string S of length n and an integer k our algorithm reports all locally optimal approximate repeats, r = umacro û, for which the Hamming distance of umacro and û is at most k, in O(nk log (n/k)) time, or all those for which the edit distance of umacro and û is at most k, in O(nk log k log (n/k)) time. This paper concentrates on a more general type of repeat called multiple tandem repeats. A multiple tandem repeat in a sequence S is a (periodic) substring r of S of the form r = u(a)u', where u is a prefix of r and u' is a prefix of u. An approximate multiple tandem repeat is a multiple repeat with errors; the repeated subsequences are similar but not identical. We precisely define approximate multiple repeats, and present an algorithm that finds all repeats that concur with our definition. The time complexity of the algorithm, when searching for repeats with up to k errors in a string S of length n, is O(nka log (n/k)) where a is the maximum number of periods in any reported repeat. We present some experimental results concerning the performance and sensitivity of our algorithm. The problem of finding repeats within a string is a computational problem with important applications in the field of molecular biology. Both exact and inexact repeats occur frequently in the genome, and certain repeats occurring in the genome are known to be related to diseases in the human.

Research paper thumbnail of Inplace 2D matching in compressed images

Journal of Algorithms, 2003

The compressed matching problem is the problem of finding all occurrences of a pattern in a compr... more The compressed matching problem is the problem of finding all occurrences of a pattern in a compressed text. In this paper we discuss the 2-dimensional compressed matching problem in Lempel-Ziv compressed images. Given a pattern P of (uncompressed) size m × m, and a text T of (uncompressed) size n × n, both in 2D-LZ compressed form, our algorithm finds all occurrences of P in T . The algorithm is strongly inplace, that is, the amount of extra space used is proportional to the best possible compression of a pattern of size m 2 . The best compression that the 2D-LZ technique can obtain for a file of size m 2 is O(m). The time for performing the search is O(n 2 ) and the preprocessing time is O(m 3 ). Our algorithm is general in the sense that it can be used for any 2D compression which can be sequentially decompressed in small space.

Research paper thumbnail of TRedD--A database for tandem repeats over the edit distance

Database, 2010

A 'tandem repeat' in DNA is a sequence of two or more contiguous,... more A 'tandem repeat' in DNA is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats are common in the genomes of both eukaryotic and prokaryotic organisms. They are significant markers for human identity testing, disease diagnosis, sequence homology and population studies. In this article, we describe a new database, TRedD, which contains the tandem repeats found in the human genome. The database is publicly available online, and the software for locating the repeats is also freely available. The definition of tandem repeats used by TRedD is a new and innovative definition based upon the concept of 'evolutive tandem repeats'. In addition, we have developed a tool, called TandemGraph, to graphically depict the repeats occurring in a sequence. This tool can be coupled with any repeat finding software, and it should greatly facilitate analysis of results. Database URL: http://tandem.sci.brooklyn.cuny.edu/

Research paper thumbnail of Tandem repeats over the edit distance

Bioinformatics, 2007

A tandem repeat in DNA is a sequence of two or more contiguous, approximate copies of a pattern o... more A tandem repeat in DNA is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats occur in the genomes of both eukaryotic and prokaryotic organisms. They are important in numerous fields including disease diagnosis, mapping studies, human identity testing (DNA fingerprinting), sequence homology and population studies. Although tandem repeats have been used by biologists for many years, there are few tools available for performing an exhaustive search for all tandem repeats in a given sequence. In this paper we describe an efficient algorithm for finding all tandem repeats within a sequence, under the edit distance measure. The contributions of this paper are two-fold: theoretical and practical. We present a precise definition for tandem repeats over the edit distance and an efficient, deterministic algorithm for finding these repeats. The algorithm has been implemented in C++, and the software is available upon request and can be used at http://www.sci.brooklyn.cuny.edu/~sokol/trepeats. The use of this tool will assist biologists in discovering new ways that tandem repeats affect both the structure and function of DNA and protein molecules.

Research paper thumbnail of Filtering Tandem Repeats in DNA Sequences

Research paper thumbnail of Finding Repeats Within Strings

Research paper thumbnail of Dynamic 2D Dictionary Matching in Small Space

Research paper thumbnail of TandemGraph: a graphical tool for modeling string regularities

Research paper thumbnail of Dynamic text and static pattern matching

ACM Transactions on Algorithms, 2007

In this paper, we address a new version of dynamic pattern matching. The dynamic text and static ... more In this paper, we address a new version of dynamic pattern matching. The dynamic text and static pattern matching problem is the problem of finding a static pattern in a text that is continuously being updated. The goal is to report all new occurrences of the pattern in the text after each text update. We present an algorithm for solving the problem, where the text update operation is changing the symbol value of a text location. Given a text of length n and a pattern of length m, our algorithm preprocesses the text in time O(n log log m), and the pattern in time O(m √ log m). The extra space used is O(n + m √ log m). Following each text update, the algorithm deletes all prior occurrences of the pattern that no longer match, and reports all new occurrences of the pattern in the text in O(log log m) time. We note that the complexity is not proportional to the number of pattern occurrences since all new occurrences can be reported in a succinct form.

Research paper thumbnail of Succinct 2D Dictionary Matching