Physical complexity of variable length symbolic sequences (original) (raw)
Related papers
A complexity measure for symbolic sequences and applications to DNA
2006
We introduce a complexity measure for symbolic sequences. Starting from a segmentation procedure of the sequence, we define its complexity as the entropy of the distribution of lengths of the domains of relatively uniform composition in which the sequence is decomposed. We show that this quantity verifies the properties usually required for a ``good'' complexity measure. In particular it satisfies the one hump property, is super-additive and has the important property of being dependent of the level of detail in which the sequence is analyzed. Finally we apply it to the evaluation of the complexity profile of some genetic sequences.
Symbolic complexity for nucleotide sequences: a sign of the genome structure
Journal of Physics A, 2016
We introduce a method to estimate the complexity function of symbolic dynamical systems from a finite sequence of symbols. We test such complexity estimator on several symbolic dynamical systems whose complexity functions are known exactly. We use this technique to estimate the complexity function for genomes of several organisms under the assumption that a genome is a sequence produced by a (unknown) dynamical system. We show that the genome of several organisms share the property that their complexity functions behaves exponentially for words of small length ℓ (0 ≤ ℓ ≤ 10) and linearly for word lengths in the range 11 ≤ ℓ ≤ 50. It is also found that the species which are phylogenetically close each other have similar complexity functions calculated from a sample of their corresponding coding regions.
Entropy and complexity of finite sequences as fluctuating quantities
Biosystems, 2002
The paper is devoted to the analysis of digitized sequences of real numbers and discrete strings, by means of the concepts of entropy and complexity. Special attention is paid to the random character of these quantities and their fluctuation spectrum. As applications, we discuss neural spike-trains and DNA sequences. We consider a given sequence as one realization of finite length of certain random process. The other members of the ensemble are defined by appropriate surrogate sequences and surrogate processes. We show that n-gram entropies and the context-free grammatical complexity have to be considered as fluctuating quantities and study the corresponding distributions. Different complexity measures reveal different aspects of a sequence. Finally, we show that the diversity of the entropy (that takes small values for pseudorandom strings) and the context-free grammatical complexity (which takes large values for pseudorandom strings) give, nonetheless, consistent results by comparison of the ranking of sample sequences taken from molecular biology, neuroscience, and artificial control sequences.
Compositional complexity of DNA sequence models
Computer Physics Communications, 1999
Recently, we proposed a new measure of complexity for symbolic sequences (Sequence Compositional Complexity, SCC) based on the entropic segmentation of a sequence into compositionally homogeneous domains. Such segmentation is carried out by means of a conceptually simple, computationally efficient heuristic algorithm. SCC is now applied to the sequences generated by several stochastic models which describe the statistical properties of DNA, in particular the observed long-range fractal correlations. This approach allows us to test the capability of the different models in describing the complex compositional heterogeneity found in DNA sequences. Moreover, SCC detects clear differences where conventional standard methods fail.
On the complexity measures of genetic sequences
Bioinformatics, 1999
Motivation: It is well known that the regulatory regions of genomes are highly repetitive. They are rich in direct, symmetric and complemented repeats, and there is no doubt about the functional significance of these repeats. Among known measures of complexity, the Ziv-Lempel complexity measure reflects most adequately repeats occurring in the text. But this measure does not take into account isomorphic repeats. By isomorphic repeats we mean fragments that are identical (or symmetric) modulo some permutation of the alphabet letters. Results: In this paper, two complexity measures of symbolic sequences are proposed that generalize the Ziv-Lempel complexity measure by taking into account any isomorphic repeats in the text (rather than just direct repeats as in Ziv-Lempel). The first of them, the complexity vector, is designed for small alphabets such as the alphabet of nucleotides. The second is based on a search for the longest isomorphic fragment in the history of sequence synthesis and can be used for alphabets of arbitrary cardinality. These measures have been used for recognition of structural regularities in DNA sequences. Some interesting structures related to the regulatory region of the human growth hormone are reported.
Entropy estimation of symbol sequences
Chaos: An Interdisciplinary Journal of Nonlinear Science, 1996
We discuss algorithms for estimating the Shannon entropy h of finite symbol sequences with long range correlations. In particular, we consider algorithms which estimate h from the code lengths produced by some compression algorithm. Our interest is in describing their convergence with sequence length, assuming no limits for the space and time complexities of the compression algorithms. A scaling law is proposed for extrapolation from finite sample lengths. This is applied to sequences of dynamical systems in non-trivial chaotic regimes, a 1-D cellular automaton, and to written English texts.
Three subsets of sequence complexity and their relevance to biopolymeric information
Theoretical Biology and Medical …, 2005
Genetic algorithms instruct sophisticated biological organization. Three qualitative kinds of sequence complexity exist: random (RSC), ordered (OSC), and functional (FSC). FSC alone provides algorithmic instruction. Random and Ordered Sequence Complexities lie at opposite ends of the same bi-directional sequence complexity vector. Randomness in sequence space is defined by a lack of Kolmogorov algorithmic compressibility. A sequence is compressible because it contains redundant order and patterns. Law-like cause-and-effect determinism produces highly compressible order. Such forced ordering precludes both information retention and freedom of selection so critical to algorithmic programming and control. Functional Sequence Complexity requires this added programming dimension of uncoerced selection at successive decision nodes in the string. Shannon information theory measures the relative degrees of RSC and OSC. Shannon information theory cannot measure FSC. FSC is invariably associated with all forms of complex biofunction, including biochemical pathways, cycles, positive and negative feedback regulation, and homeostatic metabolism. The algorithmic programming of FSC, not merely its aperiodicity, accounts for biological organization. No empirical evidence exists of either RSC of OSC ever having produced a single instance of sophisticated biological organization. Organization invariably manifests FSC rather than successive random events (RSC) or low-informational self-ordering phenomena (OSC).