Information of sequences and applications (original) (raw)
Related papers
Lossless Compression and Complexity of Chaotic Sequences
Computing Research Repository, 2011
, 2002]). We propose a new measure of complexity - defined as the number of iterations of NSRPS required to transform the input sequence into a constant sequence. We test this measure on symbolic sequences of the Logistic map for various values of the bifurcation parameter. The proposed measure of complexity is easy to compute and is observed to be highly correlated with the Lyapunov exponent of the original non-linear time series, even for very short symbolic sequences (as short as 50 samples). Finally, we construct symbolic sequences from the Skew-Tent map which are incompressible by popular compression algorithms like WinZip, WinRAR and 7-Zip, but compressible by NSRPS.
Complexity of chaotic binary sequence and precision of its numerical simulation
Nonlinear Dynamics
Numerical simulation is one of primary methods in which people study the property of chaotic systems. However, there is the effect of finite precision in all processors which can cause chaos to degenerate into a periodic function or a fixed point. If it is neglected the precision of a computer processor for the binary numerical calculations, the numerical simulation results may not be accurate due to the chaotic nature of the system under study. New and more accurate methods must be found. A quantitative computable method of sequence complexity evaluation is introduced in this paper. The effect of finite precision is evaluated from the viewpoint of sequence complexity. The simulation results show that the correlation function based on information entropy can effectively reflect the complexity of pseudorandom sequences generated by a chaotic system, and it is superior to the other measure methods based on entropy. The finite calculation precision of the processor has significant effect on the complexity of chaotic binary sequences generated by the Lorenz equation. The pseudorandom binary sequences with high complexity can be generated by a chaotic system as long as the suitable computational precision and quantification algorithm are selected and behave correctly. The new methodology helps to gain insight into systems that may exist in various application domains such as secure communications and spectrum management.
Physical complexity of variable length symbolic sequences
A measure called physical complexity is established and calculated for a population of sequences, based on statistical physics, automata theory, and information theory. It is a measure of the quantity of information in an organism's genome. It is based on Shannon's entropy, measuring the information in a population evolved in its environment, by using entropy to estimate the randomness in the genome. It is calculated from the difference between the maximal entropy of the population and the actual entropy of the population when in its environment, estimated by counting the number of fixed loci in the sequences of a population. Up until now, physical complexity has only been formulated for populations of sequences with the same length. Here, we investigate an extension to support variable length populations. We then build upon this to construct a measure for the efficiency of information storage, which we later use in understanding clustering within populations. Finally, we investigate our extended physical complexity through simulations, showing it to be consistent with the original.
A detailed entropy analysis by the recent novelty of 'lumping' is performed in some DNA sequences. On the basis of this, we first report here a negative answer to the question 'can the DNA sequences at the level of nucleotides be generated by a deterministic finite automaton of essentially a small number of states, in the statistical limit?'. What is observed in all cases is an almost linear scaling of the block entropies-up to the numerical precision-close to the one of a mixing ergodic system with a very high topological entropy. The basic result that we report here is that the all the examined biological sequences appear to be very little compressible (they lie near to the incompressible limit). The topological entropy of coding regions appears to be even higher than that of non-coding regions.
Entropy and complexity of finite sequences as fluctuating quantities
Biosystems, 2002
The paper is devoted to the analysis of digitized sequences of real numbers and discrete strings, by means of the concepts of entropy and complexity. Special attention is paid to the random character of these quantities and their fluctuation spectrum. As applications, we discuss neural spike-trains and DNA sequences. We consider a given sequence as one realization of finite length of certain random process. The other members of the ensemble are defined by appropriate surrogate sequences and surrogate processes. We show that n-gram entropies and the context-free grammatical complexity have to be considered as fluctuating quantities and study the corresponding distributions. Different complexity measures reveal different aspects of a sequence. Finally, we show that the diversity of the entropy (that takes small values for pseudorandom strings) and the context-free grammatical complexity (which takes large values for pseudorandom strings) give, nonetheless, consistent results by comparison of the ranking of sample sequences taken from molecular biology, neuroscience, and artificial control sequences.
PRELIMINARY INVESTIGATION ON NONLINEAR DYNAMICAL MODELING OF THE BIOLOGICAL SEQUENCES
In this paper, we investigate the chaotic behavior of the biological sequences among the different species. Throughout this work, we have characterized the biological sequences according to their moment invariant, correlation dimension, and largest Lyapunov exponent estimates. We have applied our model to a number of human and mouse genomes encoded into a set of integers (time series) using a plain table mapping scheme. Our results indicate that the nonlinear dynamical characteristics have yielded significant differences between the sequences of the different species. That is, we have been able to classify the different genome sequences according to their chaotic parameters estimates. On the other hand, through our investigation we have found that the use of the chaotic modeling of the biological sequences could open new frontiers in the sequence similarity search techniques.
Dynamical systems and computable information
Discrete and Continuous Dynamical Systems - Series B, 2004
We present some new results which relate information to chaotic dynamics. In our approach the quantity of information is measured by the Algorithmic Information Content (Kolmogorov complexity) or by a sort of computable version of it (Computable Information Content) in which the information is measured by the use of a suitable universal data compression algorithm. We apply these notions to the study of dynamical systems by considering the asymptotic behavior of the quantity of information necessary to describe their orbits. When a system is ergodic, this method provides an indicator which equals the Kolmogorov-Sinai entropy almost everywhere. Moreover, if the entropy is 0, our method gives new indicators which measure the unpredictability of the system and allows to classify various kind of weak chaos. Actually this is the main motivation of this work. The behaviour of a zero entropy dynamical system is far to be completely predictable exept that in particular cases. In fact there are 0 entropy systems which exibit a sort of weak chaos where the information necessary to describe the orbit behavior increases with time more than logarithmically (periodic case) even if less than linearly (positive entropy case). Also, we believe that the above method is useful for the classification of zero entropy time series. To support this point of view, we show some theoretical and experimenthal results in specific cases.
Entropy and compressibility of symbol sequences
1996
The purpose of this paper is to investigate long-range correlations in symbol sequences using methods of statistical physics and nonlinear dynamics. Beside the principal interest in the analysis of correlations and fluctuations comprising many letters, our main aim is related here to the problem of sequence compression. In spite of the great progress in this field achieved in the work of Shannon, Fano, Huffman, Lempel, Ziv and others [1] many questions still remain open. In particular one must note that since the basic work by Lempel and Ziv the improvement of the standard compression algorithms is rather slow not exceeding a few percent per decade. One the other hand several experts expressed the idee that the long range correlations, which clearly exist in texts, computer programs etc. are not sufficiently taken into account by the standard algorithms [1]. Thus, our interest in compressibility is twofold: