Dynamical Systems and Average-Case Analysis of General Tries (original) (raw)

Dynamical Systems and Average-Case Analysis of General Tries

Brigitte Vall�e

GREYC, Universit� de Caen, France

Algorithms Seminar

June 9, 1997

[summary by Julien Cl�ment]

A properly typeset version of this document is available in postscript and in pdf.

If some fonts do not look right on your screen, this might be fixed by configuring your browser (see the documentation here).

Abstract

The three major parameters of a trie(sometimes called digital tree), number of nodes, path length, and height, are analyzed precisely in a general context where words are emitted by a source associated to a dynamical system. The results can all be stated in terms of two intrinsic characteristics of the source: the entropy and the probability of letter coincidence. These characteristics themselves are linked in a natural way to spectral properties of a Ruelle operator associated to the dynamical system.

1 Probabilistic dynamical sources

1.1 Definitions

A dynamical source, in the context of information theory, is a mechanism which produces infinite words over an alphabet M. Such a system is defined by four elements: (i) an alphabet M included in N, (ii) a quasi partition of _I_=]0,1[ with intervals I m, mM, (iii) a mapping s: IM which is constant over each I m and equal to m and finally (iv) a mapping T: I � _I_which satisfies two properties: the restriction of T to I m is a real analytic bijection from I m to I; the mapping T is expansive, i.e. |T'(x)|>1 on I. The words emitted by the source are produced by iterating T and coded thanks to s. The word M(x) of _M_� (an infinite sequence of symbols), where_x_� I, is formed with the symbols

M(x):=(s(x),s(T(x)),s(_T_2(x)),...,s(T k(x)),...).

Each letter of the alphabet is associated to a distinct branch of T or equivalently to a distinct inverse branch of _T_denoted by h m. The bijection h m: II _m_coincides with the inverse of the restriction of T to I m.

1.2 Examples

Figure 1: Graphical representation of the mappings T ands for a source based on the continued fraction expansion (left) and for a memoryless source based on the 5-ary expansion of numbers (right).

A lot of probabilistic dynamical sources can be described in such a framework, including all memoryless sources where letters of the alphabet can be emitted with probabilities {p i} independently of previous letters. This gives a mapping T where branches are affine. In particular this encompasses the _b_-ary expansion model of numbers.Markov sources take into account a finite past for producing words and thus generalize memoryless sources. Finally, in a model where the source is based upon the continued fraction expansion of numbers, the alphabet is N and the probability for a character to be emitted depends on all the previous history.

1.3 Fundamental intervals, entropy, coincidence probability

Numbers which share the same prefix expansion _m_1,...,m k form an interval called fundamental interval of depth k which is exactly, with the preceding notations, In a general context, the interval I is endowed with a continuous density f (so that _F_denotes the associated distribution). Then the word M(x) is produced according to the expansion process (using T and s). In this context, the measure u h of a fundamental interval I _h_associated to _h_=h _m_1 � ... � h m k is

u h:= |F(h(0))-F(h(1))|,

and plays a crucial role since it is the probability that a word begins with a certain prefix. During the analysis, two quantities relative to the source appear naturally. The entropy h( S,F) and the coincidence probability c( S,F) are defined as the limits

| h( S | ,F):= | | | u h log u h, c( S | ,F):= | | 1/k. | | -------- | ------- | | | ----------------------------- | ------- | | ------ |

It is interesting to note that these limits exist and are independent of the distribution F.

2 Tries associated to a general source

Let MN be a set of elements called digits, and _M_� the set of all infinite sequences built over M. For any word produced by the dynamical source M(x)=(s(x),s(Tx),s(T_2_x),...) (with xI), the head and tail functions are defined by

head(M(x))=s(x), tail(M(x))=M(Tx).

Any finite set of infinite words produced by the same source can be written as M(X)={M(x) |xX}, and one associates to X a trie, Trie(X), defined by the following recursive rules

2.1 Parameters

The main parameters of a trie are size (number of internal nodes), height and external length path, which is the sum of all links from the root to each leaf.

The model considers here infinite strings emitted independently by the same dynamical source. Rather than considering a fixed number n of strings, a Poisson model of rate n is used, where the number of strings N is also a random variable which strongly concentrates around n. The strong property of independence of this particular model makes the analysis easier and gives access to the expectations of parameters. The expectations of height, size and external path length under a Poisson model of rate n and distribution F over

I are respectively

D(n)= [1- (1+nu h)e ], S(n)= [1-(1+n u h) e ], P(n)= n u h [1-e ].

2.2 Asymptotics

These quantities are easily recognized as harmonic sums of the form F(x)=�kK l_k_ f(�k x) (excepted for the height which needs a small calculation step before). The best tool to analyze the asymptotics of such harmonic sums is the Mellin transform, which leads to locating poles of the associated Dirichlet series L(s)=�kK l_k_ �k s. Here, the key quantity for the analysis of size and path length is the_series of fundamental intervals_,

| L(F,s)= | u h _s_= | |F(h(0))-F(h(1))|s, | | ----------- | ------------- | ------------------------------ |

considered for complex values of s. For a general source, it is not always easy (or possible) to locate precisely the singularities. In this case a Tauberian theorem, under some constraints, can be used to extract the asymptotic expansion.

3 Generalized Ruelle operators

3.1 Presentation

The generalized Ruelle operators are derived from the original Ruelle operators, and involve secants of the inverse branches_H_(u,v):=|(h(u)-h(v))/(_u_-v)|. The generalized Ruelle operators are defined by

| G | _s_[_F_](_u_,_v_)= | | (u,v)s F(h(u),h(s)), | | ------- | -------------------- | | ------------------------------------ |

where _H_~ is the analytic extension of H and s is a complex parameter. Here the sum is taken over branches of depth 1. If we define the secant L of the distribution F, then the Dirichlet series can be expressed as

| L(F,s)= | u h _s_= | |F(h(0))-F(h(1))|_s_= (_I_-Gs)-1[_L_ _s_](0,1). | | ----------- | ------------- | --------------------------------------------------------------------- |

3.2 Singularities of the quasi inverse (_I_-Gs)-1

Singularities of (_I_-Gs)-1 are of special interest because these are also singularities of L(F,s). These singularities arise for values of s where Gs has an eigenvalue equal to 1. In particular, there is always a pole at_s_=1 (easy to prove from the previous form of L(F,s)). One can derive from strong properties of the operator Gs that there are three different cases, called periodic, quasi-periodic and aperiodic, depending on the precise nature of the eigenvalues of G_s_on the line �(s)=1. The operator Gs has also special properties at _s_=1 and _s_=2 since the entropy of the source is h( S)=-l'(1) (the derivative of the dominant eigenvalue at_s_=1), while the coincidence probability is c( S)=l(2).

4.1 Analysis of height

The analysis of height is based on estimates of the individual probabilities p_k_(n)=�|h|=k (1+n u h) _e_-n u _h_followed by a Mellin analysis. This leads to the asymptotic expansion

D(n)= log n +P F(log n)+g+A F+o(1)

4.2 Analysis of size and path length

The operator (_I_-Gs)-1 has a simple pole at _s_=1, and thus gives the main term of the asymptotic expansion. For a general source, a Tauberian theorem can be applied to estimate the contribution of others poles. Finally one has

P(n)= n log n + o(n log n), S(n)= n + o(n).

5 Some important particular cases

5.1 Bernoulli sources

The Bernoulli source considers a finite alphabet _M_={1,..., r} with probability of emission {_p_1,...,p r} (with _p_1+...+p _r_=1). In this case the entropy and the coincidence probability have classical expressions

5.2 Continued fraction source

The continued fraction expansion of numbers can be considered as a dynamical source over the infinite alphabet _M_=N. The operator of Ruelle is then called the Ruelle-Mayer operator and is defined by The entropy of the source is linked to the so-called L�vy's constant which plays a central r�le in the the analysis of the Euclidean algorithm. The coincidence probability is a constant which intervenes in two-dimensional generalizations of the Euclidean algorithm. These constants are

References

[1]

Brigitte Vall�e. -- Dynamical systems and average-case analysis of general tries. -- Les cahiers du GREYC, 1997.

[2]

Flajolet (Philippe) and Vall�e (Brigitte). --Continued Fraction Algorithms, Functional Operators, and Structure Constants. -- Research Report n°2931, Institut National de Recherche en Informatique et en Automatique, July 1996. 33 pages. Invited lecture at the 7th Fibonacci Conference, Graz, July 1996; to appear in Theoretical Computer Science.

[3]

Herv� Daud� (Philippe Flajolet) and Vall�e (Brigitte). --An Average-case Analysis of the Gaussian Algorithm for Lattice Reduction. -- Research Report n°2798, Institut National de Recherche en Informatique et en Automatique, February 1996. To appear in_Combinatorics, Probability and Computing_, 1997.

[4]

Jacquet (Philippe) and Szpankowski (Wojciech). -- Analysis of digital tries whith Markovian dependency. IEEE Transactions on Information Theory, vol. 37, n°5, September 1991, pp. 1470--1475.

[5]

Vall�e (Brigitte). -- Op�rateurs de Ruelle-Mayer g�n�ralis�s et analyse en moyenne des algorithmes d'Euclide et de Gauss. Acta Arithmetica, vol. LXXXI, n°2, 1997, pp. 101--144.


This document was translated from LATEX byH E V E A.