Networking Phylogeny for Indo-European and Austronesian Languages (original) (raw)

Classification of the Indo-European Languages Using a Phylogenetic Network Approach

Studies in classification, data analysis, and knowledge organization, 2010

Discovering the origin of the Indo-European (IE) language family is one of the most intensively studied problems in historical linguistics. Gray and Atkinson [6] inferred a phylogenetic tree (i.e., additive tree or X-tree [2]) of the IE family, using bayesian inference and rate-smoothing algorithms, based on the 87 Indo-European language data set collected by Dyen et al. [5]. When conducting their classification study, Gray and Atkinson assumed that the evolution of languages was strictly divergent and the frequency of borrowing (i.e., horizontal transmission of individual words) was very low. As consequence, their results suggested a predominantly treelike pattern of the IE language evolution. In our opinion, only a network model can adequately represent the evolution of the IE languages. We propose to apply a method of horizontal gene transfer (HGT) detection [8] to reconstruct phylogenetic network depicting the evolution of the IE language family.

Language trees support the express-train sequence of Austronesian expansion

Nature, 2000

Languages, like molecules, document evolutionary history. Darwin observed that evolutionary change in languages greatly resembled the processes of biological evolution: inheritance from a common ancestor and convergent evolution operate in both. Despite many suggestions, few attempts have been made to apply the phylogenetic methods used in biology to linguistic data. Here we report a parsimony analysis of a large language data set. We use this analysis to test competing hypotheses--the "express-train" and the "entangled-bank" models--for the colonization of the Pacific by Austronesian-speaking peoples. The parsimony analysis of a matrix of 77 Austronesian languages with 5,185 lexical items produced a single most-parsimonious tree. The express-train model was converted into an ordered geographical character and mapped onto the language tree. We found that the topology of the language tree was highly compatible with the express-train model.

Behind Family Trees: Secondary Connections in Uralic Language Networks

Although it has long been recognized that the family tree model is too simplistic to account for historical connections between languages, most computational studies of language history have concentrated on tree-building methods. Here, we employ computational network methods to assess the utility of network models in comparison with tree models in studying the subgrouping of Uralic languages. We also compare basic vocabulary data with words that are more easily borrowed and replaced crosslinguistically (less basic vocabulary) in order to find out how secondary connections affect computational analyses of this language family. In general, the networks support a treelike pattern of diversification, but also provide information about conflicting connections underlying some of the ambiguous divergences in the trees. These are seen as reflections of unclear divergence patterns (either in ancestral protolanguages or between languages closely related at present), which pose problems for a tree model. The networks also show that the relationships of closely related present-day languages are more complex than what the tree models suggest. When comparing less basic with basic vocabulary, we can detect the effect of borrowing between different branches (horizontal transfer) mostly between and within the Finnic and Saami subgroups. We argue that the trees obtained with basic vocabulary provide the primary pattern of the divergence of a language family, whereas networks, especially those constructed with less basic vocabulary, add reality to the picture by showing the effect of more complicated developments affecting the connections between the languages.

Behind Family Trees: Secondary Connections in Uralic Language Networks (with J. Lehtinen, T. Honkola, K. Syrjänen, O. Vesakoski, N. Wahlberg)

Language Dynamics and Change, 2014

Although it has long been recognized that the family tree model is too simplistic to account for historical connections between languages, most computational studies of language history have concentrated on tree-building methods. Here, we employ computational network methods to assess the utility of network models in comparison with tree models in studying the subgrouping of Uralic languages. We also compare basic vocabulary data with words that are more easily borrowed and replaced cross-linguistically (less basic vocabulary) in order to find out how secondary connections affect computational analyses of this language family. In general, the networks support a treelike pattern of diversification, but also provide information about conflicting connections underlying some of the ambiguous divergences in the trees. These are seen as reflections of unclear divergence patterns (either in ancestral protolanguages or between languages closely related at present), which pose problems for a tree model. The networks also show that the relationships of closely related present-day languages are more complex than what the tree models suggest. When comparing less basic with basic vocabulary, we can detect the effect of borrowing between different branches (horizontal transfer) mostly between and within the Finnic and Saami subgroups. We argue that the trees obtained with basic vocabulary provide the primary pattern of the divergence of a language family, whereas networks, especially those constructed with less basic vocabulary, add reality to the picture by showing the effect of more complicated developments affecting the connections between the languages.

Splits or waves? Trees or webs? How divergence measures and network analysis can unravel language histories

Philosophical Transactions of the Royal Society, B: Biological Sciences, 2010

Linguists have traditionally represented patterns of divergence within a language family in terms of either a ‘splits’ model, corresponding to a branching family tree structure, or the wave model, resulting in a (dialect) continuum. Recent phylogenetic analyses, however, have tended to assume the former as a viable idealization also for the latter. But the contrast matters, for it typically reflects different processes in the real world: speaker populations either separated by migrations, or expanding over continuous territory. Since history often leaves a complex of both patterns within the same language family, ideally we need a single model to capture both, and tease apart the respective contributions of each. The ‘network’ type of phylogenetic method offers this, so we review recent applications to language data. Most have used lexical data, encoded as binary or multi-state characters. We look instead at continuous distance measures of divergence in phonetics. Our output networks combine branch- and continuum-like signals in ways that correspond well to known histories (illustrated for Germanic, and particularly English). We thus challenge the traditional insistence on shared innovations, setting out a new, principled explanation for why complex language histories can emerge correctly from distance measures, despite shared retentions and parallel innovations.

UNRAVELING THE TANGLES OF LANGUAGE EVOLUTION

Chaos, Complexity And Transport, 2012

The relationships between languages molded by extremely complex social, cultural and political factors are assessed by an automated method, in which the distance between languages is estimated by the average normalized Levenshtein distance between words from the list of 200 meanings maximally resistant to change. A sequential process of language classification described by random walks on the matrix of lexical distances allows to represent complex relationships between languages geometrically, in terms of distances and angles. We have tested the method on a sample of 50 Indo-European and 50 Austronesian languages. The geometric representations of language taxonomy allows for making accurate interfaces on the most significant events of human history by tracing changes in language families through time. The Anatolian and Kurgan hypothesis of the Indo-European origin and the "express train" model of the Polynesian origin are thoroughly discussed.

Networks uncover hidden lexical borrowing in Indo-European language evolution

Language evolution is traditionally described in terms of family trees with ancestral languages splitting into descendent languages. However, it has long been recognized that language evolution also entails hori- zontal components, most commonly through lexical borrowing. For example, the English language was heavily influenced by Old Norse and Old French; eight per cent of its basic vocabulary is borrowed. Borrowing is a distinctly non-tree-like process—akin to horizontal gene transfer in genome evolution— that cannot be recovered by phylogenetic trees. Here, we infer the frequency of hidden borrowing among 2346 cognates (etymologically related words) of basic vocabulary distributed across 84 Indo- European languages. The dataset includes 124 (5%) known borrowings. Applying the uniformitarian principle to inventory dynamics in past and present basic vocabularies, we find that 1373 (61%) of the cognates have been affected by borrowing during their history. Our approach correctly identified 117 (94%) known borrowings. Reconstructed phylogenetic networks that capture both vertical and horizontal components of evolutionary history reveal that, on average, eight per cent of the words of basic vocabulary in each Indo-European language were involved in borrowing during evolution. Basic vocabulary is often assumed to be relatively resistant to borrowing. Our results indicate that the impact of borrowing is far more widespread than previously thought.

[Siva Kalyan, Alexandre François & Harald Hammarström (eds)] Understanding language genealogy: Alternatives to the tree model

Siva Kalyan, Alexandre François & Harald Hammarström (eds), Understanding language genealogy: Alternatives to the tree model. Special issue of Journal of Historical Linguistics 9/1., 2019

There are important reasons to be sceptical of the accuracy and usefulness of the family-tree model in historical linguistics. That model assumes that every linguistic innovation applies to a language considered as an undifferentiated whole, a point with no “width”. But this assumption makes it impossible to use a tree to model the partial diffusion of an innovation within a language community (“internal diffusion”), or the diffusion of an innovation across language communities (“external diffusion”). These limitations have long been noticed by historical linguists (Schmidt 1872, Schuchardt 1900); but they become glaringly obvious in the cases discussed by Ross (1988) and François (2014) under the heading of “linkages” – i.e., language families that arise through the diversification, in situ, of a dialect network. The articles in this special issue all contribute towards addressing this problem, from a range of perspectives. **** Problems with, and alternatives to, the tree model in historical linguistics — Siva Kalyan, Alexandre François & Harald Hammarström Non-tree-like signal using multiple tree topologies — Annemarie Verkerk Visualizing the Boni dialects with Historical Glottometry — Alexander Elias Subgrouping the Sogeram languages — Don Daniels, Danielle Barth & Wolfgang Barth: Save the trees: Why we need tree models in linguistic reconstruction — Guillaume Jacques & Johann-Mattis List When the waves meet the trees: A response to Jacques and List — Siva Kalyan & Alexandre François

Using hybridization networks to retrace the evolution of Indo-European languages

BMC Evolutionary Biology, 2016

Background: Curious parallels between the processes of species and language evolution have been observed by many researchers. Retracing the evolution of Indo-European (IE) languages remains one of the most intriguing intellectual challenges in historical linguistics. Most of the IE language studies use the traditional phylogenetic tree model to represent the evolution of natural languages, thus not taking into account reticulate evolutionary events, such as language hybridization and word borrowing which can be associated with species hybridization and horizontal gene transfer, respectively. More recently, implicit evolutionary networks, such as split graphs and minimal lateral networks, have been used to account for reticulate evolution in linguistics. Results: Striking parallels existing between the evolution of species and natural languages allowed us to apply three computational biology methods for reconstruction of phylogenetic networks to model the evolution of IE languages. We show how the transfer of methods between the two disciplines can be achieved, making necessary methodological adaptations. Considering basic vocabulary data from the well-known Dyen's lexical database, which contains word forms in 84 IE languages for the meanings of a 200-meaning Swadesh list, we adapt a recently developed computational biology algorithm for building explicit hybridization networks to study the evolution of IE languages and compare our findings to the results provided by the split graph and galled network methods. Conclusion: We conclude that explicit phylogenetic networks can be successfully used to identify donors and recipients of lexical material as well as the degree of influence of each donor language on the corresponding recipient languages. We show that our algorithm is well suited to detect reticulate relationships among languages, and present some historical and linguistic justification for the results obtained. Our findings could be further refined if relevant syntactic, phonological and morphological data could be analyzed along with the available lexical data.

On Certain Aspects of Distance-based Models of Language Relationships, with Reference to the Position of Indo-European among other Language Families (2018)

Journal of Indo-European Studies, 2018

The paper explores the informative potential of various distance-based methods of language classification such as cluster analysis, networks, and two-dimensional projections, using lexicostatistical data on 41 languages belonging to seven families (IE, Uralic, Altaic, Yupik-Chukchee, Kartvelian, Semitic, and North Caucasian) represented in the STARLING database. Rooting and weighting are of critical importance, radically affecting the graphic models. Special focus is made on two-dimensional charts generated by the multidimensional scaling and on the little-used minimum spanning tree method. The latter two techniques are employed to test the hybridization/ Sprachbund theory of Indo-European origins. The " Semitic " tendency of IE relative to Uralic is significant whereas neither the " Kartvelian " tendency nor the North Caucasian substratum hypothesis are supported by the two-dimensional models.