UNRAVELING THE TANGLES OF LANGUAGE EVOLUTION (original) (raw)

Automated languages phylogeny from Levenshtein distance

2009

Línguas evoluem com o tempo em um processo em que reprodução, mutação e extinção são todos possíveis, de forma semelhante ao que acontece com os organismos vivos. Usando esta similaridadeé possível, em princípio, construirárvores genealógicas que mostrem o grau de parentesco entre línguas. O método usado pela glotocronologia moderna, desenvolvido por Swadesh na década de 1950, mede distâncias entre línguas a partir do percentual de palavras com origem histórica comum em uma lista. O ponto fraco desse métodoé o grau de subjetividade presente no julgamento da distância. Recentemente propusemos um método automatizado que evita a subjetividade, cujos resultados podem ser replicados por estudos que usem a mesma base de dados e que não necessita nenhum conhecimento linguístico específico por parte do pesquisador. Além do mais, o método permite uma comparação rápida de um grande nmero de línguas. Aplicamos nosso método aos grupos Indoeuropeu e Austronésio considerando, em cada caso, cinquenta línguas diferentes. Asárvores genealógicas resultantes são semelhantesàs de estudos anteriores, mas com algumas diferenças importantes na posição de poucas línguas e subgrupos. Acreditamos que essas diferenças carregam informações novas sobre a estrutura daárvore e sobre as relações filogenéticas dentro das famílias. Languages evolve over time in a process in which reproduction, mutation and extinction are all possible, similar to what happens to living organisms. Using this similarity it is possible, in principle, to build family trees which show the degree of relatedness between languages. The method used by modern glottochronology, developed by Swadesh in the 1950s, measures distances from the percentage of words with a common historical origin. The weak point of this method is that subjective judgment plays a relevant role. Recently we proposed an automated method that avoids the subjectivity, whose results can be replicated by studies that use the same database and that doesn't require a specific linguistic knowledge. Moreover, the method allows a quick comparison of a large number of languages. We applied our method to the Indo-European and Austronesian families, considering in both cases, fifty different languages. The resulting trees are similar to those of previous studies, but with some important differences in the position of few languages and subgroups. We believe that these differences carry new information on the structure of the tree and on the phylogenetic relationships within families.

On Certain Aspects of Distance-based Models of Language Relationships, with Reference to the Position of Indo-European among other Language Families (2018)

Journal of Indo-European Studies, 2018

The paper explores the informative potential of various distance-based methods of language classification such as cluster analysis, networks, and two-dimensional projections, using lexicostatistical data on 41 languages belonging to seven families (IE, Uralic, Altaic, Yupik-Chukchee, Kartvelian, Semitic, and North Caucasian) represented in the STARLING database. Rooting and weighting are of critical importance, radically affecting the graphic models. Special focus is made on two-dimensional charts generated by the multidimensional scaling and on the little-used minimum spanning tree method. The latter two techniques are employed to test the hybridization/ Sprachbund theory of Indo-European origins. The " Semitic " tendency of IE relative to Uralic is significant whereas neither the " Kartvelian " tendency nor the North Caucasian substratum hypothesis are supported by the two-dimensional models.

Indo-European languages tree by Levenshtein distance

EPL (Europhysics Letters), 2008

The evolution of languages closely resembles the evolution of haploid organisms. This similarity has been recently exploited [1, 2] to construct language trees. The key point is the definition of a distance among all pairs of languages which is the analogous of a genetic distance. Many methods have been proposed to define these distances, one of this, used by glottochronology, compute distance from the percentage of shared "cognates". Cognates are words inferred to have a common historical origin, and subjective judgment plays a relevant role in the identification process. Here we push closer the analogy with evolutionary biology and we introduce a genetic distance among language pairs by considering a renormalized Levenshtein distance among words with same meaning and averaging on all the words contained in a Swadesh list . The subjectivity of process is consistently reduced and the reproducibility is highly facilitated. We test our method against the Indo-European group considering fifty different languages and the two hundred words of the Swadesh list for any of them. We find out a tree which closely resembles the one published in [1] with some significant differences.

Geometric Representations of Language Taxonomies

Computer Speech and Language 25 (2011) 679–699, 2011

A Markov chain analysis of a network generated by the matrix of lexical distances allows for representing complex relationships between different languages in a language family geometrically, in terms of distances and angles. The fully automated method for construction of language taxonomy is tested on a sample of fifty languages of the Indo-European language group and applied to a sample of fifty languages of the Austronesian language group. The Anatolian and Kurgan hypotheses of the Indo-European origin and the 'express train' model of the Polynesian origin are thoroughly discussed.

Networking Phylogeny for Indo-European and Austronesian Languages

Nature Precedings, 2009

Harnessing cognitive abilities of many individuals, a language evolves upon their mutual interactions establishing a persistent social environment to which language is closely attuned. Human history is encoded in the rich sets of linguistic data by means of symmetry patterns that are not always feasibly represented by trees.

Computational feature-sensitive reconstruction of language relationships: Developing the ALINE distance for comparative historical linguistic reconstruction

Journal of …, 2008

Historical relationships among languages are used as a proxy for social history in many non-linguistic settings, including the fields of cultural and molecular anthropology. Linguists have traditionally assembled this information using the standard comparative method. While providing extremely nuanced linguistic information, this approach is time consuming and labor intensive. Conversely, computational approaches are appreciably quicker, but can potentially introduce significant error. Furthermore, current methods often use cognate sets that were themselves coded by historical linguists, thus reducing the benefit of computational approaches. Here we develop a method, based on the ALINE distance, to extract feature-sensitive relationships from paired glosses, datasets that require minimal contribution from trained linguists beyond transcription from primary sources. We validate our results by comparison with data generated independently via the comparative method, and quantify error rates using consistency indices. To showcase our method's utility and to demonstrate its robustness at local and regional scales, we apply it to two language datasets from eastern Indonesia. As linguistic datasets proliferate, scalable computational methods that mimic historical linguistic reconstruction will become increasingly necessary. Although at present we cannot disentangle all the processes driving linguistic change (e.g. lexical borrowing), our method provides a robust and accurate alternative to manual linguistic analysis. The feature-sensitive method adopted here accurately and automatically identifies emergent patterns hidden in traditional word-lists by analyzing critical phonetic information that is discarded (or required as prerequisite) by many current cognate-based computational methods. This approach is not intended to supplant manual linguistic analysis, but has an important role in quickly generating robust data for nonlinguistic fields or interdisciplinary projects that require formal quantitative analysis of historical linguistic relationships. Our approach provides a workable approximate phylogeny in cases where a trained linguist is unavailable, or otherwise significantly reduces the time and effort required for manual classification.

Measures of lexical distance between languages

The idea of measuring distance between languages seems to have its roots in the work of the French explorer Dumont D'Urville (1832) [13]. He collected comparative word lists for various languages during his voyages aboard the Astrolabe from 1826 to 1829 and, in his work concerning the geographical division of the Pacific, he proposed a method for measuring the degree of relation among languages. The method used by modern glottochronology, developed by Morris Swadesh in the 1950s, measures distances from the percentage of shared cognates, which are words with a common historical origin. Recently, we proposed a new automated method which uses the normalized Levenshtein distances among words with the same meaning and averages on the words contained in a list. Recently another group of scholars, Bakker et al. (2009) [8] and Holman et al. (2008) [9], proposed a refined version of our definition including a second normalization. In this paper we compare the information content of our definition with the refined version in order to decide which of the two can be applied with greater success to resolve relationships among languages.

Language-tree divergence times support the Anatolian theory of Indo-European origin

Nature, 2003

Languages, like genes, provide vital clues about human history. The origin of the Indo-European language family is ``the most intensively studied, yet still most recalcitrant, problem of historical linguistics''. Numerous genetic studies of Indo-European origins have also produced inconclusive results. Here we analyse linguistic data using computational methods derived from evolutionary biology. We test two theories of Indo-European origin: the `Kurgan expansion' and the `Anatolian farming' hypotheses. The Kurgan theory centres on possible archaeological evidence for an expansion into Europe and the Near East by Kurgan horsemen beginning in the sixth millennium BP. In contrast, the Anatolian theory claims that Indo-European languages expanded with the spread of agriculture from Anatolia around 8,000-9,500 years BP. In striking agreement with the Anatolian hypothesis, our analysis of a matrix of 87 languages with 2,449 lexical items produced an estimated age range for the initial Indo-European divergence of between 7,800 and 9,800 years BP. These results were robust to changes in coding procedures, calibration points, rooting of the trees and priors in the bayesian analysis.

Automated Dating of the World’s Language Families Based on Lexical Similarity

This paper describes a computerized alternative to glottochronology for estimating elapsed time since parent languages diverged into daughter languages. The method, developed by the Automated Similarity Judgment Program (ASJP) consortium, is different from glottochronology in four major respects: (1) it is automated and thus is more objective, (2) it applies a uniform analytical approach to a single database of worldwide languages, (3) it is based on lexical similarity as determined from Levenshtein (edit) distances rather than on cognate percentages, and (4) it provides a formula for date calculation that mathematically recognizes the lexical heterogeneity of individual languages, including parent languages just before their breakup into daughter languages. Automated judgments of lexical similarity for groups of related languages are calibrated with historical, epigraphic, and archaeological divergence dates for 52 language groups. The discrepancies between estimated and calibration dates are found to be on average 29% as large as the estimated dates themselves, a figure that does not differ significantly among language families. As a resource for further research that may require dates of known level of accuracy, we offer a list of ASJP time depths for nearly all the world’s recognized language families and for many subfamilies.

Mapping the Origins and Expansion of the Indo-European Language Family

There are two competing hypotheses for the origin of the Indo-European language family. The conventional view places the homeland in the Pontic steppes about 6000 years ago. An alternative hypothesis claims that the languages spread from Anatolia with the expansion of farming 8000 to 9500 years ago. We used Bayesian phylogeographic approaches, together with basic vocabulary data from 103 ancient and contemporary Indo-European languages, to explicitly model the expansion of the family and test these hypotheses. We found decisive support for an Anatolian origin over a steppe origin. Both the inferred timing and root location of the Indo-European language trees fit with an agricultural expansion from Anatolia beginning 8000 to 9500 years ago. These results highlight the critical role that phylogeographic inference can play in resolving debates about human prehistory.