Behind the Sino-Tibetan Database of Lexical Cognates: Introductory remarks (original) (raw)

One of the major efforts behind our recently published paper on the origin and spread of the Sino-Tibetan languages (Sagart et al. 2019) was the creation of a database of lexical cognates which was used to run the phylogenetic analyses. The creation of this database started about four years ago, when I joined the Centre des Recherches Linguistiques sur l’Asie Oriental in Paris as a research fellow in January 2015, and Guillaume Jacques and Laurent Sagart approached me with the idea of making a phylogenetic study of Sino-Tibetan languages. In December 2017, almost three years after having started, our database consisted of 180 concepts translated into 50 different languages. Since creating the database was not directly straightforward from the beginning, with quite a few situations in which we realized we had to re-arrange the data or the procedure, I thought it might be useful to share our experience in a series of blog posts, as it might be interesting for scholars who wish to create their own database.

A database of lexical cognates is nothing else than a comparative wordlist in which cognate relations between words from different languages are annotated. For the database itself, no specific software is needed, and spreadsheet editors like LibreOffice, Excel, or Google Sheets can easily be used for this purpose. As a minimal requirement, such a database provides information on how a given language expresses a given concept and with which other words the word denoting this concept in the language is etymologically related. Ideally, more information should be supplied, of course, for example, regarding the source of information (be it a reference or original fieldwork), if the word has been borrowed or not, or how the word is pronounced. If one wants to be very detailed, one can also indicate who made the respective cognate judgments, or even supply alignments that indicate where the experts think that the words are cognate.

While it sounds rather straightforward to create such a database at a first glance, there are many pitfalls one should better be aware of before starting to build one from scratch. There is a large amount of potential problems one can encounter during the creation process. It can turn out that the data for a key language is insufficient, key collaborators may leave the project, coding data may turn out to require much more time than estimated, and the results may also be disappointing in the end.

In order to be prepared for what can happen, it would be ideal, if we had some kind of a guideline on how to create datesets of lexical cognates. Given large projects of datasets that were prominently used in the past, such as ABVD (Greenhill et al. 2008), IELex (Dunn et al. 2012), or the datasets published as part of the Global Lexicostatistical Database project (Starostin and Krylov 2011), one might even think that this problem has been discussed long enough, so that scholars who want to build a new database on their own should not have a hard time to find enough guidance on how to get started. Unfortunately, when going from theory to practice, our own experience in working with lexical data from different scholars as part of large data aggregation projects like CLICS² (List et al. 2018) has shown that this is usually not the case. While fieldworkers have their toolbox to create dictionaries, there is no equivalent for historical linguists working on comparative databases of lexical cognates. As a result, scholars who start datasets from scratch often reinvent many wheels, and the wheels they reinvent may be squared at times.

Our experience with building the Sino-Tibetan Database of Lexical Cognates underlying the study by Sagart et al. (2019) does not solve these problems. We were making use of tools like EDICTOR (List 2017), which facilitate the process of cognate coding and making alignments of the data, but in order to get the data in a first instance, we mostly relied on the fact that we had people (mostly also myself) in our team who could quickly prepare custom scripts to parse available data and extract the data we needed. We also profited much from the fact that some projects, especially STEDT (Matisoff 2015), but also Tower of Babel, had been digitizing large amounts of data in a rather regular form in the past. We were also lucky to have people in our team who have done original fieldwork (which enabled them to quickly fill in a list of certain varieties, consulting informants where data was missing), and to have external collaborators who generously shared their data and answered our queries on specific items (see the list of acknowledgments in Sagart et al. 2019 for details).

The way in which we assembled the data for our study was not straightforward, but rather a winding road of many dead-ends, some surprises, lots of discussions, some disappointments, and a lot of lessons we learned for the future. We did not reach our (or maybe preliminary my) initial goal of providing a database of cognates that would provide fully aligned cognate sets and list all correspondence patterns in an indisputable way, so that it could be inspected, challenged, and improved by our colleagues in all kind of details one could think of as a historical linguist. But we achieved to provide a dataset of 180 concepts translated into 50 different varieties of Sino-Tibetan, along with an interface to easily inspect the data which goes beyond many of the datasets that have been published in the past.

Our data is open, scholars can easily inspect it in detail from its URL, and they can even correct the cognate judgments, add alignments, and further expand or correct it. In order to do so, however, one needs to learn a bit about the way in which we (1) assembled the data, (2) coded the data, and (3) how the tools for data curation and annotation can be used. In order to make it easier to understand what was going on behind the Sino-Tibetan Database of Lexical Cognates, we plan to write a couple of blog posts in which we will explain how the data was assembled, curated, and analyzed.

References

Dunn, Michael (2012): Indo-European lexical cognacy database (IELex). http://ielex.mpi.nl/.

Greenhill, Simon J. and Blust, Robert and Gray, Russell D. (2008): The Austronesian Basic Vocabulary Database: From bioinformatics to lexomics. Evolutionary Bioinformatics 4. 271-283.

List, Johann-Mattis (2017): A web-based interactive tool for creating, inspecting, editing, and publishing etymological datasets. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. System Demonstrations. 9-12.

List, Johann-Mattis and Simon Greenhill and Cormac Anderson and Thomas Mayer and Tiago Tresoldi and Robert Forkel (eds.) (2018): CLICS: Database of Cross-Linguistic Colexifications. Max Planck Institute for the Science of Human History. Jena: http://clics.clld.org/.

Matisoff, James A. (2015): The Sino-Tibetan Etymological Dictionary and Thesaurus project. Berkeley:University of California.

Sagart, Laurent and Jacques, Guillaume and Lai, Yunfan and Ryder, Robin and Thouzeau, Valentin and Greenhill, Simon J. and List, Johann-Mattis (2019): Dated language phylogenies shed light on the ancestry of Sino-Tibetan. Proceedings of the National Academy of Science of the United States of America . 1-6. DOI: 10.1073/pnas.1817972116

Starostin, George S. and Krylov, Phil (eds.) (2011): The Global Lexicostatistical Database. Compiling, clarifying, connecting basic vocabulary around the world: From free-form to tree-form. http://starling.rinet.ru/new100/main.htm.


OpenEdition suggests that you cite this post as follows:
Johann-Mattis List (May 13, 2019). Behind the Sino-Tibetan Database of Lexical Cognates: Introductory remarks. Computer-Assisted Language Comparison in Practice. Retrieved November 14, 2024 from https://doi.org/10.58079/m6k7