Documenting and researching endangered languages: the Pangloss Collection (original) (raw)

[Michailovsky, Mazaudon, Michaud, Guillaume, François & Adamou] Documenting and Researching Endangered Languages: The Pangloss Collection

Boyd Michailovsky, Martine Mazaudon, Alexis Michaud, Séverine Guillaume, Alexandre François & Evangelia Adamou. 2014. Documenting and Researching Endangered Languages: The Pangloss Collection. _Language Documentation & Conservation_ 8 (2014), pp.119-135., 2014

The Pangloss Collection [http://lacito.vjf.cnrs.fr/pangloss/index\_en.htm\] is a language archive developed since 1994 at the Langues et Civilisations à Tradition Orale (LACITO) research group of the French Centre National de la Recherche Scientifique (CNRS). It contributes to the documentation and study of the world’s languages by providing free access to documents of connected, spontaneous speech, mostly in endangered or under-resourced languages, recorded in their cultural context and transcribed in consultation with native speakers. The Collection is an Open Archive containing media files (recordings), text annotations, and metadata; it currently contains over 1,400 recordings in 70 languages, including more than 400 transcribed and annotated documents. The annotations consist of transcription, free translation in English, French and/or other languages, and, in many cases, word or morpheme glosses; they are time-aligned with the recordings, usually at the utterance level. A web interface makes these annotations accessible online in an interlinear display format, in synchrony with the sound, using any standard browser. The structure of the XML documents makes them accessible to searching and indexing, always preserving the links to the recordings. Long-term preservation is guaranteed through a partnership with a digital archive. A guiding principle of the Pangloss Collection is that a close association between documentation and research is highly profitable to both. This article presents the collections currently available; it also aims to convey a sense of the range of possibilities they offer to the scientific and speaker communities and to the general public

Combining Documentation And Research: Ongoing Work On An Endangered Language

2012

Abstract—This paper is intended for an audience of speech technology specialists who believe that “automatic processing of under-resourced languages is a way to study language diversity with a multi-disciplinary view”(L. Besacier, keynote speech at this conference). It aims (i) to provide an illustration of the way in which data are collected in fieldwork on endangered languages, bringing attention to the quality of the transcriptions and annotations created by linguists;(ii) to present the contents and format of a set of ...

How usable are digital collections for endangered languages? A review

Proceedings of the Linguistic Society of America, 2022

Here, we report on pilot research on the extent to which language collections in digital linguistic archives are discoverable, accessible, and usable for linguistic research. Using a test case of common tasks in phonetic and phonological documentation, we evaluate a small random sample of collections and find substantial, striking problems in all domains. Of the original 20 collections, only six had digitized audio files with associated transcripts (preferably phrase-aligned). That is, only 30% of the collections in our sample were even potentially suitable for any type of phonetic work (regardless of quality of recording). Information about the contents of the collection was usually discoverable, though there was variation in the types of information that could be easily searched for in the collection. Though eventually three collections were aligned, only one collection was successfully forcealigned from the archival materials without substantial intervention. We close with recommendations for archive depositors to facilitate discoverability, accessibility, and functionality of language collections. Consistency and accuracy in file naming practices, data descriptions, and transcription practices is imperative. Providing a collection guide also helps. Including useful search terms about collection contents makes the materials more findable. Researchers need to be aware of the changes to collection structure that may result from archival uploads. Depositors need to consider how their metadata is included in collections and how items in the collection may be matched to each other and to metadata categories. Finally, if our random sample is indicative, linguistic documentation practices for future phonetic work need to change rapidly, if such work from archival collections is to be done in future.

Internet applications for endangered languages: a talking dictionary of Ainu

2011

There are an estimated 6,900 languages spoken in the world today and at least half of them are under threat of extinction. This is mainly because speakers of smaller languages are switching to other larger languages for economic, social or political reasons, or because they feel ashamed of their ancestral language. The language can thus be lost in one or two generations, often to the great regret of their descendants. Over the past ten years a new field of study called “language documentation” has developed. Language documentation is concerned with the methods, tools, and theoretical bases for compiling a representative and lasting multipurpose record of languages. It has developed in response to the urgent need to make an enduring record of the world’s many endangered languages and to support speakers of these languages in their desire to maintain them. It is also fueled by developments in information and media technologies which make documentation and the preservation and dissemin...

Online presentation and accessibility of endangered languages data: The General Portal to the DoBeS Archive

2012

Data depositories containing language documentation corpora are generally well structured, well maintained, and include large collections of many under-researched languages. However, they are not yet conceived of as resources that can be easily consulted on scientific or non-scientific questions pertaining to one of those languages. A general portal to the DoBeS archive has been created to facilitate access to the data, to attract more users to the archive, and to lower the threshold for users outside the linguistic community to access the data. The structure and the main features of this portal will be presented in this paper.