The International Workshop on Language Preservation: An Experiment in Text Collection and Language Technology (original) (raw)
Related papers
Language Documentation and Conservation, 2015
In a recent article, Bird et al. (2013) discuss a workshop held at the University of Goroka in Papua New Guinea (PNG) in 2012. The workshop was intended to offer a new methodological framework for language documentation and capacity building that streamlines the documentation process and accelerates the global effort to document endangered languages through machine translation and automated glossing technology developed by computer scientists. As a volunteer staff member at the workshop, in this response to Bird et al. I suggest that it did not in the end provide us with a model that should be replicated in the future. I explain how its failure to uphold fundamental commitments from a documentary linguistic and humanistic perspective can help inform future workshops and large-scale documentary efforts in PNG. Instead of experimenting with technological shortcuts that aim to reduce the role of linguists in language documentation and that construct participants as sources of data, we should implement training workshops geared toward the interests and skills of local participants who are interested in documenting their languages, and focus on building meaningful partnerships with academic institutions in PNG.
The geographical region of Insular South East Asia and New Guinea is well-known as an area of mega-biodiversity. Less well-known is the extreme linguistic diversity in this area: over a quarter of the world's 6,000 languages are spoken here. As small minority languages, most of them will cease to be spoken in the coming few generations. The project described here ensures the preservation of unique records of languages and the cultures encapsulated by them in the region. The language resources were gathered by twenty linguists at, or in collaboration with, Dutch universities over the last 40 years, and were compiled and archived in collaboration with The Language Archive (TLA) at the Max Planck Institute in Nijmegen. The resulting archive constitutes a collection of multimedia materials and written documents from 48 languages in Insular South East Asia and West New Guinea. At TLA, the data was archived according to state-of-the-art standards (TLA holds the Data Seal of Approval): the component metadata infrastructure CMDI was used; all metadata categories as well as relevant units of annotation were linked to the ISO data category registry ISOcat. This guaranteed proper integration of the language resources into the CLARIN framework. Through the archive, future speaker communities and researchers will be able to extensively search the materials for answers to their own questions, even if they do not themselves know the language, and even if the language dies.
In search of island treasures: Language documentation in the Pacific
Language Documentation & Conservation Special Publication, 2018
The Pacific region is home to about 1,500 languages, with a strong concentration of linguistic diversity in Melanesia. The turn towards documentary linguistics, initiated in the 1980s and theorized by N. Himmelmann, has encouraged linguists to prepare, archive and distribute large corpora of audio and video recordings in a broad array of Pacific languages, many of which are endangered. The strength of language documentation is to entail the mutual exchange of skills and knowledge between linguists and speaker communities. Their members can access archived resources, or create their own. Importantly, they can also appropriate the outcome of these documentary efforts to promote literacy within their school systems, and to consolidate or revitalize their heritage languages against the increasing pressure of dominant tongues. While providing an overview of the general progress made in the documentation of Pacific languages in the last twenty years, this paper also reports on my own experience with documenting and promoting languages in Island Melanesia since 1997. François, Alexandre. 2018. In search of island treasures: Language documentation in the Pacific. In McDonnell, Bradley, Andrea L. Berez-Kroeker, and Gary Holton (eds.), Reflections on Language Documentation 20 Years after Himmelmann 1998. Language Documentation & Conservation Special Publication no. 15: 276-294. Honolulu: University of Hawai‘i Press.
Documenting and researching endangered languages: the Pangloss Collection
The Pangloss Collection is a language archive developed since 1994 at the Langues et Civilisations à Tradition Orale (LACITO) research group of the French Centre National de la Recherche Scientiique (CNRS). It contributes to the documentation and study of the world's languages by providing free access to documents of connected, spontaneous speech, mostly in endangered or under-resourced languages, recorded in their cultural context and transcribed in consultation with native speakers. The Collection is an Open Archive containing media iles (recordings), text annotations, and metadata; it currently contains over 1,400 recordings in 70 languages, including more than 400 transcribed and annotated documents. The annotations consist of transcription, free translation in English, French and/or other languages, and, in many cases, word or morpheme glosses; they are time-aligned with the recordings, usually at the utterance level. A web interface makes these annotations accessible online in an interlinear display format, in synchrony with the sound, using any standard browser. The structure of the XML documents makes them accessible to searching and indexing, always preserving the links to the recordings. Longterm preservation is guaranteed through a partnership with a digital archive. A guiding principle of the Pangloss Collection is that a close association between documentation and research is highly proitable to both. This article presents the collections currently available; it also aims to convey a sense of the range of possibilities they offer to the scientiic and speaker communities and to the general public.
Preserving a living archive of Indigenous language material
2016
This paper describes how Charles Darwin University Library is directly helping to sustain and preserve Aboriginal languages and culture that have been facing hurdles for long-term survival. The Library, in partnership with an ARC-funded research project known as the Living Archive of Aboriginal Languages (www.cdu.edu.au/laal), supports this effort with a repository, web application and digitisation program to preserve endangered Indigenous resources and facilitate both Indigenous community engagement and international linguistic research. The project serves as a rich case study demonstrating how academic libraries can work with researchers to support the archiving of cultural heritage.
Boyd Michailovsky, Martine Mazaudon, Alexis Michaud, Séverine Guillaume, Alexandre François & Evangelia Adamou. 2014. Documenting and Researching Endangered Languages: The Pangloss Collection. _Language Documentation & Conservation_ 8 (2014), pp.119-135., 2014
The Pangloss Collection [http://lacito.vjf.cnrs.fr/pangloss/index\_en.htm\] is a language archive developed since 1994 at the Langues et Civilisations à Tradition Orale (LACITO) research group of the French Centre National de la Recherche Scientifique (CNRS). It contributes to the documentation and study of the world’s languages by providing free access to documents of connected, spontaneous speech, mostly in endangered or under-resourced languages, recorded in their cultural context and transcribed in consultation with native speakers. The Collection is an Open Archive containing media files (recordings), text annotations, and metadata; it currently contains over 1,400 recordings in 70 languages, including more than 400 transcribed and annotated documents. The annotations consist of transcription, free translation in English, French and/or other languages, and, in many cases, word or morpheme glosses; they are time-aligned with the recordings, usually at the utterance level. A web interface makes these annotations accessible online in an interlinear display format, in synchrony with the sound, using any standard browser. The structure of the XML documents makes them accessible to searching and indexing, always preserving the links to the recordings. Long-term preservation is guaranteed through a partnership with a digital archive. A guiding principle of the Pangloss Collection is that a close association between documentation and research is highly profitable to both. This article presents the collections currently available; it also aims to convey a sense of the range of possibilities they offer to the scientific and speaker communities and to the general public
Developing a Living Archive of Aboriginal Languages
Language Documentation & Conservation, 2014
The fluctuating fortunes of Northern Territory bilingual education programs in Australian languages and English have put at risk thousands of books developed for these programs in remote schools. In an effort to preserve such a rich cultural and linguistic heritage, the Living Archive of Aboriginal Languages project is establishing an open access, online repository comprising digital versions of these materials. Using web technologies to store and access the resources makes them accessible to the communities of origin, the wider academic community, and the general public. The process of creating, populating, and implementing such an archive has posed many interesting technical, cultural and linguistic challenges, some of which are explored in this paper.
Indigenous Languages of Indonesia: Creating Language Resources for Language Preservation
2008
In this paper, we report a survey of language resources in Indonesia, primarily of indigenous languages. We look at the official Indonesian language (Bahasa Indonesia) and 726 regional languages of Indonesia (Bahasa Nusantara) and list all the available lexical resources (LRs) that we can gathered. This paper suggests that the smaller regional languages may remain relatively unstudied, and unknown, but they are still worthy of our attention. Various LRs of these endangered languages are being built and collected by regional language centers for study and its preservation. We will also briefly report its presence on the Internet.