Creating Corpora of Finland's Sign Languages (original) (raw)
Related papers
The Corpus of Finnish Sign Language
Proceedings of the Ninth Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives (LREC2020), 2020
This paper presents the Corpus of Finnish Sign Language (Corpus FinSL), a structured and annotated collection of Finnish Sign Language (FinSL) videos published in May 2019 in FIN-CLARIN's Language Bank of Finland. The corpus is divided into two subcorpora, one of which comprises elicited narratives and the other conversations. All of the FinSL material has been annotated using ELAN and the lexical database Finnish Signbank. Basic annotation includes ID-glosses and translations into Finnish. The anonymized metadata of Corpus FinSL has been organized in accordance with the IMDI standard. Altogether, Corpus FinSL contains nearly 15 hours of video material from 21 FinSL users. Corpus FinSL has already been exploited in FinSL research and teaching, and it is predicted that in the future it will have a significant positive impact on these fields as well as on the status of the sign language community in Finland.
From Design and Collection to Annotation of a Learner Corpus of Sign Language
This paper aims to present part of the project " From Speech to Sign – learning Swedish Sign Language as a second language " which include a learner corpus that is based on data produced by hearing adult L2 signers. The paper describes the design of corpus building and the collection of data for the Corpus in Swedish Sign Language as a Second Language (SSLC-L2). Another component of ongoing work is the creation of a specialized annotation scheme for SSLC-L2, one that differs somewhat from the annotation work in Swedish Sign Language Corpus (SSLC), where the data is based on performance by L1 signers. Also, we will account for and discuss the methodology used to annotate L2 structures.
The Sign Linguistics Corpora Network: Towards Standards for Signed Language Resources
Proceedings of LREC 2010, 2010
The Sign Linguistics Corpora Network is a three-year network initiative that aims to collect existing knowledge and practices on the creation and use of signed language resources. The concrete goals are to organise a series of four workshops in 2009 and 2010, create a stable Internet location for such knowledge, and generate new ideas for employing the most recent technologies for the study of signed languages. The network covers a wide range of subjects: data collection, metadata, annotation, and exploitation; these are the topics of the four workshops. The outcomes of the first two workshops are summarised in this paper; both workshops demonstrated that the need for dedicated knowledge on sign language corpora is especially salient in countries where researchers work alone or in small groups, which is still quite common in many places in Europe. While the original goal of the network was primarily to focus on corpus linguistics and language documentation, human language technology has gradually been incorporated as a user group of signed language resources.
Metadata for sign language corpora
Background document for an ECHO workshop. …, 2003
The goal of the workshop is to acquaint you with the concept of metadata and the IMDI standard, and to decide on a set of categories specific for our subfield of the language sciences. The proposal for a list of categories in this document (section 6) is a first proposal for such a set.
STS-korpus: A Sign Language Web Corpus Tool for Teaching and Public Use
Language Resources and Evaluation, 2020
In this paper we describe STS-korpus, a web corpus tool for Swedish Sign Language (STS) which we have built during the past year, and which is now publicly available on the internet. STS-korpus uses the data of Swedish Sign Language Corpus (SSLC) and is primarily intended for teachers and students of sign language. As such it is created to be simple and user-friendly with no download or setup required. The user interface allows for searching-with search results displayed as a simple concordance-and viewing of videos with annotations. Each annotation also provides additional data and links to the corresponding entry in the online Swedish Sign Language Dictionary. We describe the corpus, its appearance and search syntax, as well as more advanced features like access control and dynamic content. Finally we say a word or two about the role we hope it will play in the classroom, and something about the development process and the software used. STS-korpus is available here: https://teckensprakskorpus.su.se
Issues underlying a common Sign Language Corpora annotation scheme
Corpus-based Sign Language linguistics has emerged as a new linguistic domain, and as a consequence large-scale and controlled video data repositories are under construction for different Sign Languages. Nevertheless, as pointed by (Johnston, 2008) no unified annotation scheme is yet available, which compromises any chance of comparing or reusing corpora across research teams. Another related issue is the comparability of descriptions and formalizations between SL linguistics and mainstream linguistics. In this paper, we address the issue of the definition of a common annotation scheme for Sign Language corpora annotation, distribution, exchange and comparison. In section 2. we discuss the challenge of building inter-operable corpora for corpus-based linguistics. We also examine existing annotation schemes or strategies proposed for SL linguistics. In section 3. we propose a small set of annotation tiers, based on Frame-Semantics, as a common annotation scheme. We also propose to add text-level as well as utterance-level metadata to this common annotation scheme, in order to broaden the range of future uses of SL corpora.
Sign Language Resources in Sweden: Dictionary and Corpus
Sign language resources are necessary tools for adequately serving the needs of learners, teachers and researchers of signed languages. Among these resources, the Swedish Sign Language Dictionary was begun in 2008 and has been in development ever since. Today, it has approximately 8,000 sign entries. The Swedish Sign Language Corpus is also an important resource, but it is of a very different kind than the dictionary. Compiled during the years 2009-2011, the corpus consists of video recorded conversations among 42 informants aged between 20 and 82, from three separate regions in Sweden. With 14 % of the corpus having been annotated with glosses for signs, it comprises total of approximately 3,600 different signs occurring about 25,500 times (tokens) in the 42 annotated sign language discourses/video files. As these two resources sprang from different starting points, they are independent from each other; however, in the late phases of building the corpus the importance of combining work from the two became evident. This presentation will show the development of these two resources and the advantages of combining them.
This paper outlines the establishment of the first digital corpus of Irish Sign Language using a software programme called ELAN. The Signs of Ireland comprises 40 signers making it the largest digital annotated corpus of a signed language in Europe. This paper describes the way in which such software enhances sign linguistic research, and outlines some of the limitations that arise, in great part, because of the lack of a standardized notation system for signed languages, because of the need for human consistency when working on annotation, and the fact that you will 'get out what you put in' when working with a digital corpus: that is, the decisions made regarding the annotations influence analysis results.