Language Technology Research Papers - Academia.edu (original) (raw)

The present study discusses ethics in building and using applications based on natural language processing in electronic nursing documentation. Specifically, we first focus on the question of how patient confidentiality can be ensured in... more

The present study discusses ethics in building and using applications based on natural language processing in electronic nursing documentation. Specifically, we first focus on the question of how patient confidentiality can be ensured in developing language technology for the nursing documentation domain. Then, we identify and theoretically analyze the ethical outcomes which arise when using natural language processing to support clinical judgement and decision-making. In total, we put forward and justify 10 claims related to ethics in applying language technology to nursing documents. A review of recent scientific articles related to ethics in electronic patient records or in the utilization of large databases was conducted. Then, the results were compared with ethical guidelines for nurses and the Finnish legislation covering health care and processing of personal data. Finally, the practical experiences of the authors in applying the methods of natural language processing to nursing documents were appended. Patient records supplemented with natural language processing capabilities may help nurses give better, more efficient and more individualized care for their patients. In addition, language technology may facilitate patients' possibility to receive truthful information about their health and improve the nature of narratives. Because of these benefits, research about the use of language technology in narratives should be encouraged. In contrast, privacy-sensitive health care documentation brings specific ethical concerns and difficulties to the natural language processing of nursing documents. Therefore, when developing natural language processing tools, patient confidentiality must be ensured. While using the tools, health care personnel should always be responsible for the clinical judgement and decision-making. One should also consider that the use of language technology in nursing narratives may threaten patients' rights by using documentation collected for other purposes. Applying language technology to nursing documents may, on the one hand, contribute to the quality of care, but, on the other hand, threaten patient confidentiality. As an overall conclusion, natural language processing of nursing documents holds the promise of great benefits if the potential risks are taken into consideration.

Text-type determines the linguistic and paralinguistic means for conveying the message. The present study investigates how to discriminate between text-types and which types should be focused on in language teaching. One of the... more

Text-type determines the linguistic and paralinguistic means for conveying the message. The present study investigates how to discriminate between text-types and which types should be focused on in language teaching. One of the distinctive features of text-type is text formality, determinable from its part-of-speech structure. Our formality analysis concerns 28 oral and written texts. The results indicate that written texts are more formal than oral ones, and monologues more formal than dialogues. The results are applicable in FLT as well as in language technology for automatic identification of text types.

In this paper we present experiments with data-driven part-of-speech taggers trained and evaluated on the annotated Helsinki Corpus of Swahili. Using four of the current state-of-the-art data-driven taggers, TnT, MBT, SVMTool and MXPOST,... more

In this paper we present experiments with data-driven part-of-speech taggers trained and evaluated on the annotated Helsinki Corpus of Swahili. Using four of the current state-of-the-art data-driven taggers, TnT, MBT, SVMTool and MXPOST, we observe the latter as being the most accurate tagger for the Kiswahili dataset. We further improve on the performance of the individual taggers by combining them into a committee of taggers. We observe that the more naive combination methods, like the novel plural voting approach, outperform more elaborate schemes like cascaded classifiers and weighted voting. This paper is the first publication to present experiments on data-driven part-of-speech tagging for Kiswahili and Bantu languages in general.

"The establishment of the Estonian Emotional Speech Corpus (EESC) began in 2006 within the framework of the National Programme for Estonian Language Technology at the Institute of the Estonian Language. The corpus contains 1,234 Estonian... more

"The establishment of the Estonian Emotional Speech Corpus (EESC) began in 2006 within the framework of the National Programme for Estonian Language Technology at the Institute of the Estonian Language. The corpus contains 1,234 Estonian sentences that express anger, joy and sadness, or are neutral. The sentences come from text passages read out by non-professionals who were not given any explicit indication of the target emotion. It was assumed that the content of the text would elicit an emotion in the reader and that this would be expressed in their voice. This avoids the exaggerations of acted speech. The emotion of each sentence in the corpus was then determined by listening tests. The corpus is publicly available at http://peeter.eki.ee:5000/.
This article gives an overview of the theoretical starting-points of the corpus and their usefulness for its implementation."

@Book{LaTeCH-SHELTR:2009, editor = {Lars Borin and Piroska Lendvai}, title = {Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (LaTeCH --... more

@Book{LaTeCH-SHELTR:2009, editor = {Lars Borin and Piroska Lendvai}, title = {Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (LaTeCH -- SHELT\&R 2009)}, month = {March}, year = {2009}, address = {Athens, Greece}, publisher = {Association for Computational Linguistics}, url = {http://www.aclweb.org/anthology/W09-03} } @InProceedings{goerz-scholz:2009:LaTeCH-SHELTR, author = {Goerz, Guenther and Scholz, Martin}, title = {Content ...

This paper discusses our efforts to develop a full automatic speech recognition (ASR) system for Scottish Gaelic, starting from a point of limited resource. Building ASR technology is important for documenting and revitalising endangered... more

This paper discusses our efforts to develop a full automatic speech recognition (ASR) system for Scottish Gaelic, starting from a point of limited resource. Building ASR technology is important for documenting and revitalising endangered languages; it enables existing resources to be enhanced with automatic subtitles and transcriptions, improves accessibility for users, and, in turn, encourages continued use of the language. In this paper, we explain the many difficulties faced when collecting minority language data for speech recognition. A novel cross-lingual approach to the alignment of training data is used to overcome one such difficulty, and in this way we demonstrate how majority language resources can bootstrap the development of lower-resourced language technology. We use the Kaldi speech recognition toolkit to develop several Gaelic ASR systems, and report a final WER of 26.30%. This is a 9.50% improvement on our original model.

Artikkelen kvilar på to aksiom: Genustilordning i norsk er regelstyrt, ikkje tilfeldig, og maskulinum er default genus i norsk (norske substantiv er maskuline viss ingenting tilseier noko anna). Ut i frå desse aksioma blir det presentert... more

Artikkelen kvilar på to aksiom: Genustilordning i norsk er regelstyrt, ikkje tilfeldig, og maskulinum er default genus i norsk (norske substantiv er maskuline viss ingenting tilseier noko anna). Ut i frå desse aksioma blir det presentert eit sett av semantiske, morfologiske og fonologiske reglar som gir korrekt genus for ca. 94 % av substantiva i norsk. Innfallsvinkelen i artikkelen er synkron, men siste avsnitt in- neheld ei drøfting av Olav T. Beito sine data om genusskifte frå gammalnorsk til nynorsk ut i frå teori- ar om genusskifte i germansk sett fram av Donald Steinmetz.

The paradigm crisis in IE linguistics, resulting from the incapability of monolaryngealism (Szemerényi) and trilaryngealism (Eichner et alii) to reconstruct *h(2) except on the basis of Hitt. ḫ, was preliminarily solved in Pyysalo 2013,... more

The paradigm crisis in IE linguistics, resulting from the incapability of monolaryngealism (Szemerényi) and trilaryngealism (Eichner et alii) to reconstruct *h(2) except on the basis of Hitt. ḫ, was preliminarily solved in Pyysalo 2013, 2019, where its cause, de Saussure’s and Møller’s defective ablaut *Ae : *A : *eA, was corrected by adding the long quantity PIE *ē, yielding a pattern
*Aē : *Ae: *A : *eA : *ēA (where *A = Neogr. *ǝ = PIE *ɑ).
The paper at hand defines the sufficient condition for ending the crisis, viz. presenting the critical solutions to all main problems of the (P)IE vowel system in connection to PIE *h = Hitt. ḫ, including the explanation for the correlation between PIE *h and the (IE) ‘a-vocalism’ (Neogr. *ǝ a ā), the reconstruction of the triple representation of schwa in Greek with a single PIE *h, the solution to BRUGMANN’s law, and the description of the maximal PIE ablaut in connection with PIE *h.

In an effort to meet the demands in speed, productivity and low-cost, the translation industry has turned to Machine Translation (MT) and Post-editing (PE). Nowadays, MT output is used as raw translation to be further post-edited by a... more

In an effort to meet the demands in speed, productivity and low-cost, the translation industry has turned to Machine Translation (MT) and Post-editing (PE). Nowadays, MT output is used as raw translation to be
further post-edited by a translator (Lommel and DePalma, 2016). Yet, translators still approach PE with caution and scepticism and question its real benefits (Koponen 2012; Gaspari et al 2014; Moorkens 2018). In
addition, attitudes to MT and PE seem to affect PE effort and performance (Witczak, 2016; Çetiner and İşisağ, 2019). Under that light, this study aims to investigate the attitudes and perceptions of undergraduate translation students towards MT and PE and their performance before and after they receive training in MT
and PE. Questionnaires are used to capture their attitudes and perceptions, a calculation of the technical effort and the temporal effort expended by the students while post-editing is also used, while a human evaluation of he post-edited output is carried out to assess their performance and the quality of the post-edited texts. The analysis reveals a change in the students’ attitudes and perceptions; they report a more positive attitude toward MT and PE, they are more confident and faster, while they avoid over-editing.

This paper reports on the development of the Cantonese Wordnet, a new wordnet project based on Hong Kong Cantonese. It is built using the expansion approach, lever-aging on the existing Chinese Open Wordnet, and the Princeton Wordnet's... more

This paper reports on the development of the Cantonese Wordnet, a new wordnet project based on Hong Kong Cantonese. It is built using the expansion approach, lever-aging on the existing Chinese Open Wordnet, and the Princeton Wordnet's semantic hierarchy. The main goal of our project was to produce a high quality, human-curated resource-and this paper reports on the initial efforts and steady progress of our building method. It is our belief that the lexical data made available by this wordnet, including Jyutping romaniza-tion, will be useful for a variety of future uses, including many language processing tasks and linguistic research on Cantonese and its interactions with other Chinese dialects.

@Book{LaTeCH-SHELTR:2009, editor = {Lars Borin and Piroska Lendvai}, title = {Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (LaTeCH --... more

@Book{LaTeCH-SHELTR:2009, editor = {Lars Borin and Piroska Lendvai}, title = {Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (LaTeCH -- SHELT\&R 2009)}, month = {March}, year = {2009}, address = {Athens, Greece}, publisher = {Association for Computational Linguistics}, url = {http://www.aclweb.org/anthology/W09-03} } @InProceedings{goerz-scholz:2009:LaTeCH-SHELTR, author = {Goerz, Guenther and Scholz, Martin}, title = {Content ...

"We present an experiment designed for extracting construction candidates for a Swedish constructicon from text corpora. We have explored the use of hybrid n-grams with the practical goal to discover previously undescribed partially... more

"We present an experiment designed for extracting construction candidates for a Swedish constructicon from text corpora. We have explored the use of hybrid n-grams with the practical goal to discover previously undescribed partially schematic constructions. The experiment was successful, in that quite a few new constructions were discovered. The precision is low, but as a push-button tool for construction discovery, it has proven a valuable tool for the work on a Swedish constructicon."

A nyelvi kulturális örökség elérhetővé tételében kulcsfontosságú szerep jut a nyelvtechnológiának, melynek módszereivel a kutatók egységes, következetes, nyelvi információval ellátott adatbázisokhoz juthatnak. A nyelvtörténészek és... more

A nyelvi kulturális örökség elérhetővé tételében kulcsfontosságú szerep jut a nyelvtechnológiának, melynek módszereivel a kutatók egységes, következetes, nyelvi információval ellátott adatbázisokhoz juthatnak. A nyelvtörténészek és nyelvtechnológusok egyik legfontosabb együttműködési terepe a történeti korpuszok építése, melyek kiváló alapanyagot szolgáltatnak az elméleti kutatásoknak. A jelen cikkben bemutatott korpusz, a Párhuzamos Bibliakorpusz, az Ómagyar Korpusz anyagára támaszkodik, és tartalmazza mindazokat az ó-és középmagyar kori magyar nyelvű bibliafordításokat, amelyek abban szerepelnek. A Párhuzamos Bibliakorpuszra épülő keresőalkalmazás, a Párhuzamos Bibliaolvasó pedig kiegészíti az Ómagyar Korpuszhoz készült Régi Magyar Konkordancia nevű korpuszlekérdező felületet, ugyanis míg abban konkrét nyelvi elemekre, tulajdonságaikra és környezetükre lehet keresni, itt bibliai helyekre tudunk rákeresni és a kapott találatokon összehasonlító vizsgálatokat végezni. A Párhuzamos Bibliaolvasó elérhető a http://parallelbible.nytud.hu URL-en keresztül, míg a Párhuzamos Bibliakorpusz anyaga letölthető a https://github.com/dlt-rilmta/parallelbible GitHub repozitóriumból. A weboldal és a repozitórium tartalma angolul is olvasható.

A text usually contains one or a few main topics, which are split up into subtopics, which in their turn can be further described by more detailed topics. In this article we describe a system that segments a text into topics and... more

A text usually contains one or a few main topics, which are split up into subtopics, which in their turn can be further described by more detailed topics. In this article we describe a system that segments a text into topics and subtopics. Each segment is characterized by important key terms that are extracted from it and by its begin and end position in the text. A table of contents is built by using the hierarchical and sequential relationships between topical segments that are identified in a text. The table of contents generator relies upon universal linguistic theories on the topic and comment of a sentence and on patterns of thematic progression in text. The linguistic theories of topic and comment are modeled both deterministically and probabilistically. The system is applied to English texts (news, World Wide Web and encyclopedia texts) and is evaluated.(Received March 08 2004)(Revised December 26 2005)

ABSTRACT The article gives an analysis of the demographic material for North Sa´mi in Norway during the last 150 years, and compares it to key tendencies in some of the Uralic languages of the Soviet Union. The present linguistic... more

ABSTRACT The article gives an analysis of the demographic material for North Sa´mi in Norway during the last 150 years, and compares it to key tendencies in some of the Uralic languages of the Soviet Union. The present linguistic landscape can be predicted with great accuracy from Friis’ survey of 1860. At that time, bilingualism among the Norwegians was widespread in parishes with predominantly Sa´mi or Finnish (Kven) population. During the assimilation process, the preservation of Sa´mi was not due to the size of the Sa´mi population, but rather to its relative size. Today’s Sa´mi communities are the ones with the least Norwegians one and a half centuries ago. A key factor in the language shift process has been mixed marriages. The Soviet data show a greater degree of language preservation, especially for the Nenets and Mari. The difference is partly a result of the Soviet language policy, but also to the degree of contact between the minority and majority populations.

Collection and annotation of specialized corpora, for less-spoken languages such as Greek, is crucial endeavour for the development and growth of the language technology research for these languages. This paper presents the design and... more

Collection and annotation of specialized corpora, for less-spoken languages such as Greek, is crucial endeavour for the development and growth of the language technology research for these languages. This paper presents the design and compilation of a biomedical corpus that took place in the framework of the national R&D project “IATROLEXI” (http://www.iatrolexi.gr). The aim of IATROLEXI is to create the critical infrastructure for the Greek language, i.e. linguistic resources and tools, to be used in advanced natural language processing (NLP) applications, i.e. information extraction, data mining, etc., in the domain of biomedicine. The project will build upon existing resources that have been developed by the project partners, i.e. a Greek morphological lexicon of about 100.000 words, and language processing tools such as a lemmatizer and a morphosyntactic tagger, and it will further develop new resources such as a specialised corpus of biomedical texts that is presented in this p...

Large-scale corpora are becoming an increasingly important resource in language research, including many sub-disciplines within language technology. An initiative has developed over the last year or more which aims to construct such a... more

Large-scale corpora are becoming an increasingly important resource in language research, including many sub-disciplines within language technology. An initiative has developed over the last year or more which aims to construct such a corpus as a key element of ...

Introduction Segmentation is the division of a speech file into non-overlapping sections corresponding to physical or linguistic units. Labelling is the assignment of physical or linguistic labels to these units. Both segmentation and... more

Introduction Segmentation is the division of a speech file into non-overlapping sections corresponding to physical or linguistic units. Labelling is the assignment of physical or linguistic labels to these units. Both segmentation and labelling form a major part of current work in linguistic databases. 1.1.1 Segmental transcription The term `transcription' may be used to refer to the representation of a text or an utterance as a string of symbols, without any linkage to the acoustic representation of the utterance. This was the pattern followed by speech and text corpus work during the 1980's, such as the prosodically-transcribed Spoken English Corpus (Knowles et al. 1995). These corpora did not link the symbolic representation with the physical acoustic waveform, and hence were not fully machine-readable. A recent project, MARSEC (Roach et al. 1993), has generated these links for the Spoken English Corpus such that it is now a

In this paper we set out the case for how smart-glasses can be used to augment and improve live Simultaneous Interpreting (SI) of spoken languages. We do this through reviewing the relevant literature and identifying the current... more

In this paper we set out the case for how smart-glasses can be used to augment and improve live Simultaneous Interpreting (SI) of spoken languages. We do this through reviewing the relevant literature and identifying the current challenges faced by professional foreign language interpreters, such as cognitive load, working memory constraints and session dynamics. Finally, we describe our experimental framework and the prototype smart-glasses based system we are building which will act as a testbed for research into the use of augmented-reality smart-glasses as an aid to interpreting. The main contributions of this paper are the review of the state of the art in language interpreting technology plus the smart-glass experimental framework which act as an aid to Simultaneous Interpreting (SI).

This study sought to answer the following questions: How do dictionaries (monolingual and bilingual) treat the verbs which do not passivise? What sort of information do these dictionaries provide for Arab EFL learners on the... more

This study sought to answer the following questions: How do dictionaries (monolingual and bilingual) treat the verbs which do not passivise? What sort of information do these dictionaries provide for Arab EFL learners on the non-passivisable verbs? Do the inadequate syntactic information on these verbs in dictionaries constitute a potential source of error for Arab EFL learners? The researcher hinged on a set of unpassivisable verbs. The information on these verbs were evaluated in mono- and bilingual dictionaries. The results showed that monolingual dictionaries provide more information than the bilingual one do. Further, bilingual dictionaries might trigger off errors by EFL learners.

Icelandic is a morphologically complex language, for which language technology resources are scarce. Only a few years ago, it could be stated that language technology was practically non-existent in Iceland. In this paper, we describe the... more

Icelandic is a morphologically complex language, for which language technology resources are scarce. Only a few years ago, it could be stated that language technology was practically non-existent in Iceland. In this paper, we describe the development of an NLP toolkit for processing the language, the challenges faced and the decisions made during development. The current version of the toolkit consists of a tokeniser/sentence segmentiser, a morphological analyser, a linguistic rule-based tagger, and a finite-state parser. The development of our toolkit is a step towards building a Basic Language Resource Toolkit (BLARK) for the Icelandic language.

Research on a number of developments in language technologies, targeted at improving patent processing procedures within patent offices and in subsequent patent database search systems, is described. Aspects of patent processing covered... more

Research on a number of developments in language technologies, targeted at improving patent processing procedures within patent offices and in subsequent patent database search systems, is described. Aspects of patent processing covered are (1) OCR correction, to ...

Abstract. We describe the symbolic authoring facilities of the M-PIRO project. M-PIRO is developing technology that allows personalized multilingual object descriptions, in both textual and spoken form, to be produced from symbolic... more

Abstract. We describe the symbolic authoring facilities of the M-PIRO project. M-PIRO is developing technology that allows personalized multilingual object descriptions, in both textual and spoken form, to be produced from symbolic information in a database and small fragments of text. The technology is being tested in the context of electronic museums, where a prototype that produces dynamically multilingual exhibit descriptions for presentations over the web has already been developed. This paper focuses on M-PIRO’s authoring subsystem, which allows domain experts with no language technology expertise to configure the system for new applications. The authoring facilities allow the experts to define or modify the structure of the underlying database, its contents, and the system’s domain-dependent linguistic resources. Previews of the generated texts can also be produced during the authoring process to monitor the content and quality of the resulting descriptions. 1

Creating common ground is an important element in development processes involving multidisciplinary partners in an innovative project, as is the case in the LTfLL (Language Technologies for Life-long Learning) project. The SDM... more

Creating common ground is an important element in development processes involving multidisciplinary partners in an innovative project, as is the case in the LTfLL (Language Technologies for Life-long Learning) project. The SDM (Scenario-based Development Method) we adopted provides the means to communicate about the development process with partners. A Use case sets the scene for the problem to be solved, while several scenarios describe the development process, from analysis of the problem to be solved to ...