Language Technology Research Papers - Academia.edu (original) (raw)

2025, Psychiatry Research

Effective access to knowledge within large declarative memory stores is one challenge in the development and understanding of long-living, generally intelligent agents. We focus on a sub-component of this problem: given a large store of knowledge, how should an agent's task-independent memory mechanism respond to an ambiguous cue, one that pertains to multiple previously encoded memories. A large body of cognitive modeling work suggests that human memory retrievals are biased in part by the recency and frequency of past memory access. In this paper, we evaluate the functional benefit of a set of memory retrieval heuristics that incorporate these biases, in the context of the word sense disambiguation task, in which an agent must identify the most appropriate word meaning in response to an ambiguous linguistic cue. In addition, we develop methods to integrate these retrieval biases within a task-independent declarative memory system implemented in the Soar cognitive architecture and evaluate their effectiveness and efficiency in three commonly used semantic concordances.

2025, Constructional Approaches to Language

Chapter 4 Towards continuity between the lexicon and the constructicon in FrameNet Brasil

2025, Szegedi Tudományegyetem, Informatikai Intézet

Kivonat A cikkben bemutatjuk a StaffTalk nevű, nagy méretű, kézzel annotált korpuszt, mely magyar nyelvű spontán beszélgetéseket tartalmaz. A korpusz létrehozásával elsősorban ahhoz szerettünk volna vizsgálati anyagot teremteni, hogy zárt közösségeken belül az informális kommunikáció és a megbecsültség hogyan befolyásolja a közösség működését és normarendszerét. A munka első lépéseként a hanganyagokat legépeltettük, amelynek során a verbális információn túl egyéb, nem verbális információk megjelölésére is megkértük az annotátorokat. A legépelt hanganyagokat ezt követően három szinten annotáltuk: a beszélgetésekben megjelenő pletykát, beszédaktusokat és egyéb pragmatikai jegyeket, valamint bizonytalanságra utaló szavakat egyaránt megjelöltünk. Mindezeknek a sajátságoknak köszönhetően a kiinduló kutatási kérdéssel összefüggésben, valamint azon túl is a korpusz sokféle pragmatikai szempontú elemzés elvégzésére is alkalmassá vált. Kulcsszavak: korpusz, spontán beszéd, kézi annotálás, pragmatika, szemantika, pletyka Manapság a társadalomtudományok területén is egyre népszerűbbé válnak a korpuszalapú, illetve számítógépes nyelvészeti eszközöket alkalmazó vizsgálatok. Je-

2025, Language in India

Compounding is a highly fertile process. It is quite often used in various innovative ways for generating new words in most of the languages. At the time of compounding the participating members often undergo a process of morphosyntactic change that forces them to lose much of their lexicosemantic information. In this paper we make an attempt to capture lexicosemantic properties, which are lost in this process, and try to identify the factors that play active roles behind such metamorphosis of compounds. Our investigation is based on Bengali compounds as the central area of study with occasional references to the English compounds for understanding the phenomenon in a systematic way. The present study has direct applicational relevance in the area of applied linguistics, mainstream linguistics and language technology.

2025, IEEE Access

Preserving and innovating indigenous languages is crucial for maintaining cultural heritage and facilitating community development in this digital age. Understanding the digital landscape of IsiXhosa language digitization efforts, however, remains underexplored. This systematic scoping review aims to map the existing literature on digitalization efforts for IsiXhosa language preservation and innovation. A scoping review was conducted, guided by PRISMA-ScR, on relevant literature pertaining to isiXhosa digitization, preservation, and innovations. Two databases were searched (Scopus and Web of Science), and additional articles were identified through a grey literature search (Google Scholar) to identify other relevant literature. Data were extracted in terms of title, year, country, language, digitization efforts, and summary of main findings and results related to isiXhosa language preservation and innovation. A total of 85 unique articles were included from 479 records, leading to the identification of five themes under the digitization efforts. Most studies were conducted in South Africa, accounting for about 78% of the total articles. As such, significant efforts were identified in grammar and morphology, speech recognition, and machine translation. There are unresolved gaps and challenges (such as large and high-quality datasets) that must be addressed. Nevertheless, the efforts demonstrate an increased acceptance of the significance of protecting and improving indigenous African languages, such as isiXhosa, in this era of digital technology.

2025

Networking the development of computational resources for African languages can be greatly advanced if researchers aim to develop tools that are to a large extent language-independent and therefore reusable for other languages. In this paper we describe a particular case study, namely the development of an annotated corpus of Gĩkũyũ, using language-independent machine learning techniques. The general aim of our work on Gĩkũyũ is two-fold: on the one hand we wish to digitally preserve this resource-scarce language, while on the other hand it serves as a feasibility study of using language-independent machine learning techniques for linguistic annotation of corpora. To this end we investigate established annotation induction techniques like unsupervised learning and knowledge transfer. These methods can provide interesting perspectives for the linguistic description of many other resource-scarce languages.

2025, Lecture Notes in Computer Science

Current definitions of "software component" are based on abstract data types -collections of functions together with local data. This paper addresses two ways in which this definition is inadequate: it fails to allow for lightweight components -those for which a function call is too inefficient or semantically inappropriate -and it fails to allow for generative components -those in which the component embodies a method of constructing code rather than actual code. We argue that both can be solved by proper use of existing language technologies, by using a higher-order meta-language to compositionally manipulate values of type Code, syntactic fragments of some object language. By defining a client as a function from a component to Code, components can be defined at a very general level without much notational overhead. In this paper, we illustrate this idea entirely at the source-code level, taking Code to be string. Operating at this level is particularly simple, and is useful when the source code is not proprietary. In a companion paper, we define Code as a set of values containing machine-language code (as well as some additional structure), allowing components to be delivered in binary form.

2025, Adapting International Standard for Asian Language Technologies

徳永健伸, Dain Kaplan, Chu-Ren Huang, Hsieh Shu-Kai, Calzolari Nicoletta, Monachini Monica, Soria Claudia, 白井清昭, Sornlertlamvanich Virach, Charoenporn Thatsanee, Xia YingJu. ... Takenobu Tokunaga, Dain Kaplan, Chu-Ren Huang, Shu-Kai Hsieh, Nicoletta Calzolari, Monica Monachini, Claudia Soria, Kiyoaki Shirai, Virach Sornlertlamvanich, Thatsanee Charoenporn, YingJu Xia. ... ©2007 Tokyo Institute of Technology All rights reserved.

2025

When LTC started in Poznań, Poland, as LT Awareness Days (1995), Polish was still a "less-resourced-language". This report presents the LRL Workshop Series organized since 2009 as integral part of LTC. We present as raison d'être of LRL to contribute to a "roadmap towards supplying LR and LT for all languages". We go one by one through all LRLs (until 2019) to present themes suggested by organizers, affiliation countries of the authors, as well as the concerned languages. We note positive phenomena such as appearance of countries and languages so far very rare at the international LT conferences.

2025, Lecture Notes in Computer Science

The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and... more

2025

It is widely recognized that good teaching includes instructor-student feedback, and in online courses, feedback takes a variety of forms, including both synchronous and asynchronous interactions. To understand better the types and frequency of instructorstudent feedback interactions, this case study used document analysis to examine feedback in an online course over a full semester. Feedback interactions were coded as either individual or team feedback and also then coded as either corrective, motivational, or technology-related. With 1,744 recorded instructor-student feedback interactions, corrective feedback accounted for nearly 70% of all feedback (given more often to teams than individuals); motivational feedback was 20% (given more often to individuals than teams); and technology feedback was 10% (given more often to individuals than teams). Additionally, feedback differed over the duration of the semester, with motivational feedback being the greatest at the beginning of the term. An examination of individual versus team differences revealed that teams tended to receive a greater amount of corrective feedback, whereas individuals required greater motivational feedback. Implications of the study include that instructors may not be conscious of the proportions of corrective versus motivational feedback to their online students. Instructors are also encouraged to take certain measures to reduce the burden of technology feedback required of the instructor, since students will constantly demand such non-pedagogical assistance.

2025

The NLP research group at the University of Szeged took part in the development of the Hungarian WordNet between 2005 and 2007. In 2008, they developed a smaller, domain specific WordNet on customs law. This knowledge base contains about 650 concepts cautiously selected by legal experts from the relevant Hungarian statutory legal texts, above all, from two acts and from other laws and decrees. The resulted hierarchic net of concepts is used in an information retrieval system for quick access to documents that ...

2025

Current trends in language technology require treebanks that do not stop at the level of constituent structure, but include deeper and richer levels of analysis, including appropriate meaning structures. Capturing sufficient detail at... more

2025, AfLaT 2010

A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging Guy De Pauw1, 2, Naomi Maajabu1, Peter Waiganjo Wagacha2 1CLiPS-Computational Linguistics ... Luo Useweyo/useweyo Chike/chike Nyasaye/nyasaye mi/mi koro/koro... more

2025, Annual Meeting of the Special Interest Group on Discourse and Dialogue

This paper introduces a new dialogue management framework for goal-directed conversations. A declarative specification defines the domain-specific elements and guides the dialogue manager, which communicates with the knowledge sources to complete the specified goal. The user is viewed as another knowledge source. The dialogue manager finds the next action by a mixture of rule-based reasoning and a simple statistical model. Implementation in the flight-reservation domain demonstrates that the framework enables the developer to easily build a conversational dialogue system.

2025, Proceedings of the International Summer School of Bilingualism and Multilingualism (ISSBM2022)

This study analyzes the features of contemporary Russian netspeak through linguistic corpora. The analysis focused on the different transliteration processes involving more than 300 loanwords, as well as the investigation of different derivational and inflectional morphemes attached to English transliterated loanwords and roots. This research aimed to establish whether a diachronic corpus-based analysis of Russian netspeak can help us to forecast which standard form will be established over time, making predictions on future developments of the language. An incredibly articulated and multifaceted picture emerges, which shows all the complexity of a new and constantly evolving language, in which foreign loanwords struggle to adapt and integrate into the target language, giving rise to numerous variants of the same word.
Link to the book: https://books.ung.si/issbm2022/.
ISBN : 978-961-7025-30-9

2025, DHNB2022 Conference Proceedings

This study presents the results of an Aanaar Saami pilot project in the Saami Culture Archive, University of Oulu. The project has established a set of conventions to transcribe and annotate Aanaar Saami recordings in the archive's collection and created a mechanism through which grammatically annotated but anonymous versions can be imported to the Korp search interface in the Language Bank of Finland. The practices include wide use of Saami language technology, the use of Finnish computational research infrastructure, and they can be extended later to other Saami languages in the archive.

2025

The usability of Ultra Wide Band terahertz radar technique for inspection and imaging of objects of interest for the aeronautics industry is under investigation in this paper. Frequency-modulated continuous-wave (FMCW) radar principle and systems will be detailed along with its benefits and limitations depending on the architecture and characteristics of the system as well as the materials under inspection. Promising results and advances in the airplane covering see-through problematic are also demonstrated through measurements that have been performed with our imaging systems, demonstrating the suitability of FMCW radars as a new tool for Non Destructive Testing (NDT) applications for the aeronautics industry.

2025

It should also be borne in mind that languages make use of different sets of word classes. Latin, for example, has no articles. This course adopts the standard VISL system for English -11 word classes. These can be conveniently subclassified into three groups: page 40 John M. Dienhart • mand/maend/maendene, gås/gaes/gaessene Note that the gender of the noun is irrelevant in the formation of definite constructions when the plural form of the noun is involved. VISL's colon notation for the articles (whether definite or indefinite) is consistently D:art.

2025

Èdè kì í ṣe èti tí kì í dágbà, bí ọ̀làjú àti ìdàgbàsókè ṣe bá ayé ní èdè náà dàgbà si. Gẹ́gẹ́ bí àwọn onímọ̀ ti ṣe sọ, òpó pàtàkì ni èdè ọmọnìyàn jẹ́ fún àsọyé àti àgbọ́yé ìṣẹ̀lẹ̀. Èdè sì jẹ́ ọ̀kan lára àwọn irinṣẹ́ pàtàkì fún àwùjọ èdè. Ibi ayé dé dúró lónìí pẹ̀lú ogbọ́n àpilẹ̀rọ èyí tí ń ṣe AI, àwọn èdè ilẹ̀ Adúláwọ̀ ò tí kógo ja nínú ìmọ̀-ẹ̀rọ tí èdè Yorùbá sì jẹ́ ọ̀kan láàárín àwọn èdè ilẹ̀ adúláwọ̀ yìí. Láti lè jẹ́ èdè Yorùbá náà tẹ̀wọ̀n nínú ìmọ̀-ẹ̀rọ Détà (ọ̀rọ̀ àti gbólòhùn Yorùbá) ṣe pàtàkì láti fi ṣe fi kọ́ ìmọ̀-ẹ̀rọ nípasẹ̀ ọgbọ́n ìkẹ́rọ-lédè-ọmọnìyàn. Nítorí láì sí tàgìírì àwọ ò hu, láì sí Détà kò sí bí ìmọ̀-ẹ̀rọ ṣe fẹ́ ni àgbọ́yè èdè Yorùbá.
Àgbéyẹ̀wò ìló détà Yorùbá fún Iṣẹ́ ìmọ̀-ẹ̀rọ: Ìkẹ́rọ-lédè-Ọmọnìyàn (NLP) gẹ́gẹ́ bí àfojúsùn ni iṣẹ́-ìwádìí dálé. Orí kìíní ni a ti sọ̀rọ̀ lóri ohun tó fá iṣẹ́-ìwádìí yìí àti ibi tí iṣẹ́-ìwádìí máa gbòòrò dé. Ní orí kìíní yìí kan náà ni a sọ̀rọ̀ nípa ọgbọ́n tí a fẹ́ fi ṣe iṣẹ́-ìwádìí yìí. Ní orí kejì ni a ti sọ̀rọ̀ nípa àgbéyẹ̀wò iṣẹ́ àwọn ẹni ìṣáájú, iṣẹ́ ìmọ̀-ẹ̀rọ tó wà ní èdè Yorùbá àti àgbéyèwò ìtànkálẹ̀ ìmọ̀-ẹ̀ro ní àwùjọ Yorùbá. Orí kẹta àti kẹrin ní ojú iṣẹ́-ìwádìí. Ní orí kẹta, ìbẹ́ ni a tí sọ̀rọ̀ lóri ohun tí détà jẹ́, bí a ṣe lè gba détà, ipá tí ìlò détà tí a gbà ń kó lóri ìdàgbàsókè èdè Yorùbá. Orí kẹrin sì sọ̀rọ̀ nípa àwọn ohun tí a lè fi détà ti a gbà ṣe nínú ìmọ̀ ẹ̀rọ nípa lílo ète ọgbọ́n ìkẹ́rọ-lédè-ọmọnìyà.
Ní orí karùn-ún ni a fi àdàgbà ètò rọ̀ sí, ibẹ̀ ni a ti ṣàlàyé ìrírí wa lẹ́nu iṣẹ́-ìwádìí, àwọn àkíyèsí wa nípa lílo détà èdè Yorùbá fún àwọn iṣẹ́ àkànṣẹ ìkẹ́rọ-lédè ọmọnìyàn. Ní ìparí ni a wá wo àwọn ọ̀nà tí ìmọ̀-ẹ̀rọ ìkẹ́rọ-lédè-ọmọnìyàn ṣe lè ṣe ìrànwọ́ fún ìdàgbàsókè èdè Yorùbá káríayé àti pàtàkì tàbí ìwúlò iṣẹ́-ìwádìí yìí sí àwùjọ Yorùbá

2025

Natural Language Processing (NLP) is an interdisciplinary research area at developing computer programs capable of human-like activities related to understanding or producing texts or speech in a natural language, such as English. Natural language processing has been in existence for more than fifty years. During this time, it has significantly contributed to the field of human-computer interaction in terms of theoretical results and practical applications. As computers continue to become more affordable and accessible, the importance of user interfaces that are effective, robust, unobtrusive, and user-friendly regardless of user expertise or impediment becomes more pronounced. Since natural language usually provides for effortless and effective communication in human-human interaction, its significance and potential in human-computer interaction should not be overlookedeither spoken or typewritten, it may effectively complement other available modalities, such as windows, icons, and menus, and pointing; in some cases, such as in users with disabilities, natural language may even be the only applicable modality. In this paper, we examines the field of natural language processing as it relates to human computer interaction by focusing on its history, interactive application areas, and how natural language programming contributes a lot for natural language processing.

2025, CALL for all Languages - EUROCALL 2023 Short Papers

Current theoretical advances in applied linguistics have not yet found wide practical application in the field of language revitalization. In this paper, plans for an open source application for desktop computers and mobile devices for Indigenous language learning settings will be outlined. The app consists of building blocks inspired by cognitive linguistics and task-based language learning. Members of Indigenous language communities can use these to create exercises and assessment modules for their respective languages. In the paper, a mock-up with model exercises will be showcased to illustrate how certain aspects of the afore-mentioned theories can be applied. For example, vocabulary tasks are informed by insights from the analysis of collocations, connotations, frames, metaphors, prototypicality, and semantic relations.

2025, Language and Technology Conference

2025

2025, Language Resources and Evaluation

A speech database has been collected for use to highlight the importance of "speaker factor" in forensic voice comparison. FABIOLE has been created during the FABIOLE project funded by the French Research Agency (ANR) from 2013 to 2016. This corpus consists in more than 3 thousands excerpts spoken by 130 French native male speakers. The speakers are divided into two categories: 30 target speakers who everyone has 100 excerpts and 100 "impostors" who everyone has only one excerpt. The data were collected from 10 different French radio and television shows where each utterance turns with a minimum duration of 30s and has a good speech quality. The data set is mainly used for investigating speaker factor in forensic voice comparison and interpreting some unsolved issue such as the relationship between speaker characteristics and system behavior. In this paper, we present FABIOLE database. Then, preliminary experiments are performed to evaluate the effect of the "speaker factor" and the show on a voice comparison system behavior.

2025

A speech database has been collected for use to highlight the importance of “speaker factor” in forensic voice comparison. FABIOLE has been created during the FABIOLE project funded by the French Research Agency (ANR) from 2013 to 2016. This corpus consists in more than 3 thousands excerpts spoken by 130 French native male speakers. The speakers are divided into two categories: 30 target speakers who everyone has 100 excerpts and 100 “impostors” who everyone has only one excerpt. The data were collected from 10 different French radio and television shows where each utterance turns with a minimum duration of 30s and has a good speech quality. The data set is mainly used for investigating speaker factor in forensic voice comparison and interpreting some unsolved issue such as the relationship between speaker characteristics and system behavior. In this paper, we present FABIOLE database. Then, preliminary experiments are performed to evaluate the effect of the “speaker factor” and the...

2025

Classical intensional semantic frameworks, like Montague's Intensional Logic (IL), identify intensional identity with logical equivalence. This criterion of cointensionality is excessively coarse-grained, and it gives rise to several well known di culties. Theories of fine-grained intensionality have been been proposed to avoid this problem. Several of these provide a formal solution to the problem, but they do not ground this solution in a substantive account of intensional di↵erence. Applying the distinction between operational and denotational meaning, developed for the semantics of programming languages, to the interpretation of natural language expressions, o↵ers the basis for such an account. It permits us to escape some of the complications generated by the traditional modal characterization of intensions.

2025, Informology

In the field of Library and Information Science, the accurate representation and retrieval of information are of utmost importance. Information representation and indexing are critical processes that facilitate the efficient access and utilization of knowledge. However, these processes are not without challenges. One significant issue that arises is “semantic noise”, a phenomenon that can distort the meaning of information and hinder effective communication between information retrieval (IR) systems and users. This study aims to explore the concept of semantic noise, its causes, and its implications for information representation and indexing. The current study is primarily theoretical in nature, focusing on the conceptual exploration of semantic noise and its impact on information representation and retrieval. This study investigates the concept of semantic noise, its causes, and its implications for information representation and indexing in the field of library and information science. The results of the research highlight that semantic noise, caused by irrelevant, ambiguous, or conflicting elements in information representation and indexing, significantly disrupts the clarity and accuracy of information retrieval. Key causes include ambiguity in language and representation, varying contexts, inconsistent terminology, and cultural or linguistic barriers, which collectively introduce complexity and hinder effective communication between information retrieval systems and users. Semantic noise reduces retrieval accuracy, leads to inefficient query processing, and poses challenges for natural language processing (NLP) systems, often resulting in user frustration and diminished trust in information retrieval (IR) systems. Semantic noise disrupts the clarity and accuracy of information representation and retrieval, leading to inefficiencies, misinterpretations, and user dissatisfaction. Addressing and mitigating semantic noise requires advanced techniques in natural language understanding, such as contextual analysis, semantic search, semantic modeling, and machine learning. These techniques ensure that information retrieval (IR) systems can effectively bridge the gap between user intent and stored data. These findings underscore the critical need for precision in language, standardized terminology, and context-aware approaches to minimize semantic noise and enhance the reliability of information representation and retrieval.

2025, Informology

2025

The core group of the project consisted in the beginning of Even Hovdhaugen (Norway), Carol Henriksen (Denmark), Bengt Sigurd (Sweden), and Kalevi Wiik (Finland). In 1994, Kalevi Wiik had to leave the project due to other commitments and was replaced by Fred Karlsson as the Finnish representative. Kjell Paulsen, Oslo, functioned as secretary of the project during the duration of the project period and also one year as research assistant. The first thorough version of our manuscript was completed in late 1996. The magnitude of our task is aptly illustrated by the fact that we needed three more years of diligent work on top of the originally scheduled project period in order to properly finish the manuscript. The book has been written jointly by the undersigned core group of the project, but we have been very dependent on the help and research of a number of Nordic linguists. First of all, we would like to thank Kjartan Ottósson who has written the draft of most of the contributions on Iceland and Icelandic in chapters three, four, and five. Secondly, we thank the participants at the conference we arranged in Oslo in 1994 on the history of linguistics in the Nordic countries. The papers presented at this conference (Henriksen et al., eds. 1996) have been a valuable source in writing this book. Last, but not least, we would like to thank all the linguists who have been willing to answer our many curious questions or helped us find obscure references and forgotten material. The entire manuscript, with the exception of the brief concluding chapter seven, has been read and commented on in detail by nine prominent Nordic linguists, two each from Denmark, Finland, Norway, and Sweden, and one from Iceland. We express our deep indebtedness to Nils Erik Enkvist (Academician in the Academy of Finland, Helsingfors), Frans Gregersen (Professor of Danish Language,

2025, The Routledge Handbook of Emotions in the Ancient Near East

2025, The Expression of Emotions in Ancient Egypt and Mesopotamia

The study of emotions in ancient Mesopotamia is still very much in its infancy. There has been some scholarly interest regarding Mesopotamian emotions, but before this volume much of it has been conducted from a medical or psychological perspective.1 Understanding emotional practices across such a great geographical, cultural and temporal range is fraught with complications. An important topic that has been hardly touched in Mesopotamian studies yet is the concept of "emotion" itself. The concept of "emotion" is not universal, and in Akkadian there is no term that could serve as a translation of it or other such categories (e.g., "affect," "feeling"). From a comparative perspective, it has been demonstrated that in the Hebrew Bible, lexemes expressing explicit emotions are scarce and other linguistic expressions, such as literary devices, bodily sensations, certain actions or clusters of actions, are used to indicate emotions. Additionally, emotional responses often refer to social situations, suggesting that the social dimension of emotions was more important than the internal feelings of an individual.2 All these observations should be examined in Akka-* Work on this article was jointly conducted by all the authors, but for the most part the division of work was as follows: Svärd wrote Sections 1, 2 and 6, Jauhiainen 3 and 4.2, Sahala 4.1, and Alstola, Jauhiainen and Svärd 4.3-4. Svärd and Alstola analyzed the results and wrote Section 5. Jauhiainen processed the dataset. Sahala designed the weighting algorithm for PMI and wrote the tool for calculating the PMI scores. Alstola coordinated the design for the workflow (Section 4.4). Lindén directed the language technological work and was crucial for the success of this cross-disciplinary project. A technical note: in this article we follow the volume convention of using ḫ instead of h, but in our data we use the regular h. The Akkadian spellings of words and their translations follow the lemmas in Oracc. 1 For a more comprehensive overview of the study of emotions and Mesopotamia, see the Introduction in this volume. 2 Mirguet 2016.

2025, 2012 Proceedings of Picmet 12 Technology Management For Emerging Technologies

The potential applicability of established new product development processes to information and communications technology (ICT)-for-development projects is investigated. The demand for ICT solutions to serve numerous societal information needs in developing regions of the world is increasing rapidly. A number of methods and practices have been used by organizations to develop and deliver such ICT solutions, but a need exists to formalize product development processes for use in the ICT-for-development context. Existing literature on product development in the ICT-for-development context is explored to derive a theoretical model that may be suitable for addressing product development process problems encountered in such projects. An ICT-for-development project to disseminate government information to rural communities with limited literacy is evaluated against the derived theoretical model in a case study. The project was carried out by the CSIR (Council for Scientific and Industrial Research), a government agency in South Africa. The presence and positive effect of certain established product development practices is identified, while the absence or unsatisfactory execution of other established practices are assessed for their contribution to decreased levels of product success.

2025

We present a dependency annotation scheme for Finnish which aims at respecting the multilayered nature of language. We first tackle the annotation of surfacesyntactic structures (SSyntS) as inspired by the Meaning-Text framework. Exclusively syntactic criteria are used when defining the surface-syntactic relations tagset. Our annotation scheme allows for a direct mapping between surface-syntax and a more semantics-oriented representation, in particular predicate-argument structures. It has been applied to a corpus of Finnish, composed of 2,025 sentences related to weather conditions.

2025, Language & Technology ed., Zygmunt Ventulani

We describe our approach to construct a phoneme set for polyglot speech synthesis. In polyglot speech synthesis, resources are shared across languages. The goal of this research is to develop global phoneme set using existing resources. Therefore, MBROLA has been selected. In MBROLA, there are 72 diphone databases of different languages. For each database, there is a set of phonemes used. We have selected 31 language databases out of the 72 diphone databases in MBROLA. By reusing existing resources, we would be able to gather global phoneme set in faster and wider language coverage. Therefore it would be able to be used for language that has limited linguistic expertise or limited linguistics resources. Our approach includes the process of extracting the phonemes of these languages, clustering, eliminating and substituting inaccurate phonemes and finally evaluating the list of phonemes obtained. At the end of this study, we are able to come out with one complete list of a global phoneme set. This list can be use as a substitution for unavailable phonemes in future polyglot TTS systems. It is also suitable to be used as a default phoneme set for new languages if the new languages' phoneme set is not yet defined.

2025, Machine Translation

Arabic is rich in morphology and syntax. It is normally written with optional diacritics and without the notion of capitalization. These characteristics make dealing with Arabic a challenge for both learners and researchers. My experience in the Arabic natural language processing (ANLP) research area allows me to say that the key to fostering a research or developing an application in the ANLP field lies in getting insights into the standard layer-based structure of linguistic phenomena (phonology, morphology, syntax and semantics) as well as in recognizing the interaction between them. Until now, there was no such an available introductory resource that can fulfill these requirements. For example, a simple Google search of the term "Arabic Natural Language Processing" results in research groups, research papers, tutorials and presentations, companies, and scholars. Consequently, tangible efforts had to be spent by any beginner of the ANLP field, whether a scientist, linguist, developer, or student, in going through different material which might be either irrelevant or too advanced. Therefore, the purpose and the significance of this book are clear from where it stands. Also, it is adequately classified by the publisher as belonging to the "human language technologies" series. This book gives a sufficient solid introductory background on ANLP. This makes it the first of its kind. The author has a broad background in computational linguistics, in general, and ANLP, in particular. He also has a marvelous research track record and has been very active in serving the research communities. The book is clear about its intended audience. It is the most suitable for anyone who would like to get a fundamental background about ANLP for research, study, or development purposes. It is very well written. This book amazingly takes you gradually from the ground

2025

This paper presents a novel approach to deal with dictionary retrieval. This new approach is based on a very efficient and scalable theoretical structure called Multi-Terminal Multi-valued Decision Diagrams (MTMDD). Such tool allows the definition of very large, even multilingual, dictionaries without significant increase in memory demands, and also with virtually no additional processing cost. Besides the general idea of the novel approach, this paper presents a description of the technologies involved, and their implementation in a software package called WAGGER. Finally, we also present some examples of usage and possible applications of this dictionary retriever.

2025

This article documents the increasing use of the English curse word fuck worldwide, as well as its degree of adaption into the host language, its syntactic function, and its meaning and its strength as taboo. Comparing the use of fuck with a special focus on the Nordic countries (Norway, Denmark, and Iceland) with its use in Eurasia and Africa (with different alphabets, namely Cyrillic in Russia, Devanāgarī in India and Ge’ez script in Ethiopia), we found some similar developmental patterns, but also differences, for example to what degree the English loan word has replaced local curses and in what ways among social groups within a country. Comparing the terms used for the same concept was challenging because some countries have better text corpora and more research on written languages and especially on taboos, and those without such resources required additional minor investigations for a baseline. Findings revealed that fuck has spread worldwide from English, and it is commonly u...

2025

The Norwegian language falls into two main variants bokmal and nynorsk. The majority's variant is bokmal, used by over 90 % of the population. Historically, bokmal again falls into several sub-variants, but now the two main sub-variantsriksmal and bokmal proper are practically united in one common norm. This norm is being documented in the national dictionary project bearing the symbolically significant name BRO ('bridge'). The article presents the background for the BRO collaboration, and sketches a concrete and feasible plan for the lexicographical documentation of the common norm. A challenge lies in the choice of lemma sign form and the presentation of bokmal's wide variety of optional forms, where also style nuances play a role. The same applies to the choice of examples and collocations and other multi-word lemmas. Both challenges arise from the need for freedom of expression within the norm, which is typical of Norwegians' preference to mark identity through

2025, Acta Linguistica Lithuanica

Straipsnyje rašoma apie lietuvių kalbos technologijų būklę, supažindinama su Europos kalbų lygybės skaitmeninėje terpėje situacija. Nagrinėjami kiekybiniai ir kokybiniai kalbų lygybę atskleidžiantys rodikliai Europos Sąjungos kontekste, atsižvelgiant į kalbėtojų, skaitmeninių kalbos išteklių ir technologijų skaičių bei joms teikiamą paramą, ypatingą dėmesį skiriant Lietuvos atvejo analizei. Šiuo straipsniu siekiama pabrėžti iki šiol kalbų technologijų srityje atliktą darbą ir išryškinti spragas bei atskleisti išbandymus, su kuriais susiduria ir juos sprendžia oficiali nacionalinė ir Europos Sąjungos kalba -lietuvių kalba. Straipsnyje pateikiama naujausia lietuvių kalbos technologijų padėties apžvalga analizuojant skaitmeninius kalbos išteklius ir įrankius / paslaugas. Rezultatai rodo, kad per pastaruosius 10 metų įvyko nemažai pokyčių, bet vis dar trūksta išteklių švietimo ir kitose srityse. ESMINI AI ŽODŽI AI: Europos kalbos, kalbos ištekliai, kalbos įrankiai / paslaugos, kalbų lygybė, kalbų technologijos.

2025

This paper explores two different methods of learning dialectal morphology from a small parallel corpus of standard and dialect-form text, given that a computational description of the standard morphology is available. The goal is to produce a model that translates individual lexical dialectal items to their standard dialect counterparts in order to facilitate dialectal use of available NLP tools that only assume standard-form input. The results show that a learning method based on inductive logic programming quickly converges to the correct model with respect to many phonological and morphological differences that are regular in nature.

2025, Proc. of Papillon2004, …

The project "Lexica and Corpora for Speech-to-Speech Translation Components" (LC-STAR) aims to develop lexica for automatic speech recognition and text to speech synthesis for thirteen languages, and multilingual corpora for speech centered translation applications for nine languages. The project is led by a consortium comprising two universities and several industrial companies. All resources to be developed are encoded using the Extensible Markup Language (XML). This paper describes XML related issues in the LC-STAR project from three different perspectives; the XML encoding of the lexica, the XML encoding of the multilingual corpora and issues regarding the validation of XML encodings like that of the LC-STAR lexica.

2025, Sign Language Studies

2025

The study examines how native Pashto ESL learners place stress on identical lexemes within English sentences, focusing on pitch, duration, and intensity. The Objectives of this study is to analyze the acoustic properties of stressed syllables, to investigate the influence of Pashto on English stress production, and improve prosodic stress understanding to aid Pashto ESL learners. An experimental and quantitative approach was used, involving (20 × 3 × 7 = 420) from (N = 20) Pashto ESL speakers reading pre-selected words in carrier phrases. The voice sample data was collected from the cloud OneDrive corpora datasets (Abbasi, Abbasi, SRSP-Pak-Eng-43, 2023] [SHEC unpublished raw datasets, Sindh Madressatul Islam University, 2023-SHEC-SRSP-PaK-Eng-43) of Pashto speakers with similar method based on the disyllable stimuli. Seven pairs of disyllabic words were selected as stimuli following the framework of Beckman & Pierrehumbert (Beckman and

2025, IJASS JOURNAL

Linguistic sustainability delves into both the theoretical frameworks and practical methodologies essential for the revitalization of languages threatened by the pervasive forces of globalization and cultural homogenization. With predictions indicating that nearly half of the 6,000 languages spoken today will face extinction by the end of this century, this article underlines the critical importance of safeguarding linguistic diversity. Our research scrutinizes major factors contributing to language endangerment, such as socio-economic pressures, cultural assimilation, and the lack of institutional support. The principal aim of our study is to identify powerful strategies for language revitalization and to illustrate how Indonesia, a country boasting over 700 languages, employs these strategies to sustain its linguistic heritage. Adopting a qualitative approach anchored in a comprehensive literature review, we endeavor to delineate the predominant factors precipitating language endangerment and extinction, while simultaneously identifying robust revitalization strategies. The article features an intricate case study of Indonesia, elucidating the nation’s multifaceted approach to language preservation. This includes the integration of regional languages into educational curricula, the promotion of linguistic diversity across various platforms, the digitization of linguistic resources, and the establishment of legal protections. Our findings demonstrate that Indonesia’s diverse and nuanced strategies are effective in fostering linguistic diversity, providing a valuable model for other multilingual societies globally. Ultimately, the preservation of linguistic diversity requires a concerted and collaborative effort from governments, educational institutions, communities, and individuals alike

2024

Developing artificial intelligence systems for African languages presents a critical challenge in contemporary computational linguistics and natural language processing. This research addresses data scarcity in African language AI... more