Human Language Technology Research Papers (original) (raw)

Signs are everywhere in our lives. They make our lives easier when we are familiar with them. But sometimes they also pose problems. For example, a tourist might not be able to understand signs in a foreign country. In this paper, we present our efforts towards automatic sign translation. We discuss methods for automatic sign detection. We describe sign translation using example based machine translation technology. We use a usercentered approach in developing an automatic sign translation system. The approach takes advantage of human intelligence in selecting an area of interest and domain for translation if needed. A user can determine which sign is to be translated if multiple signs have been detected within the image. The selected part of the image is then processed, recognized, and translated. We have developed a prototype system that can recognize Chinese signs input from a video camera which is a common gadget for a tourist, and translate them into English text or voice stream.

- by
- •
- Information Systems, English language, Machine Translation, Reading

This article talks about how advances in human language technology can help overcomesome of the barriers that prevent community participation in cyberspace. Human languagetechnology refers to the set of technologies, such as speech recognition and speech synthesisthat are used to create spoken language systems---systems that allow people to communicatewith machines using speech.A significant advantage of using speech as an interface

- by Ron Cole
- •
- Speech Recognition, Community Participation, Human Language Technology

The NEMLAR project: Network for Euro-Mediterranean LAnguage Resource and human language technology development and support; (www.nemlar.org) is a project supported by the EC with partners from Europe and the Middle East; whose objective is to build a network of specialized partners to promote and support the development of Arabic Language Resources in the Mediterranean region. The project focused on identifying the state of the art of LRs in the region, assessing priority requirements through consultations with language industry and communication players, and establishing a protocol for developing and identifying a Basic Language Resource Kit (BLARK) for Arabic, and to assess first priority requirements. The BLARK is defined as the minimal set of language resources that is necessary to do any pre-competitive research and education, in addition to the development of crucial components for any future NLP industry. Following the identification of high priority resources the NEMLAR partners agreed to focus on, and produce three main resources, which are: 1) Annotated Arabic written corpus of about 500 K words, 2) Arabic speech corpus for TTS applications of 2x5 hours, and 3) Arabic broadcast news speech corpus of 40 hours Modern Standard Arabic. For each of the resources underlying linguistic models and assumptions of the corpus, technical specifications, methodologies for the collection and building of the resources, validation and verification mechanisms were put and applied for the three LRs.

- by Mostafa Attia
- •
- Verification and Validation, Middle East, Arabic Language, Language Resources

Satire is an attractive subject in deception detection research: it is a type of deception that intentionally incorporates cues revealing its own deceptiveness. Whereas other types of fabrications aim to instill a false sense of truth in the reader, a successful satirical hoax must eventually be exposed as a jest. This paper provides a conceptual overview of satire and humor, elaborating and illustrating the unique features of satirical news, which mimics the format and style of journalistic reporting. Satirical news stories were carefully matched and examined in contrast with their legitimate news counterparts in 12 contemporary news topics in 4 domains (civics, science, business, and "soft" news). Building on previous work in satire detection, we proposed an SVMbased algorithm, enriched with 5 predictive features (Absurdity, Humor, Grammar, Negative Affect, and Punctuation) and tested their combinations on 360 news articles. Our best predicting feature combination (Absurdity, Grammar and Punctuation) detects satirical news with a 90% precision and 84% recall (F-score=87%). Our work in algorithmically identifying satirical news pieces can aid in minimizing the potential deceptive impact of satire.

- by Sarah Cornwell
- •
- Discourse Analysis, Human Computer Interaction, Journalism, Art

This chapter has two goals. The first goal is to compare Machine Learning (ML) and Knowledge Discovery in Data (KDD, also often called Data Mining, DM) insisting on how much they actually differ. In order to make my ideas somewhat easier to understand, and as an illustration, I will include a description of several research topics that I find relevant to KDD and to KDD only. The second goal is to show that the definition I give of KDD can be almost directly applied to text analysis, and that will lead us to a very restrictive definition of Knowledge Discovery in Texts (KDT). I will provide a compelling example of a real-life set of rules obtained by what I call KDT techniques.

- by Sean Boisen and +1
- •
- Algorithms, Natural Language Processing, English language, Speech Recognition

Data Mining has become a buzzword in industry in recent years. It is something that everyone is talking about but few seem to understand. There are two reasons for this lack of understanding: First is the fact that Data Mining researchers have very diverse backgrounds such as machine learning, psychology and statistics. This means that the research is often based on different methodologies and communication links e.g. notation is often unique to a particular research area which hampers the exchange of ideas and the dissemination to the wider public. The second reason for the lack of understanding is that the main ideas behind Data Mining are often completely opposite to mainstream statistics and as many companies interested in Data Mining already employ statisticians, such a change of view can create opposition.

Word embeddings are real-valued word representations able to capture lexical semantics and trained on natural language corpora. Models proposing these representations have gained popularity in the recent years, but the issue of the most adequate evaluation method still remains open. This paper presents an extensive overview of the field of word embeddings evaluation, highlighting main problems and proposing a typology of approaches to evaluation, summarizing 16 intrinsic methods and 12 extrinsic methods. I describe both widely-used and experimental methods, systematize information about evaluation datasets and discuss some key challenges.

Satire is an attractive subject in deception detection research: it is a type of deception that intentionally incorporates cues revealing its own deceptiveness. Whereas other types of fabrications aim to instill a false sense of truth in the reader, a successful satirical hoax must eventually be exposed as a jest. This paper provides a conceptual overview of satire and humor, elaborating and illustrating the unique features of satirical news, which mimics the format and style of journalistic reporting. Satirical news stories were carefully matched and examined in contrast with their legitimate news counterparts in 12 contemporary news topics in 4 domains (civics, science, business, and “soft” news). Building on previous work in satire detection, we proposed an SVM-based algorithm, enriched with 5 predictive features (Absurdity, Humor, Grammar, Negative Affect, and Punctuation) and tested their combinations on 360 news articles. Our best predicting feature combination (Absurdity, Grammar and Punctuation) detects satirical news with a 90% precision and 84% recall (F-score=87%). Our work in algorithmically identifying satirical news pieces can aid in minimizing the potential deceptive impact of satire. [Note: The associated dataset of the Satirical and Legitimate News, S-n-L News DB 2015-2016, is available via http://victoriarubin.fims.uwo.ca/news-verification/ . The set is password-protected to avoid automated harvesting. Please feel free to request the password, if you are interested.]

- by Victoria L Rubin and +1
- •
- Human Computer Interaction, Humanities Computing (Digital Humanities), Journalism, Natural Language Processing

This paper introduces VIP, an R&D project that explores the impact and feasibility of using Human Language Technology (HLT) and Natural Language Processing (NLP) for interpreting training, practice and research. This project aims at filling the gap in and addressing the pressing need for technology in general for interpreters, which is reported to be scarce. Although most interpreters are unaware of
interpreting technologies or are reluctant to use them, there are some tools and resources already available, mainly computer-assisted interpreting (CAI) tools. VIP is working on the development of technology and cutting-edge research with the potential to revolutionise the interpreting industry by lowering costs for interpreter training, fostering an online community which shares, generates and cultivates interpreting resources; and providing an efficient interpreter workbench tool (computer assisted interpreting software). Available at: http://www.tradulex.com/varia/TC39-london2017.pdf

This study focusses on developing statistical POS taggers for Odia using two distinct algorithms CRF (probability) and SVM (classifier). Approximately, 400k tokens have been applied to develop both of them with the training and testing data estimating to 236k and 123k tokens respectively. For annotating the whole ILCI corpus the BIS annotation scheme has been taken into consideration with some modifications. So far as the experimental set up is concerned, similar feature has been selected to train both the models. Evaluation has been conducted on the precision and recall measures for CRF and known-unknown words accuracy for SVM. A comprehensive error analysis has been conducted to figure out the types of errors committed by both in common based on which 5-fold manual error correction and final evaluation have been conducted. After identifying and discussing issues, different solutions have been proposed: formulation of linguistic rules, corpus-driven, word sense disambiguation, and application of external tools like NER, WSD, morph analyser. Finally, the taggers are made online using JSP and JST technology. Both the taggers, CRF++ (94.39 and 88.87) and SVM (96.85 and 93.59), have outperformed the existing Odia POS taggers in terms of both reliability and accuracy. For ensuring the quality of the output, an IA agreement has been conducted.

For the past 39 years the international conference, Translating and the Computer (TC), has been a unique forum for academics, users, developers and vendors of translation technology tools. It is a distinctive event where translators, interpreters, researchers and business people, from translation companies, international organisations, universities and research labs, as well as freelance professionals, come together to exchange ideas and learn about the latest developments in translation technologies. Over the last two decades various translation tools such as Translation Memory programs presented at previous TC conferences, have revolutionised the work of translators. Regrettably, the same cannot be said for the work of interpreters who have yet to benefit from suitable language technology tools and resources which could assist them in their work. Given this situation, at this year's 39th TC conference we have decided to put an emphasis on the new and emerging language technologies, tools and resources which can support the work of interpreters. The panel 'New Frontiers in Interpreting Technology' features leading experts and practitioners in interpreting and the TC39 programme offers several talks presenting tools for interpreters. We firmly believe that the presentations and discussions on this topic will encourage the development of innovative tools which will revolutionise the work of interpreters in the near future, as has already been the case for translators. This year's conference also features stimulating talks on Translation Technology topics central to TC conferences including but not limited to machine translation, post-editing, CAT tools and terminology. We are confident that you will find that all the presentations and posters, panels and workshops, will provide interesting user perspectives and opportunities, and will lead to inspiring discussions. We trust

This incisive Paper aims to show that the usual contemporary theoretical assumptions about African Tone Languages in the field of Phonetics and Phonology do not take enough the COGNITIVE ASPECT of Tone Conception, Ton Production and Tone Perception.
It goes beyond the recent Findings of Auto-segmental Phonology and Finite-State Phonology to study the Tonal Phenomenon in the African lingusitic context with new scientific Paradigms focussed on the Cognition.
This Focus on the cognitive aspect can open new perpectives for the design of computational models in the field of automatic speech processing of African Tone Languages.

In the digital era, language barriers represent a major challenge preventing European citizens and businesses from fully benefiting from a truly integrated Europe. These barriers particularly affect the less educated and older population, as well as speakers of smaller and minority languages, thus creating a notable language divide. Language barriers have a profound effect on (1) cross-border public services, (2) fostering a common European identity, (3) workers’ mobility, and (4) cross-border e-commerce and trade, in the context of a Digital Single Market. The emergence of new technological approaches such as deep-learning neural networks, based on increased computational power and access to sizeable amounts of data, are making Human Language Technologies (HLT) a real solution to overcoming language barriers. However, several factors, such as market fragmentation, uncoordinated research and insufficient funding, are hindering the European HLT industry, while putting underresourced languages in danger of digital extinction. Moreover, language technologies are not properly represented in the agenda of European policy-makers, although they are likely to be crucial for the construction of a fair and truly integrated European Union. Based on the analysis of the current state of affairs, the study argues for setting up a multidisciplinary large-scale coordinated initiative, the European Human Language Project (HLP). Within the HLP, eleven policies are proposed and assessed. These policies are grouped into: institutional policies, research policies, industry policies, market policies, and public service policies.

In recent years, there has been dramatic progress in both speech and language processing, in many cases leveraging some of the same underlying methods. This progress and the growing technical ties motivate efforts to combine speech and language technologies in spoken document processing applications. This paper outlines some of the issues involved, as well as the opportunities, presenting an overview of the special double session on this topic.

- by Elizabeth Shriberg
- •
- Computer Science, Information Retrieval, Speech Recognition, TV

In this paper we analyse the role of Language Resources (LR) and Language Technologies (LT) in today Human Language Technology field and try to speculate on some of the priorities for the next years, from the particular perspective of the FLaReNet project, that has been asked to act as an observatory to assess current status of the field on Language Resources and Technology and to indicate priorities of action for the future.

- by Claudia Soria
- •
- Language Technology, Human Language Technology

Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognized words can be the final results, for such applications as commands & control, data entry, and document preparation. They can also serve as the input to further linguistic processing in order to achieve speech understanding, a subject covered in section 1.8.

- by Ron Cole
- •
- Human Language Technology

This paper describes SCANMail, a system that allows users to browse and search their voicemail messages by content through a GUI. Content based navigation is realized by use of automatic speech recognition, information retrieval, information extraction and human computer interaction technology. In addition to the browsing and querying functionalities, acoustics-based caller ID technology is used to proposes caller names from existing caller acoustic models trained from user feedback. The GUI browser also provides a note-taking capability. Comparing SCANMail to a regular voicemail interface in a user study, SCANMail performed better both in terms of objective (time to and quality of solutions) as well as subjective objectives.

As part of the DARPA Spoken Language System program, we recently initiated an effort in spoken language understanding. A spoken language system addresses applications in which speech is used for interactive problem solving between a person and a computer. In these applications, not only must the system convert the speech signal into text, it must also understand the linguistic structure of a sentence in order to generate the correct response. This paper describes our early experience with the development of the MIT VOYAGER spoken language system.

- by Joseph Polifroni
- •
- Linguistics, Speech, Problem Solving, Natural language

This paper describes how Human Language Technologies and linguistic resources are used to support the construction of components of a knowledge organisation system. In particular we focus on methodologies and resources for building a corpus-based domain ontology and extracting relevant metadata information for text chunks from domain-specific corpora.

- by Dorte Hansen and +1
- •
- Human Language Technology, Domain Specificity, Domain Ontology

We argue that the detection of entailment and contradiction relations between texts is a minimal metric for the evaluation of text understanding systems. Intensionality, which is widespread in natural language, raises a number of detection issues that cannot be brushed aside. We describe a contexted clausal representation, derived from approaches in formal semantics, that permits an extended range of intensional entailments and contradictions to be tractably detected.

- by Daniel Bobrow and +1
- •
- Formal Semantics, Natural language, Human Language Technology

Language resources are typically defined and created for application in speech technology contexts, but the documentation of languages which are unlikely ever to be provided with enabling technologies nevertheless plays an important role in defining the heritage of a speech community and in the provision of basic insights into the language oriented components of human cognition. This is particularly true of endangered languages. The present case study concerns the documentation both of the birth and of the endangerment within a rather short space of time of a 'spirit language', Medefaidrin, created and used as a vehicular language by a religious community in South-Eastern Nigeria. The documentation shows phonological, orthographic, morphological, syntactic and textual typological features of Medefaidrin which indicate that typological properties of English were a model for the creation of the language, rather than typological properties of the enclaving language, Ibibio. The documentation is designed as part of the West African Language Archive (WALA), following OLAC metadata standards.

- by Eno-Abasi Urua
- •
- Language Development, Speech Communication, Language contact, Language Resources

In this paper we report on a project that was launched by the Dutch Language Union (Nederlandse Taalunie) with the aim of strengthening the position of Dutch in language and speech technology (Human Language Technologies, HLT). In particular we report on the activities aimed at surveying and evaluating HLT resources to establish priorities for future developments.

- by Walter Daelemans
- •
- Human Language Technology
- by Annie Zaenen and +1
- •
- Cognitive Science, Human Computer Interaction, Computational Linguistics, Linguistics

This paper presents bRol, the first fully automatic system to be developed for the parsing of syntactic and semantic dependencies in Basque. The parser has been built according to the settings established for the CoNLL-2009 Shared Task (Hajič et al., 2009), therefore, bRol can be thought of as a standard parser with scores comparable to the ones reported in the shared task. A second-order graph-based MATE parser has been used as the syntactic dependency parser. The semantic model, on the other hand, uses the traditional four-stage SRL pipeline. The system has a labeled attachment score of 80.51%, a labeled semantic F 1 of 75.10, and a labeled macro F 1 of 77.80.

- by Haritz Salaberri
- •
- Natural Language Processing, Basque Studies, Semantics, Parsing

We describe the use of the Wikitology knowledge base as a resource for a variety of applications with special focus on a cross-document entity coreference resolution task. This task involves recognizing when entities and relations mentioned in different documents refer to the same object or relation in the world. Wikitology is a knowledge base system constructed with material from Wikipedia, DBpedia and Freebase that includes both unstructured text and semi-structured information. Wikitology was used to define features ...

This paper reports results of the 1992 Evaluation of machine translation (MT) systems in the DARPA MT initiative and results of a Pre-test to the 1993 Evaluation. The DARPA initiative is unique in that the evaluated systems differ radically in languages translated, theoretical approach to system design, and intended end-user application. In the 1992 suite, a Comprehension Test compared the accuracy and interpretability of system and control outputs; a Quality Panel for each language pair judged the fidelity of translations from each source version. The 1993 suite evaluated adequacy and fluency and investigated three scoring methods.

- by John S White
- •
- Machine Translation, System Design, Human Language Technology

In this paper we set out the case for how smart-glasses can be used to augment and improve live Simultaneous Interpreting (SI) of spoken languages. We do this through reviewing the relevant literature and identifying the current challenges faced by professional foreign language interpreters, such as cognitive load, working memory constraints and session dynamics. Finally, we describe our experimental framework and the prototype smart-glasses based system we are building which will act as a testbed for research into the use of augmented-reality smart-glasses as an aid to interpreting. The main contributions of this paper are the review of the state of the art in language interpreting technology plus the smart-glass experimental framework which act as an aid to Simultaneous Interpreting (SI).

- by Victor Callaghan and +1
- •
- Cognitive Psychology, Cognitive Science, Human Computer Interaction, Educational Technology

There are more than 6000 languages in the world but only a small number possess the resources required for implementation of Human Language Technologies (HLT). Thus, HLT are mostly concerned by languages for which large resources are available or which have suddenly become of interest because of the economic or political scene. On the contrary, languages from developing countries or minorities have been less worked on in the past years. One way of improving this "language divide" is do more research on portability of HLT for multilingual applications.

- by Vincent Berment and +1
- •
- Visualization, Speech Synthesis, Machine Translation, Speech Acoustics

This paper presents the application of morpheme-based and factored language models in an Amharic speech recognition task. Since using morphemes in both acoustic and language models results, mostly, in performance degradation due to acoustic confusability and since it is problematic to use factored language models in standard word decoders, we applied the models in a lattice rescoring framework. Lattices of 100 best alternatives for each test sentence of the 5k development test set have been generated using a baseline speech recognizer that uses a word-based backoff bigram language model. The lattices have then been rescored with various morpheme-based and factored language models and a slight improvement in word recognition accuracy has been observed.

This paper presents our basic approach to creating Proposition Bank, which involves adding a layer of semantic annotation to the Penn English TreeBank. Without attempting to confirm or disconfirm any particular semantic theory, our goal is to provide consistent argument labeling that will facilitate the automatic extraction of relational data. An argument such as the window in John broke the

- by jiawen liu
- •
- Argument Structure, Human Language Technology, Semantic Annotation, Relational data

Projects in human language technologies do not only imply challenges for programmers but also for grammarians. In a recent project to develop an automatic lemmatiser for Setswana, the problem arose as to what the lemma in Setswana should be, as no clearcut definition exists in the Bantu language grammars or lexicographic studies. This article aims to determine and discuss the term "lemma" in Setswana as it should be applied in automatic lemmatisation.

- by Karien Brits
- •
- Lexicology, Morphology, Human Language Technology

This paper describes a small, structured English corpus that is designed for translation into Less Commonly Taught Languages (LCTLs), and a set of re-usable tools for creation of similar cor- pora. 1 The corpus systematically explores meanings that are known to affect morphology or syntax in the world's languages. Each sentence is associated with a feature structure showing the elements of meaning that are represented in the sentence. The corpus is highly structured so that it can support machine learning with only a small amount of data. As part of the REFLEX program, the corpus will be translated into multiple LCTLs, resulting in parallel corpora can be used for training of MT and other language technologies. Only the untranslated English corpus is described in this paper.

- by Jeff Good
- •
- Computational Linguistics, Corpus Linguistics, Human Language Technology
- by Noureddine Chenfour
- •
- Computer Science, Verification and Validation, Middle East, Arabic Language

Lexical resources are key components for applications related to human language technology. Various models of lexical resources have been designed and implemented during the last twenty years and the scientific community has now gained enough experience to design a common standard at an international level. This paper thus describes the ongoing activity within ISO/TC 37/SC 4 on LMF (Lexical Markup

- by Gil Francopoulo and +1
- •
- Scientific Communication, Lexical Markup Framework, Human Language Technology

A tension exists between the increasingly rich semantic models in knowledge management systems and the continuing prevalence of human language materials in large organisations. The process of tying semantic models and natural language together is referred to as semantic annotation, which may also be characterized as the dynamic creation of bidirectional relationships between ontologies and unstructured and semi-structured documents.

- by Brian Davis
- •
- Knowledge Management, Machine Learning, Information Extraction, Natural language

In this paper we describe some preliminary results of qualitative evaluation of the answer-ing system HITIQA (High-Quality Interactive Question Answering) which has been devel-oped over the last 2 years as an advanced re-search tool for... more

this paper is a national project known asIPSOM, whose main goal is to improve the access to digitallystored spoken books, used primarily by the visuallyimpaired community, by providing tools for easily detectingand indexing units (words, sentences, topics). Simultaneously,the project also aims to broaden the usage of multimediaspoken books (for instance in didactic applications,etc.), by providing multimedia interfaces for access and

- by Diamantino Caseiro and +1
- •
- Human Language Technology, Indexation

The availability of an abundance of knowledge sources has spurred a large amount of effort in the development and enhancement of Information Retrieval techniques. Users’ information needs are expressed in natural language and successful retrieval is very much dependent on the effective communication of the intended purpose.
Linguistic characteristics that cause semantic ambiguity and misinterpretation of queries as well as additional factors such as the lack of familiarity with the search environment affect the users’ ability to accurately represent their information needs, coined by the concept intention gap. The latter directly affects the relevance of the returned search results which may not be to the users’ satisfaction and therefore is a major issue impacting the effectiveness of information retrieval systems. Central to our discussion is the identification of the significant constituents that characterize the query intent and their enrichment through the addition of meaningful terms, phrases or even latent representations, either manually or automatically to capture their intended meaning. Specifically, we discuss techniques to achieve the enrichment and in particular those utilizing the information gathered from statistical processing of term dependencies within a document corpus or from external knowledge sources such as ontologies. We lay down the anatomy of a generic query expansion framework and propose its module-based decomposition, covering topical issues from query processing, information retrieval, computational linguistics and ontology engineering. For each of the modules we review state-of-the-art solutions in the literature categorized and analyzed under the light of the techniques used.

- by Alexander Waibel
- •
- Computer Science, Machine Translation, System Design, Video Conferencing

Language resources are typically defined and created for application in speech technology contexts, but the documentation of languages which are unlikely ever to be provided with enabling technologies nevertheless plays an important role in defining the heritage of a speech community and in the provision of basic insights into the language oriented components of human cognition. This is particularly true of endangered languages. The present case study concerns the documentation both of the birth and of the endangerment within a rather short space of time of a spirit language, Medefaidrin, created and used as a vehicular language by a religious community in South-Eastern Nigeria. The documentation shows phonological, orthographic, morphological, syntactic and textual typological features of Medefaidrin which indicate that typological properties of English were a model for the creation of the language, rather than typological properties of the enclaving language, Ibibio. The documen...

- by Dafydd Gibbon
- •
- Computer Science, Language Development, Speech Communication, Language contact

TermPedia is a human language technology (HLT) application for document enrichment that automatically provides definitions for technical terms (TTs). A technical term (TT) may hinder document comprehension if it is introduced without any definition or explanation. In some cases when a term is defined, the definition may contain additional technical terms that instigate a similar problem. This is why we investigated a possibility of providing contextually relevant information for the technical term by linking it to an encyclopedia. In this way, additional information relating to the technical terms shall be readily available and hopefully make documents more comprehensible.

- by Proscovia Olango
- •
- Linguistics, Human Language Technology
- by Emanuele Pianta
- •
- Machine Translation, System Design, Video Conferencing, European Commission
- by Ron Cole
- •
- Human Language Technology

We describe a new modality/negation (MN) annotation scheme, the creation of a (publicly available) MN lexicon, and two automated MN taggers that we built using the annotation scheme and lexicon. Our annotation scheme isolates three components of modality and negation: a trigger (a word that conveys modality or negation), a target (an action associated with modality or negation), and a holder (an experiencer of modality). We describe how our MN lexicon was semi-automatically produced and we demonstrate that a structure-based MN tagger results in precision around 86% (depending on genre) for tagging of a standard LDC data set.