Aleš Horák - Profile on Academia.edu (original) (raw)

Papers by Aleš Horák

Cognitive Studies, Dec 20, 2018

Challenging Online Propaganda and Disinformation in the 21st Century, 2021

One of the main aims of logical analysis of natural language ex- pressions lies in the task to ca... more One of the main aims of logical analysis of natural language ex- pressions lies in the task to capture the meaning structures independently on the selected “mean of transport,” i.e. on a particular natural language used. Logical analysis should just offer a “bridge between language ex-pressions.” In this paper, we show the preliminary results of automated bilingual logical analysis, namely the analysis of English and Czech sentences. The underlying logical formalism, the Transparent Intensional Logic (TIL), is a representative of a higher-order temporal logic designed to express full meaning relations of natural language expressions. We present the details of the current development and preparations of the supportive lexicons for the AST (automated semantic analysis) tool when working with a new language, i.e. English. The AST provides an implementation of the Normal Translation Algorithm for TIL aiming to offer a normative logical analysis of the input sentences. We show the simila...

Proceedings of the 13th International Conference on Agents and Artificial Intelligence, 2021

Although nowadays digital-born documents are generally prevalent, exchange of business documents ... more Although nowadays digital-born documents are generally prevalent, exchange of business documents often consists in processing their scanned image form as a general human-readable format with one-to-one correspondence to paper documents. Bulk processing of such scanned documents then requires human intervention to extract and enter the main document metadata. In this paper, we present the design and evaluation of a contract processing module in the OCRMiner system. The information extraction process allows to combine layout properties with text analysis as input to a rule-based extraction with confidence score propagation. The first results are evaluated with public Czech contract documents reaching the item extraction accuracy of almost 88%.

Translation memories (TMs) used in computer-aided translation (CAT) systems are the highest-quali... more Translation memories (TMs) used in computer-aided translation (CAT) systems are the highest-quality source of parallel texts since they consist of segment translation pairs approved by professional human translators. The obvious problem is their size and coverage of new document segments when compared with other parallel data. In this paper, we describe several methods for expanding translation memories using linguistically motivated segment combining approaches concentrated on preserving the high translational quality. The evaluation of the methods was done on a medium-size real-world translation memory and documents provided by a Czech translation company as well as on a large publicly available DGT translation memory published by European Commission. The asset of the TM expansion methods were evaluated by the pre-translation analysis of widely used MemoQ CAT system and the METEOR metric was used for measuring the quality of fully expanded new translation segments.

Proceedings - Natural Language Processing in a Deep Learning World, 2019

Propaganda of various pressure groups ranging from big economies to ideological blocks is often p... more Propaganda of various pressure groups ranging from big economies to ideological blocks is often presented in a form of objective newspaper texts. However, the real objectivity is here shaded with the support of imbalanced views and distorted attitudes by means of various manipulative stylistic techniques. In the project of Manipulative Propaganda Techniques in the Age of Internet, a new resource for automatic analysis of stylistic mechanisms for influencing the readers' opinion is developed. In its current version, the resource consists of 7,494 newspaper articles from four selected Czech digital news servers annotated for the presence of specific manipulative techniques. In this paper, we present the current state of the annotations and describe the structure of the dataset in detail. We also offer an evaluation of bag-of-words classification algorithms for the annotated manipulative techniques.

Proceedings of the 4th International Conference on Software and Data Technologies, 2009

The vision of Semantic Web introduced ontologies as the main unifying tool for management of the ... more The vision of Semantic Web introduced ontologies as the main unifying tool for management of the knowledge and semantic structure of text documents. However, linking the real text documents with the ontologies (of various kinds and various degree of complexity) is still a matter of current research in knowledge representation projects. In this paper, we are presenting the work results of the KYOTO project database implementation. The goal of the project is to provide a complex system for automatic processing of documents in order to extract known facts, link them with shared ontology and use this knowledge for Question Answering about the document topic. We give details about the design and implementation of the KYOTO database, which interlinks national Word-Net semantic networks with the general SUMO ontology to offer the basis of the future shared ontology.

The semantic network editor DEBVisDic has been used by different development teams to create more... more The semantic network editor DEBVisDic has been used by different development teams to create more than 20 national wordnets. The editor was recently re-developed as a multi-platform web-based application for general semantic networks editing. One of the main advantages, when compared to the previous implementation, lies in the fact that no client-side installation is needed now. Following the successful first phase in building the Open Dutch Wordnet, DEBVisDic was extended with features that allow users to easily create, edit, and share a new (usually national) wordnet without the need of any complicated configuration or advanced technical skills. The DEBVisDic editor provides advanced features for wordnet browsing, editing, and visualization. Apart from the user-friendly web-based application, DEBVisDic also provides an API interface to integrate the semantic network data into external applications.

Velke ontologie a semanticke sitě představuji komplexni viceurovňove struktury, ktere nelze snadn... more Velke ontologie a semanticke sitě představuji komplexni viceurovňove struktury, ktere nelze snadno ověřit běžnými metodami kontroly. Automaticke kontroly konzistence mohou odhalit systemove chyby, např. chybějici odkazy, ale nalezt chybějici význam slova je obtižne. Běžna řeseni spolehaji na postupne konzultace mnoha informacnich zdrojů při postupnem recenznim řizeni. V clanku je popsan nový přistup pro ověřeni a rozsiřovani dat wordnetu pomoci zapojeni uživatelů. Tento přistup zajisťuje brzke vydani plne datove sady pro použiti cilovou skupinou s pozdějsimi neustalými upravami podle navrhů veřejných uživatelů a kontrolou těchto navrhů experty. Tým expertů ma k dispozici navrhy oprav v přehledne agregovane podobě, a take podporou revizi a editace.

Proceedings of the 12th International Conference on Agents and Artificial Intelligence, 2020

Question answering systems have improved greatly during the last five years by employing architec... more Question answering systems have improved greatly during the last five years by employing architectures of deep neural networks such as attentive recurrent networks or transformer-based networks with pretrained contextual information. In this paper, we present the results and detailed analysis of experiments with the largest question answering benchmark dataset for the Czech language. The best results evaluated in the text reach the accuracy of 72 %, which is a 4 % improvement to the previous best result. We also introduce the newest version of the Czech Question Answering benchmark dataset SQAD 3.0, which was substantially extended to more than 13,000 question-answer pairs, and we report the first answer selection results on this dataset which indicate that the size of the training data is important for the task.

Proceedings of the 4th International Conference on Agents and Artificial Intelligence, 2012

This paper discusses three up-to-date Artificial Intelligence (AI) projects focusing on the quest... more This paper discusses three up-to-date Artificial Intelligence (AI) projects focusing on the questionanswering problem-Watson, Aura and True Knowledge. Besides a quick introduction to the architecture of systems, we show examples revealing their shortages. The goal of the discussion is the necessity of a module that acquires knowledge in a meaningful way and isolation of the Mind from natural language. We introduce an idea of the GuessME! system that, by a playing simple game, deepens its own knowledge and brings new light to the question-answering problem.

Proceedings of the 11th International Conference on Agents and Artificial Intelligence, 2019

In this paper, we introduce a new updated version of the Czech Question Answering database SQAD v... more In this paper, we introduce a new updated version of the Czech Question Answering database SQAD v2.1 (Simple Question Answering Database) with the update being devoted to improved question and answer classification. The SQAD v2.1 database contains more than 8,500 question-answer pairs with all appropriate metadata for QA training and evaluation. We present the details and changes in the database structure as well as a new algorithm for detecting the question type and the actual answer type from the text of the question. The algorithm is evaluated with more than 4,000 question answer pairs reaching the F1-measure of 88% for question typed and 85% for answer type detection.

International Journal on Artificial Intelligence Tools, 2019

This paper describes a new system for semi-automatically building, extending and managing a termi... more This paper describes a new system for semi-automatically building, extending and managing a terminological thesaurus — a multilingual terminology dictionary enriched with relationships between the terms themselves to form a thesaurus. The system allows to radically enhance the workow of current terminology expert groups, where most of the editing decisions still come from introspection. The presented system supplements the lexicographic process with natural language processing techniques, which are seamlessly integrated to the thesaurus editing environment. The system’s methodology and the resulting thesaurus are closely connected to new domain corpora in the six languages involved. They are used for term usage examples as well as for the automatic extraction of new candidate terms. The terminological thesaurus is now accessible via a web-based application, which (a) presents rich detailed information on each term, (b) visualizes term relations, and (c) displays real-life usage exam...

The Prague Bulletin of Mathematical Linguistics, 2016

The first edition of the Encyclopaedia of the Czech Language was published in 2002 and since that... more The first edition of the Encyclopaedia of the Czech Language was published in 2002 and since that time it has established as one of the basic reference books for the study of the Czech language and related linguistic disciplines. However, many new concepts and even new research areas have emerged since that publication. That is why a preparation of a complete new edition of the encyclopaedia started in 2011, rather than just re-printing the previous version with supplements. The new edition covers current research status in all concepts connected with the linguistic studies of (prevalently, but not solely) the Czech language. The project proceeded for five years and it has finished at the end of 2015, the printed edition is currently in preparation. An important innovation of the new encyclopaedia lies in the decision that the new edition will be published both as a printed book and as an electronic on-line encyclopaedia, utilizing the many advantages of electronic dictionaries.In t...

This paper describes the methodology and development of tools for building and presenting a termi... more This paper describes the methodology and development of tools for building and presenting a terminological thesaurus closely connected with a new specialized domain corpus. The thesaurus multiplatform application offers detailed information on each term, visualizes term relations, or displays real-life usage examples of the term in the domainrelated documents. Moreover, the specialized corpus is used to detect domain specific terms and propose an extension of the thesaurus with new terms. The presented project is aimed at the terminological thesaurus of land surveying domain, however the tools are re-usable for other terminological domains.

In this paper, we describe and evaluate current improvements to methods for enlarging translation... more In this paper, we describe and evaluate current improvements to methods for enlarging translation memories. In comparison with the previous results in 2013, we have achieved improvement in coverage by almost 35 percentage points on the same test data. The basic subsegment splitting of the translation pairs is done using Moses and (M)GIZA++ tools, which provide the subsegment translation probabilities. The obtained phrases are then combined with subsegment combination techniques and filtered by large target language models.

The paper presents a supervised approach to semantic parsing, based on a new semantic resource, t... more The paper presents a supervised approach to semantic parsing, based on a new semantic resource, the Pattern Dictionary of English Verbs (PDEV). PDEV lists the most frequent patterns of English verbs identified in corpus. Each argument in a pattern is semantically categorized with semantic types from the PDEV ontology. Each pattern is linked to a set of sentences from the British National Corpus. The article describes PDEV in details and presents the task of pattern classification. The system described is based on a distributional approach, and achieves 66% in Micro-average F1 across a sample of 25 of the most frequent verbs.

International Journal of Machine Learning and Computing, 2012

This paper describes the design of a knowledge representation and reasoning system, named Dolphin... more This paper describes the design of a knowledge representation and reasoning system, named Dolphin, which is based on higher-order temporal Transparent Intensional Logic (TIL). An intelligent agent (NAM), that is able to read newspaper headlines from specialized internet server and allows users to ask questions about various world situations is chosen to demonstrate Dolphin features. Temporal aspects play an essential role in natural language therefore we present how this phenomenon is handled in the system. Reasoning capabilities of the agent are divided into three individual strategies and described in the text. As a result we compare NAM answers to one of the most used search engines nowadays.

RASLAN 2012 Recent Advances in Slavonic Natural Language Processing

Logical analysis of natural language allows to extract semantic relations that are not revealed f... more Logical analysis of natural language allows to extract semantic relations that are not revealed for standard full text search methods. Intensional logic systems, such as the Transparent Intensional Logic (TIL), can rigorously describe even the higher-order relations between the speaker and the content or meaning of the discourse. In this paper, we concentrate on the mechanism of logical analysis of direct and indirect discourse by means of TIL. We explicate the procedure within the Normal Translation Algorithm (NTA) for Transparent Intensional Logic (TIL), which covers ...

Proceedings of the 18th International Congress of Linguists (CIL18), Seoul, Republic of Korea, Jul 21, 2008

Cornetto is a two-year project, funded by the Flemish-Dutch Taalunie in the Stevin-programme (pro... more Cornetto is a two-year project, funded by the Flemish-Dutch Taalunie in the Stevin-programme (project number STE05039). It produces a lexical semantic database for Dutch. The database combines Wordnet (Fellbaum 1998) with FrameNet-like information. The data is derived from two existing lexical resources: the Dutch Wordnet (DWN, Vossen 1998) and the Referentie Bestand Nederlands (RBN, Maks, Martin and Meerseman 1999). These two resources represent two different perspectives on word meaning. Whereas DWN takes ...