Simone Marchi | Consiglio Nazionale delle Ricerche (CNR) (original) (raw)

Papers by Simone Marchi

Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015

English. The Arabic script omits diacritics, which are essential to fully specify inflected word ... more English. The Arabic script omits diacritics, which are essential to fully specify inflected word forms. The extensive homography caused by diacritic omission considerably increases the number of alternative parses of any morphological analyzer that makes no use of contextual information. Many such parses are spurious and can be filtered out if diacriticization, i.e. the process of interpolating diacritics in written forms, takes advantage of a number of orthographic, morpho-syntactic and semantic constraints that operate in Arabic at the word level. We show that this strategy reduces parsing time and makes morphological analysis of written texts considerably more accurate. Italiano. Le convenzioni ortografiche della lingua araba consentono l'omissione dei diacritici, introducendo così numerosi casi di omografia tra forme flesse e la conseguente proliferazione di analisi morfologiche contestualmente spurie. Un analizzatore morfologico che utilizzi i vincoli ortografici, morfo-sintattici e semantici che operano a livello lessicale, può tuttavia ridurre drasticamente il livello di ambiguità morfologica del testo scritto, producendo analisi più efficienti e accurate.

The current digital turn in studying and analyzing historical documents results in both having ma... more The current digital turn in studying and analyzing historical documents results in both having machine actionable cultural data and providing software able to process them. However, these data and services often lack in integration strategies among them in order to be reused in other contexts different from the original ones. As pointed out by Franz Fischer in a worthy of note article: “There is no out-of-the-box software available for creating truly critical and truly digital editions at the same time” [1]. Likewise, Monica Berti stated that is now important to "build a model for representing quotations and text reuses of lost works in a digital environment” [2]. In this vision Bridget Almas is in charge of developing an integrated platform for collaboratively transcribing, editing, and translating historical documents and texts. She claimed that through this platform, called Perseids, students and scholars are able to create open source digital scholarly editions [3]. A numbe...

The paper focuses on the automatic extraction of domain knowledge from Italian legal texts and pr... more The paper focuses on the automatic extraction of domain knowledge from Italian legal texts and presents a fully-implemented ontology learning system (T2K, Text-2-Knowledge) that includes a battery of tools for Natural Language Processing, statistical text analysis and machine learning. Evaluated results show the considerable potential of systems like T2K, exploiting an incremental interleaving of NLP and machine learning techniques for accurate large-scale semi-automatic extraction and structuring of domain-specific knowledge.

One of the main challenges of the DH community is to provide suitable software models and tools. ... more One of the main challenges of the DH community is to provide suitable software models and tools. To model the literary domain and the relative user requirements, we chose to follow the engineering principles of object-oriented analysis and design. The digital representation of a textual resource is a challenge as it involves several theoretical and epistemological issues in semiotics, paleography, philology, linguistics, engineering, and computer science. We have designed and implemented a set of core entities as the fundamental data types shared among all the components of the environment.

BMC Bioinformatics, 2011

Background: Due to the rapidly expanding body of biomedical literature, biologists require increa... more Background: Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events) involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events. Results: This article provides an overview of the design, construction and evaluation of a large-scale lexical and conceptual resource for the biomedical domain, the BioLexicon. The resource can be exploited by text mining tools at several levels, e.g., part-of-speech tagging, recognition of biomedical entities, and the extraction of events in which they are involved. As such, the BioLexicon must account for real usage of words in biomedical texts. In particular, the BioLexicon gathers together different types of terms from several existing data resources into a single, unified repository, and augments them with new term variants automatically extracted from biomedical literature. Extraction of events is facilitated through the inclusion of biologically pertinent verbs (around which events are typically organized) together with information about typical patterns of grammatical and semantic behaviour, which are acquired from domain-specific texts. In order to foster interoperability, the BioLexicon is modelled using the Lexical Markup Framework, an ISO standard. Conclusions: The BioLexicon contains over 2.2 M lexical entries and over 1.8 M terminological variants, as well as over 3.3 M semantic relations, including over 2 M synonymy relations. Its exploitation can benefit both application developers and users. We demonstrate some such benefits by describing integration of the resource into a number of different tools, and evaluating improvements in performance that this can bring.

Abstract. The domain adaptation task was aimed at investigating techniques for adapting state-of-... more Abstract. The domain adaptation task was aimed at investigating techniques for adapting state-of-the-art dependency parsing systems to new domains. Both the language dealt with, i.e. Italian, and the target domain, namely the legal domain, represent two main novelties of the task organised at Evalita 2011. In this paper, we define the task and describe how the datasets were created from different resources. In addition, we characterize the different approaches of the participating systems, report the test results, and provide a first analysis of these results.

This paper introduces the POESIA internet filtering system, which is open-source, and which combi... more This paper introduces the POESIA internet filtering system, which is open-source, and which combines standard filtering methods, such as positive/negative URL lists, with more advanced techniques, such as image processing and NLP-enhanced text filtering. The description here focusses on components providing textual content filtering for three European languages (English, Italian and Spanish), employing NLP methods to enhance performance. We address also the acquisition of language data needed to develop these filters, and the evaluation of the system and its components.

English. The inclusion of semantic features in the stylometric analysis of literary texts appears... more English. The inclusion of semantic features in the stylometric analysis of literary texts appears to be poorly investigated. In this work, we experiment with the application of Distributional Semantics to a corpus of Italian literature to test if words distribution can convey stylistic cues. To verify our hypothesis, we have set up an Authorship Attribution experiment. Indeed, the results we have obtained suggest that the style of an author can reveal itself through words distribution too. Italiano. L’inclusione di caratteristiche semantiche nell’analisi stilometrica di testi letterari appare poco studiata. In questo lavoro, sperimentiamo l’applicazione della Semantica Distribuzionale ad un corpus di letteratura italiana per verificare se la distribuzione delle parole possa fornire indizi stilistici. Per verificare la nostra ipotesi, abbiamo imbastito un esperimento di Authorship Attribution. I risultati ottenuti suggeriscono che, effettivamente, lo stile di un autore pu rivelarsi a...

Within the “Museo Virtuale della Musica BellinInRete” project, a corpus of letters, written by th... more Within the “Museo Virtuale della Musica BellinInRete” project, a corpus of letters, written by the renowned composer Vincenzo Bellini (1801 - 1835) from Catania, will be encoded and made publicly available. This contribution aims at illustrating the part of the project regarding the implementation of the prototype for the metadata and text encoding, indexing and visualisation of Bellini’s correspondence. The encoding scheme has been defined according to the latest guidelines of the Text Encoding Initiative and it has been instantiated on a sample of letters. Contextually, a first environment has been implemented by customizing two open source tools: Edition Visualization Technology and Omega Scholarly platform . The main objective of the digital edition is to engage general public with the cultural heritage held by the Belliniano Civic Museum of Catania. This wide access to Bellini’s correspondence has been conceived preserving the scholarly transcriptions of the letters edited by S...

This article illustrates the first steps towards the implementation of a Decision Support System ... more This article illustrates the first steps towards the implementation of a Decision Support System aimed to recreate a research environment for scholars and provide them with computational tools to assist in the processing and interpretation of texts. While outlining the general characteristics of the system, the paper presents a minimal set of user requirements and provides a possible use case on Dante’s Inferno.

Workshop on Semantic Web …

Abstract The management and exchange of multimedia data is a challenging area of research due to... more Abstract The management and exchange of multimedia data is a challenging area of research due to the variety of formats, standards and the many interesting intended applications. Semantic web technologies are very promising to enable interoperability and integration of media. ...

In this paper we present an original approach to natural language query interpretation which has ... more In this paper we present an original approach to natural language query interpretation which has been implemented within the FuLL (Fuzzy Logic and Language) Italian project of BC S.r.l. In particular, we discuss here the creation of linguistic and ontological resources, together with the exploitation of existing ones, for natural language-driven database access and retrieval. Both the database and the queries we experiment with are Italian, but the methodology we broach naturally extends to other languages.

Proc. of the 3rd Italian Semantic Web Workshop-SWAP 2006, 2006

Abstract—The demand for efficient methods for extracting knowledge from multimedia content has le... more Abstract—The demand for efficient methods for extracting knowledge from multimedia content has led to a growing research community investigating the convergence of multimedia and knowledge technologies. In this paper we describe a methodology for extracting multimedia information from product catalogues empowered by the synergetic use and extension of a domain ontology. The methodology was implemented in the Trade Fair Advanced Semantic Annotation Pipeline of the VIKE-framework. Index Terms—Semantic ...

Proceedings of the 12th International Conference on Artificial Intelligence and Law - ICAIL '09, 2009

Materia Giudaica, 2020

A formal digital structuring of the terminology of the Talmud is being carried out in the context... more A formal digital structuring of the terminology of the Talmud is being carried out in the context of the Project for the Translation of the Babylonian Talmud into Italian. According to the principles of the Meaning-Text Theory, the terminological resource was encoded in the form of a multi-language Explanatory Combinatorial Dictionary (Hebrew-Aramaic-Italian). The construction of such a resource was supported by text processing and computational linguistics techniques aimed at automatically extracting terms from the Italian translation of the Talmud and aligning them with the corresponding Hebrew/Aramaic
source terms. The paper describes the process that was set up for constructing the terminological resource with the ultimate goal of illustrating the advantages of adopting a formal linguistic
model. The terminological resource aims to be a useful tool to deepen the characteristics of the languages of the Talmud, help translators in their work, and more generally, scholars in their study of the Talmud itself.

Materia Giudaica, 2018

The automatic linguistic analysis of ancient Hebrew represents a new research opportunity in the ... more The automatic linguistic analysis of ancient Hebrew represents a new research opportunity in the field of Jewish studies. In fact, very little has been produced, both in terms of linguistic resources and, above all, of tools for analysis of ancient Hebrew. This article illustrates a work born within the Italian Translation of the Babylonian Talmud Project aimed at the construction of an automatic linguistic annotator of mishnaic Hebrew.

Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015

BMC Bioinformatics, 2011

Workshop on Semantic Web …

Proc. of the 3rd Italian Semantic Web Workshop-SWAP 2006, 2006

Proceedings of the 12th International Conference on Artificial Intelligence and Law - ICAIL '09, 2009

Materia Giudaica, 2020

Materia Giudaica, 2018

BMC …, Jan 1, 2011

Nel contesto del Progetto per la traduzione del Talmud babilonese in italiano (PTTB) si sta proce... more Nel contesto del Progetto per la traduzione del Talmud babilonese in italiano (PTTB) si sta procedendo a una strutturazione digitale formale della terminologia. La risorsa terminologica è stata codificata sotto forma di un dizionario combinatorio ed esplicativo, multilingue (ebraico-aramaico-italiano) secondo i principi della teoria testo-senso. La costruzione di tale risorsa è stata supportata dall'elaborazione del testo e dalle tecniche linguistiche computazionali volte a estrarre automaticamente i termini dalla traduzione italiana del Talmud e ad allinearli con i corrispondenti termini in ebraico / aramaico. L'articolo descrive il processo avviato per la costruzione della risorsa terminologica con l'obiettivo finale di illustrare i vantaggi dell'adozione di un modello linguistico formale. La risorsa terminologica mira, infatti, a essere uno strumento utile per approfondire le caratteristiche delle lingue del Talmud, per aiutare i traduttori nel loro lavoro e più in generale l'ampia platea di studiosi del Talmud.

The Literary Computing group of the Institute for Computational Linguistics at the National Resea... more The Literary Computing group of the Institute for Computational Linguistics at the National Research Council of Italy (ILC-CNR) is carrying on a line of research about designing software models for textual scholarship as well as implementing them by using cutting-edge software engineering approaches and technologies. The research work is aimed at providing a general framework (called Omega) [4] - inherently conceived with the object-oriented paradigm and semantic web technologies - suitable for studying historical and literary documents and texts.