MARIA SIMI - Academia.edu (original) (raw)
Papers by MARIA SIMI
Accademia University Press eBooks, 2016
Elsevier eBooks, 1994
Abstract An analysis of some formal proofs appeared in recent literature dealing with multiple th... more Abstract An analysis of some formal proofs appeared in recent literature dealing with multiple theories, reveals that they are not always accurate: some steps are not properly accounted for, lifting is use improperly, extra logical constructions or unnecessary assumptions are made. Many such problems appears due to the involved mechanisms of reflection. We show that proof in context can replace the most common uses of reflection principles. Proofs can be carried out by switching to a context and reasoning within it. Context switching however does not correspond to reflection or reification but involves changing the level of nesting of theory within another theory. We introduce a generalised rule for proof in context and a convenient notation to express nesting of contexts, which allows us to carry out reasoning in and across contexts in a safe and natural way.
... by a dependency parser: A dependency parser can be built without using a formal grammar of a ... more ... by a dependency parser: A dependency parser can be built without using a formal grammar of a ... It will also be worth considering moving the intent classification at indexing time and ... [2] A. Aue, M. Gamon.[2005] Customizing Sentiment Classifiers to New Domains: a Case Study. ...
Accademia University Press eBooks, 2015
Cross-Language Evaluation Forum, 2012
The paper reports our experiments in tackling the CLEF 2012 Pilot Task on Machine Reading for Que... more The paper reports our experiments in tackling the CLEF 2012 Pilot Task on Machine Reading for Question Answering. We introduce the technique of index expansion, which relies on building a search index enriched with information gathered from a linguistic analysis of texts. The index provides a highly tangled representation of the sentences where each word is directly connected to others representing both meaning and relations. Instead of keeping the knowledge base separate, the relevant knowledge gets embedded within the text. We can hence use efficient indexing techniques to represent such knowledge and query it very effectively. We explain how index expansion was used in the task and describe the experiments that we performed. The results achieved are quite positive and a final error analysis shows how the technique can be further improved.
This contribution presents the first steps towards the analysis of Leonardo Fibonacci's Liber... more This contribution presents the first steps towards the analysis of Leonardo Fibonacci's Liber Abbaci using computational linguistics methods. The work is currently carried out in the context of a joint research project between the Tuscany Region and the University of Pisa with the help of an interdisciplinary team.
Italian Journal of Computational Linguistics, 2021
Today's goal-oriented dialogue systems are designed to operate in restricted domains and with the... more Today's goal-oriented dialogue systems are designed to operate in restricted domains and with the implicit assumption that the user goals fit the domain ontology of the system. Under these assumptions dialogues exhibit only limited collaborative phenomena. However, this is not necessarily true in more complex scenarios, where user and system need to collaborate to align their knowledge of the domain in order to improve the conversation and achieve their goals. To foster research on data-driven collaborative dialogues, in this paper we present JILDA, a fully annotated dataset of chat-based, mixed-initiative Italian dialogues related to the job-offer domain. As far as we know, JILDA is the first dialogic corpus completely annotated in this domain. The analysis realised on top of the semantic annotations clearly shows the naturalness and greater complexity of JILDA's dialogues. In fact, the new dataset offers a large number of examples of pragmatic phenomena, such as proactivity (i.e., providing information not explicitly requested) and grounding, which are rarely investigated in AI conversational agents based on neural architectures. In conclusion, the annotated JILDA corpus, given its innovative characteristics, represents a new challenge for conversational agents and an important resource for tackling more complex scenarios, thus advancing the state of the art in this field.
EVALITA. Evaluation of NLP and Speech Tools for Italian
English. The paper describes our submissions to the task on Named Entity rEcognition and Linking ... more English. The paper describes our submissions to the task on Named Entity rEcognition and Linking in Italian Tweets (NEEL-IT) at Evalita 2016. Our approach relies on a technique of Named Entity tagging that exploits both character-level and word-level embeddings. Character-based embeddings allow learning the idiosyncrasies of the language used in tweets. Using a full-blown Named Entity tagger allows recognizing a wider range of entities than those well known by their presence in a Knowledge Base or gazetteer. Our submissions achieved first, second and fourth top official scores. Italiano. L'articolo descrive la nostra partecipazione al task di Named Entity rEcognition and Linking in Italian Tweets (NEEL-IT) a Evalita 2016. Il nostro approccio si basa sull'utilizzo di un Named Entity tagger che sfrutta embeddings sia character-level che word-level. I primi consentono di apprendere le idiosincrasie della scrittura nei tweet. L'uso di un tagger completo consente di riconoscere uno spettro più ampio di entità rispetto a quelle conosciute per la loro presenza in Knowledge Base o gazetteer. Le prove sottomesse hanno ottenuto il primo, secondo e quarto dei punteggi ufficiali.
Lecture Notes in Computer Science, 2013
Established in 2007, EVALITA (http://www.evalita.it) is the evaluation campaign of Natural Langua... more Established in 2007, EVALITA (http://www.evalita.it) is the evaluation campaign of Natural Language Processing and Speech Technologies for the Italian language, organized around shared tasks focusing on the analysis of written and spoken language respectively. EVALITA's shared tasks are aimed at contributing to the development and dissemination of natural language resources and technologies by proposing a shared context for training and evaluation. Following the success of previous editions, we organized EVALITA 2014, the fourth evaluation campaign with the aim of continuing to provide a forum for the comparison and evaluation of research outcomes as far as Italian is concerned from both academic institutions and industrial organizations. The event has been supported by the NLP Special Interest Group of the Italian Association for Artificial Intelligence (AI*IA) and by the Italian Association of Speech Science (AISV). The novelty of this year is that the final workshop of EVALITA is co-located with the 1st Italian Conference of Computational Linguistics (CLiC-it, http://clic.humnet.unipi.it/), a new event aiming to establish a reference forum for research on Computational Linguistics of the Italian community with contributions from a wide range of disciplines going from Computational Linguistics, Linguistics and Cognitive Science to Machine Learning, Computer Science, Knowledge Representation, Information Retrieval and Digital Humanities. The co-location with CLiC-it potentially widens the potential audience of EVALITA. The final workshop, held in Pisa on the 11th December 2014 within the context of the XIII AI*IA Symposium on Artificial Intelligence (Pisa, 10-12 December 2014, http://aiia2014.di.unipi.it/), gathers the results of 8 tasks, 4 of which focusing on written language and 4 on speech technologies. In this EVALITA edition, we received 30 expressions of interest, 55 registrations and 43 actual submissions to 8 proposed tasks distributed as follows:
Semantic Processing of Legal Texts (SPLeT-2012) Workshop Programme
The 4th Workshop on “Semantic Processing of Legal Texts”(SPLeT–2012) presents the first multiling... more The 4th Workshop on “Semantic Processing of Legal Texts”(SPLeT–2012) presents the first multilingual shared task on Dependency Parsing of Legal Texts. In this paper, we define the general task and its internal organization into sub–tasks, describe the datasets and the domain–specific linguistic peculiarities characterizing them. We finally report the results achieved by the participating systems, describe the underlying approaches and provide a first analysis of the final test results. Keywords: Domain Adaptation, Dependency ...
Super-sense tagging (SST) is a Natural Language Processing task that consists of annotating each ... more Super-sense tagging (SST) is a Natural Language Processing task that consists of annotating each significant entity in a text, like nouns, verbs, adjectives and adverbs, within a general semantic taxonomy defined by the WordNet lexicographer classes (called super-senses) [1]. SST can be considered as a task half-way between Named-Entity Recognition (NER) and Word Sense Disambiguation (WSD): it is an extension of NER, since it uses a larger set of semantic categories, and it is an easier and more practical task with respect to WSD, that deals with very specific senses.
Tanl (Natural Language Text Analytics) is a suite of tools for text analytics based on the softwa... more Tanl (Natural Language Text Analytics) is a suite of tools for text analytics based on the software architecture paradigm of data pipelines. Tanl pipelines are data driven, i.e. each stage pulls data from the preceding stage and transforms them for use by the next stage. Since data is processed as soon as it becomes available, processing delay is minimized improving data throughput. The processing modules can be written in C++ or in Python and can be combined using few lines of Python scripts to produce full NLP applications. Tanl provides a set of modules, ranging from tokenization to POS tagging, from parsing to NE recognition. A Tanl pipeline can be processed in parallel on a cluster of computers by means of a modified version of Hadoop streaming. We present the architecture, its modules and some sample applications. Introduction Text analytics involves many tasks ranging from simple text collection, extraction, and preparation to linguistic syntactic and semantic analysis, cross...
BIT, 1978
A function is presented to test equality between lists. In any case the function requires at most... more A function is presented to test equality between lists. In any case the function requires at most one single traversal of the lists and behaves as follows:--if there are no cycles, it works as standardEQU AL functions;--if there are cycles, it does not give information about equality, but it can detect cycles, i.e. it signals which lists are cyclic.
Background: Hyperglycemia and obesity are associated with a worse prognosis in subjects with COVI... more Background: Hyperglycemia and obesity are associated with a worse prognosis in subjects with COVID-19 independently. Their interaction as well as the potential modulating effects of additional confounding factors is poorly known. Therefore, we aimed to identify and evaluate confounding factors affecting the prognostic value of
We tackle two dierent problems of text categorization, namely feature selection (FS) and classier... more We tackle two dierent problems of text categorization, namely feature selection (FS) and classier induction. We propose a new FS technique, based on a simplied version of the 2 statistics and a novel variant, based on the exploitation of negative evidence, of the well-known k-NN method. We report the results of systematic experimentation of these two methods performed on the Reuters-21578 benchmark. 1. INTRODUCTION Text categorization denotes the activity of automatically building, by means of machine learning techniques, automatic text classiers (see e.g. [2]). Two key steps are document indexing and classier induction. Document indexing refers to the task of automatically constructing internal representations of the documents, able to synthetize the meaning of the documents. Usually, a text document is represented as a vector of weights d j = hw1j ; : : : ; wrj i, where r is the number of features (i.e. words) that occur at least once in at least one document of the coll...
The paper addresses the challenge of con-verting MIDT, an existing dependency– based Italian tree... more The paper addresses the challenge of con-verting MIDT, an existing dependency– based Italian treebank resulting from the harmonization and merging of smaller re-sources, into the Stanford Dependencies annotation formalism, with the final aim of constructing a standard–compliant re-source for the Italian language. Achieved results include a methodology for con-verting treebank annotations belonging to the same dependency–based family, the Italian Stanford Dependency Treebank (ISDT), and an Italian localization of the
Accademia University Press eBooks, 2016
Elsevier eBooks, 1994
Abstract An analysis of some formal proofs appeared in recent literature dealing with multiple th... more Abstract An analysis of some formal proofs appeared in recent literature dealing with multiple theories, reveals that they are not always accurate: some steps are not properly accounted for, lifting is use improperly, extra logical constructions or unnecessary assumptions are made. Many such problems appears due to the involved mechanisms of reflection. We show that proof in context can replace the most common uses of reflection principles. Proofs can be carried out by switching to a context and reasoning within it. Context switching however does not correspond to reflection or reification but involves changing the level of nesting of theory within another theory. We introduce a generalised rule for proof in context and a convenient notation to express nesting of contexts, which allows us to carry out reasoning in and across contexts in a safe and natural way.
... by a dependency parser: A dependency parser can be built without using a formal grammar of a ... more ... by a dependency parser: A dependency parser can be built without using a formal grammar of a ... It will also be worth considering moving the intent classification at indexing time and ... [2] A. Aue, M. Gamon.[2005] Customizing Sentiment Classifiers to New Domains: a Case Study. ...
Accademia University Press eBooks, 2015
Cross-Language Evaluation Forum, 2012
The paper reports our experiments in tackling the CLEF 2012 Pilot Task on Machine Reading for Que... more The paper reports our experiments in tackling the CLEF 2012 Pilot Task on Machine Reading for Question Answering. We introduce the technique of index expansion, which relies on building a search index enriched with information gathered from a linguistic analysis of texts. The index provides a highly tangled representation of the sentences where each word is directly connected to others representing both meaning and relations. Instead of keeping the knowledge base separate, the relevant knowledge gets embedded within the text. We can hence use efficient indexing techniques to represent such knowledge and query it very effectively. We explain how index expansion was used in the task and describe the experiments that we performed. The results achieved are quite positive and a final error analysis shows how the technique can be further improved.
This contribution presents the first steps towards the analysis of Leonardo Fibonacci's Liber... more This contribution presents the first steps towards the analysis of Leonardo Fibonacci's Liber Abbaci using computational linguistics methods. The work is currently carried out in the context of a joint research project between the Tuscany Region and the University of Pisa with the help of an interdisciplinary team.
Italian Journal of Computational Linguistics, 2021
Today's goal-oriented dialogue systems are designed to operate in restricted domains and with the... more Today's goal-oriented dialogue systems are designed to operate in restricted domains and with the implicit assumption that the user goals fit the domain ontology of the system. Under these assumptions dialogues exhibit only limited collaborative phenomena. However, this is not necessarily true in more complex scenarios, where user and system need to collaborate to align their knowledge of the domain in order to improve the conversation and achieve their goals. To foster research on data-driven collaborative dialogues, in this paper we present JILDA, a fully annotated dataset of chat-based, mixed-initiative Italian dialogues related to the job-offer domain. As far as we know, JILDA is the first dialogic corpus completely annotated in this domain. The analysis realised on top of the semantic annotations clearly shows the naturalness and greater complexity of JILDA's dialogues. In fact, the new dataset offers a large number of examples of pragmatic phenomena, such as proactivity (i.e., providing information not explicitly requested) and grounding, which are rarely investigated in AI conversational agents based on neural architectures. In conclusion, the annotated JILDA corpus, given its innovative characteristics, represents a new challenge for conversational agents and an important resource for tackling more complex scenarios, thus advancing the state of the art in this field.
EVALITA. Evaluation of NLP and Speech Tools for Italian
English. The paper describes our submissions to the task on Named Entity rEcognition and Linking ... more English. The paper describes our submissions to the task on Named Entity rEcognition and Linking in Italian Tweets (NEEL-IT) at Evalita 2016. Our approach relies on a technique of Named Entity tagging that exploits both character-level and word-level embeddings. Character-based embeddings allow learning the idiosyncrasies of the language used in tweets. Using a full-blown Named Entity tagger allows recognizing a wider range of entities than those well known by their presence in a Knowledge Base or gazetteer. Our submissions achieved first, second and fourth top official scores. Italiano. L'articolo descrive la nostra partecipazione al task di Named Entity rEcognition and Linking in Italian Tweets (NEEL-IT) a Evalita 2016. Il nostro approccio si basa sull'utilizzo di un Named Entity tagger che sfrutta embeddings sia character-level che word-level. I primi consentono di apprendere le idiosincrasie della scrittura nei tweet. L'uso di un tagger completo consente di riconoscere uno spettro più ampio di entità rispetto a quelle conosciute per la loro presenza in Knowledge Base o gazetteer. Le prove sottomesse hanno ottenuto il primo, secondo e quarto dei punteggi ufficiali.
Lecture Notes in Computer Science, 2013
Established in 2007, EVALITA (http://www.evalita.it) is the evaluation campaign of Natural Langua... more Established in 2007, EVALITA (http://www.evalita.it) is the evaluation campaign of Natural Language Processing and Speech Technologies for the Italian language, organized around shared tasks focusing on the analysis of written and spoken language respectively. EVALITA's shared tasks are aimed at contributing to the development and dissemination of natural language resources and technologies by proposing a shared context for training and evaluation. Following the success of previous editions, we organized EVALITA 2014, the fourth evaluation campaign with the aim of continuing to provide a forum for the comparison and evaluation of research outcomes as far as Italian is concerned from both academic institutions and industrial organizations. The event has been supported by the NLP Special Interest Group of the Italian Association for Artificial Intelligence (AI*IA) and by the Italian Association of Speech Science (AISV). The novelty of this year is that the final workshop of EVALITA is co-located with the 1st Italian Conference of Computational Linguistics (CLiC-it, http://clic.humnet.unipi.it/), a new event aiming to establish a reference forum for research on Computational Linguistics of the Italian community with contributions from a wide range of disciplines going from Computational Linguistics, Linguistics and Cognitive Science to Machine Learning, Computer Science, Knowledge Representation, Information Retrieval and Digital Humanities. The co-location with CLiC-it potentially widens the potential audience of EVALITA. The final workshop, held in Pisa on the 11th December 2014 within the context of the XIII AI*IA Symposium on Artificial Intelligence (Pisa, 10-12 December 2014, http://aiia2014.di.unipi.it/), gathers the results of 8 tasks, 4 of which focusing on written language and 4 on speech technologies. In this EVALITA edition, we received 30 expressions of interest, 55 registrations and 43 actual submissions to 8 proposed tasks distributed as follows:
Semantic Processing of Legal Texts (SPLeT-2012) Workshop Programme
The 4th Workshop on “Semantic Processing of Legal Texts”(SPLeT–2012) presents the first multiling... more The 4th Workshop on “Semantic Processing of Legal Texts”(SPLeT–2012) presents the first multilingual shared task on Dependency Parsing of Legal Texts. In this paper, we define the general task and its internal organization into sub–tasks, describe the datasets and the domain–specific linguistic peculiarities characterizing them. We finally report the results achieved by the participating systems, describe the underlying approaches and provide a first analysis of the final test results. Keywords: Domain Adaptation, Dependency ...
Super-sense tagging (SST) is a Natural Language Processing task that consists of annotating each ... more Super-sense tagging (SST) is a Natural Language Processing task that consists of annotating each significant entity in a text, like nouns, verbs, adjectives and adverbs, within a general semantic taxonomy defined by the WordNet lexicographer classes (called super-senses) [1]. SST can be considered as a task half-way between Named-Entity Recognition (NER) and Word Sense Disambiguation (WSD): it is an extension of NER, since it uses a larger set of semantic categories, and it is an easier and more practical task with respect to WSD, that deals with very specific senses.
Tanl (Natural Language Text Analytics) is a suite of tools for text analytics based on the softwa... more Tanl (Natural Language Text Analytics) is a suite of tools for text analytics based on the software architecture paradigm of data pipelines. Tanl pipelines are data driven, i.e. each stage pulls data from the preceding stage and transforms them for use by the next stage. Since data is processed as soon as it becomes available, processing delay is minimized improving data throughput. The processing modules can be written in C++ or in Python and can be combined using few lines of Python scripts to produce full NLP applications. Tanl provides a set of modules, ranging from tokenization to POS tagging, from parsing to NE recognition. A Tanl pipeline can be processed in parallel on a cluster of computers by means of a modified version of Hadoop streaming. We present the architecture, its modules and some sample applications. Introduction Text analytics involves many tasks ranging from simple text collection, extraction, and preparation to linguistic syntactic and semantic analysis, cross...
BIT, 1978
A function is presented to test equality between lists. In any case the function requires at most... more A function is presented to test equality between lists. In any case the function requires at most one single traversal of the lists and behaves as follows:--if there are no cycles, it works as standardEQU AL functions;--if there are cycles, it does not give information about equality, but it can detect cycles, i.e. it signals which lists are cyclic.
Background: Hyperglycemia and obesity are associated with a worse prognosis in subjects with COVI... more Background: Hyperglycemia and obesity are associated with a worse prognosis in subjects with COVID-19 independently. Their interaction as well as the potential modulating effects of additional confounding factors is poorly known. Therefore, we aimed to identify and evaluate confounding factors affecting the prognostic value of
We tackle two dierent problems of text categorization, namely feature selection (FS) and classier... more We tackle two dierent problems of text categorization, namely feature selection (FS) and classier induction. We propose a new FS technique, based on a simplied version of the 2 statistics and a novel variant, based on the exploitation of negative evidence, of the well-known k-NN method. We report the results of systematic experimentation of these two methods performed on the Reuters-21578 benchmark. 1. INTRODUCTION Text categorization denotes the activity of automatically building, by means of machine learning techniques, automatic text classiers (see e.g. [2]). Two key steps are document indexing and classier induction. Document indexing refers to the task of automatically constructing internal representations of the documents, able to synthetize the meaning of the documents. Usually, a text document is represented as a vector of weights d j = hw1j ; : : : ; wrj i, where r is the number of features (i.e. words) that occur at least once in at least one document of the coll...
The paper addresses the challenge of con-verting MIDT, an existing dependency– based Italian tree... more The paper addresses the challenge of con-verting MIDT, an existing dependency– based Italian treebank resulting from the harmonization and merging of smaller re-sources, into the Stanford Dependencies annotation formalism, with the final aim of constructing a standard–compliant re-source for the Italian language. Achieved results include a methodology for con-verting treebank annotations belonging to the same dependency–based family, the Italian Stanford Dependency Treebank (ISDT), and an Italian localization of the