Alessandro Maisto | University of Salerno (original) (raw)
Papers by Alessandro Maisto
The measurement of machine translation (MT) performances is an unsolved issue in NLP. This task c... more The measurement of machine translation (MT) performances is an unsolved issue in NLP. This task can be done by a human but the time cost and the need for skilled workers to do it rise the necessity for automatic ways to measure the quality of a translation. In this work, we aim to develop a new methodology for measuring the quality of MT results from a syntactic point of view. The idea takes as a theoretical framework the work of Harris about the decomposition of sentences in elementary units called kernels. Our model parses Spanish sentences and UNL (Universal Networking Language) Graphs with a rule-based methodology and divides them into units of information. Comparing those units the model measures the quality of the translation. Our results show that decomposing the sentences in minimal syntactic units could improve the evaluation performances also without a lexical/semantic analysis.
Italian Journal of Computational Linguistics
(DS) models are based on the idea that two words which appear in similar contexts, i.e. similar n... more (DS) models are based on the idea that two words which appear in similar contexts, i.e. similar neighborhoods, have similar meanings. This concept was originally presented by Harris in his Distributional Hypothesis (DH) (Harris 1954). Even though DH forms the basis of the majority of DS models, Harris states in later works that only syntactic analysis can allow for a more precise formulation of the neighborhoods involved: the arguments and the operators. In this work, we present a DS model based on the concept of Syntactic Distance inspired by a study of Harris's theories concerning the syntactic-semantic interface. In our model, the context of each word is derived from its dependency network generated by a parser. With this strategy, the co-occurring terms of a target word are calculated on the basis of their syntactic relations, which are also preserved in the event of syntactical transformations. The model, named Syntactic Distance as Word Window (SD-W2), has been tested on three state-of-the-art tasks: Semantic Distance, Synonymy and Single Word Priming, and compared with other classical DS models. In addition, the model has been subjected to a new test based on Operator-Argument selection. Although the results obtained by SD-W2 do not always reach those of modern contextualized models, they are often above average and, in many cases, they are comparable with the result of GLOVE or BERT.
Lecture Notes in Computer Science, 2023
The research we present in this paper focuses on the automatic management of the knowledge about ... more The research we present in this paper focuses on the automatic management of the knowledge about experience goods and services and their features, starting from real texts generated online by internet users. The details about an experiment conducted on a dataset of product reviews, on which we tested a set of rule-based and statistical solutions, will be described in the paper. The main goals are the review classification, the extraction of relevant product features and their systematization into product-driven ontologies. Feature extraction is performed through a rule-based strategy grounded on SentIta, an Italian collection of subjective lexical resources. Features and Reviews are classified thanks to a Distributional Semantic algorithm. In the end, we face the problem of the extracted knowledge organization by integrating the subjective information produced by the internet users within a product-driven ontology. The Natural Language Processing (NLP) tool exploited in the work is ...
Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it 2017, 2017
The present research exploits the large amount of linguistic resources developed into the Lexicon... more The present research exploits the large amount of linguistic resources developed into the Lexicon-grammar paradigm in the domain of the Opinion Mining. Grounded on the Semantic Predicates theory, the proposed system is able to automatically match the syntactic structures selected by special classes of verbs, indicating positive or negative Sentiment, Opinion or Physical acts, with the semantic frames evoked by the same lexical items. This methods has been tested on a large dataset composed of short texts, such as tweets and news headings.
Lingue e Linguaggi, 2017
– The research we present in this paper focuses on the automatic management of the knowledge abou... more – The research we present in this paper focuses on the automatic management of the knowledge about experience goods and services and their features, starting from real texts generated online by internet users. The details about an experiment conducted on a dataset of product reviews, on which we tested a set of rule-based and statistical solutions, will be described in the paper. The main goals are the review classification, the extraction of relevant product features and their systematization into product-driven ontologies. Feature extraction is performed through a rule-based strategy grounded on SentIta , an Italian collection of subjective lexical resources. Features and Reviews are classified thanks to a Distributional Semantic algorithm. In the end, we face the problem of the extracted knowledge organization by integrating the subjective information produced by the internet users within a product-driven ontology. The Natural Language Processing (NLP) tool exploited in the work ...
In this paper we present a hybrid semi-automatic methodology for the construction of a Lexical Kn... more In this paper we present a hybrid semi-automatic methodology for the construction of a Lexical Knowledge Base. The purpose of the work is to face the challenges related to synonymy in Question Answering systems. We talk about a “hybrid” method for two main reasons: firstly, it includes both a manual annotation process and an automatic expansion phase; secondly, the Knowledge Base data refers, a the same time, to both the syntactic and the semantic properties borrowed from the Lexicon-Grammar theoretical framework. The resulting Knowledge Base allows the automatic recognition of those nouns and adjectives which are not typically related into synonyms databases. In detail, we refer to nouns and adjectives that enter into a morph-phonological relation with verbs, in addition to the classic matching between words based on synset from the MultiWordNet synonym database.
In this work we present a new framework for the analysis of Italian texts that could help linguis... more In this work we present a new framework for the analysis of Italian texts that could help linguists to perform rapid text analysis. The framework, that performs both statistical and rule-based analysis, is called LG-Starship. The idea is to built a modular software that includes the basic algorithms to perform different kinds of analysis. The framework will include a Preprocessing Module a POS Tagging and Lemmatization module, a Statistic Module, a Semantic Module based on Distributional Analysis algorithms, and a Syntactic Module, which analyses syntax structures of a selected sentence and tags the verbs and its arguments with semantic labels. The objective of the Framework is to build an “all-in-one” platform for NLP which allows any kind of users to perform basic and advanced text analysis.
Advances in Intelligent Systems and Computing, 2019
The paper describes a new Text Preprocessing Pipeline based on a Hybrid approach which involve ru... more The paper describes a new Text Preprocessing Pipeline based on a Hybrid approach which involve rule-based and stochastic approaches. The presented pipeline is part of a larger project titled Big Data for Multi-Agent Specialized System developed by Network Contacts in collaboration with University of Salerno and other institutional partners. The aim of the project is to build an Hybrid Question Answering System composed by sets of Dialog Bots able to process great volumes of data. Due to the importance of unstructured textual data, a particular focus of the project is on automatic processing of Text. The paper will describe the three main modules of the preprocessing pipeline, which involve a Style Correction Module, a Clitic Decomposition Module and a POS Tagging and Lemmatization Module.
The present work faces the problem of automatic classification and representation of unstructured... more The present work faces the problem of automatic classification and representation of unstructured texts into the Cultural Heritage domain. The research is carried out through a methodology based on the exploitation of machine-readable dictionaries of terminological simple words and multiword expressions. In the paper we will discuss the design and the population of a domain ontology, that enters into a complex interaction with the electronic dictionaries and a network of local grammars. A Max-Ent classifier, based on the ontology schema, aims to confer to each analyzed text an object identifier which is related to the semantic dimension of the text. Into this activity, the unstructured texts are processed through the use of the semantically annotated dictionaries in order to discover the underlying structure which facilitates the classification. The final purpose is the automatic attribution of POIds to texts on the base of the semantic features extracted into the texts through NLP ...
Stenotyping is a writing method system used to transcribe spoken texts, rapidly and in real time,... more Stenotyping is a writing method system used to transcribe spoken texts, rapidly and in real time, using a mechanical or digital device equipped with a special keyboard. This device is called a stenotype, stenotype machine, shorthand machine or steno writer, and it is a specialized chorded keyboard or typewriter allowing to performing beats of one or more keys simultaneously. Stenotyping requires the application of specific coded writing systems intended to limit and accelerate the number of beats. Whereas high-speed beats often generate a high amount of typos, the creation of a stenotype writing method based on a non-casual combination of morphemes would rely on a defined list of elements to be combined (i.e., the morphemes of a language) together with a production syntax (that is, the morphological rules of a language). Therefore, in this paper, we will show how to use NooJ linguistic resources and morphological grammars to build and implement a system for real-time typos automatic...
Due to the importance of the information it conveys, Medical Entity Recognition is one of the mos... more Due to the importance of the information it conveys, Medical Entity Recognition is one of the most investigated tasks in Natural Language Processing. Many researches have been aiming at solving the issue of Text Extraction, also in order to develop Decision Support Systems in the field of Health Care. In this paper, we propose a Lexicon-grammar method for the automatic extraction from raw texts of the semantic information referring to medical entities and, furthermore, for the identification of the semantic categories that describe the located entities. Our work is grounded on an electronic dictionary of neoclassical formative elements of the medical domain, an electronic dictionary of nouns indicating drugs, body parts and internal body parts and a grammar network composed of morphological and syntactical rules in the form of Finite-State Automata. The outcome of our research is an Extensible Markup Language (XML) annotated corpus of medical reports with information pertaining to t...
Ontologies are powerful instruments for high–level description of concepts, especially for Semant... more Ontologies are powerful instruments for high–level description of concepts, especially for Semantic Web applications or features classification. In some cases, Ontologies have been used to create high–level description of the Virtual Worlds objects in order to simplify the definition of a Virtual Reality. In this paper, we will describe an hybrid methodology that, starting from a domain Ontology, provides the basis for the creation of a Virtual Reality that could train workers to correct use of protection tools. In addition, starting from real data, we generate a Bayesian Network that calculates the probability of death or injuries in case of misconduct.
2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), 2019
The paper describes a linguistic analysis performed over a corpus of Italian political speeches a... more The paper describes a linguistic analysis performed over a corpus of Italian political speeches annotated through nonverbal tags. With the purpose to locate lexical features correlated with non-verbal phenomena, we exploited a set of electronic dictionaries of simple and compounds words and a network of Local Grammars. A Distributional Analysis of the texts has, then, been carried out in order to automatically explore the corpus.
Internet of Things, 2021
Abstract Video games industry represents one of the most profitable activities connected with ent... more Abstract Video games industry represents one of the most profitable activities connected with entertainment and visual arts. Video game industry involves a great number of different professionals who work together to create products expected to reach people in many countries. A substantial part of these people are teenagers, strongly attracted and influenced by video games. For these reasons, various systems of labels have been created. These systems are based on different criteria, but they have in common the presence of descriptors, or labels, whose indicate the type of contents in the game. These labels give an age range indicator to inform the buyers and users to the more suitable age for the product. One of these systems is called PEGI, and we will primarily take it in consideration for the purposes of our study. The rating procedure includes a large process of manual control of each submitted game. In order to help this large and demanding process, we propose a system of video games rating based on automatic classification of the products performed over the “transcript” or script, files that display the full transcription of dialogues in a video game. The proposed automatic classification algorithm is based on specialized dictionaries enriched with a vector semantics algorithm, and is able to provide an age rating and a genre classification of video games. It works in a more efficient way in games with a consistent amount of dialogues. The experimentation of the proposed algorithm is returning encouraging results.
Proceedings of the 2015 ACM on Workshop on Multimodal Deception Detection, 2015
This work proposes a new approach to deception detection, based on finding significant difference... more This work proposes a new approach to deception detection, based on finding significant differences between liars and truth tellers through the analysis of their behavior, verbal and non-verbal. This is based on the combination of two factors: multimodal data collection, and t-pattern analysis. Multimodal approach has been acknowledged in literature about deception detection and on several studies concerning the understanding of any communicative phenomenon. We believe a methodology such as T-pattern analysis could be able to get the best advantages from an approach that combines data coming from multiple signaling systems. In fact, T-pattern analysis is a recent methodology for the analysis of behavior that unveil the complex structure at the basis of the organization of human behavior. For this work, we conducted an experimental study and analyzed data related to a single subject. Results showed how T-pattern analysis allowed to find differences between truth telling and lying. This work aims at making progress in the state of knowledge about deception detection, with the final goal to propose a useful tool for the improvement of public security and well-being.
The measurement of machine translation (MT) performances is an unsolved issue in NLP. This task c... more The measurement of machine translation (MT) performances is an unsolved issue in NLP. This task can be done by a human but the time cost and the need for skilled workers to do it rise the necessity for automatic ways to measure the quality of a translation. In this work, we aim to develop a new methodology for measuring the quality of MT results from a syntactic point of view. The idea takes as a theoretical framework the work of Harris about the decomposition of sentences in elementary units called kernels. Our model parses Spanish sentences and UNL (Universal Networking Language) Graphs with a rule-based methodology and divides them into units of information. Comparing those units the model measures the quality of the translation. Our results show that decomposing the sentences in minimal syntactic units could improve the evaluation performances also without a lexical/semantic analysis.
Italian Journal of Computational Linguistics
(DS) models are based on the idea that two words which appear in similar contexts, i.e. similar n... more (DS) models are based on the idea that two words which appear in similar contexts, i.e. similar neighborhoods, have similar meanings. This concept was originally presented by Harris in his Distributional Hypothesis (DH) (Harris 1954). Even though DH forms the basis of the majority of DS models, Harris states in later works that only syntactic analysis can allow for a more precise formulation of the neighborhoods involved: the arguments and the operators. In this work, we present a DS model based on the concept of Syntactic Distance inspired by a study of Harris's theories concerning the syntactic-semantic interface. In our model, the context of each word is derived from its dependency network generated by a parser. With this strategy, the co-occurring terms of a target word are calculated on the basis of their syntactic relations, which are also preserved in the event of syntactical transformations. The model, named Syntactic Distance as Word Window (SD-W2), has been tested on three state-of-the-art tasks: Semantic Distance, Synonymy and Single Word Priming, and compared with other classical DS models. In addition, the model has been subjected to a new test based on Operator-Argument selection. Although the results obtained by SD-W2 do not always reach those of modern contextualized models, they are often above average and, in many cases, they are comparable with the result of GLOVE or BERT.
Lecture Notes in Computer Science, 2023
The research we present in this paper focuses on the automatic management of the knowledge about ... more The research we present in this paper focuses on the automatic management of the knowledge about experience goods and services and their features, starting from real texts generated online by internet users. The details about an experiment conducted on a dataset of product reviews, on which we tested a set of rule-based and statistical solutions, will be described in the paper. The main goals are the review classification, the extraction of relevant product features and their systematization into product-driven ontologies. Feature extraction is performed through a rule-based strategy grounded on SentIta, an Italian collection of subjective lexical resources. Features and Reviews are classified thanks to a Distributional Semantic algorithm. In the end, we face the problem of the extracted knowledge organization by integrating the subjective information produced by the internet users within a product-driven ontology. The Natural Language Processing (NLP) tool exploited in the work is ...
Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it 2017, 2017
The present research exploits the large amount of linguistic resources developed into the Lexicon... more The present research exploits the large amount of linguistic resources developed into the Lexicon-grammar paradigm in the domain of the Opinion Mining. Grounded on the Semantic Predicates theory, the proposed system is able to automatically match the syntactic structures selected by special classes of verbs, indicating positive or negative Sentiment, Opinion or Physical acts, with the semantic frames evoked by the same lexical items. This methods has been tested on a large dataset composed of short texts, such as tweets and news headings.
Lingue e Linguaggi, 2017
– The research we present in this paper focuses on the automatic management of the knowledge abou... more – The research we present in this paper focuses on the automatic management of the knowledge about experience goods and services and their features, starting from real texts generated online by internet users. The details about an experiment conducted on a dataset of product reviews, on which we tested a set of rule-based and statistical solutions, will be described in the paper. The main goals are the review classification, the extraction of relevant product features and their systematization into product-driven ontologies. Feature extraction is performed through a rule-based strategy grounded on SentIta , an Italian collection of subjective lexical resources. Features and Reviews are classified thanks to a Distributional Semantic algorithm. In the end, we face the problem of the extracted knowledge organization by integrating the subjective information produced by the internet users within a product-driven ontology. The Natural Language Processing (NLP) tool exploited in the work ...
In this paper we present a hybrid semi-automatic methodology for the construction of a Lexical Kn... more In this paper we present a hybrid semi-automatic methodology for the construction of a Lexical Knowledge Base. The purpose of the work is to face the challenges related to synonymy in Question Answering systems. We talk about a “hybrid” method for two main reasons: firstly, it includes both a manual annotation process and an automatic expansion phase; secondly, the Knowledge Base data refers, a the same time, to both the syntactic and the semantic properties borrowed from the Lexicon-Grammar theoretical framework. The resulting Knowledge Base allows the automatic recognition of those nouns and adjectives which are not typically related into synonyms databases. In detail, we refer to nouns and adjectives that enter into a morph-phonological relation with verbs, in addition to the classic matching between words based on synset from the MultiWordNet synonym database.
In this work we present a new framework for the analysis of Italian texts that could help linguis... more In this work we present a new framework for the analysis of Italian texts that could help linguists to perform rapid text analysis. The framework, that performs both statistical and rule-based analysis, is called LG-Starship. The idea is to built a modular software that includes the basic algorithms to perform different kinds of analysis. The framework will include a Preprocessing Module a POS Tagging and Lemmatization module, a Statistic Module, a Semantic Module based on Distributional Analysis algorithms, and a Syntactic Module, which analyses syntax structures of a selected sentence and tags the verbs and its arguments with semantic labels. The objective of the Framework is to build an “all-in-one” platform for NLP which allows any kind of users to perform basic and advanced text analysis.
Advances in Intelligent Systems and Computing, 2019
The paper describes a new Text Preprocessing Pipeline based on a Hybrid approach which involve ru... more The paper describes a new Text Preprocessing Pipeline based on a Hybrid approach which involve rule-based and stochastic approaches. The presented pipeline is part of a larger project titled Big Data for Multi-Agent Specialized System developed by Network Contacts in collaboration with University of Salerno and other institutional partners. The aim of the project is to build an Hybrid Question Answering System composed by sets of Dialog Bots able to process great volumes of data. Due to the importance of unstructured textual data, a particular focus of the project is on automatic processing of Text. The paper will describe the three main modules of the preprocessing pipeline, which involve a Style Correction Module, a Clitic Decomposition Module and a POS Tagging and Lemmatization Module.
The present work faces the problem of automatic classification and representation of unstructured... more The present work faces the problem of automatic classification and representation of unstructured texts into the Cultural Heritage domain. The research is carried out through a methodology based on the exploitation of machine-readable dictionaries of terminological simple words and multiword expressions. In the paper we will discuss the design and the population of a domain ontology, that enters into a complex interaction with the electronic dictionaries and a network of local grammars. A Max-Ent classifier, based on the ontology schema, aims to confer to each analyzed text an object identifier which is related to the semantic dimension of the text. Into this activity, the unstructured texts are processed through the use of the semantically annotated dictionaries in order to discover the underlying structure which facilitates the classification. The final purpose is the automatic attribution of POIds to texts on the base of the semantic features extracted into the texts through NLP ...
Stenotyping is a writing method system used to transcribe spoken texts, rapidly and in real time,... more Stenotyping is a writing method system used to transcribe spoken texts, rapidly and in real time, using a mechanical or digital device equipped with a special keyboard. This device is called a stenotype, stenotype machine, shorthand machine or steno writer, and it is a specialized chorded keyboard or typewriter allowing to performing beats of one or more keys simultaneously. Stenotyping requires the application of specific coded writing systems intended to limit and accelerate the number of beats. Whereas high-speed beats often generate a high amount of typos, the creation of a stenotype writing method based on a non-casual combination of morphemes would rely on a defined list of elements to be combined (i.e., the morphemes of a language) together with a production syntax (that is, the morphological rules of a language). Therefore, in this paper, we will show how to use NooJ linguistic resources and morphological grammars to build and implement a system for real-time typos automatic...
Due to the importance of the information it conveys, Medical Entity Recognition is one of the mos... more Due to the importance of the information it conveys, Medical Entity Recognition is one of the most investigated tasks in Natural Language Processing. Many researches have been aiming at solving the issue of Text Extraction, also in order to develop Decision Support Systems in the field of Health Care. In this paper, we propose a Lexicon-grammar method for the automatic extraction from raw texts of the semantic information referring to medical entities and, furthermore, for the identification of the semantic categories that describe the located entities. Our work is grounded on an electronic dictionary of neoclassical formative elements of the medical domain, an electronic dictionary of nouns indicating drugs, body parts and internal body parts and a grammar network composed of morphological and syntactical rules in the form of Finite-State Automata. The outcome of our research is an Extensible Markup Language (XML) annotated corpus of medical reports with information pertaining to t...
Ontologies are powerful instruments for high–level description of concepts, especially for Semant... more Ontologies are powerful instruments for high–level description of concepts, especially for Semantic Web applications or features classification. In some cases, Ontologies have been used to create high–level description of the Virtual Worlds objects in order to simplify the definition of a Virtual Reality. In this paper, we will describe an hybrid methodology that, starting from a domain Ontology, provides the basis for the creation of a Virtual Reality that could train workers to correct use of protection tools. In addition, starting from real data, we generate a Bayesian Network that calculates the probability of death or injuries in case of misconduct.
2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), 2019
The paper describes a linguistic analysis performed over a corpus of Italian political speeches a... more The paper describes a linguistic analysis performed over a corpus of Italian political speeches annotated through nonverbal tags. With the purpose to locate lexical features correlated with non-verbal phenomena, we exploited a set of electronic dictionaries of simple and compounds words and a network of Local Grammars. A Distributional Analysis of the texts has, then, been carried out in order to automatically explore the corpus.
Internet of Things, 2021
Abstract Video games industry represents one of the most profitable activities connected with ent... more Abstract Video games industry represents one of the most profitable activities connected with entertainment and visual arts. Video game industry involves a great number of different professionals who work together to create products expected to reach people in many countries. A substantial part of these people are teenagers, strongly attracted and influenced by video games. For these reasons, various systems of labels have been created. These systems are based on different criteria, but they have in common the presence of descriptors, or labels, whose indicate the type of contents in the game. These labels give an age range indicator to inform the buyers and users to the more suitable age for the product. One of these systems is called PEGI, and we will primarily take it in consideration for the purposes of our study. The rating procedure includes a large process of manual control of each submitted game. In order to help this large and demanding process, we propose a system of video games rating based on automatic classification of the products performed over the “transcript” or script, files that display the full transcription of dialogues in a video game. The proposed automatic classification algorithm is based on specialized dictionaries enriched with a vector semantics algorithm, and is able to provide an age rating and a genre classification of video games. It works in a more efficient way in games with a consistent amount of dialogues. The experimentation of the proposed algorithm is returning encouraging results.
Proceedings of the 2015 ACM on Workshop on Multimodal Deception Detection, 2015
This work proposes a new approach to deception detection, based on finding significant difference... more This work proposes a new approach to deception detection, based on finding significant differences between liars and truth tellers through the analysis of their behavior, verbal and non-verbal. This is based on the combination of two factors: multimodal data collection, and t-pattern analysis. Multimodal approach has been acknowledged in literature about deception detection and on several studies concerning the understanding of any communicative phenomenon. We believe a methodology such as T-pattern analysis could be able to get the best advantages from an approach that combines data coming from multiple signaling systems. In fact, T-pattern analysis is a recent methodology for the analysis of behavior that unveil the complex structure at the basis of the organization of human behavior. For this work, we conducted an experimental study and analyzed data related to a single subject. Results showed how T-pattern analysis allowed to find differences between truth telling and lying. This work aims at making progress in the state of knowledge about deception detection, with the final goal to propose a useful tool for the improvement of public security and well-being.