Alfio Gliozzo - Academia.edu (original) (raw)

Papers by Alfio Gliozzo

Research paper thumbnail of KnowGL: Knowledge Generation and Linking from Text

Proceedings of the ... AAAI Conference on Artificial Intelligence, Jun 26, 2023

We propose KnowGL, a tool that allows converting text into structured relational data represented... more We propose KnowGL, a tool that allows converting text into structured relational data represented as a set of ABox assertions compliant with the TBox of a given Knowledge Graph (KG), such as Wikidata. We address this problem as a sequence generation task by leveraging pre-trained sequenceto-sequence language models, e.g. BART. Given a sentence, we fine-tune such models to detect pairs of entity mentions and jointly generate a set of facts consisting of the full set of semantic annotations for a KG, such as entity labels, entity types, and their relationships. To showcase the capabilities of our tool, we build a web application consisting of a set of UI widgets that help users to navigate through the semantic data extracted from a given input text. We make the KnowGL model available at https://huggingface.co/ibm/knowgl-large.

Research paper thumbnail of Span Selection Pre-training for Question Answering

arXiv (Cornell University), Sep 9, 2019

BERT (Bidirectional Encoder Representations from Transformers) and related pre-trained Transforme... more BERT (Bidirectional Encoder Representations from Transformers) and related pre-trained Transformers have provided large gains across many language understanding tasks, achieving a new state-of-the-art (SOTA). BERT is pretrained on two auxiliary tasks: Masked Language Model and Next Sentence Prediction. In this paper we introduce a new pre-training task inspired by reading comprehension to better align the pre-training from memorization to understanding. Span Selection Pre-Training (SSPT) poses cloze-like training instances, but rather than draw the answer from the model's parameters, it is selected from a relevant passage. We find significant and consistent improvements over both BERT BASE and BERT LARGE on multiple Machine Reading Comprehension (MRC) datasets. Specifically, our proposed model has strong empirical evidence as it obtains SOTA results on Natural Questions, a new benchmark MRC dataset, outperforming BERT LARGE by 3 F1 points on short answer prediction. We also show significant impact in HotpotQA, improving answer prediction F1 by 4 points and supporting fact prediction F1 by 1 point and outperforming the previous best system. Moreover, we show that our pre-training approach is particularly effective when training data is limited, improving the learning curve by a large amount.

Research paper thumbnail of When Did that Happen? — Linking Events and Relations to Timestamps

Conference of the European Chapter of the Association for Computational Linguistics, Apr 23, 2012

We present work on linking events and fluents (i.e., relations that hold for certain periods of t... more We present work on linking events and fluents (i.e., relations that hold for certain periods of time) to temporal information in text, which is an important enabler for many applications such as timelines and reasoning. Previous research has mainly focused on temporal links for events, and we extend that work to include fluents as well, presenting a common methodology for linking both events and relations to timestamps within the same sentence. Our approach combines tree kernels with classical feature-based learning to exploit context and achieves competitive F1-scores on event-time linking, and comparable F1scores for fluents. Our best systems achieve F1-scores of 0.76 on events and 0.72 on fluents. * The first author conducted this research during an internship at IBM Research.

Research paper thumbnail of Retrieval-Based Transformer for Table Augmentation

Data preparation, also called data wrangling, is considered one of the most expensive and timecon... more Data preparation, also called data wrangling, is considered one of the most expensive and timeconsuming steps when performing analytics or building machine learning models. Preparing data typically involves collecting and merging data from complex heterogeneous, and often large-scale data sources, such as data lakes. In this paper, we introduce a novel approach toward automatic data wrangling in an attempt to alleviate the effort of end-users, e.g. data analysts, in structuring dynamic views from data lakes in the form of tabular data. We aim to address table augmentation tasks, including row/column population and data imputation. Given a corpus of tables, we propose a retrieval augmented self-trained transformer model. Our self-learning strategy consists in randomly ablating tables from the corpus and training the retrieval-based model to reconstruct the original values or headers given the partial tables as input. We adopt this strategy to first train the dense neural retrieval model encoding tableparts to vectors, and then the end-to-end model trained to perform table augmentation tasks. We test on EntiTables, the standard benchmark for table augmentation, as well as introduce a new benchmark to advance further research: WebTables. Our model consistently and substantially outperforms both supervised statistical methods and the current state-of-the-art transformer-based models.

Research paper thumbnail of KGI: An Integrated Framework for Knowledge Intensive Language Tasks

In this paper, we present a system to showcase the capabilities of the latest state-of-the-art re... more In this paper, we present a system to showcase the capabilities of the latest state-of-the-art retrieval augmented generation models trained on knowledge-intensive language tasks, such as slot filling, open domain question answering, dialogue, and fact-checking. Moreover, given a user query, we show how the output from these different models can be combined to cross-examine the outputs of each other. Particularly, we show how accuracy in dialogue can be improved using the question answering model. We are also releasing all models used in the demo as a contribution of this paper. A short video demonstrating the system is available at https://ibm.box.com/v/emnlp2022-demo.

Research paper thumbnail of Robust Retrieval Augmented Generation for Zero-shot Slot Filling

arXiv (Cornell University), Aug 31, 2021

Automatically inducing high quality knowledge graphs from a given collection of documents still r... more Automatically inducing high quality knowledge graphs from a given collection of documents still remains a challenging problem in AI. One way to make headway for this problem is through advancements in a related task known as slot filling. In this task, given an entity query in form of [ENTITY, SLOT, ?], a system is asked to 'fill' the slot by generating or extracting the missing value exploiting evidence extracted from relevant passage(s) in the given document collection. The recent works in the field try to solve this task in an end-to-end fashion using retrieval-based language models. In this paper, we present a novel approach to zero-shot slot filling that extends dense passage retrieval with hard negatives and robust training procedures for retrieval augmented generation models. Our model reports large improvements on both T-REx and zsRE slot filling datasets, improving both passage retrieval and slot value generation, and ranking at the top-1 position in the KILT leaderboard. Moreover, we demonstrate the robustness of our system showing its domain adaptation capability on a new variant of the TACRED dataset for slot filling, through a combination of zero/few-shot learning. We release the source code and pre-trained models 1 .

Research paper thumbnail of Beyond Jeopardy! Adapting Watson to new Domains using Distributional Semantics

Ingénierie Des Systèmes D'information, 2013

Research paper thumbnail of Semantic Technologies in IBM Watson

This paper describes a seminar course designed by IBM and Columbia University on the topic of Sem... more This paper describes a seminar course designed by IBM and Columbia University on the topic of Semantic Technologies, in particular as used in IBM Watson TM-a large scale Question Answering system which famously won at Jeopardy! R against two human grand champions. It was first offered at Columbia University during the 2013 spring semester, and will be offered at other institutions starting in the fall semester. We describe the course's first successful run and its unique features: a class centered around a specific industrial technology; a large-scale class project which student teams can choose to participate in and which serves as the basis for an open source project that will continue to grow each time the course is offered; publishable papers, demos and start-up ideas; evidence that the course can be self-evaluating, which makes it potentially appropriate for an online setting; and a unique model where a large company trains instructors and contributes to creating educational material at no charge to qualifying institutions.

Research paper thumbnail of Semantic Concept Discovery Over Event Data

ISWC (Posters, Demos & Industry Tracks), 2017

Research paper thumbnail of Inducing Implicit Relations from Text Using Distantly Supervised Deep Nets

Lecture Notes in Computer Science, 2018

Research paper thumbnail of A Generative Model for Relation Extraction and Classification

arXiv (Cornell University), Feb 26, 2022

Relation extraction (RE) is an important information extraction task which provides essential inf... more Relation extraction (RE) is an important information extraction task which provides essential information to many NLP applications such as knowledge base population and question answering. In this paper, we present a novel generative model for relation extraction and classification (which we call GREC), where RE is modeled as a sequenceto-sequence generation task. We explore various encoding representations for the source and target sequences, and design effective schemes that enable GREC to achieve stateof-the-art performance on three benchmark RE datasets. In addition, we introduce negative sampling and decoding scaling techniques which provide a flexible tool to tune the precision and recall performance of the model. Our approach can be extended to extract all relation triples from a sentence in one pass. Although the one-pass approach incurs certain performance loss, it is much more computationally efficient.

Research paper thumbnail of Query Focused Variable Centroid Vectors for Passage Re-ranking in Semantic Search

In this paper, we propose a new approach for passage re-ranking. We show that variable (i.e. non-... more In this paper, we propose a new approach for passage re-ranking. We show that variable (i.e. non-static) centroid vectors for passages, created based on the given query, significantly improves passage re-ranking results compared to that obtained using static centroid vectors. We also show that the results are comparable to RWMD-Q, an existing (non-centroid based unsupervised) state of the art. The experiments reported are conducted on two different datasets in both neural and co-occurrence based distributional semantics settings.

Research paper thumbnail of Semantic Concept Discovery over Event Databases

Lecture Notes in Computer Science, 2018

Research paper thumbnail of Applying a Generic Sequence-to-Sequence Model for Simple and Effective Keyphrase Generation

arXiv (Cornell University), Jan 13, 2022

Research paper thumbnail of Populating Web Scale Knowledge Graphs using Distantly Supervised Relation Extraction and Validation

arXiv (Cornell University), Aug 21, 2019

In this paper, we propose a fully automated system to extend knowledge graphs using external info... more In this paper, we propose a fully automated system to extend knowledge graphs using external information from web-scale corpora. The designed system leverages a deep learning based technology for relation extraction that can be trained by a distantly supervised approach. In addition to that, the system uses a deep learning approach for knowledge base completion by utilizing the global structure information of the induced KG to further refine the confidence of the newly discovered relations. The designed system does not require any effort for adaptation to new languages and domains as it does not use any hand-labeled data, NLP analytics and inference rules. Our experiments, performed on a popular academic benchmark demonstrate that the suggested system boosts the performance of relation extraction by a wide margin, reporting error reductions of 50%, resulting in relative improvement of up to 100%. Also, a web-scale experiment conducted to extend DBPedia with knowledge from Common Crawl shows that our system is not only scalable but also does not require any adaptation cost, while yielding substantial accuracy gain.

Research paper thumbnail of Dynamic Facet Selection by Maximizing Graded Relevance

Research paper thumbnail of A Study on Passage Re-ranking in Embedding based Unsupervised Semantic Search

arXiv (Cornell University), Apr 22, 2018

State of the art approaches for (embedding based) unsupervised semantic search exploits either co... more State of the art approaches for (embedding based) unsupervised semantic search exploits either compositional similarity (of a query and a passage) or pair-wise word (or term) similarity (from the query and the passage). By design, word based approaches do not incorporate similarity in the larger context (query/passage), while compositional similarity based approaches are usually unable to take advantage of the most important cues in the context. In this paper we propose a new compositional similarity based approach, called variable centroid vector (VCVB), that tries to address both of these limitations. We also presents results using a different type of compositional similarity based approach by exploiting universal sentence embedding. We provide empirical evaluation on two different benchmarks.

Research paper thumbnail of Robust Retrieval Augmented Generation for Zero-shot Slot Filling

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

Research paper thumbnail of A Dataset for Web-Scale Knowledge Base Population

Lecture Notes in Computer Science, 2018

Research paper thumbnail of Retrieval-Based Transformer for Table Augmentation

arXiv (Cornell University), Jun 20, 2023

Research paper thumbnail of KnowGL: Knowledge Generation and Linking from Text

Proceedings of the ... AAAI Conference on Artificial Intelligence, Jun 26, 2023

We propose KnowGL, a tool that allows converting text into structured relational data represented... more We propose KnowGL, a tool that allows converting text into structured relational data represented as a set of ABox assertions compliant with the TBox of a given Knowledge Graph (KG), such as Wikidata. We address this problem as a sequence generation task by leveraging pre-trained sequenceto-sequence language models, e.g. BART. Given a sentence, we fine-tune such models to detect pairs of entity mentions and jointly generate a set of facts consisting of the full set of semantic annotations for a KG, such as entity labels, entity types, and their relationships. To showcase the capabilities of our tool, we build a web application consisting of a set of UI widgets that help users to navigate through the semantic data extracted from a given input text. We make the KnowGL model available at https://huggingface.co/ibm/knowgl-large.

Research paper thumbnail of Span Selection Pre-training for Question Answering

arXiv (Cornell University), Sep 9, 2019

BERT (Bidirectional Encoder Representations from Transformers) and related pre-trained Transforme... more BERT (Bidirectional Encoder Representations from Transformers) and related pre-trained Transformers have provided large gains across many language understanding tasks, achieving a new state-of-the-art (SOTA). BERT is pretrained on two auxiliary tasks: Masked Language Model and Next Sentence Prediction. In this paper we introduce a new pre-training task inspired by reading comprehension to better align the pre-training from memorization to understanding. Span Selection Pre-Training (SSPT) poses cloze-like training instances, but rather than draw the answer from the model's parameters, it is selected from a relevant passage. We find significant and consistent improvements over both BERT BASE and BERT LARGE on multiple Machine Reading Comprehension (MRC) datasets. Specifically, our proposed model has strong empirical evidence as it obtains SOTA results on Natural Questions, a new benchmark MRC dataset, outperforming BERT LARGE by 3 F1 points on short answer prediction. We also show significant impact in HotpotQA, improving answer prediction F1 by 4 points and supporting fact prediction F1 by 1 point and outperforming the previous best system. Moreover, we show that our pre-training approach is particularly effective when training data is limited, improving the learning curve by a large amount.

Research paper thumbnail of When Did that Happen? — Linking Events and Relations to Timestamps

Conference of the European Chapter of the Association for Computational Linguistics, Apr 23, 2012

We present work on linking events and fluents (i.e., relations that hold for certain periods of t... more We present work on linking events and fluents (i.e., relations that hold for certain periods of time) to temporal information in text, which is an important enabler for many applications such as timelines and reasoning. Previous research has mainly focused on temporal links for events, and we extend that work to include fluents as well, presenting a common methodology for linking both events and relations to timestamps within the same sentence. Our approach combines tree kernels with classical feature-based learning to exploit context and achieves competitive F1-scores on event-time linking, and comparable F1scores for fluents. Our best systems achieve F1-scores of 0.76 on events and 0.72 on fluents. * The first author conducted this research during an internship at IBM Research.

Research paper thumbnail of Retrieval-Based Transformer for Table Augmentation

Data preparation, also called data wrangling, is considered one of the most expensive and timecon... more Data preparation, also called data wrangling, is considered one of the most expensive and timeconsuming steps when performing analytics or building machine learning models. Preparing data typically involves collecting and merging data from complex heterogeneous, and often large-scale data sources, such as data lakes. In this paper, we introduce a novel approach toward automatic data wrangling in an attempt to alleviate the effort of end-users, e.g. data analysts, in structuring dynamic views from data lakes in the form of tabular data. We aim to address table augmentation tasks, including row/column population and data imputation. Given a corpus of tables, we propose a retrieval augmented self-trained transformer model. Our self-learning strategy consists in randomly ablating tables from the corpus and training the retrieval-based model to reconstruct the original values or headers given the partial tables as input. We adopt this strategy to first train the dense neural retrieval model encoding tableparts to vectors, and then the end-to-end model trained to perform table augmentation tasks. We test on EntiTables, the standard benchmark for table augmentation, as well as introduce a new benchmark to advance further research: WebTables. Our model consistently and substantially outperforms both supervised statistical methods and the current state-of-the-art transformer-based models.

Research paper thumbnail of KGI: An Integrated Framework for Knowledge Intensive Language Tasks

In this paper, we present a system to showcase the capabilities of the latest state-of-the-art re... more In this paper, we present a system to showcase the capabilities of the latest state-of-the-art retrieval augmented generation models trained on knowledge-intensive language tasks, such as slot filling, open domain question answering, dialogue, and fact-checking. Moreover, given a user query, we show how the output from these different models can be combined to cross-examine the outputs of each other. Particularly, we show how accuracy in dialogue can be improved using the question answering model. We are also releasing all models used in the demo as a contribution of this paper. A short video demonstrating the system is available at https://ibm.box.com/v/emnlp2022-demo.

Research paper thumbnail of Robust Retrieval Augmented Generation for Zero-shot Slot Filling

arXiv (Cornell University), Aug 31, 2021

Automatically inducing high quality knowledge graphs from a given collection of documents still r... more Automatically inducing high quality knowledge graphs from a given collection of documents still remains a challenging problem in AI. One way to make headway for this problem is through advancements in a related task known as slot filling. In this task, given an entity query in form of [ENTITY, SLOT, ?], a system is asked to 'fill' the slot by generating or extracting the missing value exploiting evidence extracted from relevant passage(s) in the given document collection. The recent works in the field try to solve this task in an end-to-end fashion using retrieval-based language models. In this paper, we present a novel approach to zero-shot slot filling that extends dense passage retrieval with hard negatives and robust training procedures for retrieval augmented generation models. Our model reports large improvements on both T-REx and zsRE slot filling datasets, improving both passage retrieval and slot value generation, and ranking at the top-1 position in the KILT leaderboard. Moreover, we demonstrate the robustness of our system showing its domain adaptation capability on a new variant of the TACRED dataset for slot filling, through a combination of zero/few-shot learning. We release the source code and pre-trained models 1 .

Research paper thumbnail of Beyond Jeopardy! Adapting Watson to new Domains using Distributional Semantics

Ingénierie Des Systèmes D'information, 2013

Research paper thumbnail of Semantic Technologies in IBM Watson

This paper describes a seminar course designed by IBM and Columbia University on the topic of Sem... more This paper describes a seminar course designed by IBM and Columbia University on the topic of Semantic Technologies, in particular as used in IBM Watson TM-a large scale Question Answering system which famously won at Jeopardy! R against two human grand champions. It was first offered at Columbia University during the 2013 spring semester, and will be offered at other institutions starting in the fall semester. We describe the course's first successful run and its unique features: a class centered around a specific industrial technology; a large-scale class project which student teams can choose to participate in and which serves as the basis for an open source project that will continue to grow each time the course is offered; publishable papers, demos and start-up ideas; evidence that the course can be self-evaluating, which makes it potentially appropriate for an online setting; and a unique model where a large company trains instructors and contributes to creating educational material at no charge to qualifying institutions.

Research paper thumbnail of Semantic Concept Discovery Over Event Data

ISWC (Posters, Demos & Industry Tracks), 2017

Research paper thumbnail of Inducing Implicit Relations from Text Using Distantly Supervised Deep Nets

Lecture Notes in Computer Science, 2018

Research paper thumbnail of A Generative Model for Relation Extraction and Classification

arXiv (Cornell University), Feb 26, 2022

Relation extraction (RE) is an important information extraction task which provides essential inf... more Relation extraction (RE) is an important information extraction task which provides essential information to many NLP applications such as knowledge base population and question answering. In this paper, we present a novel generative model for relation extraction and classification (which we call GREC), where RE is modeled as a sequenceto-sequence generation task. We explore various encoding representations for the source and target sequences, and design effective schemes that enable GREC to achieve stateof-the-art performance on three benchmark RE datasets. In addition, we introduce negative sampling and decoding scaling techniques which provide a flexible tool to tune the precision and recall performance of the model. Our approach can be extended to extract all relation triples from a sentence in one pass. Although the one-pass approach incurs certain performance loss, it is much more computationally efficient.

Research paper thumbnail of Query Focused Variable Centroid Vectors for Passage Re-ranking in Semantic Search

In this paper, we propose a new approach for passage re-ranking. We show that variable (i.e. non-... more In this paper, we propose a new approach for passage re-ranking. We show that variable (i.e. non-static) centroid vectors for passages, created based on the given query, significantly improves passage re-ranking results compared to that obtained using static centroid vectors. We also show that the results are comparable to RWMD-Q, an existing (non-centroid based unsupervised) state of the art. The experiments reported are conducted on two different datasets in both neural and co-occurrence based distributional semantics settings.

Research paper thumbnail of Semantic Concept Discovery over Event Databases

Lecture Notes in Computer Science, 2018

Research paper thumbnail of Applying a Generic Sequence-to-Sequence Model for Simple and Effective Keyphrase Generation

arXiv (Cornell University), Jan 13, 2022

Research paper thumbnail of Populating Web Scale Knowledge Graphs using Distantly Supervised Relation Extraction and Validation

arXiv (Cornell University), Aug 21, 2019

In this paper, we propose a fully automated system to extend knowledge graphs using external info... more In this paper, we propose a fully automated system to extend knowledge graphs using external information from web-scale corpora. The designed system leverages a deep learning based technology for relation extraction that can be trained by a distantly supervised approach. In addition to that, the system uses a deep learning approach for knowledge base completion by utilizing the global structure information of the induced KG to further refine the confidence of the newly discovered relations. The designed system does not require any effort for adaptation to new languages and domains as it does not use any hand-labeled data, NLP analytics and inference rules. Our experiments, performed on a popular academic benchmark demonstrate that the suggested system boosts the performance of relation extraction by a wide margin, reporting error reductions of 50%, resulting in relative improvement of up to 100%. Also, a web-scale experiment conducted to extend DBPedia with knowledge from Common Crawl shows that our system is not only scalable but also does not require any adaptation cost, while yielding substantial accuracy gain.

Research paper thumbnail of Dynamic Facet Selection by Maximizing Graded Relevance

Research paper thumbnail of A Study on Passage Re-ranking in Embedding based Unsupervised Semantic Search

arXiv (Cornell University), Apr 22, 2018

State of the art approaches for (embedding based) unsupervised semantic search exploits either co... more State of the art approaches for (embedding based) unsupervised semantic search exploits either compositional similarity (of a query and a passage) or pair-wise word (or term) similarity (from the query and the passage). By design, word based approaches do not incorporate similarity in the larger context (query/passage), while compositional similarity based approaches are usually unable to take advantage of the most important cues in the context. In this paper we propose a new compositional similarity based approach, called variable centroid vector (VCVB), that tries to address both of these limitations. We also presents results using a different type of compositional similarity based approach by exploiting universal sentence embedding. We provide empirical evaluation on two different benchmarks.

Research paper thumbnail of Robust Retrieval Augmented Generation for Zero-shot Slot Filling

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

Research paper thumbnail of A Dataset for Web-Scale Knowledge Base Population

Lecture Notes in Computer Science, 2018

Research paper thumbnail of Retrieval-Based Transformer for Table Augmentation

arXiv (Cornell University), Jun 20, 2023