DAGOBAH: Enhanced Scoring Algorithms for Scalable Annotations of Tabular Data (original) (raw)
Related papers
JenTab: A Toolkit for Semantic Table Annotations
2021
Tables are a ubiquitous source of structured information. However, their use in automated pipelines is severely affected by conflicts in naming and issues like missing entries or spelling mistakes. The Semantic Web has proven itself a valuable tool in dealing with such issues, allowing the fusion of data from heterogeneous sources. Its usage requires the annotation of table elements like cells and columns with entities from existing knowledge graphs. Automating this semantic annotation, especially for noisy tabular data, remains a challenge, though. JenTab is a modular system to map table contents onto large knowledge graphs like Wikidata. It starts by creating an initial pool of candidates for possible annotations. Over multiple iterations context information is then used to eliminate candidates until, eventually, a single annotation is identified as the best match. Based on the SemTab2020 dataset, this paper presents various experiments to evaluate the performance of JenTab. This ...
Semantic Annotation for Tabular Data
ArXiv, 2020
Detecting semantic concept of columns in tabular data is of particular interest to many applications ranging from data integration, cleaning, search to feature engineering and model building in machine learning. Recently, several works have proposed supervised learning-based or heuristic pattern-based approaches to semantic type annotation. Both have shortcomings that prevent them from generalizing over a large number of concepts or examples. Many neural network based methods also present scalability issues. Additionally, none of the known methods works well for numerical data. We propose C2C^2C2, a column to concept mapper that is based on a maximum likelihood estimation approach through ensembles. It is able to effectively utilize vast amounts of, albeit somewhat noisy, openly available table corpora in addition to two popular knowledge graphs to perform effective and efficient concept prediction for structured data. We demonstrate the effectiveness of C2C^2C2 over available technique...
Semantic Concept Annotation for Tabular Data
Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021
Determining the semantic concepts of columns in tabular data is of use for many applications ranging from data integration, cleaning, search to feature engineering and model building in machine learning. Several prior works have proposed supervised learning-based or heuristic-based approaches to semantic type annotation. These techniques suffer from poor generalizability over a large number of concepts or examples. Recent neural network based supervised learning methods generalize to different datasets but require large amounts of curated training data and also present scalability issues. Furthermore, none of the known methods works well for numerical data. We present C 2 , a system that maps each column to a concept based on a maximum likelihood estimation approach through ensembles. It is able to effectively utilize vast amounts of, albeit somewhat noisy, openly available table corpora in addition to two popular knowledge graphs (Wikidata and DBpedia), to perform effective and efficient concept annotation for tabular data. Specifically, we utilize a collection of 32 million openly available webtables from several sources. We also present efficient indexing techniques for categorical string, numeric and mixed-type data, and novel techniques for table context utilization. We demonstrate the effectiveness and efficiency of C 2 over available techniques on 9 real-world datasets containing a wide variety of concepts. CCS CONCEPTS • Information systems → Data extraction and integration.
Semantically Conceptualizing and Annotating Tables
Lecture Notes in Computer Science, 2008
Enabling a system to automatically conceptualize and annotate a human-readable table is one way to create interesting semanticweb content. But exactly "how?" is not clear. With conceptualization and annotation in mind, we investigate a semantic-enrichment procedure as a way to turn syntactically observed table layout into semantically coherent ontological concepts, relationships, and constraints. Our semanticenrichment procedure shows how to make use of auxiliary world knowledge to construct rich ontological structures and to populate these ontological structures with instance data. The system uses auxiliary knowledge (1) to recognize concepts and which data values belong to which concepts, (2) to discover relationships among concepts and which datavalue combinations represent relationship instances, and (3) to discover constraints over the concepts and relationships that the data values and data-value combinations should satisfy. Experimental evaluations indicate that the automatic conceptualization and annotation processes perform well, yielding F-measures of 90% for concept recognition, 77% for relationship discovery, and 90% for constraint discovery in web tables selected from the geopolitical domain.
Exploiting a Web of Semantic Data for Interpreting Tables
2010
Much of the world's knowledge is contained in structured documents like spreadsheets, database relations and tables in documents found on the Web and in print. The information in these tables might be much more valuable if it could be appropriately exported or encoded in RDF, making it easier to share, understand and integrate with other information. This is especially true if it could be linked into the growing linked data cloud. We describe techniques to automatically infer a (partial) semantic model for information in tables using both table headings, if available, and the values stored in table cells and to export the data the table represents as linked data. The techniques have been prototyped for a subset of linked data that covers the core of Wikipedia.
Entity Linking to Knowledge Graphs to Infer Column Types and Properties
2019
This paper describes our broad goal of linking tabular data to semantic knowledge graphs, as well as our specific attempts at solving the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching. Our efforts were split into a Candidate Generation and a Candidate Selection phase. The former involves searching for relevant entities in knowledge bases, while the latter involves picking the top candidate using various techniques such as heuristics (the ‘TF-IDF’ approach) and machine learning (the Neural Network Ranking model). We achieve an F1 score of 0.826 without any training data on the 400000+ cells to be annotated in Round 2 CEA challenge. On CTA and CPA variants, we score 1.099 and 0.790 respectively.
Towards Capturing Contextual Semantic Information About Statements in Web Tables
2018
Data published on the Web is growing every year. However, most of this data does not have semantic representation. Web tables are an example of structured data on the Web that has no clear semantics. While there is an emerging research effort in lifting tabular data into semantic web formats, most of the work is focused around entity recognition in tables with simple structure. In this work we explore how capture the semantics of complex tables and transform them to knowledge graph. These complex tables include contextual information about statements, such as time or provenance. Hence, we need to use contextualized knowledge graphs to represent the information of the tables. We explore how this contextual information is represented in tables, and relate it to previous classifications of web tables, and how to encode it in RDF using different approaches. Finally, we present a prototype tool that converts web tables from Wikipedia into RDF, trying to cover all existing approaches.
Using linked data to interpret tables
… Linked Data, held in …, 2010
Vast amounts of information is available in structured forms like spreadsheets, database relations, and tables found in documents and on the Web. We describe an approach that uses linked data to interpret such tables and associate their components with nodes in a reference linked data collection. Our proposed framework assigns a class (i.e. type) to table columns, links table cells to entities, and inferred relations between columns to properties. The resulting interpretation can be used to annotate tables, confirm existing facts in the linked data collection, and propose new facts to be added. Our implemented prototype uses DBpedia as the linked data collection and Wikitology for background knowledge. We evaluated its performance using a collection of tables from Google Squared, Wikipedia and the Web.
T2LD: Interpreting and Representing Tables as Linked Data
Proc. Poster and …, 2010
We describe a framework and prototype system for interpreting tables and extracting entities and relations from them, and producing a linked data representation of the table's contents. This can be used to annotate the table or to add new facts to the linked data collection.
Relation Extraction from Tables using Artificially Generated Metadata
2021
Relation Extraction (RE) from tables is the task of identifying relations between pairs of columns of a table. Generally, RE models for this task require labelled tables for training. These labelled tables can also be generated artificially from a Knowledge Graph (KG), which makes the cost to acquire them much lower in comparison to manual annotations. However, unlike real tables, these synthetic tables lack associated metadata, such as, column-headers, captions, etc; this is because synthetic tables are created out of KGs that do not store such metadata. Meanwhile, previous works have shown that metadata is important for accurate RE from tables. To address this issue, we propose methods to artificially create some of this metadata for synthetic tables. Afterward, we experiment with a BERT-based model, in line with recently published works, that takes as input a combination of proposed artificial metadata and table content. Our empirical results show that this leads to an improvemen...