Johann Gamper | Free University of Bozen-Bolzano (original) (raw)
Papers by Johann Gamper
In this paper, we present the VISOR tool, which helps the user to explore data and their summary ... more In this paper, we present the VISOR tool, which helps the user to explore data and their summary structures by visualizing the relationships between the size k of a data summary and the induced error. Given an ordered dataset, VISOR allows to vary the size k of a data summary and to immediately see the effect on the induced error, by visualizing the error and its dependency on k in an ϵ-graph and Δ-graph, respectively. The user can easily explore different values of k and determine the best value for the summary size. VISOR allows also to compare different summarization methods, such as piecewise constant approximation, piecewise aggregation approximation or V-optimal histograms. We show several demonstration scenarios, including how to determine an appropriate value for the summary size and comparing different summarization techniques.
Lecture Notes in Computer Science, 2022
arXiv (Cornell University), Feb 13, 2019
Snapshot semantics is widely used for evaluating queries over temporal data: temporal relations a... more Snapshot semantics is widely used for evaluating queries over temporal data: temporal relations are seen as sequences of snapshot relations, and queries are evaluated at each snapshot. In this work, we demonstrate that current approaches for snapshot semantics over interval-timestamped multiset relations are subject to two bugs regarding snapshot aggregation and bag difference. We introduce a novel temporal data model based on K-relations that overcomes these bugs and prove it to correctly encode snapshot semantics. Furthermore, we present an efficient implementation of our model as a database middleware and demonstrate experimentally that our approach is competitive with native implementations and significantly outperforms such implementations on queries that involve aggregation.
arXiv (Cornell University), Feb 27, 2013
The problem of deriving lower and upper bounds for the edit distance between labelled undirected ... more The problem of deriving lower and upper bounds for the edit distance between labelled undirected graphs has recently received increasing attention. However, only one algorithm has been proposed that allegedly computes not only an upper but also a lower bound for non-uniform metric edit costs and incorporates information about both node and edge labels. In this paper, we show that this algorithm is incorrect in the sense that, in general, it does not compute a lower bound. We present BRANCH, a corrected version of the algorithm that runs in O(n5) time. We also develop a speed-up BRANCHFAST that runs in O(n4) time and computes a lower bound, which is only slightly less accurate than the one computed by BRANCH. An experimental evaluation shows that BRANCH and BRANCHFAST yield excellent runtime/accuracy-tradeoffs, as they outperform all existing competitors in terms of runtime or in terms of accuracy.
Proceedings of the VLDB Endowment, Sep 1, 2022
Mining time series motifs is a fundamental, yet expensive task in exploratory data analytics. In ... more Mining time series motifs is a fundamental, yet expensive task in exploratory data analytics. In this paper, we therefore propose a fast method to find the top- k motifs with probabilistic guarantees. Our probabilistic approach is based on Locality Sensitive Hashing and allows to prune most of the distance computations, leading to huge speedups. We improve on a straightforward application of LSH to time series data by developing a self-tuning algorithm that adapts to the data distribution. Furthermore, we include several optimizations to the algorithm, reducing redundant computations and leveraging the structure of time series data to speed up LSH computations. We prove the correctness of the algorithm and provide bounds to the cost of the basic operations it performs. An experimental evaluation shows that our algorithm is able to tackle time series of one billion points on a single CPU-based machine, performing orders of magnitude faster than the GPU-based state of the art.
Lecture Notes in Computer Science, 2017
The prefix sum approach is a powerful technique to answer range-sum queries over multi-dimensiona... more The prefix sum approach is a powerful technique to answer range-sum queries over multi-dimensional arrays in constant time by requiring only a few look-ups in an array of precomputed prefix sums. In this paper, we propose the sparse prefix sum approach that is based on relative prefix sums and exploits sparsity in the data to vastly reduce the storage costs for the prefix sums. The proposed approach has desirable theoretical properties and works well in practice. It is the first approach achieving constant query time with sub-linear update costs and storage costs for range-sum queries over sparse low-dimensional arrays. Experiments on real-world data sets show that the approach reduces storage costs by an order of magnitude with only a small overhead in query time, thus preserving microsecond-fast query answering.
Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5)
Extracting attribute-value information from unstructured product descriptions continue to be of a... more Extracting attribute-value information from unstructured product descriptions continue to be of a vital importance in e-commerce applications. One of the most important product attributes is the brand which highly influences customers' purchasing behaviour. Thus, it is crucial to accurately extract brand information dealing with the main challenge of discovering new brand names. Under the open world assumption, several approaches have adopted deep learning models to extract attribute-values using sequence tagging paradigm. However, they did not employ finer grained data representations such as character level embeddings which improve generalizability. In this paper, we introduce OpenBrand, a novel approach for discovering brand names. OpenBrand is a BiLSTM-CRF-Attention model with embeddings at different granularities. Such embeddings are learned using CNN and LSTM architectures to provide more accurate representations. We further propose a new dataset for brand value extraction, with a very challenging task on zero-shot extraction. We have tested our approach, through extensive experiments, and shown that it outperforms state-of-the-art models in brand name discovery.
Proceedings of the 31st ACM International Conference on Information & Knowledge Management
Attribute value extraction from product profiles is essential for many applications such as produ... more Attribute value extraction from product profiles is essential for many applications such as product retrieval, comparison, and recommendation. While existing techniques focus mainly on the extraction task, none of them deals with the problem of correcting wrong attribute values. In this paper we propose CAVE, a novel system for attribute correction and enrichment using the Question Answering (QA) paradigm. CAVE learns information from both titles and attribute tables, using encoder and language models to correct attribute values. It also has the capability to enrich existing product descriptions with new attribute values extracted from titles. To the best of our knowledge, CAVE is the first system that allows users to experiment with a number of powerful QA models and compare their performances on attribute values correction using real-word datasets. CCS CONCEPTS • Computing methodologies → Information extraction.
Lecture Notes in Computer Science, 2017
We develop a highly efficient access method, called Delta-Top-Index, to answer top-k subsequence ... more We develop a highly efficient access method, called Delta-Top-Index, to answer top-k subsequence matching queries over a time series data set. Compared to a naive implementation, our index has a storage cost that is up to two orders of magnitude smaller, while providing answers within microseconds. We demonstrate the efficiency and effectiveness of our technique in an experimental evaluation with real-world data.
Information Systems, May 1, 2019
Prefix sums are a powerful technique to answer range-sum queries over multi-dimensional arrays in... more Prefix sums are a powerful technique to answer range-sum queries over multi-dimensional arrays in O(1) time by looking up a constant number of values in an array of size O(N) , where N is the number of cells in the multi-dimensional array. However, the technique suffers from O(N) update and storage costs. Relative prefix sums address the high update costs by partitioning the array into blocks, thereby breaking the dependency between cells. In this paper, we present sparse prefix sums that exploit data sparsity to reduce the high storage costs of relative prefix sums. By building upon relative prefix sums, sparse prefix sums achieve the same update complexity as relative prefix sums. The authors of relative prefix sums erroneously claimed that the update complexity is O(√ N) for any number of dimensions. We show that this claim holds only for two dimensions, whereas the correct complexity for an arbitrary number of d dimensions is O(N d−1 d). To reduce the storage costs, the sparse prefix sums technique exploits sparsity in the data and avoids to materialize prefix sums for empty rows and columns in the data grid; instead, look-up tables are used to preserve constant query time. Sparse prefix sums are the first approach to achieve O(1) query time with sub-linear storage costs for range-sum queries over sparse low-dimensional arrays. A thorough experimental evaluation shows that the approach works very well in practice. On the tested real-world data sets the storage costs are reduced by an order of magnitude with only a small overhead in query time, thus preserving microsecond-fast query answering.
The graph edit distance (GED) is a widely used distance measure for attributed graphs. It has rec... more The graph edit distance (GED) is a widely used distance measure for attributed graphs. It has recently been shown that the problem of computing GED, which is a NPhard optimization problem, can be formulated as a quadratic assignment problem (QAP). This formulation is useful, since it allows to derive well performing approximative heuristics for GED from existing techniques for QAP. In this paper, we focus on the case where the edit costs that underlie GED are quasimetric. This is the case in many applications of GED. We show that, for quasimetric edit costs, it is possible to reduce the size of the corresponding QAP formulation. An empirical evaluation shows that this reduction significantly speeds up the QAP-based approximative heuristics for GED.
2018 22nd International Conference Information Visualisation (IV)
ArXiv, 2019
The graph edit distance (GED) is a flexible distance measure which is widely used for inexact gra... more The graph edit distance (GED) is a flexible distance measure which is widely used for inexact graph matching. Since its exact computation is NP-hard, heuristics are used in practice. A popular approach is to obtain upper bounds for GED via transformations to the linear sum assignment problem with error-correction (LSAPE). Typically, local structures and distances between them are employed for carrying out this transformation, but recently also machine learning techniques have been used. In this paper, we formally define a unifying framework LSAPE-GED for transformations from GED to LSAPE. We introduce rings as a new kind of local structures that are able to capture a lot of information encoded in the input graphs at a low computational cost. Furthermore, we propose two new ring based heuristics RING and RING-ML, which instantiate LSAPE-GED using the traditional and the machine learning based approach for transforming GED to LSAPE, respectively. Extensive experiments show that using ...
Computer Science and Information Systems, 2020
The VLDB Journal, 2019
Because of its flexibility, intuitiveness, and expressivity, the graph edit distance (GED) is one... more Because of its flexibility, intuitiveness, and expressivity, the graph edit distance (GED) is one of the most widely used distance measures for labeled graphs. Since exactly computing GED is NP-hard, over the past years, various heuristics have been proposed. They use techniques such as transformations to the linear sum assignment problem with error-correction, local search, and linear programming to approximate GED via upper or lower bounds. In this paper, we provide a systematic overview of the most important heuristics. Moreover, we empirically evaluate all compared heuristics within an integrated implementation.
Multimedia Tools and Applications, 2019
Recently, an increasing need for sophisticated multimedia analytics tools has been observed, whic... more Recently, an increasing need for sophisticated multimedia analytics tools has been observed, which is triggered by a rapid growth of multimedia collections and by an increasing number of scientific fields embedding images in their studies. Although temporal data is ubiquitous and crucial in many applications, such tools typically do not support the analysis of data along the temporal dimension, especially for time periods. An appropriate visualization and comparison of period data associated with multimedia collections would help users to infer new information from such collections. In this paper, we present a novel multimedia analytics application for summarizing and analyzing temporal data from eye-tracking experiments. The application combines three different visual approaches: TIME • DIFF, visual-information-seeking mantra, and multi-viewpoint. A qualitative evaluation with domain experts confirmed that our application helps decision makers to summarize and analyze multimedia collections containing period data.
Advances in Intelligent Systems and Computing, 2014
Research on database and information system technologies has been rapidly evolving over the last ... more Research on database and information system technologies has been rapidly evolving over the last few years. Advances concern either new data types, new management issues, and new kind of architectures and systems. The 17th East-European Conference on Advances in Databases and Information Systems (ADBIS 2013), held on September 1-4, 2013 in Genova, Italy, and associated satellite events aimed at covering some emerging issues concerning such new trends in database and information system research. The aim of this paper is to present such events, their motivations and topics of interest, as well as briefly outline the papers selected for presentations. The selected papers will then be included in the remainder of this voume.
In this paper, we present the VISOR tool, which helps the user to explore data and their summary ... more In this paper, we present the VISOR tool, which helps the user to explore data and their summary structures by visualizing the relationships between the size k of a data summary and the induced error. Given an ordered dataset, VISOR allows to vary the size k of a data summary and to immediately see the effect on the induced error, by visualizing the error and its dependency on k in an ϵ-graph and Δ-graph, respectively. The user can easily explore different values of k and determine the best value for the summary size. VISOR allows also to compare different summarization methods, such as piecewise constant approximation, piecewise aggregation approximation or V-optimal histograms. We show several demonstration scenarios, including how to determine an appropriate value for the summary size and comparing different summarization techniques.
Lecture Notes in Computer Science, 2022
arXiv (Cornell University), Feb 13, 2019
Snapshot semantics is widely used for evaluating queries over temporal data: temporal relations a... more Snapshot semantics is widely used for evaluating queries over temporal data: temporal relations are seen as sequences of snapshot relations, and queries are evaluated at each snapshot. In this work, we demonstrate that current approaches for snapshot semantics over interval-timestamped multiset relations are subject to two bugs regarding snapshot aggregation and bag difference. We introduce a novel temporal data model based on K-relations that overcomes these bugs and prove it to correctly encode snapshot semantics. Furthermore, we present an efficient implementation of our model as a database middleware and demonstrate experimentally that our approach is competitive with native implementations and significantly outperforms such implementations on queries that involve aggregation.
arXiv (Cornell University), Feb 27, 2013
The problem of deriving lower and upper bounds for the edit distance between labelled undirected ... more The problem of deriving lower and upper bounds for the edit distance between labelled undirected graphs has recently received increasing attention. However, only one algorithm has been proposed that allegedly computes not only an upper but also a lower bound for non-uniform metric edit costs and incorporates information about both node and edge labels. In this paper, we show that this algorithm is incorrect in the sense that, in general, it does not compute a lower bound. We present BRANCH, a corrected version of the algorithm that runs in O(n5) time. We also develop a speed-up BRANCHFAST that runs in O(n4) time and computes a lower bound, which is only slightly less accurate than the one computed by BRANCH. An experimental evaluation shows that BRANCH and BRANCHFAST yield excellent runtime/accuracy-tradeoffs, as they outperform all existing competitors in terms of runtime or in terms of accuracy.
Proceedings of the VLDB Endowment, Sep 1, 2022
Mining time series motifs is a fundamental, yet expensive task in exploratory data analytics. In ... more Mining time series motifs is a fundamental, yet expensive task in exploratory data analytics. In this paper, we therefore propose a fast method to find the top- k motifs with probabilistic guarantees. Our probabilistic approach is based on Locality Sensitive Hashing and allows to prune most of the distance computations, leading to huge speedups. We improve on a straightforward application of LSH to time series data by developing a self-tuning algorithm that adapts to the data distribution. Furthermore, we include several optimizations to the algorithm, reducing redundant computations and leveraging the structure of time series data to speed up LSH computations. We prove the correctness of the algorithm and provide bounds to the cost of the basic operations it performs. An experimental evaluation shows that our algorithm is able to tackle time series of one billion points on a single CPU-based machine, performing orders of magnitude faster than the GPU-based state of the art.
Lecture Notes in Computer Science, 2017
The prefix sum approach is a powerful technique to answer range-sum queries over multi-dimensiona... more The prefix sum approach is a powerful technique to answer range-sum queries over multi-dimensional arrays in constant time by requiring only a few look-ups in an array of precomputed prefix sums. In this paper, we propose the sparse prefix sum approach that is based on relative prefix sums and exploits sparsity in the data to vastly reduce the storage costs for the prefix sums. The proposed approach has desirable theoretical properties and works well in practice. It is the first approach achieving constant query time with sub-linear update costs and storage costs for range-sum queries over sparse low-dimensional arrays. Experiments on real-world data sets show that the approach reduces storage costs by an order of magnitude with only a small overhead in query time, thus preserving microsecond-fast query answering.
Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5)
Extracting attribute-value information from unstructured product descriptions continue to be of a... more Extracting attribute-value information from unstructured product descriptions continue to be of a vital importance in e-commerce applications. One of the most important product attributes is the brand which highly influences customers' purchasing behaviour. Thus, it is crucial to accurately extract brand information dealing with the main challenge of discovering new brand names. Under the open world assumption, several approaches have adopted deep learning models to extract attribute-values using sequence tagging paradigm. However, they did not employ finer grained data representations such as character level embeddings which improve generalizability. In this paper, we introduce OpenBrand, a novel approach for discovering brand names. OpenBrand is a BiLSTM-CRF-Attention model with embeddings at different granularities. Such embeddings are learned using CNN and LSTM architectures to provide more accurate representations. We further propose a new dataset for brand value extraction, with a very challenging task on zero-shot extraction. We have tested our approach, through extensive experiments, and shown that it outperforms state-of-the-art models in brand name discovery.
Proceedings of the 31st ACM International Conference on Information & Knowledge Management
Attribute value extraction from product profiles is essential for many applications such as produ... more Attribute value extraction from product profiles is essential for many applications such as product retrieval, comparison, and recommendation. While existing techniques focus mainly on the extraction task, none of them deals with the problem of correcting wrong attribute values. In this paper we propose CAVE, a novel system for attribute correction and enrichment using the Question Answering (QA) paradigm. CAVE learns information from both titles and attribute tables, using encoder and language models to correct attribute values. It also has the capability to enrich existing product descriptions with new attribute values extracted from titles. To the best of our knowledge, CAVE is the first system that allows users to experiment with a number of powerful QA models and compare their performances on attribute values correction using real-word datasets. CCS CONCEPTS • Computing methodologies → Information extraction.
Lecture Notes in Computer Science, 2017
We develop a highly efficient access method, called Delta-Top-Index, to answer top-k subsequence ... more We develop a highly efficient access method, called Delta-Top-Index, to answer top-k subsequence matching queries over a time series data set. Compared to a naive implementation, our index has a storage cost that is up to two orders of magnitude smaller, while providing answers within microseconds. We demonstrate the efficiency and effectiveness of our technique in an experimental evaluation with real-world data.
Information Systems, May 1, 2019
Prefix sums are a powerful technique to answer range-sum queries over multi-dimensional arrays in... more Prefix sums are a powerful technique to answer range-sum queries over multi-dimensional arrays in O(1) time by looking up a constant number of values in an array of size O(N) , where N is the number of cells in the multi-dimensional array. However, the technique suffers from O(N) update and storage costs. Relative prefix sums address the high update costs by partitioning the array into blocks, thereby breaking the dependency between cells. In this paper, we present sparse prefix sums that exploit data sparsity to reduce the high storage costs of relative prefix sums. By building upon relative prefix sums, sparse prefix sums achieve the same update complexity as relative prefix sums. The authors of relative prefix sums erroneously claimed that the update complexity is O(√ N) for any number of dimensions. We show that this claim holds only for two dimensions, whereas the correct complexity for an arbitrary number of d dimensions is O(N d−1 d). To reduce the storage costs, the sparse prefix sums technique exploits sparsity in the data and avoids to materialize prefix sums for empty rows and columns in the data grid; instead, look-up tables are used to preserve constant query time. Sparse prefix sums are the first approach to achieve O(1) query time with sub-linear storage costs for range-sum queries over sparse low-dimensional arrays. A thorough experimental evaluation shows that the approach works very well in practice. On the tested real-world data sets the storage costs are reduced by an order of magnitude with only a small overhead in query time, thus preserving microsecond-fast query answering.
The graph edit distance (GED) is a widely used distance measure for attributed graphs. It has rec... more The graph edit distance (GED) is a widely used distance measure for attributed graphs. It has recently been shown that the problem of computing GED, which is a NPhard optimization problem, can be formulated as a quadratic assignment problem (QAP). This formulation is useful, since it allows to derive well performing approximative heuristics for GED from existing techniques for QAP. In this paper, we focus on the case where the edit costs that underlie GED are quasimetric. This is the case in many applications of GED. We show that, for quasimetric edit costs, it is possible to reduce the size of the corresponding QAP formulation. An empirical evaluation shows that this reduction significantly speeds up the QAP-based approximative heuristics for GED.
2018 22nd International Conference Information Visualisation (IV)
ArXiv, 2019
The graph edit distance (GED) is a flexible distance measure which is widely used for inexact gra... more The graph edit distance (GED) is a flexible distance measure which is widely used for inexact graph matching. Since its exact computation is NP-hard, heuristics are used in practice. A popular approach is to obtain upper bounds for GED via transformations to the linear sum assignment problem with error-correction (LSAPE). Typically, local structures and distances between them are employed for carrying out this transformation, but recently also machine learning techniques have been used. In this paper, we formally define a unifying framework LSAPE-GED for transformations from GED to LSAPE. We introduce rings as a new kind of local structures that are able to capture a lot of information encoded in the input graphs at a low computational cost. Furthermore, we propose two new ring based heuristics RING and RING-ML, which instantiate LSAPE-GED using the traditional and the machine learning based approach for transforming GED to LSAPE, respectively. Extensive experiments show that using ...
Computer Science and Information Systems, 2020
The VLDB Journal, 2019
Because of its flexibility, intuitiveness, and expressivity, the graph edit distance (GED) is one... more Because of its flexibility, intuitiveness, and expressivity, the graph edit distance (GED) is one of the most widely used distance measures for labeled graphs. Since exactly computing GED is NP-hard, over the past years, various heuristics have been proposed. They use techniques such as transformations to the linear sum assignment problem with error-correction, local search, and linear programming to approximate GED via upper or lower bounds. In this paper, we provide a systematic overview of the most important heuristics. Moreover, we empirically evaluate all compared heuristics within an integrated implementation.
Multimedia Tools and Applications, 2019
Recently, an increasing need for sophisticated multimedia analytics tools has been observed, whic... more Recently, an increasing need for sophisticated multimedia analytics tools has been observed, which is triggered by a rapid growth of multimedia collections and by an increasing number of scientific fields embedding images in their studies. Although temporal data is ubiquitous and crucial in many applications, such tools typically do not support the analysis of data along the temporal dimension, especially for time periods. An appropriate visualization and comparison of period data associated with multimedia collections would help users to infer new information from such collections. In this paper, we present a novel multimedia analytics application for summarizing and analyzing temporal data from eye-tracking experiments. The application combines three different visual approaches: TIME • DIFF, visual-information-seeking mantra, and multi-viewpoint. A qualitative evaluation with domain experts confirmed that our application helps decision makers to summarize and analyze multimedia collections containing period data.
Advances in Intelligent Systems and Computing, 2014
Research on database and information system technologies has been rapidly evolving over the last ... more Research on database and information system technologies has been rapidly evolving over the last few years. Advances concern either new data types, new management issues, and new kind of architectures and systems. The 17th East-European Conference on Advances in Databases and Information Systems (ADBIS 2013), held on September 1-4, 2013 in Genova, Italy, and associated satellite events aimed at covering some emerging issues concerning such new trends in database and information system research. The aim of this paper is to present such events, their motivations and topics of interest, as well as briefly outline the papers selected for presentations. The selected papers will then be included in the remainder of this voume.