Mangesh Bendre - Academia.edu (original) (raw)
Papers by Mangesh Bendre
Proceedings of the 32nd ACM International Conference on Information and Knowledge Management
Imbalanced datasets are commonly observed in various real-world applications, presenting signific... more Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers. When working with large datasets, the imbalanced issue can be further exacerbated, making it exceptionally difficult to train classifiers effectively. To address the problem, over-sampling techniques have been developed to linearly interpolating data instances between minorities and their neighbors. However, in many realworld scenarios such as anomaly detection, minority instances are often dispersed diversely in the feature space rather than clustered together. Inspired by domain-agnostic data mix-up, we propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes. It is non-trivial to develop such a framework, the challenges include source sample selection, mix-up strategy selection, and the coordination between the underlying model and mix-up strategies. To tackle these challenges, we formulate the problem of iterative data mix-up as a Markov decision process (MDP) that maps data attributes onto an augmentation strategy. To solve the MDP, we employ an actorcritic framework to adapt the discrete-continuous decision space. This framework is utilized to train a data augmentation policy and design a reward signal that explores classifier uncertainty and encourages performance improvement, irrespective of the classifier's convergence. We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets using three different types of classifiers. The results of these experiments showcase the potential and promise of our framework in addressing imbalanced datasets with diverse minorities.
arXiv (Cornell University), Oct 13, 2019
Embeddings are one of the fundamental building blocks for data analysis tasks. Embeddings are alr... more Embeddings are one of the fundamental building blocks for data analysis tasks. Embeddings are already essential tools for large language models and image analysis, and their use is being extended to many other research domains. The generation of these distributed representations is often a dataand computation-expensive process; yet the holistic analysis and adjustment of them after they have been created is still a developing area. In this paper, we first propose a very general quantitatively measure for the presence of features in the embedding data based on if it can be learned. We then devise a method to remove or alleviate undesired features in the embedding while retaining the essential structure of the data. We use a Domain Adversarial Network (DAN) to generate a non-affine transformation, but we add constraints to ensure the essential structure of the embedding is preserved. Our empirical results demonstrate that the proposed algorithm significantly outperforms the state-of-art unsupervised algorithm on several data sets, including novel applications from the industry.
2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)
Lecture Notes in Computer Science, 2022
2022 IEEE International Conference on Big Data (Big Data)
We are witnessing the increasing availability of data across a spectrum of domains, necessitating... more We are witnessing the increasing availability of data across a spectrum of domains, necessitating the interactive ad-hoc management and analysis of this data, in order to put it to use. Unfortunately, interactive ad-hoc management of very large datasets presents a host of challenges, ranging from performance to interface usability. This thesis introduces a new research direction of manipulation of large datasets using an interactive interface and makes several steps towards this direction. In particular, we develop DataSpread, a tool that enables users to work with arbitrary large datasets via a direct manipulation interface. DataSpread holistically unifies spreadsheets and relational databases to leverage the benefits of both. However, this holistic integration is not trivial due to the differences in the architecture and ideologies of the two paradigms: spreadsheets and databases. We have built a prototype of DataSpread, which, in addition to motivating the underlying challenges, demonstrates the feasibility and usefulness of this holistic integration. We focus on the following challenges encountered while developing DataSpread. (i) Representation-here, we address the challenges of flexibly representing ad-hoc spreadsheet data within a relational database; (ii) Indexing-here, we develop indexing data structures for supporting and maintaining access by position; (iii) Formula Computation-here, we introduce an asynchronous formula computation framework that addresses the challenge of ensuring consistency and interactivity at the same time; and (iv) Organization-here, we develop a framework to best organize data based on a workload, e.g., queries specified on the spreadsheet interface.
Proceedings of the VLDB Endowment, 2021
Spreadsheet systems are by far the most popular platform for data exploration on the planet, supp... more Spreadsheet systems are by far the most popular platform for data exploration on the planet, supporting millions of rows of data. However, exploring spreadsheets that are this large via operations such as scrolling or issuing formulae can be overwhelming and error-prone. Users easily lose context and suffer from cognitive and mechanical burdens while issuing formulae on data spanning multiple screens. To address these challenges, we introduce dynamic hierarchical overviews that are embedded alongside spreadsheets. Users can employ this overview to explore the data at various granularities, zooming in and out of the spreadsheet. They can issue formulae over data subsets without cumbersome scrolling or range selection, enabling users to gain a high or low-level perspective of the spreadsheet. An implementation of our dynamic hierarchical overview, NOAH, integrated within DataSpread, preserves spreadsheet semantics and look and feel, while introducing such enhancements. Our user studies demonstrate that NOAH makes it more intuitive, easier, and faster to navigate spreadsheet data compared to traditional spreadsheets like Microsoft Excel and spreadsheet plug-ins like Pivot Table, for a variety of exploration tasks; participants made fewer mistakes in NOAH while being faster in completing the tasks.
Spreadsheet software is often the tool of choice for ad-hoc tabular data management, processing, ... more Spreadsheet software is often the tool of choice for ad-hoc tabular data management, processing, and visualization, especially on tiny data sets. On the other hand, relational database systems offer sig-nificant power, expressivity, and efficiency over spreadsheet soft-ware for data management, while lacking in the ease of use and ad-hoc analysis capabilities. We demonstrate DATA-SPREAD, a data exploration tool that unifies databases and spreadsheets. DATA-SPREAD continues to offer a Microsoft Excel-based spreadsheet front-end, while in parallel managing all the data in a back-end database, specifically, Postgres. DATA-SPREAD retains all the ad-vantages of spreadsheets, including ease of use, ad-hoc analysis and visualization capabilities, and a schema-free nature, while also adding the advantages of traditional relational databases, such as scalability and the ability to use arbitrary SQL to import, filter, or join external or internal tables and have the results appear in the spre...
Spreadsheet software is the tool of choice for ad-hoc tabular data management, manipulation, quer... more Spreadsheet software is the tool of choice for ad-hoc tabular data management, manipulation, querying, and visualization with adoption by billions of users. However, spreadsheets are not scalable, unlike database systems. We develop DATASPREAD, a system that holistically unifies databases and spreadsheets with a goal to work with massive spreadsheets: DATASPREAD retains all of the advantages of spreadsheets, including ease of use, ad-hoc analysis and visualization capabilities, and a schema-free nature, while also adding the scalability and collaboration abilities of traditional relational databases. We design DATASPREAD with a spreadsheet front-end and a regular relational database back-end. To integrate spreadsheets and databases, in this paper, we develop a storage and indexing engine for spreadsheet data. We first formalize and study the problem of representing and manipulating spreadsheet data within a relational database. We demonstrate that identifying the optimal representat...
Spreadsheets are one of the most popular tools for ad-hoc exploration and analysis of data. Despi... more Spreadsheets are one of the most popular tools for ad-hoc exploration and analysis of data. Despite that, exploring and analyzing spreadsheet datasets that span more than a few screens via operations such as scrolling or issuing formulae, is often overwhelming for end-users. Users easily lose context as they explore the data via scrolling and suffer from cognitive and mechanical burdens while issuing formulae on data spanning multiple screens. We propose integrating a navigation plug-in with spreadsheets to support the seamless exploration of large datasets that are increasingly the norm. Our interface, NOAH, developed using lessons from classical overview+detail interfaces, embeds a multi-granularity zoomable overview alongside the spreadsheet. Users can employ the overview to explore the data at various granularities. Furthermore, they can issue formulae over subsets of data without performing cumbersome scrolling or range selection operations, enabling users to gain a high or low...
We are witnessing the increasing availability of data across a spectrum of domains, necessitating... more We are witnessing the increasing availability of data across a spectrum of domains, necessitating the interactive ad-hoc management and analysis of this data, in order to put it to use. Unfortunately, interactive ad-hoc management of very large datasets presents a host of challenges, ranging from performance to interface usability. This thesis introduces a new research direction of manipulation of large datasets using an interactive interface and makes several steps towards this direction. In particular, we develop DataSpread, a tool that enables users to work with arbitrary large datasets via a direct manipulation interface. DataSpread holistically unifies spreadsheets and relational databases to leverage the benefits of both. However, this holistic integration is not trivial due to the differences in the architecture and ideologies of the two paradigms: spreadsheets and databases. We have built a prototype of DataSpread, which, in addition to motivating the underlying challenges, ...
Spreadsheet systems are by far the most popular platform for data exploration on the planet, supp... more Spreadsheet systems are by far the most popular platform for data exploration on the planet, supporting millions of rows of data. However, exploring spreadsheets that are this large via operations such as scrolling or issuing formulae can be overwhelming and error-prone. Users easily lose context and suffer from cognitive and mechanical burdens while issuing formulae on data spanning multiple screens. To address these challenges, we introduce dynamic hierarchical overviews that are embedded alongside spreadsheets. Users can employ this overview to explore the data at various granularities, zooming in and out of the spreadsheet. They can issue formulae over data subsets without cumbersome scrolling or range selection, enabling users to gain a high or low-level perspective of the spreadsheet. An implementation of our dynamic hierarchical overview, NOAH, integrated within DataSpread, preserves spreadsheet semantics and look and feel, while introducing such enhancements. Our user studie...
Spreadsheet software is the tool of choice for interactive ad-hoc data management, with adoption ... more Spreadsheet software is the tool of choice for interactive ad-hoc data management, with adoption by billions of users. However, spreadsheets are not scalable, unlike database systems. On the other hand, database systems, while highly scalable, do not support interactivity as a first-class primitive. We are developing DataSpread, to holistically integrate spreadsheets as a front-end interface with databases as a back-end datastore, providing scalability to spreadsheets, and interactivity to databases, an integration we term presentational data management (PDM). In this paper, we make the first step towards this vision for relational databases: developing a storage engine for PDM, studying how to flexibly represent spreadsheet data within a relational database and how to support and maintain access by position. We first conduct an extensive survey of spreadsheet use to motivate our functional requirements for a storage engine for PDM. We develop a natural set of mechanisms for flexibl...
Spreadsheet systems enable users to store and analyze data in an intuitive and flexible interface... more Spreadsheet systems enable users to store and analyze data in an intuitive and flexible interface. Yet the scale of data being analyzed often leads to spreadsheets hanging and freezing on small changes. We propose a new asynchronous formula computation framework: instead of freezing the interface we return control to users quickly to ensure interactivity, while computing the formulae in the background. To ensure consistency, we indicate formulae being computed in the background via visual cues on the spreadsheet. Our asynchronous computation framework introduces two novel challenges: (a) How do we identify dependencies for a given change in a bounded time? (b) How do we schedule computation to maximize the number of spreadsheet cells available to the user over time? We bound the dependency identification time by compressing the formula dependency graph lossily, a problem we show to be NP-Hard. A compressed dependency table enables us to quickly identify the spreadsheet cells that ne...
Spreadsheets are widely used for data management and analysis by individuals and teams with varyi... more Spreadsheets are widely used for data management and analysis by individuals and teams with varying degrees of programming expertise across a spectrum of domains. While several papers have studied the prevalence of errors on spreadsheets and performed ethnographic studies on spreadsheet use, little is known about how spreadsheet users approach and address computational tasks on spreadsheets, especially on relatively large datasets. To understand how users analyze data on spreadsheets, we conducted a study consisting of eight common analytical tasks, with thirty-two participants. Participants developed an execution strategy for each task and then attempted to operationalize this strategywithin the spreadsheet system. From examining the study results and transcripts, we identified the successful and unsuccessful strategies participants adopted in addressing the tasks. In general, we find that unsuccessful spreadsheet users had difficulties mapping spreadsheet models to their predeterm...
Proceedings of the 14th ACM International Conference on Web Search and Data Mining
In this paper, we demonstrate our Global Personalized Recommender (GPR) system for restaurants. G... more In this paper, we demonstrate our Global Personalized Recommender (GPR) system for restaurants. GPR does not use any explicit reviews, ratings, or domain-specific metadata but rather leverages over 3 billion anonymized payment transactions to learn user and restaurant behavior patterns. The design and development of GPR have been challenging, primarily due to the scale and skew of the data. Our system supports over 450M cardholders from over 200 countries and 2.5M restaurants in over 35K cities worldwide, respectively. Additionally, GPR being a global recommender system, needs to account for the regional variations in people's food choices and habits. We address the challenges by combining three different recommendation algorithms instead of using a single revolutionary model in the backend. The individual recommendation models are scalable and adapt to varying data skew challenges to ensure high-quality personalized recommendations for any user anywhere in the world.
2019 IEEE 35th International Conference on Data Engineering (ICDE)
Spreadsheet tools are ubiquitous for interactive adhoc data management and analysis. With increas... more Spreadsheet tools are ubiquitous for interactive adhoc data management and analysis. With increasing dataset sizes, spreadsheet tools fall short-they freeze during heavy computation within the sheet (interactivity); they are hard to navigate when datasets go beyond a certain size (navigability); they only support cell-at-a-time computation, severely limiting analysis capabilities (expressiveness). We have been developing DATASPREAD to holistically unify databases and spreadsheets to leverage the benefits of both, with a spreadsheet-like front-end and a database-like backend. We demonstrate three key features of DATASPREAD to address the aforementioned spreadsheet scalability challenges in interactivity, navigability, and expressiveness 1. Our demonstration will let attendees perform typical analysis tasks on Microsoft Excel and DATASPREAD side-by-side, providing a clear understanding of the improvements offered by DATASPREAD over traditional spreadsheet tools.
2018 IEEE 34th International Conference on Data Engineering (ICDE), Apr 1, 2018
Spreadsheet software is the tool of choice for interactive ad-hoc data management, with adoption ... more Spreadsheet software is the tool of choice for interactive ad-hoc data management, with adoption by billions of users. However, spreadsheets are not scalable, unlike database systems. On the other hand, database systems, while highly scalable, do not support interactivity as a first-class primitive. We are developing DATASPREAD, to holistically integrate spreadsheets as a frontend interface with databases as a back-end datastore, providing scalability to spreadsheets, and interactivity to databases, an integration we term presentational data management (PDM). In this paper, we make the first step towards this vision: developing a storage engine for PDM, studying how to flexibly represent spreadsheet data within a database and how to support and maintain access by position. We first conduct an extensive survey of spreadsheet use to motivate our functional requirements for a storage engine for PDM. We develop a natural set of mechanisms for flexibly representing spreadsheet data and demonstrate that identifying the optimal representation is NP-HARD; however, we develop an efficient approach to identify the optimal representation from an important and intuitive subclass of representations. We extend our mechanisms with positional access mechanisms that don't suffer from cascading update issues, leading to constant time access and modification performance. We evaluate these representations on a workload of typical spreadsheets and spreadsheet operations, providing up to 50% reduction in storage, and up to 50% reduction in formula evaluation time.
Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
Spreadsheet systems are used for storing and analyzing data across virtually every domain by prog... more Spreadsheet systems are used for storing and analyzing data across virtually every domain by programmers and nonprogrammers alike. While spreadsheet systems have continued to support storage and analysis of increasingly large scale datasets, they are prone to hanging and freezing while performing computations even on much smaller datasets. We present a benchmarking study that evaluates and compares the performance of three popular spreadsheet systems, Microsoft Excel, LibreOffice Calc, and Google Sheets, on a range of canonical spreadsheet computation operations. We find that spreadsheet systems lack interactivity for several operations, on datasets well below their documented scalability limits. We further evaluate whether spreadsheet systems adopt optimization techniques from the database community such as indexing, intelligent data layout, and incremental and shared computation, to efficiently execute computation operations. We outline several ways future spreadsheet systems can be redesigned to offer interactive response times on large datasets.
Proceedings of the 32nd ACM International Conference on Information and Knowledge Management
Imbalanced datasets are commonly observed in various real-world applications, presenting signific... more Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers. When working with large datasets, the imbalanced issue can be further exacerbated, making it exceptionally difficult to train classifiers effectively. To address the problem, over-sampling techniques have been developed to linearly interpolating data instances between minorities and their neighbors. However, in many realworld scenarios such as anomaly detection, minority instances are often dispersed diversely in the feature space rather than clustered together. Inspired by domain-agnostic data mix-up, we propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes. It is non-trivial to develop such a framework, the challenges include source sample selection, mix-up strategy selection, and the coordination between the underlying model and mix-up strategies. To tackle these challenges, we formulate the problem of iterative data mix-up as a Markov decision process (MDP) that maps data attributes onto an augmentation strategy. To solve the MDP, we employ an actorcritic framework to adapt the discrete-continuous decision space. This framework is utilized to train a data augmentation policy and design a reward signal that explores classifier uncertainty and encourages performance improvement, irrespective of the classifier's convergence. We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets using three different types of classifiers. The results of these experiments showcase the potential and promise of our framework in addressing imbalanced datasets with diverse minorities.
arXiv (Cornell University), Oct 13, 2019
Embeddings are one of the fundamental building blocks for data analysis tasks. Embeddings are alr... more Embeddings are one of the fundamental building blocks for data analysis tasks. Embeddings are already essential tools for large language models and image analysis, and their use is being extended to many other research domains. The generation of these distributed representations is often a dataand computation-expensive process; yet the holistic analysis and adjustment of them after they have been created is still a developing area. In this paper, we first propose a very general quantitatively measure for the presence of features in the embedding data based on if it can be learned. We then devise a method to remove or alleviate undesired features in the embedding while retaining the essential structure of the data. We use a Domain Adversarial Network (DAN) to generate a non-affine transformation, but we add constraints to ensure the essential structure of the embedding is preserved. Our empirical results demonstrate that the proposed algorithm significantly outperforms the state-of-art unsupervised algorithm on several data sets, including novel applications from the industry.
2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)
Lecture Notes in Computer Science, 2022
2022 IEEE International Conference on Big Data (Big Data)
We are witnessing the increasing availability of data across a spectrum of domains, necessitating... more We are witnessing the increasing availability of data across a spectrum of domains, necessitating the interactive ad-hoc management and analysis of this data, in order to put it to use. Unfortunately, interactive ad-hoc management of very large datasets presents a host of challenges, ranging from performance to interface usability. This thesis introduces a new research direction of manipulation of large datasets using an interactive interface and makes several steps towards this direction. In particular, we develop DataSpread, a tool that enables users to work with arbitrary large datasets via a direct manipulation interface. DataSpread holistically unifies spreadsheets and relational databases to leverage the benefits of both. However, this holistic integration is not trivial due to the differences in the architecture and ideologies of the two paradigms: spreadsheets and databases. We have built a prototype of DataSpread, which, in addition to motivating the underlying challenges, demonstrates the feasibility and usefulness of this holistic integration. We focus on the following challenges encountered while developing DataSpread. (i) Representation-here, we address the challenges of flexibly representing ad-hoc spreadsheet data within a relational database; (ii) Indexing-here, we develop indexing data structures for supporting and maintaining access by position; (iii) Formula Computation-here, we introduce an asynchronous formula computation framework that addresses the challenge of ensuring consistency and interactivity at the same time; and (iv) Organization-here, we develop a framework to best organize data based on a workload, e.g., queries specified on the spreadsheet interface.
Proceedings of the VLDB Endowment, 2021
Spreadsheet systems are by far the most popular platform for data exploration on the planet, supp... more Spreadsheet systems are by far the most popular platform for data exploration on the planet, supporting millions of rows of data. However, exploring spreadsheets that are this large via operations such as scrolling or issuing formulae can be overwhelming and error-prone. Users easily lose context and suffer from cognitive and mechanical burdens while issuing formulae on data spanning multiple screens. To address these challenges, we introduce dynamic hierarchical overviews that are embedded alongside spreadsheets. Users can employ this overview to explore the data at various granularities, zooming in and out of the spreadsheet. They can issue formulae over data subsets without cumbersome scrolling or range selection, enabling users to gain a high or low-level perspective of the spreadsheet. An implementation of our dynamic hierarchical overview, NOAH, integrated within DataSpread, preserves spreadsheet semantics and look and feel, while introducing such enhancements. Our user studies demonstrate that NOAH makes it more intuitive, easier, and faster to navigate spreadsheet data compared to traditional spreadsheets like Microsoft Excel and spreadsheet plug-ins like Pivot Table, for a variety of exploration tasks; participants made fewer mistakes in NOAH while being faster in completing the tasks.
Spreadsheet software is often the tool of choice for ad-hoc tabular data management, processing, ... more Spreadsheet software is often the tool of choice for ad-hoc tabular data management, processing, and visualization, especially on tiny data sets. On the other hand, relational database systems offer sig-nificant power, expressivity, and efficiency over spreadsheet soft-ware for data management, while lacking in the ease of use and ad-hoc analysis capabilities. We demonstrate DATA-SPREAD, a data exploration tool that unifies databases and spreadsheets. DATA-SPREAD continues to offer a Microsoft Excel-based spreadsheet front-end, while in parallel managing all the data in a back-end database, specifically, Postgres. DATA-SPREAD retains all the ad-vantages of spreadsheets, including ease of use, ad-hoc analysis and visualization capabilities, and a schema-free nature, while also adding the advantages of traditional relational databases, such as scalability and the ability to use arbitrary SQL to import, filter, or join external or internal tables and have the results appear in the spre...
Spreadsheet software is the tool of choice for ad-hoc tabular data management, manipulation, quer... more Spreadsheet software is the tool of choice for ad-hoc tabular data management, manipulation, querying, and visualization with adoption by billions of users. However, spreadsheets are not scalable, unlike database systems. We develop DATASPREAD, a system that holistically unifies databases and spreadsheets with a goal to work with massive spreadsheets: DATASPREAD retains all of the advantages of spreadsheets, including ease of use, ad-hoc analysis and visualization capabilities, and a schema-free nature, while also adding the scalability and collaboration abilities of traditional relational databases. We design DATASPREAD with a spreadsheet front-end and a regular relational database back-end. To integrate spreadsheets and databases, in this paper, we develop a storage and indexing engine for spreadsheet data. We first formalize and study the problem of representing and manipulating spreadsheet data within a relational database. We demonstrate that identifying the optimal representat...
Spreadsheets are one of the most popular tools for ad-hoc exploration and analysis of data. Despi... more Spreadsheets are one of the most popular tools for ad-hoc exploration and analysis of data. Despite that, exploring and analyzing spreadsheet datasets that span more than a few screens via operations such as scrolling or issuing formulae, is often overwhelming for end-users. Users easily lose context as they explore the data via scrolling and suffer from cognitive and mechanical burdens while issuing formulae on data spanning multiple screens. We propose integrating a navigation plug-in with spreadsheets to support the seamless exploration of large datasets that are increasingly the norm. Our interface, NOAH, developed using lessons from classical overview+detail interfaces, embeds a multi-granularity zoomable overview alongside the spreadsheet. Users can employ the overview to explore the data at various granularities. Furthermore, they can issue formulae over subsets of data without performing cumbersome scrolling or range selection operations, enabling users to gain a high or low...
We are witnessing the increasing availability of data across a spectrum of domains, necessitating... more We are witnessing the increasing availability of data across a spectrum of domains, necessitating the interactive ad-hoc management and analysis of this data, in order to put it to use. Unfortunately, interactive ad-hoc management of very large datasets presents a host of challenges, ranging from performance to interface usability. This thesis introduces a new research direction of manipulation of large datasets using an interactive interface and makes several steps towards this direction. In particular, we develop DataSpread, a tool that enables users to work with arbitrary large datasets via a direct manipulation interface. DataSpread holistically unifies spreadsheets and relational databases to leverage the benefits of both. However, this holistic integration is not trivial due to the differences in the architecture and ideologies of the two paradigms: spreadsheets and databases. We have built a prototype of DataSpread, which, in addition to motivating the underlying challenges, ...
Spreadsheet systems are by far the most popular platform for data exploration on the planet, supp... more Spreadsheet systems are by far the most popular platform for data exploration on the planet, supporting millions of rows of data. However, exploring spreadsheets that are this large via operations such as scrolling or issuing formulae can be overwhelming and error-prone. Users easily lose context and suffer from cognitive and mechanical burdens while issuing formulae on data spanning multiple screens. To address these challenges, we introduce dynamic hierarchical overviews that are embedded alongside spreadsheets. Users can employ this overview to explore the data at various granularities, zooming in and out of the spreadsheet. They can issue formulae over data subsets without cumbersome scrolling or range selection, enabling users to gain a high or low-level perspective of the spreadsheet. An implementation of our dynamic hierarchical overview, NOAH, integrated within DataSpread, preserves spreadsheet semantics and look and feel, while introducing such enhancements. Our user studie...
Spreadsheet software is the tool of choice for interactive ad-hoc data management, with adoption ... more Spreadsheet software is the tool of choice for interactive ad-hoc data management, with adoption by billions of users. However, spreadsheets are not scalable, unlike database systems. On the other hand, database systems, while highly scalable, do not support interactivity as a first-class primitive. We are developing DataSpread, to holistically integrate spreadsheets as a front-end interface with databases as a back-end datastore, providing scalability to spreadsheets, and interactivity to databases, an integration we term presentational data management (PDM). In this paper, we make the first step towards this vision for relational databases: developing a storage engine for PDM, studying how to flexibly represent spreadsheet data within a relational database and how to support and maintain access by position. We first conduct an extensive survey of spreadsheet use to motivate our functional requirements for a storage engine for PDM. We develop a natural set of mechanisms for flexibl...
Spreadsheet systems enable users to store and analyze data in an intuitive and flexible interface... more Spreadsheet systems enable users to store and analyze data in an intuitive and flexible interface. Yet the scale of data being analyzed often leads to spreadsheets hanging and freezing on small changes. We propose a new asynchronous formula computation framework: instead of freezing the interface we return control to users quickly to ensure interactivity, while computing the formulae in the background. To ensure consistency, we indicate formulae being computed in the background via visual cues on the spreadsheet. Our asynchronous computation framework introduces two novel challenges: (a) How do we identify dependencies for a given change in a bounded time? (b) How do we schedule computation to maximize the number of spreadsheet cells available to the user over time? We bound the dependency identification time by compressing the formula dependency graph lossily, a problem we show to be NP-Hard. A compressed dependency table enables us to quickly identify the spreadsheet cells that ne...
Spreadsheets are widely used for data management and analysis by individuals and teams with varyi... more Spreadsheets are widely used for data management and analysis by individuals and teams with varying degrees of programming expertise across a spectrum of domains. While several papers have studied the prevalence of errors on spreadsheets and performed ethnographic studies on spreadsheet use, little is known about how spreadsheet users approach and address computational tasks on spreadsheets, especially on relatively large datasets. To understand how users analyze data on spreadsheets, we conducted a study consisting of eight common analytical tasks, with thirty-two participants. Participants developed an execution strategy for each task and then attempted to operationalize this strategywithin the spreadsheet system. From examining the study results and transcripts, we identified the successful and unsuccessful strategies participants adopted in addressing the tasks. In general, we find that unsuccessful spreadsheet users had difficulties mapping spreadsheet models to their predeterm...
Proceedings of the 14th ACM International Conference on Web Search and Data Mining
In this paper, we demonstrate our Global Personalized Recommender (GPR) system for restaurants. G... more In this paper, we demonstrate our Global Personalized Recommender (GPR) system for restaurants. GPR does not use any explicit reviews, ratings, or domain-specific metadata but rather leverages over 3 billion anonymized payment transactions to learn user and restaurant behavior patterns. The design and development of GPR have been challenging, primarily due to the scale and skew of the data. Our system supports over 450M cardholders from over 200 countries and 2.5M restaurants in over 35K cities worldwide, respectively. Additionally, GPR being a global recommender system, needs to account for the regional variations in people's food choices and habits. We address the challenges by combining three different recommendation algorithms instead of using a single revolutionary model in the backend. The individual recommendation models are scalable and adapt to varying data skew challenges to ensure high-quality personalized recommendations for any user anywhere in the world.
2019 IEEE 35th International Conference on Data Engineering (ICDE)
Spreadsheet tools are ubiquitous for interactive adhoc data management and analysis. With increas... more Spreadsheet tools are ubiquitous for interactive adhoc data management and analysis. With increasing dataset sizes, spreadsheet tools fall short-they freeze during heavy computation within the sheet (interactivity); they are hard to navigate when datasets go beyond a certain size (navigability); they only support cell-at-a-time computation, severely limiting analysis capabilities (expressiveness). We have been developing DATASPREAD to holistically unify databases and spreadsheets to leverage the benefits of both, with a spreadsheet-like front-end and a database-like backend. We demonstrate three key features of DATASPREAD to address the aforementioned spreadsheet scalability challenges in interactivity, navigability, and expressiveness 1. Our demonstration will let attendees perform typical analysis tasks on Microsoft Excel and DATASPREAD side-by-side, providing a clear understanding of the improvements offered by DATASPREAD over traditional spreadsheet tools.
2018 IEEE 34th International Conference on Data Engineering (ICDE), Apr 1, 2018
Spreadsheet software is the tool of choice for interactive ad-hoc data management, with adoption ... more Spreadsheet software is the tool of choice for interactive ad-hoc data management, with adoption by billions of users. However, spreadsheets are not scalable, unlike database systems. On the other hand, database systems, while highly scalable, do not support interactivity as a first-class primitive. We are developing DATASPREAD, to holistically integrate spreadsheets as a frontend interface with databases as a back-end datastore, providing scalability to spreadsheets, and interactivity to databases, an integration we term presentational data management (PDM). In this paper, we make the first step towards this vision: developing a storage engine for PDM, studying how to flexibly represent spreadsheet data within a database and how to support and maintain access by position. We first conduct an extensive survey of spreadsheet use to motivate our functional requirements for a storage engine for PDM. We develop a natural set of mechanisms for flexibly representing spreadsheet data and demonstrate that identifying the optimal representation is NP-HARD; however, we develop an efficient approach to identify the optimal representation from an important and intuitive subclass of representations. We extend our mechanisms with positional access mechanisms that don't suffer from cascading update issues, leading to constant time access and modification performance. We evaluate these representations on a workload of typical spreadsheets and spreadsheet operations, providing up to 50% reduction in storage, and up to 50% reduction in formula evaluation time.
Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
Spreadsheet systems are used for storing and analyzing data across virtually every domain by prog... more Spreadsheet systems are used for storing and analyzing data across virtually every domain by programmers and nonprogrammers alike. While spreadsheet systems have continued to support storage and analysis of increasingly large scale datasets, they are prone to hanging and freezing while performing computations even on much smaller datasets. We present a benchmarking study that evaluates and compares the performance of three popular spreadsheet systems, Microsoft Excel, LibreOffice Calc, and Google Sheets, on a range of canonical spreadsheet computation operations. We find that spreadsheet systems lack interactivity for several operations, on datasets well below their documented scalability limits. We further evaluate whether spreadsheet systems adopt optimization techniques from the database community such as indexing, intelligent data layout, and incremental and shared computation, to efficiently execute computation operations. We outline several ways future spreadsheet systems can be redesigned to offer interactive response times on large datasets.