Pankaj Malhotra - Academia.edu (original) (raw)

Papers by Pankaj Malhotra

Research paper thumbnail of TimeNet: Pre-trained deep recurrent neural network for time series classification

Inspired by the tremendous success of deep Convolutional Neural Networks as generic feature extra... more Inspired by the tremendous success of deep Convolutional Neural Networks as generic feature extractors for images, we propose TimeNet: a deep recurrent neural network (RNN) trained on diverse time series in an unsupervised manner using sequence to sequence (seq2seq) models to extract features from time series. Rather than relying on data from the problem domain, TimeNet attempts to generalize time series representation across domains by ingesting time series from several domains simultaneously. Once trained, TimeNet can be used as a generic off-the-shelf feature extractor for time series. The representations or embeddings given by a pre-trained TimeNet are found to be useful for time series classification (TSC). For several publicly available datasets from UCR TSC Archive and an industrial telematics sensor data from vehicles, we observe that a classifier learned over the TimeNet embeddings yields significantly better performance compared to (i) a classifier learned over the embeddi...

Research paper thumbnail of Graph Neural Networks for Leveraging Industrial Equipment Structure: An application to Remaining Useful Life Estimation

ArXiv, 2020

Automated equipment health monitoring from streaming multisensor time-series data can be used to ... more Automated equipment health monitoring from streaming multisensor time-series data can be used to enable condition-based maintenance, avoid sudden catastrophic failures, and ensure high operational availability. We note that most complex machinery has a well-documented and readily accessible underlying structure capturing the inter-dependencies between sub-systems or modules. Deep learning models such as those based on recurrent neural networks (RNNs) or convolutional neural networks (CNNs) fail to explicitly leverage this potentially rich source of domain-knowledge into the learning procedure. In this work, we propose to capture the structure of a complex equipment in the form of a graph, and use graph neural networks (GNNs) to model multi-sensor time-series data. Using remaining useful life estimation as an application task, we evaluate the advantage of incorporating the graph structure via GNNs on the publicly available turbofan engine benchmark dataset. We observe that the propos...

Research paper thumbnail of TimeNet: Pre-trained deep recurrent neural network for time series classification

ArXiv, 2017

Inspired by the tremendous success of deep Convolutional Neural Networks as generic feature extra... more Inspired by the tremendous success of deep Convolutional Neural Networks as generic feature extractors for images, we propose TimeNet: a deep recurrent neural network (RNN) trained on diverse time series in an unsupervised manner using sequence to sequence (seq2seq) models to extract features from time series. Rather than relying on data from the problem domain, TimeNet attempts to generalize time series representation across domains by ingesting time series from several domains simultaneously. Once trained, TimeNet can be used as a generic off-the-shelf feature extractor for time series. The representations or embeddings given by a pre-trained TimeNet are found to be useful for time series classification (TSC). For several publicly available datasets from UCR TSC Archive and an industrial telematics sensor data from vehicles, we observe that a classifier learned over the TimeNet embeddings yields significantly better performance compared to (i) a classifier learned over the embeddi...

Research paper thumbnail of Meta-Learning for Few-Shot Time Series Classification

Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, 2020

Deep neural networks (DNNs) have achieved state-of-the-art results on time series classification ... more Deep neural networks (DNNs) have achieved state-of-the-art results on time series classification (TSC) tasks. In this work, we focus on leveraging DNNs in the often-encountered practical scenario where access to labeled training data is difficult, and where DNNs would be prone to overfitting. We leverage recent advancements in gradientbased meta-learning, and propose an approach to train a residual neural network with convolutional layers as a meta-learning agent for few-shot TSC. The network is trained on a diverse set of fewshot tasks sampled from various domains (e.g. healthcare, activity recognition, etc.) such that it can solve a target task from another domain using only a small number of training samples from the target task. Most existing meta-learning approaches are limited in practice as they assume a fixed number of target classes across tasks. We overcome this limitation in order to train a common agent across domains with each domain having different number of target classes, we utilize a triplet-loss based learning procedure that does not require any constraints to be enforced on the number of classes for the few-shot TSC tasks. To the best of our knowledge, we are the first to use meta-learning based pre-training for TSC. Our approach sets a new benchmark for few-shot TSC, outperforming several strong baselines on few-shot tasks sampled from 41 datasets in UCR TSC Archive. We observe that pre-training under the meta-learning paradigm allows the network to quickly adapt to new unseen tasks with small number of labeled instances.

Research paper thumbnail of NISER: Normalized Item and Session Representations to Handle Popularity Bias

arXiv: Information Retrieval, 2019

The goal of session-based recommendation (SR) models is to utilize the information from past acti... more The goal of session-based recommendation (SR) models is to utilize the information from past actions (e.g. item/product clicks) in a session to recommend items that a user is likely to click next. Recently it has been shown that the sequence of item interactions in a session can be modeled as graph-structured data to better account for complex item transitions. Graph neural networks (GNNs) can learn useful representations for such session-graphs, and have been shown to improve over sequential models such as recurrent neural networks [14]. However, we note that these GNN-based recommendation models suffer from popularity bias: the models are biased towards recommending popular items, and fail to recommend relevant long-tail items (less popular or less frequent items). Therefore, these models perform poorly for the less popular new items arriving daily in a practical online setting. We demonstrate that this issue is, in part, related to the magnitude or norm of the learned item and se...

Research paper thumbnail of Batch-Constrained Distributional Reinforcement Learning for Session-based Recommendation

ArXiv, 2020

Most of the existing deep reinforcement learning (RL) approaches for session-based recommendation... more Most of the existing deep reinforcement learning (RL) approaches for session-based recommendations either rely on costly online interactions with real users, or rely on potentially biased rule-based or data-driven user-behavior models for learning. In this work, we instead focus on learning recommendation policies in the pure batch or offline setting, i.e. learning policies solely from offline historical interaction logs or batch data generated from an unknown and sub-optimal behavior policy, without further access to data from the real-world or user-behavior models. We propose BCD4Rec: Batch-Constrained Distributional RL for Session-based Recommendations. BCD4Rec builds upon the recent advances in batch (offline) RL and distributional RL to learn from offline logs while dealing with the intrinsically stochastic nature of rewards from the users due to varied latent interest preferences (environments). We demonstrate that BCD4Rec significantly improves upon the behavior policy as wel...

Research paper thumbnail of CauSeR

Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021

Recommender Systems (RS) tend to recommend more popular items instead of the relevant long-tail i... more Recommender Systems (RS) tend to recommend more popular items instead of the relevant long-tail items. Mitigating such popularity bias is crucial to ensure that less popular but relevant items are part of the recommendation list shown to the user. In this work, we study the phenomenon of popularity bias in session-based RS (SRS) obtained via deep learning (DL) models. We observe that DL models trained on the historical user-item interactions in session logs (having long-tailed item-click distributions) tend to amplify popularity bias. To understand the source of this bias amplification, we consider potential sources of bias at two distinct stages in the modeling process: i. the data-generation stage (user-item interactions captured as session logs), ii. the DL model training stage. We highlight that the popularity of an item has a causal effect on i. user-item interactions via conformity bias, as well as ii. item ranking from DL models via biased training process due to class (target item) imbalance. While most existing approaches in literature address only one of these effects, we consider a comprehensive causal inference framework that identifies and mitigates the effects at both stages. Through extensive empirical evaluation on simulated and real-world datasets, we show that our approach improves upon several strong baselines from literature for popularity bias and long-tailed classification. Ablation studies show the advantage of our comprehensive causal analysis to identify and handle bias in data generation as well as training stages. CCS CONCEPTS • Information systems → Recommender systems.

Research paper thumbnail of On Practical Aspects of Using RNNs for Fault Detection in Sparsely-labeled Multi-sensor Time Series

Annual Conference of the PHM Society, 2018

In this work, we attempt to address two practical limitations when using Recurrent Neural Network... more In this work, we attempt to address two practical limitations when using Recurrent Neural Networks (RNNs) as classifiers for fault detection using multi-sensor time series data: Firstly, there is a need to understand the classification decisions of RNNs. It is difficult for engineers to diagnose the faults when multiple sensors are being monitored at once. The faults detected by RNNs can be better understood if the sensors carrying the faulty signature are known. To achieve this, we propose a sensor relevance scoring (SRS) approach that scores each sensor based on its contribution to the classification decision by leveraging the hidden layer activations of RNNs. Secondly, lack of labeled training data due to infrequent faults (or otherwise) makes it difficult to train RNNs in a supervised manner. We pre-train an RNN on large unlabeled data via an autoencoder in an unsupervised manner, and then finetune the RNN for the fault detection task using small amount of labeled training data....

Research paper thumbnail of Recurrent Neural Networks for Online Remaining Useful Life Estimation in Ion Mill Etching System

Annual Conference of the PHM Society, 2018

We describe the approach – submitted as part of the 2018 PHM Data Challenge – for estimating time... more We describe the approach – submitted as part of the 2018 PHM Data Challenge – for estimating time-to-failure or Remaining Useful Life (RUL) of Ion Mill Etching Systems in an online fashion using data from multiple sensors. RUL estimation from multi-sensor data can be considered as learning a regression function that maps a multivariate time series to a real-valued number, i.e. the RUL. We use a deep Recurrent Neural Network (RNN) to learn the metric regression function from multivariate time series. We highlight practical aspects of the RUL estimation problem in this data challenge such as i) multiple operating conditions, ii) lack of knowledge of exact onset of failure or degradation, iii) different operational behavior across tools in terms of range of values of parameters, etc. We describe our solution in the context of these challenges. Importantly, multiple modes of failure are possible in an ion mill etching system; therefore, it is desirable to estimate the RUL with respect t...

Research paper thumbnail of Transfer Learning for Clinical Time Series Analysis Using Deep Neural Networks

Journal of Healthcare Informatics Research, 2019

Deep neural networks have shown promising results for various clinical prediction tasks such as d... more Deep neural networks have shown promising results for various clinical prediction tasks such as diagnosis, mortality prediction, predicting duration of stay in hospital, etc. However, training deep networks such as those based on Recurrent Neural Networks (RNNs) requires large labeled data, significant hyper-parameter tuning effort and expertise, and high computational resources. In this work, we investigate as to what extent can transfer learning address these issues when using deep RNNs to model multivariate clinical time series. We consider two scenarios for transfer learning using RNNs: i) domain-adaptation, i.e., leveraging a deep RNN-namely, TimeNet-pre-trained for feature extraction on time series from diverse domains, and adapting it for feature extraction and subsequent target tasks in healthcare domain, ii) task-adaptation, i.e., pre-training a deep RNN-namely, HealthNet-on diverse tasks in healthcare domain, and adapting it to new target tasks in the same domain. We evaluate the above approaches on publicly available MIMIC-III benchmark dataset, and demonstrate that (a) computationally-efficient linear models trained using features extracted via pre-trained RNNs outperform or, in the worst case, perform as well as deep RNNs and statistical hand-crafted features based models trained specifically for target task; (b) models obtained by adapting pre-trained models for target tasks are significantly more robust to the size of labeled data compared to task-specific RNNs, while also being computationally efficient. We, therefore, conclude that pre-trained deep models like TimeNet and HealthNet allow leveraging the advantages of deep learning for clinical time series analysis tasks, while also minimize dependence on hand-crafted features, deal robustly with scarce labeled training data scenarios without overfitting, as well as reduce dependence on expertise and resources required to train deep networks from scratch (e.g. neural network architecture selection and hyper-parameter tuning efforts).

Research paper thumbnail of Graph-Parallel Entity Resolution using LSH & IMM

In this paper we describe graph-based parallel algorithms for entity resolution that improve over... more In this paper we describe graph-based parallel algorithms for entity resolution that improve over the map-reduce approach. We compare two approaches to parallelize a Locality Sensitive Hashing (LSH) accelerated, Iterative Match-Merge (IMM) entity resolution technique: BCP, where records hashed together are compared at a single node/reducer, vs an alternative mechanism (RCP) where comparison load is better distributed across processors especially in the presence of severely skewed bucket sizes. We analyze the BCP and RCP approaches analytically as well as empirically using a large synthetically generated dataset. We generalize the lessons learned from our experience and submit that the RCP approach is also applicable in many similar applications that rely on LSH or related grouping strategies to minimize pair-wise comparisons.

Research paper thumbnail of Incremental Entity Resolution from Linked Documents

In many government applications we often find that information about entities, such as persons, a... more In many government applications we often find that information about entities, such as persons, are available in disparate data sources such as passports, driving licences, bank accounts, and income tax records. Similar scenarios are commonplace in large enterprises having multiple customer, supplier, or partner databases. Each data source maintains different aspects of an entity, and resolving entities based on these attributes is a well-studied problem. However, in many cases documents in one source reference those in others; e.g., a person may provide his driving-licence number while applying for a passport, or vice-versa. These links define relationships between documents of the same entity (as opposed to inter-entity relationships, which are also often used for resolution). In this paper we describe an algorithm to cluster documents that are highly likely to belong to the same entity by exploiting inter-document references in addition to attribute similarity. Our technique uses a combination of iterative graph-traversal, locality-sensitive hashing, iterative match-merge, and graph-clustering to discover unique entities based on a document corpus. A unique feature of our technique is that new sets of documents can be added incrementally while having to re-resolve only a small subset of a previously resolved entity-document collection. We present performance and quality results on two data-sets: a real-world database of companies and a large synthetically generated 'population' database. We also demonstrate benefit of using inter-document references for clustering in the form of enhanced recall of documents for resolution.

Research paper thumbnail of TimeNet: Pre-trained deep recurrent neural network for time series classification

Inspired by the tremendous success of deep Convolutional Neural Networks as generic feature extra... more Inspired by the tremendous success of deep Convolutional Neural Networks as generic feature extractors for images, we propose TimeNet: a deep recurrent neural network (RNN) trained on diverse time series in an unsupervised manner using sequence to sequence (seq2seq) models to extract features from time series. Rather than relying on data from the problem domain, TimeNet attempts to generalize time series representation across domains by ingesting time series from several domains simultaneously. Once trained, TimeNet can be used as a generic off-the-shelf feature extractor for time series. The representations or embeddings given by a pre-trained TimeNet are found to be useful for time series classification (TSC). For several publicly available datasets from UCR TSC Archive and an industrial telematics sensor data from vehicles, we observe that a classifier learned over the TimeNet embeddings yields significantly better performance compared to (i) a classifier learned over the embeddi...

Research paper thumbnail of Graph Neural Networks for Leveraging Industrial Equipment Structure: An application to Remaining Useful Life Estimation

ArXiv, 2020

Automated equipment health monitoring from streaming multisensor time-series data can be used to ... more Automated equipment health monitoring from streaming multisensor time-series data can be used to enable condition-based maintenance, avoid sudden catastrophic failures, and ensure high operational availability. We note that most complex machinery has a well-documented and readily accessible underlying structure capturing the inter-dependencies between sub-systems or modules. Deep learning models such as those based on recurrent neural networks (RNNs) or convolutional neural networks (CNNs) fail to explicitly leverage this potentially rich source of domain-knowledge into the learning procedure. In this work, we propose to capture the structure of a complex equipment in the form of a graph, and use graph neural networks (GNNs) to model multi-sensor time-series data. Using remaining useful life estimation as an application task, we evaluate the advantage of incorporating the graph structure via GNNs on the publicly available turbofan engine benchmark dataset. We observe that the propos...

Research paper thumbnail of TimeNet: Pre-trained deep recurrent neural network for time series classification

ArXiv, 2017

Inspired by the tremendous success of deep Convolutional Neural Networks as generic feature extra... more Inspired by the tremendous success of deep Convolutional Neural Networks as generic feature extractors for images, we propose TimeNet: a deep recurrent neural network (RNN) trained on diverse time series in an unsupervised manner using sequence to sequence (seq2seq) models to extract features from time series. Rather than relying on data from the problem domain, TimeNet attempts to generalize time series representation across domains by ingesting time series from several domains simultaneously. Once trained, TimeNet can be used as a generic off-the-shelf feature extractor for time series. The representations or embeddings given by a pre-trained TimeNet are found to be useful for time series classification (TSC). For several publicly available datasets from UCR TSC Archive and an industrial telematics sensor data from vehicles, we observe that a classifier learned over the TimeNet embeddings yields significantly better performance compared to (i) a classifier learned over the embeddi...

Research paper thumbnail of Meta-Learning for Few-Shot Time Series Classification

Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, 2020

Deep neural networks (DNNs) have achieved state-of-the-art results on time series classification ... more Deep neural networks (DNNs) have achieved state-of-the-art results on time series classification (TSC) tasks. In this work, we focus on leveraging DNNs in the often-encountered practical scenario where access to labeled training data is difficult, and where DNNs would be prone to overfitting. We leverage recent advancements in gradientbased meta-learning, and propose an approach to train a residual neural network with convolutional layers as a meta-learning agent for few-shot TSC. The network is trained on a diverse set of fewshot tasks sampled from various domains (e.g. healthcare, activity recognition, etc.) such that it can solve a target task from another domain using only a small number of training samples from the target task. Most existing meta-learning approaches are limited in practice as they assume a fixed number of target classes across tasks. We overcome this limitation in order to train a common agent across domains with each domain having different number of target classes, we utilize a triplet-loss based learning procedure that does not require any constraints to be enforced on the number of classes for the few-shot TSC tasks. To the best of our knowledge, we are the first to use meta-learning based pre-training for TSC. Our approach sets a new benchmark for few-shot TSC, outperforming several strong baselines on few-shot tasks sampled from 41 datasets in UCR TSC Archive. We observe that pre-training under the meta-learning paradigm allows the network to quickly adapt to new unseen tasks with small number of labeled instances.

Research paper thumbnail of NISER: Normalized Item and Session Representations to Handle Popularity Bias

arXiv: Information Retrieval, 2019

The goal of session-based recommendation (SR) models is to utilize the information from past acti... more The goal of session-based recommendation (SR) models is to utilize the information from past actions (e.g. item/product clicks) in a session to recommend items that a user is likely to click next. Recently it has been shown that the sequence of item interactions in a session can be modeled as graph-structured data to better account for complex item transitions. Graph neural networks (GNNs) can learn useful representations for such session-graphs, and have been shown to improve over sequential models such as recurrent neural networks [14]. However, we note that these GNN-based recommendation models suffer from popularity bias: the models are biased towards recommending popular items, and fail to recommend relevant long-tail items (less popular or less frequent items). Therefore, these models perform poorly for the less popular new items arriving daily in a practical online setting. We demonstrate that this issue is, in part, related to the magnitude or norm of the learned item and se...

Research paper thumbnail of Batch-Constrained Distributional Reinforcement Learning for Session-based Recommendation

ArXiv, 2020

Most of the existing deep reinforcement learning (RL) approaches for session-based recommendation... more Most of the existing deep reinforcement learning (RL) approaches for session-based recommendations either rely on costly online interactions with real users, or rely on potentially biased rule-based or data-driven user-behavior models for learning. In this work, we instead focus on learning recommendation policies in the pure batch or offline setting, i.e. learning policies solely from offline historical interaction logs or batch data generated from an unknown and sub-optimal behavior policy, without further access to data from the real-world or user-behavior models. We propose BCD4Rec: Batch-Constrained Distributional RL for Session-based Recommendations. BCD4Rec builds upon the recent advances in batch (offline) RL and distributional RL to learn from offline logs while dealing with the intrinsically stochastic nature of rewards from the users due to varied latent interest preferences (environments). We demonstrate that BCD4Rec significantly improves upon the behavior policy as wel...

Research paper thumbnail of CauSeR

Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021

Recommender Systems (RS) tend to recommend more popular items instead of the relevant long-tail i... more Recommender Systems (RS) tend to recommend more popular items instead of the relevant long-tail items. Mitigating such popularity bias is crucial to ensure that less popular but relevant items are part of the recommendation list shown to the user. In this work, we study the phenomenon of popularity bias in session-based RS (SRS) obtained via deep learning (DL) models. We observe that DL models trained on the historical user-item interactions in session logs (having long-tailed item-click distributions) tend to amplify popularity bias. To understand the source of this bias amplification, we consider potential sources of bias at two distinct stages in the modeling process: i. the data-generation stage (user-item interactions captured as session logs), ii. the DL model training stage. We highlight that the popularity of an item has a causal effect on i. user-item interactions via conformity bias, as well as ii. item ranking from DL models via biased training process due to class (target item) imbalance. While most existing approaches in literature address only one of these effects, we consider a comprehensive causal inference framework that identifies and mitigates the effects at both stages. Through extensive empirical evaluation on simulated and real-world datasets, we show that our approach improves upon several strong baselines from literature for popularity bias and long-tailed classification. Ablation studies show the advantage of our comprehensive causal analysis to identify and handle bias in data generation as well as training stages. CCS CONCEPTS • Information systems → Recommender systems.

Research paper thumbnail of On Practical Aspects of Using RNNs for Fault Detection in Sparsely-labeled Multi-sensor Time Series

Annual Conference of the PHM Society, 2018

In this work, we attempt to address two practical limitations when using Recurrent Neural Network... more In this work, we attempt to address two practical limitations when using Recurrent Neural Networks (RNNs) as classifiers for fault detection using multi-sensor time series data: Firstly, there is a need to understand the classification decisions of RNNs. It is difficult for engineers to diagnose the faults when multiple sensors are being monitored at once. The faults detected by RNNs can be better understood if the sensors carrying the faulty signature are known. To achieve this, we propose a sensor relevance scoring (SRS) approach that scores each sensor based on its contribution to the classification decision by leveraging the hidden layer activations of RNNs. Secondly, lack of labeled training data due to infrequent faults (or otherwise) makes it difficult to train RNNs in a supervised manner. We pre-train an RNN on large unlabeled data via an autoencoder in an unsupervised manner, and then finetune the RNN for the fault detection task using small amount of labeled training data....

Research paper thumbnail of Recurrent Neural Networks for Online Remaining Useful Life Estimation in Ion Mill Etching System

Annual Conference of the PHM Society, 2018

We describe the approach – submitted as part of the 2018 PHM Data Challenge – for estimating time... more We describe the approach – submitted as part of the 2018 PHM Data Challenge – for estimating time-to-failure or Remaining Useful Life (RUL) of Ion Mill Etching Systems in an online fashion using data from multiple sensors. RUL estimation from multi-sensor data can be considered as learning a regression function that maps a multivariate time series to a real-valued number, i.e. the RUL. We use a deep Recurrent Neural Network (RNN) to learn the metric regression function from multivariate time series. We highlight practical aspects of the RUL estimation problem in this data challenge such as i) multiple operating conditions, ii) lack of knowledge of exact onset of failure or degradation, iii) different operational behavior across tools in terms of range of values of parameters, etc. We describe our solution in the context of these challenges. Importantly, multiple modes of failure are possible in an ion mill etching system; therefore, it is desirable to estimate the RUL with respect t...

Research paper thumbnail of Transfer Learning for Clinical Time Series Analysis Using Deep Neural Networks

Journal of Healthcare Informatics Research, 2019

Deep neural networks have shown promising results for various clinical prediction tasks such as d... more Deep neural networks have shown promising results for various clinical prediction tasks such as diagnosis, mortality prediction, predicting duration of stay in hospital, etc. However, training deep networks such as those based on Recurrent Neural Networks (RNNs) requires large labeled data, significant hyper-parameter tuning effort and expertise, and high computational resources. In this work, we investigate as to what extent can transfer learning address these issues when using deep RNNs to model multivariate clinical time series. We consider two scenarios for transfer learning using RNNs: i) domain-adaptation, i.e., leveraging a deep RNN-namely, TimeNet-pre-trained for feature extraction on time series from diverse domains, and adapting it for feature extraction and subsequent target tasks in healthcare domain, ii) task-adaptation, i.e., pre-training a deep RNN-namely, HealthNet-on diverse tasks in healthcare domain, and adapting it to new target tasks in the same domain. We evaluate the above approaches on publicly available MIMIC-III benchmark dataset, and demonstrate that (a) computationally-efficient linear models trained using features extracted via pre-trained RNNs outperform or, in the worst case, perform as well as deep RNNs and statistical hand-crafted features based models trained specifically for target task; (b) models obtained by adapting pre-trained models for target tasks are significantly more robust to the size of labeled data compared to task-specific RNNs, while also being computationally efficient. We, therefore, conclude that pre-trained deep models like TimeNet and HealthNet allow leveraging the advantages of deep learning for clinical time series analysis tasks, while also minimize dependence on hand-crafted features, deal robustly with scarce labeled training data scenarios without overfitting, as well as reduce dependence on expertise and resources required to train deep networks from scratch (e.g. neural network architecture selection and hyper-parameter tuning efforts).

Research paper thumbnail of Graph-Parallel Entity Resolution using LSH & IMM

In this paper we describe graph-based parallel algorithms for entity resolution that improve over... more In this paper we describe graph-based parallel algorithms for entity resolution that improve over the map-reduce approach. We compare two approaches to parallelize a Locality Sensitive Hashing (LSH) accelerated, Iterative Match-Merge (IMM) entity resolution technique: BCP, where records hashed together are compared at a single node/reducer, vs an alternative mechanism (RCP) where comparison load is better distributed across processors especially in the presence of severely skewed bucket sizes. We analyze the BCP and RCP approaches analytically as well as empirically using a large synthetically generated dataset. We generalize the lessons learned from our experience and submit that the RCP approach is also applicable in many similar applications that rely on LSH or related grouping strategies to minimize pair-wise comparisons.

Research paper thumbnail of Incremental Entity Resolution from Linked Documents

In many government applications we often find that information about entities, such as persons, a... more In many government applications we often find that information about entities, such as persons, are available in disparate data sources such as passports, driving licences, bank accounts, and income tax records. Similar scenarios are commonplace in large enterprises having multiple customer, supplier, or partner databases. Each data source maintains different aspects of an entity, and resolving entities based on these attributes is a well-studied problem. However, in many cases documents in one source reference those in others; e.g., a person may provide his driving-licence number while applying for a passport, or vice-versa. These links define relationships between documents of the same entity (as opposed to inter-entity relationships, which are also often used for resolution). In this paper we describe an algorithm to cluster documents that are highly likely to belong to the same entity by exploiting inter-document references in addition to attribute similarity. Our technique uses a combination of iterative graph-traversal, locality-sensitive hashing, iterative match-merge, and graph-clustering to discover unique entities based on a document corpus. A unique feature of our technique is that new sets of documents can be added incrementally while having to re-resolve only a small subset of a previously resolved entity-document collection. We present performance and quality results on two data-sets: a real-world database of companies and a large synthetically generated 'population' database. We also demonstrate benefit of using inter-document references for clustering in the form of enhanced recall of documents for resolution.