Megha Khosla - Academia.edu (original) (raw)

Papers by Megha Khosla

Cornell University - arXiv, Jun 28, 2022

The problem of interpreting the decisions of machine learning is a well-researched and important.... more The problem of interpreting the decisions of machine learning is a well-researched and important. We are interested in a specific type of machine learning model that deals with graph data called graph neural networks. Evaluating interpretability approaches for graph neural networks (GNN) specifically are known to be challenging due to the lack of a commonly accepted benchmark. Given a GNN model, several interpretability approaches exist to explain GNN models with diverse (sometimes conflicting) evaluation methodologies. In this paper, we propose a benchmark for evaluating the explainability approaches for GNNs called BAGEL. In BAGEL, we firstly propose four diverse GNN explanation evaluation regimes-1) faithfulness, 2) sparsity, 3) correctness, and 4) plausibility. We reconcile multiple evaluation metrics in the existing literature and cover diverse notions for a holistic evaluation. Our graph datasets range from citation networks, document graphs, to graphs from molecules and proteins. We conduct an extensive empirical study on four GNN models and nine post-hoc explanation approaches for node and graph classification tasks. We open both the benchmarks and reference implementations and make them available at https://github.com/Mandeep-Rathee/Bagel-benchmark.

Cornell University - arXiv, Jun 23, 2021

Graph neural networks (GNNs) have achieved great success on various tasks and fields that require... more Graph neural networks (GNNs) have achieved great success on various tasks and fields that require relational modeling. GNNs aggregate node features using the graph structure as inductive biases resulting in flexible and powerful models. However, GNNs remain hard to interpret as the interplay between node features and graph structure is only implicitly learned. In this paper, we propose a novel method called KEdge for explicitly sparsifying the underlying graph by removing unnecessary neighbors. Our key idea is based on a tractable method for sparsification using the Hard Kumaraswamy distribution that can be used in conjugation with any GNN model. KEdge learns edge masks in a modular fashion trained with any GNN allowing for gradient-based optimization in an end-to-end fashion. We demonstrate through extensive experiments that our model KEdge can prune a large proportion of the edges with only a minor effect on the test accuracy. Specifically, in the PubMed dataset, KEdge learns to drop more than 80% of the edges with an accuracy drop of merely ≈ 2% showing that graph structure has only a small contribution in comparison to node features. Finally, we also show that KEdge effectively counters the over-smoothing phenomena in deep GNNs by maintaining good task performance with increasing GNN layers.

Proceedings of the ACM Web Conference 2022

Neural document ranking approaches, specifically transformer models, have achieved impressive gai... more Neural document ranking approaches, specifically transformer models, have achieved impressive gains in ranking performance. However, query processing using such over-parameterized models is both resource and time intensive. In this paper, we propose the Fast-Forward index-a simple vector forward index that facilitates ranking documents using interpolation of lexical and semantic scores-as a replacement for contextual re-rankers and dense indexes based on nearest neighbor search. Fast-Forward indexes rely on efficient sparse models for retrieval and merely look up pre-computed dense transformer-based vector representations of documents and passages in constant time for fast CPU-based semantic similarity computation during query processing. We propose index pruning and theoretically grounded early stopping techniques to improve the query processing throughput. We conduct extensive large-scale experiments on TREC-DL datasets and show improvements over hybrid indexes in performance and query processing efficiency using only CPUs. Fast-Forward indexes can provide superior ranking performance using interpolation due to the complementary benefits of lexical and semantic similarities. CCS CONCEPTS • Information systems → Retrieval models and ranking.

2021 Third IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA)

Graph Neural Networks (GNNs), which generalize traditional deep neural networks or graph data, ha... more Graph Neural Networks (GNNs), which generalize traditional deep neural networks or graph data, have achieved state of the art performance on several graph analytical tasks like node classification, link prediction or graph classification. We focus on how trained GNN models could leak information about the member nodes that they were trained on. In particular, we focus on answering the question: given a graph, can we determine which nodes were used for training the GNN model? We operate in the inductive settings for node classification, which means that none of the nodes in the test set (or the non-member nodes) were seen during the training. We propose a simple attack model which is able to distinguish between the member and non-member nodes while just having a black-box access to the model. We experimentally compare the privacy risks of four representative GNN models. Our results show that all the studied GNN models are vulnerable to privacy leakage. While in traditional machine learning models, overfitting is considered the main cause of such leakage, we show that in GNNs the additional structural information is the major contributing factor. CCS CONCEPTS • Computing methodologies → Neural networks; • Security and privacy → Privacy protections.

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, 2019

Word embeddings are a powerful approach for analyzing language and have been widely popular in nu... more Word embeddings are a powerful approach for analyzing language and have been widely popular in numerous tasks in information retrieval and text mining. Training embeddings over huge corpora is computationally expensive because the input is typically sequentially processed and parameters are synchronously updated. Distributed architectures for asynchronous training that have been proposed either focus on scaling vocabulary sizes and dimensionality or suffer from expensive synchronization latencies. In this paper, we propose a scalable approach to train word embeddings by partitioning the input space instead in order to scale to massive text corpora while not sacrificing the performance of the embeddings. Our training procedure does not involve any parameter synchronization except a final sub-model merge phase that typically executes in a few minutes. Our distributed training scales seamlessly to large corpus sizes and we get comparable and sometimes even up to 45% performance improvement in a variety of NLP benchmarks using models trained by our distributed procedure which requires 1/10 of the time taken by the baseline approach. Finally we also show that we are robust to missing words in submodels and are able to effectively reconstruct word representations.

Big Data, 2022

Mining health data can lead to faster medical decisions, improvement in the quality of treatment,... more Mining health data can lead to faster medical decisions, improvement in the quality of treatment, disease prevention, reduced cost, and it drives innovative solutions within the healthcare sector. However, health data is highly sensitive and subject to regulations such as the General Data Protection Regulation (GDPR), which aims to ensure patient's privacy. Anonymization or removal of patient identifiable information, though the most conventional way, is the first important step to adhere to the regulations and incorporate privacy concerns. In this paper, we review the existing anonymization techniques and their applicability to various types (relational and graph-based) of health data. Besides, we provide an overview of possible attacks on anonymized data. We illustrate via a reconstruction attack that anonymization though necessary, is not sufficient to address patient privacy and discuss methods for protecting against such attacks. Finally, we discuss tools that can be used to achieve anonymization.

ArXiv, 2021

When applying outlier detection in settings where data is sensitive, mechanisms which guarantee t... more When applying outlier detection in settings where data is sensitive, mechanisms which guarantee the privacy of the underlying data are needed. The k-nearest neighbors (k-NN) algorithm is a simple and one of the most effective methods for outlier detection. So far, there have been no attempts made to develop a differentially private ( -DP) approach for k-NN based outlier detection. Existing approaches often relax the notion of -DP and employ other methods than k-NN. We propose a method for k-NN based outlier detection by separating the procedure into a fitting step on reference inlier data and then apply the outlier classifier to new data. We achieve -DP for both the fitting algorithm and the outlier classifier with respect to the reference data by partitioning the dataset into a uniform grid, which yields low global sensitivity. Our approach yields nearly optimal performance on real-world data with varying dimensions when compared to the non-private versions of k-NN.

ArXiv, 2021

Neural approaches, specifically transformer models, for ranking documents have delivered impressi... more Neural approaches, specifically transformer models, for ranking documents have delivered impressive gains in ranking performance. However, query processing using such over-parameterized models is both resource and time intensive. Consequently, to keep query processing costs manageable, trade-offs are made to reduce the number of documents to be re-ranked or consider leaner models with fewer parameters. In this paper, we propose the fast-forward index – a simple vector forward index that facilitates ranking documents using interpolationbased ranking models. Fast-forward indexes pre-compute the dense transformer-based vector representations of documents and passages for fast CPU-based semantic similarity computation during query processing. We propose theoretically grounded index pruning and early stopping techniques to improve the query-processing throughput using fast-forward indexes. We conduct extensive large-scale experiments over the TREC-DL datasets and show up to 75% improveme...

ArXiv, 2021

We study the classical weighted perfect matchings problem for bipartite graphs or sometimes refer... more We study the classical weighted perfect matchings problem for bipartite graphs or sometimes referred to as the assignment problem, i.e., given a weighted bipartite graph G = (U ∪ V,E) with weights w : E → R we are interested to find the maximum matching in G with the minimum/maximum weight. In this work we present a new and arguably simpler analysis of one of the earliest techniques developed for solving the assignment problem, namely the auction algorithm. Using our analysis technique we present tighter and improved bounds on the runtime complexity for finding an approximate minumum weight perfect matching in k-left regular sparse bipartite graphs.

Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval, 2021

Learning-to-rank (LTR) is a class of supervised learning techniques that apply to ranking problem... more Learning-to-rank (LTR) is a class of supervised learning techniques that apply to ranking problems dealing with a large number of features. The popularity and widespread application of LTR models in prioritizing information in a variety of domains makes their scrutability vital in today's landscape of fair and transparent learning systems. However, limited work exists that deals with interpreting the decisions of learning systems that output rankings. In this paper we propose a model agnostic local explanation method that seeks to identify a small subset of input features as explanation to the ranked output for a given query. We introduce new notions of validity and completeness of explanations specifically for rankings, based on the presence or absence of selected features, as a way of measuring goodness. We devise a novel optimization problem to maximize validity directly and propose greedy algorithms as solutions. In extensive quantitative experiments we show that our approach outperforms other model agnostic explanation approaches across pointwise, pairwise and listwise LTR models in validity while not compromising on completeness.

Companion of the The Web Conference 2018 on The Web Conference 2018 - WWW '18, 2018

Recent works in recommendation systems have focused on diversity in recommendations as an importa... more Recent works in recommendation systems have focused on diversity in recommendations as an important aspect of recommendation quality. In this work we argue that the post-processing algorithms aimed at only improving diversity among recommendations lead to discrimination among the users. We introduce the notion of user fairness which has been overlooked in literature so far and propose measures to quantify it. Our experiments on two diversification algorithms show that an increase in aggregate diversity results in increased disparity among the users.

Companion Proceedings of the Web Conference 2020, 2020

The extraction of main content from web pages is an important task for numerous applications, ran... more The extraction of main content from web pages is an important task for numerous applications, ranging from usability aspects, like reader views for news articles in web browsers, to information retrieval or natural language processing. Existing approaches are lacking as they rely on large amounts of hand-crafted features for classification. This results in models that are tailored to a specific distribution of web pages, e.g. from a certain time frame, but lack in generalization power. We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input. This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model. In addition, we create a new, more current dataset to show that our model is able to adapt to changes in the structure of web pages and outperform the state-of-the-art model.

Studies in Computational Intelligence, 2018

Most real-world graphs collected from the Web like Web graphs and social network graphs are incom... more Most real-world graphs collected from the Web like Web graphs and social network graphs are incomplete. This leads to inaccurate estimates of graph properties based on link analysis such as PageRank. In this paper we focus on studying such deviations in ordering/ranking imposed by PageRank over incomplete graphs. We first show that deviations in rankings induced by PageRank are indeed possible. We measure how much a ranking, induced by PageRank, on an input graph could deviate from the original unseen graph. More importantly, we are interested in conceiving a measure that approximates the rank correlation among them without any knowledge of the original graph. To this extent we formulate the HAK measure that is based on computing the impact redistribution of PageRank according to the local graph structure. Finally, we perform extensive experiments on both real-world Web and social network graphs with more than 100M vertices and 10B edges as well as synthetic graphs to showcase the utility of HAK.

Machine Learning and Knowledge Discovery in Databases, 2020

In this paper we propose and study the novel problem of explaining node embeddings by finding emb... more In this paper we propose and study the novel problem of explaining node embeddings by finding embedded human interpretable subspaces in already trained unsupervised node representation embeddings. We use an external knowledge base that is organized as a taxonomy of human-understandable concepts over entities as a guide to identify subspaces in node embeddings learned from an entity graph derived from Wikipedia. We propose a method that given a concept finds a linear transformation to a subspace where the structure of the concept is retained. Our initial experiments show that we obtain low error in finding fine-grained concepts.

Applied Network Science, 2019

Most real-world graphs collected from the Web like Web graphs and social network graphs are parti... more Most real-world graphs collected from the Web like Web graphs and social network graphs are partially discovered or crawled. This leads to inaccurate estimates of graph properties based on link analysis such asPageRank. In this paper we focus on studying such deviations in ordering/ranking imposed byPageRankover crawled graphs. We first show that deviations in rankings induced byPageRankare indeed possible. We measure how much a ranking, induced byPageRank, on an input graph could deviate from the original unseen graph. More importantly, we are interested in conceiving a measure that approximates the rank correlation among them without any knowledge of the original graph. To this extent we formulate theHAKmeasure that is based on computing the impact redistribution ofPageRankaccording to the local graph structure. We further propose an algorithm that identifies connected subgraphs over the input graph for which the relative ordering is preserved. Finally, we perform extensive experi...

Algorithmica, 2019

Hash tables are ubiquitous in computer science for efficient access to large datasets. However, t... more Hash tables are ubiquitous in computer science for efficient access to large datasets. However, there is always a need for approaches that offer compact memory utilisation without substantial degradation of lookup performance. Cuckoo hashing is an efficient technique of creating hash tables with high space utilisation and offer a guaranteed constant access time. We are given n locations and m items. Each item has to be placed in one of the k ≥ 2 locations chosen by k random hash functions. By allowing more than one choice for a single item, cuckoo hashing resembles multiple choice allocations schemes. In addition it supports dynamically changing the location of an item among its possible locations. We propose and analyse an insertion algorithm for cuckoo hashing that runs in linear time with high probability and in expectation. Previous work on total allocation time has analysed breadth first search, and it was shown to be linear only in expectation. Our algorithm finds an assignment (with probability 1) whenever it exists. In contrast, the other known insertion method, known as random walk insertion, may run indefinitely even for a solvable instance. We also present experimental results comparing the performance of our algorithm with the random walk method, also for the case when each location can hold more than one item. As a corollary we obtain a linear time algorithm (with high probability and in expectation) for finding perfect matchings in a special class of sparse random bipartite graphs. We support this by performing experiments on a real world large dataset for finding maximum matchings in general large bipartite

ArXiv, 2021

With the increasing popularity of Graph Neural Networks (GNNs) in several sensitive applications ... more With the increasing popularity of Graph Neural Networks (GNNs) in several sensitive applications like healthcare and medicine, concerns have been raised over the privacy aspects of trained GNNs. More notably, GNNs are vulnerable to privacy attacks, such as membership inference attacks, even if only blackbox access to the trained model is granted. To build defenses, differential privacy has emerged as a mechanism to disguise the sensitive data in training datasets. Following the strategy of Private Aggregation of Teacher Ensembles (PATE), recent methods leverage a large ensemble of teacher models. These teachers are trained on disjoint subsets of private data and are employed to transfer knowledge to a student model, which is then released with privacy guarantees. However, splitting graph data into many disjoint training sets may destroy the structural information and adversely affect accuracy. We propose a new graph-specific scheme of releasing a student GNN, which avoids splitting ...

IEEE Transactions on Knowledge and Data Engineering

With the ever-increasing popularity and applications of graph neural networks, several proposals ... more With the ever-increasing popularity and applications of graph neural networks, several proposals have been made to interpret and understand the decisions of a GNN model. Explanations for a GNN model differ in principle from other input settings. It is important to attribute the decision to input features and other related instances connected by the graph structure. We find that the previous explanation generation approaches that maximize the mutual information between the label distribution produced by the GNN model and the explanation to be restrictive. Specifically, existing approaches do not enforce explanations to be predictive, sparse, or robust to input perturbations. In this paper, we lay down some of the fundamental principles that an explanation method for GNNs should follow and introduce a metric fidelity as a measure of the explanation's effectiveness. We propose a novel approach Zorro based on the principles from rate-distortion theory that uses a simple combinatorial procedure to optimize for fidelity. Extensive experiments on real and synthetic datasets reveal that Zorro produces sparser, stable, and more faithful explanations than existing GNN explanation approaches.

Micro RNA or miRNA is a highly conserved class of non-coding RNA that plays an important role in ... more Micro RNA or miRNA is a highly conserved class of non-coding RNA that plays an important role in many diseases. Identifying miRNA-disease associations can pave the way for better clinical diagnosis and finding potential drug targets. We propose a biologically-motivated data-driven approach for the miRNA-disease association prediction, which overcomes the data scarcity problem by exploiting information from multiple data sources. The key idea is to enrich the existing miRNA/disease-protein-coding gene (PCG) associations via a Message Passing framework, followed by the use of disease ontology information for further feature filtering. The enriched and filtered PCG associations are then used to construct the interconnected miRNA-PCG-disease network to train a structural deep network embedding (SDNE) model. Finally, the pre-trained embedding and the biologically relevant features from the miRNA family and disease semantic similarity are concatenated to form the pair input representation...

Constraint Satisfaction Problems (CSPs) are defined over a set of variables whose state must sati... more Constraint Satisfaction Problems (CSPs) are defined over a set of variables whose state must satisfy a number of constraints. We study a class of algorithms called Message Passing Algorithms, which aim at finding the probability distribution of the variables over the space of satisfying assignments. These algorithms involve passing local messages (according to some message update rules) over the edges of a factor graph constructed corresponding to the CSP. We focus on the Belief Propagation (BP) algorithm, which finds exact solution marginals for tree-like factor graphs. However, convergence and exactness cannot be guaranteed for a general factor graph. We propose a method for improving BP to account for cycles in the factor graph. We also study another message passing algorithm known as Survey Propagation (SP), which is empirically quite effective in solving random K − SAT instances, even when the density is close to the satisfiability threshold. We contribute to the theoretical understanding of SP by deriving the SP equations from the BP message update rules. I would like to thank Prof. Kurt Melhorn for giving me the opportunity to pursue this thesis under his supervision and his timely and valuable inputs. I am grateful to my advisor, Konstantinos Panagiotou for bringing an interesting and challenging research topic to my attention. I also thank him for his persistent support and patience through many discussions we had through the course of the thesis. He has been an excellent advisor. I thank my parents for being a source of continued emotional support and teaching me to aim high. I am thankful to all my friends for their support and encouragement.

Cornell University - arXiv, Jun 28, 2022

Cornell University - arXiv, Jun 23, 2021

Proceedings of the ACM Web Conference 2022

2021 Third IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA)

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, 2019

Big Data, 2022

ArXiv, 2021

Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval, 2021

Companion of the The Web Conference 2018 on The Web Conference 2018 - WWW '18, 2018

Companion Proceedings of the Web Conference 2020, 2020

Studies in Computational Intelligence, 2018

Machine Learning and Knowledge Discovery in Databases, 2020

Applied Network Science, 2019

Algorithmica, 2019

ArXiv, 2021

IEEE Transactions on Knowledge and Data Engineering