Arnab Bhattacharya - Profile on Academia.edu (original) (raw)

Papers by Arnab Bhattacharya

Findings of the Association for Computational Linguistics: ACL 2022

Many populous countries including India are burdened with a considerable backlog of legal cases. ... more Many populous countries including India are burdened with a considerable backlog of legal cases. Development of automated systems that could process legal documents and augment legal practitioners can mitigate this. However, there is a dearth of high-quality corpora that is needed to develop such data-driven systems. The problem gets even more pronounced in the case of low resource languages such as Hindi. In this resource paper, we introduce the Hindi Legal Documents Corpus (HLDC), a corpus of more than 900K legal documents in Hindi. Documents are cleaned and structured to enable the development of downstream applications. Further, as a use-case for the corpus, we introduce the task of bail prediction. We experiment with a battery of models and propose a Multi-Task Learning (MTL) based model for the same. MTL models use summarization as an auxiliary task along with bail prediction as the main task. Experiments with different models are indicative of the need for further research in this area. We release the corpus and model implementation code with this paper: https://github.com/ Exploration-Lab/HLDC.

ACM Transactions on the Web

Short Message Service (SMS) is one of the widely used mobile applications for global communicatio... more Short Message Service (SMS) is one of the widely used mobile applications for global communication for personal and business purposes. Its widespread use for customer interaction, business updates and reminders has made it a billion dollars industry in ‘Text Marketing’. Along with valid SMS, a tsunami of spam messages also popup which serve various purposes for the sender and a majority of them are fraudulent. Filtering spam SMS in an accurate manner is a crucial and challenging task which will benefit human lives both mentally and economically. Some of the challenges in the filtering of spam SMS include less number of characters, texts in informal languages and lack of public SMS spam corpus, etc. Focusing solely on the textual features of the SMS is a major handicap of the existing methods as it lacks in dynamically adapting to the increasing number of new keywords and jargons. In this paper, we develop an intention-based approach of SMS spam filtering that efficiently handles dyn...

Knowledge bases (KB) are an important resource in a number of natural language processing (NLP) a... more Knowledge bases (KB) are an important resource in a number of natural language processing (NLP) and information retrieval (IR) tasks, such as semantic search, automated question-answering etc. They are also useful for researchers trying to gain information from a text. Unfortunately, however, the state-of-the-art in Sanskrit NLP does not yet allow automated construction of knowledge bases due to unavailability or lack of sufficient accuracy of tools and methods. Thus, in this work, we describe our efforts on manual annotation of Sanskrit text for the purpose of knowledge graph (KG) creation. We choose the chapter Dhanyavarga from Bhavaprakashanighantu of the Ayurvedic text Bhavaprakasha for annotation. The constructed knowledge graph contains 410 entities and 764 relationships. Since Bhavaprakashanighantu is a technical glossary text that describes various properties of different substances, we develop an elaborate ontology to capture the semantics of the entity and relationship typ...

International Journal of Advance Engineering and Research Development, 2016

Itemset mining has been an active area of research due to its successful application in various d... more Itemset mining has been an active area of research due to its successful application in various data mining scenarios including finding association rules. Though most of the past work has been on finding frequent itemsets, infrequent itemset mining has demonstrated its utility in web mining, bioinformatics and other fields. In this paper, we propose a new algorithm based on the pattern-growth paradigm to find minimally infrequent itemsets. A minimally infrequent itemset has no subset which is also infrequent. We also introduce the novel concept of residual trees. We further utilize the residual trees to mine multiple level minimum support itemsets where different thresholds are used for finding frequent itemsets for different lengths of the itemset. Finally, we analyze the behavior of our algorithm with respect to different parameters and show through experiments that it outperforms the competing ones.

Lecture Notes in Computer Science, 2009

Statistical distance measures have found wide applicability in information retrieval tasks that t... more Statistical distance measures have found wide applicability in information retrieval tasks that typically involve high dimensional datasets. In order to reduce the storage space and ensure efficient performance of queries, dimensionality reduction while preserving the inter-point similarity is highly desirable. In this paper, we investigate various statistical distance measures from the point of view of discovering low distortion embeddings into low-dimensional spaces. More specifically, we consider the Mahalanobis distance measure, the Bhattacharyya class of divergences and the Kullback-Leibler divergence. We present a dimensionality reduction method based on the Johnson-Lindenstrauss Lemma for the Mahalanobis measure that achieves arbitrarily low distortion. By using the Johnson-Lindenstrauss Lemma again, we further demonstrate that the Bhattacharyya distance admits dimensionality reduction with arbitrarily low additive error. We also examine the question of embeddability into metric spaces for these distance measures due to the availability of efficient indexing schemes on metric spaces. We provide explicit constructions of point sets under the Bhattacharyya and the Kullback-Leibler divergences whose embeddings into any metric space incur arbitrarily large distortions. We show that the lower bound presented for Bhattacharyya distance is nearly tight by providing an embedding that approaches the lower bound for relatively small dimensional datasets.

Encyclopedia of Database Systems, 2009

Proceedings of the VLDB Endowment, 2012

The problem of identification of statistically significant patterns in a sequence of data has bee... more The problem of identification of statistically significant patterns in a sequence of data has been applied to many domains such as intrusion detection systems, financial models, web-click records, automated monitoring systems, computational biology, cryptology, and text analysis. An observed pattern of events is deemed to be statistically significant if it is unlikely to have occurred due to randomness or chance alone. We use the chi-square statistic as a quantitative measure of statistical significance. Given a string of characters generated from a memoryless Bernoulli model, the problem is to identify the substring for which the empirical distribution of single letters deviates the most from the distribution expected from the generative Bernoulli model. This deviation is captured using the chi-square measure. The most significant substring (MSS) of a string is thus defined as the substring having the highest chi-square value. Till date, to the best of our knowledge, there does not...

Proceedings of the 33rd …, 2007

The modeling of high level semantic events from low level sensor signals is important in order to... more The modeling of high level semantic events from low level sensor signals is important in order to understand distributed phenomena. For such content-modeling purposes, transformation of numeric data into symbols and the modeling of resulting symbolic sequences can be achieved using statistical models-Markov Chains (MCs) and Hidden Markov Models (HMMs). We consider the problem of distributed indexing and semantic querying over such sensor models. Specifically, we are interested in efficiently answering (i) range queries: return all sensors that have observed an unusual sequence of symbols with a high likelihood, (ii) top-1 queries: return the sensor that has the maximum probability of observing a given sequence, and (iii) 1-NN queries: return the sensor (model) which is most similar to a query model. All the above queries can be answered at the centralized base station, if each sensor transmits its model to the base station. However, this is communicationintensive. We present a much more efficient alternative-a distributed index structure, MIST (Model-based Index STructure), and accompanying algorithms for answering the above queries. MIST aggregates two or more constituent models into a single composite model, and constructs an in-network hierarchy over such composite models. We develop two kinds of composite models: the first kind captures the average behavior of the underlying models and the second kind captures the extreme behaviors of the underlying models. Using the index parameters maintained at the root of a subtree, we bound the probability of observation of a query sequence from a sensor in the subtree. We also bound the distance of a query model to a sensor model using these parameters. Extensive experimental evaluation on both real-world and synthetic data sets show that the MIST schemes scale well in terms of network size and number of model states. We also show its superior performance over the centralized schemes in terms of update, query, and total communication costs.

Optimizing XML queries is an intensively studied problem in the field of databases of late. The t... more Optimizing XML queries is an intensively studied problem in the field of databases of late. The topic has a host of applications, viz., web-scale XML and keyword search. In this paper, we address the problem of efficient execution of XML path queries (commonly known as XPath queries), branch queries and wild-card queries. Our index structure assists in fast identification of child-parent as well as ancestor-descendant relationship, thus increasing the efficiency of XPath query execution. Both XML data and queries possess an inherent tree structure and, thus, fast child-parent lookup is a necessity to improve performance. We propose a holistic hybrid index structure that combines the Extended Dewey labeling scheme with the CTree index structure to leverage advantages of both the mechanisms. Our index structure is capable of catering to all the queries (single path, branch and wild-card queries), with equal or better performance metrics when compared to the state-of-the-art.

A.: Minimum spanning tree on spatiotemporal networks

Active languages such as Bangla (or Bengali) evolve over time due to a variety of social, cultura... more Active languages such as Bangla (or Bengali) evolve over time due to a variety of social, cultural, economic, and political issues. In this paper, we analyze the change in the written form of the modern phase of Bangla quantitatively in terms of character-level, syllable-level, morpheme-level and word-level features. We collect three different types of corpora-classical, newspapers and blogs-and test whether the differences in their features are statistically significant. Results suggest that there are significant changes in the length of a word when measured in terms of characters, but there is not much difference in usage of different characters, syllables and morphemes in a word or of different words in a sentence. To the best of our knowledge, this is the first work on Bangla of this kind.

Proceedings of the 29th ACM International Conference on Information & Knowledge Management

ArXiv, 2010

The tremendous expanse of search engines, dictionary and thesaurus storage, and other text mining... more The tremendous expanse of search engines, dictionary and thesaurus storage, and other text mining applications, combined with the popularity of readily available scanning devices and optical character recognition tools, has necessitated efficient storage, retrieval and management of massive text databases for various modern applications. For such applications, we propose a novel data structure, INSTRUCT, for efficient storage and management of sequence databases. Our structure uses bit vectors for reusing the storage space for common triplets, and hence, has a very low memory requirement. INSTRUCT efficiently handles prefix and suffix search queries in addition to the exact string search operation by iteratively checking the presence of triplets. We also propose an extension of the structure to handle substring search efficiently, albeit with an increase in the space requirements. This extension is important in the context of trie-based solutions which are unable to handle such quer...

Sanskrit (saṃskṛta) enjoys one of the largest and most varied literature in the whole world. Extr... more Sanskrit (saṃskṛta) enjoys one of the largest and most varied literature in the whole world. Extracting the knowledge from it, however, is a challenging task due to multiple reasons including complexity of the language and paucity of standard natural language processing tools. In this paper, we target the problem of building knowledge graphs for particular types of relationships from saṃskṛta texts. We build a natural language question-answering system in saṃskṛta that uses the knowledge graph to answer factoid questions. We design a framework for the overall system and implement two separate instances of the system on human relationships from mahābhārata and rāmāyaṇa, and one instance on synonymous relationships from bhāvaprakāśa nighaṇṭu, a technical text from āyurveda. We show that about 50% of the factoid questions can be answered correctly by the system. More importantly, we analyse the shortcomings of the system in detail for each step, and discuss the possible ways forward.

Constraint Satisfaction over Generalized Staircase Constraints

ArXiv, 2013

One of the key research interests in the area of Constraint Satisfaction Problem (CSP) is to iden... more One of the key research interests in the area of Constraint Satisfaction Problem (CSP) is to identify tractable classes of constraints and develop efficient solutions for them. In this paper, we introduce generalized staircase (GS) constraints which is an important generalization of one such tractable class found in the literature, namely, staircase constraints. GS constraints are of two kinds, down staircase (DS) and up staircase (US). We first examine several properties of GS constraints, and then show that arc consistency is sufficient to determine a solution to a CSP over DS constraints. Further, we propose an optimal O(cd) time and space algorithm to compute arc consistency for GS constraints where c is the number of constraints and d is the size of the largest domain. Next, observing that arc consistency is not necessary for solving a DSCSP, we propose a more efficient algorithm for solving it. With regard to US constraints, arc consistency is not known to be sufficient to det...

Proc. VLDB Endow., 2020

Subgraph querying is one of the most important primitives in many applications. Although the fiel... more Subgraph querying is one of the most important primitives in many applications. Although the field is well studied for deterministic graphs, in many situations, the graphs are probabilistic in nature. In this paper, we address the problem of subgraph querying in large probabilistic labeled graphs. We employ a novel algorithmic framework, called CHISEL, that uses the idea of statistical significance for approximate subgraph matching on uncertain graphs that have uncertainty in edges. For each candidate matching vertex in the target graph that matches a query vertex, we compute its statistical significance using the chi-squared statistic. The search algorithm then proceeds in a greedy manner by exploring the vertex neighbors having the largest chi-square score. In addition to edge uncertainty, we also show how CHISEL can handle uncertainty in labels and/or vertices. Experiments on large real-life graphs show the efficiency and effectiveness of our algorithm. PVLDB Reference Format: Sh...

Skyline queries retrieve promising data objects that are not dominated in all the attributes of i... more Skyline queries retrieve promising data objects that are not dominated in all the attributes of interest. However, in many cases, a user may not be interested in a skyline set computed over the entire dataset, but rather over a specified range of values for each attribute. For example, a user may look for hotels only within a specified budget and/or in a particular area in the city. This leads to constrained skylines. Even after constraining the query ranges, the size of the skyline set can be impractically large, thereby necessitating the need for approximate or representative skylines. Thus, in this paper, we introduce the problem of finding range-constrained approximate skylines. We design a grid-based framework, called SkyCover, for computing such skylines. Given an approximation error parameter > 0, the SkyCover framework guarantees that every skyline is “covered” by at least one representative object that is not worse by more than a factor of (1 + ) in all the dimensions. T...

The multi-criteria decision making, which is possible with the advent of skyline queries, has bee... more The multi-criteria decision making, which is possible with the advent of skyline queries, has been applied in many areas. Though most of the existing research is concerned with only a single relation, several real world applications require finding the skyline set of records over multiple relations. Consequently, the join operation over skylines where the preferences are local to each relation, has been proposed. In many of those cases, however, the join often involves performing aggregate operations among some of the attributes from the different relations. In this paper, we introduce such queries as "aggregate skyline join queries". Since the naive algorithm is impractical, we propose three algorithms to efficiently process such queries. The algorithms utilize certain properties of skyline sets, and processes the skylines as much as possible locally before computing the join. Experiments with real and synthetic datasets exhibit the practicality and scalability of the alg...

Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020

Knowledge graphs (KGs), that have become the backbone of many critical knowledge-centric applicat... more Knowledge graphs (KGs), that have become the backbone of many critical knowledge-centric applications, are mostly automatically constructed based on an ensemble of extraction techniques applied over diverse data sources. It is, therefore, important to establish the provenance of results for a query to determine how these were computed. Provenance is shown to be useful for assigning confidence scores to the results, for debugging the KG generation itself, and for providing answer explanations. In many such applications, certain queries are registered as standing queries since their answers are needed often. However, KGs keep continuously changing due to reasons such as changes in the source data, improvements to the extraction techniques, refinement/enrichment of information, and so on. This raises the issue of efficiently maintaining the provenance polynomials of complex graph pattern queries for dynamic and large KGs instead of having to recompute them from scratch each time the KG...

Cross-lingual information retrieval is a challenging task in the absence of aligned parallel corp... more Cross-lingual information retrieval is a challenging task in the absence of aligned parallel corpora. In this paper, we address this problem by considering topically aligned corpora designed for evaluating an IR setup. To emphasize, we neither use any sentence-aligned corpora or document-aligned corpora, nor do we use any language specific resources such as dictionary, thesaurus, or grammar rules. Instead, we use an embedding into a common space and learn word correspondences directly from there. We test our proposed approach for bilingual IR on standard FIRE datasets for Bangla, Hindi and English. The proposed method is superior to the state-of-the-art method not only for IR evaluation measures but also in terms of time requirements. We extend our method successfully to the trilingual setting.