Sudeshna Sarkar | IIT Kharagpur (original) (raw)
Papers by Sudeshna Sarkar
The Florida AI Research Society, May 21, 2001
We introduce a new genetic operator, Reduction, that rectifies decision trees not correct syntact... more We introduce a new genetic operator, Reduction, that rectifies decision trees not correct syntactically and at the same time removes the redundant sections within, while preserving its accuracy during operation. A novel approach to crossover is presented that uses the reduction operator to systematically extract building blocks spread out over the entire second parent to create a subtree that is valid and particularly useful in the context it replaces the subtree in the first parent. The crossover introduced also removes unexplored code from the offspring and hence prevents redundancy and bloating. Overall, reduction can be viewed as a local optimization step that directs the population, generated initially or over generations through crossovers, to potentially good regions in the search space so that reproduction is performed in a highly correlated landscape with a global structure. Lexical convergence is also ensured implying identical individuals always produce the same offspring.
FIRE (Working Notes), 2020
The goal of FIRE 2020 EDNIL track was to create a framework which could be used to detect events ... more The goal of FIRE 2020 EDNIL track was to create a framework which could be used to detect events from news articles in English, Hindi, Bengali, Marathi and Tamil. The track consisted of two tasks: (i) Identifying a piece of text from news articles that contains an event (Event Identification). (ii) Creating an event frame from the news article (Event Frame Extraction). The events that were identified in Event Identification task were Man-made Disaster and Natural Disaster. In Event Frame Extraction task the event frame consists of Event type, Casualties, Time, Place, Reason.
ACM Transactions on Asian and Low-Resource Language Information Processing, Dec 17, 2018
We investigate the use of word embeddings for query translation to improve precision in cross-lan... more We investigate the use of word embeddings for query translation to improve precision in cross-language information retrieval (CLIR). Word vectors represent words in a distributional space such that syntactically or semantically similar words are close to each other in this space. Multilingual word embeddings are constructed in such a way that similar words across languages have similar vector representations. We explore the effective use of bilingual and multilingual word embeddings learned from comparable corpora of Indic languages to the task of CLIR. We propose a clustering method based on the multilingual word vectors to group similar words across languages. For this we construct a graph with words from multiple languages as nodes and with edges connecting words with similar vectors. We use the Louvain method for community detection to find communities in this graph. We show that choosing target language words as query translations from the clusters or communities containing the query terms helps in improving CLIR. We also find that better-quality query translations are obtained when words from more languages are used to do the clustering even when the additional languages are neither the source nor the target languages. This is probably because having more similar words across multiple languages helps define well-defined dense subclusters that help us obtain precise query translations. In this article, we demonstrate the use of multilingual word embeddings and word clusters for CLIR involving Indic languages. We also make available a tool for obtaining related words and the visualizations of the multilingual word vectors for English, Hindi, Bengali, Marathi, Gujarati, and Tamil.
IEEE Data(base) Engineering Bulletin, 2007
Modern database systems mostly support representation and retrieval of data belonging to differen... more Modern database systems mostly support representation and retrieval of data belonging to different scripts and different languages. But the database functions are mostly designed or optimized with respect to the Roman script and English. Most database querying languages include support for regular expression matching. However the matching units are designed for the Roman script, and do not satisfy the natural requirements of all other scripts. In this paper, we discuss the different scripts and languages in use in the world, and recommend the type of regular expression support that will suit the needs for all these scripts. We also discuss crosslingual match operators and matching with respect to linguistic units.
ACM Transactions on Asian and Low-Resource Language Information Processing, Jun 1, 2020
Cross-lingual dependency parsing approaches have been employed to develop dependency parsers for ... more Cross-lingual dependency parsing approaches have been employed to develop dependency parsers for the languages for which little or no treebanks are available using the treebanks of other languages. A language for which the cross-lingual parser is developed is usually referred to as the target language and the language whose treebank is used to train the cross-lingual parser model is referred to as the source language. The cross-lingual parsing approaches for dependency parsing may be broadly classified into three categories: model transfer, annotation projection, and treebank translation. This survey provides an overview of the various aspects of the model transfer approach of cross-lingual dependency parsing. In this survey, we present a classification of the model transfer approaches based on the different aspects of the method. We discuss some of the challenges associated with cross-lingual parsing and the techniques used to address these challenges. In order to address the difference in vocabulary between two languages, some approaches use only non-lexical features of the words to train the models while others use shared representations of the words. Some approaches address the morphological differences by chunk-level transfer rather than word-level transfer. The syntactic differences between the source and target languages are sometimes addressed by transforming the source language treebanks or by combining the resources of multiple source languages. Besides cross-lingual transfer parser models may be developed for a specific target language or it may be trained to parse sentences of multiple languages. With respect to the above-mentioned aspects, we look at the different ways in which the methods can be classified. We further classify and discuss the different approaches from the perspective of the corresponding aspects. We also demonstrate the performance of the transferred models under different settings corresponding to the classification aspects on a common dataset.
Neural machine translation (NMT) models have recently been shown to be very successful in machine... more Neural machine translation (NMT) models have recently been shown to be very successful in machine translation (MT). The use of LSTMs in machine translation has significantly improved the translation performance for longer sentences by being able to capture the context and long range correlations of the sentences in their hidden layers. The attention model based NMT system (Bahdanau et al., 2014) has become the state-of-the-art, performing equal or better than other statistical MT approaches. In this paper, we wish to study the performance of the attention-model based NMT system (Bahdanau et al., 2014) on the Indian language pair, Hindi and Bengali, and do an analysis on the types or errors that occur in case when the languages are morphologically rich and there is a scarcity of large parallel training corpus. We then carry out certain post-processing heuristic steps to improve the quality of the translated statements and suggest further measures that can be carried out.
One of the most important components of an e-learning system is the learning material. The popula... more One of the most important components of an e-learning system is the learning material. The popularity of e-learning has led to the development of many learning object repositories that store high quality learning materials specifically created for e-learning. High quality learning materials are expensive to create. So it is very important to ensure reuse of learning materials. Reuse of learning materials are made possible by semantically tagging them with standard metadata. In this chapter, we present a comparative study of available learning object metadata and learning object repositories. The learning material can be tagged either manually or automatically. Manual annotation is a time consuming and expensive process. We have explored the feasibility of tagging learning materials automatically with a set of IEEE LOM metadata specification. Here, we present a standard classification approach using probabilistic neural network to automatically identify the topic of the learning mate...
Advances in Water Resources, 2021
Abstract Groundwater plays a major role in human adaptation and ecological sustainability against... more Abstract Groundwater plays a major role in human adaptation and ecological sustainability against climate variability by providing global water and food security. In the Indus-Ganges-Brahmaputra-Meghna aquifers (IGBM), groundwater abstraction has been reported to be one of the primary contributors to groundwater storage variability. However, there is still a lack of understanding on the relative influence of climate and abstraction on groundwater. Data-guided statistical studies are reported to be crucial in understanding the human-natural complex system. Here, we attributed the long-term (1985–2015) impact of local-precipitation, global-climate cycles, and human influence on multi-depth groundwater levels (n=6753) in the IGBM using lag correlation analysis, wavelet coherence analysis, and regression-based dominance analysis. Our findings highlight the variable patterns of phase lags observed between multi-depth groundwater levels and precipitation depending on the different nature of climatic and anthropogenic drivers in different parts of the basin. We observed intuitive responses, i.e., rapid response in shallow groundwater and relatively delayed responses to the global climate patterns with increasing depth. However, in the most exploited areas, the hydrological processes governing the groundwater recharge are overwhelmed by unsustainable groundwater abstraction, thus decoupling the hydro-climatic continuum. Our results also suggest groundwater abstraction to be the dominant influence in most of the basin, particularly at the greater depth of the aquifer, thus highlighting the importance of understanding multi-depth groundwater dynamics for future groundwater management and policy interventions.
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, 2017
This paper describes the IIT Kharagpur dependency parsing system in CoNLL-2017 shared task on Mul... more This paper describes the IIT Kharagpur dependency parsing system in CoNLL-2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. We primarily focus on the lowresource languages (surprise languages). We have developed a framework to combine multiple treebanks to train parsers for low resource languages by a delexicalization method. We have applied transformation on the source language treebanks based on syntactic features of the low-resource language to improve performance of the parser. In the official evaluation, our system achieves macro-averaged LAS scores of 67.61 and 37.16 on the entire blind test data and the surprise language test data respectively.
Lecture Notes in Computer Science, 2016
Prediction of heavy rainfall is an extremely important problem in the field of meteorology as it ... more Prediction of heavy rainfall is an extremely important problem in the field of meteorology as it has a great impact on the life and economy of people. Every year many people in different parts of the world suffer from the severe consequences of heavy rainfall like flood, spread of diseases, etc. We have proposed a model based on deep neural network to predict extreme rainfall from the previous climatic parameters. Our model comprising of a stacked auto-encoder has been tested for Mumbai and Kolkata, India, and found to be capable of predicting heavy rainfall events over both these regions. The model is able to predict extreme rainfall events 6 to 48 h before their occurrence. However it also predicts several false positives. We compare our results with other methods and find our method doing much better than the other methods used in literature. Predicting heavy rainfall 1 to 2 days earlier is a difficult task and such an early prediction can help in avoiding a lot of damages. This is where we find that our model can give a promising solution. Compared to the conventional methods used, our method reduces the number of false alarms; on further analysis of our results we find that in many cases false alarm has been raised when there has been rainfall in the surrounding regions. Thus our model generates warning for heavy rain in surrounding regions as well.
Lecture Notes in Computer Science, 2014
In this paper, we present an approach, based on random indexing, to identify semantically related... more In this paper, we present an approach, based on random indexing, to identify semantically related information that effectively disambiguate the user query and improves the retrieval efficiency of news documents. User query terms are expanded based on the terms with similar word senses that are discovered by implicitly considering the “associatedness” of the document context with that of the given query. This type of associatedness is guided by word space models, as described by Kanerva et al.(2000). The word-space model computes the meaning of the terms by implicitly utilizing the distributional patterns (contexts) of words collected over large text data. The distributional patterns represent semantic similarity between words in terms of their spatial proximity in the context space. In this space, words are represented by context vectors whose relative directions are assumed to indicate semantic similarity. Motivated by this distributional hypothesis, words with similar meanings are assumed to have similar contexts. For example, if we observe two words that constantly occur with the same context, we are justified in assuming that they mean similar things. Hence the word space methodology makes semantics computable and the underlying models do not require any linguistic or semantic expertise. Experimental results done on FIRE news collection show that the proposed approach effectively captures the term contexts using higher order term associations across the collection of news documents and use such information to assist the retrieval of documents.
Lecture Notes in Computer Science, 2013
In web search, given a query, a search engine is required to retrieve a set of relevant documents... more In web search, given a query, a search engine is required to retrieve a set of relevant documents. We wish to rank documents based on the content and look beyond mere relevance. Often there is a requirement that users want comprehensive documents containing variety of aspects of information relevant to the query topic. Given a query, a document is considered to be comprehensive only if the document covers more number of aspects of the given query. The comprehensiveness of a web document may be estimated by analyzing various parts of its content, and checking diversity, coverage of the content and the relevance as well. In this work, we have proposed an information retrieval system that ranks documents based on the comprehensiveness of the content. We use pseudo relevance feedback to score the comprehensiveness of web documents as well as their relevance. Experiments show that the proposed method effectively identifies documents having comprehensive content.
Artificial Intelligence, 1989
... of heuristic search algorithms for both ordinary and AND OR graphs which not only terminate w... more ... of heuristic search algorithms for both ordinary and AND OR graphs which not only terminate with admissible solutions m restricted memory, but also fruitfully utilize the avail able memory These algorithms take as input the amount of additional memory MAX (over the minimum ...
International conference natural language processing, Dec 1, 2014
This paper presents an accurate identification of different types of karta (subject) in Bangla. D... more This paper presents an accurate identification of different types of karta (subject) in Bangla. Due to the limited amount of annotated data of dependency relations, we have built a baseline parser for Bangla using data driven method. Then a rule based post processor is applied on the output of baseline parser. As a result, average labeled attachment score improvement of karta (subject) based on F-measure on KGPBenTreeBank and ICON 2010 Treebank are 25.35% and 9.53%, respectively.
International Conference on Computational Linguistics, Dec 1, 2016
In recent years there has been a lot of interest in cross-lingual parsing for developing treebank... more In recent years there has been a lot of interest in cross-lingual parsing for developing treebanks for languages with small or no annotated treebanks. In this paper, we explore the development of a cross-lingual transfer parser from Hindi to Bengali using a Hindi parser and a Hindi-Bengali parallel corpus. A parser is trained and applied to the Hindi sentences of the parallel corpus and the parse trees are projected to construct probable parse trees of the corresponding Bengali sentences. Only about 14% of these trees are complete (transferred trees contain all the target sentence words) and they are used to construct a Bengali parser. We relax the criteria of completeness to consider well-formed trees (43% of the trees) leading to an improvement. We note that the words often do not have a one-to-one mapping in the two languages but considering sentences at the chunk-level results in better correspondence between the two languages. Based on this we present a method to use chunking as a preprocessing step and do the transfer on the chunk trees. We find that about 72% of the projected parse trees of Bengali are now well-formed. The resultant parser achieves significant improvement in both Unlabeled Attachment Score (UAS) as well as Labeled Attachment Score (LAS) over the baseline word-level transferred parser.
ACM Transactions on Asian and Low-Resource Language Information Processing, Feb 21, 2023
International conference natural language processing, Dec 1, 2016
While statistical methods have been very effective in developing NLP tools, the use of linguistic... more While statistical methods have been very effective in developing NLP tools, the use of linguistic tools and understanding of language structure can make these tools better. Cross-lingual parser construction has been used to develop parsers for languages with no annotated treebank. Delexicalized parsers that use only POS tags can be transferred to a new target language. But the success of a delexicalized transfer parser depends on the syntactic closeness between the source and target languages. The understanding of the linguistic similarities and differences between the languages can be used to improve the parser. In this paper, we use a method based on cross-lingual model transfer to transfer a Hindi parser to Bengali. The technique does not need any parallel corpora but makes use of chunkers of these languages. We observe that while the two languages share broad similarities, Bengali and Hindi phrases do not have identical construction. We can improve the transfer based parser if the parser is transferred at the chunk level. Based on this we present a method to use chunkers to develop a cross-lingual parser for Bengali which results in an improvement of unlabelled attachment score (UAS) from 65.1 (baseline parser) to 78.2.
Lecture Notes in Computer Science, 2023
The Florida AI Research Society, May 21, 2001
We introduce a new genetic operator, Reduction, that rectifies decision trees not correct syntact... more We introduce a new genetic operator, Reduction, that rectifies decision trees not correct syntactically and at the same time removes the redundant sections within, while preserving its accuracy during operation. A novel approach to crossover is presented that uses the reduction operator to systematically extract building blocks spread out over the entire second parent to create a subtree that is valid and particularly useful in the context it replaces the subtree in the first parent. The crossover introduced also removes unexplored code from the offspring and hence prevents redundancy and bloating. Overall, reduction can be viewed as a local optimization step that directs the population, generated initially or over generations through crossovers, to potentially good regions in the search space so that reproduction is performed in a highly correlated landscape with a global structure. Lexical convergence is also ensured implying identical individuals always produce the same offspring.
FIRE (Working Notes), 2020
The goal of FIRE 2020 EDNIL track was to create a framework which could be used to detect events ... more The goal of FIRE 2020 EDNIL track was to create a framework which could be used to detect events from news articles in English, Hindi, Bengali, Marathi and Tamil. The track consisted of two tasks: (i) Identifying a piece of text from news articles that contains an event (Event Identification). (ii) Creating an event frame from the news article (Event Frame Extraction). The events that were identified in Event Identification task were Man-made Disaster and Natural Disaster. In Event Frame Extraction task the event frame consists of Event type, Casualties, Time, Place, Reason.
ACM Transactions on Asian and Low-Resource Language Information Processing, Dec 17, 2018
We investigate the use of word embeddings for query translation to improve precision in cross-lan... more We investigate the use of word embeddings for query translation to improve precision in cross-language information retrieval (CLIR). Word vectors represent words in a distributional space such that syntactically or semantically similar words are close to each other in this space. Multilingual word embeddings are constructed in such a way that similar words across languages have similar vector representations. We explore the effective use of bilingual and multilingual word embeddings learned from comparable corpora of Indic languages to the task of CLIR. We propose a clustering method based on the multilingual word vectors to group similar words across languages. For this we construct a graph with words from multiple languages as nodes and with edges connecting words with similar vectors. We use the Louvain method for community detection to find communities in this graph. We show that choosing target language words as query translations from the clusters or communities containing the query terms helps in improving CLIR. We also find that better-quality query translations are obtained when words from more languages are used to do the clustering even when the additional languages are neither the source nor the target languages. This is probably because having more similar words across multiple languages helps define well-defined dense subclusters that help us obtain precise query translations. In this article, we demonstrate the use of multilingual word embeddings and word clusters for CLIR involving Indic languages. We also make available a tool for obtaining related words and the visualizations of the multilingual word vectors for English, Hindi, Bengali, Marathi, Gujarati, and Tamil.
IEEE Data(base) Engineering Bulletin, 2007
Modern database systems mostly support representation and retrieval of data belonging to differen... more Modern database systems mostly support representation and retrieval of data belonging to different scripts and different languages. But the database functions are mostly designed or optimized with respect to the Roman script and English. Most database querying languages include support for regular expression matching. However the matching units are designed for the Roman script, and do not satisfy the natural requirements of all other scripts. In this paper, we discuss the different scripts and languages in use in the world, and recommend the type of regular expression support that will suit the needs for all these scripts. We also discuss crosslingual match operators and matching with respect to linguistic units.
ACM Transactions on Asian and Low-Resource Language Information Processing, Jun 1, 2020
Cross-lingual dependency parsing approaches have been employed to develop dependency parsers for ... more Cross-lingual dependency parsing approaches have been employed to develop dependency parsers for the languages for which little or no treebanks are available using the treebanks of other languages. A language for which the cross-lingual parser is developed is usually referred to as the target language and the language whose treebank is used to train the cross-lingual parser model is referred to as the source language. The cross-lingual parsing approaches for dependency parsing may be broadly classified into three categories: model transfer, annotation projection, and treebank translation. This survey provides an overview of the various aspects of the model transfer approach of cross-lingual dependency parsing. In this survey, we present a classification of the model transfer approaches based on the different aspects of the method. We discuss some of the challenges associated with cross-lingual parsing and the techniques used to address these challenges. In order to address the difference in vocabulary between two languages, some approaches use only non-lexical features of the words to train the models while others use shared representations of the words. Some approaches address the morphological differences by chunk-level transfer rather than word-level transfer. The syntactic differences between the source and target languages are sometimes addressed by transforming the source language treebanks or by combining the resources of multiple source languages. Besides cross-lingual transfer parser models may be developed for a specific target language or it may be trained to parse sentences of multiple languages. With respect to the above-mentioned aspects, we look at the different ways in which the methods can be classified. We further classify and discuss the different approaches from the perspective of the corresponding aspects. We also demonstrate the performance of the transferred models under different settings corresponding to the classification aspects on a common dataset.
Neural machine translation (NMT) models have recently been shown to be very successful in machine... more Neural machine translation (NMT) models have recently been shown to be very successful in machine translation (MT). The use of LSTMs in machine translation has significantly improved the translation performance for longer sentences by being able to capture the context and long range correlations of the sentences in their hidden layers. The attention model based NMT system (Bahdanau et al., 2014) has become the state-of-the-art, performing equal or better than other statistical MT approaches. In this paper, we wish to study the performance of the attention-model based NMT system (Bahdanau et al., 2014) on the Indian language pair, Hindi and Bengali, and do an analysis on the types or errors that occur in case when the languages are morphologically rich and there is a scarcity of large parallel training corpus. We then carry out certain post-processing heuristic steps to improve the quality of the translated statements and suggest further measures that can be carried out.
One of the most important components of an e-learning system is the learning material. The popula... more One of the most important components of an e-learning system is the learning material. The popularity of e-learning has led to the development of many learning object repositories that store high quality learning materials specifically created for e-learning. High quality learning materials are expensive to create. So it is very important to ensure reuse of learning materials. Reuse of learning materials are made possible by semantically tagging them with standard metadata. In this chapter, we present a comparative study of available learning object metadata and learning object repositories. The learning material can be tagged either manually or automatically. Manual annotation is a time consuming and expensive process. We have explored the feasibility of tagging learning materials automatically with a set of IEEE LOM metadata specification. Here, we present a standard classification approach using probabilistic neural network to automatically identify the topic of the learning mate...
Advances in Water Resources, 2021
Abstract Groundwater plays a major role in human adaptation and ecological sustainability against... more Abstract Groundwater plays a major role in human adaptation and ecological sustainability against climate variability by providing global water and food security. In the Indus-Ganges-Brahmaputra-Meghna aquifers (IGBM), groundwater abstraction has been reported to be one of the primary contributors to groundwater storage variability. However, there is still a lack of understanding on the relative influence of climate and abstraction on groundwater. Data-guided statistical studies are reported to be crucial in understanding the human-natural complex system. Here, we attributed the long-term (1985–2015) impact of local-precipitation, global-climate cycles, and human influence on multi-depth groundwater levels (n=6753) in the IGBM using lag correlation analysis, wavelet coherence analysis, and regression-based dominance analysis. Our findings highlight the variable patterns of phase lags observed between multi-depth groundwater levels and precipitation depending on the different nature of climatic and anthropogenic drivers in different parts of the basin. We observed intuitive responses, i.e., rapid response in shallow groundwater and relatively delayed responses to the global climate patterns with increasing depth. However, in the most exploited areas, the hydrological processes governing the groundwater recharge are overwhelmed by unsustainable groundwater abstraction, thus decoupling the hydro-climatic continuum. Our results also suggest groundwater abstraction to be the dominant influence in most of the basin, particularly at the greater depth of the aquifer, thus highlighting the importance of understanding multi-depth groundwater dynamics for future groundwater management and policy interventions.
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, 2017
This paper describes the IIT Kharagpur dependency parsing system in CoNLL-2017 shared task on Mul... more This paper describes the IIT Kharagpur dependency parsing system in CoNLL-2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. We primarily focus on the lowresource languages (surprise languages). We have developed a framework to combine multiple treebanks to train parsers for low resource languages by a delexicalization method. We have applied transformation on the source language treebanks based on syntactic features of the low-resource language to improve performance of the parser. In the official evaluation, our system achieves macro-averaged LAS scores of 67.61 and 37.16 on the entire blind test data and the surprise language test data respectively.
Lecture Notes in Computer Science, 2016
Prediction of heavy rainfall is an extremely important problem in the field of meteorology as it ... more Prediction of heavy rainfall is an extremely important problem in the field of meteorology as it has a great impact on the life and economy of people. Every year many people in different parts of the world suffer from the severe consequences of heavy rainfall like flood, spread of diseases, etc. We have proposed a model based on deep neural network to predict extreme rainfall from the previous climatic parameters. Our model comprising of a stacked auto-encoder has been tested for Mumbai and Kolkata, India, and found to be capable of predicting heavy rainfall events over both these regions. The model is able to predict extreme rainfall events 6 to 48 h before their occurrence. However it also predicts several false positives. We compare our results with other methods and find our method doing much better than the other methods used in literature. Predicting heavy rainfall 1 to 2 days earlier is a difficult task and such an early prediction can help in avoiding a lot of damages. This is where we find that our model can give a promising solution. Compared to the conventional methods used, our method reduces the number of false alarms; on further analysis of our results we find that in many cases false alarm has been raised when there has been rainfall in the surrounding regions. Thus our model generates warning for heavy rain in surrounding regions as well.
Lecture Notes in Computer Science, 2014
In this paper, we present an approach, based on random indexing, to identify semantically related... more In this paper, we present an approach, based on random indexing, to identify semantically related information that effectively disambiguate the user query and improves the retrieval efficiency of news documents. User query terms are expanded based on the terms with similar word senses that are discovered by implicitly considering the “associatedness” of the document context with that of the given query. This type of associatedness is guided by word space models, as described by Kanerva et al.(2000). The word-space model computes the meaning of the terms by implicitly utilizing the distributional patterns (contexts) of words collected over large text data. The distributional patterns represent semantic similarity between words in terms of their spatial proximity in the context space. In this space, words are represented by context vectors whose relative directions are assumed to indicate semantic similarity. Motivated by this distributional hypothesis, words with similar meanings are assumed to have similar contexts. For example, if we observe two words that constantly occur with the same context, we are justified in assuming that they mean similar things. Hence the word space methodology makes semantics computable and the underlying models do not require any linguistic or semantic expertise. Experimental results done on FIRE news collection show that the proposed approach effectively captures the term contexts using higher order term associations across the collection of news documents and use such information to assist the retrieval of documents.
Lecture Notes in Computer Science, 2013
In web search, given a query, a search engine is required to retrieve a set of relevant documents... more In web search, given a query, a search engine is required to retrieve a set of relevant documents. We wish to rank documents based on the content and look beyond mere relevance. Often there is a requirement that users want comprehensive documents containing variety of aspects of information relevant to the query topic. Given a query, a document is considered to be comprehensive only if the document covers more number of aspects of the given query. The comprehensiveness of a web document may be estimated by analyzing various parts of its content, and checking diversity, coverage of the content and the relevance as well. In this work, we have proposed an information retrieval system that ranks documents based on the comprehensiveness of the content. We use pseudo relevance feedback to score the comprehensiveness of web documents as well as their relevance. Experiments show that the proposed method effectively identifies documents having comprehensive content.
Artificial Intelligence, 1989
... of heuristic search algorithms for both ordinary and AND OR graphs which not only terminate w... more ... of heuristic search algorithms for both ordinary and AND OR graphs which not only terminate with admissible solutions m restricted memory, but also fruitfully utilize the avail able memory These algorithms take as input the amount of additional memory MAX (over the minimum ...
International conference natural language processing, Dec 1, 2014
This paper presents an accurate identification of different types of karta (subject) in Bangla. D... more This paper presents an accurate identification of different types of karta (subject) in Bangla. Due to the limited amount of annotated data of dependency relations, we have built a baseline parser for Bangla using data driven method. Then a rule based post processor is applied on the output of baseline parser. As a result, average labeled attachment score improvement of karta (subject) based on F-measure on KGPBenTreeBank and ICON 2010 Treebank are 25.35% and 9.53%, respectively.
International Conference on Computational Linguistics, Dec 1, 2016
In recent years there has been a lot of interest in cross-lingual parsing for developing treebank... more In recent years there has been a lot of interest in cross-lingual parsing for developing treebanks for languages with small or no annotated treebanks. In this paper, we explore the development of a cross-lingual transfer parser from Hindi to Bengali using a Hindi parser and a Hindi-Bengali parallel corpus. A parser is trained and applied to the Hindi sentences of the parallel corpus and the parse trees are projected to construct probable parse trees of the corresponding Bengali sentences. Only about 14% of these trees are complete (transferred trees contain all the target sentence words) and they are used to construct a Bengali parser. We relax the criteria of completeness to consider well-formed trees (43% of the trees) leading to an improvement. We note that the words often do not have a one-to-one mapping in the two languages but considering sentences at the chunk-level results in better correspondence between the two languages. Based on this we present a method to use chunking as a preprocessing step and do the transfer on the chunk trees. We find that about 72% of the projected parse trees of Bengali are now well-formed. The resultant parser achieves significant improvement in both Unlabeled Attachment Score (UAS) as well as Labeled Attachment Score (LAS) over the baseline word-level transferred parser.
ACM Transactions on Asian and Low-Resource Language Information Processing, Feb 21, 2023
International conference natural language processing, Dec 1, 2016
While statistical methods have been very effective in developing NLP tools, the use of linguistic... more While statistical methods have been very effective in developing NLP tools, the use of linguistic tools and understanding of language structure can make these tools better. Cross-lingual parser construction has been used to develop parsers for languages with no annotated treebank. Delexicalized parsers that use only POS tags can be transferred to a new target language. But the success of a delexicalized transfer parser depends on the syntactic closeness between the source and target languages. The understanding of the linguistic similarities and differences between the languages can be used to improve the parser. In this paper, we use a method based on cross-lingual model transfer to transfer a Hindi parser to Bengali. The technique does not need any parallel corpora but makes use of chunkers of these languages. We observe that while the two languages share broad similarities, Bengali and Hindi phrases do not have identical construction. We can improve the transfer based parser if the parser is transferred at the chunk level. Based on this we present a method to use chunkers to develop a cross-lingual parser for Bengali which results in an improvement of unlabelled attachment score (UAS) from 65.1 (baseline parser) to 78.2.
Lecture Notes in Computer Science, 2023