Dwaipayan Roy - Academia.edu (original) (raw)
Papers by Dwaipayan Roy
Personalized Point of Interest recommendation is very helpful for satisfying users' needs at ... more Personalized Point of Interest recommendation is very helpful for satisfying users' needs at new places. In this article, we propose a tag embedding based method for Personalized Recommendation of Point Of Interest. We model the relationship between tags corresponding to Point Of Interest. The model provides representative embedding corresponds to a tag in a way that related tags will be closer. We model Point of Interest-based on tag embedding and also model the users (user profile) based on the Point Of Interest rated by them. finally, we rank the user's candidate Point Of Interest based on cosine similarity between user's embedding and Point of Interest's embedding. Further, we find the parameters required to model user by discrete optimizing over different measures (like ndcg@5, MRR, ...). We also analyze the result while considering the same parameters for all users and individual parameters for each user. Along with it we also analyze the effect on the result w...
5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD)
The problem of designing recommender systems for scholarly article citations has been actively re... more The problem of designing recommender systems for scholarly article citations has been actively researched with more than 200 publications appearing in the last two decades. In spite of this, no definitive results are available about what approaches work best. Arguably the most important reason for this lack of consensus is the dearth of standardised test collections and evaluation protocols, such as those provided by TREC-like forums. CiteSeerX, a "scholarly big dataset" has recently become available. However, this collection provides only the raw material that is yet to be moulded into Cranfield style test collections. In this paper, we discuss the limitations of test collections used in earlier work, and describe how we used CiteSeerX to design a test collection with a well-defined evaluation protocol. The collection consists of over 600,000 research papers and over 2,500 queries. We report some preliminary experimental results using this collection, which are indicative...
ArXiv, 2016
In this paper a framework for Automatic Query Expansion (AQE) is proposed using distributed neura... more In this paper a framework for Automatic Query Expansion (AQE) is proposed using distributed neural language model word2vec. Using semantic and contextual relation in a distributed and unsupervised framework, word2vec learns a low dimensional embedding for each vocabulary entry. Using such a framework, we devise a query expansion technique, where related terms to a query are obtained by K-nearest neighbor approach. We explore the performance of the AQE methods, with and without feedback query expansion, and a variant of simple K-nearest neighbor in the proposed framework. Experiments on standard TREC ad-hoc data (Disk 4, 5 with query sets 301-450, 601-700) and web data (WT10G data with query set 451-550) shows significant improvement over standard term-overlapping based retrieval methods. However the proposed method fails to achieve comparable performance with statistical co-occurrence based feedback method such as RM3. We have also found that the word2vec based query expansion metho...
ArXiv, 2021
In this demo paper, we present ConSTR, a novel Contextual Search Term Recommender that utilises t... more In this demo paper, we present ConSTR, a novel Contextual Search Term Recommender that utilises the user’s interaction context for search term recommendation and literature retrieval. ConSTR integrates a two-layered recommendation interface: the first layer suggests terms with respect to a user’s current search term, and the second layer suggests terms based on the users’ previous search activities (interaction context). For the demonstration, ConSTR is built on the arXiv, an academic repository consisting of 1.8 million documents.
ArXiv, 2016
A major difficulty in applying word vector embeddings in IR is in devising an effective and effic... more A major difficulty in applying word vector embeddings in IR is in devising an effective and efficient strategy for obtaining representations of compound units of text, such as whole documents, (in comparison to the atomic words), for the purpose of indexing and scoring documents. Instead of striving for a suitable method for obtaining a single vector representation of a large document of text, we rather aim for developing a similarity metric that makes use of the similarities between the individual embedded word vectors in a document and a query. More specifically, we represent a document and a query as sets of word vectors, and use a standard notion of similarity measure between these sets, computed as a function of the similarities between each constituent word pair from these sets. We then make use of this similarity measure in combination with standard IR based similarities for document ranking. The results of our initial experimental investigations shows that our proposed metho...
Journal of the Association for Information Science and Technology, 2021
PSN: Technology (Topic), 2020
Problem definition: The U.S. federal government makes significant investments in technology progr... more Problem definition: The U.S. federal government makes significant investments in technology programs to deliver essential services to the public. The execution of a program is monitored against a baseline—an aggregate plan representing the program’s planned budget, schedule and scope. Recent reports suggest that federal technology programs are re-baselined multiple times, resulting in additional spending of taxpayer money. Although a program’s scope has often been considered a driver of baseline changes, we have a limited understanding of the execution factors that may affect this relationship. Academic/Practical Relevance: With increasing bipartisan scrutiny of federal spending in technology programs and continuing debate in the media about their execution, a nuanced understanding of the drivers of baseline changes in federal technology programs is a critical and contemporary line of inquiry relevant to both policymakers and managers. Our study also responds to recent calls in the operations management literature for research on public sector operations. Methodology: The study sample comprises detailed archival data on 240 U.S. federal government technology programs across 24 federal agencies. We estimate a negative binomial regression specification that accounts for agency fixed effects and several program-specific characteristics to test four hypotheses on the interrelationships between a program’s scope, granularity, management competency, execution methodology and baseline changes.. Results: The results indicate that program scope is positively associated with the number of baseline changes. However, increasing levels of program granularity and program management competency attenuate this positive relationship. Additional analysis highlights the significant savings in taxpayer contributions that can occur by reducing baseline changes in programs of greater scope through an increase in the levels of program granularity and program management competency. Managerial Implications: The study results emphasize the need for federal agencies to invest greater efforts in granularizing a program and in identifying managers with high levels of program management competency when program scope is high, as such efforts can translate into a reduction in the number of baseline changes. The results also highlight the role of number of baseline changes as a valuable in-process metric for program managers and federal agencies to monitor the execution of federal technology programs and identify programs with greater potential. for experiencing cost overruns.
Forum for Information Retrieval Evaluation, 2020
This paper describes an overview of the findings of the track named ‘Causality-driven Ad hoc Info... more This paper describes an overview of the findings of the track named ‘Causality-driven Ad hoc Information Retrieval’ (abbv. CAIR) at the Forum for Information Retrieval Evaluation (FIRE) 2020. The purpose of the track was to investigate how effectively can search systems retrieve documents that are causally related to a specified query event. Different from standard information retrieval (IR), the criteria of relevance in this search scenario is stricter in the sense that the retrieved documents at the top ranks should provide information on the potentially relevant causes that might have caused a given query event, e.g. retrieve documents on political situations that might have led to ‘Brexit’. We released a dataset comprised of a set of 25 queries split into train and test sets. We received submissions from two participating groups. The two main observations from the best performing runs from the two participating groups are that longer queries showed a general trend to yield more ...
Secondary analysis or the reuse of existing survey data is a common practice among social scienti... more Secondary analysis or the reuse of existing survey data is a common practice among social scientists. Searching for relevant datasets in Digital Libraries is a somehow unfamiliar behaviour for this community. Dataset retrieval, especially in the social sciences, incorporates additional material such as codebooks, questionnaires, raw data files and more. Our assumption is that due to the diverse nature of datasets, document retrieval models often do not work as efficiently for retrieving datasets. One way of enhancing these types of searches is to incorporate the users' interaction context in order to personalise dataset retrieval sessions. As a first step towards this long term goal, we study characteristics of dataset retrieval sessions from a real-life Digital Library for the social sciences that incorporates both: research data and publications. Previous studies reported a way of discerning queries between document search and dataset search by query length. In this paper, we ...
The Background Linking task is a problem that focuses on providing users with suggestions for art... more The Background Linking task is a problem that focuses on providing users with suggestions for articles to read next, when the user is reading a news article. The suggested articles should provide adequate context and background information for the article that the user is currently reading. In this paper, we describe several methods that we explored for this task, and report their results.
Wikipedia is the largest web-based open encyclopedia covering more than three hundred languages. ... more Wikipedia is the largest web-based open encyclopedia covering more than three hundred languages. However, different language editions of Wikipedia differ significantly in terms of their information coverage. We present a systematic comparison of information coverage in English Wikipedia (most exhaustive) and Wikipedias in eight other widely spoken languages (Arabic, German, Hindi, Korean, Portuguese, Russian, Spanish and Turkish). We analyze the content present in the respective Wikipedias in terms of the coverage of topics as well as the depth of coverage of topics included in these Wikipedias. Our analysis quantifies and provides useful insights about the information gap that exists between different language editions of Wikipedia and offers a roadmap for the IR community to bridge this gap.
The Contextual Suggestion Problem focuses on search techniques for complex information needs that... more The Contextual Suggestion Problem focuses on search techniques for complex information needs that are highly dependent on context and user interest. In this paper, we present our approach to providing user and context dependent suggestions. The official evaluation scores for our submitted run are reported and compared to the overall median scores (across all 34 runs).
US federal information technology (IT) programs are often baselined multiple times, resulting in ... more US federal information technology (IT) programs are often baselined multiple times, resulting in wasteful expenditure of taxpayer money. While increasing program scope has often been attributed as a key driver of baseline changes, understanding how execution factors associated with the program may interplay with program scope to affect baseline changes has significant potential for improving the utilization of taxpayer contributions. We take a closer look at three execution factors relevant to federal IT programs – namely the granularity of the program in terms of the number of component activities, the project management competency in the program, and the project management methodology pursued in the program. The empirical analysis is carried out using publicly available data from the US federal government on 250 IT programs across 19 federal agencies. Consistent with our predictions, the results indicate that increasing scope of a federal IT program is associated with increasing f...
Crosslingual information retrieval (CLIR) finds its application in aligning documents across comp... more Crosslingual information retrieval (CLIR) finds its application in aligning documents across comparable corpora. However, traditional CLIR, due to the term independence assumption, cannot consider the semantic similarity between the constituent words of the candidate pairs of documents in two different languages. Moreover, traditional CLIR models score a document by aggregating only the weights of the constituent terms that match with those of the query, while the other non-matching terms of the document do not significantly contribute to the similarity function. Word vector embedding allows the provision to model the semantic distances between terms by the application of standard distance metrics between their corresponding real valued vectors. This paper develops a word vector embedding based CLIR model that uses the average distances between the embedded word vectors of the source and target language documents to rank candidate document pairs. Our experiments with the WMT bilingu...
Entries in microblogging sites are very short. For example, a 'tweet' (a post or status update on... more Entries in microblogging sites are very short. For example, a 'tweet' (a post or status update on the popular microblogging site Twitter) can contain at most 140 characters. To comply with this restriction, users frequently use abbreviations to express their thoughts, thus producing sentences that are often poorly structured or ungrammatical. As a result, it becomes a challenge to come up with methods for automatically identifying named entities (names of persons, organizations, locations etc.). In this study, we use a four-step approach to automatic named entity recognition from microposts. First, we do some preprocessing of the micropost (e.g. replace abbreviations with actual words). Then we use an off-the-shelf part-of-speech tagger to tag the nouns. Next, we use the Google Search API to retrieve sentences containing the tagged nouns. Finally, we run a standard Named Entity Recognizer (NER) on the retrieved sentences. The tagged nouns are returned along with the tags assigned by the NER. This simple approach, using readily available components, yields promising results on standard benchmark data.
Traditional information retrieval systems are primarily focused on finding topically-relevant doc... more Traditional information retrieval systems are primarily focused on finding topically-relevant documents, which are descriptive of a particular query concept. However, when working with sources such as collections of news articles, a user might often want to identify not only those documents which describe a news event, but also documents which explain the chain of events which potentially led to that event occurring. These associations might be complex, involving a number of causal factors. Motivated by this information need, we formulate the task of causal information retrieval. We provide a literature survey on causality-related research, and explain how the proposed task differs from standard retrieval problems. We then empirically investigate the ability of popular retrieval methods to successfully retrieve causally-relevant documents. Our results demonstrate that the performance of traditional methods are not upto the mark for this task, and that causal information retrieval re...
Personalized Point of Interest recommendation is very helpful for satisfying users' needs at ... more Personalized Point of Interest recommendation is very helpful for satisfying users' needs at new places. In this article, we propose a tag embedding based method for Personalized Recommendation of Point Of Interest. We model the relationship between tags corresponding to Point Of Interest. The model provides representative embedding corresponds to a tag in a way that related tags will be closer. We model Point of Interest-based on tag embedding and also model the users (user profile) based on the Point Of Interest rated by them. finally, we rank the user's candidate Point Of Interest based on cosine similarity between user's embedding and Point of Interest's embedding. Further, we find the parameters required to model user by discrete optimizing over different measures (like ndcg@5, MRR, ...). We also analyze the result while considering the same parameters for all users and individual parameters for each user. Along with it we also analyze the effect on the result w...
5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD)
The problem of designing recommender systems for scholarly article citations has been actively re... more The problem of designing recommender systems for scholarly article citations has been actively researched with more than 200 publications appearing in the last two decades. In spite of this, no definitive results are available about what approaches work best. Arguably the most important reason for this lack of consensus is the dearth of standardised test collections and evaluation protocols, such as those provided by TREC-like forums. CiteSeerX, a "scholarly big dataset" has recently become available. However, this collection provides only the raw material that is yet to be moulded into Cranfield style test collections. In this paper, we discuss the limitations of test collections used in earlier work, and describe how we used CiteSeerX to design a test collection with a well-defined evaluation protocol. The collection consists of over 600,000 research papers and over 2,500 queries. We report some preliminary experimental results using this collection, which are indicative...
ArXiv, 2016
In this paper a framework for Automatic Query Expansion (AQE) is proposed using distributed neura... more In this paper a framework for Automatic Query Expansion (AQE) is proposed using distributed neural language model word2vec. Using semantic and contextual relation in a distributed and unsupervised framework, word2vec learns a low dimensional embedding for each vocabulary entry. Using such a framework, we devise a query expansion technique, where related terms to a query are obtained by K-nearest neighbor approach. We explore the performance of the AQE methods, with and without feedback query expansion, and a variant of simple K-nearest neighbor in the proposed framework. Experiments on standard TREC ad-hoc data (Disk 4, 5 with query sets 301-450, 601-700) and web data (WT10G data with query set 451-550) shows significant improvement over standard term-overlapping based retrieval methods. However the proposed method fails to achieve comparable performance with statistical co-occurrence based feedback method such as RM3. We have also found that the word2vec based query expansion metho...
ArXiv, 2021
In this demo paper, we present ConSTR, a novel Contextual Search Term Recommender that utilises t... more In this demo paper, we present ConSTR, a novel Contextual Search Term Recommender that utilises the user’s interaction context for search term recommendation and literature retrieval. ConSTR integrates a two-layered recommendation interface: the first layer suggests terms with respect to a user’s current search term, and the second layer suggests terms based on the users’ previous search activities (interaction context). For the demonstration, ConSTR is built on the arXiv, an academic repository consisting of 1.8 million documents.
ArXiv, 2016
A major difficulty in applying word vector embeddings in IR is in devising an effective and effic... more A major difficulty in applying word vector embeddings in IR is in devising an effective and efficient strategy for obtaining representations of compound units of text, such as whole documents, (in comparison to the atomic words), for the purpose of indexing and scoring documents. Instead of striving for a suitable method for obtaining a single vector representation of a large document of text, we rather aim for developing a similarity metric that makes use of the similarities between the individual embedded word vectors in a document and a query. More specifically, we represent a document and a query as sets of word vectors, and use a standard notion of similarity measure between these sets, computed as a function of the similarities between each constituent word pair from these sets. We then make use of this similarity measure in combination with standard IR based similarities for document ranking. The results of our initial experimental investigations shows that our proposed metho...
Journal of the Association for Information Science and Technology, 2021
PSN: Technology (Topic), 2020
Problem definition: The U.S. federal government makes significant investments in technology progr... more Problem definition: The U.S. federal government makes significant investments in technology programs to deliver essential services to the public. The execution of a program is monitored against a baseline—an aggregate plan representing the program’s planned budget, schedule and scope. Recent reports suggest that federal technology programs are re-baselined multiple times, resulting in additional spending of taxpayer money. Although a program’s scope has often been considered a driver of baseline changes, we have a limited understanding of the execution factors that may affect this relationship. Academic/Practical Relevance: With increasing bipartisan scrutiny of federal spending in technology programs and continuing debate in the media about their execution, a nuanced understanding of the drivers of baseline changes in federal technology programs is a critical and contemporary line of inquiry relevant to both policymakers and managers. Our study also responds to recent calls in the operations management literature for research on public sector operations. Methodology: The study sample comprises detailed archival data on 240 U.S. federal government technology programs across 24 federal agencies. We estimate a negative binomial regression specification that accounts for agency fixed effects and several program-specific characteristics to test four hypotheses on the interrelationships between a program’s scope, granularity, management competency, execution methodology and baseline changes.. Results: The results indicate that program scope is positively associated with the number of baseline changes. However, increasing levels of program granularity and program management competency attenuate this positive relationship. Additional analysis highlights the significant savings in taxpayer contributions that can occur by reducing baseline changes in programs of greater scope through an increase in the levels of program granularity and program management competency. Managerial Implications: The study results emphasize the need for federal agencies to invest greater efforts in granularizing a program and in identifying managers with high levels of program management competency when program scope is high, as such efforts can translate into a reduction in the number of baseline changes. The results also highlight the role of number of baseline changes as a valuable in-process metric for program managers and federal agencies to monitor the execution of federal technology programs and identify programs with greater potential. for experiencing cost overruns.
Forum for Information Retrieval Evaluation, 2020
This paper describes an overview of the findings of the track named ‘Causality-driven Ad hoc Info... more This paper describes an overview of the findings of the track named ‘Causality-driven Ad hoc Information Retrieval’ (abbv. CAIR) at the Forum for Information Retrieval Evaluation (FIRE) 2020. The purpose of the track was to investigate how effectively can search systems retrieve documents that are causally related to a specified query event. Different from standard information retrieval (IR), the criteria of relevance in this search scenario is stricter in the sense that the retrieved documents at the top ranks should provide information on the potentially relevant causes that might have caused a given query event, e.g. retrieve documents on political situations that might have led to ‘Brexit’. We released a dataset comprised of a set of 25 queries split into train and test sets. We received submissions from two participating groups. The two main observations from the best performing runs from the two participating groups are that longer queries showed a general trend to yield more ...
Secondary analysis or the reuse of existing survey data is a common practice among social scienti... more Secondary analysis or the reuse of existing survey data is a common practice among social scientists. Searching for relevant datasets in Digital Libraries is a somehow unfamiliar behaviour for this community. Dataset retrieval, especially in the social sciences, incorporates additional material such as codebooks, questionnaires, raw data files and more. Our assumption is that due to the diverse nature of datasets, document retrieval models often do not work as efficiently for retrieving datasets. One way of enhancing these types of searches is to incorporate the users' interaction context in order to personalise dataset retrieval sessions. As a first step towards this long term goal, we study characteristics of dataset retrieval sessions from a real-life Digital Library for the social sciences that incorporates both: research data and publications. Previous studies reported a way of discerning queries between document search and dataset search by query length. In this paper, we ...
The Background Linking task is a problem that focuses on providing users with suggestions for art... more The Background Linking task is a problem that focuses on providing users with suggestions for articles to read next, when the user is reading a news article. The suggested articles should provide adequate context and background information for the article that the user is currently reading. In this paper, we describe several methods that we explored for this task, and report their results.
Wikipedia is the largest web-based open encyclopedia covering more than three hundred languages. ... more Wikipedia is the largest web-based open encyclopedia covering more than three hundred languages. However, different language editions of Wikipedia differ significantly in terms of their information coverage. We present a systematic comparison of information coverage in English Wikipedia (most exhaustive) and Wikipedias in eight other widely spoken languages (Arabic, German, Hindi, Korean, Portuguese, Russian, Spanish and Turkish). We analyze the content present in the respective Wikipedias in terms of the coverage of topics as well as the depth of coverage of topics included in these Wikipedias. Our analysis quantifies and provides useful insights about the information gap that exists between different language editions of Wikipedia and offers a roadmap for the IR community to bridge this gap.
The Contextual Suggestion Problem focuses on search techniques for complex information needs that... more The Contextual Suggestion Problem focuses on search techniques for complex information needs that are highly dependent on context and user interest. In this paper, we present our approach to providing user and context dependent suggestions. The official evaluation scores for our submitted run are reported and compared to the overall median scores (across all 34 runs).
US federal information technology (IT) programs are often baselined multiple times, resulting in ... more US federal information technology (IT) programs are often baselined multiple times, resulting in wasteful expenditure of taxpayer money. While increasing program scope has often been attributed as a key driver of baseline changes, understanding how execution factors associated with the program may interplay with program scope to affect baseline changes has significant potential for improving the utilization of taxpayer contributions. We take a closer look at three execution factors relevant to federal IT programs – namely the granularity of the program in terms of the number of component activities, the project management competency in the program, and the project management methodology pursued in the program. The empirical analysis is carried out using publicly available data from the US federal government on 250 IT programs across 19 federal agencies. Consistent with our predictions, the results indicate that increasing scope of a federal IT program is associated with increasing f...
Crosslingual information retrieval (CLIR) finds its application in aligning documents across comp... more Crosslingual information retrieval (CLIR) finds its application in aligning documents across comparable corpora. However, traditional CLIR, due to the term independence assumption, cannot consider the semantic similarity between the constituent words of the candidate pairs of documents in two different languages. Moreover, traditional CLIR models score a document by aggregating only the weights of the constituent terms that match with those of the query, while the other non-matching terms of the document do not significantly contribute to the similarity function. Word vector embedding allows the provision to model the semantic distances between terms by the application of standard distance metrics between their corresponding real valued vectors. This paper develops a word vector embedding based CLIR model that uses the average distances between the embedded word vectors of the source and target language documents to rank candidate document pairs. Our experiments with the WMT bilingu...
Entries in microblogging sites are very short. For example, a 'tweet' (a post or status update on... more Entries in microblogging sites are very short. For example, a 'tweet' (a post or status update on the popular microblogging site Twitter) can contain at most 140 characters. To comply with this restriction, users frequently use abbreviations to express their thoughts, thus producing sentences that are often poorly structured or ungrammatical. As a result, it becomes a challenge to come up with methods for automatically identifying named entities (names of persons, organizations, locations etc.). In this study, we use a four-step approach to automatic named entity recognition from microposts. First, we do some preprocessing of the micropost (e.g. replace abbreviations with actual words). Then we use an off-the-shelf part-of-speech tagger to tag the nouns. Next, we use the Google Search API to retrieve sentences containing the tagged nouns. Finally, we run a standard Named Entity Recognizer (NER) on the retrieved sentences. The tagged nouns are returned along with the tags assigned by the NER. This simple approach, using readily available components, yields promising results on standard benchmark data.
Traditional information retrieval systems are primarily focused on finding topically-relevant doc... more Traditional information retrieval systems are primarily focused on finding topically-relevant documents, which are descriptive of a particular query concept. However, when working with sources such as collections of news articles, a user might often want to identify not only those documents which describe a news event, but also documents which explain the chain of events which potentially led to that event occurring. These associations might be complex, involving a number of causal factors. Motivated by this information need, we formulate the task of causal information retrieval. We provide a literature survey on causality-related research, and explain how the proposed task differs from standard retrieval problems. We then empirically investigate the ability of popular retrieval methods to successfully retrieve causally-relevant documents. Our results demonstrate that the performance of traditional methods are not upto the mark for this task, and that causal information retrieval re...