Tanmay Basu - Academia.edu (original) (raw)

Papers by Tanmay Basu

SN Applied Sciences, 2020

The k-nearest-neighbor (kNN) decision rule is a simple and robust classifier for text categorizat... more The k-nearest-neighbor (kNN) decision rule is a simple and robust classifier for text categorization. The performance of kNN decision rule depends heavily upon the value of the neighborhood parameter k. The method categorize a test document even if the difference between the number of members of two competing categories is one. Hence, choice of k is crucial as different values of k can change the result of text categorization. Moreover, text categorization is a challenging task as the text data are generally sparse and high dimensional. Note that, assigning a document to a predefined category for an arbitrary value of k may not be accurate when there is no bound on the margin of majority voting. A method is thus proposed in spirit of the nearest-neighbor decision rule using a medoid-based weighting scheme to deal with these issues. The method puts more weightage on the training documents that are not only lie close to the test document but also lie close to the medoid of its corresponding category in decision making, unlike the standard nearest-neighbor algorithms that stress on the documents that are just close to the test document. The aim of the proposed classifier is to enrich the quality of decision making. The empirical results show that the proposed method performs better than different standard nearest-neighbor decision rules and support vector machine classifier using various well-known text collections in terms of macro-and micro-averaged f-measure.

International Journal of Machine Learning and Cybernetics, Sep 18, 2015

Term selection methods in text categorization effectively reduce the size of the vocabulary to im... more Term selection methods in text categorization effectively reduce the size of the vocabulary to improve the quality of classifier. Each corpus generally contains many irrelevant and noisy terms, which eventually reduces the effectiveness of text categorization. Term selection, thus, focuses on identifying the relevant terms for each category without affecting the quality of text categorization. A new supervised term selection technique have been proposed for dimensionality reduction. The method assigns a score to each term of a corpus based on its similarity with all the categories, and then all the terms of the corpus are ranked accordingly. Subsequently the significant terms of each category are selected to create the final subset of terms irrespective of the size of the category. The performance of the proposed term selection technique is compared with the performance of nine other term selection methods for categorization of several well known text corpora using kNN and SVM classifiers. The empirical results show that the proposed method performs significantly better than the other methods in most of the cases of all the corpora.

Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

The DepSign-LT-EDI-ACL2022 shared task focuses on early prediction of severity of depression over... more The DepSign-LT-EDI-ACL2022 shared task focuses on early prediction of severity of depression over social media posts. The BioNLP group at Department of Data Science and Engineering in Indian Institute of Science Education and Research Bhopal (IISERB) has participated in this challenge and submitted three runs based on three different text mining models. The severity of depression were categorized into three classes, viz., no depression, moderate, and severe and the data to build models were released as part of this shared task. The objective of this work is to identify relevant features from the given social media texts for effective text classification. As part of our investigation, we explored features derived from text data using document embeddings technique and simple bag of words model following different weighting schemes. Subsequently, adaptive boosting, logistic regression, random forest and support vector machine (SVM) classifiers were used to identify the scale of depression from the given texts. The experimental analysis on the given validation data show that the SVM classifier using the bag of words model following term frequency and inverse document frequency weighting scheme outperforms the other models for identifying depression. However, this framework could not achieve a place among the top ten runs of the shared task. This paper describes the potential of the proposed framework as well as the possible reasons behind mediocre performance on the given data.

Knowledge and Information Systems, 2022

Business Information Systems, 2019

The process of developing systematic reviews is a well established method of collecting evidence ... more The process of developing systematic reviews is a well established method of collecting evidence from publications, where it follows a predefined and explicit protocol design to promote rigour, transparency and repeatability. The process is manual and involves lot of time and needs expertise. The aim of this work is to build an effective framework using machine learning techniques to partially automate the process of systematic literature review by extracting required data elements of anxiety outcome measures. A framework is thus proposed that initially builds a training corpus by extracting different data elements related to anxiety outcome measures from relevant publications. The publications are retrieved from Medline, EMBASE, CINAHL, AHMED and Pyscinfo following a given set of rules defined by a research group in the United Kingdom reviewing comfort interventions in health care. Subsequently, the method trains a machine learning classifier using this training corpus to extract the desired data elements from new publications. The experiments are conducted on 48 publications containing anxiety outcome measures with an aim to automatically extract the sentences stating the mean and standard deviation of the measures of outcomes of different types of interventions to lessen anxiety. The experimental results show that the recall and precision of the proposed method using random forest classifier are respectively 100% and 83%, which indicates that the method is able to extract all required data elements.

The supervised and unsupervised methodologies of text mining using the plain text data of English... more The supervised and unsupervised methodologies of text mining using the plain text data of English language have been discussed. Some new supervised and unsupervised methodologies have been developed for effective mining of the text data after successfully overcoming some limitations of the existing techniques. The problems of unsupervised techniques of text mining, i.e., document clustering methods are addressed. A new similarity measure between documents has been designed to improve the accuracy of measuring the content similarity between documents. Further, a hierarchical document clustering technique is designed using this similarity measure. The main significance of the clustering algorithm is that the number of clusters can be automatically determined by varying a similarity threshold of the proposed similarity measure. The algorithm experimentally outperforms several other document clustering techniques, but it suffers from computational cost. Therefore another hybrid document...

The Task 2 of CLEF eRisk 2021 challenge focuses on early prediction of self-harm based on sequent... more The Task 2 of CLEF eRisk 2021 challenge focuses on early prediction of self-harm based on sequentially processing pieces of text over social media. The workshop has organized three tasks this year and released different corpora for the individual tasks and these are developed using the posts and comments over Reddit, a popular social media. The text mining group at Center for Computational Biology in University of Birmingham, UK has participated in Task 2 of this challenge and submitted five runs for five different text mining frameworks. The paper explore the performance of different text mining techniques for early risk prediction of self-harm. The techniques involve various classifiers and feature engineering schemes. The simple bag of words model and the Doc2Vec based document embeddings have been used to build features from free text. Subsequently, ada boost, random forest, logistic regression and support vector machine (SVM) classifiers are used to identify self-harm from the ...

The advent of social media has great impact in the society and huge number of conversations are b... more The advent of social media has great impact in the society and huge number of conversations are being made over social media everyday. Unfortunately, many of these conversations are meant for personal attacks or include abusive comments, which may have adverse effect to particular communities or individuals. The same may raise doubts about the liability and popularity of the social media forums, which should be prevented. A set of machine learning classifiers have been explored here to automatically identify abusive or toxic comments from social media posts. The empirical analysis on a data set of Wikipedia Talk Page using different such classifiers including a recurrent neural network has shown significant improvement towards

In this era of abundance, we humans strive to choose. Recommender systems are widely used to cope... more In this era of abundance, we humans strive to choose. Recommender systems are widely used to cope up with this crisis of abundance by recommending items that we may like based on our previous consumption history. In this paper, a weighting techniqiue is proposed in spirit of the term weighting scheme of the text retrieval system for item based collaborative recommender system. The proposed scheme has been used for effective movie recommendation. The empirical analysis on the benchmark MovieLens 100K data set has shown improvement over state of the art recommender system algorithms.

Text clustering techniques segregate a corpus into several groups such that the documents in one ... more Text clustering techniques segregate a corpus into several groups such that the documents in one group are close to each other and the documents across groups are dissimilar. The role of similarity measure is vital to determine meaningful clusters. The existing similarity measures generally find the content similarity between documents to form clusters. The text data are generally sparse and high dimensional. Therefore content similarity may not effectively find the relatedness between documents. A similarity measure is proposed in this regard, which finds the similarity between two documents by finding their common neighbours in the given corpus. Moreover, it can explicitly identify dissimilar documents to prevent grouping of dissimilar documents in the same cluster. The similarity measure is used to improve the performance of text clustering using spectral method. The experimental results on standard corpora show that the proposed method performs better than state of the art text ...

The CLEF eRisk 2018 challenge focuses on early detection of signs of depression or anorexia using... more The CLEF eRisk 2018 challenge focuses on early detection of signs of depression or anorexia using posts or comments over social media. The eRisk lab has organized two tasks this year and released two different corpora for the individual tasks. The corpora are developed using the posts and comments over Reddit, a popular social media. The machine learning group at Ramakrishna Mission Vivekananda Educational and Research Institute (RKMVERI), India has participated in this challenge and individually submitted five results to accomplish the objectives of these two tasks. The paper presents different machine learning techniques and analyze their performance for early risk prediction of anorexia or depression. The techniques involve various classifiers and feature engineering schemes. The simple bag of words model has been used to perform ada boost, random forest, logistic regression and support vector machine classifiers to identify documents related to anorexia or depression in the indi...

The k-nearest neighbor decision rule is a simple, robust and widely used classifier. The method p... more The k-nearest neighbor decision rule is a simple, robust and widely used classifier. The method puts a point into a particular class, if the class has the maximum representation among the k nearest neighbors of the point in the training set. However, determining the value of k is difficult. Moreover, nearest neighbor classification techniques put more stress on the data points that lie on the boundary region of individual classes. These methods rely upon those boundary points to decide the class label of a new data point, but, the boundary points may not be a good representation of a particular class. A method is thus proposed here in spirit of the nearest neighbor classification technique using a medoid based weighting scheme to overcome these limitations. The experimental results using various standard benchmark data sets have shown that the proposed method outperforms different state of the art classifiers.

2012 IEEE 12th International Conference on Data Mining Workshops, 2012

BACKGROUND Qualitative research methods are increasingly being used across disciplines because of... more BACKGROUND Qualitative research methods are increasingly being used across disciplines because of their ability to help investigators understand the perspectives of participants in their own words. However, qualitative analysis is a laborious and resource-intensive process. To achieve depth, researchers are limited to smaller sample sizes when analyzing text data. One potential method to address this concern is natural language processing (NLP). Qualitative text analysis involves researchers reading data, assigning code labels, and iteratively developing findings; NLP has the potential to automate part of this process. Unfortunately, little methodological research has been done to compare automatic coding using NLP techniques and qualitative coding, which is critical to establish the viability of NLP as a useful, rigorous analysis procedure. OBJECTIVE The purpose of this study was to compare the utility of a traditional qualitative text analysis, an NLP analysis, and an augmented ap...

Information, 2021

The objective of systematic reviews is to address a research question by summarizing relevant stu... more The objective of systematic reviews is to address a research question by summarizing relevant studies following a detailed, comprehensive, and transparent plan and search protocol to reduce bias. Systematic reviews are very useful in the biomedical and healthcare domain; however, the data extraction phase of the systematic review process necessitates substantive expertise and is labour-intensive and time-consuming. The aim of this work is to partially automate the process of building systematic radiotherapy treatment literature reviews by summarizing the required data elements of geometric errors of radiotherapy from relevant literature using machine learning and natural language processing (NLP) approaches. A framework is developed in this study that initially builds a training corpus by extracting sentences containing different types of geometric errors of radiotherapy from relevant publications. The publications are retrieved from PubMed following a given set of rules defined by ...

Advanced Data Mining and Applications, 2012

The aim of text document classification is to automatically group a document to a predefined clas... more The aim of text document classification is to automatically group a document to a predefined class. The main problem of document classification is high dimensionality and sparsity of the data matrix. A new feature selection technique using the google distance have been proposed in this article to effectively obtain a feature subset which improves the classification accuracy. Normalized google distance can automatically extract the meaning of terms from the world wide web. It utilizes the advantage of number of hits returned by the google search engine to compute the semantic relation between two terms. In the proposed approach, only the distance function of google distance is used to develop a relation between a feature and a class for document classification and it is independent of google search results. Every feature will generate a score based on their relation with all the classes and then all the features will be ranked accordingly. The experimental results are presented using knn classifier on several TREC and Reuter data sets. Precision, recall, f-measure and classification accuracy are used to analyze the results. The proposed method is compared with four other feature selection methods for document classification, document frequency thresholding, information gain, mutual information and χ 2 statistic. The empirical studies have shown that the proposed method have effectively done feature selection in most of the cases with either an improvement or no change of classification accuracy.

Electronic Commerce Research, 2020

Automated community detection is an important problem in the study of complex networks. The idea ... more Automated community detection is an important problem in the study of complex networks. The idea of community detection is closely related to the concept of data clustering in pattern recognition. Data clustering refers to the task of grouping similar objects and segregating dissimilar objects. The community detection problem can be thought of as finding groups of densely interconnected nodes with few connections to nodes outside the group. A node similarity measure is proposed here that finds the similarity between two nodes by considering both neighbors and non-neighbors of these two nodes. Subsequently, a method is introduced for identifying communities in complex networks using this node similarity measure and the notion of data clustering. The significant characteristic of the proposed method is that it does not need any prior knowledge about the actual communities of a network. Extensive experiments on several real world and artificial networks with known ground-truth communit...

Pattern Recognition and Machine Intelligence, 2009

Semantic relation is an important concept of information science. Now a days it is widely used in... more Semantic relation is an important concept of information science. Now a days it is widely used in semantic web. This paper aims to present a measure to automatically determine semantic relation between words using web as knowledge source. It explores whether two words are related or not even if they are dissimilar in meaning. The proposed formula is a function of frequency of occurrences of the words in each document in the corpus. This relationship measure will be useful to extract semantic information from the web. Experimental evaluation on ten manually selected word pairs using the WebKb data as information source demonstrates the effectiveness of the proposed semantic relation.

SN Applied Sciences, 2020

Fundamenta Informaticae, 2015

The similarity based decision rule computes the similarity between a new test document and the ex... more The similarity based decision rule computes the similarity between a new test document and the existing documents of the training set that belong to various categories. The new document is grouped to a particular category in which it has maximum number of similar documents. A document similarity based supervised decision rule for text categorization is proposed in this article. The similarity measure determine the similarity between two documents by finding their distances with all the documents of training set and it can explicitly identify two dissimilar documents. The decision rule assigns a test document to the best one among the competing categories, if the best category beats the next competing category by a previously fixed margin. Thus the proposed rule enhances the certainty of the decision. The salient feature of the decision rule is that, it never assigns a document arbitrarily to a category when the decision is not so certain. The performance of the proposed decision rule for text categorization is compared with some well known classification techniques e.g., k-nearest neighbor decision rule, support vector machine, naive bayes etc. using various TREC and Reuter corpora. The empirical results have shown that the proposed method performs significantly better than the other classifiers for text categorization.

SN Applied Sciences, 2020

International Journal of Machine Learning and Cybernetics, Sep 18, 2015

Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Knowledge and Information Systems, 2022

Business Information Systems, 2019

2012 IEEE 12th International Conference on Data Mining Workshops, 2012

Information, 2021

Advanced Data Mining and Applications, 2012

Electronic Commerce Research, 2020

Pattern Recognition and Machine Intelligence, 2009

SN Applied Sciences, 2020

Fundamenta Informaticae, 2015