Saleti Sumalatha - Academia.edu (original) (raw)

Papers by Saleti Sumalatha

Research paper thumbnail of Analyzing the Health Data: An Application of High Utility Itemset Mining

Research paper thumbnail of Optimizing Predictive Models for Parkinson's Disease Diagnosis

Advances in medical technologies and clinical practice book series, Feb 23, 2024

Research paper thumbnail of An efficient ensemble-based Machine Learning for breast cancer detection

Biomedical Signal Processing and Control

Research paper thumbnail of Optimizing fetal health prediction: Ensemble modeling with fusion of feature selection and extraction techniques for cardiotocography data

Computational Biology and Chemistry

Research paper thumbnail of An Enhancement in the Efficiency of Disease Prediction Using Feature Extraction and Feature Selection

Advances in healthcare information systems and administration book series, Jun 30, 2023

Research paper thumbnail of A comparison of various machine learning algorithms and execution of flask deployment on essay grading

International Journal of Power Electronics and Drive Systems, Jun 1, 2023

Students' performance can be assessed based on grading the answers written by the students during... more Students' performance can be assessed based on grading the answers written by the students during their examination. Currently, students are assessed manually by the teachers. This is a cumbersome task due to an increase in the student-teacher ratio. Moreover, due to coronavirus disease (COVID-19) pandemic, most of the educational institutions have adopted online teaching and assessment. To measure the learning ability of a student, we need to assess them. The current grading system works well for multiple choice questions, but there is no grading system for evaluating the essays. In this paper, we studied different machine learning and natural language processing techniques for automated essay scoring/grading (AES/G). Data imbalance is an issue which creates the problem in predicting the essay score due to uneven distribution of essay scores in the training data. We handled this issue using random over sampling technique which generates even distribution of essay scores. Also, we built a web application using flask and deployed the machine learning models. Subsequently, all the models have been evaluated using accuracy, precision, recall, and F1-score. It is found that random forest algorithm outperformed the other algorithms with an accuracy of 97.67%, precision of 97.62%, recall of 97.67%, and F1-score of 97.58%.

Research paper thumbnail of Heart Disease Prediction Using Novel Quine McCluskey Binary Classifier (QMBC)

IEEE Access

Cardiovascular disease is the primary reason for mortality worldwide, responsible for around a th... more Cardiovascular disease is the primary reason for mortality worldwide, responsible for around a third of all deaths. To assist medical professionals in quickly identifying and diagnosing patients, numerous machine learning and data mining techniques are utilized to predict the disease. Many researchers have developed various models to boost the efficiency of these predictions. Feature selection and extraction techniques are utilized to remove unnecessary features from the dataset, thereby reducing computation time and increasing the efficiency of the models. In this study, we introduce a new ensemble Quine McCluskey Binary Classifier (QMBC) technique for identifying patients diagnosed with some form of heart disease and those who are not diagnosed. The QMBC model utilizes an ensemble of seven models, including logistic regression, decision tree, random forest, K-nearest neighbour, naive bayes, support vector machine, and multilayer perceptron, and performs exceptionally well on binary class datasets. We employ feature selection and feature extraction techniques to accelerate the prediction process. We utilize Chi-Square and ANOVA approaches to identify the top 10 features and create a subset of the dataset. We then apply Principal Component Analysis to the subset to identify 9 prime components. We utilize an ensemble of all seven models and the Quine McCluskey technique to obtain the Minimum Boolean expression for the target feature. The results of the seven models (x 0 , x 1 , x 2 ,. .. , x 6) are considered independent features, while the target attribute is dependent. We combine the projected outcomes of the seven ML models and the target feature to form a foaming dataset. We apply the ensemble model to the dataset, utilizing the Quine McCluskey minimum Boolean equation built with an 80:20 train-to-test ratio. Our proposed QMBC model surpasses all current state-of-the-art models and previously suggested methods put forward by various researchers. INDEX TERMS Machine learning, chi-square, ANOVA, principal component analysis, Quine McCluskey technique, ensemble approach.

Research paper thumbnail of A Comparative Analysis of the Evolution of DNA Sequencing Techniques along with the Accuracy Prediction of a Sample DNA Sequence Dataset using Machine Learning

2023 2nd International Conference on Paradigm Shifts in Communications Embedded Systems, Machine Learning and Signal Processing (PCEMS)

Research paper thumbnail of A Comparison of Various Class Balancing and Dimensionality Reduction Techniques on Customer Churn Prediction

2022 IEEE 7th International Conference on Recent Advances and Innovations in Engineering (ICRAIE)

Research paper thumbnail of Exploring Patterns and Correlations Between Cryptocurrencies and Forecasting Crypto Prices Using Influential Tweets

Communications in computer and information science, 2022

Research paper thumbnail of Mining Spatio-Temporal Sequential Patterns Using MapReduce Approach

Communications in computer and information science, 2022

Research paper thumbnail of Constraint Pushing Multi-threshold Framework for High Utility Time Interval Sequential Pattern Mining

Communications in computer and information science, 2022

Research paper thumbnail of Secure Lightweight Macro Payment Protocol for Mobile Payments Using Iot-Based Wearable Devices

Research paper thumbnail of Incremental mining of high utility sequential patterns using MapReduce paradigm

Research paper thumbnail of A novel Bit Vector Product algorithm for mining frequent itemsets from large datasets using MapReduce framework

Cluster Computing, 2017

Frequent itemset mining (FIM) is an interesting sub-area of research in the field of Data Mining.... more Frequent itemset mining (FIM) is an interesting sub-area of research in the field of Data Mining. With the increase in the size of datasets, conventional FIM algorithms are not suitable and efforts are made to migrate to the Big Data Frameworks for designing algorithms using MapReduce like computing paradigms. We too interested in designing MapReduce based algorithm. Initially, our Parallel Compression algorithm makes data simpler to handle. A novel bit vector data structure is proposed to maintain compressed transactions and it is formed by scanning the dataset only once. Our Bit Vector Product algorithm follows the MapReduce approach and effectively searches for frequent itemsets from a given list of transactions. The experimental results are present to prove the efficacy of our approach over some of the recent works.

Research paper thumbnail of A MapReduce solution for incremental mining of sequential patterns from big data

Expert Systems with Applications, 2019

Abstract Sequential Pattern Mining (SPM) is a popular data mining task with broad applications. W... more Abstract Sequential Pattern Mining (SPM) is a popular data mining task with broad applications. With the advent of big data, traditional SPM algorithms are not scalable. Hence, many of the researchers have migrated to big data frameworks such as MapReduce and proposed distributed algorithms. However, the existing MapReduce algorithms assume the data as static and do not handle the incremental database updates. Moreover, they use to re-mine the updated database while new sequences are inserted. In this paper, we propose an efficient distributed algorithm for incremental sequential pattern mining (MR-INCSPM) using the MapReduce framework that can handle big data. The proposed algorithm incorporates the backward mining approach that efficiently makes use of the knowledge obtained during the previous mining process. Also, based on the study of item co-occurrences, we propose Co-occurrence Reverse Map (CRMAP) data structure. The issue of combinatorial explosion of candidate sequences is dealt using the proposed CRMAP data structure. Besides, a novel candidate generation and early prune mechanisms are designed using CRMAP to speed up the mining process. The proposed algorithm is evaluated on both the real and synthetic datasets. The experimental results prove the efficacy of MR-INCSPM with respect to processing time, memory and pruning efficiency.

Research paper thumbnail of A novel mapreduce algorithm for distributed mining of sequential patterns using co-occurrence information

Applied Intelligence, 2018

Sequential Pattern Mining (SPM) problem is much studied and extended in several directions. With ... more Sequential Pattern Mining (SPM) problem is much studied and extended in several directions. With the tremendous growth in the size of datasets, traditional algorithms are not scalable. In order to solve the scalability issue, recently few researchers have developed distributed algorithms based on MapReduce. However, the existing MapReduce algorithms require multiple rounds of MapReduce, which increases communication and scheduling overhead. Also, they do not address the issue of handling long sequences. They generate huge number of candidate sequences that do not appear in the input database and increases the search space. This results in more number of candidate sequences for support counting. Our algorithm is a two phase MapReduce algorithm that generates the promising candidate sequences using the pruning strategies. It also reduces the search space and thus the support computation is effective. We make use of the item co-occurrence information and the proposed Sequence Index List (SIL) data structure helps in computing the support at fast. The experimental results show that the proposed algorithm has better performance over the existing MapReduce algorithms for the SPM problem.

Research paper thumbnail of Mining High Utility Time Interval Sequences Using MapReduce Approach: Multiple Utility Framework

IEEE Access

Mining high utility sequential patterns is observed to be a significant research in data mining. ... more Mining high utility sequential patterns is observed to be a significant research in data mining. Several methods mine the sequential patterns while taking utility values into consideration. The patterns of this type can determine the order in which items were purchased, but not the time interval between them. The time interval among items is important for predicting the most useful real-world circumstances, including retail market basket data analysis, stock market fluctuations, DNA sequence analysis, and so on. There are a very few algorithms for mining sequential patterns those consider both the utility and time interval. However, they assume the same threshold for each item, maintaining the same unit profit. Moreover, with the rapid growth in data, the traditional algorithms cannot handle the big data and are not scalable. To handle this problem, we propose a distributed three phase MapReduce framework that considers multiple utilities and suitable for handling big data. The time constraints are pushed into the algorithm instead of pre-defined intervals. Also, the proposed upper bound minimizes the number of candidate patterns during the mining process. The approach has been tested and the experimental results show its efficiency in terms of run time, memory utilization, and scalability. INDEX TERMS Data mining, MapReduce framework, multiple utility thresholds, sequential pattern mining, time interval patterns.

Research paper thumbnail of Student Placement Chance Prediction Model using Machine Learning Techniques

2021 5th Conference on Information and Communication Technology (CICT), 2021

Research paper thumbnail of Distributed Mining of High Utility Sequential Patterns with Negative Item Values

The sequential pattern mining was widely used to solve various business problems, including frequ... more The sequential pattern mining was widely used to solve various business problems, including frequent user click pattern, customer analysis of buying product, gene microarray data analysis, etc. Many studies were going on these pattern mining to extract insightful data. All the studies were mostly concentrated on high utility sequential pattern mining (HUSP) with positive values without a distributed approach. All the ex-isting solutions are centralized which incurs greater computation and communication costs. In this paper, we introduce a novel algorithm for mining HUSPs including negative item values in support of a distributed approach. We use the Hadoop map reduce algorithms for processing the data in parallel. Various pruning techniques have been proposed to minimize the search space in a distributed environment, thus reducing the expense of processing. To our understanding, no algorithm was proposed to mine High Utility Sequential Patterns with negative item values in a distrib...

Research paper thumbnail of Analyzing the Health Data: An Application of High Utility Itemset Mining

Research paper thumbnail of Optimizing Predictive Models for Parkinson's Disease Diagnosis

Advances in medical technologies and clinical practice book series, Feb 23, 2024

Research paper thumbnail of An efficient ensemble-based Machine Learning for breast cancer detection

Biomedical Signal Processing and Control

Research paper thumbnail of Optimizing fetal health prediction: Ensemble modeling with fusion of feature selection and extraction techniques for cardiotocography data

Computational Biology and Chemistry

Research paper thumbnail of An Enhancement in the Efficiency of Disease Prediction Using Feature Extraction and Feature Selection

Advances in healthcare information systems and administration book series, Jun 30, 2023

Research paper thumbnail of A comparison of various machine learning algorithms and execution of flask deployment on essay grading

International Journal of Power Electronics and Drive Systems, Jun 1, 2023

Students' performance can be assessed based on grading the answers written by the students during... more Students' performance can be assessed based on grading the answers written by the students during their examination. Currently, students are assessed manually by the teachers. This is a cumbersome task due to an increase in the student-teacher ratio. Moreover, due to coronavirus disease (COVID-19) pandemic, most of the educational institutions have adopted online teaching and assessment. To measure the learning ability of a student, we need to assess them. The current grading system works well for multiple choice questions, but there is no grading system for evaluating the essays. In this paper, we studied different machine learning and natural language processing techniques for automated essay scoring/grading (AES/G). Data imbalance is an issue which creates the problem in predicting the essay score due to uneven distribution of essay scores in the training data. We handled this issue using random over sampling technique which generates even distribution of essay scores. Also, we built a web application using flask and deployed the machine learning models. Subsequently, all the models have been evaluated using accuracy, precision, recall, and F1-score. It is found that random forest algorithm outperformed the other algorithms with an accuracy of 97.67%, precision of 97.62%, recall of 97.67%, and F1-score of 97.58%.

Research paper thumbnail of Heart Disease Prediction Using Novel Quine McCluskey Binary Classifier (QMBC)

IEEE Access

Cardiovascular disease is the primary reason for mortality worldwide, responsible for around a th... more Cardiovascular disease is the primary reason for mortality worldwide, responsible for around a third of all deaths. To assist medical professionals in quickly identifying and diagnosing patients, numerous machine learning and data mining techniques are utilized to predict the disease. Many researchers have developed various models to boost the efficiency of these predictions. Feature selection and extraction techniques are utilized to remove unnecessary features from the dataset, thereby reducing computation time and increasing the efficiency of the models. In this study, we introduce a new ensemble Quine McCluskey Binary Classifier (QMBC) technique for identifying patients diagnosed with some form of heart disease and those who are not diagnosed. The QMBC model utilizes an ensemble of seven models, including logistic regression, decision tree, random forest, K-nearest neighbour, naive bayes, support vector machine, and multilayer perceptron, and performs exceptionally well on binary class datasets. We employ feature selection and feature extraction techniques to accelerate the prediction process. We utilize Chi-Square and ANOVA approaches to identify the top 10 features and create a subset of the dataset. We then apply Principal Component Analysis to the subset to identify 9 prime components. We utilize an ensemble of all seven models and the Quine McCluskey technique to obtain the Minimum Boolean expression for the target feature. The results of the seven models (x 0 , x 1 , x 2 ,. .. , x 6) are considered independent features, while the target attribute is dependent. We combine the projected outcomes of the seven ML models and the target feature to form a foaming dataset. We apply the ensemble model to the dataset, utilizing the Quine McCluskey minimum Boolean equation built with an 80:20 train-to-test ratio. Our proposed QMBC model surpasses all current state-of-the-art models and previously suggested methods put forward by various researchers. INDEX TERMS Machine learning, chi-square, ANOVA, principal component analysis, Quine McCluskey technique, ensemble approach.

Research paper thumbnail of A Comparative Analysis of the Evolution of DNA Sequencing Techniques along with the Accuracy Prediction of a Sample DNA Sequence Dataset using Machine Learning

2023 2nd International Conference on Paradigm Shifts in Communications Embedded Systems, Machine Learning and Signal Processing (PCEMS)

Research paper thumbnail of A Comparison of Various Class Balancing and Dimensionality Reduction Techniques on Customer Churn Prediction

2022 IEEE 7th International Conference on Recent Advances and Innovations in Engineering (ICRAIE)

Research paper thumbnail of Exploring Patterns and Correlations Between Cryptocurrencies and Forecasting Crypto Prices Using Influential Tweets

Communications in computer and information science, 2022

Research paper thumbnail of Mining Spatio-Temporal Sequential Patterns Using MapReduce Approach

Communications in computer and information science, 2022

Research paper thumbnail of Constraint Pushing Multi-threshold Framework for High Utility Time Interval Sequential Pattern Mining

Communications in computer and information science, 2022

Research paper thumbnail of Secure Lightweight Macro Payment Protocol for Mobile Payments Using Iot-Based Wearable Devices

Research paper thumbnail of Incremental mining of high utility sequential patterns using MapReduce paradigm

Research paper thumbnail of A novel Bit Vector Product algorithm for mining frequent itemsets from large datasets using MapReduce framework

Cluster Computing, 2017

Frequent itemset mining (FIM) is an interesting sub-area of research in the field of Data Mining.... more Frequent itemset mining (FIM) is an interesting sub-area of research in the field of Data Mining. With the increase in the size of datasets, conventional FIM algorithms are not suitable and efforts are made to migrate to the Big Data Frameworks for designing algorithms using MapReduce like computing paradigms. We too interested in designing MapReduce based algorithm. Initially, our Parallel Compression algorithm makes data simpler to handle. A novel bit vector data structure is proposed to maintain compressed transactions and it is formed by scanning the dataset only once. Our Bit Vector Product algorithm follows the MapReduce approach and effectively searches for frequent itemsets from a given list of transactions. The experimental results are present to prove the efficacy of our approach over some of the recent works.

Research paper thumbnail of A MapReduce solution for incremental mining of sequential patterns from big data

Expert Systems with Applications, 2019

Abstract Sequential Pattern Mining (SPM) is a popular data mining task with broad applications. W... more Abstract Sequential Pattern Mining (SPM) is a popular data mining task with broad applications. With the advent of big data, traditional SPM algorithms are not scalable. Hence, many of the researchers have migrated to big data frameworks such as MapReduce and proposed distributed algorithms. However, the existing MapReduce algorithms assume the data as static and do not handle the incremental database updates. Moreover, they use to re-mine the updated database while new sequences are inserted. In this paper, we propose an efficient distributed algorithm for incremental sequential pattern mining (MR-INCSPM) using the MapReduce framework that can handle big data. The proposed algorithm incorporates the backward mining approach that efficiently makes use of the knowledge obtained during the previous mining process. Also, based on the study of item co-occurrences, we propose Co-occurrence Reverse Map (CRMAP) data structure. The issue of combinatorial explosion of candidate sequences is dealt using the proposed CRMAP data structure. Besides, a novel candidate generation and early prune mechanisms are designed using CRMAP to speed up the mining process. The proposed algorithm is evaluated on both the real and synthetic datasets. The experimental results prove the efficacy of MR-INCSPM with respect to processing time, memory and pruning efficiency.

Research paper thumbnail of A novel mapreduce algorithm for distributed mining of sequential patterns using co-occurrence information

Applied Intelligence, 2018

Sequential Pattern Mining (SPM) problem is much studied and extended in several directions. With ... more Sequential Pattern Mining (SPM) problem is much studied and extended in several directions. With the tremendous growth in the size of datasets, traditional algorithms are not scalable. In order to solve the scalability issue, recently few researchers have developed distributed algorithms based on MapReduce. However, the existing MapReduce algorithms require multiple rounds of MapReduce, which increases communication and scheduling overhead. Also, they do not address the issue of handling long sequences. They generate huge number of candidate sequences that do not appear in the input database and increases the search space. This results in more number of candidate sequences for support counting. Our algorithm is a two phase MapReduce algorithm that generates the promising candidate sequences using the pruning strategies. It also reduces the search space and thus the support computation is effective. We make use of the item co-occurrence information and the proposed Sequence Index List (SIL) data structure helps in computing the support at fast. The experimental results show that the proposed algorithm has better performance over the existing MapReduce algorithms for the SPM problem.

Research paper thumbnail of Mining High Utility Time Interval Sequences Using MapReduce Approach: Multiple Utility Framework

IEEE Access

Mining high utility sequential patterns is observed to be a significant research in data mining. ... more Mining high utility sequential patterns is observed to be a significant research in data mining. Several methods mine the sequential patterns while taking utility values into consideration. The patterns of this type can determine the order in which items were purchased, but not the time interval between them. The time interval among items is important for predicting the most useful real-world circumstances, including retail market basket data analysis, stock market fluctuations, DNA sequence analysis, and so on. There are a very few algorithms for mining sequential patterns those consider both the utility and time interval. However, they assume the same threshold for each item, maintaining the same unit profit. Moreover, with the rapid growth in data, the traditional algorithms cannot handle the big data and are not scalable. To handle this problem, we propose a distributed three phase MapReduce framework that considers multiple utilities and suitable for handling big data. The time constraints are pushed into the algorithm instead of pre-defined intervals. Also, the proposed upper bound minimizes the number of candidate patterns during the mining process. The approach has been tested and the experimental results show its efficiency in terms of run time, memory utilization, and scalability. INDEX TERMS Data mining, MapReduce framework, multiple utility thresholds, sequential pattern mining, time interval patterns.

Research paper thumbnail of Student Placement Chance Prediction Model using Machine Learning Techniques

2021 5th Conference on Information and Communication Technology (CICT), 2021

Research paper thumbnail of Distributed Mining of High Utility Sequential Patterns with Negative Item Values

The sequential pattern mining was widely used to solve various business problems, including frequ... more The sequential pattern mining was widely used to solve various business problems, including frequent user click pattern, customer analysis of buying product, gene microarray data analysis, etc. Many studies were going on these pattern mining to extract insightful data. All the studies were mostly concentrated on high utility sequential pattern mining (HUSP) with positive values without a distributed approach. All the ex-isting solutions are centralized which incurs greater computation and communication costs. In this paper, we introduce a novel algorithm for mining HUSPs including negative item values in support of a distributed approach. We use the Hadoop map reduce algorithms for processing the data in parallel. Various pruning techniques have been proposed to minimize the search space in a distributed environment, thus reducing the expense of processing. To our understanding, no algorithm was proposed to mine High Utility Sequential Patterns with negative item values in a distrib...