Shalini Batra - Academia.edu (original) (raw)
Papers by Shalini Batra
Computational Intelligence
Concept drift refers to the change in data distributions and evolving relationships between input... more Concept drift refers to the change in data distributions and evolving relationships between input and output variables with the passage of time. To analyze such variations in learning environments and generate models which can accommodate changing performance of predictive systems is one of the challenging machine learning applications. In general, the majority of the existing schemes consider one of the specific drift types: gradual, abrupt, recurring, or mixed, with traditional voting setup. In this work, we propose a novel data stream framework, dynamically adaptive and diverse dual ensemble (DA‐DDE) which responds to multiple drift types in the incoming data streams by combining online and block‐based ensemble techniques. In the proposed scheme, a dual diversified ensemble‐based system is constructed with the combination of active and passive ensembles, updated over a diverse set of resampled input space. The adaptive weight setting method is proposed in this work which utilizes the overall performance of learners on historic as well as recent concepts of distributions. Further a dual voting system has been used for hypothesis generation by considering dynamic adaptive credibility of ensembles in real time. Comparative analysis with 14 state‐of‐the‐art algorithms on 24 artificial and 11 real datasets shows that DA‐DDE is highly effective in handling various drift types.
GLOBECOM 2017 - 2017 IEEE Global Communications Conference, 2017
With an exponential increase in the Internet traffic over the network, there are growing concerns... more With an exponential increase in the Internet traffic over the network, there are growing concerns of identification of legitimate users which are the bulk sources of Internet traffic generation. However, due to the occurrence of anomalies in the network traffic, normal operations or the functionalities (traffic classification, resource allocation, and service management) of network get affected. Thus, in a given time frame, there is a requirement of anomalies detection in the network. The efficiency of any anomaly detection model mainly depends on the selection of relevant features and the learning algorithms which are used for classification of the network traffic patterns. However, due to curse of dimensionality, imbalance between classes, and variations in the types of anomalies, most of the existing solutions reported in the literature fail to deal with problems that occurs while detecting anomalies in large-scale network data. So, to remove these gaps in the existing solutions, we propose a new hybrid anomaly detection scheme called as Ensemble-based Classification Model for Network Anomaly Detection (EnClass) to detect anomalies in real- world networking datasets. EnClass has three modules as (i) Hoeffding-bound based clustering to identify the optimal subset of features to be taken for classification of network traffic (ii) Eigenvalues computation module to refine the features set for removal of unnecessary attributes and (iii) Very-fast decision tree for network traffic classification. In order to validate the proposed anomaly detection model, experimental evaluation is performed using real-world Knowledge Discovery and Data Mining (KDD'99) dataset with respect to parameters such as-detection rate, false positive rate, and F-score. The comparison with existing approaches clearly demonstrates the effectiveness of the EnClass in terms of detection rate (98.58%), false positive rate (0.42%), and F-score (96.06%).
2018 IEEE International Conference on Communications (ICC), 2018
With an exponential increase in the data generation from various Internet-enabled devices, end us... more With an exponential increase in the data generation from various Internet-enabled devices, end user's demand satisfaction with respect to Quality of experience (QoE) has become a prime concern over the past few years. However, to assure QoE to the end users, content delivery networks (CDNs) aim to provide the content close to the user's geographical location so as to decrease network congestion, and latency along with an optimal bandwidth consumption. This paper proposes a popular content storage at the edge nodes/gateways instead of a remote server for increasing the data availability. For efficient cache management at the edge nodes, data is stored using Quotient filters (QFs), where number of QFs considered are determined by the number of categories taken for data segregation. To improve the accuracy and reduce the effort in caching process, one extra bit called timer-based metabit has been used with the QF, which helps to implement least frequently used caching efficiently. It has been experimentally proved that the proposed scheme has an approximate gain of 8.9% in object hit ratio with respect to the existing CDN based techniques. Moreover, the search time complexity of the proposed edge-based CDN is independent of the number of incoming requests.
2017 International Conference on Computing, Communication and Automation (ICCCA), 2017
Many applications utilize Probabilistic Data Structure (PDS) to reduce data storage and data proc... more Many applications utilize Probabilistic Data Structure (PDS) to reduce data storage and data processing cost. PDS use probabilistic approaches and approximation principles along with hashing techniques for fast processing of data. In recent years, they have been gaining much popularity due to the fact that they can be efficiently used for big data processing and streaming applications. A Bloom filter is a probabilistic data structure that supports representation of a set S of N elements in very low space and support set membership testing. As compared to original set space requirement of Bloom filter is very low. Bloom filter finds application in many domains of Computer Science. In this survey, several variants and applications of Bloom filter in different domains have been discussed.
2017 4th International Conference on Signal Processing, Computing and Control (ISPCC), 2017
Process mining is a bridge between performance and compliance or super glue between process and d... more Process mining is a bridge between performance and compliance or super glue between process and data oriented analysis. Healthcare organizations are facing the challenges to extract dynamic, complex and cross-functional processes. The issues related to improper data management, collaboration and coordination make the careflows convoluted to perceive. In this paper, we are examining the event log embracing the events of sepsis cases from a hospital. Sepsis is a life threatening condition typically caused by an infection. The events were recorded by the ERP system of the hospital. The research objective is to investigate this healthcare process for two stances: control flow and conformance perusal. Varied process mining techniques are used for exploration and collation of their operation with the event log. We have extracted the useful facts from event logs in form of MXML/XES format, importing in a PROM framework and scrutinize the pathways followed by distinct processes. The results of these stances provide new discernment that facilitates the refinement of the prevailing procedure.
Data mining and knowledge engineering, 2010
Interoperability and integration of data sources are becoming ever more challenging issues with t... more Interoperability and integration of data sources are becoming ever more challenging issues with the increase in both the amount of data and the number of data producers. Interoperability not only has to resolve the differences in data structures, it also has to deal with semantic heterogeneity. Taking semantically heterogeneous databases as the prototypical situation, this paper describes how ontology (in the traditional metaphysical sense) can contribute to delivering a more efficient and effective process of matching by providing a framework for the analysis, and so the basis for a methodology. It delivers not only a better process for matching, but the process also gives a better result.
Artificial Intelligence and Speech Technology, 2021
Proceedings of the International Conference on Advances in Information Communication Technology & Computing, 2016
This paper presents a new color image encoding and decoding technique using Fractional Fourier Tr... more This paper presents a new color image encoding and decoding technique using Fractional Fourier Transform (FrFT) and Discrete Wavelet Transform (DWT). In proposed work, all three planes of color images are encoded using subbands of DWT and parameters of FrFT. Selection of subband (among the subbands obtained after applying DWT) for applying FrFT and parameters of FrFT are used as security key for the purpose of encoding and decoding of all three color channels. For ensuring the correct decoding of color images, the knowledge of correct selection of subband of DWT for applying FrFT and exact values of FrFT parameters is required. Correct decoding is not possible without the correct knowledge of DWT subband and FrFT parameters values. The proposed technique is also compared with one of the recent existing techniques and experimental results are used to show the effectiveness of the proposed technique.
.................................................................................................... more ...................................................................................................................... iii Table of
The percentage loss of electric energy due to transmission and distribution(T&D) is notable. As t... more The percentage loss of electric energy due to transmission and distribution(T&D) is notable. As the urbanization is expanding, connectivity is becoming more complex and T&D losses are further increasing. These T&D losses can be minimized significantly by optimizing the connectivity of electricity lines. The major objective of this work is to optimize the placement of transformed in residential areas to minimize the transmission losses. To achieve the desired goal three approaches have been considered: deterministic, clustering and stochastic. In deterministic approach, brute force method is applied, in clustering K-means unsupervised clustering is applied and in stochastic approach stimulated annealing is used. A comparative analysis of all three aforesaid approaches has been done on real time data collected from PSPCL, to calculate the percentage reduction in transmission and distribution (T&D) loss by applying aforesaid approaches.
2018 International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), 2018
The categories and quantity of data are expanding exponentially with the on-going wave of connect... more The categories and quantity of data are expanding exponentially with the on-going wave of connectivity. A number of connected devices and data sources continuously generate a huge amount of data at a very high speed. This paper investigates various methods such as - Naive Bayes classifier, Very Fast Decision Trees (VFDT), ensemble methods, clustering based methods, etc. that have been used for streaming data processing. In this paper, recurrent neural network (RNN) is implemented topredict the next sequence of a data stream. Three types of sequential data streams are considered - uniform rectangular data, uniform sinusoidal data and non-uniform sinc pulse data. Various RNN architectures such as - simple RNN, RNN with long short term memory (LSTM), RNN with gated recurrent units (GRU) and RNN optimized with Genetic Algorithm (GA) are implemented for various combinations of number network hyper-parameters such as –number of hidden layers, number of neurons per layer, activation function and optimizer etc. The optimal combination of the hyper-parameters is selected using GA. With sample data streams, simple RNN shows better prediction accuracy than LSTM and GRU for single hidden layer architecture. As the RNN architectures get deeper, LSTM and GRU outperform simple RNN. The optimized version of RNN has been experimentally observed to be 78.13% faster than single layered LSTM architecture and 82.76% faster than the LSTM model with 4 hidden layers. The decline in accuracy is 8.67% and 12.67% respectively.
2017 4th International Conference on Signal Processing, Computing and Control (ISPCC), 2017
Diagnosing Diabetes is one of the problems that require high level of accurate analysis and predi... more Diagnosing Diabetes is one of the problems that require high level of accurate analysis and prediction. Traditional techniques for clinical decision support systems are grounded on a single classifier or combination of various classifiers which are used for the diagnosis of the disease and its prediction. Recently much heed has been paid to improve the performance of disease prediction with the use of ensemble-based methods. Using ensemble methods in decision support systems assist in analyzing theses type of diseases. To improve the performance of weak classifiers boosting and bagging can be used. These techniques are based on combining the outputs and functionality of the classifiers. A weighted majority vote or a simple majority vote which has been used in this study are the most common rules for bagging and boosting. In this paper, we compare the performance of bagging and boosting with our hybrid approach called Hierarchical and Progressive Combination of Classifiers (HPCC) through the study of the famous Pima Indians Diabetes Dataset and the best classifier is chosen on the basis of the accuracy achieved.
Expert Systems, 2020
Mining data streams for predictive analysis is one of the most interesting topics in machine lear... more Mining data streams for predictive analysis is one of the most interesting topics in machine learning. With the drifting data distributions, it becomes important to build adaptive systems which are dynamic and accurate. Although ensembles are powerful in improving accuracy of incremental learning, it is crucial to maintain a set of best suitable learners in the ensemble while considering the diversity between them. By adding diversity‐based pruning to the traditional accuracy‐based pruning, this paper proposes a novel concept drift handling approach named Two‐Level Pruning based Ensemble with Abstained Learners (TLP‐EnAbLe). In this approach, deferred similarity based pruning delays the removal of under performing similar learners until it is assured that they are no longer fit for prediction. The proposed scheme retains diverse learners that are well suited for current concept. Two‐level abstaining monitors performance of learners and chooses the best set of competent learners for participating in decision making. This is an enhancement to traditional majority voting system which dynamically chooses high performing learners and abstains the ones which are not suitable for prediction. In our experiments, it has been demonstrated that TLP‐EnAbLe handles concept drift more effectively than other state‐of‐the‐art algorithms on nineteen artificially drifting and ten real‐world datasets. Further, statistical tests conducted on various drift patterns which include gradual, abrupt, recurring and their combinations prove efficiency of the proposed approach.
Electronic Workshops in Computing, 2010
Knowledge-Based Systems, 2019
Advanced Techniques in Computing Sciences and Software Engineering, 2009
Latent Semantic Indexing (LSI), a well known technique in Information Retrieval has been partiall... more Latent Semantic Indexing (LSI), a well known technique in Information Retrieval has been partially successful in text retrieval and no major breakthrough has been achieved in text classification as yet. A significant step forward in this regard was made by Hofmann [3], who presented the probabilistic LSI (PLSI) model, as an alternative to LSI. If we wish to consider exchangeable representations for documents and words, PLSI is not successful which further led to the Latent Dirichlet Allocation (LDA) model [4]. A new local Latent Semantic ...
Computational Intelligence
Concept drift refers to the change in data distributions and evolving relationships between input... more Concept drift refers to the change in data distributions and evolving relationships between input and output variables with the passage of time. To analyze such variations in learning environments and generate models which can accommodate changing performance of predictive systems is one of the challenging machine learning applications. In general, the majority of the existing schemes consider one of the specific drift types: gradual, abrupt, recurring, or mixed, with traditional voting setup. In this work, we propose a novel data stream framework, dynamically adaptive and diverse dual ensemble (DA‐DDE) which responds to multiple drift types in the incoming data streams by combining online and block‐based ensemble techniques. In the proposed scheme, a dual diversified ensemble‐based system is constructed with the combination of active and passive ensembles, updated over a diverse set of resampled input space. The adaptive weight setting method is proposed in this work which utilizes the overall performance of learners on historic as well as recent concepts of distributions. Further a dual voting system has been used for hypothesis generation by considering dynamic adaptive credibility of ensembles in real time. Comparative analysis with 14 state‐of‐the‐art algorithms on 24 artificial and 11 real datasets shows that DA‐DDE is highly effective in handling various drift types.
GLOBECOM 2017 - 2017 IEEE Global Communications Conference, 2017
With an exponential increase in the Internet traffic over the network, there are growing concerns... more With an exponential increase in the Internet traffic over the network, there are growing concerns of identification of legitimate users which are the bulk sources of Internet traffic generation. However, due to the occurrence of anomalies in the network traffic, normal operations or the functionalities (traffic classification, resource allocation, and service management) of network get affected. Thus, in a given time frame, there is a requirement of anomalies detection in the network. The efficiency of any anomaly detection model mainly depends on the selection of relevant features and the learning algorithms which are used for classification of the network traffic patterns. However, due to curse of dimensionality, imbalance between classes, and variations in the types of anomalies, most of the existing solutions reported in the literature fail to deal with problems that occurs while detecting anomalies in large-scale network data. So, to remove these gaps in the existing solutions, we propose a new hybrid anomaly detection scheme called as Ensemble-based Classification Model for Network Anomaly Detection (EnClass) to detect anomalies in real- world networking datasets. EnClass has three modules as (i) Hoeffding-bound based clustering to identify the optimal subset of features to be taken for classification of network traffic (ii) Eigenvalues computation module to refine the features set for removal of unnecessary attributes and (iii) Very-fast decision tree for network traffic classification. In order to validate the proposed anomaly detection model, experimental evaluation is performed using real-world Knowledge Discovery and Data Mining (KDD'99) dataset with respect to parameters such as-detection rate, false positive rate, and F-score. The comparison with existing approaches clearly demonstrates the effectiveness of the EnClass in terms of detection rate (98.58%), false positive rate (0.42%), and F-score (96.06%).
2018 IEEE International Conference on Communications (ICC), 2018
With an exponential increase in the data generation from various Internet-enabled devices, end us... more With an exponential increase in the data generation from various Internet-enabled devices, end user's demand satisfaction with respect to Quality of experience (QoE) has become a prime concern over the past few years. However, to assure QoE to the end users, content delivery networks (CDNs) aim to provide the content close to the user's geographical location so as to decrease network congestion, and latency along with an optimal bandwidth consumption. This paper proposes a popular content storage at the edge nodes/gateways instead of a remote server for increasing the data availability. For efficient cache management at the edge nodes, data is stored using Quotient filters (QFs), where number of QFs considered are determined by the number of categories taken for data segregation. To improve the accuracy and reduce the effort in caching process, one extra bit called timer-based metabit has been used with the QF, which helps to implement least frequently used caching efficiently. It has been experimentally proved that the proposed scheme has an approximate gain of 8.9% in object hit ratio with respect to the existing CDN based techniques. Moreover, the search time complexity of the proposed edge-based CDN is independent of the number of incoming requests.
2017 International Conference on Computing, Communication and Automation (ICCCA), 2017
Many applications utilize Probabilistic Data Structure (PDS) to reduce data storage and data proc... more Many applications utilize Probabilistic Data Structure (PDS) to reduce data storage and data processing cost. PDS use probabilistic approaches and approximation principles along with hashing techniques for fast processing of data. In recent years, they have been gaining much popularity due to the fact that they can be efficiently used for big data processing and streaming applications. A Bloom filter is a probabilistic data structure that supports representation of a set S of N elements in very low space and support set membership testing. As compared to original set space requirement of Bloom filter is very low. Bloom filter finds application in many domains of Computer Science. In this survey, several variants and applications of Bloom filter in different domains have been discussed.
2017 4th International Conference on Signal Processing, Computing and Control (ISPCC), 2017
Process mining is a bridge between performance and compliance or super glue between process and d... more Process mining is a bridge between performance and compliance or super glue between process and data oriented analysis. Healthcare organizations are facing the challenges to extract dynamic, complex and cross-functional processes. The issues related to improper data management, collaboration and coordination make the careflows convoluted to perceive. In this paper, we are examining the event log embracing the events of sepsis cases from a hospital. Sepsis is a life threatening condition typically caused by an infection. The events were recorded by the ERP system of the hospital. The research objective is to investigate this healthcare process for two stances: control flow and conformance perusal. Varied process mining techniques are used for exploration and collation of their operation with the event log. We have extracted the useful facts from event logs in form of MXML/XES format, importing in a PROM framework and scrutinize the pathways followed by distinct processes. The results of these stances provide new discernment that facilitates the refinement of the prevailing procedure.
Data mining and knowledge engineering, 2010
Interoperability and integration of data sources are becoming ever more challenging issues with t... more Interoperability and integration of data sources are becoming ever more challenging issues with the increase in both the amount of data and the number of data producers. Interoperability not only has to resolve the differences in data structures, it also has to deal with semantic heterogeneity. Taking semantically heterogeneous databases as the prototypical situation, this paper describes how ontology (in the traditional metaphysical sense) can contribute to delivering a more efficient and effective process of matching by providing a framework for the analysis, and so the basis for a methodology. It delivers not only a better process for matching, but the process also gives a better result.
Artificial Intelligence and Speech Technology, 2021
Proceedings of the International Conference on Advances in Information Communication Technology & Computing, 2016
This paper presents a new color image encoding and decoding technique using Fractional Fourier Tr... more This paper presents a new color image encoding and decoding technique using Fractional Fourier Transform (FrFT) and Discrete Wavelet Transform (DWT). In proposed work, all three planes of color images are encoded using subbands of DWT and parameters of FrFT. Selection of subband (among the subbands obtained after applying DWT) for applying FrFT and parameters of FrFT are used as security key for the purpose of encoding and decoding of all three color channels. For ensuring the correct decoding of color images, the knowledge of correct selection of subband of DWT for applying FrFT and exact values of FrFT parameters is required. Correct decoding is not possible without the correct knowledge of DWT subband and FrFT parameters values. The proposed technique is also compared with one of the recent existing techniques and experimental results are used to show the effectiveness of the proposed technique.
.................................................................................................... more ...................................................................................................................... iii Table of
The percentage loss of electric energy due to transmission and distribution(T&D) is notable. As t... more The percentage loss of electric energy due to transmission and distribution(T&D) is notable. As the urbanization is expanding, connectivity is becoming more complex and T&D losses are further increasing. These T&D losses can be minimized significantly by optimizing the connectivity of electricity lines. The major objective of this work is to optimize the placement of transformed in residential areas to minimize the transmission losses. To achieve the desired goal three approaches have been considered: deterministic, clustering and stochastic. In deterministic approach, brute force method is applied, in clustering K-means unsupervised clustering is applied and in stochastic approach stimulated annealing is used. A comparative analysis of all three aforesaid approaches has been done on real time data collected from PSPCL, to calculate the percentage reduction in transmission and distribution (T&D) loss by applying aforesaid approaches.
2018 International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), 2018
The categories and quantity of data are expanding exponentially with the on-going wave of connect... more The categories and quantity of data are expanding exponentially with the on-going wave of connectivity. A number of connected devices and data sources continuously generate a huge amount of data at a very high speed. This paper investigates various methods such as - Naive Bayes classifier, Very Fast Decision Trees (VFDT), ensemble methods, clustering based methods, etc. that have been used for streaming data processing. In this paper, recurrent neural network (RNN) is implemented topredict the next sequence of a data stream. Three types of sequential data streams are considered - uniform rectangular data, uniform sinusoidal data and non-uniform sinc pulse data. Various RNN architectures such as - simple RNN, RNN with long short term memory (LSTM), RNN with gated recurrent units (GRU) and RNN optimized with Genetic Algorithm (GA) are implemented for various combinations of number network hyper-parameters such as –number of hidden layers, number of neurons per layer, activation function and optimizer etc. The optimal combination of the hyper-parameters is selected using GA. With sample data streams, simple RNN shows better prediction accuracy than LSTM and GRU for single hidden layer architecture. As the RNN architectures get deeper, LSTM and GRU outperform simple RNN. The optimized version of RNN has been experimentally observed to be 78.13% faster than single layered LSTM architecture and 82.76% faster than the LSTM model with 4 hidden layers. The decline in accuracy is 8.67% and 12.67% respectively.
2017 4th International Conference on Signal Processing, Computing and Control (ISPCC), 2017
Diagnosing Diabetes is one of the problems that require high level of accurate analysis and predi... more Diagnosing Diabetes is one of the problems that require high level of accurate analysis and prediction. Traditional techniques for clinical decision support systems are grounded on a single classifier or combination of various classifiers which are used for the diagnosis of the disease and its prediction. Recently much heed has been paid to improve the performance of disease prediction with the use of ensemble-based methods. Using ensemble methods in decision support systems assist in analyzing theses type of diseases. To improve the performance of weak classifiers boosting and bagging can be used. These techniques are based on combining the outputs and functionality of the classifiers. A weighted majority vote or a simple majority vote which has been used in this study are the most common rules for bagging and boosting. In this paper, we compare the performance of bagging and boosting with our hybrid approach called Hierarchical and Progressive Combination of Classifiers (HPCC) through the study of the famous Pima Indians Diabetes Dataset and the best classifier is chosen on the basis of the accuracy achieved.
Expert Systems, 2020
Mining data streams for predictive analysis is one of the most interesting topics in machine lear... more Mining data streams for predictive analysis is one of the most interesting topics in machine learning. With the drifting data distributions, it becomes important to build adaptive systems which are dynamic and accurate. Although ensembles are powerful in improving accuracy of incremental learning, it is crucial to maintain a set of best suitable learners in the ensemble while considering the diversity between them. By adding diversity‐based pruning to the traditional accuracy‐based pruning, this paper proposes a novel concept drift handling approach named Two‐Level Pruning based Ensemble with Abstained Learners (TLP‐EnAbLe). In this approach, deferred similarity based pruning delays the removal of under performing similar learners until it is assured that they are no longer fit for prediction. The proposed scheme retains diverse learners that are well suited for current concept. Two‐level abstaining monitors performance of learners and chooses the best set of competent learners for participating in decision making. This is an enhancement to traditional majority voting system which dynamically chooses high performing learners and abstains the ones which are not suitable for prediction. In our experiments, it has been demonstrated that TLP‐EnAbLe handles concept drift more effectively than other state‐of‐the‐art algorithms on nineteen artificially drifting and ten real‐world datasets. Further, statistical tests conducted on various drift patterns which include gradual, abrupt, recurring and their combinations prove efficiency of the proposed approach.
Electronic Workshops in Computing, 2010
Knowledge-Based Systems, 2019
Advanced Techniques in Computing Sciences and Software Engineering, 2009
Latent Semantic Indexing (LSI), a well known technique in Information Retrieval has been partiall... more Latent Semantic Indexing (LSI), a well known technique in Information Retrieval has been partially successful in text retrieval and no major breakthrough has been achieved in text classification as yet. A significant step forward in this regard was made by Hofmann [3], who presented the probabilistic LSI (PLSI) model, as an alternative to LSI. If we wish to consider exchangeable representations for documents and words, PLSI is not successful which further led to the Latent Dirichlet Allocation (LDA) model [4]. A new local Latent Semantic ...