Tushaar Gangavarapu | Cornell University (original) (raw)

inproceedings by Tushaar Gangavarapu

Research paper thumbnail of Coherence-based Modeling of Clinical Concepts Inferred from Heterogeneous Clinical Notes for ICU Patient Risk Stratification

In hospitals, critical care patients are often susceptible to various complications that adversel... more In hospitals, critical care patients are often susceptible to various complications that adversely affect their morbidity and mortality. Digitized patient data from Electronic Health Records (EHRs) can be utilized to facilitate risk stratification accurately and provide prioritized care. Existing clinical decision support systems are heavily reliant on the structured nature of the EHRs. However, the valuable patient-specific data contained in unstructured clinical notes are often manually transcribed into EHRs. The prolific use of extensive medical jargon, heterogeneity, sparsity, rawness, inconsistent abbreviations, and complex structure of the clinical notes poses significant challenges, and also results in a loss of information during the manual conversion process. In this work, we employ two coherence-based topic modeling approaches to model the free-text in the unstructured clinical nursing notes and capture its semantic textual features with the emphasis on human interpretability. Furthermore, we present FarSight, a long-term aggregation mechanism intended to detect the onset of disease with the earliest recorded symptoms and infections. We utilize the predictive capabilities of deep neural models for the clinical task of risk stratification through ICD-9 code group prediction. Our experimental validation on MIMIC-III (v1.4) database underlined the efficacy of FarSight with coherence-based topic modeling, in extracting discriminative clinical features from the unstructured nursing notes. The proposed approach achieved a superior predictive performance when benchmarked against the structured EHR data based state-of-the-art model, with an improvement of 11.50{\%} in AUPRC and 1.16{\%} in AUROC.

Research paper thumbnail of Deep Neural Learning for Automated Diagnostic Code Group Prediction Using Unstructured Nursing Notes

Disease prediction, a central problem in clinical care and management, has gained much significan... more Disease prediction, a central problem in clinical care and management, has gained much significance over the last decade. Nursing notes documented by caregivers contain valuable information concerning a patient's state, which can aid in the development of intelligent clinical prediction systems. Moreover, due to the limited adaptation of structured electronic health records in developing countries, the need for disease prediction from such clinical text has garnered substantial interest from the research community. The availability of large, publicly available databases such as MIMIC-III, and advancements in machine and deep learning models with high predictive capabilities have further facilitated research in this direction. In this work, we model the latent knowledge embedded in the unstructured clinical nursing notes, to address the clinical task of disease prediction as a multi-label classification of ICD-9 code groups. We present EnTAGS, which facilitates aggregation of the data in the clinical nursing notes of a patient, by modeling them independent of one another. To handle the sparsity and high dimensionality of clinical nursing notes effectively, our proposed EnTAGS is built on the topics extracted using Non-negative matrix factorization. Furthermore, we explore the applicability of deep learning models for the clinical task of disease prediction, and assess the reliability of the proposed models using standard evaluation metrics. Our experimental evaluation revealed that the proposed approach consistently exceeded the state-of-the-art prediction model by 1.87% in accuracy, 12.68% in AUPRC, and 11.64% in MCC score.

Research paper thumbnail of A Novel Bio-inspired Hybrid Metaheuristic for Unsolicited Bulk Email Detection

With the recent influx of technology, Unsolicited Bulk Emails (UBEs) have become a potential prob... more With the recent influx of technology, Unsolicited Bulk Emails (UBEs) have become a potential problem, leaving computer users and organizations at the risk of brand, data, and financial loss. In this paper, we present a novel bio-inspired hybrid parallel optimization algorithm (Cuckoo-Firefly-GR), which combines Genetic Replacement (GR) of low fitness individuals with a hybrid of Cuckoo Search (CS) and Firefly (FA) optimizations. Cuckoo-Firefly-GR not only employs the random walk in CS, but also uses mechanisms in FA to generate and select fitter individuals. The content- and behavior-based features of emails used in the existing works, along with Doc2Vec features of the email body are employed to extract the syntactic and semantic information in the emails. By establishing an optimal balance between intensification and diversification, and reaching global optimization using two metaheuristics, we argue that the proposed algorithm significantly improves the performance of UBE detection, by selecting the most discriminative feature subspace. This study presents significant observations from the extensive evaluations on UBE corpora of 3, 844 emails, that underline the efficiency and superiority of our proposed Cuckoo-Firefly-GR over the base optimizations (Cuckoo-GR and Firefly-GR), dense autoencoders, recurrent neural autoencoders, and several state-of-the-art methods. Furthermore, the instructive feature subset obtained using the proposed Cuckoo-Firefly-GR, when classified using a dense neural model, achieved an accuracy of {\$}{\$}99{\backslash}{\%}{\$}{\$}99{\%}.

Research paper thumbnail of Parallel OpenMP and CUDA Implementations of the N-Body Problem

The N-body problem, in the field of astrophysics, predicts the movements of the planets and their... more The N-body problem, in the field of astrophysics, predicts the movements of the planets and their gravitational interactions. This paper aims at developing efficient and high-performance implementations of two versions of the N-body problem. Adaptive tree structures are widely used in N-body simulations. Building and storing the tree and the need for work-load balancing pose significant challenges in high-performance implementations. Our implementations use various cores in CPU and GPU via efficient work-load balancing with data and task parallelization. The contributions include OpenMP and Nvidia CUDA implementations to parallelize force computation and mass distribution, and achieve competitive performance in terms of speedup and running time which is empirically justified and graphed. This research not only aids as an alternative to complex simulations but also to other big data applications requiring work-load distribution and computationally expensive procedures.

Research paper thumbnail of TAGS: Towards Automated Classification of Unstructured Clinical Nursing Notes

Accurate risk management and disease prediction are vital in intensive care units to channel prom... more Accurate risk management and disease prediction are vital in intensive care units to channel prompt care to patients in critical conditions and aid medical personnel in effective decision making. Clinical nursing notes document subjective assessments and crucial information of a patient's state, which is mostly lost when transcribed into Electronic Medical Records (EMRs). The Clinical Decision Support Systems (CDSSs) in the existing body of literature are heavily dependent on the structured nature of EMRs. Moreover, works which aim at benchmarking deep learning models are limited. In this paper, we aim at leveraging the underutilized treasure-trove of patient-specific information present in the unstructured clinical nursing notes towards the development of CDSSs. We present a fuzzy token-based similarity approach to aggregate voluminous clinical documentations of a patient. To structure the free-text in the unstructured notes, vector space and coherence-based topic modeling approaches that capture the syntactic and latent semantic information are presented. Furthermore, we utilize the predictive capabilities of deep neural architectures for disease prediction as ICD-9 code group. Experimental validation revealed that the proposed Term weighting of nursing notes AGgregated using Similarity (TAGS) model outperformed the state-of-the-art model by 5{\%} in AUPRC and 1.55{\%} in AUROC.

Research paper thumbnail of A Single Program Multiple Data Algorithm for Feature Selection

Feature selection is a critical component in data science and has been the topic of research for ... more Feature selection is a critical component in data science and has been the topic of research for many years. Advances in hardware and the availability of better multiprocessing platforms have enabled parallel computing to reach very high levels of performance. Minimum Redundancy Maximum Relevance (mRMR) is a powerful feature selection technique used in many applications. In this paper, we present a novel optimized Single Program Multiple Data (SPMD) approach to implement the mRMR algorithm with synchronous computation, optimum load balancing and greater speedup than task-parallel approaches. The experimental results presented using multiple synthesized datasets prove the efficiency and scalability of the proposed technique over original mRMR.

articles by Tushaar Gangavarapu

Research paper thumbnail of Predicting ICD-9 code groups with fuzzy similarity based supervised multi-label classification of unstructured clinical nursing notes

In hospitals, caregivers are trained to chronicle the subtle changes in the clinical conditions o... more In hospitals, caregivers are trained to chronicle the subtle changes in the clinical conditions of a patient at regular intervals, for enabling decision-making. Caregivers’ text-based clinical notes are a significant source of rich patient-specific data, that can facilitate effective clinical decision support, despite which, this treasure-trove of data remains largely unexplored for supporting the prediction of clinical outcomes. The application of sophisticated data modeling and prediction algorithms with greater computational capacity have made disease prediction from raw clinical notes a relevant problem. In this paper, we propose an approach based on vector space and topic modeling, to structure the raw clinical data by capturing the semantic information in the nursing notes. Fuzzy similarity based data cleansing approach was used to merge anomalous and redundant patient data. Furthermore, we utilize eight supervised multi-label classification models to facilitate disease (ICD-9 code group) prediction. We present an exhaustive comparative study to evaluate the performance of the proposed approaches using standard evaluation metrics. Experimental validation on MIMIC-III, an open database, underscored the superior performance of the proposed Term weighting of unstructured notes AGgregated using fuzzy Similarity (TAGS) model, which consistently outperformed the state-of-the-art structured data based approach by 7.79% in AUPRC and 1.24% in AUROC.

Research paper thumbnail of A novel filter–wrapper hybrid greedy ensemble approach optimized using the genetic algorithm to reduce the dimensionality of high-dimensional biomedical datasets

The predictive accuracy of high-dimensional biomedical datasets is often dwindled by many irrelev... more The predictive accuracy of high-dimensional biomedical datasets is often dwindled by many irrelevant and redundant molecular disease diagnosis features. Dimensionality reduction aims at finding a feature subspace that preserves the predictive accuracy while eliminating noise and curtailing the high computational cost of training. The applicability of a particular feature selection technique is heavily reliant on the ability of that technique to match the problem structure and to capture the inherent patterns in the data. In this paper, we propose a novel filter–wrapper hybrid ensemble feature selection approach based on the weighted occurrence frequency and the penalty scheme, to obtain the most discriminative and instructive feature subspace. The proposed approach engenders an optimal feature subspace by greedily combining the feature subspaces obtained from various predetermined base feature selection techniques. Furthermore, the base feature subspaces are penalized based on specific performance dependent penalty parameters. We leverage effective heuristic search strategies including the greedy parameter-wise optimization and the Genetic Algorithm (GA) to optimize the subspace ensembling process. The effectiveness, robustness, and flexibility of the proposed hybrid greedy ensemble approach in comparison with the base feature selection techniques, and prolific filter and state-of-the-art wrapper methods are justified by empirical analysis on three distinct high-dimensional biomedical datasets. Experimental validation revealed that the proposed greedy approach, when optimized using GA, outperformed the selected base feature selection techniques by 4.17%–15.14% in terms of the prediction accuracy.

Research paper thumbnail of Multi-channel, convolutional attention based neural model for automated diagnostic coding of unstructured patient discharge summaries

Effective coding of patient records in hospitals is an essential requirement for epidemiology, bi... more Effective coding of patient records in hospitals is an essential requirement for epidemiology, billing, and managing insurance claims. The prevalent practice of manual coding, carried out by trained medical coders, is error-prone and time-consuming. Mitigating this labor-intensive process by developing diagnostic coding systems built on patients’ Electronic Medical Records (EMRs) is vital. However, developing nations with low digitization rates have limited availability of structured EMRs, thereby necessitating a need for systems that leverage unstructured data sources. Despite the rich clinical information available in such unstructured data, modeling them is complex, owing to the variety and sparseness of diagnostic codes, complex structural and temporal nature of summaries, and prolific use of medical jargon. This work proposes a context-attentive network to facilitate automatic diagnostic code assignment as a multi-label classification problem. The proposed model facilitates information aggregation across a patient’s discharge summary via multi-channel, variable-sized convolutional filters to extract multi-granular snippets. The attention mechanism enables selecting vital segments in those snippets that map to the clinical codes. The model’s superior performance underscores its effectiveness compared to the state-of-the-art on the MIMIC-III database. Additionally, experimental validation using the CodiEsp dataset exhibited the model’s interpretability and explainability.

Research paper thumbnail of FarSight: Long-Term Disease Prediction Using Unstructured Clinical Nursing Notes

Research paper thumbnail of Applicability of machine learning in spam and phishing email filtering: review and approaches

With the influx of technological advancements and the increased simplicity in communication, espe... more With the influx of technological advancements and the increased simplicity in communication, especially through emails, the upsurge in the volume of unsolicited bulk emails (UBEs) has become a severe threat to global security and economy. Spam emails not only waste users' time, but also consume a lot of network bandwidth, and may also include malware as executable files. Alternatively, phishing emails falsely claim users' personal information to facilitate identity theft and are comparatively more dangerous. Thus, there is an intrinsic need for the development of more robust and dependable UBE filters that facilitate automatic detection of such emails. There are several countermeasures to spam and phishing, including blacklisting and content-based filtering. However, in addition to content-based features, behavior-based features are well-suited in the detection of UBEs. Machine learning models are being extensively used by leading internet service providers like Yahoo, Gmail, and Outlook, to filter and classify UBEs successfully. There are far too many options to consider, owing to the need to facilitate UBE detection and the recent advances in this domain. In this paper, we aim at elucidating on the way of extracting email content and behavior-based features, what features are appropriate in the detection of UBEs, and the selection of the most discriminating feature set. Furthermore, to accurately handle the menace of UBEs, we facilitate an exhaustive comparative study using several state-of-the-art machine learning algorithms. Our proposed models resulted in an overall accuracy of 99{\%} in the classification of UBEs. The text is accompanied by snippets of Python code, to enable the reader to implement the approaches elucidated in this paper.

Papers by Tushaar Gangavarapu

Research paper thumbnail of Parallel OpenMP and CUDA Implementations of the N-Body Problem

The N-body problem, in the field of astrophysics, predicts the movements of the planets and their... more The N-body problem, in the field of astrophysics, predicts the movements of the planets and their gravitational interactions. This paper aims at developing efficient and high-performance implementations of two versions of the N-body problem. Adaptive tree structures are widely used in N-body simulations. Building and storing the tree and the need for work-load balancing pose significant challenges in high-performance implementations. Our implementations use various cores in CPU and GPU via efficient work-load balancing with data and task parallelization. The contributions include OpenMP and Nvidia CUDA implementations to parallelize force computation and mass distribution, and achieve competitive performance in terms of speedup and running time which is empirically justified and graphed. This research not only aids as an alternative to complex simulations but also to other big data applications requiring work-load distribution and computationally expensive procedures.

Research paper thumbnail of A Single Program Multiple Data Algorithm for Feature Selection

Feature selection is a critical component in data science and has been the topic of research for ... more Feature selection is a critical component in data science and has been the topic of research for many years. Advances in hardware and the availability of better multiprocessing platforms have enabled parallel computing to reach very high levels of performance. Minimum Redundancy Maximum Relevance (mRMR) is a powerful feature selection technique used in many applications. In this paper, we present a novel optimized Single Program Multiple Data (SPMD) approach to implement the mRMR algorithm with synchronous computation, optimum load balancing and greater speedup than task-parallel approaches. The experimental results presented using multiple synthesized datasets prove the efficiency and scalability of the proposed technique over original mRMR.

Research paper thumbnail of An Empirical Study to Detect the Collision Rate in Similarity Hashing Algorithm Using MD5

2019 International Conference on Data Science and Engineering (ICDSE), 2019

Similarity Hashing (SimHash) is a widely used locality-sensitive hashing algorithm employed in th... more Similarity Hashing (SimHash) is a widely used locality-sensitive hashing algorithm employed in the detection of similarity, in large-scale data processing, including plagiarism detection and near-duplicate web document detection. Collision resistance is a crucial property of cryptographic hash algorithms that are used to verify the message integrity in internet security applications. A hash function is said to be collision-resistant if it is hard to find two different inputs that hash to the same output. In this paper, we present an empirical study to facilitate the detection of collision rate when SimHash is employed to check the integrity of the message. The analysis was performed using bit sequences with length varying from 2 to 32 and Message Digest 5 (MD5) as the internal hash function. Furthermore, to enable faster collision detection with more significant speedup and efficient space utilization, we parallelized the process using a distributed data-parallel approach with synch...

Research paper thumbnail of Coherence-Based Modeling of Clinical Concepts Inferred from Heterogeneous Clinical Notes for ICU Patient Risk Stratification

In hospitals, critical care patients are often susceptible to various complications that adversel... more In hospitals, critical care patients are often susceptible to various complications that adversely affect their morbidity and mortality. Digitized patient data from Electronic Health Records (EHRs) can be utilized to facilitate risk stratification accurately and provide prioritized care. Existing clinical decision support systems are heavily reliant on the structured nature of the EHRs. However, the valuable patient-specific data contained in unstructured clinical notes are often manually transcribed into EHRs. The prolific use of extensive medical jargon, heterogeneity, sparsity, rawness, inconsistent abbreviations, and complex structure of the clinical notes poses significant challenges, and also results in a loss of information during the manual conversion process. In this work, we employ two coherence-based topic modeling approaches to model the free-text in the unstructured clinical nursing notes and capture its semantic textual features with the emphasis on human interpretability. Furthermore, we present FarSight, a long-term aggregation mechanism intended to detect the onset of disease with the earliest recorded symptoms and infections. We utilize the predictive capabilities of deep neural models for the clinical task of risk stratification through ICD-9 code group prediction. Our experimental validation on MIMIC-III (v1.4) database underlined the efficacy of FarSight with coherence-based topic modeling, in extracting discriminative clinical features from the unstructured nursing notes. The proposed approach achieved a superior predictive performance when benchmarked against the structured EHR data based state-of-the-art model, with an improvement of 11.50% in AUPRC and 1.16% in AUROC.

Research paper thumbnail of Coherence-based Modeling of Clinical Concepts Inferred from Heterogeneous Clinical Notes for ICU Patient Risk Stratification

In hospitals, critical care patients are often susceptible to various complications that adversel... more In hospitals, critical care patients are often susceptible to various complications that adversely affect their morbidity and mortality. Digitized patient data from Electronic Health Records (EHRs) can be utilized to facilitate risk stratification accurately and provide prioritized care. Existing clinical decision support systems are heavily reliant on the structured nature of the EHRs. However, the valuable patient-specific data contained in unstructured clinical notes are often manually transcribed into EHRs. The prolific use of extensive medical jargon, heterogeneity, sparsity, rawness, inconsistent abbreviations, and complex structure of the clinical notes poses significant challenges, and also results in a loss of information during the manual conversion process. In this work, we employ two coherence-based topic modeling approaches to model the free-text in the unstructured clinical nursing notes and capture its semantic textual features with the emphasis on human interpretability. Furthermore, we present FarSight, a long-term aggregation mechanism intended to detect the onset of disease with the earliest recorded symptoms and infections. We utilize the predictive capabilities of deep neural models for the clinical task of risk stratification through ICD-9 code group prediction. Our experimental validation on MIMIC-III (v1.4) database underlined the efficacy of FarSight with coherence-based topic modeling, in extracting discriminative clinical features from the unstructured nursing notes. The proposed approach achieved a superior predictive performance when benchmarked against the structured EHR data based state-of-the-art model, with an improvement of 11.50{\%} in AUPRC and 1.16{\%} in AUROC.

Research paper thumbnail of Deep Neural Learning for Automated Diagnostic Code Group Prediction Using Unstructured Nursing Notes

Disease prediction, a central problem in clinical care and management, has gained much significan... more Disease prediction, a central problem in clinical care and management, has gained much significance over the last decade. Nursing notes documented by caregivers contain valuable information concerning a patient's state, which can aid in the development of intelligent clinical prediction systems. Moreover, due to the limited adaptation of structured electronic health records in developing countries, the need for disease prediction from such clinical text has garnered substantial interest from the research community. The availability of large, publicly available databases such as MIMIC-III, and advancements in machine and deep learning models with high predictive capabilities have further facilitated research in this direction. In this work, we model the latent knowledge embedded in the unstructured clinical nursing notes, to address the clinical task of disease prediction as a multi-label classification of ICD-9 code groups. We present EnTAGS, which facilitates aggregation of the data in the clinical nursing notes of a patient, by modeling them independent of one another. To handle the sparsity and high dimensionality of clinical nursing notes effectively, our proposed EnTAGS is built on the topics extracted using Non-negative matrix factorization. Furthermore, we explore the applicability of deep learning models for the clinical task of disease prediction, and assess the reliability of the proposed models using standard evaluation metrics. Our experimental evaluation revealed that the proposed approach consistently exceeded the state-of-the-art prediction model by 1.87% in accuracy, 12.68% in AUPRC, and 11.64% in MCC score.

Research paper thumbnail of A Novel Bio-inspired Hybrid Metaheuristic for Unsolicited Bulk Email Detection

With the recent influx of technology, Unsolicited Bulk Emails (UBEs) have become a potential prob... more With the recent influx of technology, Unsolicited Bulk Emails (UBEs) have become a potential problem, leaving computer users and organizations at the risk of brand, data, and financial loss. In this paper, we present a novel bio-inspired hybrid parallel optimization algorithm (Cuckoo-Firefly-GR), which combines Genetic Replacement (GR) of low fitness individuals with a hybrid of Cuckoo Search (CS) and Firefly (FA) optimizations. Cuckoo-Firefly-GR not only employs the random walk in CS, but also uses mechanisms in FA to generate and select fitter individuals. The content- and behavior-based features of emails used in the existing works, along with Doc2Vec features of the email body are employed to extract the syntactic and semantic information in the emails. By establishing an optimal balance between intensification and diversification, and reaching global optimization using two metaheuristics, we argue that the proposed algorithm significantly improves the performance of UBE detection, by selecting the most discriminative feature subspace. This study presents significant observations from the extensive evaluations on UBE corpora of 3, 844 emails, that underline the efficiency and superiority of our proposed Cuckoo-Firefly-GR over the base optimizations (Cuckoo-GR and Firefly-GR), dense autoencoders, recurrent neural autoencoders, and several state-of-the-art methods. Furthermore, the instructive feature subset obtained using the proposed Cuckoo-Firefly-GR, when classified using a dense neural model, achieved an accuracy of {\$}{\$}99{\backslash}{\%}{\$}{\$}99{\%}.

Research paper thumbnail of Parallel OpenMP and CUDA Implementations of the N-Body Problem

The N-body problem, in the field of astrophysics, predicts the movements of the planets and their... more The N-body problem, in the field of astrophysics, predicts the movements of the planets and their gravitational interactions. This paper aims at developing efficient and high-performance implementations of two versions of the N-body problem. Adaptive tree structures are widely used in N-body simulations. Building and storing the tree and the need for work-load balancing pose significant challenges in high-performance implementations. Our implementations use various cores in CPU and GPU via efficient work-load balancing with data and task parallelization. The contributions include OpenMP and Nvidia CUDA implementations to parallelize force computation and mass distribution, and achieve competitive performance in terms of speedup and running time which is empirically justified and graphed. This research not only aids as an alternative to complex simulations but also to other big data applications requiring work-load distribution and computationally expensive procedures.

Research paper thumbnail of TAGS: Towards Automated Classification of Unstructured Clinical Nursing Notes

Accurate risk management and disease prediction are vital in intensive care units to channel prom... more Accurate risk management and disease prediction are vital in intensive care units to channel prompt care to patients in critical conditions and aid medical personnel in effective decision making. Clinical nursing notes document subjective assessments and crucial information of a patient's state, which is mostly lost when transcribed into Electronic Medical Records (EMRs). The Clinical Decision Support Systems (CDSSs) in the existing body of literature are heavily dependent on the structured nature of EMRs. Moreover, works which aim at benchmarking deep learning models are limited. In this paper, we aim at leveraging the underutilized treasure-trove of patient-specific information present in the unstructured clinical nursing notes towards the development of CDSSs. We present a fuzzy token-based similarity approach to aggregate voluminous clinical documentations of a patient. To structure the free-text in the unstructured notes, vector space and coherence-based topic modeling approaches that capture the syntactic and latent semantic information are presented. Furthermore, we utilize the predictive capabilities of deep neural architectures for disease prediction as ICD-9 code group. Experimental validation revealed that the proposed Term weighting of nursing notes AGgregated using Similarity (TAGS) model outperformed the state-of-the-art model by 5{\%} in AUPRC and 1.55{\%} in AUROC.

Research paper thumbnail of A Single Program Multiple Data Algorithm for Feature Selection

Feature selection is a critical component in data science and has been the topic of research for ... more Feature selection is a critical component in data science and has been the topic of research for many years. Advances in hardware and the availability of better multiprocessing platforms have enabled parallel computing to reach very high levels of performance. Minimum Redundancy Maximum Relevance (mRMR) is a powerful feature selection technique used in many applications. In this paper, we present a novel optimized Single Program Multiple Data (SPMD) approach to implement the mRMR algorithm with synchronous computation, optimum load balancing and greater speedup than task-parallel approaches. The experimental results presented using multiple synthesized datasets prove the efficiency and scalability of the proposed technique over original mRMR.

Research paper thumbnail of Predicting ICD-9 code groups with fuzzy similarity based supervised multi-label classification of unstructured clinical nursing notes

In hospitals, caregivers are trained to chronicle the subtle changes in the clinical conditions o... more In hospitals, caregivers are trained to chronicle the subtle changes in the clinical conditions of a patient at regular intervals, for enabling decision-making. Caregivers’ text-based clinical notes are a significant source of rich patient-specific data, that can facilitate effective clinical decision support, despite which, this treasure-trove of data remains largely unexplored for supporting the prediction of clinical outcomes. The application of sophisticated data modeling and prediction algorithms with greater computational capacity have made disease prediction from raw clinical notes a relevant problem. In this paper, we propose an approach based on vector space and topic modeling, to structure the raw clinical data by capturing the semantic information in the nursing notes. Fuzzy similarity based data cleansing approach was used to merge anomalous and redundant patient data. Furthermore, we utilize eight supervised multi-label classification models to facilitate disease (ICD-9 code group) prediction. We present an exhaustive comparative study to evaluate the performance of the proposed approaches using standard evaluation metrics. Experimental validation on MIMIC-III, an open database, underscored the superior performance of the proposed Term weighting of unstructured notes AGgregated using fuzzy Similarity (TAGS) model, which consistently outperformed the state-of-the-art structured data based approach by 7.79% in AUPRC and 1.24% in AUROC.

Research paper thumbnail of A novel filter–wrapper hybrid greedy ensemble approach optimized using the genetic algorithm to reduce the dimensionality of high-dimensional biomedical datasets

The predictive accuracy of high-dimensional biomedical datasets is often dwindled by many irrelev... more The predictive accuracy of high-dimensional biomedical datasets is often dwindled by many irrelevant and redundant molecular disease diagnosis features. Dimensionality reduction aims at finding a feature subspace that preserves the predictive accuracy while eliminating noise and curtailing the high computational cost of training. The applicability of a particular feature selection technique is heavily reliant on the ability of that technique to match the problem structure and to capture the inherent patterns in the data. In this paper, we propose a novel filter–wrapper hybrid ensemble feature selection approach based on the weighted occurrence frequency and the penalty scheme, to obtain the most discriminative and instructive feature subspace. The proposed approach engenders an optimal feature subspace by greedily combining the feature subspaces obtained from various predetermined base feature selection techniques. Furthermore, the base feature subspaces are penalized based on specific performance dependent penalty parameters. We leverage effective heuristic search strategies including the greedy parameter-wise optimization and the Genetic Algorithm (GA) to optimize the subspace ensembling process. The effectiveness, robustness, and flexibility of the proposed hybrid greedy ensemble approach in comparison with the base feature selection techniques, and prolific filter and state-of-the-art wrapper methods are justified by empirical analysis on three distinct high-dimensional biomedical datasets. Experimental validation revealed that the proposed greedy approach, when optimized using GA, outperformed the selected base feature selection techniques by 4.17%–15.14% in terms of the prediction accuracy.

Research paper thumbnail of Multi-channel, convolutional attention based neural model for automated diagnostic coding of unstructured patient discharge summaries

Effective coding of patient records in hospitals is an essential requirement for epidemiology, bi... more Effective coding of patient records in hospitals is an essential requirement for epidemiology, billing, and managing insurance claims. The prevalent practice of manual coding, carried out by trained medical coders, is error-prone and time-consuming. Mitigating this labor-intensive process by developing diagnostic coding systems built on patients’ Electronic Medical Records (EMRs) is vital. However, developing nations with low digitization rates have limited availability of structured EMRs, thereby necessitating a need for systems that leverage unstructured data sources. Despite the rich clinical information available in such unstructured data, modeling them is complex, owing to the variety and sparseness of diagnostic codes, complex structural and temporal nature of summaries, and prolific use of medical jargon. This work proposes a context-attentive network to facilitate automatic diagnostic code assignment as a multi-label classification problem. The proposed model facilitates information aggregation across a patient’s discharge summary via multi-channel, variable-sized convolutional filters to extract multi-granular snippets. The attention mechanism enables selecting vital segments in those snippets that map to the clinical codes. The model’s superior performance underscores its effectiveness compared to the state-of-the-art on the MIMIC-III database. Additionally, experimental validation using the CodiEsp dataset exhibited the model’s interpretability and explainability.

Research paper thumbnail of FarSight: Long-Term Disease Prediction Using Unstructured Clinical Nursing Notes

Research paper thumbnail of Applicability of machine learning in spam and phishing email filtering: review and approaches

With the influx of technological advancements and the increased simplicity in communication, espe... more With the influx of technological advancements and the increased simplicity in communication, especially through emails, the upsurge in the volume of unsolicited bulk emails (UBEs) has become a severe threat to global security and economy. Spam emails not only waste users' time, but also consume a lot of network bandwidth, and may also include malware as executable files. Alternatively, phishing emails falsely claim users' personal information to facilitate identity theft and are comparatively more dangerous. Thus, there is an intrinsic need for the development of more robust and dependable UBE filters that facilitate automatic detection of such emails. There are several countermeasures to spam and phishing, including blacklisting and content-based filtering. However, in addition to content-based features, behavior-based features are well-suited in the detection of UBEs. Machine learning models are being extensively used by leading internet service providers like Yahoo, Gmail, and Outlook, to filter and classify UBEs successfully. There are far too many options to consider, owing to the need to facilitate UBE detection and the recent advances in this domain. In this paper, we aim at elucidating on the way of extracting email content and behavior-based features, what features are appropriate in the detection of UBEs, and the selection of the most discriminating feature set. Furthermore, to accurately handle the menace of UBEs, we facilitate an exhaustive comparative study using several state-of-the-art machine learning algorithms. Our proposed models resulted in an overall accuracy of 99{\%} in the classification of UBEs. The text is accompanied by snippets of Python code, to enable the reader to implement the approaches elucidated in this paper.

Research paper thumbnail of Parallel OpenMP and CUDA Implementations of the N-Body Problem

The N-body problem, in the field of astrophysics, predicts the movements of the planets and their... more The N-body problem, in the field of astrophysics, predicts the movements of the planets and their gravitational interactions. This paper aims at developing efficient and high-performance implementations of two versions of the N-body problem. Adaptive tree structures are widely used in N-body simulations. Building and storing the tree and the need for work-load balancing pose significant challenges in high-performance implementations. Our implementations use various cores in CPU and GPU via efficient work-load balancing with data and task parallelization. The contributions include OpenMP and Nvidia CUDA implementations to parallelize force computation and mass distribution, and achieve competitive performance in terms of speedup and running time which is empirically justified and graphed. This research not only aids as an alternative to complex simulations but also to other big data applications requiring work-load distribution and computationally expensive procedures.

Research paper thumbnail of A Single Program Multiple Data Algorithm for Feature Selection

Feature selection is a critical component in data science and has been the topic of research for ... more Feature selection is a critical component in data science and has been the topic of research for many years. Advances in hardware and the availability of better multiprocessing platforms have enabled parallel computing to reach very high levels of performance. Minimum Redundancy Maximum Relevance (mRMR) is a powerful feature selection technique used in many applications. In this paper, we present a novel optimized Single Program Multiple Data (SPMD) approach to implement the mRMR algorithm with synchronous computation, optimum load balancing and greater speedup than task-parallel approaches. The experimental results presented using multiple synthesized datasets prove the efficiency and scalability of the proposed technique over original mRMR.

Research paper thumbnail of An Empirical Study to Detect the Collision Rate in Similarity Hashing Algorithm Using MD5

2019 International Conference on Data Science and Engineering (ICDSE), 2019

Similarity Hashing (SimHash) is a widely used locality-sensitive hashing algorithm employed in th... more Similarity Hashing (SimHash) is a widely used locality-sensitive hashing algorithm employed in the detection of similarity, in large-scale data processing, including plagiarism detection and near-duplicate web document detection. Collision resistance is a crucial property of cryptographic hash algorithms that are used to verify the message integrity in internet security applications. A hash function is said to be collision-resistant if it is hard to find two different inputs that hash to the same output. In this paper, we present an empirical study to facilitate the detection of collision rate when SimHash is employed to check the integrity of the message. The analysis was performed using bit sequences with length varying from 2 to 32 and Message Digest 5 (MD5) as the internal hash function. Furthermore, to enable faster collision detection with more significant speedup and efficient space utilization, we parallelized the process using a distributed data-parallel approach with synch...

Research paper thumbnail of Coherence-Based Modeling of Clinical Concepts Inferred from Heterogeneous Clinical Notes for ICU Patient Risk Stratification

In hospitals, critical care patients are often susceptible to various complications that adversel... more In hospitals, critical care patients are often susceptible to various complications that adversely affect their morbidity and mortality. Digitized patient data from Electronic Health Records (EHRs) can be utilized to facilitate risk stratification accurately and provide prioritized care. Existing clinical decision support systems are heavily reliant on the structured nature of the EHRs. However, the valuable patient-specific data contained in unstructured clinical notes are often manually transcribed into EHRs. The prolific use of extensive medical jargon, heterogeneity, sparsity, rawness, inconsistent abbreviations, and complex structure of the clinical notes poses significant challenges, and also results in a loss of information during the manual conversion process. In this work, we employ two coherence-based topic modeling approaches to model the free-text in the unstructured clinical nursing notes and capture its semantic textual features with the emphasis on human interpretability. Furthermore, we present FarSight, a long-term aggregation mechanism intended to detect the onset of disease with the earliest recorded symptoms and infections. We utilize the predictive capabilities of deep neural models for the clinical task of risk stratification through ICD-9 code group prediction. Our experimental validation on MIMIC-III (v1.4) database underlined the efficacy of FarSight with coherence-based topic modeling, in extracting discriminative clinical features from the unstructured nursing notes. The proposed approach achieved a superior predictive performance when benchmarked against the structured EHR data based state-of-the-art model, with an improvement of 11.50% in AUPRC and 1.16% in AUROC.