Amir Hossein Razavi - Academia.edu (original) (raw)

Papers by Amir Hossein Razavi

Research paper thumbnail of Report on formal analysis of autopoietic P2P network, together with predictions of performance (Deliverable D3.2/Open Philosophies for Associative Autopoietic Digital Ecosystems Contract n° 034824)

Research paper thumbnail of A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining

Artificial Intelligence in Medicine, 2005

In medicine, data mining methods such as Decision Tree Induction (DTI) can be trained for extract... more In medicine, data mining methods such as Decision Tree Induction (DTI) can be trained for extracting rules to predict the outcomes of new patients. However, incompleteness and high dimensionality of stored data are a problem. Canonical Correlation Analysis (CCA) can be used prior to DTI as a dimension reduction technique to preserve the character of the original data by omitting non-essential data. In this study, data from 3949 breast cancer patients were analysed. Raw data were cleaned by running a set of logical rules. Missing values were replaced using the Expectation Maximization algorithm. After dimension reduction with CCA, DTI was employed to analyse the resulting dataset. The validity of the predictive model was confirmed by tenfold cross validation and the effect of pre-processing was analysed by applying DTI to data without pre-processing. Replacing missing values and using CCA for data reduction dramatically reduced the size of the resulting tree and increased the accuracy of the prediction of breast cancer recurrence.

Research paper thumbnail of eDoctor: machine learning and the future of medicine

Journal of Internal Medicine, Sep 3, 2018

Machine learning (ML) is a burgeoning field of medicine with huge resources being applied to fuse... more Machine learning (ML) is a burgeoning field of medicine with huge resources being applied to fuse computer science and statistics to medical problems. Proponents of ML extol its ability to deal with large, complex and disparate data, often found within medicine and feel that ML is the future for biomedical research, personalized medicine, computer-aided diagnosis to significantly advance global health care. However, the concepts of ML are unfamiliar to many medical professionals and there is untapped potential in the use of ML as a research tool. In this article, we provide an overview of the theory behind ML, explore the common ML algorithms used in medicine including their pitfalls and discuss the potential future of ML in medicine.

Research paper thumbnail of Peering Into the Black Box of Artificial Intelligence: Evaluation Metrics of Machine Learning Methods

AJR. American journal of roentgenology, Jan 17, 2018

Machine learning (ML) and artificial intelligence (AI) are rapidly becoming the most talked about... more Machine learning (ML) and artificial intelligence (AI) are rapidly becoming the most talked about and controversial topics in radiology and medicine. Over the past few years, the numbers of ML- or AI-focused studies in the literature have increased almost exponentially, and ML has become a hot topic at academic and industry conferences. However, despite the increased awareness of ML as a tool, many medical professionals have a poor understanding of how ML works and how to critically appraise studies and tools that are presented to us. Thus, we present a brief overview of ML, explain the metrics used in ML and how to interpret them, and explain some of the technical jargon associated with the field so that readers with a medical background and basic knowledge of statistics can feel more comfortable when examining ML applications. Attention to sample size, overfitting, underfitting, cross validation, as well as a broad knowledge of the metrics of machine learning, can help those with ...

Research paper thumbnail of Text Representation and General Topic Annotation Based on Latent Dirichlet Allocation

ABSTRACT We propose a low-dimensional text representation method for topic classifi�cation. The L... more ABSTRACT We propose a low-dimensional text representation method for topic classifi�cation. The Latent Dirichet Allocation (LDA) model is built on a large amount of unlabelled data, in order to extract potential topic clusters. Each document is represented as a distribution over these clusters. We experiment with two datasets. We collected the �first dataset from the FriendFeed social network and we manually annotated part of it with 10 general classes. The second dataset is a standard text classi�fication benchmark, Reuters 21578, the R8 subset (annotated with 8 classes). We show that classi�fication based on the LDA representation leads to acceptable results, while combining a bag-of-words representation with the LDA representation leads to further improvements. We also propose a multi-level LDA representation that catches topic cluster distributions from generic ones to more specifi�c ones.

Research paper thumbnail of Personal Health Information detection in unstructured web documents

Proceedings of the 26th IEEE International Symposium on Computer-Based Medical Systems, 2013

This paper describes our study of the incidence of Personal Health Information (PHI) on the Web. ... more This paper describes our study of the incidence of Personal Health Information (PHI) on the Web. PHI is usually shared under conditions of confidentiality, protection and trust, and should not be disclosed or available to unrelated third parties or the general public. We first analyzed the characteristics that potentially make systems successful in identification of unsolicited or unjustified PHI disclosures. In the next stage, we designed and implemented an integrated Natural Language Processing/Machine Learning (NLP/ML)-based system that detects disclosures of personal health information, specifically according to the above characteristics including detected patterns. This research is regarded as the first step toward a learning system that will be trained based on a limited training set built on the result of the processing chain described in the paper in order to generally detect the PHI disclosures over the web.

Research paper thumbnail of Modeling and predicting cascading removal phenomenon over social networks

Social Network Analysis and Mining, 2014

Innovations, opinions, ideas, recommendations or tendencies emerge in a variety of social network... more Innovations, opinions, ideas, recommendations or tendencies emerge in a variety of social networks. They can either disappear quickly or propagate and create considerable impact on the network. Their disappearance may also spread from one node to another across the network creating cascading behavior. Cascading phenomenon is mainly analyzed either by identifying the most influential nodes according to their features in the network, detecting quickly the phenomenon or targeting a minimum set of nodes that could maximize the spread of influence or minimize the propagation of a rumor or an outbreak. The objective of the present work is to predict the nodes to be deleted in cascade following the disappearance of one or many nodes. The cascading removal phenomenon is imitated by three well-known influence maximization cascading models in addition to two variants of a new cascading strategy which sound more consistent with human intuition over cascading removals. The prediction is done for an individual iteration of the cascading models, with the ability to be projected over the entire course of cascades without any loss of generality. We compare the prediction accuracy over three real-life networks and five synthetically generated schemas that imitate real social networks.

Research paper thumbnail of Applications of Knowledge Discovery in Quality Registries-Predicting Recurrence of Breast Cancer and Analyzing Non-compliance with a Clinical Guideline

LINKOPING UNIVERSITY MEDICAL …, 2007

In medicine, data are produced from different sources and continuously stored in data depositorie... more In medicine, data are produced from different sources and continuously stored in data depositories. Examples of these growing databases are quality registries. In Sweden, there are many cancer registries where data on cancer patients are gathered and recorded and are used mainly for reporting survival analyses to high level health authorities. In this thesis, a breast cancer quality registry operating in SouthEast of Sweden is used as the data source for newer analytical techniques, i.e. data mining as a part of knowledge discovery in databases (KDD) methodology. Analyses are done to sift through these data in order to find interesting information and hidden knowledge. KDD consists of multiple steps, starting with gathering data from different sources and preparing them in data pre-processing stages prior to the main analysis with data mining. Data were cleaned from outliers and noise and missing values were handled. Then a proper subset of the data was chosen by canonical correlation analysis (CCA) in a dimensionality reduction step. This technique was chosen because there were multiple outcomes, and variables had complex relationship to one another. After data were prepared, they were analyzed with a data mining method. Decision tree induction as a simple and efficient method was used to mine the data. To show the benefits of proper data pre-processing, results from data mining with pre-processing of the data were compared with results from data mining without data pre-processing. The comparison showed that data pre-processing results in a more compact model with a better performance in predicting the recurrence of cancer. An important part of knowledge discovery in medicine is to increase the involvement of medical experts in the process. This starts with enquiry about current problems in their field, which leads to finding areas where computer support can be helpful. The experts can suggest potentially important variables and should then approve and validate new patterns or knowledge as predictive or descriptive models. If it can be shown that the performance of a model is comparable to domain experts, it is more probable that the model will be used to support physicians in their daily decision-making. In this thesis, we validated the model by comparing List of Publications This thesis is based on five papers, which will be referred to in the text by their roman numerals.

Research paper thumbnail of ArcView GIS/avenue programmer's reference

Research paper thumbnail of Concurrency Control and Recovery Management for Open e-Business Transactions

Concurrency control mechanisms such as turn-taking, locking, serialization, transactional locking... more Concurrency control mechanisms such as turn-taking, locking, serialization, transactional locking mechanism, and operational transformation try to provide data consistency when concurrent activities are permitted in a reactive system. Locks are typically used in transactional models for assurance of data consistency and integrity in a concurrent environment. In addition, recovery management is used to preserve atomicity and durability in transaction models. Unfortunately, conventional lock mechanisms severely (and intentionally) limit concurrency in a transactional environment. Such lock mechanisms also limit recovery capabilities. Finally, existing recovery mechanisms themselves afford a considerable overhead to concurrency. This paper describes a new transaction model that supports release of early results inside and outside of a transaction, decreasing the severe limitations of conventional lock mechanisms, yet still warranties consistency and recoverability of released resources (results). This is achieved through use of a more flexible locking mechanism and by using two types of consistency graph. This provides an integrated solution for transaction management, recovery management and concurrency control. We argue that these are necessary features for management of long-term transactions within "digital ecosystems" of small to medium enterprises

Research paper thumbnail of General Topic Annotation in Social Networks: A Latent Dirichlet Allocation Approach

Lecture Notes in Computer Science, 2013

Research paper thumbnail of Text Representation Using Multi-level Latent Dirichlet Allocation

Lecture Notes in Computer Science, 2014

We introduce a novel text representation method to be applied on corpora containing short / mediu... more We introduce a novel text representation method to be applied on corpora containing short / medium length textual documents. The method applies Latent Dirichlet Allocation (LDA) on a corpus to infer its major topics, which will be used for document representation. The representation that we propose has multiple levels (granularities) by using different numbers of topics. We postulate that interpreting data in a more general space, with fewer dimensions, can improve the representation quality. Experimental results support the informative power of our multi-level representation vectors. We show that choosing the correct granularity of representation is an important aspect of text classification. We propose a multi-level representation, at different topical granularities, rather than choosing one level. The documents are represented by topical relevancy weights, in a low-dimensional vector representation. Finally, the proposed representation is applied to a text classification task using several well-known classification algorithms. We show that it leads to very good classification performance. Another advantage is that, with a small compromise on accuracy, our low-dimensional representation can be fed into many supervised or unsupervised machine learning algorithms that empirically cannot be applied on the conventional high-dimensional text representation methods.

Research paper thumbnail of Classifying Biomedical Abstracts Using Committees of Classifiers and Collective Ranking Techniques

Lecture Notes in Computer Science, 2009

The purpose of this work is to reduce the workload of human experts in building systematic review... more The purpose of this work is to reduce the workload of human experts in building systematic reviews from published articles, used in evidence-based medicine. We propose to use a committee of classifiers to rank biomedical abstracts based on the predicted relevance to the topic under review. In our approach, we identify two subsets of abstracts: one that represents the top, and another that represents the bottom of the ranked list. These subsets, identified using machine learning (ML) techniques, are considered zones where abstracts are labeled with high confidence as relevant or irrelevant to the topic of the review. Early experiments with this approach using different classifiers and different representation techniques show significant workload reduction.

Research paper thumbnail of Parameterized Contrast in Second Order Soft Co-occurrences: A Novel Text Representation Technique in Text Mining and Knowledge Extraction

2009 IEEE International Conference on Data Mining Workshops, 2009

In this article, we present a novel statistical representation method for knowledge extraction fr... more In this article, we present a novel statistical representation method for knowledge extraction from a corpus containing short texts. Then we introduce the contrast parameter which could be adjusted for targeting different conceptual levels in text mining and knowledge extraction. The method is based on second order co-occurrence vectors whose efficiency for representing meaning has been established in many applications, especially for representing word senses in different contexts and for disambiguation purposes. We evaluate our method on two tasks: classification of textual description of dreams, and classification of medical abstracts for systematic reviews.

Research paper thumbnail of Automatic Text Ontological Representation and Classification via Fundamental to Specific Conceptual Elements (TOR-FUSE)

In this dissertation, we introduce a novel text representation method mainly used for text classi... more In this dissertation, we introduce a novel text representation method mainly used for text classification purpose. The presented representation method is initially based on a variety of closeness relationships between pairs of words in text passages within the entire corpus. This representation is then used as the basis for our multi-level lightweight ontological representation method (TOR-FUSE), in which documents are represented based on their contexts and the goal of the learning task. The method is unlike the traditional representation methods, in which all the documents are represented solely based on the constituent words of the documents, and are totally isolated from the goal that they are represented for. We believe choosing the correct granularity of representation features is an important aspect of text classification. Interpreting data in a more general dimensional space, with fewer dimensions, can convey more discriminative knowledge and decrease the level of learning p...

Research paper thumbnail of Classifying Malicious Domains using DNS Traffic Analysis

2021 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), 2021

Research paper thumbnail of Lightweight Hybrid Detection of Data Exfiltration using DNS based on Machine Learning

2021 the 11th International Conference on Communication and Network Security, 2021

Research paper thumbnail of Leakage Detection of Confidential Information in Unstructured Web Documents

We study presence of Personal Health Information (PHI) on the Internet. We survey the existing me... more We study presence of Personal Health Information (PHI) on the Internet. We survey the existing methods and systems used to detect PHI. We analyse what characteristics make the systems successful in the given tasks. Identification of such characteristics is an important step for design and implementation of new systems which can detect an unsolicited PHI disclosure on the Internet. PHI, which is usually shared on conditions of confidentiality, protection and trust, should not be disclosed to unrelated third parties or the general public.

Research paper thumbnail of Topic Classification using Latent Dirichlet Allocation at Multiple Levels

We propose a novel low-dimensional text representation method for topic classi�fication. Several ... more We propose a novel low-dimensional text representation method for topic classi�fication. Several Latent Dirichet Allocation (LDA) models are built on a large amount of unlabelled data, in order to extract potential topic clusters, at different levels of generalization. Each document is represented as a distribution over these topic clusters. We experiment with two datasets. We collected the �first dataset from the FriendFeed social network and we manually annotated part of it with 10 general classes. The second dataset is a standard text classifi�cation benchmark, Reuters 21578, the R8 subset (annotated with 8 classes). We show that classifi�cation based on our multi-level LDA representation leads to improved results for both datasets. Our representation catches topic distributions from generic ones to more specifi�c ones and allows the machine learning algorithm choose the appropriate level of generalization for the task. Another advantage is the dimensionality reduction, which per...

Research paper thumbnail of Search Engine Optimization: A New Method

Research paper thumbnail of Report on formal analysis of autopoietic P2P network, together with predictions of performance (Deliverable D3.2/Open Philosophies for Associative Autopoietic Digital Ecosystems Contract n° 034824)

Research paper thumbnail of A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining

Artificial Intelligence in Medicine, 2005

In medicine, data mining methods such as Decision Tree Induction (DTI) can be trained for extract... more In medicine, data mining methods such as Decision Tree Induction (DTI) can be trained for extracting rules to predict the outcomes of new patients. However, incompleteness and high dimensionality of stored data are a problem. Canonical Correlation Analysis (CCA) can be used prior to DTI as a dimension reduction technique to preserve the character of the original data by omitting non-essential data. In this study, data from 3949 breast cancer patients were analysed. Raw data were cleaned by running a set of logical rules. Missing values were replaced using the Expectation Maximization algorithm. After dimension reduction with CCA, DTI was employed to analyse the resulting dataset. The validity of the predictive model was confirmed by tenfold cross validation and the effect of pre-processing was analysed by applying DTI to data without pre-processing. Replacing missing values and using CCA for data reduction dramatically reduced the size of the resulting tree and increased the accuracy of the prediction of breast cancer recurrence.

Research paper thumbnail of eDoctor: machine learning and the future of medicine

Journal of Internal Medicine, Sep 3, 2018

Machine learning (ML) is a burgeoning field of medicine with huge resources being applied to fuse... more Machine learning (ML) is a burgeoning field of medicine with huge resources being applied to fuse computer science and statistics to medical problems. Proponents of ML extol its ability to deal with large, complex and disparate data, often found within medicine and feel that ML is the future for biomedical research, personalized medicine, computer-aided diagnosis to significantly advance global health care. However, the concepts of ML are unfamiliar to many medical professionals and there is untapped potential in the use of ML as a research tool. In this article, we provide an overview of the theory behind ML, explore the common ML algorithms used in medicine including their pitfalls and discuss the potential future of ML in medicine.

Research paper thumbnail of Peering Into the Black Box of Artificial Intelligence: Evaluation Metrics of Machine Learning Methods

AJR. American journal of roentgenology, Jan 17, 2018

Machine learning (ML) and artificial intelligence (AI) are rapidly becoming the most talked about... more Machine learning (ML) and artificial intelligence (AI) are rapidly becoming the most talked about and controversial topics in radiology and medicine. Over the past few years, the numbers of ML- or AI-focused studies in the literature have increased almost exponentially, and ML has become a hot topic at academic and industry conferences. However, despite the increased awareness of ML as a tool, many medical professionals have a poor understanding of how ML works and how to critically appraise studies and tools that are presented to us. Thus, we present a brief overview of ML, explain the metrics used in ML and how to interpret them, and explain some of the technical jargon associated with the field so that readers with a medical background and basic knowledge of statistics can feel more comfortable when examining ML applications. Attention to sample size, overfitting, underfitting, cross validation, as well as a broad knowledge of the metrics of machine learning, can help those with ...

Research paper thumbnail of Text Representation and General Topic Annotation Based on Latent Dirichlet Allocation

ABSTRACT We propose a low-dimensional text representation method for topic classifi�cation. The L... more ABSTRACT We propose a low-dimensional text representation method for topic classifi�cation. The Latent Dirichet Allocation (LDA) model is built on a large amount of unlabelled data, in order to extract potential topic clusters. Each document is represented as a distribution over these clusters. We experiment with two datasets. We collected the �first dataset from the FriendFeed social network and we manually annotated part of it with 10 general classes. The second dataset is a standard text classi�fication benchmark, Reuters 21578, the R8 subset (annotated with 8 classes). We show that classi�fication based on the LDA representation leads to acceptable results, while combining a bag-of-words representation with the LDA representation leads to further improvements. We also propose a multi-level LDA representation that catches topic cluster distributions from generic ones to more specifi�c ones.

Research paper thumbnail of Personal Health Information detection in unstructured web documents

Proceedings of the 26th IEEE International Symposium on Computer-Based Medical Systems, 2013

This paper describes our study of the incidence of Personal Health Information (PHI) on the Web. ... more This paper describes our study of the incidence of Personal Health Information (PHI) on the Web. PHI is usually shared under conditions of confidentiality, protection and trust, and should not be disclosed or available to unrelated third parties or the general public. We first analyzed the characteristics that potentially make systems successful in identification of unsolicited or unjustified PHI disclosures. In the next stage, we designed and implemented an integrated Natural Language Processing/Machine Learning (NLP/ML)-based system that detects disclosures of personal health information, specifically according to the above characteristics including detected patterns. This research is regarded as the first step toward a learning system that will be trained based on a limited training set built on the result of the processing chain described in the paper in order to generally detect the PHI disclosures over the web.

Research paper thumbnail of Modeling and predicting cascading removal phenomenon over social networks

Social Network Analysis and Mining, 2014

Innovations, opinions, ideas, recommendations or tendencies emerge in a variety of social network... more Innovations, opinions, ideas, recommendations or tendencies emerge in a variety of social networks. They can either disappear quickly or propagate and create considerable impact on the network. Their disappearance may also spread from one node to another across the network creating cascading behavior. Cascading phenomenon is mainly analyzed either by identifying the most influential nodes according to their features in the network, detecting quickly the phenomenon or targeting a minimum set of nodes that could maximize the spread of influence or minimize the propagation of a rumor or an outbreak. The objective of the present work is to predict the nodes to be deleted in cascade following the disappearance of one or many nodes. The cascading removal phenomenon is imitated by three well-known influence maximization cascading models in addition to two variants of a new cascading strategy which sound more consistent with human intuition over cascading removals. The prediction is done for an individual iteration of the cascading models, with the ability to be projected over the entire course of cascades without any loss of generality. We compare the prediction accuracy over three real-life networks and five synthetically generated schemas that imitate real social networks.

Research paper thumbnail of Applications of Knowledge Discovery in Quality Registries-Predicting Recurrence of Breast Cancer and Analyzing Non-compliance with a Clinical Guideline

LINKOPING UNIVERSITY MEDICAL …, 2007

In medicine, data are produced from different sources and continuously stored in data depositorie... more In medicine, data are produced from different sources and continuously stored in data depositories. Examples of these growing databases are quality registries. In Sweden, there are many cancer registries where data on cancer patients are gathered and recorded and are used mainly for reporting survival analyses to high level health authorities. In this thesis, a breast cancer quality registry operating in SouthEast of Sweden is used as the data source for newer analytical techniques, i.e. data mining as a part of knowledge discovery in databases (KDD) methodology. Analyses are done to sift through these data in order to find interesting information and hidden knowledge. KDD consists of multiple steps, starting with gathering data from different sources and preparing them in data pre-processing stages prior to the main analysis with data mining. Data were cleaned from outliers and noise and missing values were handled. Then a proper subset of the data was chosen by canonical correlation analysis (CCA) in a dimensionality reduction step. This technique was chosen because there were multiple outcomes, and variables had complex relationship to one another. After data were prepared, they were analyzed with a data mining method. Decision tree induction as a simple and efficient method was used to mine the data. To show the benefits of proper data pre-processing, results from data mining with pre-processing of the data were compared with results from data mining without data pre-processing. The comparison showed that data pre-processing results in a more compact model with a better performance in predicting the recurrence of cancer. An important part of knowledge discovery in medicine is to increase the involvement of medical experts in the process. This starts with enquiry about current problems in their field, which leads to finding areas where computer support can be helpful. The experts can suggest potentially important variables and should then approve and validate new patterns or knowledge as predictive or descriptive models. If it can be shown that the performance of a model is comparable to domain experts, it is more probable that the model will be used to support physicians in their daily decision-making. In this thesis, we validated the model by comparing List of Publications This thesis is based on five papers, which will be referred to in the text by their roman numerals.

Research paper thumbnail of ArcView GIS/avenue programmer's reference

Research paper thumbnail of Concurrency Control and Recovery Management for Open e-Business Transactions

Concurrency control mechanisms such as turn-taking, locking, serialization, transactional locking... more Concurrency control mechanisms such as turn-taking, locking, serialization, transactional locking mechanism, and operational transformation try to provide data consistency when concurrent activities are permitted in a reactive system. Locks are typically used in transactional models for assurance of data consistency and integrity in a concurrent environment. In addition, recovery management is used to preserve atomicity and durability in transaction models. Unfortunately, conventional lock mechanisms severely (and intentionally) limit concurrency in a transactional environment. Such lock mechanisms also limit recovery capabilities. Finally, existing recovery mechanisms themselves afford a considerable overhead to concurrency. This paper describes a new transaction model that supports release of early results inside and outside of a transaction, decreasing the severe limitations of conventional lock mechanisms, yet still warranties consistency and recoverability of released resources (results). This is achieved through use of a more flexible locking mechanism and by using two types of consistency graph. This provides an integrated solution for transaction management, recovery management and concurrency control. We argue that these are necessary features for management of long-term transactions within "digital ecosystems" of small to medium enterprises

Research paper thumbnail of General Topic Annotation in Social Networks: A Latent Dirichlet Allocation Approach

Lecture Notes in Computer Science, 2013

Research paper thumbnail of Text Representation Using Multi-level Latent Dirichlet Allocation

Lecture Notes in Computer Science, 2014

We introduce a novel text representation method to be applied on corpora containing short / mediu... more We introduce a novel text representation method to be applied on corpora containing short / medium length textual documents. The method applies Latent Dirichlet Allocation (LDA) on a corpus to infer its major topics, which will be used for document representation. The representation that we propose has multiple levels (granularities) by using different numbers of topics. We postulate that interpreting data in a more general space, with fewer dimensions, can improve the representation quality. Experimental results support the informative power of our multi-level representation vectors. We show that choosing the correct granularity of representation is an important aspect of text classification. We propose a multi-level representation, at different topical granularities, rather than choosing one level. The documents are represented by topical relevancy weights, in a low-dimensional vector representation. Finally, the proposed representation is applied to a text classification task using several well-known classification algorithms. We show that it leads to very good classification performance. Another advantage is that, with a small compromise on accuracy, our low-dimensional representation can be fed into many supervised or unsupervised machine learning algorithms that empirically cannot be applied on the conventional high-dimensional text representation methods.

Research paper thumbnail of Classifying Biomedical Abstracts Using Committees of Classifiers and Collective Ranking Techniques

Lecture Notes in Computer Science, 2009

The purpose of this work is to reduce the workload of human experts in building systematic review... more The purpose of this work is to reduce the workload of human experts in building systematic reviews from published articles, used in evidence-based medicine. We propose to use a committee of classifiers to rank biomedical abstracts based on the predicted relevance to the topic under review. In our approach, we identify two subsets of abstracts: one that represents the top, and another that represents the bottom of the ranked list. These subsets, identified using machine learning (ML) techniques, are considered zones where abstracts are labeled with high confidence as relevant or irrelevant to the topic of the review. Early experiments with this approach using different classifiers and different representation techniques show significant workload reduction.

Research paper thumbnail of Parameterized Contrast in Second Order Soft Co-occurrences: A Novel Text Representation Technique in Text Mining and Knowledge Extraction

2009 IEEE International Conference on Data Mining Workshops, 2009

In this article, we present a novel statistical representation method for knowledge extraction fr... more In this article, we present a novel statistical representation method for knowledge extraction from a corpus containing short texts. Then we introduce the contrast parameter which could be adjusted for targeting different conceptual levels in text mining and knowledge extraction. The method is based on second order co-occurrence vectors whose efficiency for representing meaning has been established in many applications, especially for representing word senses in different contexts and for disambiguation purposes. We evaluate our method on two tasks: classification of textual description of dreams, and classification of medical abstracts for systematic reviews.

Research paper thumbnail of Automatic Text Ontological Representation and Classification via Fundamental to Specific Conceptual Elements (TOR-FUSE)

In this dissertation, we introduce a novel text representation method mainly used for text classi... more In this dissertation, we introduce a novel text representation method mainly used for text classification purpose. The presented representation method is initially based on a variety of closeness relationships between pairs of words in text passages within the entire corpus. This representation is then used as the basis for our multi-level lightweight ontological representation method (TOR-FUSE), in which documents are represented based on their contexts and the goal of the learning task. The method is unlike the traditional representation methods, in which all the documents are represented solely based on the constituent words of the documents, and are totally isolated from the goal that they are represented for. We believe choosing the correct granularity of representation features is an important aspect of text classification. Interpreting data in a more general dimensional space, with fewer dimensions, can convey more discriminative knowledge and decrease the level of learning p...

Research paper thumbnail of Classifying Malicious Domains using DNS Traffic Analysis

2021 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), 2021

Research paper thumbnail of Lightweight Hybrid Detection of Data Exfiltration using DNS based on Machine Learning

2021 the 11th International Conference on Communication and Network Security, 2021

Research paper thumbnail of Leakage Detection of Confidential Information in Unstructured Web Documents

We study presence of Personal Health Information (PHI) on the Internet. We survey the existing me... more We study presence of Personal Health Information (PHI) on the Internet. We survey the existing methods and systems used to detect PHI. We analyse what characteristics make the systems successful in the given tasks. Identification of such characteristics is an important step for design and implementation of new systems which can detect an unsolicited PHI disclosure on the Internet. PHI, which is usually shared on conditions of confidentiality, protection and trust, should not be disclosed to unrelated third parties or the general public.

Research paper thumbnail of Topic Classification using Latent Dirichlet Allocation at Multiple Levels

We propose a novel low-dimensional text representation method for topic classi�fication. Several ... more We propose a novel low-dimensional text representation method for topic classi�fication. Several Latent Dirichet Allocation (LDA) models are built on a large amount of unlabelled data, in order to extract potential topic clusters, at different levels of generalization. Each document is represented as a distribution over these topic clusters. We experiment with two datasets. We collected the �first dataset from the FriendFeed social network and we manually annotated part of it with 10 general classes. The second dataset is a standard text classifi�cation benchmark, Reuters 21578, the R8 subset (annotated with 8 classes). We show that classifi�cation based on our multi-level LDA representation leads to improved results for both datasets. Our representation catches topic distributions from generic ones to more specifi�c ones and allows the machine learning algorithm choose the appropriate level of generalization for the task. Another advantage is the dimensionality reduction, which per...

Research paper thumbnail of Search Engine Optimization: A New Method