Flavio Villanustre - Academia.edu (original) (raw)
Papers by Flavio Villanustre
Journal of big data, Mar 9, 2024
Springer eBooks, 2018
The use of general descriptive names, registered names, trademarks, service marks, etc. in this p... more The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
Journal of Big Data
Fraud datasets often times lack consistent and accurate labels, and are characterized by having h... more Fraud datasets often times lack consistent and accurate labels, and are characterized by having high class imbalance where the number of fraudulent examples are far fewer than those of normal ones. Machine learning designed for effectively detecting fraud is an important task since fraudulent behavior can have significant financial or health consequences, but is presented with significant challenges due to the class imbalance and availability of reliable labels. This paper presents an unsupervised fraud detection method that uses an iterative cleaning process for effective fraud detection. We measure our method performance using a newly created Medicare fraud big dataset and a widely used credit card fraud dataset. Additionally, we detail the process of creating the highly-imbalanced Medicare dataset from multiple publicly available sources, how additional trainable features were added, and how fraudulent labels were assigned for final model performance measurements. The results are...
international workshop on big data software engineering, May 16, 2015
Big Data Analytics in particular and Data Science in general have become key disciplines in the l... more Big Data Analytics in particular and Data Science in general have become key disciplines in the last decade. The convergence of Information Technology, Statistics and Mathematics, to explore and extract information from Big Data have challenged the way many industries used to operate, shifting the decision making process in many organizations. A new breed of Big Data platforms has appeared, to fulfill the needs to process data that is large, complex, variable and rapidly generated. The author describes the experience in this field from a company that provides Big Data analytics as its core business.
Journal of Big Data, 2015
Big Data Analytics and Deep Learning are two high-focus of data science. Big Data has become impo... more Big Data Analytics and Deep Learning are two high-focus of data science. Big Data has become important as many organizations both public and private have been collecting massive amounts of domain-specific information, which can contain useful information about problems such as national intelligence, cyber security, fraud detection, marketing, and medical informatics. Companies such as Google and Microsoft are analyzing large volumes of data for business analysis and decisions, impacting existing and future technology. Deep Learning algorithms extract high-level, complex abstractions as data representations through a hierarchical learning process. Complex abstractions are learnt at a given level based on relatively simpler abstractions formulated in the preceding level in the hierarchy. A key benefit of Deep Learning is the analysis and learning of massive amounts of unsupervised data, making it a valuable tool for Big Data Analytics where raw data is largely unlabeled and un-categor...
2017 IEEE International Conference on Big Data (Big Data), 2017
The proliferation of Big Data processing environments such as Hadoop, Apache Spark, and HPCC Syst... more The proliferation of Big Data processing environments such as Hadoop, Apache Spark, and HPCC Systems is driving the development of performance analysis tools in these distributed systems. The goal is to achieve high performance through the optimization of Big Data applications. However, tuning performance in a fine-grained manner is quite challenging due to the high complexity and massive size of the distributed systems. ECL-Watch is a data-flow based fine-grained comprehensive Big Data performance analysis tool utilizing the high level declarative dataflow programming language ECL in HPCC Systems. As a case study, we implement and optimize the Yinyang K-Means machine learning algorithm in ECL in HPCC Systems. The experimental results show that the performance of the native ECL version of the Yinyang K-Means algorithm increased significantly after tuning: from being about three times slower than the standard K-Means implementation in ECL, to become roughly 15% faster than standard K...
2019 4th International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS), 2019
Clustering algorithms are an important part of unsupervised machine learning. With Big Data, appl... more Clustering algorithms are an important part of unsupervised machine learning. With Big Data, applying clustering algorithms such as KMeans has become a challenge due to the significantly larger volume of data and the computational complexity of the standard approach, Lloyd's algorithm. This work aims to tackle this challenge by transforming the classic clustering KMeans algorithm to be highly scalable and to be able to operate on Big Data. We leverage the distributed computing environment of the HPCC Systems platform. The presented KMeans algorithm adopts a hybrid parallelism method to achieve a massively scalable parallel KMeans. Our approach can save a significant amount of time of researchers and machine learning practitioners who train hundreds of models on a daily basis. The performance is evaluated with different size datasets and clusters and the results show a significant scalabilty of the scalable parallel KMeans algorithm.
Infrastructure as a Service, one of the most disruptive aspects of cloud computing, enables confi... more Infrastructure as a Service, one of the most disruptive aspects of cloud computing, enables configuring a cluster for each application for each workload. When the workload changes, a cluster will be either underutilized (wasting resources) or unable to meet demand (incurring opportunity costs). Consequently, efficient cluster resizing requires proper data replication and placement. Our work reveals that coarse-grain, workload-aware replication addresses over-utilization but cannot resolve under-utilization. With fine-grain partitioning of the dataset, data replication can reduce both under- and over-utilization. In our empirical studies, compared to a näive uniform data replication a coarse-grain workload-aware replication increases throughput by 81% on a highly-skewed workload. A fine-grain scheme further reaches 166% increase. Furthermore, a surprisingly small increase in granularity is sufficient to obtain most benefits. Evaluations also show that maximizing the number of unique ...
Graph theory and the study of networks can be traced back to Leonhard Euler’s original paper on t... more Graph theory and the study of networks can be traced back to Leonhard Euler’s original paper on the Seven Bridges of Konigsberg, in 1736 [1]. Although the mathematical foundations to understanding graphs have been laid out over the last few centuries [2, 3, 4], it wasn’t until recently, with the advent of modern computers, that parsing and analysis of large-scale graphs became tractable [5]. In the last decade, graph theory gained mainstream popularity following the adoption of graph models for new applications domains, including social networks and the web of data, both generating extremely large and dynamic graphs that cannot be adequately handled by legacy graph management applications [6].
For several decades, LexisNexis Risk Solutions has provided real-time risk assessment and managem... more For several decades, LexisNexis Risk Solutions has provided real-time risk assessment and management services via their easy easy-to-use, big data analytics solutions—building a reputation for precision, speed, and breadth over the years. Now a global organization headquartered in Alpharetta, Georgia, LexisNexis provides services and solutions such as identity management, risk scoring, fraud detection/analytics, as well as data aggregation and management, to some of the world’s largest banking institutions, retail establishments, and insurance companies. LexisNexis Risk Solutions is an established industry leader with more than $1.5 billion in annual revenues, with an ever-expanding global presence.
2015 IEEE International Symposium on Workload Characterization, 2015
Infrastructure as a Service, one of the distinguishing characteristics of cloud computing, enable... more Infrastructure as a Service, one of the distinguishing characteristics of cloud computing, enables configuring a cluster for each application for each workload. When the workload changes, a statically-sized cluster with a fixed capacity will be either underutilized (wasting resources) or unable to meet demand (incurring opportunity costs). In cloud computing, a new cluster configuration easily can be used for the next run of the application. As the workload increases the cluster can expand. However, efficient cluster expansion requires proper data replication and placement. This paper focuses on workload-aware data replication and placement to support efficient cloud computing. It examines the tradeoffs between replication factors, partition granularity, and placement strategy. It shows that coarse-grain, workload-aware replication is able to improve performance over a näive uniform data placement. Dividing the dataset into small sets, fine-grain replication, improves performance be...
Journal of Big Data
This project is funded by the US National Science Foundation (NSF) through their NSF RAPID progra... more This project is funded by the US National Science Foundation (NSF) through their NSF RAPID program under the title “Modeling Corona Spread Using Big Data Analytics.” The project is a joint effort between the Department of Computer & Electrical Engineering and Computer Science at FAU and a research group from LexisNexis Risk Solutions.The novel coronavirus Covid-19 originated in China in early December 2019 and has rapidly spread to many countries around the globe, with the number of confirmed cases increasing every day. Covid-19 is officially a pandemic. It is a novel infection with serious clinical manifestations, including death, and it has reached at least 124 countries and territories. Although the ultimate course and impact of Covid-19 are uncertain, it is not merely possible but likely that the disease will produce enough severe illness to overwhelm the worldwide health care infrastructure. Emerging viral pandemics can place extraordinary and sustained demands on public health...
Journal of Big Data
The increasing reliance on electronic health record (EHR) in areas such as medical research shoul... more The increasing reliance on electronic health record (EHR) in areas such as medical research should be addressed by using ample safeguards for patient privacy. These records often tend to be big data, and given that a significant portion is stored as free (unstructured) text, we decided to examine relevant work on automated free text de-identification with recurrent neural network (RNN) and conditional random field (CRF) approaches. Both methods involve machine learning and are widely used for the removal of protected health information (PHI) from free text. The outcome of our survey work produced several informative findings. Firstly, RNN models, particularly long short-term memory (LSTM) algorithms, generally outperformed CRF models and also other systems, namely rule-based algorithms. Secondly, hybrid or ensemble systems containing joint LSTM-CRF models showed no advantage over individual LSTM and CRF models. Thirdly, overfitting may be an issue when customized de-identification d...
Journal of big data, Mar 9, 2024
Springer eBooks, 2018
The use of general descriptive names, registered names, trademarks, service marks, etc. in this p... more The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
Journal of Big Data
Fraud datasets often times lack consistent and accurate labels, and are characterized by having h... more Fraud datasets often times lack consistent and accurate labels, and are characterized by having high class imbalance where the number of fraudulent examples are far fewer than those of normal ones. Machine learning designed for effectively detecting fraud is an important task since fraudulent behavior can have significant financial or health consequences, but is presented with significant challenges due to the class imbalance and availability of reliable labels. This paper presents an unsupervised fraud detection method that uses an iterative cleaning process for effective fraud detection. We measure our method performance using a newly created Medicare fraud big dataset and a widely used credit card fraud dataset. Additionally, we detail the process of creating the highly-imbalanced Medicare dataset from multiple publicly available sources, how additional trainable features were added, and how fraudulent labels were assigned for final model performance measurements. The results are...
international workshop on big data software engineering, May 16, 2015
Big Data Analytics in particular and Data Science in general have become key disciplines in the l... more Big Data Analytics in particular and Data Science in general have become key disciplines in the last decade. The convergence of Information Technology, Statistics and Mathematics, to explore and extract information from Big Data have challenged the way many industries used to operate, shifting the decision making process in many organizations. A new breed of Big Data platforms has appeared, to fulfill the needs to process data that is large, complex, variable and rapidly generated. The author describes the experience in this field from a company that provides Big Data analytics as its core business.
Journal of Big Data, 2015
Big Data Analytics and Deep Learning are two high-focus of data science. Big Data has become impo... more Big Data Analytics and Deep Learning are two high-focus of data science. Big Data has become important as many organizations both public and private have been collecting massive amounts of domain-specific information, which can contain useful information about problems such as national intelligence, cyber security, fraud detection, marketing, and medical informatics. Companies such as Google and Microsoft are analyzing large volumes of data for business analysis and decisions, impacting existing and future technology. Deep Learning algorithms extract high-level, complex abstractions as data representations through a hierarchical learning process. Complex abstractions are learnt at a given level based on relatively simpler abstractions formulated in the preceding level in the hierarchy. A key benefit of Deep Learning is the analysis and learning of massive amounts of unsupervised data, making it a valuable tool for Big Data Analytics where raw data is largely unlabeled and un-categor...
2017 IEEE International Conference on Big Data (Big Data), 2017
The proliferation of Big Data processing environments such as Hadoop, Apache Spark, and HPCC Syst... more The proliferation of Big Data processing environments such as Hadoop, Apache Spark, and HPCC Systems is driving the development of performance analysis tools in these distributed systems. The goal is to achieve high performance through the optimization of Big Data applications. However, tuning performance in a fine-grained manner is quite challenging due to the high complexity and massive size of the distributed systems. ECL-Watch is a data-flow based fine-grained comprehensive Big Data performance analysis tool utilizing the high level declarative dataflow programming language ECL in HPCC Systems. As a case study, we implement and optimize the Yinyang K-Means machine learning algorithm in ECL in HPCC Systems. The experimental results show that the performance of the native ECL version of the Yinyang K-Means algorithm increased significantly after tuning: from being about three times slower than the standard K-Means implementation in ECL, to become roughly 15% faster than standard K...
2019 4th International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS), 2019
Clustering algorithms are an important part of unsupervised machine learning. With Big Data, appl... more Clustering algorithms are an important part of unsupervised machine learning. With Big Data, applying clustering algorithms such as KMeans has become a challenge due to the significantly larger volume of data and the computational complexity of the standard approach, Lloyd's algorithm. This work aims to tackle this challenge by transforming the classic clustering KMeans algorithm to be highly scalable and to be able to operate on Big Data. We leverage the distributed computing environment of the HPCC Systems platform. The presented KMeans algorithm adopts a hybrid parallelism method to achieve a massively scalable parallel KMeans. Our approach can save a significant amount of time of researchers and machine learning practitioners who train hundreds of models on a daily basis. The performance is evaluated with different size datasets and clusters and the results show a significant scalabilty of the scalable parallel KMeans algorithm.
Infrastructure as a Service, one of the most disruptive aspects of cloud computing, enables confi... more Infrastructure as a Service, one of the most disruptive aspects of cloud computing, enables configuring a cluster for each application for each workload. When the workload changes, a cluster will be either underutilized (wasting resources) or unable to meet demand (incurring opportunity costs). Consequently, efficient cluster resizing requires proper data replication and placement. Our work reveals that coarse-grain, workload-aware replication addresses over-utilization but cannot resolve under-utilization. With fine-grain partitioning of the dataset, data replication can reduce both under- and over-utilization. In our empirical studies, compared to a näive uniform data replication a coarse-grain workload-aware replication increases throughput by 81% on a highly-skewed workload. A fine-grain scheme further reaches 166% increase. Furthermore, a surprisingly small increase in granularity is sufficient to obtain most benefits. Evaluations also show that maximizing the number of unique ...
Graph theory and the study of networks can be traced back to Leonhard Euler’s original paper on t... more Graph theory and the study of networks can be traced back to Leonhard Euler’s original paper on the Seven Bridges of Konigsberg, in 1736 [1]. Although the mathematical foundations to understanding graphs have been laid out over the last few centuries [2, 3, 4], it wasn’t until recently, with the advent of modern computers, that parsing and analysis of large-scale graphs became tractable [5]. In the last decade, graph theory gained mainstream popularity following the adoption of graph models for new applications domains, including social networks and the web of data, both generating extremely large and dynamic graphs that cannot be adequately handled by legacy graph management applications [6].
For several decades, LexisNexis Risk Solutions has provided real-time risk assessment and managem... more For several decades, LexisNexis Risk Solutions has provided real-time risk assessment and management services via their easy easy-to-use, big data analytics solutions—building a reputation for precision, speed, and breadth over the years. Now a global organization headquartered in Alpharetta, Georgia, LexisNexis provides services and solutions such as identity management, risk scoring, fraud detection/analytics, as well as data aggregation and management, to some of the world’s largest banking institutions, retail establishments, and insurance companies. LexisNexis Risk Solutions is an established industry leader with more than $1.5 billion in annual revenues, with an ever-expanding global presence.
2015 IEEE International Symposium on Workload Characterization, 2015
Infrastructure as a Service, one of the distinguishing characteristics of cloud computing, enable... more Infrastructure as a Service, one of the distinguishing characteristics of cloud computing, enables configuring a cluster for each application for each workload. When the workload changes, a statically-sized cluster with a fixed capacity will be either underutilized (wasting resources) or unable to meet demand (incurring opportunity costs). In cloud computing, a new cluster configuration easily can be used for the next run of the application. As the workload increases the cluster can expand. However, efficient cluster expansion requires proper data replication and placement. This paper focuses on workload-aware data replication and placement to support efficient cloud computing. It examines the tradeoffs between replication factors, partition granularity, and placement strategy. It shows that coarse-grain, workload-aware replication is able to improve performance over a näive uniform data placement. Dividing the dataset into small sets, fine-grain replication, improves performance be...
Journal of Big Data
This project is funded by the US National Science Foundation (NSF) through their NSF RAPID progra... more This project is funded by the US National Science Foundation (NSF) through their NSF RAPID program under the title “Modeling Corona Spread Using Big Data Analytics.” The project is a joint effort between the Department of Computer & Electrical Engineering and Computer Science at FAU and a research group from LexisNexis Risk Solutions.The novel coronavirus Covid-19 originated in China in early December 2019 and has rapidly spread to many countries around the globe, with the number of confirmed cases increasing every day. Covid-19 is officially a pandemic. It is a novel infection with serious clinical manifestations, including death, and it has reached at least 124 countries and territories. Although the ultimate course and impact of Covid-19 are uncertain, it is not merely possible but likely that the disease will produce enough severe illness to overwhelm the worldwide health care infrastructure. Emerging viral pandemics can place extraordinary and sustained demands on public health...
Journal of Big Data
The increasing reliance on electronic health record (EHR) in areas such as medical research shoul... more The increasing reliance on electronic health record (EHR) in areas such as medical research should be addressed by using ample safeguards for patient privacy. These records often tend to be big data, and given that a significant portion is stored as free (unstructured) text, we decided to examine relevant work on automated free text de-identification with recurrent neural network (RNN) and conditional random field (CRF) approaches. Both methods involve machine learning and are widely used for the removal of protected health information (PHI) from free text. The outcome of our survey work produced several informative findings. Firstly, RNN models, particularly long short-term memory (LSTM) algorithms, generally outperformed CRF models and also other systems, namely rule-based algorithms. Secondly, hybrid or ensemble systems containing joint LSTM-CRF models showed no advantage over individual LSTM and CRF models. Thirdly, overfitting may be an issue when customized de-identification d...