Umit Demirbaga | University of Cambridge (original) (raw)
Papers by Umit Demirbaga
IEEE Internet of Things Journal
IEEE Transactions on Network and Service Management, 2022
IEEE Transactions on Computers, 2021
Big data processing systems, such as Hadoop and Spark, usually work in large-scale, highly-concur... more Big data processing systems, such as Hadoop and Spark, usually work in large-scale, highly-concurrent, and multi-tenant environments that can easily cause hardware and software malfunctions or failures, thereby leading to performance degradation. Several systems and methods exist to detect big data processing systems' performance degradation, perform root-cause analysis, and even overcome the issues causing such degradation. However, these solutions focus on specific problems such as stragglers and inefficient resource utilization. There is a lack of a generic and extensible framework to support the real-time diagnosis of big data systems. In this paper, we propose, develop and validate AutoDiagn. This generic and flexible framework provides holistic monitoring of a big data system while detecting performance degradation and enabling root-cause analysis. We present an implementation and evaluation of AutoDiagn that interacts with a Hadoop cluster deployed on a public cloud and tested with real-world benchmark applications. Experimental results show that AutoDiagn can offer a high accuracy root-cause analysis framework, at the same time as offering a small resource footprint, high throughput and low latency.
Twitter, a micro-blogging service, has been generating a large amount of data every minute as it ... more Twitter, a micro-blogging service, has been generating a large amount of data every minute as it gives people chance to express their thoughts and feelings quickly and clearly about any topics. To obtain the desired information from these available big data, it requires high-performance parallel computing tools along with machine learning algorithms' support. Emerging big data processing frameworks (e.g. Hadoop) can handle such big data effectively. In this paper, we, firstly introduce a novel approach to automatically classify Twitter data obtained from British Geological Survey (BGS), collected using some specific keywords such as landslide, landslides, mudslide, landfall, landslip, soil sliding, based on tweet post date and the countries where tweets are posted using MapReduce algorithm. We then propose a model to distinguish the tweets if they are landslides-related using Naïve-Bayes machine learning algorithm with n-Grams language model on Mahout. This paper also describes ...
2019 38th Symposium on Reliable Distributed Systems (SRDS)
Modern big data processing systems are becoming very complex in terms of large-scale, high-concur... more Modern big data processing systems are becoming very complex in terms of large-scale, high-concurrency and multiple talents. Thus, many failures and performance reductions only happen at run-time and are very difficult to capture. Moreover, some issues may only be triggered when some components are executed. To analyze the root cause of these types of issues, we have to capture the dependencies of each component in real-time. In this paper, we propose SmartMonit, a real-time big data monitoring system, which collects infrastructure information such as the process status of each task. At the same time, we develop a real-time stream processing framework to analyze the coordination among the tasks and the infrastructures. This coordination information is essential for troubleshooting the reasons for failures and performance reduction, especially the ones propagated from other causes.
Neural Computing and Applications
Twitter produces a massive amount of data due to its popularity that is one of the reasons underl... more Twitter produces a massive amount of data due to its popularity that is one of the reasons underlying big data problems. One of those problems is the classification of tweets due to use of sophisticated and complex language, which makes the current tools insufficient. We present our framework HTwitt, built on top of the Hadoop ecosystem, which consists of a MapReduce algorithm and a set of machine learning techniques embedded within a big data analytics platform to efficiently address the following problems: (1) traditional data processing techniques are inadequate to handle big data; (2) data preprocessing needs substantial manual effort; (3) domain knowledge is required before the classification; (4) semantic explanation is ignored. In this work, these challenges are overcome by using different algorithms combined with a Naïve Bayes classifier to ensure reliability and highly precise recommendations in virtualization and cloud environments. These features make HTwitt different fro...
Journal of Systems Architecture
The osmotic computing paradigm sets out the principles and algorithms for simplifying the deploym... more The osmotic computing paradigm sets out the principles and algorithms for simplifying the deployment of Internet of Things (IoT) applications in integrated edge-cloud environments. Various existing simulation frameworks can be used to support integration of cloud and edge computing environments. However, none of these can directly support an osmotic computing environment due to the complexity of IoT applications and heterogeneity of integrated edge-cloud environments. Osmotic computing suggests the migration of workload to/from a cloud data center to edge devices, based on performance and security trigger events. We propose 'IoTSim-Osmosis-a simulation framework to support the testing and validation of osmotic computing applications. In particular, our detailed related work analysis demonstrates that IoTSim-Osmosis is the first simulation framework to enable unified modeling and simulation of complex IoT applications over heterogeneous edge-cloud environments. IoTSim-Osmosis is demonstrated using an electricity management and billing application case study, for benchmarking various run-time QoS parameters, such as IoT battery use, execution time, network transmission time and consumed energy.
Neural Computing and Applications, 2021
Twitter produces a massive amount of data due to its popularity that is one of the reasons underl... more Twitter produces a massive amount of data due to its popularity that is one of the reasons underlying big data problems. One of those problems is the classification of tweets due to use of sophisticated and complex language, which makes the current tools insufficient. We present our framework HTwitt, built on top of the Hadoop ecosystem, which consists of a MapReduce algorithm and a set of machine learning techniques embedded within a big data analytics platform to efficiently address the following problems: (1) traditional data processing techniques are inadequate to handle big data; (2) data preprocessing needs substantial manual effort; (3) domain knowledge is required before the classification; (4) semantic explanation is ignored. In this work, these challenges are overcome by using different algorithms combined with a Naïve Bayes classifier to ensure reliability and highly precise recommendations in virtualization and cloud environments. These features make HTwitt different from others in terms of having an effective and practical design for text classification in big data analytics. The main contribution of the paper is to propose a framework for building landslide early warning systems by pinpointing useful tweets and visualizing them along with the processed information. We demonstrate the results of the experiments which quantify the levels of overfitting in the training stage of the model using different sizes of real-world datasets in machine learning phases. Our results demonstrate that the proposed system provides high-quality results with a score of nearly 95% and meets the requirement of a Hadoop-based classification system.
IoTSim-Osmosis: A Framework for Modelling & Simulating IoT Applications Over an Edge-Cloud Continuum, 2020
The osmotic computing paradigm sets out the principles and algorithms for simplifying the deploym... more The osmotic computing paradigm sets out the principles and algorithms for simplifying the deployment of Internet of Things (IoT) applications in integrated edge-cloud environments. Various existing simulation frameworks can be used to support integration of cloud and edge computing environments. However, none of these can directly support an osmotic computing environment due to the complexity of IoT applications and heterogeneity of integrated edge-cloud environments. Osmotic computing suggests the migration of workload to/from a cloud data center to edge devices, based on performance and security trigger events. We propose 'IoTSim-Osmosis-a simulation framework to support the testing and validation of osmotic computing applications. In particular, our detailed related work analysis demonstrates that IoTSim-Osmosis is the first simulation framework to enable unified modeling and simulation of complex IoT applications over heterogeneous edge-cloud environments. IoTSim-Osmosis is demonstrated using an electricity management and billing application case study, for benchmarking various run-time QoS parameters, such as IoT battery use, execution time, network transmission time and consumed energy.
AutoDiagn: An Automated Real-time Diagnosis Framework for Big Data Systems, 2021
Big data processing systems, such as Hadoop and Spark, usually work in large-scale, highly-concur... more Big data processing systems, such as Hadoop and Spark, usually work in large-scale, highly-concurrent, and multi-tenant environments that can easily cause hardware and software malfunctions or failures, thereby leading to performance degradation. Several systems and methods exist to detect big data processing systems' performance degradation, perform root-cause analysis, and even overcome the issues causing such degradation. However, these solutions focus on specific problems such as stragglers and inefficient resource utilization. There is a lack of a generic and extensible framework to support the real-time diagnosis of big data systems. In this paper, we propose, develop and validate AutoDiagn. This generic and flexible framework provides holistic monitoring of a big data system while detecting performance degradation and enabling root-cause analysis. We present an implementation and evaluation of AutoDiagn that interacts with a Hadoop cluster deployed on a public cloud and tested with real-world benchmark applications. Experimental results show that AutoDiagn can offer a high accuracy root-cause analysis framework, at the same time as offering a small resource footprint, high throughput and low latency.
Computers & Electrical Engineering, 2019
Cyber-physical systems (CPS) integrate cyber-infrastructure comprising computers and networks wit... more Cyber-physical systems (CPS) integrate cyber-infrastructure comprising computers and networks with physical processes. The cyber components monitor, control, and coordinate the physical processes typically via actuators. As CPS are characterized by reliability, availability, and performance, they are expected to have a tremendous impact not only on industrial systems but also in our daily lives. We have started to witness the emergence of cloud-based CPS. However, cloud systems are prone to stochastic conditions that may lead to quality of service degradation. In this paper, we propose M2CPA - a novel framework for multi-virtualization, and multi-cloud monitoring in cloud-based cyber-physical systems. M2CPA monitors the performance of application components running inside multiple virtualization platforms deployed on multiple clouds. M2CPA is validated through extensive experimental analysis using a real testbed comprising multiple public clouds and multi-virtualization technologies.
Conference Presentations by Umit Demirbaga
SmartMonit: Real-time Big Data Monitoring System, 2019
Modern big data processing systems are becoming very complex in terms of large-scale, high-concur... more Modern big data processing systems are becoming very complex in terms of large-scale, high-concurrency and multiple talents. Thus, many failures and performance reductions only happen at run-time and are very difficult to capture. Moreover, some issues may only be triggered when some components are executed. To analyze the root cause of these types of issues, we have to capture the dependencies of each component in real-time. In this paper, we propose SmartMonit, a real-time big data monitoring system, which collects infrastructure information such as the process status of each task. At the same time, we develop a real-time stream processing framework to analyze the coordination among the tasks and the infrastructures. This coordination information is essential for troubleshooting the reasons for failures and performance reduction, especially the ones propagated from other causes.
Twitter, a micro-blogging service, has been generating a large amount of data every minute as it ... more Twitter, a micro-blogging service, has been generating a large amount of data every minute as it gives people chance to express their thoughts and feelings quickly and clearly about any topics. To obtain the desired information from these available big data, it requires high-performance parallel computing tools along with machine learning algorithms' support. Emerging big data processing frameworks (e.g. Hadoop) can handle such big data effectively. In this paper, we, firstly introduce a novel approach to automatically classify Twitter data obtained from British Geological Survey (BGS), collected using some specific keywords such as landslide, landslides, mudslide, landfall, land-slip, soil sliding, based on tweet post date and the countries where tweets are posted using MapReduce algorithm. We then propose a model to distinguish the tweets if they are landslides-related using Na¨ıveNa¨ıve-Bayes machine learning algorithm with n-Grams language model on Mahout. This paper also describes an algorithm for the pre-processing steps to make the semi-structured Twitter text data ready for classification. The proposed methods are useful for the BGS and other interested people to be able to see the name and number of the countries where the tweets are sent, the number of tweets sent from each country, the dates and time intervals of the tweets, and to classify the tweets whether they are related to landslides.
IEEE Internet of Things Journal
IEEE Transactions on Network and Service Management, 2022
IEEE Transactions on Computers, 2021
Big data processing systems, such as Hadoop and Spark, usually work in large-scale, highly-concur... more Big data processing systems, such as Hadoop and Spark, usually work in large-scale, highly-concurrent, and multi-tenant environments that can easily cause hardware and software malfunctions or failures, thereby leading to performance degradation. Several systems and methods exist to detect big data processing systems' performance degradation, perform root-cause analysis, and even overcome the issues causing such degradation. However, these solutions focus on specific problems such as stragglers and inefficient resource utilization. There is a lack of a generic and extensible framework to support the real-time diagnosis of big data systems. In this paper, we propose, develop and validate AutoDiagn. This generic and flexible framework provides holistic monitoring of a big data system while detecting performance degradation and enabling root-cause analysis. We present an implementation and evaluation of AutoDiagn that interacts with a Hadoop cluster deployed on a public cloud and tested with real-world benchmark applications. Experimental results show that AutoDiagn can offer a high accuracy root-cause analysis framework, at the same time as offering a small resource footprint, high throughput and low latency.
Twitter, a micro-blogging service, has been generating a large amount of data every minute as it ... more Twitter, a micro-blogging service, has been generating a large amount of data every minute as it gives people chance to express their thoughts and feelings quickly and clearly about any topics. To obtain the desired information from these available big data, it requires high-performance parallel computing tools along with machine learning algorithms' support. Emerging big data processing frameworks (e.g. Hadoop) can handle such big data effectively. In this paper, we, firstly introduce a novel approach to automatically classify Twitter data obtained from British Geological Survey (BGS), collected using some specific keywords such as landslide, landslides, mudslide, landfall, landslip, soil sliding, based on tweet post date and the countries where tweets are posted using MapReduce algorithm. We then propose a model to distinguish the tweets if they are landslides-related using Naïve-Bayes machine learning algorithm with n-Grams language model on Mahout. This paper also describes ...
2019 38th Symposium on Reliable Distributed Systems (SRDS)
Modern big data processing systems are becoming very complex in terms of large-scale, high-concur... more Modern big data processing systems are becoming very complex in terms of large-scale, high-concurrency and multiple talents. Thus, many failures and performance reductions only happen at run-time and are very difficult to capture. Moreover, some issues may only be triggered when some components are executed. To analyze the root cause of these types of issues, we have to capture the dependencies of each component in real-time. In this paper, we propose SmartMonit, a real-time big data monitoring system, which collects infrastructure information such as the process status of each task. At the same time, we develop a real-time stream processing framework to analyze the coordination among the tasks and the infrastructures. This coordination information is essential for troubleshooting the reasons for failures and performance reduction, especially the ones propagated from other causes.
Neural Computing and Applications
Twitter produces a massive amount of data due to its popularity that is one of the reasons underl... more Twitter produces a massive amount of data due to its popularity that is one of the reasons underlying big data problems. One of those problems is the classification of tweets due to use of sophisticated and complex language, which makes the current tools insufficient. We present our framework HTwitt, built on top of the Hadoop ecosystem, which consists of a MapReduce algorithm and a set of machine learning techniques embedded within a big data analytics platform to efficiently address the following problems: (1) traditional data processing techniques are inadequate to handle big data; (2) data preprocessing needs substantial manual effort; (3) domain knowledge is required before the classification; (4) semantic explanation is ignored. In this work, these challenges are overcome by using different algorithms combined with a Naïve Bayes classifier to ensure reliability and highly precise recommendations in virtualization and cloud environments. These features make HTwitt different fro...
Journal of Systems Architecture
The osmotic computing paradigm sets out the principles and algorithms for simplifying the deploym... more The osmotic computing paradigm sets out the principles and algorithms for simplifying the deployment of Internet of Things (IoT) applications in integrated edge-cloud environments. Various existing simulation frameworks can be used to support integration of cloud and edge computing environments. However, none of these can directly support an osmotic computing environment due to the complexity of IoT applications and heterogeneity of integrated edge-cloud environments. Osmotic computing suggests the migration of workload to/from a cloud data center to edge devices, based on performance and security trigger events. We propose 'IoTSim-Osmosis-a simulation framework to support the testing and validation of osmotic computing applications. In particular, our detailed related work analysis demonstrates that IoTSim-Osmosis is the first simulation framework to enable unified modeling and simulation of complex IoT applications over heterogeneous edge-cloud environments. IoTSim-Osmosis is demonstrated using an electricity management and billing application case study, for benchmarking various run-time QoS parameters, such as IoT battery use, execution time, network transmission time and consumed energy.
Neural Computing and Applications, 2021
Twitter produces a massive amount of data due to its popularity that is one of the reasons underl... more Twitter produces a massive amount of data due to its popularity that is one of the reasons underlying big data problems. One of those problems is the classification of tweets due to use of sophisticated and complex language, which makes the current tools insufficient. We present our framework HTwitt, built on top of the Hadoop ecosystem, which consists of a MapReduce algorithm and a set of machine learning techniques embedded within a big data analytics platform to efficiently address the following problems: (1) traditional data processing techniques are inadequate to handle big data; (2) data preprocessing needs substantial manual effort; (3) domain knowledge is required before the classification; (4) semantic explanation is ignored. In this work, these challenges are overcome by using different algorithms combined with a Naïve Bayes classifier to ensure reliability and highly precise recommendations in virtualization and cloud environments. These features make HTwitt different from others in terms of having an effective and practical design for text classification in big data analytics. The main contribution of the paper is to propose a framework for building landslide early warning systems by pinpointing useful tweets and visualizing them along with the processed information. We demonstrate the results of the experiments which quantify the levels of overfitting in the training stage of the model using different sizes of real-world datasets in machine learning phases. Our results demonstrate that the proposed system provides high-quality results with a score of nearly 95% and meets the requirement of a Hadoop-based classification system.
IoTSim-Osmosis: A Framework for Modelling & Simulating IoT Applications Over an Edge-Cloud Continuum, 2020
The osmotic computing paradigm sets out the principles and algorithms for simplifying the deploym... more The osmotic computing paradigm sets out the principles and algorithms for simplifying the deployment of Internet of Things (IoT) applications in integrated edge-cloud environments. Various existing simulation frameworks can be used to support integration of cloud and edge computing environments. However, none of these can directly support an osmotic computing environment due to the complexity of IoT applications and heterogeneity of integrated edge-cloud environments. Osmotic computing suggests the migration of workload to/from a cloud data center to edge devices, based on performance and security trigger events. We propose 'IoTSim-Osmosis-a simulation framework to support the testing and validation of osmotic computing applications. In particular, our detailed related work analysis demonstrates that IoTSim-Osmosis is the first simulation framework to enable unified modeling and simulation of complex IoT applications over heterogeneous edge-cloud environments. IoTSim-Osmosis is demonstrated using an electricity management and billing application case study, for benchmarking various run-time QoS parameters, such as IoT battery use, execution time, network transmission time and consumed energy.
AutoDiagn: An Automated Real-time Diagnosis Framework for Big Data Systems, 2021
Big data processing systems, such as Hadoop and Spark, usually work in large-scale, highly-concur... more Big data processing systems, such as Hadoop and Spark, usually work in large-scale, highly-concurrent, and multi-tenant environments that can easily cause hardware and software malfunctions or failures, thereby leading to performance degradation. Several systems and methods exist to detect big data processing systems' performance degradation, perform root-cause analysis, and even overcome the issues causing such degradation. However, these solutions focus on specific problems such as stragglers and inefficient resource utilization. There is a lack of a generic and extensible framework to support the real-time diagnosis of big data systems. In this paper, we propose, develop and validate AutoDiagn. This generic and flexible framework provides holistic monitoring of a big data system while detecting performance degradation and enabling root-cause analysis. We present an implementation and evaluation of AutoDiagn that interacts with a Hadoop cluster deployed on a public cloud and tested with real-world benchmark applications. Experimental results show that AutoDiagn can offer a high accuracy root-cause analysis framework, at the same time as offering a small resource footprint, high throughput and low latency.
Computers & Electrical Engineering, 2019
Cyber-physical systems (CPS) integrate cyber-infrastructure comprising computers and networks wit... more Cyber-physical systems (CPS) integrate cyber-infrastructure comprising computers and networks with physical processes. The cyber components monitor, control, and coordinate the physical processes typically via actuators. As CPS are characterized by reliability, availability, and performance, they are expected to have a tremendous impact not only on industrial systems but also in our daily lives. We have started to witness the emergence of cloud-based CPS. However, cloud systems are prone to stochastic conditions that may lead to quality of service degradation. In this paper, we propose M2CPA - a novel framework for multi-virtualization, and multi-cloud monitoring in cloud-based cyber-physical systems. M2CPA monitors the performance of application components running inside multiple virtualization platforms deployed on multiple clouds. M2CPA is validated through extensive experimental analysis using a real testbed comprising multiple public clouds and multi-virtualization technologies.
SmartMonit: Real-time Big Data Monitoring System, 2019
Modern big data processing systems are becoming very complex in terms of large-scale, high-concur... more Modern big data processing systems are becoming very complex in terms of large-scale, high-concurrency and multiple talents. Thus, many failures and performance reductions only happen at run-time and are very difficult to capture. Moreover, some issues may only be triggered when some components are executed. To analyze the root cause of these types of issues, we have to capture the dependencies of each component in real-time. In this paper, we propose SmartMonit, a real-time big data monitoring system, which collects infrastructure information such as the process status of each task. At the same time, we develop a real-time stream processing framework to analyze the coordination among the tasks and the infrastructures. This coordination information is essential for troubleshooting the reasons for failures and performance reduction, especially the ones propagated from other causes.
Twitter, a micro-blogging service, has been generating a large amount of data every minute as it ... more Twitter, a micro-blogging service, has been generating a large amount of data every minute as it gives people chance to express their thoughts and feelings quickly and clearly about any topics. To obtain the desired information from these available big data, it requires high-performance parallel computing tools along with machine learning algorithms' support. Emerging big data processing frameworks (e.g. Hadoop) can handle such big data effectively. In this paper, we, firstly introduce a novel approach to automatically classify Twitter data obtained from British Geological Survey (BGS), collected using some specific keywords such as landslide, landslides, mudslide, landfall, land-slip, soil sliding, based on tweet post date and the countries where tweets are posted using MapReduce algorithm. We then propose a model to distinguish the tweets if they are landslides-related using Na¨ıveNa¨ıve-Bayes machine learning algorithm with n-Grams language model on Mahout. This paper also describes an algorithm for the pre-processing steps to make the semi-structured Twitter text data ready for classification. The proposed methods are useful for the BGS and other interested people to be able to see the name and number of the countries where the tweets are sent, the number of tweets sent from each country, the dates and time intervals of the tweets, and to classify the tweets whether they are related to landslides.