Ernst Gran - Academia.edu (original) (raw)
Papers by Ernst Gran
2016 IEEE 15th International Symposium on Network Computing and Applications (NCA), 2016
Reconfiguration of high performance lossless interconnection networks is a cumbersome and time-co... more Reconfiguration of high performance lossless interconnection networks is a cumbersome and time-consuming task. For that reason reconfiguration in large networks are typically limited to situations where it is absolutely necessary, for instance when severe faults occur. On the contrary, due to the shared and dynamic nature of modern cloud infrastructures, performance-driven reconfigurations are necessary to ensure efficient utilization of resources. In this work we present a scheme that allows for fast reconfigurations by limiting the task to subparts of the network that can benefit from a local reconfiguration. Moreover, our method is able to use different routing algorithms for different sub-parts within the same subnet. We also present a Fat-Tree routing algorithm that reconfigures a network given a user-provided node ordering. Hardware experiments and large scale simulation results show that we are able to significantly reduce reconfiguration times from 50% to as much as 98.7% for very large topologies, while improving performance.
2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)
2019 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), 2019
Cloud computing has revolutionised the development and deployment of applications by running them... more Cloud computing has revolutionised the development and deployment of applications by running them cost-effectively in remote data centres. With the increasing need for mobility and micro-services, particularly with the emerging 5G mobile broadband networks, there is also a strong demand for mobile edge computing (MEC). It enables applications to run in small cloud systems in close proximity to the user in order to minimise latencies. Both cloud computing and MEC have their own advantages and disadvantages. Combining these two computing paradigms in a unified multi-cloud platform has the potential of obtaining the best of both worlds. However, a comprehensive study is needed to evaluate the performance gains and the overheads imposed by this combination to real-world cloud applications. In this paper, we introduce a baseline performance evaluation in order to identify the fallacies and pitfalls of combining multiple cloud systems and MEC into a unified MEC-multi-cloud platform. For this purpose, we analyze the basic, application-independent performance metrics of average roundtrip time (RTT) and average application payload throughput in a setup consisting of two private and one public cloud systems. This baseline performance analysis confirms the feasibility of MEC-multi-cloud and provides guidelines for designing an autonomic resource provisioning solution in terms of an extension proposed to our existing MELODIC middleware platform for multi-cloud applications.
2011 International Conference on Parallel Processing, 2011
Existing congestion control mechanisms in interconnects can be divided into two general approache... more Existing congestion control mechanisms in interconnects can be divided into two general approaches. One is to throttle traffic injection at the sources that contribute to congestion, and the other is to isolate the congested traffic in specially designated resources. These two approaches have different, but non-overlapping weaknesses. In this paper we present in detail a method that combines injection throttling and congested-flow isolation. Through simulation studies we first demonstrate the respective flaws of the injection throttling and of flow isolation. Thereafter we show that our combined method extracts the best of both approaches in the sense that it gives fast reaction to congestion, it is scalable and it has good fairness properties with respect to the congested flows.
IEEE Transactions on Parallel and Distributed Systems, 2015
Interconnection networks are key components in high-performance computing (HPC) systems, their pe... more Interconnection networks are key components in high-performance computing (HPC) systems, their performance having a strong influence on the overall system one. However, at high load, congestion and its negative effects (e.g., Head-of-line blocking) threaten the performance of the network, and so the one of the entire system. Congestion control (CC) is crucial to ensure an efficient utilization of the interconnection network during congestion situations. As one major trend is to reduce the effective wiring in interconnection networks to reduce cost and power consumption, the network will operate very close to its capacity. Thus, congestion control becomes essential. Existing CC techniques can be divided into two general approaches. One is to throttle traffic injection at the sources that contribute to congestion, and the other is to isolate the congested traffic in specially designated resources. However, both approaches have different, but non-overlapping weaknesses: injection throttling techniques have a slow reaction against congestion, while isolating traffic in special resources may lead the system to run out of those resources. In this paper we propose EcoCC, a new Efficient and Cost-Effective CC technique, that combines injection throttling and congested-flow isolation to minimize their respective drawbacks and maximize overall system performance. This new strategy is suitable for current commercial switch architectures, where it could be implemented without requiring significant complexity. Experimental results, using simulations under synthetic and real trace-based traffic patterns, show that this technique improves by up to 55 percent over some of the most successful congestion control techniques.
2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2010
2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012
2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2011
Anomaly detection is the process of identifying unexpected events or abnormalities in data, and i... more Anomaly detection is the process of identifying unexpected events or abnormalities in data, and it has been applied in many different areas such as system monitoring, fraud detection, healthcare, intrusion detection, etc. Providing real-time, lightweight, and proactive anomaly detection for time series with neither human intervention nor domain knowledge could be highly valuable since it reduces human effort and enables appropriate countermeasures to be undertaken before a disastrous event occurs. To our knowledge, RePAD (Real-time Proactive Anomaly Detection algorithm) is a generic approach with all abovementioned features. To achieve real-time and lightweight detection, RePAD utilizes Long Short-Term Memory (LSTM) to detect whether or not each upcoming data point is anomalous based on short-term historical data points. However, it is unclear that how different amounts of historical data points affect the performance of RePAD. Therefore, in this paper, we investigate the impact of ...
Exascale computing systems are being built with thousands of nodes. The high number of components... more Exascale computing systems are being built with thousands of nodes. The high number of components of these systems significantly increases the probability of failure. A key component for them is the interconnection network. If failures occur in the interconnection network, they may isolate a large fraction of the machine. For this reason, an efficient fault-tolerant mechanism is needed to keep the system interconnected, even in the presence of faults. A recently proposed topology for these large systems is the hybrid k-ary n-direct s-indirect (KNS) family that provides optimal performance and connectivity at a reduced hardware cost. This paper presents a fault-tolerant routing methodology for the KNS topology that degrades performance gracefully in presence of faults and tolerates a large number of faults without disabling any healthy computing node. In order to tolerate network failures, the methodology uses a simple mechanism. For any source-destination pair, if necessary, packets are forwarded to the destination node through a set of intermediate nodes (without being ejected from the network) with the aim of circumventing faults. The evaluation results shows that the proposed methodology tolerates a large number of faults. For instance, it is able to tolerate more than 99.5% of fault combinations when there are ten faults in a 3-D network with 1,000 nodes using only one intermediate node and more than 99.98% if two intermediate nodes are used. Furthermore, the methodology offers a gracious performance degradation. As an example, performance degrades only by 1% for a 2-D network with 1,024 nodes and 1% faulty links.
Testbed Multi-homing Routing Transport Applications abstract Over the last decade, the Internet h... more Testbed Multi-homing Routing Transport Applications abstract Over the last decade, the Internet has grown at a tremendous speed in both size and com- plexity. Nowadays, a large number of important services - for instance e-commerce, healthcare and many others - depend on the availability of the underlying network. Clearly, service interruptions due to network problems may have a severe impact. On the long way towards the Future Internet, the complexity will grow even further. There- fore, new ideas and concepts must be evaluated thoroughly, and particularly in realistic, real-world Internet scenarios, before they can be deployed for production networks. For this purpose, various testbeds - for instance PLANETLAB ,G PENI or G-LAB - have been estab- lished and are intensively used for research. However, all of these testbeds lack the support for so-called multi-homing. Multi-homing denotes the connection of a site to multiple Internet service providers, in order to achieve redundancy....
Transportation Research Record: Journal of the Transportation Research Board
Over the past decade, many approaches have been introduced for traffic speed prediction. However,... more Over the past decade, many approaches have been introduced for traffic speed prediction. However, providing fine-grained, accurate, time-efficient, and adaptive traffic speed prediction for a growing transportation network where the size of the network keeps increasing and new traffic detectors are constantly deployed has not been well studied. To address this issue, this paper presents DistTune based on long short-term memory (LSTM) and the Nelder-Mead method. When encountering an unprocessed detector, DistTune decides if it should customize an LSTM model for this detector by comparing the detector with other processed detectors in the normalized traffic speed patterns they have observed. If a similarity is found, DistTune directly shares an existing LSTM model with this detector to achieve time-efficient processing. Otherwise, DistTune customizes an LSTM model for the detector to achieve fine-grained prediction. To make DistTune even more time-efficient, DisTune performs on a clus...
2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC)
Anomaly detection is an active research topic in many different fields such as intrusion detectio... more Anomaly detection is an active research topic in many different fields such as intrusion detection, network monitoring, system health monitoring, IoT healthcare, etc. However, many existing anomaly detection approaches require either human intervention or domain knowledge, and may suffer from high computation complexity, consequently hindering their applicability in real-world scenarios. Therefore, a lightweight and ready-to-go approach that is able to detect anomalies in real-time is highly sought-after. Such an approach could be easily and immediately applied to perform time series anomaly detection on any commodity machine. The approach could provide timely anomaly alerts and by that enable appropriate countermeasures to be undertaken as early as possible. With these goals in mind, this paper introduces ReRe, which is a Real-time Ready-to-go proactive Anomaly Detection algorithm for streaming time series. ReRe employs two lightweight Long Short-Term Memory (LSTM) models to predict and jointly determine whether or not an upcoming data point is anomalous based on short-term historical data points and two long-term self-adaptive thresholds. Experiments based on real-world time-series datasets demonstrate the good performance of ReRe in real-time anomaly detection without requiring human intervention or domain knowledge.
Advanced Information Networking and Applications
During the past decade, many anomaly detection approaches have been introduced in different field... more During the past decade, many anomaly detection approaches have been introduced in different fields such as network monitoring, fraud detection, and intrusion detection. However, they require understanding of data pattern and often need a long off-line period to build a model or network for the target data. Providing real-time and proactive anomaly detection for streaming time series without human intervention and domain knowledge is highly valuable since it greatly reduces human effort and enables appropriate countermeasures to be undertaken before a disastrous damage, failure, or other harmful event occurs. However, this issue has not been well studied yet. To address it, this paper proposes RePAD, which is a Real-time Proactive Anomaly Detection algorithm for streaming time series based on Long Short-Term Memory (LSTM). RePAD utilizes short-term historical data points to predict and determine whether or not the upcoming data point is a sign that an anomaly is likely to happen in the near future. By dynamically adjusting the detection threshold over time, RePAD is able to tolerate minor pattern change in time series and detect anomalies either proactively or on time. Experiments based on two time series datasets collected from the Numenta Anomaly Benchmark demonstrate that RePAD is able to proactively detect anomalies and provide early warnings in real time without human intervention and domain knowledge.
2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC)
Euro-Par 2020: Parallel Processing
2016 IEEE 24th Annual Symposium on High-Performance Interconnects (HOTI)
IEEE Transactions on Parallel and Distributed Systems
2016 IEEE 15th International Symposium on Network Computing and Applications (NCA), 2016
Reconfiguration of high performance lossless interconnection networks is a cumbersome and time-co... more Reconfiguration of high performance lossless interconnection networks is a cumbersome and time-consuming task. For that reason reconfiguration in large networks are typically limited to situations where it is absolutely necessary, for instance when severe faults occur. On the contrary, due to the shared and dynamic nature of modern cloud infrastructures, performance-driven reconfigurations are necessary to ensure efficient utilization of resources. In this work we present a scheme that allows for fast reconfigurations by limiting the task to subparts of the network that can benefit from a local reconfiguration. Moreover, our method is able to use different routing algorithms for different sub-parts within the same subnet. We also present a Fat-Tree routing algorithm that reconfigures a network given a user-provided node ordering. Hardware experiments and large scale simulation results show that we are able to significantly reduce reconfiguration times from 50% to as much as 98.7% for very large topologies, while improving performance.
2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)
2019 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), 2019
Cloud computing has revolutionised the development and deployment of applications by running them... more Cloud computing has revolutionised the development and deployment of applications by running them cost-effectively in remote data centres. With the increasing need for mobility and micro-services, particularly with the emerging 5G mobile broadband networks, there is also a strong demand for mobile edge computing (MEC). It enables applications to run in small cloud systems in close proximity to the user in order to minimise latencies. Both cloud computing and MEC have their own advantages and disadvantages. Combining these two computing paradigms in a unified multi-cloud platform has the potential of obtaining the best of both worlds. However, a comprehensive study is needed to evaluate the performance gains and the overheads imposed by this combination to real-world cloud applications. In this paper, we introduce a baseline performance evaluation in order to identify the fallacies and pitfalls of combining multiple cloud systems and MEC into a unified MEC-multi-cloud platform. For this purpose, we analyze the basic, application-independent performance metrics of average roundtrip time (RTT) and average application payload throughput in a setup consisting of two private and one public cloud systems. This baseline performance analysis confirms the feasibility of MEC-multi-cloud and provides guidelines for designing an autonomic resource provisioning solution in terms of an extension proposed to our existing MELODIC middleware platform for multi-cloud applications.
2011 International Conference on Parallel Processing, 2011
Existing congestion control mechanisms in interconnects can be divided into two general approache... more Existing congestion control mechanisms in interconnects can be divided into two general approaches. One is to throttle traffic injection at the sources that contribute to congestion, and the other is to isolate the congested traffic in specially designated resources. These two approaches have different, but non-overlapping weaknesses. In this paper we present in detail a method that combines injection throttling and congested-flow isolation. Through simulation studies we first demonstrate the respective flaws of the injection throttling and of flow isolation. Thereafter we show that our combined method extracts the best of both approaches in the sense that it gives fast reaction to congestion, it is scalable and it has good fairness properties with respect to the congested flows.
IEEE Transactions on Parallel and Distributed Systems, 2015
Interconnection networks are key components in high-performance computing (HPC) systems, their pe... more Interconnection networks are key components in high-performance computing (HPC) systems, their performance having a strong influence on the overall system one. However, at high load, congestion and its negative effects (e.g., Head-of-line blocking) threaten the performance of the network, and so the one of the entire system. Congestion control (CC) is crucial to ensure an efficient utilization of the interconnection network during congestion situations. As one major trend is to reduce the effective wiring in interconnection networks to reduce cost and power consumption, the network will operate very close to its capacity. Thus, congestion control becomes essential. Existing CC techniques can be divided into two general approaches. One is to throttle traffic injection at the sources that contribute to congestion, and the other is to isolate the congested traffic in specially designated resources. However, both approaches have different, but non-overlapping weaknesses: injection throttling techniques have a slow reaction against congestion, while isolating traffic in special resources may lead the system to run out of those resources. In this paper we propose EcoCC, a new Efficient and Cost-Effective CC technique, that combines injection throttling and congested-flow isolation to minimize their respective drawbacks and maximize overall system performance. This new strategy is suitable for current commercial switch architectures, where it could be implemented without requiring significant complexity. Experimental results, using simulations under synthetic and real trace-based traffic patterns, show that this technique improves by up to 55 percent over some of the most successful congestion control techniques.
2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2010
2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012
2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2011
Anomaly detection is the process of identifying unexpected events or abnormalities in data, and i... more Anomaly detection is the process of identifying unexpected events or abnormalities in data, and it has been applied in many different areas such as system monitoring, fraud detection, healthcare, intrusion detection, etc. Providing real-time, lightweight, and proactive anomaly detection for time series with neither human intervention nor domain knowledge could be highly valuable since it reduces human effort and enables appropriate countermeasures to be undertaken before a disastrous event occurs. To our knowledge, RePAD (Real-time Proactive Anomaly Detection algorithm) is a generic approach with all abovementioned features. To achieve real-time and lightweight detection, RePAD utilizes Long Short-Term Memory (LSTM) to detect whether or not each upcoming data point is anomalous based on short-term historical data points. However, it is unclear that how different amounts of historical data points affect the performance of RePAD. Therefore, in this paper, we investigate the impact of ...
Exascale computing systems are being built with thousands of nodes. The high number of components... more Exascale computing systems are being built with thousands of nodes. The high number of components of these systems significantly increases the probability of failure. A key component for them is the interconnection network. If failures occur in the interconnection network, they may isolate a large fraction of the machine. For this reason, an efficient fault-tolerant mechanism is needed to keep the system interconnected, even in the presence of faults. A recently proposed topology for these large systems is the hybrid k-ary n-direct s-indirect (KNS) family that provides optimal performance and connectivity at a reduced hardware cost. This paper presents a fault-tolerant routing methodology for the KNS topology that degrades performance gracefully in presence of faults and tolerates a large number of faults without disabling any healthy computing node. In order to tolerate network failures, the methodology uses a simple mechanism. For any source-destination pair, if necessary, packets are forwarded to the destination node through a set of intermediate nodes (without being ejected from the network) with the aim of circumventing faults. The evaluation results shows that the proposed methodology tolerates a large number of faults. For instance, it is able to tolerate more than 99.5% of fault combinations when there are ten faults in a 3-D network with 1,000 nodes using only one intermediate node and more than 99.98% if two intermediate nodes are used. Furthermore, the methodology offers a gracious performance degradation. As an example, performance degrades only by 1% for a 2-D network with 1,024 nodes and 1% faulty links.
Testbed Multi-homing Routing Transport Applications abstract Over the last decade, the Internet h... more Testbed Multi-homing Routing Transport Applications abstract Over the last decade, the Internet has grown at a tremendous speed in both size and com- plexity. Nowadays, a large number of important services - for instance e-commerce, healthcare and many others - depend on the availability of the underlying network. Clearly, service interruptions due to network problems may have a severe impact. On the long way towards the Future Internet, the complexity will grow even further. There- fore, new ideas and concepts must be evaluated thoroughly, and particularly in realistic, real-world Internet scenarios, before they can be deployed for production networks. For this purpose, various testbeds - for instance PLANETLAB ,G PENI or G-LAB - have been estab- lished and are intensively used for research. However, all of these testbeds lack the support for so-called multi-homing. Multi-homing denotes the connection of a site to multiple Internet service providers, in order to achieve redundancy....
Transportation Research Record: Journal of the Transportation Research Board
Over the past decade, many approaches have been introduced for traffic speed prediction. However,... more Over the past decade, many approaches have been introduced for traffic speed prediction. However, providing fine-grained, accurate, time-efficient, and adaptive traffic speed prediction for a growing transportation network where the size of the network keeps increasing and new traffic detectors are constantly deployed has not been well studied. To address this issue, this paper presents DistTune based on long short-term memory (LSTM) and the Nelder-Mead method. When encountering an unprocessed detector, DistTune decides if it should customize an LSTM model for this detector by comparing the detector with other processed detectors in the normalized traffic speed patterns they have observed. If a similarity is found, DistTune directly shares an existing LSTM model with this detector to achieve time-efficient processing. Otherwise, DistTune customizes an LSTM model for the detector to achieve fine-grained prediction. To make DistTune even more time-efficient, DisTune performs on a clus...
2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC)
Anomaly detection is an active research topic in many different fields such as intrusion detectio... more Anomaly detection is an active research topic in many different fields such as intrusion detection, network monitoring, system health monitoring, IoT healthcare, etc. However, many existing anomaly detection approaches require either human intervention or domain knowledge, and may suffer from high computation complexity, consequently hindering their applicability in real-world scenarios. Therefore, a lightweight and ready-to-go approach that is able to detect anomalies in real-time is highly sought-after. Such an approach could be easily and immediately applied to perform time series anomaly detection on any commodity machine. The approach could provide timely anomaly alerts and by that enable appropriate countermeasures to be undertaken as early as possible. With these goals in mind, this paper introduces ReRe, which is a Real-time Ready-to-go proactive Anomaly Detection algorithm for streaming time series. ReRe employs two lightweight Long Short-Term Memory (LSTM) models to predict and jointly determine whether or not an upcoming data point is anomalous based on short-term historical data points and two long-term self-adaptive thresholds. Experiments based on real-world time-series datasets demonstrate the good performance of ReRe in real-time anomaly detection without requiring human intervention or domain knowledge.
Advanced Information Networking and Applications
During the past decade, many anomaly detection approaches have been introduced in different field... more During the past decade, many anomaly detection approaches have been introduced in different fields such as network monitoring, fraud detection, and intrusion detection. However, they require understanding of data pattern and often need a long off-line period to build a model or network for the target data. Providing real-time and proactive anomaly detection for streaming time series without human intervention and domain knowledge is highly valuable since it greatly reduces human effort and enables appropriate countermeasures to be undertaken before a disastrous damage, failure, or other harmful event occurs. However, this issue has not been well studied yet. To address it, this paper proposes RePAD, which is a Real-time Proactive Anomaly Detection algorithm for streaming time series based on Long Short-Term Memory (LSTM). RePAD utilizes short-term historical data points to predict and determine whether or not the upcoming data point is a sign that an anomaly is likely to happen in the near future. By dynamically adjusting the detection threshold over time, RePAD is able to tolerate minor pattern change in time series and detect anomalies either proactively or on time. Experiments based on two time series datasets collected from the Numenta Anomaly Benchmark demonstrate that RePAD is able to proactively detect anomalies and provide early warnings in real time without human intervention and domain knowledge.
2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC)
Euro-Par 2020: Parallel Processing
2016 IEEE 24th Annual Symposium on High-Performance Interconnects (HOTI)
IEEE Transactions on Parallel and Distributed Systems