Ernst Gran - Profile on Academia.edu (original) (raw)

Papers by Ernst Gran

Fast hybrid network reconfiguration for large-scale lossless interconnection networks

2016 IEEE 15th International Symposium on Network Computing and Applications (NCA), 2016

Reconfiguration of high performance lossless interconnection networks is a cumbersome and time-co... more Reconfiguration of high performance lossless interconnection networks is a cumbersome and time-consuming task. For that reason reconfiguration in large networks are typically limited to situations where it is absolutely necessary, for instance when severe faults occur. On the contrary, due to the shared and dynamic nature of modern cloud infrastructures, performance-driven reconfigurations are necessary to ensure efficient utilization of resources. In this work we present a scheme that allows for fast reconfigurations by limiting the task to subparts of the network that can benefit from a local reconfiguration. Moreover, our method is able to use different routing algorithms for different sub-parts within the same subnet. We also present a Fat-Tree routing algorithm that reconfigures a network given a user-provided node ordering. Hardware experiments and large scale simulation results show that we are able to significantly reduce reconfiguration times from 50% to as much as 98.7% for very large topologies, while improving performance.

Adaptive Routing in InfiniBand Hardware

2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

2019 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), 2019

Cloud computing has revolutionised the development and deployment of applications by running them... more Cloud computing has revolutionised the development and deployment of applications by running them cost-effectively in remote data centres. With the increasing need for mobility and micro-services, particularly with the emerging 5G mobile broadband networks, there is also a strong demand for mobile edge computing (MEC). It enables applications to run in small cloud systems in close proximity to the user in order to minimise latencies. Both cloud computing and MEC have their own advantages and disadvantages. Combining these two computing paradigms in a unified multi-cloud platform has the potential of obtaining the best of both worlds. However, a comprehensive study is needed to evaluate the performance gains and the overheads imposed by this combination to real-world cloud applications. In this paper, we introduce a baseline performance evaluation in order to identify the fallacies and pitfalls of combining multiple cloud systems and MEC into a unified MEC-multi-cloud platform. For this purpose, we analyze the basic, application-independent performance metrics of average roundtrip time (RTT) and average application payload throughput in a setup consisting of two private and one public cloud systems. This baseline performance analysis confirms the feasibility of MEC-multi-cloud and provides guidelines for designing an autonomic resource provisioning solution in terms of an extension proposed to our existing MELODIC middleware platform for multi-cloud applications.

Combinando diferentes enfoques para el control de congestion en redes de interconexion de altas prestaciones

2011 International Conference on Parallel Processing, 2011

Existing congestion control mechanisms in interconnects can be divided into two general approache... more Existing congestion control mechanisms in interconnects can be divided into two general approaches. One is to throttle traffic injection at the sources that contribute to congestion, and the other is to isolate the congested traffic in specially designated resources. These two approaches have different, but non-overlapping weaknesses. In this paper we present in detail a method that combines injection throttling and congested-flow isolation. Through simulation studies we first demonstrate the respective flaws of the injection throttling and of flow isolation. Thereafter we show that our combined method extracts the best of both approaches in the sense that it gives fast reaction to congestion, it is scalable and it has good fairness properties with respect to the congested flows.

IEEE Transactions on Parallel and Distributed Systems, 2015

Interconnection networks are key components in high-performance computing (HPC) systems, their pe... more Interconnection networks are key components in high-performance computing (HPC) systems, their performance having a strong influence on the overall system one. However, at high load, congestion and its negative effects (e.g., Head-of-line blocking) threaten the performance of the network, and so the one of the entire system. Congestion control (CC) is crucial to ensure an efficient utilization of the interconnection network during congestion situations. As one major trend is to reduce the effective wiring in interconnection networks to reduce cost and power consumption, the network will operate very close to its capacity. Thus, congestion control becomes essential. Existing CC techniques can be divided into two general approaches. One is to throttle traffic injection at the sources that contribute to congestion, and the other is to isolate the congested traffic in specially designated resources. However, both approaches have different, but non-overlapping weaknesses: injection throttling techniques have a slow reaction against congestion, while isolating traffic in special resources may lead the system to run out of those resources. In this paper we propose EcoCC, a new Efficient and Cost-Effective CC technique, that combines injection throttling and congested-flow isolation to minimize their respective drawbacks and maximize overall system performance. This new strategy is suitable for current commercial switch architectures, where it could be implemented without requiring significant complexity. Experimental results, using simulations under synthetic and real trace-based traffic patterns, show that this technique improves by up to 55 percent over some of the most successful congestion control techniques.

2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012

In a lossless interconnection network, network congestion needs to be detected and resolved to en... more In a lossless interconnection network, network congestion needs to be detected and resolved to ensure high performance and good utilization of network resources at high network load. If no countermeasure is taken, congestion at a node in the network will stimulate the growth of a congestion tree that not only affects contributors to congestion, but also other traffic flows in the network. Left untouched, the congestion tree will block traffic flows, lead to underutilization of network resources and result in a severe drop in network performance. The InfiniBand standard specifies a congestion control (CC) mechanism to detect and resolve congestion before a congestion tree is able to grow and, by that, hamper the network performance. The InfiniBand CC mechanism includes a rich set of parameters that can be tuned in order to achieve effective CC. Even though it has been shown that the CC mechanism, properly tuned, is able to improve both throughput and fairness in an interconnection network, it has been questioned whether the mechanism is fast enough to keep up with dynamic network traffic, and if a given set of parameter values for a topology is robust when it comes to different traffic patterns, or if the parameters need to be tuned depending on the applications in use. In this paper we address both these questions. Using the three-stage fat-tree topology from the Sun Datacenter InfiniBand Switch 648 as a basis, and a simulator tuned against CC capable InfiniBand hardware, we conduct a systematic study of the efficiency of the InfiniBand CC mechanism as the network traffic becomes increasingly more dynamic. Our studies show that the InfiniBand CC, even when using a single set of parameter values, performs very well as the traffic patterns becomes increasingly more dynamic, outperforming a network without CC in all cases. Our results show throughput increases varying from a few percent, to a seventeen-fold increase.

2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2011

In lossless interconnection networks such as Infini-Band, congestion control (CC) can be an effec... more In lossless interconnection networks such as Infini-Band, congestion control (CC) can be an effective mechanism to achieve high performance and good utilization of network resources. The InfiniBand standard describes CC functionality for detecting and resolving congestion, but the design decisions on how to implement this functionallity is left to the hardware designer. One must be cautious when making these design decisions not to introduce fairness problems, as our study shows. In this paper we study the relationship between congestion control, switch arbitration, and fairness. Specifically, we look at fairness among different traffic flows arriving at a hot spot switch on different input ports, as CC is turned on. In addition we study the fairness among traffic flows at a switch where some flows are exclusive users of their input ports while other flows are sharing an input port (the parking lot problem). Our results show that the implementation of congestion control in a switch is vulnerable to unfairness if care is not taken. In detail, we found that a threshold hysteresis of more than one MTU is needed to resolve arbitration unfairness. Furthermore, to fully solve the parking lot problem, proper configuration of the CC parameters are required.

Anomaly detection is the process of identifying unexpected events or abnormalities in data, and i... more Anomaly detection is the process of identifying unexpected events or abnormalities in data, and it has been applied in many different areas such as system monitoring, fraud detection, healthcare, intrusion detection, etc. Providing real-time, lightweight, and proactive anomaly detection for time series with neither human intervention nor domain knowledge could be highly valuable since it reduces human effort and enables appropriate countermeasures to be undertaken before a disastrous event occurs. To our knowledge, RePAD (Real-time Proactive Anomaly Detection algorithm) is a generic approach with all abovementioned features. To achieve real-time and lightweight detection, RePAD utilizes Long Short-Term Memory (LSTM) to detect whether or not each upcoming data point is anomalous based on short-term historical data points. However, it is unclear that how different amounts of historical data points affect the performance of RePAD. Therefore, in this paper, we investigate the impact of ...

Exascale computing systems are being built with thousands of nodes. The high number of components... more Exascale computing systems are being built with thousands of nodes. The high number of components of these systems significantly increases the probability of failure. A key component for them is the interconnection network. If failures occur in the interconnection network, they may isolate a large fraction of the machine. For this reason, an efficient fault-tolerant mechanism is needed to keep the system interconnected, even in the presence of faults. A recently proposed topology for these large systems is the hybrid k-ary n-direct s-indirect (KNS) family that provides optimal performance and connectivity at a reduced hardware cost. This paper presents a fault-tolerant routing methodology for the KNS topology that degrades performance gracefully in presence of faults and tolerates a large number of faults without disabling any healthy computing node. In order to tolerate network failures, the methodology uses a simple mechanism. For any source-destination pair, if necessary, packets are forwarded to the destination node through a set of intermediate nodes (without being ejected from the network) with the aim of circumventing faults. The evaluation results shows that the proposed methodology tolerates a large number of faults. For instance, it is able to tolerate more than 99.5% of fault combinations when there are ten faults in a 3-D network with 1,000 nodes using only one intermediate node and more than 99.98% if two intermediate nodes are used. Furthermore, the methodology offers a gracious performance degradation. As an example, performance degrades only by 1% for a 2-D network with 1,024 nodes and 1% faulty links.

En studie av flytkontroll og bufferstørrelser i Ethernet som tett koblet nettverk

Testbed Multi-homing Routing Transport Applications abstract Over the last decade, the Internet h... more Testbed Multi-homing Routing Transport Applications abstract Over the last decade, the Internet has grown at a tremendous speed in both size and com- plexity. Nowadays, a large number of important services - for instance e-commerce, healthcare and many others - depend on the availability of the underlying network. Clearly, service interruptions due to network problems may have a severe impact. On the long way towards the Future Internet, the complexity will grow even further. There- fore, new ideas and concepts must be evaluated thoroughly, and particularly in realistic, real-world Internet scenarios, before they can be deployed for production networks. For this purpose, various testbeds - for instance PLANETLAB ,G PENI or G-LAB - have been estab- lished and are intensively used for research. However, all of these testbeds lack the support for so-called multi-homing. Multi-homing denotes the connection of a site to multiple Internet service providers, in order to achieve redundancy....

Transportation Research Record: Journal of the Transportation Research Board

Over the past decade, many approaches have been introduced for traffic speed prediction. However,... more Over the past decade, many approaches have been introduced for traffic speed prediction. However, providing fine-grained, accurate, time-efficient, and adaptive traffic speed prediction for a growing transportation network where the size of the network keeps increasing and new traffic detectors are constantly deployed has not been well studied. To address this issue, this paper presents DistTune based on long short-term memory (LSTM) and the Nelder-Mead method. When encountering an unprocessed detector, DistTune decides if it should customize an LSTM model for this detector by comparing the detector with other processed detectors in the normalized traffic speed patterns they have observed. If a similarity is found, DistTune directly shares an existing LSTM model with this detector to achieve time-efficient processing. Otherwise, DistTune customizes an LSTM model for the detector to achieve fine-grained prediction. To make DistTune even more time-efficient, DisTune performs on a clus...

2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC)

Anomaly detection is an active research topic in many different fields such as intrusion detectio... more Anomaly detection is an active research topic in many different fields such as intrusion detection, network monitoring, system health monitoring, IoT healthcare, etc. However, many existing anomaly detection approaches require either human intervention or domain knowledge, and may suffer from high computation complexity, consequently hindering their applicability in real-world scenarios. Therefore, a lightweight and ready-to-go approach that is able to detect anomalies in real-time is highly sought-after. Such an approach could be easily and immediately applied to perform time series anomaly detection on any commodity machine. The approach could provide timely anomaly alerts and by that enable appropriate countermeasures to be undertaken as early as possible. With these goals in mind, this paper introduces ReRe, which is a Real-time Ready-to-go proactive Anomaly Detection algorithm for streaming time series. ReRe employs two lightweight Long Short-Term Memory (LSTM) models to predict and jointly determine whether or not an upcoming data point is anomalous based on short-term historical data points and two long-term self-adaptive thresholds. Experiments based on real-world time-series datasets demonstrate the good performance of ReRe in real-time anomaly detection without requiring human intervention or domain knowledge.

Advanced Information Networking and Applications

During the past decade, many anomaly detection approaches have been introduced in different field... more During the past decade, many anomaly detection approaches have been introduced in different fields such as network monitoring, fraud detection, and intrusion detection. However, they require understanding of data pattern and often need a long off-line period to build a model or network for the target data. Providing real-time and proactive anomaly detection for streaming time series without human intervention and domain knowledge is highly valuable since it greatly reduces human effort and enables appropriate countermeasures to be undertaken before a disastrous damage, failure, or other harmful event occurs. However, this issue has not been well studied yet. To address it, this paper proposes RePAD, which is a Real-time Proactive Anomaly Detection algorithm for streaming time series based on Long Short-Term Memory (LSTM). RePAD utilizes short-term historical data points to predict and determine whether or not the upcoming data point is a sign that an anomaly is likely to happen in the near future. By dynamically adjusting the detection threshold over time, RePAD is able to tolerate minor pattern change in time series and detect anomalies either proactively or on time. Experiments based on two time series datasets collected from the Numenta Anomaly Benchmark demonstrate that RePAD is able to proactively detect anomalies and provide early warnings in real time without human intervention and domain knowledge.

SALAD: Self-Adaptive Lightweight Anomaly Detection for Real-time Recurrent Time Series

2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC)

Distributed Fine-Grained Traffic Speed Prediction for Large-Scale Transportation Networks Based on Automatic LSTM Customization and Sharing

Euro-Par 2020: Parallel Processing

Improvements to the InfiniBand Congestion Control Mechanism

2016 IEEE 24th Annual Symposium on High-Performance Interconnects (HOTI)

A Self-Adaptive Network for HPC Clouds: Architecture, Framework, and Implementation

IEEE Transactions on Parallel and Distributed Systems

Concurrency and Computation: Practice and Experience

To meet the demands of communication-intensive workloads in the cloud, virtual machines (VMs) sho... more To meet the demands of communication-intensive workloads in the cloud, virtual machines (VMs) should utilize low overhead network communication paradigms. In general, such paradigms enable VMs to directly communicate with the hardware by means of a passthrough technology like Single-Root I/O Virtualization (SR-IOV). However, when passthrough-based virtualization is coupled with lossless interconnection networks, live-migrations introduce scalability challenges due to the substantial network reconfiguration overhead. With these challenges in mind we proposed a virtual switch (vSwitch) SR-IOV architecture for InfiniBand in (33). In this paper, we first suggest solutions to rectify the space-domain scalability issues that are present in vSwitch-enabled subnets as a result of the VMs using dedicated layer-two addresses. Then we discuss routing strategies for virtualized environments using vSwitches, and present a routing algorithm for Fat-Trees. We also present a reconfiguration method that minimizes imposed reconfiguration overhead on Fat-Trees. We perform an extensive evaluation of our prototype algorithms, and as vSwitch-enabled hardware does not yet exist, we deduce from empirical observations by emulating vSwitches with existing hardware, as well as large-scale simulations. Our results show significant reduction in the reconfiguration times as route recalculations can be eliminated, and for certain scenarios, the number of reconfiguration subnet management packets sent to switches is reduced from several hundred thousand down to a single one without degrading the routing quality.