Matti Hiltunen - Academia.edu (original) (raw)

Papers by Matti Hiltunen

Research paper thumbnail of RIC: A RAN Intelligent Controller Platform for AI-Enabled Cellular Networks

IEEE Internet Computing, 2021

With the emergence of 5G, network densification, and richer and more demanding applications, the ... more With the emergence of 5G, network densification, and richer and more demanding applications, the radio access network (RAN)—a key component of the cellular network infrastructure—will become increasingly complex. To tackle this complexity, it is critical for the RAN to be able to automate the process of deploying, optimizing, and operating while leveraging novel data-driven technologies to ultimately improve the end-user quality of experience. In this article, we disaggregate the traditional monolithic control plane (CP) RAN architecture and introduce a RAN Intelligent Controller (RIC) platform decoupling the control and data planes of the RAN driving an intelligent and continuously evolving radio network by fostering network openness and empowering network intelligence with AI-enabled applications. We provide functional and software architectures of the RIC and discuss its design challenges. We elaborate how the RIC can enable near-real-time network optimization in 5G for the dual-connectivity use case using machine learning control loops. Finally, we provide preliminary results to evaluate the performance of our open-source RIC platform.

Research paper thumbnail of Transaction Dependency Graph Construction Using Signal Injection

Understanding the runtime behavior and dependencies between components in complex transaction-bas... more Understanding the runtime behavior and dependencies between components in complex transaction-based enterprise systems enables the system administrators to identify performance bottlenecks, allocate resources, and detect failures. This paper introduces a novel method for extracting dependency information between system components at runtime by using delay injection on individual links and Fast Fourier Transforms. Our proposed method introduces minimal disturbance in the system and its execution time is independent of the system workload. Thus, it can be used at runtime in production systems. Furthermore, it avoids false positives introduced by other methods. We present preliminary experimental results that demonstrate that our approach is able to identify dependencies, avoid false positives, while ensuring low perturbation to the target system.

Research paper thumbnail of Toward Integrating Intelligence and Programmability in Open Radio Access Networks: A Comprehensive Survey

IEEE Access

Open RAN is an emerging vision and an advancement of the Radio Access Network (RAN). Its purpose ... more Open RAN is an emerging vision and an advancement of the Radio Access Network (RAN). Its purpose is to implement a vendor and network-generation agnostic RAN, provide networking solutions across all service requests, and implement artificial intelligence solutions in different stages of an end-to-end communication path. The 5th Generation (5G) and beyond the 5th Generation (B5G) of networking introduce and support new use cases, such as tactile internet and autonomous driving. The complexity and innovative nature of these use cases require continuous innovation at a high pace in the RAN. The traditional approach of building end-to-end RAN solutions by only one vendor hampers the speed of innovation-furthermore, the lack of a standard approach to implementing artificial intelligence complicates the compatibility of products with the RAN ecosystem. O-RAN Alliance, a community of industry and academic experts in RAN, works on writing Open RAN specifications on top of the 3rd Generation Partnership Project (3GPP) standards. Founded on these specifications, the aim of this paper is to introduce open research topics in Open RAN that overlap the interests of both AI and telecommunication researchers. The paper provides an overview of the architecture and components of Open RAN, then explores AI use cases in Open RAN. Also, this survey includes some plausible AI deployment scenarios that the specifications have not covered. Open RAN in future cities creates opportunities for various use cases across different sectors, including engineering, operations, and research that this paper addresses. INDEX TERMS 5G, B5G, artificial intelligence, intelligent systems, machine learning, open RAN, radio access networks.

Research paper thumbnail of Utilizando a RFT para a Detecção de Falhas de Microsserviços do Controlador O-RAN

Anais do XXII Workshop de Testes e Tolerância a Falhas (WTF 2021), 2021

A O-RAN (Open Radio Access Network) Alliance está definindo uma nova interface de comunicação (E2... more A O-RAN (Open Radio Access Network) Alliance está definindo uma nova interface de comunicação (E2) de código aberto para customizar e controlar o comportamento da RAN. A plataforma RIC (RAN Intelligent Controller) permite implementar funções de controle da RAN por meio de microsserviços chamados xApps. Este trabalho descreve uma estratégia de tolerância a falhas para os microsserviços (xApps) que executam no controlador RIC. A estratégia proposta consiste de técnicas de particionamento de estado com replicação parcial em grupos de xApps e re-roteamento de mensagens com ciência de papel. Uma biblioteca chamada RFT (RIC Fault Tolerance) foi implementada e disponibilizada para o desenvolvimento de xApps tolerantes a falhas. Resultados experimentais apresentados neste artigo demonstram a detecção de falhas de microsserviços do controlador RIC com a RFT.

Research paper thumbnail of Virtual Redundancy for Active-Standby Cloud Applications

IEEE INFOCOM 2018 - IEEE Conference on Computer Communications, 2018

VM redundancy is the foundation of resilient cloud applications. While active-active approaches c... more VM redundancy is the foundation of resilient cloud applications. While active-active approaches combined with load balancing and autoscaling are usually resource efficient, the stateful nature of many cloud applications often necessitates 1+1 (or 1+mathbfn1+\mathbf{n}1+mathbfn) active-standby approaches. Keeping the standbys, however, could result in inefficient utilization of cloud resources. We explore an intriguing cloud-based solution, where standby VMs from active-standby applications are selectively overbooked to reduce resources reserved for failures. The approach requires careful VM placement to avoid a situation where multiple standby VMs activate simultaneously on the same host and thus cannot get the full resource entitlement. Indeed today's clouds do not have this visibility to the applications. We rectify this situation through ShadowBox, a novel redundancy-aware VM scheduler that optimizes the placement and activation of standby VMs, while assuring applications' resource entit...

Research paper thumbnail of Understanding Membership

A membership service is used in a distributed system to maintain information about which sites ar... more A membership service is used in a distributed system to maintain information about which sites are functioning and which have failed at any given time. Such services have proven to be fundamental for constructing distributed applications, with many example services and algorithms defined in the literature. Despite these efforts, however, little has been done on examining the abstract properties commonly guaranteed by membership services independent of a given implementation. Here, a number of these properties are identified and defined. These properties range from agreement among sites on membership changes, consistent ordering of change notifications, and timing properties to various ways for dealing with recoveries and partitions. Message ordering graphs, which are an abstract representation of the set of messages at each site in the system and their potential delivery order, are used to define the properties. Dependency graphs, which are a graphical representation expressing when...

Research paper thumbnail of Q-Opt

Proceedings of the 16th Annual Middleware Conference on - Middleware '15, 2015

This paper presents Q-OPT, a system for automatically tuning the configuration of quorum systems ... more This paper presents Q-OPT, a system for automatically tuning the configuration of quorum systems in strongly consistent Software Defined Storage (SDS) systems. Q-OPT is able to assign different quorum systems to different items and can be used in a large variety of settings, including systems supporting multiple tenants with different profiles, single tenant systems running applications with different requirements, or systems running a single application that exhibits non-uniform access patterns to data. Q-OPT supports automatic and dynamic reconfiguration, using a combination of complementary techniques, including top-k analysis to prioritise quorum adaptation, machine learning to determine the best quorum configuration, and a non-blocking quorum reconfiguration protocol that preserves consistency during reconfiguration. Q-OPT has been implemented as an extension to one of the most popular opensource SDS, namely Openstack's Swift.

Research paper thumbnail of Airfoil: A Topology Aware Distributed Load Balancing Service

2015 IEEE 8th International Conference on Cloud Computing, 2015

Load balancing is one of the most basic services needed by cloud applications. While today's ... more Load balancing is one of the most basic services needed by cloud applications. While today's clouds and load balancers provide highly customizable load distribution policies, they are forced out of necessity to ignore the impact of load balancing on the network. However, such network agnostic behavior can lead to inefficient utilization of cloud resources and poor performance, especially for upcoming network centric applications such as network function virtualization (NFV). But can network topology, a cloud property that is hidden from tenants, be made to effectively influence load balancing, a function that is intimately tied to per-tenant application structure that is largely invisible to the cloud? We answer that question in this paper by presenting Airfoil, a novel topology-aware distributed load balancer as a service (LBaaS) that takes network topology into consideration while providing cloud tenants with application-specific load balancing policies. By finding the best communication pattern with linear programming, the most efficient load balancing strategy is calculated without using any heuristics or approximations. We evaluate Airfoil over a range of scenarios from classical web applications to NFV pipelines, and show that it can decrease the network utilization of the most highly loaded network links by over 50%. For network constrained applications, such reductions imply a doubling of application capacity without any additional infrastructure.

Research paper thumbnail of Understanding Membership

Research paper thumbnail of Position Statement: Supporting Coordinated Adaptation in Networked Systems

While adaptation is widely recognized as valuable, adaptations in most existing systems are limit... more While adaptation is widely recognized as valuable, adaptations in most existing systems are limited to changing execution parameters in a single software module or on a single host. Our position is that the true potential of adaptation can only be realized if support is provided for more general solutions, including adaptations that span multiple hosts and multiple system components, and algorithmic adaptations that involve changing the underlying algorithms used by the system at runtime. Such a general solution must, however, address the difficult issues related to these types of adaptations. Adaptation by multiple related components, for example, must be coordinated so that these adaptations work together to implement consistent adaptation policies. Likewise, large-scale algorithmic adaptations need to be coordinated using graceful adaptation strategies in which as much normal processing as possible continues during the changeover. Here, we summarize our approach to addressing the...

Research paper thumbnail of Replicating Nondeterministic Services on Grid Environments

2006 15th IEEE International Conference on High Performance Distributed Computing, 2006

Replication is a technique commonly used to increase the availability of services in distributed ... more Replication is a technique commonly used to increase the availability of services in distributed systems, including grid and web services. While replication is relatively easy for services with fully deterministic behavior, grid and web services often include nondeterministic operations. The traditional way to replicate such nondeterministic services is to use the primary-backup approach. While this is straightforward in synchronous systems with perfect failure detection, typical grid environments are not usually considered to be synchronous systems. This paper addresses the problem of replicating nondeterministic services by designing a protocol based on Paxos and proposing two performance optimizations suitable for replicated grid services. The first improves the performance in the case where some service operations do not change the service state, while the second optimizes grid service requests that use transactions. Evaluations done both on a local cluster and on Planet-Lab demonstrate that these optimizations significantly reduce the service response time and increase the throughput of replicated services.

Research paper thumbnail of MUSIC: Multi-Site Critical Sections over Geo-Distributed State

2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS), 2020

A crucial requirement for many multi-site production services operating at global scale is the ne... more A crucial requirement for many multi-site production services operating at global scale is the need for exclusive access to latest state. Here, a novel approach to address these requirements through the abstraction of a critical section over geo-distributed state is proposed. This abstraction is realized in a key-value store called MUSIC, which provides critical sections with novel semantics suitable for geo-distributed state referred to as entry consistency under failures (ECF). The semantics of ECF in MUSIC, its formal verification, and its implementation are presented, along with details of how MUSIC has been used to realize various fundamental geo-distributed structuring paradigms. MUSIC has been deployed in production geo-distributed services at AT&T as part of the Open Network Automation Platform (ONAP). Our evaluation of MUSIC shows that, despite providing additional properties, MUSIC has higher throughput (~1.4-17.17 times) than Zookeeper for larger critical section sizes an...

Research paper thumbnail of MUSIC: Multi-Site Critical Sections over Geo-Distributed State

2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS), 2020

A crucial requirement for many multi-site production services operating at global scale is the ne... more A crucial requirement for many multi-site production services operating at global scale is the need for exclusive access to latest state. Here, a novel approach to address these requirements through the abstraction of a critical section over geo-distributed state is proposed. This abstraction is realized in a key-value store called MUSIC, which provides critical sections with novel semantics suitable for geo-distributed state referred to as entry consistency under failures (ECF). The semantics of ECF in MUSIC, its formal verification, and its implementation are presented, along with details of how MUSIC has been used to realize various fundamental geo-distributed structuring paradigms. MUSIC has been deployed in production geo-distributed services at AT&T as part of the Open Network Automation Platform (ONAP). Our evaluation of MUSIC shows that, despite providing additional properties, MUSIC has higher throughput (~1.4-17.17 times) than Zookeeper for larger critical section sizes an...

Research paper thumbnail of Growing

Web Services and their associated technologies (SOAP, XML, WSDL) are reputed inefficient What is ... more Web Services and their associated technologies (SOAP, XML, WSDL) are reputed inefficient What is the performance impact on Globus? Globus: large, complex, collaborative middleware How to extract meaningful profiling data? How to profile a complex piece of software? Both a profiling and a reverse engineering problem 3/45F. Taïani. IBM HTX'06 Our First Experiment create resource subscribe to changes add 3 notify ×4 destroy resource client container Java VM tracing execution traces First attempt: tracing everything (outside the JVM libs) client: 1,544,734 local method call (sic) server: 6,466,652 local method calls (sic) [+time out] How to visualize such results? [Globus 3.9.2, Java 1.4, no security] Java VM 4/45F. Taïani. IBM HTX'06 Program visualization: a few Notions Problem studied for quite a long time now. Different aspects: collection, manipulation, visualization. Visualization some form of projection (many proposed). Our goal: understand software structure: lib 1 lib ...

Research paper thumbnail of DarkNOC: Dashboard for Honeypot Management

Protecting computer and information systems from security attacks is becoming an increasingly imp... more Protecting computer and information systems from security attacks is becoming an increasingly important task for system administrators. Honeypots are a technology often used to detect attacks and collect information about techniques and targets (e.g., services, ports, operating systems) of attacks. However, managing a large and complex network of honeypots becomes a challenge given the amount of data collected as well as the risk that the honeypots may become infected and start attacking other machines. In this paper, we present DarkNOC, a management and monitoring tool for complex honeynets consisting of different types of honeypots as well as other data collection devices. DarkNOC has been actively used to manage a honeynet consisting of multiple subnets and hundreds of IP addresses. This paper describes the architecture and a number of case studies demonstrating the use of DarkNOC. 1

Research paper thumbnail of Nfsight: NetFlow-based Network Awareness Tool

Network awareness is highly critical for network and security administrators. It enables informed... more Network awareness is highly critical for network and security administrators. It enables informed planning and management of network resources, as well as detection and a comprehensive understanding of malicious activity. It requires a set of tools to efficiently collect, process, and represent network data. While many such tools already exist, there is no flexible and practical solution for visualizing network activity at various granularities, and quickly gaining insights about the status of network assets. To address this issue, we developed Nfsight, a Net-Flow processing and visualization application designed to offer a comprehensive network awareness solution. Nfsight constructs bidirectional flows out of the unidirectional NetFlow flows and leverages these bidirectional flows to provide client/server identification and intrusion detection capabilities. We present in this paper the internal architecture of Nfsight, the evaluation of the service, and intrusion detection algorith...

Research paper thumbnail of Reflections on Aspects and Configurable Protocols ABSTRACT

The goals of aspect oriented software development (AOSD) and frameworks for configurable protocol... more The goals of aspect oriented software development (AOSD) and frameworks for configurable protocols (CPs) are similar in many respects. AOSD allows the specification of crosscutting concerns called aspects as separate modules that are woven with the base program as needed. CPs are oriented towards building protocols or services with different quality of service (QoS) properties and attributes out of collections of independent modules, with each configuration customizing the service for a given application and execution environment. As AOSD evolves to address issues in areas such as middleware, operating systems, and distributed computing that have traditionally been the domain of CPs, lessons learned from the development of these frameworks could be useful. The purpose of this paper is to draw parallels between AOSD and CP frameworks, with a specific focus on the Cactus framework and how it compares and contrasts with the aspect-oriented paradigm.

Research paper thumbnail of Performance Aware Regeneration in Virtualized Multitier Applications

Abstract—Virtual machine technology enables highly agile system deployments in which components c... more Abstract—Virtual machine technology enables highly agile system deployments in which components can be cheaply moved, cloned, and allocated controlled hardware resources. In this paper, we examine in the context of multitier Enterprise applications, how these facilities can be used to provide enhanced solutions to the classic problem of ensuring high availability without a loss in performance on a fixed amount of resources. By using virtual machine clones to restore the redundancy of a system whenever component failures occur, we achieve improved availability compared to a system with a fixed redundancy level. By smartly controlling component placement and colocation using information about the multitier system’s flows and predictions made by queuing models, we ensure that the resulting performance degradation is minimized. Simulation results show that our proposed approach provides better availability and significantly lower degradation of system response times compared to traditio...

Research paper thumbnail of Community-based Analysis of Netflow for Early Detection of Security Incidents

Detection and remediation of security incidents (e.g., attacks, compromised machines, policy viol... more Detection and remediation of security incidents (e.g., attacks, compromised machines, policy violations) is an increasingly important task of system administrators. While numerous tools and techniques are available (e.g., Snort, nmap, netflow), novel attacks and low-grade events may still be hard to detect in a timely manner. In this paper, we present a novel approach for detecting stealthy, low-grade security incidents by utilizing information across a community of organizations (e.g., banking industry, energy generation and distribution industry, governmental organizations in a specific country, etc). The approach uses netflow, a commonly available non-intrusive data source, analyzes communication to/from the community, and alerts the community members when suspicious activity is detected. A community-based detection has the ability to detect incidents that would fall below local detection thresholds while maintaining the number of alerts at a manageable level for each day.

Research paper thumbnail of Performance Evaluation of an Alert Dissemination Engine based on the AT & T Enterprise Messaging Network

The recent surge in the variety and number of mobile devices used as communication end points has... more The recent surge in the variety and number of mobile devices used as communication end points has created a significant challenge for messaging applications that aim to reach their target recipients regardless of their location and available devices. Alerting services used to notify a potentially large number of recipients about an emergency, or other important events, are an important class of applications enabled by the prevalence of such mobile devices. A middleware platform that arbitrates content delivery and adaptation between mobile devices and backend messaging applications is crucial in reducing the software complexity on both the client device and server side applications. For an alerting application, the Quality of Service (QoS) of the middleware platform becomes crucial alerts must be delivered quickly, reliably, and securely to all their recipients. As the first step in achieving such QoS, this paper evaluates the performance of a commercial mobile middleware platform, ...

Research paper thumbnail of RIC: A RAN Intelligent Controller Platform for AI-Enabled Cellular Networks

IEEE Internet Computing, 2021

With the emergence of 5G, network densification, and richer and more demanding applications, the ... more With the emergence of 5G, network densification, and richer and more demanding applications, the radio access network (RAN)—a key component of the cellular network infrastructure—will become increasingly complex. To tackle this complexity, it is critical for the RAN to be able to automate the process of deploying, optimizing, and operating while leveraging novel data-driven technologies to ultimately improve the end-user quality of experience. In this article, we disaggregate the traditional monolithic control plane (CP) RAN architecture and introduce a RAN Intelligent Controller (RIC) platform decoupling the control and data planes of the RAN driving an intelligent and continuously evolving radio network by fostering network openness and empowering network intelligence with AI-enabled applications. We provide functional and software architectures of the RIC and discuss its design challenges. We elaborate how the RIC can enable near-real-time network optimization in 5G for the dual-connectivity use case using machine learning control loops. Finally, we provide preliminary results to evaluate the performance of our open-source RIC platform.

Research paper thumbnail of Transaction Dependency Graph Construction Using Signal Injection

Understanding the runtime behavior and dependencies between components in complex transaction-bas... more Understanding the runtime behavior and dependencies between components in complex transaction-based enterprise systems enables the system administrators to identify performance bottlenecks, allocate resources, and detect failures. This paper introduces a novel method for extracting dependency information between system components at runtime by using delay injection on individual links and Fast Fourier Transforms. Our proposed method introduces minimal disturbance in the system and its execution time is independent of the system workload. Thus, it can be used at runtime in production systems. Furthermore, it avoids false positives introduced by other methods. We present preliminary experimental results that demonstrate that our approach is able to identify dependencies, avoid false positives, while ensuring low perturbation to the target system.

Research paper thumbnail of Toward Integrating Intelligence and Programmability in Open Radio Access Networks: A Comprehensive Survey

IEEE Access

Open RAN is an emerging vision and an advancement of the Radio Access Network (RAN). Its purpose ... more Open RAN is an emerging vision and an advancement of the Radio Access Network (RAN). Its purpose is to implement a vendor and network-generation agnostic RAN, provide networking solutions across all service requests, and implement artificial intelligence solutions in different stages of an end-to-end communication path. The 5th Generation (5G) and beyond the 5th Generation (B5G) of networking introduce and support new use cases, such as tactile internet and autonomous driving. The complexity and innovative nature of these use cases require continuous innovation at a high pace in the RAN. The traditional approach of building end-to-end RAN solutions by only one vendor hampers the speed of innovation-furthermore, the lack of a standard approach to implementing artificial intelligence complicates the compatibility of products with the RAN ecosystem. O-RAN Alliance, a community of industry and academic experts in RAN, works on writing Open RAN specifications on top of the 3rd Generation Partnership Project (3GPP) standards. Founded on these specifications, the aim of this paper is to introduce open research topics in Open RAN that overlap the interests of both AI and telecommunication researchers. The paper provides an overview of the architecture and components of Open RAN, then explores AI use cases in Open RAN. Also, this survey includes some plausible AI deployment scenarios that the specifications have not covered. Open RAN in future cities creates opportunities for various use cases across different sectors, including engineering, operations, and research that this paper addresses. INDEX TERMS 5G, B5G, artificial intelligence, intelligent systems, machine learning, open RAN, radio access networks.

Research paper thumbnail of Utilizando a RFT para a Detecção de Falhas de Microsserviços do Controlador O-RAN

Anais do XXII Workshop de Testes e Tolerância a Falhas (WTF 2021), 2021

A O-RAN (Open Radio Access Network) Alliance está definindo uma nova interface de comunicação (E2... more A O-RAN (Open Radio Access Network) Alliance está definindo uma nova interface de comunicação (E2) de código aberto para customizar e controlar o comportamento da RAN. A plataforma RIC (RAN Intelligent Controller) permite implementar funções de controle da RAN por meio de microsserviços chamados xApps. Este trabalho descreve uma estratégia de tolerância a falhas para os microsserviços (xApps) que executam no controlador RIC. A estratégia proposta consiste de técnicas de particionamento de estado com replicação parcial em grupos de xApps e re-roteamento de mensagens com ciência de papel. Uma biblioteca chamada RFT (RIC Fault Tolerance) foi implementada e disponibilizada para o desenvolvimento de xApps tolerantes a falhas. Resultados experimentais apresentados neste artigo demonstram a detecção de falhas de microsserviços do controlador RIC com a RFT.

Research paper thumbnail of Virtual Redundancy for Active-Standby Cloud Applications

IEEE INFOCOM 2018 - IEEE Conference on Computer Communications, 2018

VM redundancy is the foundation of resilient cloud applications. While active-active approaches c... more VM redundancy is the foundation of resilient cloud applications. While active-active approaches combined with load balancing and autoscaling are usually resource efficient, the stateful nature of many cloud applications often necessitates 1+1 (or 1+mathbfn1+\mathbf{n}1+mathbfn) active-standby approaches. Keeping the standbys, however, could result in inefficient utilization of cloud resources. We explore an intriguing cloud-based solution, where standby VMs from active-standby applications are selectively overbooked to reduce resources reserved for failures. The approach requires careful VM placement to avoid a situation where multiple standby VMs activate simultaneously on the same host and thus cannot get the full resource entitlement. Indeed today's clouds do not have this visibility to the applications. We rectify this situation through ShadowBox, a novel redundancy-aware VM scheduler that optimizes the placement and activation of standby VMs, while assuring applications' resource entit...

Research paper thumbnail of Understanding Membership

A membership service is used in a distributed system to maintain information about which sites ar... more A membership service is used in a distributed system to maintain information about which sites are functioning and which have failed at any given time. Such services have proven to be fundamental for constructing distributed applications, with many example services and algorithms defined in the literature. Despite these efforts, however, little has been done on examining the abstract properties commonly guaranteed by membership services independent of a given implementation. Here, a number of these properties are identified and defined. These properties range from agreement among sites on membership changes, consistent ordering of change notifications, and timing properties to various ways for dealing with recoveries and partitions. Message ordering graphs, which are an abstract representation of the set of messages at each site in the system and their potential delivery order, are used to define the properties. Dependency graphs, which are a graphical representation expressing when...

Research paper thumbnail of Q-Opt

Proceedings of the 16th Annual Middleware Conference on - Middleware '15, 2015

This paper presents Q-OPT, a system for automatically tuning the configuration of quorum systems ... more This paper presents Q-OPT, a system for automatically tuning the configuration of quorum systems in strongly consistent Software Defined Storage (SDS) systems. Q-OPT is able to assign different quorum systems to different items and can be used in a large variety of settings, including systems supporting multiple tenants with different profiles, single tenant systems running applications with different requirements, or systems running a single application that exhibits non-uniform access patterns to data. Q-OPT supports automatic and dynamic reconfiguration, using a combination of complementary techniques, including top-k analysis to prioritise quorum adaptation, machine learning to determine the best quorum configuration, and a non-blocking quorum reconfiguration protocol that preserves consistency during reconfiguration. Q-OPT has been implemented as an extension to one of the most popular opensource SDS, namely Openstack's Swift.

Research paper thumbnail of Airfoil: A Topology Aware Distributed Load Balancing Service

2015 IEEE 8th International Conference on Cloud Computing, 2015

Load balancing is one of the most basic services needed by cloud applications. While today's ... more Load balancing is one of the most basic services needed by cloud applications. While today's clouds and load balancers provide highly customizable load distribution policies, they are forced out of necessity to ignore the impact of load balancing on the network. However, such network agnostic behavior can lead to inefficient utilization of cloud resources and poor performance, especially for upcoming network centric applications such as network function virtualization (NFV). But can network topology, a cloud property that is hidden from tenants, be made to effectively influence load balancing, a function that is intimately tied to per-tenant application structure that is largely invisible to the cloud? We answer that question in this paper by presenting Airfoil, a novel topology-aware distributed load balancer as a service (LBaaS) that takes network topology into consideration while providing cloud tenants with application-specific load balancing policies. By finding the best communication pattern with linear programming, the most efficient load balancing strategy is calculated without using any heuristics or approximations. We evaluate Airfoil over a range of scenarios from classical web applications to NFV pipelines, and show that it can decrease the network utilization of the most highly loaded network links by over 50%. For network constrained applications, such reductions imply a doubling of application capacity without any additional infrastructure.

Research paper thumbnail of Understanding Membership

Research paper thumbnail of Position Statement: Supporting Coordinated Adaptation in Networked Systems

While adaptation is widely recognized as valuable, adaptations in most existing systems are limit... more While adaptation is widely recognized as valuable, adaptations in most existing systems are limited to changing execution parameters in a single software module or on a single host. Our position is that the true potential of adaptation can only be realized if support is provided for more general solutions, including adaptations that span multiple hosts and multiple system components, and algorithmic adaptations that involve changing the underlying algorithms used by the system at runtime. Such a general solution must, however, address the difficult issues related to these types of adaptations. Adaptation by multiple related components, for example, must be coordinated so that these adaptations work together to implement consistent adaptation policies. Likewise, large-scale algorithmic adaptations need to be coordinated using graceful adaptation strategies in which as much normal processing as possible continues during the changeover. Here, we summarize our approach to addressing the...

Research paper thumbnail of Replicating Nondeterministic Services on Grid Environments

2006 15th IEEE International Conference on High Performance Distributed Computing, 2006

Replication is a technique commonly used to increase the availability of services in distributed ... more Replication is a technique commonly used to increase the availability of services in distributed systems, including grid and web services. While replication is relatively easy for services with fully deterministic behavior, grid and web services often include nondeterministic operations. The traditional way to replicate such nondeterministic services is to use the primary-backup approach. While this is straightforward in synchronous systems with perfect failure detection, typical grid environments are not usually considered to be synchronous systems. This paper addresses the problem of replicating nondeterministic services by designing a protocol based on Paxos and proposing two performance optimizations suitable for replicated grid services. The first improves the performance in the case where some service operations do not change the service state, while the second optimizes grid service requests that use transactions. Evaluations done both on a local cluster and on Planet-Lab demonstrate that these optimizations significantly reduce the service response time and increase the throughput of replicated services.

Research paper thumbnail of MUSIC: Multi-Site Critical Sections over Geo-Distributed State

2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS), 2020

A crucial requirement for many multi-site production services operating at global scale is the ne... more A crucial requirement for many multi-site production services operating at global scale is the need for exclusive access to latest state. Here, a novel approach to address these requirements through the abstraction of a critical section over geo-distributed state is proposed. This abstraction is realized in a key-value store called MUSIC, which provides critical sections with novel semantics suitable for geo-distributed state referred to as entry consistency under failures (ECF). The semantics of ECF in MUSIC, its formal verification, and its implementation are presented, along with details of how MUSIC has been used to realize various fundamental geo-distributed structuring paradigms. MUSIC has been deployed in production geo-distributed services at AT&T as part of the Open Network Automation Platform (ONAP). Our evaluation of MUSIC shows that, despite providing additional properties, MUSIC has higher throughput (~1.4-17.17 times) than Zookeeper for larger critical section sizes an...

Research paper thumbnail of MUSIC: Multi-Site Critical Sections over Geo-Distributed State

2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS), 2020

A crucial requirement for many multi-site production services operating at global scale is the ne... more A crucial requirement for many multi-site production services operating at global scale is the need for exclusive access to latest state. Here, a novel approach to address these requirements through the abstraction of a critical section over geo-distributed state is proposed. This abstraction is realized in a key-value store called MUSIC, which provides critical sections with novel semantics suitable for geo-distributed state referred to as entry consistency under failures (ECF). The semantics of ECF in MUSIC, its formal verification, and its implementation are presented, along with details of how MUSIC has been used to realize various fundamental geo-distributed structuring paradigms. MUSIC has been deployed in production geo-distributed services at AT&T as part of the Open Network Automation Platform (ONAP). Our evaluation of MUSIC shows that, despite providing additional properties, MUSIC has higher throughput (~1.4-17.17 times) than Zookeeper for larger critical section sizes an...

Research paper thumbnail of Growing

Web Services and their associated technologies (SOAP, XML, WSDL) are reputed inefficient What is ... more Web Services and their associated technologies (SOAP, XML, WSDL) are reputed inefficient What is the performance impact on Globus? Globus: large, complex, collaborative middleware How to extract meaningful profiling data? How to profile a complex piece of software? Both a profiling and a reverse engineering problem 3/45F. Taïani. IBM HTX'06 Our First Experiment create resource subscribe to changes add 3 notify ×4 destroy resource client container Java VM tracing execution traces First attempt: tracing everything (outside the JVM libs) client: 1,544,734 local method call (sic) server: 6,466,652 local method calls (sic) [+time out] How to visualize such results? [Globus 3.9.2, Java 1.4, no security] Java VM 4/45F. Taïani. IBM HTX'06 Program visualization: a few Notions Problem studied for quite a long time now. Different aspects: collection, manipulation, visualization. Visualization some form of projection (many proposed). Our goal: understand software structure: lib 1 lib ...

Research paper thumbnail of DarkNOC: Dashboard for Honeypot Management

Protecting computer and information systems from security attacks is becoming an increasingly imp... more Protecting computer and information systems from security attacks is becoming an increasingly important task for system administrators. Honeypots are a technology often used to detect attacks and collect information about techniques and targets (e.g., services, ports, operating systems) of attacks. However, managing a large and complex network of honeypots becomes a challenge given the amount of data collected as well as the risk that the honeypots may become infected and start attacking other machines. In this paper, we present DarkNOC, a management and monitoring tool for complex honeynets consisting of different types of honeypots as well as other data collection devices. DarkNOC has been actively used to manage a honeynet consisting of multiple subnets and hundreds of IP addresses. This paper describes the architecture and a number of case studies demonstrating the use of DarkNOC. 1

Research paper thumbnail of Nfsight: NetFlow-based Network Awareness Tool

Network awareness is highly critical for network and security administrators. It enables informed... more Network awareness is highly critical for network and security administrators. It enables informed planning and management of network resources, as well as detection and a comprehensive understanding of malicious activity. It requires a set of tools to efficiently collect, process, and represent network data. While many such tools already exist, there is no flexible and practical solution for visualizing network activity at various granularities, and quickly gaining insights about the status of network assets. To address this issue, we developed Nfsight, a Net-Flow processing and visualization application designed to offer a comprehensive network awareness solution. Nfsight constructs bidirectional flows out of the unidirectional NetFlow flows and leverages these bidirectional flows to provide client/server identification and intrusion detection capabilities. We present in this paper the internal architecture of Nfsight, the evaluation of the service, and intrusion detection algorith...

Research paper thumbnail of Reflections on Aspects and Configurable Protocols ABSTRACT

The goals of aspect oriented software development (AOSD) and frameworks for configurable protocol... more The goals of aspect oriented software development (AOSD) and frameworks for configurable protocols (CPs) are similar in many respects. AOSD allows the specification of crosscutting concerns called aspects as separate modules that are woven with the base program as needed. CPs are oriented towards building protocols or services with different quality of service (QoS) properties and attributes out of collections of independent modules, with each configuration customizing the service for a given application and execution environment. As AOSD evolves to address issues in areas such as middleware, operating systems, and distributed computing that have traditionally been the domain of CPs, lessons learned from the development of these frameworks could be useful. The purpose of this paper is to draw parallels between AOSD and CP frameworks, with a specific focus on the Cactus framework and how it compares and contrasts with the aspect-oriented paradigm.

Research paper thumbnail of Performance Aware Regeneration in Virtualized Multitier Applications

Abstract—Virtual machine technology enables highly agile system deployments in which components c... more Abstract—Virtual machine technology enables highly agile system deployments in which components can be cheaply moved, cloned, and allocated controlled hardware resources. In this paper, we examine in the context of multitier Enterprise applications, how these facilities can be used to provide enhanced solutions to the classic problem of ensuring high availability without a loss in performance on a fixed amount of resources. By using virtual machine clones to restore the redundancy of a system whenever component failures occur, we achieve improved availability compared to a system with a fixed redundancy level. By smartly controlling component placement and colocation using information about the multitier system’s flows and predictions made by queuing models, we ensure that the resulting performance degradation is minimized. Simulation results show that our proposed approach provides better availability and significantly lower degradation of system response times compared to traditio...

Research paper thumbnail of Community-based Analysis of Netflow for Early Detection of Security Incidents

Detection and remediation of security incidents (e.g., attacks, compromised machines, policy viol... more Detection and remediation of security incidents (e.g., attacks, compromised machines, policy violations) is an increasingly important task of system administrators. While numerous tools and techniques are available (e.g., Snort, nmap, netflow), novel attacks and low-grade events may still be hard to detect in a timely manner. In this paper, we present a novel approach for detecting stealthy, low-grade security incidents by utilizing information across a community of organizations (e.g., banking industry, energy generation and distribution industry, governmental organizations in a specific country, etc). The approach uses netflow, a commonly available non-intrusive data source, analyzes communication to/from the community, and alerts the community members when suspicious activity is detected. A community-based detection has the ability to detect incidents that would fall below local detection thresholds while maintaining the number of alerts at a manageable level for each day.

Research paper thumbnail of Performance Evaluation of an Alert Dissemination Engine based on the AT & T Enterprise Messaging Network

The recent surge in the variety and number of mobile devices used as communication end points has... more The recent surge in the variety and number of mobile devices used as communication end points has created a significant challenge for messaging applications that aim to reach their target recipients regardless of their location and available devices. Alerting services used to notify a potentially large number of recipients about an emergency, or other important events, are an important class of applications enabled by the prevalence of such mobile devices. A middleware platform that arbitrates content delivery and adaptation between mobile devices and backend messaging applications is crucial in reducing the software complexity on both the client device and server side applications. For an alerting application, the Quality of Service (QoS) of the middleware platform becomes crucial alerts must be delivered quickly, reliably, and securely to all their recipients. As the first step in achieving such QoS, this paper evaluates the performance of a commercial mobile middleware platform, ...