Quantile sampling for practical delay monitoring in Internet backbone networks (original) (raw)

Practical delay monitoring for ISPs

2005

Point-to-point delay is an important network performance measure as well as a key parameter in SLAs. We study how to measure and report delay in a concise and meaningful way for an ISP, and how to monitor it efficiently. We analyze various measurement intervals and potential metric definitions. We find that reporting high quantiles (between 0.95 and 0.99) every 10-30 minutes as the most effective way to summarize the delay in an ISP. We then propose an active probing scheme to estimate a high quantile with bounded error. We show that only a small number of probes are sufficient to provide an accurate estimate. We validate the proposed delay monitoring technique on real data collected on the Sprint IP backbone network.

Measuring Latency Variation in the Internet

Proceedings of the 12th International on Conference on emerging Networking EXperiments and Technologies, 2016

We analyse two complementary datasets to quantify the latency variation experienced by internet end-users: (i) a largescale active measurement dataset (from the Measurement Lab Network Diagnostic Tool) which shed light on longterm trends and regional differences; and (ii) passive measurement data from an access aggregation link which is used to analyse the edge links closest to the user. The analysis shows that variation in latency is both common and of significant magnitude, with two thirds of samples exceeding 100 ms of variation. The variation is seen within single connections as well as between connections to the same client. The distribution of experienced latency variation is heavy-tailed, with the most affected clients seeing an order of magnitude larger variation than the least affected. In addition, there are large differences between regions, both within and between continents. Despite consistent improvements in throughput, most regions show no reduction in latency variation over time, and in one region it even increases. We examine load-induced queueing latency as a possible cause for the variation in latency and find that both datasets readily exhibit symptoms of queueing latency correlated with network load. Additionally, when this queueing latency does occur, it is of significant magnitude, more than 200 ms in the median. This indicates that load-induced queueing contributes significantly to the overall latency variation.

Measurement and analysis of single-hop delay on an IP backbone network

IEEE Journal on Selected Areas in Communications, 2003

We measure and analyze the single-hop packet delay through operational routers in the Sprint Internet protocol (IP) backbone network. After presenting our delay measurements through a single router for OC-3 and OC-12 link speeds, we propose a methodology to identify the factors contributing to single-hop delay. In addition to packet processing, transmission, and queueing delay at the output link, we observe the presence of very large delays that cannot be explained within the context of a first-in first-out output queue model. We isolate and analyze these outliers.

Large Scale Internet Queueing Delay Tomography

2006

Queuing delay tomography of the Internet is mostly a theoretical research topic, and measurements were mainly performed to prove the validity of a certain measurement methods. We propose a large scale Internet tomography survey to map the queueing delay in the European networks in great details, and the rest of the world to a lesser extent. The measurements will be based on the ETOMIC high accuracy packet capturing infrastructure and on DIMES vast distributed agent community. We present the rational behind the effort, the new technical tools developed to enable it, and some results from initial trials.

Latency profiles: performance monitoring for wide area applications

Proceedings the Third IEEE Workshop on Internet Applications. WIAPP 2003, 2003

Recent technological advances have enabled the deployment of wide area applications against Internet accessible sources. A performance challenge to applications in such a setting is the unpredictable end-to-end latency of accessing these sources. We use passive information gathering mechanisms to learn end-to-end latency distributions and construct Latency Profiles (LPs). We hypothesize that a group of clients, within an autonomous system (AS), that are accessing a content server, in another AS, may be represented by (one or more) LPs. Related networking research on IDMaps, points of congestion, and BGP routes support such hypothesis. We develop aggregate LPs to provide coverage of groups (clusters) of client-server pairs. Using data gathered from a (limited) experiment we demonstrate the feasibility of constructing LPs.

An Optimal Median Calculation Algorithm for Estimating Internet Link Delays from Active Measurements

2007 Workshop on End-to-End Monitoring Techniques and Services, 2007

Delay estimation in the Internet can improve performance of many applications, e.g., Web browsing, peer-to-peer applications, and distributed games. For this purpose, researchers suggested building an Internet distance service that can efficiently supply applications with delay information based on an Internet delay map. This can be achieved by deploying a large scale measurement infrastructure such as the DIMES project where

Scalable and systematic Internet-wide path and delay estimation from existing measurements

Computer Networks, 2011

Internet-wide services and applications depend on accurate information about the internal network state to deliver good performance to end-users. However, today's Internet does not provide such information explicitly and a number of systems have been recently proposed and implemented to provide a shared measurement infrastructure for distributed applications. The goal of this work is to demonstrate that without any new measurement infrastructure or active probing we obtain composite performance estimates from AS-by-AS segments and the estimates are as good as (or even better than) those from existing estimation methodologies that use on-demand, customized active probing. The key idea behind scaling measurements to the size of the Internet is to take advantage of the known underlying structure of the network.

End-to-end QoS measurement: analytic methodology of application response time vs. tunable latency in IP networks

In this paper, we discuss one aspect of the measurement issue: how to measure end-to-end application response time (ART) relative to aggregated “tunable” network latency, or tunable latency. The goal is to enhance our understanding of the relationship between these two metrics for database access applications. Tunable latency is defined as follows: the sum of the “round trip” queuing delay and data transmission/insertion delay from beginning to end of the application transmission. Our problem space concentrates on developing a methodology to graphically characterize response time as a function of tunable latency for existing database access applications in a wired, single-threaded, multi-user, post-deployment client/server environment. A number of tools were used in developing this methodology which was not obvious from the tools' documentation. To test its feasibility before actual field use, we used an experimental setup to emulate the real user environment. In so doing, we n...

End-to-end queuing delay assessment in multi-service ip networks

Journal of Statistical Computation and Simulation, 2002

Packet-based networks are more and more used to transport interactive streaming services like telephony and videophony. To guarantee a good quality for these services, the queuing delay and delay jitter introduced in the transport of voice or video flows over the packet-based network should be kept under control. Because data sources tend to increase their sending rate until (a part of) the network is congested, mixing real-time traffic and data traffic in one queue would lead to unacceptable high delays for real-time services. Therefore, voice and video packets need to get preferential treatment (e.g. head-of-line priority) over data packets in the network nodes. Therefore, the queuing behavior of the voice and video packets can be studied more or less independently from the traffic generated by data services. Simple methods to assess the end-to-end delay are primordial. Since it is well known that an aggregate of voice (and CBR video) sources is accurately modeled by a Poisson arrival process and that delays in consecutive nodes are more or less statistically independent, this boils down to developing methods to calculate quantiles of the total queuing delay through a system of N statistically independent M/G/1 nodes. This paper develops four methods to calculate quantiles of the total queuing delay: a Gaussian method, a method based on the numerical inversion of the moment generating function of the total queuing delay developed by Abate & Whitt and two methods based on the assumption that the tail distribution of the individual queuing delay of one node is approximately exponential. The Gaussian method is the simplest, but only gives crude results. The method of Abate & Whitt is the most complex and breaks down for large quantiles. The methods based on the assumption of an exponential tail produce results that are more or less equally accurate as long as there is a node where the load is high enough.

On the Modeling of Multi-Point RTT Passive Measurements for Network Delay Monitoring

IEEE Transactions on Network and Service Management, 2019

Many network management actions need a simultaneous consideration of several elements' state. This is becoming an even more complex matter with the advent of reconfigurable deployments, where scaling functions up can prevent performance bottlenecks. Therefore, fine-grained detection of significant burdens arises as a cornerstone to optimize their monitoring and operation. We present AdPRISMA (Advanced distributed Passive Retrieval of Information, and Statistical Multi-point Analysis), a passive monitoring system intended to fit models for network delay measurements with clustering elements to improve representation of central and extreme behaviors. As distinguishing features, it relies on cost-effective multi-point round-trip time (RTT) passive network measurements, and is able to select a suitable parametric model optimizing the tradeoff between fitting and complexity. AdPRISMA can correlate records collected from several vantage points and detect where performance issues are most likely to appear; adjust alarms in terms of the probability of events; and adapt its behavior to dynamic network conditions while presenting a fair identification of anomalous situations. We evaluate AdPRISMA with experiments both in virtual environments and with real-world data to provide evidences of its applicability and capabilities to represent network elements' delay. Index Terms-network monitoring, network delay, round-trip time, probability, passive measurements, performance management, pro-active management.