Managing Quality of Service Using sFlow

Peter Phaal, sFlow.org

Overview

Managing the utilization of links in a network provides and effective way to ensure that quality of service goals are met. sFlow provides a scalable means of managing utilization across all the links in a network.

1. Over-provision

In the paper, The economics of the Internet: Utility, utilization, pricing, and Quality of Service, Andrew Odlyzko, AT&T - Research, the author argues that most network links are overprovisioned and that the economic forces that shape networks will continue to favor overprovisioning as a means of ensuring quality of service.

An overprovisioned network of wire-speed, non-blocking switches, will have very predictable latency and very low packet loss. Latency will be a simple function of switching speed, link speed, and geographic distance. Quality of service can be maintained by ensuring that the network remains overprovisioned as network traffic grows.

2. Monitor Utilization

Since quality of service policies are expressed in terms of delay and packet loss, it is tempting to use delay and loss measurements to manage the network. Unfortunately, this approach tends to provide very little warning of impending performance problems.

The problem with delay as a control metric is its non-linear behavior. Theoretical and empirical studies show that delay stays fairly constant as network utilization increases. It is only when the utilization approaches network capacity that delay rapidly increases (see Analysis of Measured Single-Hop Delay from an Operational Backbone Network, Konstantina Papagiannaki, Sue Moon, Chuck Fraleigh, Patrick Thiran, Fouad Tobagi, and Christophe Diot, Sprint Labs).

The following simple example illustrates the problem. Suppose traffic on a 100Mb/s link is growing linearly at a rate of 10Mb/s per month. If the utilization is measured on the link, the growth in traffic is readily apparent.

Typically a utilization limit would be set for the link that ensures minimal delay and packet loss across the link. In this case quality of service problems occur if the link utilization reaches 80%, so a threshold of 50% is set. In this example, the threshold would trigger three months before any noticable degradation in quality of service would occur, providing plenty of time for proactive action to be taken.

Suppose that instead of focusing on utilization, the network management system is built around delay measurements. A chart trending delay across the link would look something like:

Suppose the service level agreement required a 2 millisecond per link delay. The chart shows delay rapidly deteriorating within the period of a month, going from under a 1/10th of the acceptable delay to 10 times the maximum acceptable delay. The delay measurements provide no warning to the network manager, by the time a problem is noticed it is too late to avoid voilating the service level agreement.

Delay measurements tend to identify problems only after they have become serious, resulting reactive and poor control of the network, whereas utilization measurements tend to provide earlier warnings, allowing proactive actions that avoid quality of service problems.

3. Identify Traffic Sources

Pareto analysis is a basic technique of quality control. The technique involves identifying the major contributors to a problem and fixing them before moving on to smaller contributors. This same strategy is applicable to utilization management.

The first application of this technique is to measure the utilization of all the links in the network, identifying the most heavily utilized links. Typically only a small fraction of the links is busy and attention paid to these links will ensure that quality of service targets are met. sFlow provides real-time link utilization information from all the links in the network, making it easy to identify the busy links.

Identification of busy links occurs on two time scales: long-term trending reveals growth in demand that needs to be addressed by capacity planning and short term changes in load need to be managed to mitigate their effect.

In addition to providing the accurate, real-time, link utilization data needed to identify busy links, sFlow also provides flow data that can be used to identify the applications and sources of traffic on the busy link.

For capacity planning purposes it is very important to be able to perform a Pareto analysis of the sources of traffic on the link. Studies have shown that most traffic is due to small numbers of applications and sources (see An Analysis of Internet Content Delivery Systems, Stefan Saroiu, Krishna P. Gummandi, Richard J. Dunn, Steven D. Gribble, and Henry M. Levy, University of Washington). Understanding the sources of traffic allows more effective controls to be imposed, including eliminating undesirable traffic (such as peer to peer file sharing, rescheduling disk backup traffic or relocating a server).

Another option for handling large traffic sources is to charge for network usage above certain limits. Usage based charges have two beneficial effects: limiting traffic and generating revenue to cover the additional bandwidth costs. The flow data provided by sFlow accurately measures traffic for large sources and can be used for volume based billing.

Unexpected loads can occur because of the sudden popularity of a site, the "Slashdot Effect" (see The Slashdot Effect, Stephen Adler), or because of a denial of service attack, or because of network failure. sFlow provides the real-time flow information needed to accurately characterize the nature and source of these unexpected surges in traffic. Again the technique of Pareto analysis is used to group the traffic in varous ways to identify the main cause of the unexpected traffic (see Controlling High Bandwidth Aggregates in the Network, Ratul Mahajan, Steven M. Bellovin, Sally Floyd, John Ioannidis, Vern Paxson and Scott Shenker, ICSI Center for Internet Research (ICIR) and AT&T Labs Research). Once the cause of the excess traffic is identified, appropriate control actions can be implemented, these might include: rate limiting, replicating content, changing DNS settings to redirect traffic, or blocking using access control lists.

Conclusion

Managing network traffic and link utilizations is the key to ensuring quality of service on a network. sFlow provides scalable, real-time, information from all the links in the network, simplifying the task of bandwidth management and ensuring that quality of service goals can be met.