Posted by: Eric Siegel
End of the month, and I've been going over the list of the "dialogues" I had with Burton Group clients to see if there's something that might be of general interest.
So I uncovered one that I participated in with my colleague Ken Agress of Burton Group's consulting organization, and it's interesting because I think that the topic of QoS congestion on links into remote branches isn't given enough consideration in QoS network design.
The basic symptom of the problem was that despite QoS on the routers, there was excessive packet loss on high-priority flows into the branches on a hub-and-spoke network design.
But that's because the server-room router was trying to push too much data into the network cloud for a specific branch office. The discards weren't occurring at the server room boundary router into the cloud, because the fat access link from the server room will easily accommodate lots of traffic to the branch, especially if the other branches aren't receiving much traffic at that time.
If you have a large bandwidth access link into a cloud, but just a small bandwidth access link out of the cloud to a branch office destination, the cloud will need to discard data packets on high-bandwidth flows or during bursts, because the cloud itself (i.e., the backbone routers) has minimal buffering.
True, a single TCP flow will self-adjust (by "self-clocking") to a bandwidth-restricted exit from the cloud because its acknowledgements will be delayed, but if there are a lot of parallel flows, or there are non-flow-controlled UDP flows, you'll be in trouble. TCP flows will get the "global synchronization problem," in which all TCP flows ramp up in bandwidth concurrently, lose packets concurrently, reset their flow rates downwards concurrently, and then repeat. RED or WRED or FRED mechanisms can help avoid global synchronization, but you'll still lose packets.
So, how should this be handled? If you own the cloud or can signal to the cloud's administrator (e.g., by setting DSCP fields or EXP tags on packets going into a MPLS cloud), then the cloud can pick which packets to discard if the exit bandwidth isn't sufficient at the branch office. And the tags will, we hope, tell the cloud to discard packets from low-importance flows. RED / WRED / FRED on those flows will then get those low-importance flows to cut their bandwidth use. Those low-importance flows will be losing packets and will be miserable, but the higher-importance flows will get through.
A better idea, we think, is to avoid letting the cloud do your packet discards -- especially if you don't actually own/administer the cloud directly. Don't give it more traffic than it can handle; use traffic shaping on subinterface definitions to ensure that you never feed more data into the cloud than can exit at the destination. You will probably do a better job of discriminating among flow types and of smoothing brief traffic spikes (e.g., your routers probably have larger buffers than the cloud's routers have).
For example, see the following Cisco documents:
href="http://www.cisco.com/en/US/docs/ios/qos/configuration/guide/reg_pkt_flow_shaping.html" target=_blank>Regulating Packet Flow Using Traffic Shaping href="http://www.cisco.com/en/US/tech/tk543/tk545/technologies_tech_note09186a0080114326.shtml" target=_blank>Applying QoS Features to Ethernet SubinterfacesOr you could install a Blue Coat PacketShaper in the server room's exit path to do the traffic shaping for both TCP and UDP. And the new PacketShapers can also do compression.

I completely agree. If some QoS mechanisms are not employed then packet loss in the cloud is almost inevitable. When using a wan optimization appliance to deliver QoS the best method is to create symmetry between how much traffic enters the WAN cloud at the headquarters router such that headquarters will only send what the remote end is capable of receiving. As an example if the headquarters router has a 45 Mbps WAN link/bandwidth and the branch WAN bandwidth is only 1.5Mbps then the headquarters end should limit the amount of traffic destined to the branch to 1.5 Mbps. This will prevent indiscriminate packet loss in the cloud and allows the WAN optimization device to shape and deliver the intelligence in how much bandwidth the various applications should receive.
Patrick Wood
www.exinda.com
Posted by: Patrick Wood | November 02, 2009 at 09:53 AM
In a modern MPLS network, it's very important to consider the fact that "any to any" means exactly that. Most organisations, especially larger ones, no longer have only one active data centre - they have load balanced or load shared data centres. Consider what happens when you have (at least) two traffic sources pushing traffic to the branch. It's even worse! Neither site knows what the other is sending so you can't even simply restrict the traffic volume to the capacity of the branch site. What's required is an intelligent "air traffic control" approach, where the traffic is queued at the source, according to business priority - critical traffic takes precedence (of course!). Not sending the traffic in the first place (until you know it will arrive successfully) is much better than sending (and losing) it over and over again. Developing this approach with co-operation between the queueing devices in the data centres leads to a properly managed uncongested branch site receiving traffic at very close to line speed. Indeed, in this situation, RED (or WRED or FRED) is no longer required to provoke TCP to control itself as the control is applied before the traffic ever hits the WAN. Also, you don't need a device on the branch site unless you have a specific requirement for compression or acceleration services, so it can be very cost-efficient too.
Mark Burton. Product Management Director, Ipanema Technologies
Posted by: Mark Burton | December 14, 2009 at 05:31 AM