On Bufferbloat

It's been some time since the bufferbloat problem has been detected and described. Nevertheless, some people still doubt its existence or deny that it is in fact a problem. The best way to convince these people is demonstrating the bufferbloat effects. At least there is hope that a demo is effective, as very few understand the issue intellectually.

Demonstration

Residential Internet

All you need is a BitTorrent client and ping. Take a baseline measurement by checking the round trip time to a Google Public DNS server:

$ ping -c4 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_req=1 ttl=48 time=12.2 ms
64 bytes from 8.8.8.8: icmp_req=2 ttl=48 time=12.2 ms
64 bytes from 8.8.8.8: icmp_req=3 ttl=48 time=13.7 ms
64 bytes from 8.8.8.8: icmp_req=4 ttl=48 time=10.1 ms

--- 8.8.8.8 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 10.172/12.126/13.755/1.284 ms
    
Now start some heavy BitTorrent traffic. This can be done (legally) by downloading some of the popular films from VODO. (Depending on when you read this, you might need to find a different example.) Do not restrict the up- or download rate. Wait a bit for BitTorrent to establish connections and ramp up, then repeat the ping:
$ ping -c4 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_req=1 ttl=48 time=4857 ms
64 bytes from 8.8.8.8: icmp_req=2 ttl=48 time=3854 ms
64 bytes from 8.8.8.8: icmp_req=3 ttl=48 time=2851 ms
64 bytes from 8.8.8.8: icmp_req=4 ttl=48 time=1844 ms

--- 8.8.8.8 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3023ms
rtt min/avg/max/mdev = 1844.230/3351.948/4857.814/1122.911 ms, pipe 4
    
With lots of upload connections I have observed RTTs of more than 30s using a cable Internet connection (32Mbps down / 2Mbps up).

Mobile Internet

When using UMTS you can experience bufferbloat with the lightest of network loads, no BitTorrent needed at all. The following has been recorded while reading mail using mutt over an SSH connection:

--- 131.246.1.1 ping statistics ---
2562 packets transmitted, 2539 received, 0% packet loss, time 2568076ms
rtt min/avg/max/mdev = 195.489/1723.379/19891.544/1790.888 ms, pipe 20
    

As you can see at least one packet was buffered for nearly 20 seconds.

Another statistic over UMTS, this time downloading some files. There was more than one minute of buffering in the network!

--- 131.246.1.1 ping statistics ---
83010 packets transmitted, 80953 received, 2% packet loss, time 83439375ms
rtt min/avg/max/mdev = 116.252/2372.160/76504.755/2679.173 ms, pipe 76
    

Yet another statistic, showing 84s of buffering and 9% packet loss. This session consisted of some web surfing and SSH connections. The interactive SSH sessions were nearly unusable, while web surfing worked astonishingly well (HTTP prioritization by the ISP?).

--- 131.246.1.1 ping statistics ---
3446 packets transmitted, 3131 received, 9% packet loss, time 3461484ms
rtt min/avg/max/mdev = 111.743/2533.525/83806.501/8805.208 ms, pipe 84
    

Mobile Internet not only buffers (delays) packets, it sometimes duplicates them as well. There is still some packet loss. This measurement happened with SSH traffic, reading and writing email using mutt on a remote computer.

--- 8.8.8.8 ping statistics ---
761 packets transmitted, 761 received, +6 duplicates, 0% packet loss, time 2282381ms
rtt min/avg/max/mdev = 75.969/811.447/25267.421/2580.335 ms, pipe 9
--- sushi.unix-ag.uni-kl.de ping statistics ---
1769 packets transmitted, 1767 received, +9 duplicates, 0% packet loss, time 1770575ms
rtt min/avg/max/mdev = 65.530/567.948/21371.611/2010.884 ms, pipe 22

Enterprise Switch

With all the network equipment vendors' claims about huge packet buffers in their switches, I wanted to test how this actually works out. The device under test (DUT) is an Enterasys (now Extreme) Networks SSA model SSA-T1068-0652 switch. According to the data sheet, this switch has a packet buffer of 1.5GB.

To test delays introduced by packet buffers, I connected an Extreme X460 switch to a Gigabit port of the SSA, but configured the port to use 100Mbps only. I then connected an Extreme X460-G2 switch to a Gigabit port of the SSA using 1Gbps port speed. I connected my laptop to the X460-G2 switch and used ping from the laptop to the X460 to measure delay.

+-------------+                +-------------+                +-------------+
|    Switch   |                |    Switch   |                |    Switch   |
|   X460-G2   |     1Gbps      | SSA150 Class|     100Mbps    |     X460    |
| Traffic Gen.+----------------+             +----------------+  Ping Resp. |
|             |                |     DUT     |                |  10.0.10.11 |
+------+------+                +-------------+                +-------------+
       |
       | 100Mbps
       |
+------+------+
|    Laptop   |
| Ping Sender |
+-------------+
    

Without any other traffic the ping measurement looked as follows. This provides a baseline for the following measurements.

--- 10.0.10.11 ping statistics ---
60 packets transmitted, 60 received, 0% packet loss, time 59011ms
rtt min/avg/max/mdev = 0.723/1.112/3.895/0.735 ms
    

To generate traffic I used the EXOS Service verification tool (ESVT) on the X460-G2 switch to create 10 minutes of 110Mbps traffic to the X460 switch. As the traffic generator was started, the ping latencies increased linearly before stabilizing at about 340ms.

PING 10.0.10.11 (10.0.10.11) 56(84) bytes of data.
64 bytes from 10.0.10.11: icmp_seq=1 ttl=62 time=1.57 ms
64 bytes from 10.0.10.11: icmp_seq=2 ttl=62 time=0.819 ms
64 bytes from 10.0.10.11: icmp_seq=3 ttl=62 time=0.834 ms
64 bytes from 10.0.10.11: icmp_seq=4 ttl=62 time=0.893 ms
64 bytes from 10.0.10.11: icmp_seq=5 ttl=62 time=0.860 ms
64 bytes from 10.0.10.11: icmp_seq=6 ttl=62 time=81.7 ms
64 bytes from 10.0.10.11: icmp_seq=7 ttl=62 time=169 ms
64 bytes from 10.0.10.11: icmp_seq=8 ttl=62 time=276 ms
64 bytes from 10.0.10.11: icmp_seq=10 ttl=62 time=340 ms
64 bytes from 10.0.10.11: icmp_seq=13 ttl=62 time=353 ms
64 bytes from 10.0.10.11: icmp_seq=14 ttl=62 time=341 ms
64 bytes from 10.0.10.11: icmp_seq=15 ttl=62 time=340 ms
64 bytes from 10.0.10.11: icmp_seq=16 ttl=62 time=340 ms
64 bytes from 10.0.10.11: icmp_seq=17 ttl=62 time=340 ms
64 bytes from 10.0.10.11: icmp_seq=19 ttl=62 time=340 ms
64 bytes from 10.0.10.11: icmp_seq=20 ttl=62 time=340 ms
    

As can be seen from the missing sequence numbers, a couple of packets were lost. At the end of the measurement the latencies jumped from 340ms to 1ms. This is expected because the 340ms buffer is emptied between two ping packets, which were using the default interval of 1s.

64 bytes from 10.0.10.11: icmp_seq=601 ttl=62 time=340 ms
64 bytes from 10.0.10.11: icmp_seq=603 ttl=62 time=340 ms
64 bytes from 10.0.10.11: icmp_seq=604 ttl=62 time=0.880 ms
64 bytes from 10.0.10.11: icmp_seq=605 ttl=62 time=0.844 ms
    

The ping statistics show a basically constant use of 340ms or 4MB of packet buffer while the 100Mbps port was overloaded. Packet loss is evident as well, because after the buffer was full, additional packets were dropped. That could be either a ping or a traffic generator packet.

--- 10.0.10.11 ping statistics ---
606 packets transmitted, 346 received, 42% packet loss, time 606765ms
rtt min/avg/max/mdev = 0.819/332.236/374.480/53.640 ms
    

This test shows a queueing delay of 340ms induced by uncontrolled packet buffers. This is more than two times the delay acceptable for VoIP (150ms). On a positive note, this is much less bad than the worst case buffering of 1.5GB (129s) inferred from the data sheet. Perhaps the buffer allocation depends on the port speed to bound the extra queueing delay. But that is pure speculation as the vendor does not publicly document buffering behavior.

Remarks

Under normal consumer use, i.e. surfing the web using a broadband connection (not UMTS or the like), bufferbloat does not show itself this drasticly. That's one reason that it has been detected only recently and its effects are still doubted or downplayed by many.

Residential Internet

One single upload of sufficient size can render a residential Internet connection useless, but most users don't do this (or do nothing else during the upload). VoIP uses small packets and little bandwidth, thus it usually works well. The BitTorrent protocol has been blamed, but the root cause are bloated and uncontrolled buffers that prevent TCP's drop-based congestion control algorithm from working. The LEDBAT protocol was developed for BitTorrent. It is based on delay increases, not on packet drops, and thus reacts to filling buffers (with good AQM, this does not work well). Those buffers will not drain, because LEDBAT keeps them filled according to the configured delay value. Other client software, e.g. rsync, has gained options to manually limit the used bandwidth, thus neither filling the (bloated) buffers nor utilizing the available bandwidth. Special home router configuration has been used to mitigate the problem by manually restricting the used upload bandwidth to less than available. TCP's congestion control algorithm should do this automatically, but bloated buffers broke it.

Internet of Things

The Internet of Things (IoT), the Internet of Everything (IoE, IoX), and all the other names given to stuff connected to the Internet, that is neither a traditional computer nor a so called smart phone, and not part of the network infrastructure, provide for additional bufferbloat examples.

In one case I witnessed, access control card readers would contact a server via an IP based network. The software would declare the server unreachable if the server's answer did not arrive in at most 800ms. One day the card readers seemingly lost connectivity to the server. The network provided a round trip time of 7ms (the server was located in another city), but the response time observed by the card readers was above 2000ms. The responses did actually reach the card reader, but they were discarded there. This was caused by combining a normal Linux kernel with a low power CPU and a half-duplex 10Mbps network interface. The broadcast load in the network segment used to connect the card readers created a steady 2 to 5 second queue in the network stack. The card readers lacked the necessary processing capabilities to cope with that network. The bloated buffers did not help, because they held packets for much longer than the application would accept them. (Note that some buffering is needed to absorb traffic bursts, e.g. a couple of ARP broadcasts followed by an access control server response.)

A buffer of 100kB size can hold 800ms of traffic at 10Mbps. That corresponds to between 67 and 1563 IP packets, depending on the packet size (assumption of 64B to 1500B packets). The Linux default packet queue of 1000 packets corresponds to values from 64kB (for 64B packets) to 1.5MB (for 1500B packets). For packet sizes above 100B that is too much for an 800ms response time at 10Mbps. Variable sized packets are an especially bad choice for queue length measurements, even worse than a fixed byte size. The only useful unit for network queues is time, providing an upper bound for latency added by queueing delays.

The solution for the specific implementation was to use a dedicated VLAN (network segment) for the card readers. This reduced the broadcast load to practically zero and resulted in access control response times two orders of magnitude better than before.

Buffer Sizes

Just reducing the buffers to sizes that guarantee low latency does not work well. Buffers are needed to accomodate traffic bursts (as in the infamous microburst problem caused by seemingly too small buffers in some (e.g. Cisco) enterprise-range switches [see http://people.ucsc.edu/~warner/buffer.html for a list of switch buffer sizes]). Multiple possible interface speeds compound this problem: 10ms of buffer capacity at 10Gbps are 1s (1000ms) at 100Mbps.

Arista's claimed 100ms buffer at 10Gbps translates to 10s (10000ms) at 100Mbps. But this claim implicitly assumes that all ports are congested at the same time. Arista's 10Gbps BASE-T ports support the usual speeds from 100Mbps to 10Gbps. A buffer of a fixed amount of bytes (or packets) does not fit interface speeds varying by three orders of magnitude. A fixed buffer time could be a useful method for triple-speed interfaces (10/100/1000Mbps, 100/1000/10000Mbps), but it is not used. With Arista, the buffer bloat issue is exacerbated by using a shared buffer for all ports (per line card in modular switches). The Arista 7280SE-64 with 9GB of shared packet buffer could delay a packet up to 7.7s if only one port running at 10Gbps is oversubscribed. That would be 77s (well over a minute) if that port is using 1Gbps.

In a recent webinar, I have seen Brocade claim that their top-of-rack Ethernet switches sport the biggest buffers in the industry. This claim shows the naive views many in the industry still have regarding packet buffers. It shows a lack of market knowledge, too, since Extreme S-Series (formerly Enterasys) have buffer sizes two orders of magnitude bigger, and Arista provides three orders of magnitude bigger buffers in ToR (a.k.a. Leaf) switches.

Microbursts

The dreaded microbursts describe a situation where the long term traffic rate, e.g. measured over 5 minute intervals, is well below line rate of the interfaces involved, but over timescales much less, the traffic rate that is sent to one interface is above that interface's line rate. Switch buffers are used to absorb those fluctuations, but for every given buffer size there exists a burst that is too big to fit the buffer. In reality, those situations occur and packets are lost, resulting in less goodput than theoretically possible.

Microbursts do happen in reality. One well documented case from 2015 affected the M-Lab. The paper Traffic Microbursts and their Effect on Internet Measurement clearly shows how puzzled network engineers are when observing the effects of bursty traffic. Sadly, real-life IP traffic is very bursty. The burstiness is actually exploited by statistical multiplexing, but who understands statistics anyway?

I heard another vivid example from a colleague. Video streams showed 30 second periods of corruption, although the data rate was well below the available bandwidth (in the order of 6Mbps per camera). But every once in a while key frames from several cameras would be sent to the recording server at about the same time. Every key frame, at a size of several megabytes, comprised many IP packets, sent at line rate. This resulted in microbursts the switch buffers could not absorb. Losing even one packet of a key frame corrupts the whole frame, resulting in corrupt video until a complete key frame is received. There were 30s intervals between key frames in this case. The solution to this problem was conditioning each camera's traffic by configuring the switch port to 100Mbps instead of 1Gbps. With the uplinks still at 1Gbps all bursts could be easily absorbed by the sufficiently sized (shallow) buffers.

Network equipment engineers seem to have learned a simple lesson from all the appalling microburst problems. For every well behaved traffic pattern, i.e. a traffic pattern that is below line rate if observed over a sufficiently long time, there is an ideal buffer size absorbing every burst. Since no one knows the exact traffic patterns the equipment will have to endure in the wild, buffers need to be as big as economics allow.

TCP Incast

The relatively "recent" TCP Incast problem is another manifestation of microbursts, this time caused by specific traffic patterns observed inside "web-scale" data centers. The reflexive answer of network device vendors is to increase buffer sizes by providing big buffer or deep buffer switches. A better approach inside the controlled data center environment is the use of Data Center TCP (DCTCP), which uses ECN marking to control the sender rate, resulting in very little buffer use. End-system based mitigations like packet pacing and random delays before sending have proven effective as well.

Since incast problems are usually observed with TCP traffic, where lost packets create both additional latency through the re-transmit timeout, and reduced goodput because a re-transmit timeout results in halving the TCP window size, buffering can mitigate the impact of TCP incast problems. Buffering can mitigate TCP incast issues to a maximum packet retention time of one RTT. Inside a data center this is usually the minimum round trip time (minRTT) of the operating system, e.g. 200ms for Linux and 300ms for Windows. If the added queueing delay is greater than that, the RTT timeout will trigger and the effect will be the same as a lost packet. Dynamic buffer sizing with a configurable maximum queueing delay (e.g. 150ms if VoIP needs to be supported, 200ms for pure TCP use, 5ms for synchronously mirrored storage) might be a useful strategy to address the opposing goals of low latency and high goodput with statistical multiplexing packet networks. Two congested ports on the path, or congestion in both directions (delaying ACKs), halves the above numbers.

RFC 970

It is quite simple to come up with traffic patterns that need arbitrarily big buffers to work perfectly. Thinking this through leads to the observation that buffers need to be infinite. This idea has been researched and documented in RFC 970 On Packet Switches With Infinite Storage. Back in 1985 the TTL field of the IP header actually meant time-to-live. Routers would deduct the time in seconds the packet waited in buffers (but at least 1) from the TTL value. But this is not the case anymore. Routers always deduct 1 from the TTL field on forwarding a packet, ignoring the time it waited in a queue. Thus the fundamental mechanism that bounded the usable buffer sizes in RFC 970 is gone. Todays switches and routers do not impose a bound on buffer use. An Internet using switches with infinite storage would impose an infinite delay on packet delivery. No message sent would ever reach its destination. Well, the buffer bloat engineers are obviously inspired by that vision. Or they just do not understand that traffic patterns in the real world differ dramatically from their simplified models and overly specific scenarios.

Mitigations

Most real world traffic patterns can be tamed on the sending end systems without involving the network by using packet pacing or random delays before sending. One might feel reminded of the end-to-end principle as documented in the 1980's (End-to-End Arguments in System Design).

On the network side, using active queue management or its younger sibling, smart queue management, can improve the network's performance by orders of magnitude, especially concerning latency. Sadly, for every AQM or SQM method, there exists a pathological traffic pattern that works better without AQM/SQM. Thus, the perfect being the enemy of the good, neither AQM nor SQM is widely deployed in 2016.

AQM and SQM

The amount of buffered data needs to be controlled and feedback for the congestion control algorithm needs to be provided. This can be done using active queue management (AQM). A promising recent AQM algorithm, at least for wired networks, is CoDel (additional info), especially in the fq_codel variant as implemented in the Linux kernel. The combination of flow queuing and per queue AQM as done by fq_codel seems very effective. Since many home routers are based on Linux, this might tackle the problem where it is needed most.

P.S. All assertions in this article are trivially proven by construction, thus I leave all proofs to the interested reader. ;-)

We're all in this bloat together.


Back to my homepage.