On Bufferbloat

It has been quite some time since the bufferbloat problem has been (re-)discovered, described, and named in 2010 and described again in 2011 (see, e.g., [1] [2] [3] [4] for prior descriptions of the problem).

Short Description of Bufferbloat

In short, unreasonably large, i.e., bloated, buffers in routers can and do result in unreasonably high queuing delay during congestion. Since too large buffers also delay the congestion signals expected by TCP too much, TCP's congestion control no longer works well, and congestion tends to persist for longer periods.

Disbelief

Nevertheless, some people still doubt its existence or deny that it is in fact a problem. The most promising way to convince them seems to be demonstrating the bufferbloat effects. At least there is hope that a demo is effective in showing that there really is a problem. Accepting that there is a problem is the necessary first step towards the possibility of a solution.

Demonstration

Residential Internet

All you need is a BitTorrent client and ping. Take a baseline measurement by checking the round trip time to a Google Public DNS server:

$ ping -c4 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_req=1 ttl=48 time=12.2 ms
64 bytes from 8.8.8.8: icmp_req=2 ttl=48 time=12.2 ms
64 bytes from 8.8.8.8: icmp_req=3 ttl=48 time=13.7 ms
64 bytes from 8.8.8.8: icmp_req=4 ttl=48 time=10.1 ms

--- 8.8.8.8 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 10.172/12.126/13.755/1.284 ms

Now start some heavy BitTorrent traffic. This could be done (legally) by downloading some of the popular films from VODO. (Depending on when you read this, you probably need to find a different example.) Do not restrict the up- or download rate. Wait a bit for BitTorrent to establish connections and ramp up, then repeat the ping:

$ ping -c4 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_req=1 ttl=48 time=4857 ms
64 bytes from 8.8.8.8: icmp_req=2 ttl=48 time=3854 ms
64 bytes from 8.8.8.8: icmp_req=3 ttl=48 time=2851 ms
64 bytes from 8.8.8.8: icmp_req=4 ttl=48 time=1844 ms

--- 8.8.8.8 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3023ms
rtt min/avg/max/mdev = 1844.230/3351.948/4857.814/1122.911 ms, pipe 4

With lots of upload connections I have observed RTTs of more than 30s using a cable Internet connection (32Mbps down / 2Mbps up).

Note on BitTorrent

Some people think that peer-to-peer traffic based on many TCP connections is bad for the network. This is a misconception caused by experiencing problems in the network and seeing peer-to-peer traffic using the available bandwidth. The root cause for the problems is bufferbloat, not using the available resources. A mostly unused network does not use the available buffers, thus bufferbloat seems harmless. But if the network is actually used to capacity, bufferbloat results in problems.

Back when I was involved with LINBO, I talked with a colleague about the idea of using BitTorrent to distribute operating system images in a small campus network. He thought that this would overwhelm the network, perhaps even causing it to collapse, but it did not, because only shallow buffered switches were in use. Others noticed the benefits of BitTorrent as well.

BitTorrent is (was back when) a simple way to actually use the available network capacity. There is nothing magical about BitTorrent that would induce bufferbloat-like effects, but without pushing the network to capacity, buffers are not filled over a longer time period and bufferbloat does not show itself. Blaming BitTorrent is blaming the messenger.

The blog post Striking a balance between bufferbloat and TCP queue oscillation gives a hint why BitTorrent seems problematic in the presence of bufferbloat: with large buffers, the many long flows of BitTorrent can dominate both link capacity and buffer space, while that would not be the case with small buffers. Thus bufferbloat turns BitTorrent from highly useful into potentially problematic.

Upload versus Download

DSL and cable Internet connections often use asymmetrical bandwidth, i.e., higher download than upload capacity. Thus similarly sized buffers for both directions can introduce a higher latency in upload than in download direction. This seems to be the underlying reason that bufferbloat has often been observed when uploading large files. On mobile 2.75G (EDGE) networks, starting a single download can easily increase measured round trip times from about half a second (500ms) to about a minute (60000ms), i.e., two orders of magnitude. But mobile Internet can be much worse than this, as described below for 3G (UMTS).

Mobile Internet

When using UMTS you can experience bufferbloat with the lightest of network loads, no BitTorrent needed at all. It is even worse for EDGE (2.75G) or GPRS (2.5G). Even with LTE widely available and 5G approaching, the older methods are still supported by mobile carriers.

Round Trip Time

The following has been recorded while reading mail using mutt over an SSH connection:

--- 131.246.1.1 ping statistics ---
2562 packets transmitted, 2539 received, 0% packet loss, time 2568076ms
rtt min/avg/max/mdev = 195.489/1723.379/19891.544/1790.888 ms, pipe 20

As you can see at least one packet was buffered for nearly 20 seconds.

Another statistic over UMTS, this time downloading some files. There was more than one minute of buffering in the network!

--- 131.246.1.1 ping statistics ---
83010 packets transmitted, 80953 received, 2% packet loss, time 83439375ms
rtt min/avg/max/mdev = 116.252/2372.160/76504.755/2679.173 ms, pipe 76

Yet another statistic, showing 84s of buffering and 9% packet loss. This session consisted of some web surfing and SSH connections. The interactive SSH sessions were nearly unusable, while web surfing worked astonishingly well (HTTP prioritization by the ISP?).

--- 131.246.1.1 ping statistics ---
3446 packets transmitted, 3131 received, 9% packet loss, time 3461484ms
rtt min/avg/max/mdev = 111.743/2533.525/83806.501/8805.208 ms, pipe 84

And another example of reading e-mail and surfing the web including single file downloads (a second download would not really start before the first one finished) over UMTS (also known as 3G). Over one and a half minutes buffering could be observed in the network. Additionally, three ICMP Echo [Request] or Echo Reply packets were duplicated by the network.

--- 8.8.8.8 ping statistics ---
3205 packets transmitted, 3062 received, +3 duplicates, 4% packet loss, time 9620358ms
rtt min/avg/max/mdev = 56.504/2667.309/98311.704/7420.495 ms, pipe 33

Yet another example shows 122s (more than two minutes) of buffering in the mobile network:

--- 8.8.8.8 ping statistics ---
11804 packets transmitted, 11104 received, 5% packet loss, time 35439526ms
rtt min/avg/max/mdev = 48.552/2908.663/122192.710/7264.115 ms, pipe 41

Another record-breaking round trip time (RTT) value of 185s (over 3 minutes), which is close to the original limit of 64s + 128s = 192s for an ICMP Echo sent with a time-to-live (TTL) of 64 and the ICMP Echo Reply sent with a TTL of 128:

--- 8.8.8.8 ping statistics ---
9437 packets transmitted, 9245 received, 2% packet loss, time 28331416ms
rtt min/avg/max/mdev = 55.561/2142.735/185510.884/10026.599 ms, pipe 62

Of course, TTL calculation no longer involves time, so the total duration an IP packet is held in the network can be longer than the TTL value in seconds.

Even a buffer time of over three minutes can be topped in a UMTS network, the following statistics show over four minutes of RTT:

--- 8.8.8.8 ping statistics ---
6834 packets transmitted, 6807 received, 0% packet loss, time 20513712ms
rtt min/avg/max/mdev = 42.786/3739.517/246453.228/15911.943 ms, pipe 82

Newer technologies seem to provide even larger buffers, here an example RTT measurement over LTE on a soccer game day where the IPv4 RTT maximum is about 20s, but the IPv6 RTT maximum to the same end system is a good 716s, nearly 12 minutes:

--- 131.246.199.18 ping statistics ---
443 packets transmitted, 321 received, 27,5395% packet loss, time 20806809ms
rtt min/avg/max/mdev = 55.333/347.870/20008.590/1754.793 ms

--- 2001:638:208:fdc0::2 ping statistics ---
443 packets transmitted, 335 received, 24,3792% packet loss, time 20807597ms
rtt min/avg/max/mdev = 51.836/10270.196/716589.537/60422.076 ms, pipe 16

There was a lot of packet loss, too, and the network was not usable at all for quite some time. Large buffers did not help.

Packet Duplication

Mobile Internet not only buffers (delays) packets, it sometimes duplicates them as well. There is still some packet loss. This measurement happened with SSH traffic, reading and writing email using mutt on a remote computer.

--- 8.8.8.8 ping statistics ---
761 packets transmitted, 761 received, +6 duplicates, 0% packet loss, time 2282381ms
rtt min/avg/max/mdev = 75.969/811.447/25267.421/2580.335 ms, pipe 9

--- sushi.unix-ag.uni-kl.de ping statistics ---
1769 packets transmitted, 1767 received, +9 duplicates, 0% packet loss, time 1770575ms
rtt min/avg/max/mdev = 65.530/567.948/21371.611/2010.884 ms, pipe 22

While duplicate packets show a real problem on Ethernet, often a loop, they are not that problematic on wireless networks. Wireless networks often re-transmit lost packets at what is often called "layer 2" to work around the relatively low reliability of wireless transmissions. This goes wrong sometimes, because the ACK can be lost as well as the actual payload, which can result in duplicating the data packet. Arguably that functionality should not be used on TCP/IP networks, but it is.

Enterprise Switch

With all the network equipment vendors' claims about huge packet buffers in their switches, I wanted to test how this actually works out. The device under test (DUT) is an Enterasys (now Extreme) Networks SSA model SSA-T1068-0652 switch. According to the data sheet, this switch has a packet buffer of 1.5GB.

To test delays introduced by packet buffers, I connected an Extreme X460 switch to a Gigabit port of the SSA, but configured the port to use 100Mbps only. I then connected an Extreme X460-G2 switch to a Gigabit port of the SSA using 1Gbps port speed. I connected my laptop to the X460-G2 switch and used ping from the laptop to the X460 to measure delay.

+-------------+                +-------------+                +-------------+
|    Switch   |                |    Switch   |                |    Switch   |
|   X460-G2   |     1Gbps      | SSA150 Class|     100Mbps    |     X460    |
| Traffic Gen.+----------------+             +----------------+  Ping Resp. |
|             |                |     DUT     |                |  10.0.10.11 |
+------+------+                +-------------+                +-------------+
       |
       | 100Mbps
       |
+------+------+
|    Laptop   |
| Ping Sender |
+-------------+

Without any other traffic the ping measurement looked as follows. This provides a baseline for the following measurements.

--- 10.0.10.11 ping statistics ---
60 packets transmitted, 60 received, 0% packet loss, time 59011ms
rtt min/avg/max/mdev = 0.723/1.112/3.895/0.735 ms

To generate traffic I used the (no longer available) EXOS Service verification tool (ESVT) on the X460-G2 switch to create 10 minutes of 110Mbps traffic to the X460 switch. As the traffic generator was started, the ping latencies increased linearly before stabilizing at about 340ms.

PING 10.0.10.11 (10.0.10.11) 56(84) bytes of data.
64 bytes from 10.0.10.11: icmp_seq=1 ttl=62 time=1.57 ms
64 bytes from 10.0.10.11: icmp_seq=2 ttl=62 time=0.819 ms
64 bytes from 10.0.10.11: icmp_seq=3 ttl=62 time=0.834 ms
64 bytes from 10.0.10.11: icmp_seq=4 ttl=62 time=0.893 ms
64 bytes from 10.0.10.11: icmp_seq=5 ttl=62 time=0.860 ms
64 bytes from 10.0.10.11: icmp_seq=6 ttl=62 time=81.7 ms
64 bytes from 10.0.10.11: icmp_seq=7 ttl=62 time=169 ms
64 bytes from 10.0.10.11: icmp_seq=8 ttl=62 time=276 ms
64 bytes from 10.0.10.11: icmp_seq=10 ttl=62 time=340 ms
64 bytes from 10.0.10.11: icmp_seq=13 ttl=62 time=353 ms
64 bytes from 10.0.10.11: icmp_seq=14 ttl=62 time=341 ms
64 bytes from 10.0.10.11: icmp_seq=15 ttl=62 time=340 ms
64 bytes from 10.0.10.11: icmp_seq=16 ttl=62 time=340 ms
64 bytes from 10.0.10.11: icmp_seq=17 ttl=62 time=340 ms
64 bytes from 10.0.10.11: icmp_seq=19 ttl=62 time=340 ms
64 bytes from 10.0.10.11: icmp_seq=20 ttl=62 time=340 ms

As can be seen from the missing sequence numbers, a couple of packets were lost. At the end of the measurement the latencies jumped from 340ms to 1ms. This is expected because the 340ms buffer is emptied between two ping packets, which are using the default interval of 1s.

64 bytes from 10.0.10.11: icmp_seq=601 ttl=62 time=340 ms
64 bytes from 10.0.10.11: icmp_seq=603 ttl=62 time=340 ms
64 bytes from 10.0.10.11: icmp_seq=604 ttl=62 time=0.880 ms
64 bytes from 10.0.10.11: icmp_seq=605 ttl=62 time=0.844 ms

The ping statistics show a basically constant use of 340ms or 4MB of packet buffer while the 100Mbps port was overloaded. Packet loss is evident as well, because after the buffer was full, additional packets were dropped. That could be either a ping or a traffic generator packet.

--- 10.0.10.11 ping statistics ---
606 packets transmitted, 346 received, 42% packet loss, time 606765ms
rtt min/avg/max/mdev = 0.819/332.236/374.480/53.640 ms

This test shows a queueing delay of 340ms induced by uncontrolled packet buffers. This is more than two times the delay acceptable for VoIP (150ms). On a positive note, this is much less bad than the worst case buffering of 1.5GB (129s) inferred from the data sheet. Perhaps the buffer allocation depends on the port speed to bound the extra queueing delay. But that is pure speculation as the vendor does not publicly document buffering behavior.

Service Provider Router

As an anecdote, I have measured the (unidirectional) latency added by a quality of service implementation using default queue sizes of a service provider routing platform (using hardware forwarding) to add over half a minute (on the order of 40s) of queuing delay on a marginally (on the order of 1%) overloaded 1Gb/s link. This occurred in late 2020, well after the bufferbloat problem was supposedly solved (again).

Since this shows up only with sufficiently persistent congestion, it could be overlooked for a long time, but still result in intermittent problems. I can only urge network operators to carefully plan and test their quality of service configurations, because otherwise they risk creating more problems than solving issues by implementing quality of service (i.e., treating some packets worse than others).

Remarks

Under normal consumer use, i.e., surfing the web using a broadband connection (not UMTS or the like), bufferbloat does not show itself this drasticly. That's one reason that it has been publicly described only recently and its effects are still doubted or downplayed by many.

Residential Internet

One single upload of sufficient size can render a residential Internet connection useless, but most users don't do this (or do nothing else during the upload). VoIP uses small packets and little bandwidth, thus it usually works well. The BitTorrent protocol has been blamed, but the root cause are bloated and uncontrolled buffers that prevent TCP's drop-based congestion control algorithm from working. The LEDBAT protocol was developed for BitTorrent. It is based on delay increases, not on packet drops, and thus reacts to filling buffers (with good AQM, this does not work well). Those buffers will not drain, because LEDBAT keeps them filled according to the configured delay value. Other client software, e.g. rsync, has gained options to manually limit the used bandwidth, thus neither filling the (bloated) buffers nor utilizing the available bandwidth. Special home router configuration has been used to mitigate the problem by manually restricting the used upload bandwidth to less than available. TCP's congestion control algorithm should do this automatically, but bloated buffers broke it.

Internet of Things

The Internet of Things (IoT), the Internet of Everything (IoE, IoX), and all the other names given to stuff connected to the Internet, that is neither a traditional computer nor a so called smart phone, and not part of the network infrastructure, provide for additional bufferbloat examples.

In one case I witnessed, access control card readers would contact a server via an IP based network. The software would declare the server unreachable if the server's answer did not arrive in at most 800ms. One day the card readers seemingly lost connectivity to the server. The network provided a round trip time of 7ms (the server was located in another city), but the response time observed by the card readers was above 2000ms. The responses did actually reach the card reader, but they were discarded there. This was caused by combining a normal Linux kernel with a low power CPU and a half-duplex 10Mbps network interface. The broadcast load in the network segment used to connect the card readers created a steady 2 to 5 second queue in the network stack. The card readers lacked the necessary processing capabilities to cope with that network. The bloated buffers did not help, because they held packets for much longer than the application would accept them. (Note that some buffering is needed to absorb traffic bursts, e.g. a couple of ARP broadcasts followed by an access control server response.)

A buffer of 100kB size can hold 800ms of traffic at 10Mbps. That corresponds to between 67 and 1563 IP packets, depending on the packet size (assumption of 64B to 1500B packets). The Linux default packet queue of 1000 packets corresponds to values from 64kB (for 64B packets) to 1.5MB (for 1500B packets). For packet sizes above 100B that is too much for an 800ms response time at 10Mbps. The number of (variably sized) packets is an especially bad choice for queue length measurements, even worse than a fixed byte size. The only useful unit for queues in packet switched networks is time, providing an upper bound for latency added by queueing delays.

[The solution for the specific implementation was to use a dedicated VLAN (network segment) for the card readers. This reduced the broadcast load to practically zero and resulted in access control response times two orders of magnitude better than before. Such a simple solution is usually implemented already, so this was a rather uncommon problem caused by bufferbloat and easily mitigated.]

Buffer Sizes

Just reducing the buffers to sizes that guarantee low latency does not always work well (but sometimes it is does). Buffers are needed to accomodate traffic bursts (as in the infamous microburst problem caused by seemingly too small buffers in some (e.g. Cisco) enterprise-range switches [see http://people.ucsc.edu/~warner/buffer.html for a list of switch buffer sizes]). Multiple possible interface speeds compound this problem: 10ms of buffer capacity at 10Gbps are 1s (1000ms) at 100Mbps. Nevertheless, the buffer sizes actually needed to achieve an acceptable level of goodput under congestion are still quite small.

The article Ancient History of Buffer Sizing surveys publications on buffer sizing and the developments described therein. The article comes to the conclusion that large buffers are not needed, without looking at the problem of bufferbloat that is created by using too large buffers.

The SIGCOMM '04 paper Sizing Router Buffers states in its conclusion: "We believe that the buffers in backbone routers are much larger than they need to be — possibly by two orders of magnitude." The 2019 follow-up Sizing Router Buffers (Redux) confirms this with measurements in production networks.

The blog post Striking a balance between bufferbloat and TCP queue oscillation compares the effects of small and large buffers in front of a high delay, but relatively small bandwidth, WAN link. In the experiments, small buffers provided better results for short flows (e.g., web browsing) with little average queuing delay. Large buffers provided better results for long flows (e.g., downloading an operating system ISO image), but the queuing delay increased significantly. Maximum queuing delay increased from about 60ms to about 1000ms. Both small and large buffers resulted in high link utilization. The difference lay in how the link bandwidth was divided between short and long flows, and in the additional latency due to queuing. When increasing the number of long flows they no longer performed better than with small buffers, but the large buffers still resulted in significant additional queuing delay.

Arista's claimed 100ms buffer at 10Gbps translates to 10s (10000ms) at 100Mbps. But this claim implicitly assumes that all ports are congested at the same time. Arista's 10Gbps BASE-T ports support the usual speeds from 100Mbps to 10Gbps. A buffer of a fixed amount of bytes (or packets) does not fit interface speeds varying by three orders of magnitude. A fixed buffer time could be a useful method for triple-speed interfaces (10/100/1000Mbps, 100/1000/10000Mbps), but it is not used. With Arista, the buffer bloat issue is exacerbated by using a shared buffer for all ports (per line card in modular switches). The Arista 7280SE-64 with 9GB of shared packet buffer could delay a packet up to 7.7s if only one port running at 10Gbps is oversubscribed. That would be 77s (well over a minute) if that port is using 1Gbps (777s or nearly 13 minutes for 100Mbps port speed, e.g. an older IPMI interface in the data center).

In a webinar, I have seen Brocade claim that their top-of-rack Ethernet switches sport the biggest buffers in the industry. This claim shows the naive views many in the industry still have regarding packet buffers. It shows a lack of market knowledge, too, since Extreme S-Series (formerly Enterasys) have buffer sizes two orders of magnitude bigger, and Arista provides three orders of magnitude bigger buffers in ToR (a.k.a. Leaf) switches. [This was before Extreme aquired the Brocade data center business unit.]

Microbursts

The dreaded microbursts describe a situation where the long term traffic rate, e.g. measured over 5 minute intervals, is well below line rate of the interfaces involved, but over timescales much less, the traffic rate that is sent to one interface is above that interface's line rate. Switch buffers are used to absorb those fluctuations, but for every given buffer size there exists a burst that is too big to fit the buffer. In reality, those situations occur and packets are lost, resulting in less goodput than theoretically possible.

Microbursts do happen in reality. One well documented case from 2015 affected the M-Lab. The paper Traffic Microbursts and their Effect on Internet Measurement clearly shows how puzzled network engineers are when observing the effects of bursty traffic. Sadly, real-life IP traffic is very bursty. The burstiness is actually exploited by statistical multiplexing, but who understands statistics anyway?

I heard another vivid example from a colleague. Video streams showed 30 second periods of corruption, although the data rate was well below the available bandwidth (in the order of 6Mbps per camera). But every once in a while key frames from several cameras would be sent to the recording server at about the same time. Every key frame, at a size of several megabytes, comprised many IP packets, sent at line rate. This resulted in microbursts the switch buffers could not absorb. Losing even one packet of a key frame corrupts the whole frame, resulting in corrupt video until a complete key frame is received. There were 30s intervals between key frames in this case. The solution to this problem was conditioning each camera's traffic by configuring the switch port to 100Mbps instead of 1Gbps. With the uplinks still at 1Gbps all bursts could be easily absorbed by the sufficiently sized (shallow) buffers.

Network equipment engineers seem to have learned a simple lesson from all the appalling microburst problems. For every well behaved traffic pattern, i.e., a traffic pattern that is below line rate if observed over a sufficiently long time, there is an ideal buffer size absorbing every burst. Since no one knows the exact traffic patterns the equipment will have to endure in the wild, buffers need to be as big as economics allow. But this replaces the problem of too small buffers with the problem of too big buffers, also known as bufferbloat.

TCP Incast

The relatively "recent" TCP Incast problem is another manifestation of microbursts, this time caused by specific traffic patterns observed inside "web-scale" data centers. The reflexive answer of network device vendors is to increase buffer sizes by providing big buffer or deep buffer switches. A better approach inside the controlled data center environment is the use of Data Center TCP (DCTCP), which uses ECN marking to control the sender rate, resulting in very little buffer use. End-system based mitigations like packet pacing and random delays before sending have proven effective as well.

Since incast problems are usually observed with TCP traffic, where lost packets create both additional latency through the re-transmit timeout, and reduced goodput because a re-transmit timeout results in halving the TCP window size, buffering can mitigate the impact of TCP incast problems. Buffering can mitigate TCP incast issues to a maximum packet retention time of one RTT. Inside a data center this is usually the minimum round trip time (minRTT) of the operating system, e.g. 200ms for Linux and 300ms for Windows. If the added queueing delay is greater than that, the RTT timeout will trigger and the effect will be the same as a lost packet. Dynamic buffer sizing with a configurable maximum queueing delay (e.g. 150ms if VoIP needs to be supported, 200ms for pure TCP use, 5ms for synchronously mirrored storage) might be a useful strategy to address the opposing goals of low latency and high goodput with statistical multiplexing packet networks. Two congested ports on the path, or congestion in both directions (delaying ACKs), halves the above numbers.

RFC 970

It is quite simple to come up with traffic patterns that need arbitrarily big buffers to work perfectly. Thinking this through leads to the observation that buffers need to be infinite. This idea has been researched and documented in RFC 970 On Packet Switches With Infinite Storage. Back in 1985 the TTL field of the IP header actually meant time-to-live. Routers were supposed to deduct the time in seconds the packet waited in buffers (but at least 1) from the TTL value when forwarding a packet. But this is not the case anymore. Routers always deduct 1 from the TTL field when forwarding a packet, ignoring the time it waited in a queue. Thus the fundamental mechanism that bounded the usable buffer sizes in RFC 970 is gone. Todays switches and routers do not impose a bound on buffer use. An Internet using switches with infinite storage would impose an infinite delay on packet delivery. No message sent would ever reach its destination. Well, the buffer bloat engineers are obviously inspired by that vision. Or they just do not understand that traffic patterns in the real world differ dramatically from their simplified models and overly specific scenarios.

Mitigations

Most real world traffic patterns can be tamed on the sending end systems without involving the network by using packet pacing or random delays before sending. One might feel reminded of the end-to-end principle as documented in the 1980s (End-to-End Arguments in System Design).

This is exactly how congestion was controlled in 1984. This was still known in 2000. But that was before widespread deployment of bloated buffers in the Internet. The introduction of bufferbloat to the Internet broke the existing machanisms. Engineers try to mitigate the fallout ever since, but hamper themselves by not addressing the root cause of the problem, i.e., bufferbloat.

Mitigations are needed on both ends of a given network transfer, unless only one direction has a bottleneck link with bloated buffers that can be saturated in practice. So if your downlink can be easily saturated, you need considerate senders across the Internet that implement bufferbloat mitigations for you on their end. Don't hold your breath….

Mitigations on Hosts

The bufferbloat problem has persisted for quite some time. As a result, different mitigations for individual problems have been developed. While all of the following address some symptom or other caused by bufferbloat, oversized buffers, i.e., bufferbloat, was not generally acknowledged as the root cause of the problems. Without bufferbloat, none of the following mitigations would have been developed.

The uTorrent transport protocol (uTP) relies on bufferbloat effects to tame BitTorrent traffic. As long as nothing but BitTorrent is able to saturate the bottleneck link, this does help. On EDGE (2.75G), web browsing alone saturates the bottleneck link. Bufferbloat makes it painful, and uTP for BitTorrent cannot help here. UMTS (3G) is better, but still problematic.
The Low Extra Delay Background Transport (LEDBAT) congestion control algorithm generalizes the uTP idea beyond BitTorrent. This only affects the sender, thus every sender would need to enable this, which is not likely to happen.
The (expired) Internet Draft rLEDBAT: receiver-driven Low Extra Delay Background Transport for TCP describes how to enable the receiver of a TCP transmission to slow the sender down when the sender creates additional queuing delay by reducing the size of the receive window.
Something like this is important, because it might prevent aggressive content providers from overwhelming an end user's network.
Delay based congestion control algorithms for TCP work better than drop based congestion control algorithms in the presence of bufferbloat. But since drop based congestion control combined with bufferbloat results in starving TCP connections using delay based congestion control, this can only be used effectively in closed networks, not the global Internet.
In addition to delay based congestion control, the HyStart++ aims to avoid the effects of bufferbloat negatively affecting the slow start phase of TCP transmissions, before the congestion avoidance phase.
BBR was developed to have well working TCP transmissions in the presence of the effects of bufferbloat, with the aim of a single TCP stream saturating the bottleneck link. Sadly, this negatively affects TCP fairness, in that BBR flows starve TCP flows that use drop and/or delay based congestion control. BBR even starves other BBR sessions, resulting in unpredictable performance for a given flow. Later BBR versions supposedly address those issues, but without bufferbloat, there would be no need for something like BBR. A bigger conceptual model with BBR (both version 1 and 2) is that it requires oversized (bloated) buffers to work as intended. Its deployment thus makes it harder to fix the underlying problem (bufferbloat), thus BBR should be considered harmful.
Another conceptual problem with BBR is that its intended use is to deliver a single data stream to a consumer under the assumption that said consumer does not want to transfer any other data. But even consumers have many concurrent data transfers, be it on one computer from software updaters, email clients, instant messaging, and web apps in tabs different from the YouTube video delivered using BBR, or be it different users, e.g., a family, behind a single Internet connection trying to use the Internet simultaneously. It could be different devices of the same user behind one Internet connection (smart phone, tablet, IoT devices) all constantly causing parallel data transfers. Often it is a combinition of all this.
Optimizing BBR to completely saturize any given link with a single transfer, as done by Google, is not in the interest of Internet end users. And there are hints that BBR does not even successfully mitigate bufferbloat induced problems for video delivery.

Mitigations in the Network

On the network side, using active queue management or its younger sibling, smart queue management, can improve the network's performance by orders of magnitude, especially concerning latency. Sadly, for every AQM or SQM method, there exists a pathological traffic pattern that works better without AQM/SQM (usually throughput based workloads without latency requirements). Thus, the perfect being the enemy of the good, neither AQM nor SQM is widely deployed in 2016.

AQM / SQM and FQ

The amount of buffered data needs to be controlled and feedback for the congestion control algorithm needs to be provided. Classic feedback is packet loss, but marking (some) queued packets using the Explicit Congestion Notification (ECN) mechanism seems to be promising. Both can be done using active queue management (AQM). For marketing reasons, newer AQM variants are often called smart queue management (SQM). A promising recent AQM algorithm, at least for wired networks, is CoDel (additional info), especially in the fq_codel variant as implemented in the Linux kernel. The combination of flow queuing, sometimes called fair queuing, and per queue AQM as done by fq_codel seems very effective. This has been enhanced further in CAKE, added to the Linux kernel as well. Since many home routers are based on Linux, this might tackle the problem where it is needed most. This, of course, is still applying mitigations to work around a broken network. Fixing the network by removing buffer bloat would be the real solution.

P.S. All assertions in this article are trivially proven by construction, thus I leave all proofs to the interested reader. ;-)

We're all in this bloat together.

[1] The book CCIE™: Cisco® Certified Internetwork Expert Study Guide, Second Edition, ISBN 0-7821-4207-9, published by Sybex in 2003, mentions negative effects of too big buffers in the Quality of Service (QoS) chapter. This includes a so-called Real World Scenario where EIGRP neighbors over WAN connections were lost during backup jobs, with a root cause of too much buffering in the WAN resulting in too much delay for EIGRP hello packets leading to EIGRP neighbor time out and thus losing the forwarding path via that neighbor.

[2] The SIGCOMM '04 paper Sizing Router Buffers from 2004 comes to the conclusion that: "the buffers in backbone routers are much larger than they need to be — possibly by two orders of magnitude."

[3] Lecture 13 of MIT 6.033 Computer System Engineering, Spring 2005, describes problems of too big buffers clearly and in detail. Oversized buffers are shown to have the potential to cause congestion collapse. It suggests to use buffers on the order of one round trip time, linking the size of a buffer (in bytes) with the time (in s) needed to transfer the data. Building significantly larger buffers into networking gear is called an engineering mistake. [Delay here means minimum delay, of course.]

[4] The blog post Bittorrent, negative latency, and feedback control theory from 2009 describes the added latency of filling a DSL modem's buffer, and an initial attempt of working around the problem with active queue management.

Back to my homepage.