Tuesday, May 20, 2014

10G > 1G

10Gb/s ain't what it used to be

It was only a few years ago that 10Gb/s kit cost 10's of thousands of dollars and needed massive XenPaks to plug in as optics. It's now 2014, 10Gb/s SFPs cost about $200 each, and the closest I get to a XenPak is the broken one I use as a bottle opener.

Because it's so cheap, it's a no-brainer to put 10Gb/s NICs in your servers, but there's no guarantee that your network can support 10Gb/s the whole way through. You might think that a 1Gb/s bottleneck in your network isn't a big deal, and that TCP can fumble around and find the top speed for your connection, but you might be disappointed to hear that it's not that easy.

TCP is dumb

TCP doesn't have any parameters, internal or external, for how fast it's sending data. It has a window of how much unaccounted data has been sent, and this window moves along as acknowledgements are received. The size of this window sets an upper bound on the average speed (based on latency and packet loss, feel free to explore this), but not on the maximum speed. This becomes a problem as bandwidth and latency both increase.

Long fat networks

The TCP window keeps track of every byte "in flight" - in other words, all data that has been sent and not acknowledged. It can't send any more data until the first lot of data has been acknowledged, and it needs a buffer (window) to track this. The smallest this buffer can be is latency x bandwidth, and this number can get very big very quickly. If you're trying to send at 1Gb/s to a destination 160ms away, you need a window of 20MB - if you want to do this at 10Gb/s, you need a window of 200MB! Compared to the 128GB flash drives that clip onto your keyring, this doesn't seem like a huge amount, but to the switches and routers in your network, this is a lot to soak up if your traffic has to go across a slow part of the network

How 10 goes into 1

If your sender and receiver have 10Gb/s connections, but the network has a 1Gb/s segment in the middle, you can run into interesting problems. With 160ms of latency in the way, your sender dumps 20MB onto the network at 10Gb/s - in the time it takes to arrive at the start of the 1Gb/s segment, the 1Gb/s segment can only send 2MB - leaving 18MB to be dealt with. If you have big enough buffers, then this will eke out into the network at 1Gb/s and everything will be fine!

However, 20MB is a big buffer - it holds 160ms of data. We've all seen buffer bloat (when buffers fill up and stay full and add extra latency to the network), and hear about it being a bad thing, but this is an instance where buffers are *very* important. If you have no buffers, your TCP stream starts up and immediately drops 90% of its packets, and things go very bad, very fast.

Labbing this up

You can simulate this yourself between two Linux machines with tc and iperf. First, plug them into each other at 10Gb/s, make sure they're tuned (net.ipv4.tcp_rmem/wmem need to have the last parameter set to about 64MB), and test between them. Assuming sub-millisecond latency, you should see very close to 10Gb/s of TCP throughput (if not, the servers are mistuned, or underpowered).

box1: iperf -i1 -s
box2: iperf -i1 -c box1 -t300

Looking good? If not, you're out of luck in this instance - TCP tuning is outside the scope of this post, go and ask ESnet what to do.

Assuming this is all working, we'll add some latency as an egress filter on box1, and see what happens

sudo tc qdisc add dev eth0 netem delay 50ms limit 10000

Try the iperf again - is it still working well? If you're not averaging 7-8Gb/s then you might want to do some tuning, and come back when it's looking better.

Now we've got a known-good at 50ms, let's try simulating a 1Gb/s bottleneck in the network. Apply an ingress policer to box2 as follows:

sudo tc qdisc add dev eth0 handle ffff: ingress
sudo tc filter add dev eth0 parent ffff: protocol ip prio 1 u32 match ip src police rate 1000mbit burst 100k mtu 100k drop flowid :1

Try your iperf again - how does it look? When I tested this, I couldn't get more than 10Mb/s - something is seriously wrong! Let's try and send some UDP through to see what's happening

box1: iperf -i1 -s -u -l45k
box2: iperf -i1 -u -l45k -c box1 -t300 -b800m

Odd... we can send 800Mb/s of UDP with little or no packet loss, but can't get more than 20Mb/s of TCP?!

The fix for this is adding a shaper to make sure nothing gets dropped by the policer. We can add this on box1 in this instance as follows:

sudo tc qdisc del dev eth0 root
sudo tc qdisc add dev eth0 root handle 1: tbf rate 900mbit burst 12000 limit 500m mtu 100000
sudo tc qdisc add dev eth0 parent 1: netem delay 50ms limit 10000

You'll notice we deleted and then re-added the latency - this is just a limitation of how we chain these qdiscs together. But give it a shot - try an iperf between the two boxes with TCP, and magic - you can get 850Mb/s now!

A sustainable fix

We're not going to add shapers everywhere just because we know that some parts of our network have bad bottlenecks. It's okay though - smart people are working on a fix. By adding a knob to TCP where we explicitly say how fast we want it to go, we can make TCP pace itself, instead of emptying out its window onto the network, and then getting upset when it doesn't get magically taken care of. This is still experimental, but I'm keen to hear if anyone has had any luck with it - this is a very important step forward for TCP, and will become gradually more important as our networks get longer and fatter.