It's Always TCP_NODELAY

brooker.co.za

450 points by eieio a day ago


anonymousiam - a day ago

The Nagle algorithm was created back in the day of multi-point networking. Multiple hosts were all tied to the same communications (Ethernet) channel, so they would use CSMA (https://en.wikipedia.org/wiki/Carrier-sense_multiple_access_...) to avoid collisions. CSMA is no longer necessary on Ethernet today because all modern connections are point-to-point with only two "hosts" per channel. (Each host can have any number of "users.") In fact, most modern (copper) (Gigabit+) Ethernet connections have both ends both transmitting and receiving AT THE SAME TIME ON THE SAME WIRES. A hybrid is used on the PHY at each end to subtract what is being transmitted from what is being received. Older (10/100 Base-T) can do the same thing because each end has dedicated TX/RX pairs. Fiber optic Ethernet can use either the same fiber with different wavelengths, or separate TX/RX fibers. I haven't seen a 10Base-2 Ethernet/DECnet interface for more than 25 years. If any are still operating somewhere, they are still using CSMA. CSMA is also still used for digital radio systems (WiFi and others). CSMA includes a "random exponential backoff timer" which does the (poor) job of managing congestion. (More modern congestion control methods exist today.) Back in the day, disabling the random backoff timer was somewhat equivalent to setting TCP_NODELAY.

Dumping the Nagle algorithm (by setting TCP_NODELAY) almost always makes sense and should be enabled by default.

eieio - a day ago

I found this article while debugging some networking delays for a game that I'm working on.

It turns out that in my case it wasn't TCP_NODELAY - my backend is written in go, and go sets TCP_NODELAY by default!

But I still found the article - and in particular Nagle's acknowledgement of the issues! - to be interesting.

There's a discussion from two years ago here: https://news.ycombinator.com/item?id=40310896 - but I figured it'd been long enough that others might be interested in giving this a read too.

drfignewton - 11 hours ago

I swear, it seems like I’ve seen some variation of this 50 times on HN in the past 15 years.

The core issue with Nagle’s algorithm (TCP_NODELAY off) is its interaction with TCP Delayed ACK. Nagle prevents sending small packets if an ACK is outstanding, while the receiver delays that ACK to piggyback it on a response. When both are active, you get a 200ms "deadlock" where the sender waits for an ACK and the receiver waits for more data. This is catastrophic for latency-sensitive applications like gaming, SSH, or high-frequency RPCs.

In modern times, the bandwidth saved by Nagle is rarely worth the latency cost. You should almost always set TCP_NODELAY = 1 for any interactive or request-response protocol. The "problem" only shifts to the application layer: if you disable Nagle and then perform many small write() calls (like writing a single byte at a time), you will flood the network with tiny, inefficient packets.

Proper usage means disabling Nagle at the socket level but managing your own buffering in user-space. Use a buffered writer to assemble a logical message into a single memory buffer, then send it with one system call. This ensures your data is dispatched immediately without the overhead of thousands of tiny headers. Check the Linux tcp(7) man page for implementation details; it is the definitive reference for these behaviors.

dllthomas - a day ago

Wildly, the Polish word "nagle" (pronounced differently) means "suddenly" or "all at once", which is just astonishingly apropos for what I'm almost certain is pure coincidence.

skrebbel - 14 hours ago

Nagle himself says (or said 10y ago) that the real culprit is delayed ACK: https://news.ycombinator.com/item?id=10608356

I’m no expert by any means, but this makes sense to me. Plus, I can’t come up with many modern workloads where delayed ACK would result in significant improvement. That said, I feel the same about Nagle’s algorithm - if most packets are big, it seems to me that both features solve problems that hardly exist anymore.

Wouldn't the modern http-dominated best practice be to turn both off?

x2rj - a day ago

I've always thought a problem with Nagel's algorithm is, that the socket API does not (really) have a function to flush the buffers and send everything out instantly, so you can use that after messages that require a timely answer.

For stuff where no answer is required, Nagel's algorithm works very well for me, but many TCP channels are mixed use these days. They send messages that expect a fast answer and other that are more asynchronous (from a users point of view, not a programmers).

Wouldn't it be nice if all operating systems, (home-)routers, firewalls and programming languages would have high quality implementations of something like SCTP...

dan-robertson - 4 hours ago

A few random thoughts:

1. Perhaps on more modern hardware the thing to do with badly behaved senders is not ‘hang on to unfull packets for 40ms’ but another policy could still work, e.g. eagerly send the underfilled packet, but wait the amount of time it would take to send a full packet (and prioritize sending other flows) before sending the next underfull packet.

2. In Linux there are packets and then there are (jumbo)packets. The networking stack has some per-packet overhead so much work is done to have it operate on bigger batches and then let the hardware (or a last step in the OS) do segmentation. It’s always been pretty unclear to me how all these packet-oriented things (Nagle’s algorithm, tc, pacing) interact with jumbo packets and the various hardware offload capabilities.

3. This kind of article comes up a lot (mystery 40ms latency -> set TCP_NODELAY). In the past I’ve tried to write little test programs in a high level language to listen on tcp and respond quickly, and in some cases (depending on response size) I’ve seen strange ~40ms latencies despite TCP_NODELAY being set. I didn’t bother looking in huge detail (eg I took a strace and tcpdump but didn’t try to see non-jumbo packets) and failed to debug the cause. I’m still curious what may have caused this?

otterley - a day ago

(2024) - previously discussed at https://news.ycombinator.com/item?id=40310896

rwmj - 15 hours ago

I'm surprised the article didn't also mention MSG_MORE. On Linux it hints to the kernel that "more is to follow" (when sending data on a socket) so it shouldn't send it just yet. Maybe you need to send a header followed by some data. You could copy them into one buffer and use a single sendmsg call, but it's easier to send the header with MSG_MORE and the data in separate calls.

(io_uring is another method that helps a lot here, and it can be combined with MSG_MORE or with preallocated buffers shared with the kernel.)

kazinator - a day ago

> The bigger problem is that TCP_QUICKACK doesn’t fix the fundamental problem of the kernel hanging on to data longer than my program wants it to.

Well, of course not; it tries to reduce the problem of your kernel hanging on to an ack (or genearting an ack) longer than you would like. That pertains to received data. If the remote end is sending you data, and is paused due to filling its buffers due to not getting an ack from you, it behooves you to send an ack ASAP.

The original Berkeley Unix implementation of TCP/IP, I seem to recall, had a single global 500 ms timer for sending out acks. So when your TCP connection received new data eligible for acking, it could be as long as 500 ms before the ack was sent. If we reframe that in modern realities, we can imagine every other delay is negligible, and data is coming at the line rate of a multi gigabit connection, 500 ms represents a lot of unacknowledged bits.

Delayed acks are similar to Nagle in spirit in that they promote coalescing at the possible cost of performance. Under the assumption that the TCP connection is bidirectional and "chatty" (so that even when the bulk of the data transfer is happening in one direction, there are application-level messages in the other direction) the delayed ack creates opportunities for the TCP ACK to be piggy backed on a data transfer. A TCP segment carrying no data, only an ACK, is prevented.

As far as portability of TCP_QUICKACK goes, in C code it is as simple as #ifdef TCP_QUICKACK. If the constant exists, use it. Otherwise out of luck. If you're in another language, you have to to through some hoops depending on whether the network-related run time exposes nonportable options in a way you can test, or whether you are on your own.

medoc - 5 hours ago

Around 1999, I was testing a still young MySQL as an INFORMIX replacement, and network queries needed a very suspect and quite exact 100 mS. A bug report and message to mysql@lists.mysql.com, and this is how MySQL got to set TCP_NODELAY on its network sockets...

vsgherzi - a day ago

https://oxide-and-friends.transistor.fm/episodes/mr-nagles-w...

oxide and friends episode on it! It's quite good

martingxx - a day ago

I've always thought that Nagle's algorithm is putting policy in the kernel where it doesn't really belong.

If userspace applications want to make latency/throughput tradeoffs they can already do that with full awareness and control using their own buffers, which will also often mean fewer syscalls too.

mgaunard - 13 hours ago

I've always found Nagle's algorithm being a kernel-level default quite silly. It should be up to the application to decide when to send and when to buffer and defer.

foltik - a day ago

Why doesn’t linux just add a kconfig that enables TCP_NODELAY system wide? It could be enabled by default on modern distros.

Veserv - a day ago

The problem is actually that nobody uses the generic solution to these classes of problems and then everybody complains that the special-case for one set of parameters works poorly for a different set of parameters.

Nagle’s algorithm is just a special case solution of the generic problem of choosing when and how long to batch. We want to batch because batching usually allows for more efficient batched algorithms, locality, less overhead etc. You do not want to batch because that increases latency, both when collecting enough data to batch and because you need to process the whole batch.

One class of solution is “Work or Time”. You batch up to a certain amount of work or up to a certain amount of time, whichever comes first. You choose your amount of time as your desired worst case latency. You choose your amount of work as your efficient batch size (it should be less than max throughput * latency, otherwise you will always hit your timer first).

Nagle’s algorithm is “Work” being one packet (~1.5 KB) with “Time” being the time until all data gets a ack (you might already see how this degree of dynamism in your timeout might pose a problem already) which results in the fallback timer of 500 ms when delayed ack is on. It should be obvious that is a terrible set of parameters for modern connections. The problem is that Nagle’s algorithm only deals with the “Work” component, but punts on the “Time” component allowing for nonsense like delayed ack helpfully “configuring” your effective “Time” component to a eternity resulting in “stuck” buffers which is what the timeout is supposed to avoid. I will decline to discuss the other aspect which is choosing when to buffer and how much of which Nagle’s algorithm is again a special case.

Delayed ack is, funnily enough, basically the exact same problem but done on the receive side. So both sides set timeouts based on the other side going first which is obviously a recipe for disaster. They both set fixed “Work”, but no fixed “Time” resulting in the situation where both drivers are too polite to go first.

What should be done is use the generic solutions that are parameterized by your system and channel properties which holistically solve these problems which would take too long to describe in depth here.

bullen - 10 hours ago

What we need is configurable ack packet counts.

Then we can make TCP become UDP.

And then we solved everything.

Both linux and Windows have this config but it's buggy so we're back to TCP and UDP.

hathawsh - a day ago

Ha ha, well that's a relief. I thought the article was going to say that enabling TCP_NODELAY is causing problems in distributed systems. I am one of those people who just turn on TCP_NODELAY and never look back because it solves problems instantly and the downsides seem minimal. Fortunately, the article is on my side. Just enable TCP_NODELAY if you think it's a good idea. It apparently doesn't break anything in general.

wyldfire - 16 hours ago

Nagle's algorithm is just a special case of TCP worst case latency. Packet loss and congestion also cause significant latency.

If you care about latency, you should consider something datagram oriented like UDP or SCTP.

buybackoff - 21 hours ago

Then at a lower level and smaller latencies it's often interrupt moderation that must be disabled. Conceptually similar idea to the Nagle algo - coalesce overheads by waiting, but on the receiving end in hardware.

saghm - a day ago

I first ran into this years ago after working on a database client library as an intern. Having not heard of this option beforehand, I didn't think to enable it in the connections the library opened, and in practice that often led to messages in the wire protocol being entirely ready for sending without actually getting sent immediately. I only found out about it later when someone using it investigated why the latency was much higher than they expected, and I guess either they had run into this before or were able to figure out that it might be the culprit, and it turned out that pretty much all of the existing clients in other languages set NODELAY unconditionally.

harikb - 17 hours ago

Somewhat related, from 3 years ago. Unfortunately, original blog is gone.

"Golang disables Nagle's Algorithm by default"

1. https://news.ycombinator.com/item?id=34179426

jonstewart - a day ago

<waits for animats to show up>

carlsborg - 16 hours ago

Unless you're cross platform on Windows too and then theres also a vast number of random registry settings.

jurabek - 12 hours ago

TCP_NODELAY enabled by default on most modern languages no?

joelthelion - 18 hours ago

What happens when you change the default when building a Linux distro? Did anyone try it?

TacticalCoder - 11 hours ago

It's a bit tricky in that browsers may be using TCP_NODELAY anyway or use QUIC (UDP) and whatnots BUT, in doubt, I've got a wrapper script around my browsers launcher script that does LD_PRELOAD with TCP_NODELAY correctly configured.

Dunno if it helps but it helps me feel better.

What speeds up browsing the most though IMO is running your own DNS resolver, null routing a big part of the Internet, firewalling off entire countries (no really I don't need anything from North Korea, China or Russia for example), and then on top of that running dnsmasq locally.

I run the unbound DNS (on a little Pi so it's on 24/7) with gigantic killfiles, then I use 1.1.1.3 on top of that (CloudFlare's DNS that filters out known porn and known malware: yes, it's CloudFlare and, yes, I own shares of NET).

Some sites complain I use an "ad blocker" but it's really just null routing a big chunk of the interwebz.

That and LD_PRELOAD a lib with TCP_NODELAY: life is fast and good. Very low latency.

hsn915 - a day ago

Wouldn't distributed systems benefit from using UDP instead of TCP?

rowanG077 - a day ago

I fondly remember a simple simulation project we had to do with a group of 5 students in a second year class which had a simulation and some kind of scheduler which communicated via TCP. I was appalled at the perfomance we were getting. Even on the same machine it was way too slow for what it was doing. After hours of debugging in turned out it was indeed Nagle's algorithm causing the slowness, which I never heard about at the time. Fixed instantly with TCP_NODELAY. It was one of the first times it was made abundantly clear to me the teachers at that institution didn't know what they were teaching. Apparently we were the only group that had noticed the slow performance, and the teachers had never even heard of TCP_NODELAY.

- a day ago
[deleted]
TZubiri - 13 hours ago

> , suggesting that the default behavior is wrong, and perhaps that the whole concept is outmoded

While outmoded might be the case, wrong is probably not the case.

There's some features of the network protocols that are designed to improve the network, not the individual connection. It's not novel that you can improve your connection. By disabling "good neighbour" features.

mmaunder - 20 hours ago

PSA: UDP exists.

- 16 hours ago
[deleted]