What came first: the CNAME or the A record?

blog.cloudflare.com

398 points by linolevan 21 hours ago


steve1977 - 20 hours ago

I don't find the wording in the RFC to be that ambiguous actually.

> The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.

The "possibly preface" (sic!) to me is obviously to be understood as "if there are any CNAME RRs, the answer to the query is to be prefaced by those CNAME RRs" and not "you can preface the query with the CNAME RRs or you can place them wherever you want".

patrickmay - 20 hours ago

A great example of Hyrum's Law:

"With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody."

combined with failure to follow Postel's Law:

"Be conservative in what you send, be liberal in what you accept."

colmmacc - 13 hours ago

I am very petty about this one bug and have a very old axe to grind that this reminded me of! Way back in 2011 CloudFlare launched an incredibly poorly researched feature to just return CNAME records at a domain apex ... RFCs be damned.

https://blog.cloudflare.com/zone-apex-naked-domain-root-doma... , and I quote directly ... "Never one to let a RFC stand in the way of a solution to a real problem, we're happy to announce that CloudFlare allows you to set your zone apex to a CNAME."

The problem? CNAMEs are name level aliases, not record level, so this "feature" would break the caching of NS, MX, and SOA records that exist at domain apexes. Many of us warned them at the time that this would result in a non-deterministic issue. At EC2 and Route 53 we weren't supporting this just to be mean! If a user's DNS resolver got an MX query before an A query, things might work ... but the other way around, they might not. An absolute nightmare to deal with. But move fast and break things, so hey :)

In earnest though ... it's great to see how now CloudFare are handling CNAME chains and A record ordering issues in this kind of detail. I never would have thought of this implicit contract they've discovered, and it makes sense!

linsomniac - 16 hours ago

>While in our interpretation the RFCs do not require CNAMEs to appear in any particular order

That seems like some doubling-down BS to me, since they earlier say "It's ambiguous because it doesn't use MUST or SHOULD, which was introduced a decade after the DNS RFC." The RFC says:

>The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.

How do you get to interpreting that, in the face of "MUST" being defined a decade later, as "I guess I can append the CNAME to the answer?

Holding onto "we still think the RFC allows it" is a problem. The world is a lot better if you can just admit to your mistakes and move on. I try to model this at home and at work, because trying to "language lawyer" your way out of being wrong makes the world a worse place.

NelsonMinar - 19 hours ago

It's remarkable that the ordinary DNS lookup function in glibc doesn't work if the records aren't in the right order. It's amazing to me we went 20+ years without that causing more problems. My guess is most people publishing DNS records just sort of knew that the order mattered in practice, maybe figuring it out in early testing.

bwblabs - 16 hours ago

I will hijack this post to point out CloudFlare really doesn't understand RFC1034, their DNS authoritative interface only blocks A and AAAA if there is a CNAME defined, e.g. see this:

  $ echo "A AAAA CAA CNAME DS HTTPS LOC MX NS TXT" | sed -r 's/ /\n/g' | sed -r 's/^/rfc1034.wlbd.nl /g' | xargs dig +norec +noall +question +answer +authority @coco.ns.cloudflare.com
  ;rfc1034.wlbd.nl.  IN A
  rfc1034.wlbd.nl. 300 IN CNAME www.example.org.
  ;rfc1034.wlbd.nl.  IN AAAA
  rfc1034.wlbd.nl. 300 IN CNAME www.example.org.
  ;rfc1034.wlbd.nl.  IN CAA
  rfc1034.wlbd.nl. 300 IN CAA 0 issue "really"
  ;rfc1034.wlbd.nl.  IN CNAME
  rfc1034.wlbd.nl. 300 IN CNAME www.example.org.
  ;rfc1034.wlbd.nl.  IN DS
  rfc1034.wlbd.nl. 300 IN DS 0 13 2 21A21D53B97D44AD49676B9476F312BA3CEDB11DDC3EC8D9C7AC6BAC A84271AE
  ;rfc1034.wlbd.nl.  IN HTTPS
  rfc1034.wlbd.nl. 300 IN HTTPS 1 . alpn="h3"
  ;rfc1034.wlbd.nl.  IN LOC
  rfc1034.wlbd.nl. 300 IN LOC 0 0 0.000 N 0 0 0.000 E 0.00m 0.00m 0.00m 0.00m
  ;rfc1034.wlbd.nl.  IN MX
  rfc1034.wlbd.nl. 300 IN MX 0 .
  ;rfc1034.wlbd.nl.  IN NS
  rfc1034.wlbd.nl. 300 IN NS rfc1034.wlbd.nl.
  ;rfc1034.wlbd.nl.  IN TXT
  rfc1034.wlbd.nl. 300 IN TXT "Check my cool label serving TXT and a CNAME, in violation with RFC1034"
The result is DNS resolvers (including CloudFlare Public DNS) will have a cache dependent result if you query e.g. a TXT record (depending if it has the CNAME cached). At internet.nl (https://github.com/internetstandards/) we found out because some people claimed to have some TXT DMARC record, while also CNAMEing this record (which results in cache dependent results, and since internet.nl uses RFC 9156 QName Minimisation, if first resolves A, and therefor caches the CNAME and will never see the TXT). People configure things similar to https://mxtoolbox.com/dmarc/dmarc-setup-cname instructions (which I find in conflict with RFC1034).
netfortius - 2 hours ago

Why couldn't a "code specialized" LLM/AI be added to the change flow, in the cloudflare process, and asked to check against all known implementations of name resolution stubs, dns clients, etc., etc. If not in such cases, then when?

forinti - 20 hours ago

> While in our interpretation the RFCs do not require CNAMEs to appear in any particular order, it’s clear that at least some widely-deployed DNS clients rely on it. As some systems using these clients might be updated infrequently, or never updated at all, we believe it’s best to require CNAME records to appear in-order before any other records.

That's the only reasonable conclusion, really.

seiferteric - 19 hours ago

Now that I have seemingly taken on managing DNS at my current company I have seen several inadequacies of DNS that I was not aware of before. Main one being that if an upstream DNS server returns SERVFAIL, there is no distinction really between if the server you are querying is failed, or the actual authoritative server upstream is broken (I am aware of EDEs but doesn't really solve this). So clients querying a broken domain will retry each of their configured DNS servers, and our caching layer (Unbound) will also retry each of their upstreams etc... Results in a bunch of pointless upstream queries like an amplification attack. Also have issue with the search path doing stupid queries with NXDOMAIN like badname.company.com, badname.company.othername.com... etc..

teddyh - 15 hours ago

Cloudflare is well known for breaking DNS standards, and also then writing a new RFC to justify their broken behavior, and getting IETF to approve it. (The existence of RFC 8482 is a disgrace to everyone involved.)

> To prevent any future incidents or confusion, we have written a proposal in the form of an Internet-Draft to be discussed at the IETF

Of course.

mdavid626 - 18 hours ago

I would expect, that dns servers like 1.1.1.1 at this scale have integration tests running real resolvers, like the one in glibc. How come this issue was discovered only in production?

wolttam - 17 hours ago

My take is quite cynical on this.. This post reads to me like a post-justification of some strange newly introduced behaviour.

Please order the answer in the order the resolutions were performed to arrive at the final answer (regardless of cache timings). Anything else makes little sense, especially not in the name of some micro-optimization (which could likely be approached in other ways that don’t alter behaviour).

tuetuopay - 19 hours ago

Many rightfully interpret the RFC as "CNAME have to be before A", but the issue persists inbetween CNAMEs in the chain as noted in the article. If a record in the middle of the chain expires, glibc would still fail if the "middle" record was to be inserted between CNAMEs and A records.

It’s always DNS.

m3047 - 17 hours ago

DNS is a wire protocol, payload specification, and application protocol. For all of that, I personally wonder whether its enduring success isn't that it's remarkably underspecified when you get to the corner cases.

There's also so much of it, and it mostly works, most of the time. This creates a hysteresis loop in human judgement of efficacy: even a blind chicken gets corn if it's standing in it. Cisco bought cisco., but (a decade ago, when I had access to the firehose) on any given day belkin. would be in the top 10 TLDs if you looked at the NXDOMAIN traffic. Clients don't opportunistically try TCP (which they shouldn't, according to the specification...), but we have DoT (...but should in practice). My ISPs reverse DNS implementation is so bad that qname minimization breaks... but "nobody should be using qname minimization for reverse DNS", and "Spamhaus is breaking the law by casting shades at qname minimization".

"4096 ought to be enough for anybody" (no, frags are bad. see TCP above). There is only ever one request in a TCP connection... hey, what are these two bytes which are in front of the payload in my TCP connection? People who want to believe that their proprietary headers will be preserved if they forward an application protocol through an arbitrary number of intermediate proxy / forwarders (because that's way easier than running real DNS at the segment edge and logging client information at the application level).

Tangential, but: "But there's more to it, because people doing these things typically describe how it works for them (not how it doesn't work) and onlookers who don't pay close attention conclude "it works"." http://consulting.m3047.net/dubai-letters/dnstap-vs-pcap.htm...

purwantoroa73 - 14 minutes ago

Have you guys use Vercel + Cloudflare?

sebastianmestre - 20 hours ago

I kind of wish they start sending records in randomized order to take out all the broken implementations that depend on such a fragile property

peanut-walrus - 7 hours ago

I've always found it weird that CNAMEs get resolved and lumped into the answer section in the first place. While helpful, this is not what you asked for and it makes much more sense to me to stick that in additional section instead.

As an aside, I am super annoyed at Cloudflare for calling their proxy records "CNAME" in their UI. Those are nothing like CNAMEs and have caused endless confusion.

danepowell - 18 hours ago

Doesn't the precipitating change optimize memory on the DNS server at the expense of additional memory usage across millions of clients that now need to parse an unordered response?

esotericwarfare - 6 hours ago

CloudFlare is a terrorist organization destroying the web.

mintflow - 13 hours ago

After reading the article, I am wondering is that is there no test case to coverage the behavior that modify the CNAME order in the response? I think it should be simple to run a fleet of various OS/DNS client combinations to test the behavior.

And I also being shocked that Cisco Switch goes to reboot loop with this DNS order issue.

mcfedr - 8 hours ago

everything about this reads like an excuse from a team that doesnt want to admit they screwed up

nitpicking at the RFCs when everyone knows DNS is a big old thing with lots going on

how do they not have basic integration tests to check how clients resolve

it seems very unlike cloudflare of old that was much more up front - there is no talk of the need to improve process, just blaming other people

kayson - 20 hours ago

> However, we did not have any tests asserting the behavior remains consistent due to the ambiguous language in the RFC.

Maybe I'm being overly-cynical but I have a hard time believing that they deliberately omitted a test specifically because they reviewed the RFC and found the ambiguous language. I would've expected to see some dialog with IETF beforehand if that were the case. Or some review of the behavior of common DNS clients.

It seems like an oversight, and that's totally fine.

ShroudedNight - 20 hours ago

I'm not an IETF process expert. Would this be worth filing errata against the original RFC in addition to their new proposed update?

Also, what's the right mental framework behind deciding when to release a patch RFC vs obsoleting the old standard for a comprehensive update?

0xbadcafebee - 12 hours ago

It's kind of weird that they didn't expect this. DNS resolvers are famously inconsistent, with changes sometimes working or not working, breaking or not breaking. Virtually any change you make to what DNS serves or how will cause inconsistent behavior somewhere. (DNS encompasses hundreds of RFCs)

runningmike - 18 hours ago

The end of this blog is …. “ To learn more about our mission to help build a better Internet,”

Reminds me of https://news.ycombinator.com/item?id=37962674 or see https://tech.tiq.cc/2016/01/why-you-shouldnt-use-cloudflare/

Ericson2314 - 11 hours ago

It's a pity they have to make an entirely new RFC, rather than amend the old RFC. Having independent RFCs and not a single unified "internet standard" under version control is a bit of a bummer in this manner.

paulddraper - 20 hours ago

> RFC 1034, published in 1987, defines much of the behavior of the DNS protocol, and should give us an answer on whether the order of CNAME records matters. Section 4.3.1 contains the following text:

> If recursive service is requested and available, the recursive response to a query will be one of the following:

> - The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.

> While "possibly preface" can be interpreted as a requirement for CNAME records to appear before everything else, it does not use normative key words, such as MUST and SHOULD that modern RFCs use to express requirements. This isn’t a flaw in RFC 1034, but simply a result of its age. RFC 2119, which standardized these key words, was published in 1997, 10 years after RFC 1034.

It's pretty clear that CNAME is at the beginning.

The "possibly" does not refer to the order but rather to the presence.

If they are present, they are are first.

- 20 hours ago
[deleted]
albert_e - 10 hours ago

The kind of "optimization" that Cloudflare is attempting to do here ... doesnt that transfer the burden of more expensive parsing downstream to all the DNS clients instead?

Sounds low key selfish / inconsiderate to me

... to push such a change without adequate thought or informed buy in by consumers of that service.

- 17 hours ago
[deleted]
skywhopper - 2 hours ago

This all reads like an embarrassed engineer who can’t admit they neglected to have a comprehensive to-the-byte test suite for their second-most-important-on-the-Internet DNS server, overcompensating by blaming a 40-year-old standard that (1) they probably hadn’t consulted, and (2) no one else seems to have issues with; and proposing to update core Internet standards, rather than just accept that they made a mistake when they assumed they could just append to what any regular user of DNS expects to be a meaningfully-ordered list.

urbandw311er - 16 hours ago

I feel like they fucked it up then, when writing the post-mortem, went hunting for facts to retrospectively justify their previous decisions.

frumplestlatz - 20 hours ago

Given my years of experience with Cisco "quality", I'm not surprised by this:

> Another notable affected implementation was the DNSC process in three models of Cisco ethernet switches. In the case where switches had been configured to use 1.1.1.1 these switches experienced spontaneous reboot loops when they received a response containing the reordered CNAMEs.

... but I am surprised by this:

> One such implementation that broke is the getaddrinfo function in glibc, which is commonly used on Linux for DNS resolution.

Not that glibc did anything wrong -- I'm just surprised that anyone is implementing an internet-scale caching resolver without a comprehensive test suite that includes one of the most common client implementations on the planet.

therein - 20 hours ago

After the release got reverted, it took an 1hr28min for the deployment to propagate. You'd think that would be a very long time for CloudFlare infrastructure.

renewiltord - 20 hours ago

Nice analysis. Boy I can’t imagine having to work at Cloudflare on this stuff. A month to get your “small in code” change out only to find some bums somewhere have written code that will make it not work.

dudeinjapan - 10 hours ago

Philosophers have agonized over this question since time immemorial.

inkyoto - 12 hours ago

This could be a great fit for Prolog, in fact, as it excels at the search.

Each resolved record would be asserted as a fact, and a tiny search implementation would run after all assertions have been made to resolve the IP address irrespective of the order in which the RRsets have arrived.

A micro Prolog implementation could be rolled into glibc's resolver (or a DNS resolver in general) to solve the problem once and for all.

PunchyHamster - 14 hours ago

TL;DR everyone implemented RFC properly (if missing some defensive coding), cloudflare decided it's optional and then learned that everyone did implement RFC properly, just some also did some additional work to make sure servers made wrong still were supported

torstenvl - 15 hours ago

EDIT: Why the drive-by downvotes? If someone thinks I'm wrong, I'm happy to hear why.

> One such implementation that broke is the getaddrinfo function in glibc, which is commonly used on Linux for DNS resolution.

> Most DNS clients don’t have this issue.

The most widespread implementation on the most widespread server operating system has the issue. I'm skeptical of what the author means by "Most DNS clients."

Also, what is the point of deploying to test if you aren't going to test against extremely common scenarios (like getaddrinfo)?

> To prevent any future incidents or confusion, we have written a proposal in the form of an Internet-Draft to be discussed at the IETF. If consensus is reached...

Pretty sure both Hyrum's Law and Postel's Law have reached the point of consensus.

Being conservative in what you emit means following the spec's most conservative interpretation, even if you think the way it's worded gives you some wiggle room. And the fact that your previous implementation did it that way for a decade means people have come to rely on it.

1vuio0pswjnm7 - 17 hours ago

"One such implementation that broke is the getaddrinfo function in glibc, which is commonly used on Linux for DNS resolution. When looking at its getanswer_r implementation, we can indeed see it expects to find the CNAME records before any answers:"

Wherever possible I compile with gethostbyname instead of getaddrinfo. I use musl instead of glibc

Nothing against IPv6 but I do not use it on the computers and networks I control

charcircuit - 20 hours ago

Random DNS servers and clients being broken in weird ways is such a common problem and will probably never go away unless DNS is abandoned altogether.

It's surprising how something so simple can be so broken.