My Favorite Bugs: Invalid Surrogate Pairs

george.mand.is

73 points by meysamazad 7 hours ago


chrismorgan - 4 hours ago

A CRDT library working at the code unit level? Ouch. Of course that’s going to go wrong, it was inevitable.

As for using extended grapheme clusters, it sounds a little bit iffy—maybe possible to use correctly, maybe not, because they’re not stable over time. That style of thing has created some fascinating bugs, like (a few years ago) index corruption in PostgreSQL due to collation changes.

Unicode scalar values are technically-safe: you can’t introduce invalid Unicode. But you can definitely still end up with nonsense.

> We made emoji an atomic node type.

That avoids problems for emoji, but leaves the underlying hazard untouched. I imagine it could still theoretically occur with other text, probably CJK. But probably only theoretically.

> This splits by grapheme clusters rather than code units. No orphaned surrogates, no split emoji. It's what .slice() should have been doing all along, but of course UTF-16 predates emoji by decades.

I do not agree that slice() should operate on extended grapheme clusters. Don’t lump the grapheme cluster/scalar value split in with the sins of UTF-16 and its unreliable code point/code unit split.

UTF-16 was unforced error (and I still can’t work out why it wasn’t obvious from the start that UCS-2 would never be enough). But the concept of multiple scalars contributing to the logical unit was always inevitable.

dimes - 2 hours ago

I have a similar, Unicode-related “favorite bug”.

We were expanding our product to a new language that used non-ASCII code points. Part of the system involved invoking binaries using text as input.

Locally, everything worked great. Once deployed, we got corrupted text output. As soon as we SSH’d on to the server to inspect, everything started working again.

It turns out that SSH servers can modify the LANG environment variable. The default value on our servers didn’t support Unicode, but LANG was updated as soon as we connected via ssh. It was a head scratcher for sure.

georgemandis - 5 hours ago

Just noticed this is getting some traffic! It's a little buried in the post, but I made an interactive tool for exploring surrogate pairs as part of this:

- https://george.mand.is/invalid-surrogate-pairs/

I thought it was something that's easier to play with and feel than necessarily just read about.

jonhohle - 5 hours ago

Once I ran into this it became hard to treat strings “normally” in any situation or, alternatively, I’d force hard encoding requirements in the domain. Regardless, handling grapheme clusters properly is hard and easy to get wrong.

I recently ported a program from python to rust and the original author used string regexes. Input and output document encoding mattered but the characters that needed to be matched were always lower ASCII. The python program could have used binary regexes, but instead forced an input encoding (UTF-8) and made the user choose an output encoding. When the input comes from an unknown process or legacy data, however, you don’t always get the luxury of assuming the encoding. Switching to binary regexes and ignoring encoding altogether simplified logic, eliminated classes of errors, and made the program work in scenarios it couldn’t earlier. Getting rid of the last decoding/encoding code gave me so much relief, especially when all of the whacky encoding tests I had already written continued to work.

vishnuharidas - 2 hours ago

And here's a UTF-8 Playground: https://utf8-playground.netlify.app

Dwedit - 4 hours ago

Windows allows unmatched surrogate pairs in filenames, invalid for UTF-16. Likewise, Linux allows invalid UTF-8 byte sequences in filenames.

Because invalid UTF-16 strings could show up in places within Windows, someone made a UTF-8 variant called "WTF-8", which allows unmatched surrogate pairs to survive a round trip.

skybrian - 5 hours ago

Writing property tests on functions that work with strings is a good way to find lots of Unicode issues.

BobbyTables2 - 5 hours ago

Damn, I’ve never really had to deal with Unicode all that much.

Was already bad enough that instead of bytes, we have to worry about code points. Now even that isn’t enough?

It would have been expensive, but all characters should have been fixed size 64bit values.

impure - 4 hours ago

I had an emoji cut in half problem in Dart. I was a bit surprised because I thought substring operations worked on characters. It only caused an invalid Unicode symbol though so not too bad.

wupatz - 5 hours ago

it's good to know about surrogate pairs in unicode. It was new to me too when being part of tracking down incomplete unicode flags in the (excellent) phanpy mastodon client.

Author went for Intl.Segmenter too: https://github.com/cheeaun/phanpy/issues/1491

agus4nas - 5 hours ago

Great write-up. Do most modern languages handle invalid surrogates gracefully, or is it still a "good luck" situation depending on the runtime?

bombela - 4 hours ago

In summary, Unicode code points (characters) are 32 bit. JavaScript manipulates Unicode in utf-16 for historical reasons, because at some point before Unicode, 16 bit was deemed enough (ucs-2). utf-16 run length encodes Unicode 32 codepoints into one or two code units. Splitting in a middle of a codepoints produces one invalid half string, and one semantically different half string.

emojies are a sequence of Unicode codepoints producing a single grapheme. Splitting in the middle of a grapheme will produce two valid strings, but with some funky half baked emoji. So for a text editor it makes sense to split between grapheme boundaries.