I rendered 1,418 confusables over 230 fonts. Most aren't confusable to the eye
paultendo.github.io100 points by paultendo 2 days ago
100 points by paultendo 2 days ago
About 20 years ago I used Cyrillic confusables to watermark internal documentation that was being leaked by a disgruntled customer service employee. The document would dynamically render and include the employee ID based encoded as bits in the text. It survived copy/paste to plain text well.
I did run into some issues in early versions on when characters in Linux commands or visible web addresses were replaced. Fortunately the source docs were HTML, and it was easy to exclude code or pre nodes when rendering.
I thought this was so clever, but the leaker was never caught using it, to the best of my knowledge.
This other article from the same author is more interesting: https://paultendo.github.io/posts/unicode-confusables-nfkc-c...
I'm not an expert, I've just been "vibe-R&D"-ing computer vision for a bit now, but I'll guarantee you SSIM is not suitable for this purpose. I've been dabbling in basically this area (comparing small, potentially low-resolution images) and SSIM produces a lot of false negatives and some false positives.
I would recommend template matching using normalized cross-correlation (TM_CCOEFF_NORMED in opencv.)
Also this paper from Nvidia critically scrutinizing SSIM may be relevant: https://research.nvidia.com/publication/2020-07_Understandin...
Maybe not at super large font sizes. But even lowercase i and l are easy enough to confuse at a glance mid-word in most sans-serif fonts, not to mention uppercase I and lowercase l. You don’t even need “confusable” glyphs to create a domain name that will stand up to a casual visual confirmation from a busy user in a phishing context.
Every Albert, Alfred, or Alphonso who goes by “Al” getting confused with bots right now…
I often ask my friend Alan to review what I've created, so I can tell people it has been enhanced with Al.
Perhaps there are people named “Alexa” who started using “Al” after Amazon’s launch. Talk about bad luck.
we used to mess with our friends by making AIM screen names that looked identical, or super close to then. then messaging other friends in the group. Or going into chat and saying things like "im a big dumb idiot"
This was like 1998-2003, and non technical people were doing it too. I think I am the only one from that friend group who would even consider that as something to watch out for.
But what about 'Ы'? It looks like 'bl', doen't it? 'Ы' is one codepoint and one glyph, though 'bl' is a sequence of two letters. I believe that the method described will miss such things. Cyrillic also has 'Ю', I suppose it is possible to design a font that make it look like 'lO'? Are there any fonts like this in a wild?
> 82 pairs are pixel-identical
> a string like “аpple.com” with Cyrillic а (U+0430) is pixel-identical to “apple.com” in 40+ fonts. The user, the browser’s address bar, and any visual review process all see the same pixels. This is not theoretical. It is a measured property of the font files shipping on every Mac.
Current implementations of "Computer Use" Agentic AI tools mostly use visuals -- screenshotting of a computer screen and interpreting it.
These pixel-dentical character pairs will be a straight failure mode for those automations and could possibly be a threat vector if crafted well.
I don’t think a human could tell the difference either. This will make phishing emails much more effective.
Thanks for the effort!
I'm always intrigued by the German FE-Schrift ("fälschungserschwerende Schrift", "more-difficult-to-forge font") chooses shapes for characters that makes it hard for them to be turned into one another (like a 3 into an 8 or so):
As a youth in the DOS era, I was always enamored of fonts like OCR-A, there is some overlap between the problems of "make it easy to distinguish" and "make it hard to maliciously corrupt", although I can imagine some cases where they might be in conflict, especially if adding ink is asymmetrically easier than removing or covering it.
What I have always wondered about with FE-Schrift: they painstakingly made all glyphs distinguishable, but completely f'ed it up with V and Y: the "stalk" of the Y is vertical and so short that they're very easy to confuse. They could have made the "stalk" slanted, or even curved like in lowercase "g", and most people would have still recognized it as a "Y"...
A slanted stalk might have made it too close to an X with a removed lower left arm. But a curved stalk does seem like it would have been an easy improvement
That's super interesting, but at the same time, i think the primary concern is not if they are literally the same but if a user is likely to confuse them in a small font you dont have control over in a place they are not likely to pay attention to (e.g. addeess bar).
Like even if the two characters look quite different, if they both look like the same letter in different fonts that is a problem. It doesn't mattter if you can tell the difference between the glyphs in a side by side comparision. What matters is what letter the user interprets the glyph as.
An interesting attempt, Claude. However, your promot is missing an important step to measure effectiveness against humans: wait 40-60 years for your vision to degrade naturally, and check the confusables again, preferably on a small phone screen. Bonus points if you can find someone with visual disabilities from birth. Obviously most attacks aren't pixel-perfect, but that's not the point, all you need to confuse are human eyes.
Things like the Fraktur characters are obvious mismatches in any font I know, I do do wonder why they're on the list.
I think we'll have to start configuring our client tools (e.g. browser, email client, etc) to render domain names with annotations for different character classes. E.g. our native character set is a standard color (blue/black) and then other character sets would have to stand out (purple background?).
i'm pretty sure Mox (email server with included webui written in Go) does that - at least the Umlauts in mails i get from Hetzner seem to always stand out.
it also defaults to not loading HTML in emails, which i love. really opened my eyes to how dumb it really is to just accept all kinds of dynamic content in unknown messages. (kinda same as how the modern web relies on remote code execution to work)
Hmm, is SSIM a good metric for comparing fonts? I'd imagine it isn't ideal, as fonts are mostly textureless and SSIM has no concept of glyph identity or typographic intent.
You're right, it's not. I just posted this comment: https://news.ycombinator.com/item?id=47182655
0 and O, and l and I that look the same in a single font is a crime of modern typography.
Also, I remember 8x16 VGA font that came with KeyRus had some slight differences between Cyrillic and Latin lookalikes, that brought some strange sense of comfort when reading, and especially typing the letter c, because its Cyrillic lookalike is located on the same key.
The font the arduino editor uses renders l and 1 exactly the same. Utter madness in a beginner programming context.
Good read (as is the next article in the series), but you can tell it hasn't been proofread due to "paypa׀.com" being described as a danger. Maybe in a different font than the website's, but in that case, maybe this should have been rendered out.
> A domain using only Cyrillic characters that happen to spell a Latin word (like “аpple” in all-Cyrillic) may still render in the address bar’s font and look identical.
that is very interesting.
I imagine the browser could take some context clues and switch rendering to puny code if the locale of the user is nowhere near a cyrillic region. But that is only going to patch some edge cases and miss others.
Ideally, the solution is password managers everywhere, which don't have this vulnerability, instead of using human eyes to visually recognize web urls and thus is vulnerable.
> I imagine the browser could take some context clues and switch rendering to puny code if the locale of the user is nowhere near a cyrillic region.
Anyone reading this - please, please, please do not make any assumptions based on the end-user's geography.
Signed, someone who can cross 3 national and 4 language borders within a few hours of driving.
The article mentions this only briefly, but browsers already do this kind of heuristic protection! See https://en.wikipedia.org/wiki/IDN_homograph_attack#Defending... or https://chromium.googlesource.com/chromium/src/+/main/docs/i... for a Chrome-specific blog post.
I think the lack of exploration of the context around the problem and current mitigations is an issue with the article - it spends a lot of time talking about the possible threat, but very little time on whether the attack is actually practical with modern mitigations.
Not to mention it would only apply to clicking spoofed links. Unless the keyboard mapping was compromised, those letters won’t be typed.
>> A domain using only Cyrillic characters that happen to spell a Latin word (like “аpple” in all-Cyrillic) may still render in the address bar’s font and look identical
Here you go:
https:// аррlе.соm
(using English "l" and "m" here, Russian м looks differently)
Was it a demo site? The font looks very wonky, not sure if I should copy-paste from it.
This seems misguided. The fact that 'ρ' isn't a pixel for pixel match for 'p' doesn't mean they're not confusable. The threat model is not being unable to solve a spot-the-difference puzzle. Unless you are familiar with every pixel of your system fonts, and carefully scrutinize every character on your screen, the lack of an exact match in jρmorgan[.]com in a URL is going to do very little for you. There are many english characters that have multiple totally distinct ways to write them, so you can have two 'a' variants that are distinct but equally 'normal' looking. I guess if you get an LLM to write your blog posts they don't have to make much sense to begin with.
To be fair, the correlation threshold they used was 0.7 for confusable, and 0.3-0.7 for contextually confusable. But I definitely would have liked to see some examples of glyph pairs at around 0.5 correlation. And at small font sizes realistic in actual threat scenarios.
This is really cool. I loved the technical breakdown and side by side comparisons. Surprised to hear that Microsoft and MacOS default fonts didn't score so well!
Ooph, I couldn't get far in this the font is giving me motion sickness some how.
Was that the intention?
Why are all the descending letters truncated in the titles? Not sure if it's a css glitch or terrible font choice. A bit ironic on an article about fonts.
It appears to be part of the font[0]. It looks a bit weird, but display fonts usually can get away with being more eccentric.
well, you didn't really do anything, did you? Claude Code rendered these things and wrote the blog post haha
> "This is not theoretical. It is a measured property of the font files shipping on every Mac."
some patterns of speech are so recognizably LLM, i am convinced that the AI detection startups have a very strong chance to succeed on text.
I don't know how people read this sort of LLM output without their eyes glazing over and tuning it out. Every blog post authored or substantially edited by Claude sounds the same sort of vaguely pompous and stilted, surely people are bored of it by now? But apparently not.
Going off on a bit of a tangent here..
> some patterns of speech are so recognizably LLM, i am convinced that the AI detection startups have a very strong chance to succeed on text.
The problem for them is the market. Those who actually want to buy AI detection tools usually want the impossible - detecting any kind of AI-written text, or even AI-written-human-edited text.
You're right in that many HN articles (not going to comment on this one specifically) are very easy to detect. But that's just because these article writers are too lazy to even use any of the plethora of tools that remove the smells automatically, or tools that write without them in the first place (I've made such a tool myself), or even just adjusting the prompt to write in a different style that avoids them.
Most people who would be interested in paying for AI detection tools want them to detect all of the above cases too, which is of course impossible.
Yes, some patterns of speech are recognizable … The "That's LLM generated" pattern is one of those. And while I can understand the motivation behind this, I find it more irritating now than LLM texts, if these contain useful information, which make me curious.
This text made me curious, I liked the approach the author has taken. And it made me think how I would do it. My first idea would be to use ImageMagick to render text and then use ImageMagick's https://imagemagick.org/script/compare.php to somehow calculate the risk of confounding glyphs.
So: Don't be snarky? Maybe we need another rule here, to limit comments on "LLM style" https://news.ycombinator.com/newsguidelines.html
However it was written, it’s a useful and well structured article. I thought it was a good read
I mean, no shit Sherlock, Cyrillic letters being indistinguishable from English ones is what Russian speakers have been using to get around braindead keyword сеnsоrshір¹ forever, same way kids type "de@th" on TikTok to avoid automoderation.
Most of the added value in this article can be summed up by saying that the Cyrillic glyphs are identical to the similar English ones in the fonts that author looked at (which isn't true for all fonts), and author didn't find many other such examples.
_______
¹ Try matching that word with "censorship" for fun
[flagged]
Maybe not. I checked OPs blog and he seem to be putting up 2-3 longer posts per day. Since it is LLM content, I have no idea whether it's mainly hallucinations or based on facts. So what did I learn from reading the article? Maybe nothing, maybe it's just made up.
If you have a Mac you can follow the steps at the end of the post and reproduce the results https://paultendo.github.io/posts/confusable-vision-visual-s...
I don't have a Mac.
This is very cool, impressive piece of work Paul.
[dead]