WorstFit: Unveiling Hidden Transformers in Windows ANSI

blog.orange.tw

373 points by notmine1337 6 months ago

This is a tough one. It’s systemic —- MS provides a “best fit” code mapping from wide Unicode to ASCII, which is a known, published, “vibes-based” mapper. This best fit parser is used a lottt of places, and I’m sure that it’s required for ongoing inclusion based on how MS views backward compatibility. It’s linked in by default everywhere, whether or not you know you included it.

The exploits largely revolved around either speccing an unusual code point that “vibes” into say a slash or a hyphen or quotes. These code points are typically evaluated one way (correct full Unicode evaluation) inside a modern programming language, but when passed to shell commands or other Win32 API things are vibes-downed. Crucially this happens after you check them, since it’s when you’ve passed control.

To quote the curl maintainer “curl is a victim” here — but who is the culprit? It seems certain that curl will be used to retrieve user supplied data automatically by a server in the future. When that server mangles user input in one way for validation and another when applied to system libraries, you’re going to have a problem.

It seems to me like maybe the solution is to provide an opt-out of “best fit” munging in the Win32 space, but I’m not a Windows guy, so I speculate. At least then open source providers could just add the opt out to best practices, and deal with the many terrible problems that things like a Unicode wide variant of “ or \ delivers to them.

And of course even if you do that, you’ll interact with officially shipped APIs and software that has not opted out.

wongarsu - 6 months ago

The opt-out is to use the unicode windows APIs (the functions ending in "w" instead of "a"). This also magically fixes all issues with paths longer than 260 characters (if you add a "\\?\" prefix or set you manifest correctly), and has been available and recommended since Windows XP.
I'm not sure why the non-unicode APIs are still so commonly used. I can't imagine it's out of a desire to support Windows 98 or Window 2000.
- comex - 6 months ago
  
  _Or_ set your application to use UTF-8 for the "A" APIs. Apparently this is supported as of a Windows 10 update from 2019. [1]
  [1] https://learn.microsoft.com/en-us/windows/apps/design/global...
  - asveikau - 6 months ago
    
    It should have been supported approximately 20 years earlier than that. I was coding against Win32 looong before 2019 and wondering for years why they wouldn't let you.
    An explanation I heard ~10 years prior is that doing so exposed bugs in CRT and nobody wanted to fix them.
    
    cesarb - 6 months ago
    
    > An explanation I heard ~10 years prior is that doing so exposed bugs in CRT and nobody wanted to fix them.
    What I've heard is that the issue is not with the CRT, but with applications using fixed-size byte buffers. IIRC, converting from UTF-16 to any of the traditional Windows code pages requires at most two bytes for each UTF-16 code unit, while the UTF-8 "code page" can need three bytes. That would lead to buffer overflows in these legacy applications if the "ANSI" code page was changed to UTF-8.
    
    pjmlp - 6 months ago
    
    Not sure what that has to do with CRT, given that it isn't part of Win32.
    
    garganzol - 6 months ago
    
    CRT in a form of msvcrt.dll file had a de-facto presence in Windows since the end of 1990s. Later on, since 2018 or so, CRT availability was formalized in Windows API in form of ucrtbase.dll module.
    
    ygra - 6 months ago
    
    msvcrt was never for applications to use: https://devblogs.microsoft.com/oldnewthing/20140411-00/?p=12...
    
    okanat - 6 months ago
    
    The bundled one with Windows wasn't. However the same "feature" exists in redistributed versions of msvcrt.
    
    pjmlp - 6 months ago
    
    Which doesn't change the fact that Win32 doesn't depend on it.
    
    okanat - 6 months ago
    
    It is extremely hard to create an application that doesn't depend on CRT on Windows. CRT provides tables for handlers of SEH exceptions and provides the default event handlers. Win32 headers have hard dependencies of the handler tables CRT provides. So you need to go quite a bit out of your way to hack deep Win32 headers. Loading DLLs etc also may call CRT functions.
    You can read Mingw64 source to see how many hacks they had to do to make it work.
    
    ChrisSD - 6 months ago
    
    That's the "vcruntime" not the "ucrt". There has been a distinction since the ucrt was made an official part of the OS.
    It's very easy to make a win32 program without the ucrt filesytems APIs so long as you don't mind being platform-specific (or making your own cross-platform wrappers).
    
    pjmlp - 6 months ago
    
    I have been developing for Microsoft platforms since MS-DOS 3.3, Win16 and Win32 development without any function from standard C library has been a thing for decades, for those doing C development without caring about portability, like demoscene competitions and gaming.
    Using C++ is another matter.
    
    asveikau - 6 months ago
    
    It's still an important piece of the app compatibility story.
  - kazinator - 6 months ago
    
    Does that mean that in this UTF-8 mode, GetCommandLineA would, when the full-width double quote occurs in the command line, return the UTF-8 bytes for that double quote, rather than steamrolling it to an ASCII double quote with the WorstFit mapping?
    
    iforgotpassword - 6 months ago
    
    Yes, I wanted to suggest the same. I modified some old tools I wrote 15 years ago to do that a while ago. Not because I was aware of any vulnerability, but because a few places still used char* and I figured this would basically make it never fail with any weird filenames regardless of the code page.
    So now it seems even if you think your app is fully Unicode, still do this just in case? :)
    
    account42 - 6 months ago
    
    > I figured this would basically make it never fail with any weird filenames regardless of the code page.
    Windows filenames are not guaranteed to be valid UTF-16 so A functions with UTF-8 code page can still fail to access some files. If you want 100% compatibility you need to realize that Windows is a WTF-16 system and make your own compatibility wrappers for the W functions under that assumption.
    
    kazinator - 6 months ago
    
    It sounds like something Cygwin ought to do across their ecosystem.
  - account42 - 6 months ago
    
    UTF-8 ACP might fix these exploits but it doesn't fix the root issue that your application encoding can't represent the whole internal system encoding (WTF-16, NOT UTF-16 despite what it claims).
- Sharlin - 6 months ago
  
  As mentioned elsewhere in this discussion, 99% of the time the cause is likely the use of standard C functions (or C++ `std::string`…) instead of MS's nonstandard wide versions. Which of course is a ubiquitous practice in portable command-line software like curl.
  - smatija - 6 months ago
    
    A lot of details is in linked curl hackerone: https://hackerone.com/reports/2550951
  - account42 - 6 months ago
    
    std::string is not an issue, how you get strings from the environment into it is.
    You can use W functions and convert the WTF-16 strings you get to WTF-8 and use that in std::string without problems.
  - pishpash - 6 months ago
    
    So the culprit is still the software writer. They should have wrapped the C++ library for OS-specific behavior on Windows. Because they are publishing buggy software and calling it cross-platform.
    
    bayindirh - 6 months ago
    
    curl first released in 1996, shortly after Windows 95 has born and runs on numerous Windows versions even today. So, how many different versions shall be maintained? Are you going to help one of these versions?
    On top of that, how many new gotchas these “modern” Windows functions hide, and how many fix cycles are required to polish them to the required level?
    
    thrdbndndn - 6 months ago
    
    If we're talking about curl specifically, I absolutely think they would (NOT "should") fix/workaround it if there are actually common problems caused by it.
    Yes it would have required numerous fix cycles, but curl in my mind is such a polished product and they would have bit the bullet.
    
    bayindirh - 6 months ago
    
    You're right, if the problems created by this are big enough, the team will fix them without any fanfare and whining.
    However, in neither case this is a shortcoming of curl. They'd be responding to a complicated problem caused by the platform they're running on.
    
    8n4vidtmkvmk - 6 months ago
    
    Why would/should they? I've never paid for curl. Who even develops it? Sounds like a thankless job to fix obscure worstfit bugs.
    
    bayindirh - 6 months ago
    
    > Why would/should they?
    Because they care. That's it.
    > I've never paid for curl.
    I'm sure people who develop it doesn't want money and fame, but they're just doing what they like. However, curl has commercial support contracts if you need.
    > Who even develops it?
    Daniel Stenberg et al. Daniel can be found at https://daniel.haxx.se.
    > Sounds like a thankless job to fix obscure worstfit bugs.
    It may look thankless, but it's not. curl is critical infrastructure at this point. While https://xkcd.com/2347/ applies squarely to cURL, it's actually nice that the lead developer is making some money out of his endeavor.
    
    thrdbndndn - 6 months ago
    
    Why would they develop curl at all by your logic?
    They fix bugs because they simply want their product to be better, if if I were to take a guess? Like, I'm sure curl's contributors worked on OS-specific problems before, and it wouldn't be the last.
    > to fix obscure worstfit bugs.
    Again my premise is "if there are actually common problems caused by it". This specific bug does sound like that, at least not for now.
    
    account42 - 6 months ago
    
    I'm sure those software writers will be happy to refund your purchase. It's not their fault that Microsofts standard C implementation is faulty.
    
    - 6 months ago
    
    [deleted]
- Thorrez - 6 months ago
  
  >I'm not sure why the non-unicode APIs are still so commonly used.
  Even argv is affected on Windows. That's part of the C and C++ standard, not really a Windows API. Telling all C/C++ devs they need to stop using argv is kind of a tough ask.
- shakna - 6 months ago
  
  You also have to use wmain instead of main, with a wchar_t argv, otherwise the compiled-in argparser will be calling the ANSI version. In other words... Anyone using MSVC and the cross-platform standardised and normal C system, are hit by this.
  Oh, and wmain is a VisualC thing. It isn't found on other platforms. Not standardised.
  - mort96 - 6 months ago
    
    Writing cross platform code which consistently uses UCS-2 wchar_t* on Windows and UTF-8 char* on UNIX-like systems sounds like absolute hell
    
    account42 - 6 months ago
    
    It's not that bad really - you just convert at the win32 API call boundary.
    Also, it's not UCS-2. Also not UTF-16. Windows uses WTF-16 internally and if you want 100% compatibility that's what you need to target.
    
    lmz - 6 months ago
    
    A wchar_t "native" libc implementation would be an interesting thing.
- vessenes - 6 months ago
  
  I think the issue is that native OS things like the windows command line, say, don’t always do this. Check the results of their ‘cd’ commands with Japanese Yen characters introduced. You can see that the path descriptor somehow has updated to a directory name with Yen (or a wide backslash) in it, while the file system underneath has munged, and put them into an actual directory. It’s precisely the problem that you can’t control the rest of the API surface to use W that is the source of the difficulties.
- ack_complete - 6 months ago
  
  Using \\?\ has a downside: since it bypasses Win32's path processing, it also prevents relative paths like d:test.txt from working. Kind of annoying on the command line with tools like 7z.exe.
  - account42 - 6 months ago
    
    Sounds more like an upside TBH.
- cesarb - 6 months ago
  
  > I'm not sure why the non-unicode APIs are still so commonly used. I can't imagine it's out of a desire to support Windows 98 or Window 2000.
  Nowadays, it's either for historical reasons (code written back when supporting Windows 9x was important, or even code migrated from Windows 3.x), or out of a desire to support non-Windows systems. Most operating systems use a byte-based multi-byte encoding (nowadays usually UTF-8) as their native encoding, instead of UTF-16.
- asveikau - 6 months ago
  
  I share your recommendations of always using PWSTR when using windows apis.
  > I'm not sure why the non-unicode APIs are still so commonly used
  I think because the rest of the C world uses char* with utf-8, so that is what people are habituated to. Setting the ACP to CP_UTF8 would have solved a lot of problems, but I believe that's only been supported for a short period of time, bafflingly.
  - account42 - 6 months ago
    
    > Setting the ACP to CP_UTF8 would have solved a lot of problems, but I believe that's only been supported for a short period of time, bafflingly.
    It wouldn't solve all encoding problems though because most Windows APIs can store/return invalid UTF-16 which you can't represent in CP_UTF8 - you'd need a CP_WTF8 for that which doesn't even exist so you have to use the W APIs and do the conversion yourself.
- account42 - 6 months ago
  
  > I'm not sure why the non-unicode APIs are still so commonly used. I can't imagine it's out of a desire to support Windows 98 or Window 2000.
  A lot of the uses are indirectly via standard C API functions that are effectively broken on Windows but work just well enough (i.e. work with ASCII) that their use goes unnoticed when someone ports something to Windows.
- cryptonector - 6 months ago
  
  > I'm not sure why the non-unicode APIs are still so commonly used.
  Simple: portable code meant to run on Unix (where UTF-8 is king) and Windows -> want to use UTF-8 codepage on Windows and the "A" APIs.
- cryptonector - 6 months ago
  
  The other opt-out might be to opt into UTF-8 support for the "A" functions.
- p_ing - 6 months ago
  
  > paths longer than 260 characters (if you add a "\\?\" prefix or set you manifest correctly)
  A long ago released build of Windows 10 did this automatically so no need for adjustments anymore, 32k is the max....
  ...except for Office! It can't handle long paths. But Office has always been hacky (the title bar, for example).
captainmuon - 6 months ago

Windows has a way of opting out of legacy behavior since Windows XP - manifest files. If you don't include a manifest, even GetWindowsVersion will not return the current version IIRC. It should be not too hard to add an opt-out in there (and at some point make it default in Visual Studio).
I think what is also needed is some kind of linting - there is usually no need to call ANSI WinAPI functions in a modern application (unless you set the locale to UTF-8 and only use the 8-bit functions, but I don't know how well that works). I think there are also a couple of settings and headers to include to make everything "just work" - meaning argv, printf and std::cout work with UTF-8, you get no strange conversions, and you just have functions to convert between UTF-8 and UTF-16 to use WinAPI. I'm pretty sure I have a Visual Studio project lying around somewhere where it works. But all those steps necessary need to be documented and put in one place by MS.
- account42 - 6 months ago
  
  > If you don't include a manifest, even GetWindowsVersion will not return the current version IIRC.
  Worse than that, even reading relevant registry keys will be faked.
- Arwill - 6 months ago
  
  Using UTF8 internally and converting strings for W API calls is a way to gain some performance.
  - cryptonector - 6 months ago
    
    More like it's a way to keep your Windows port code to a minimum so that the rest can run on Unix. I.e., you want to use UTF-8 because that's the standard on Unix, and you don't want to have completely different versions of your code for Windows and Unix because now you have twice the maintenance trouble.
  - account42 - 6 months ago
    
    *WTF-8 unless you want to not be able to handle all possible filenames.
account42 - 6 months ago

> To quote the curl maintainer “curl is a victim” here — but who is the culprit?
Security vulnerability or not, it's a bug with curl on windows as it doesn't correctly handle unicode arguments.
UltraSane - 6 months ago

The loosey-goosey mapping of code points to characters has always bothered me about Unicode.
- cryptonector - 6 months ago
  
  This isn't about Unicode having "loosey-goosey" anything. It's about aa mapping that Microsoft came up with to map Unicode to non-Unicode.
  - SAI_Peregrinus - 6 months ago
    
    Yeah, they could have mapped code points to their textual descriptions. That'd require reallocations, but converting ＂to UNICODE_FULLWIDTH_QUOTATION_MARK_U+FF02 would be unambiguous. Ugly, but obvious what happened. Better than � IMO!
    
    cryptonector - 6 months ago
    
    Since there's two possible antecedents for "they" (the Unicode Consortium, and Microsoft) here you'll have to clarify. Also, my question really was for u/UltraSane.
    Microsoft should just never have created Best-Fit -- it's a disaster. If you have to lose information, use an ASCII character to denote loss of information and be done. (I hesitate to offer `?` as that character.) Or fail to spawn the process with an error indicating the impossibility of transcoding. Failure is better actually.
    
    SAI_Peregrinus - 6 months ago
    
    Oh, failure is way better. But a lot of the original APIs didn't have failures, they just returned a value, and MS doesn't like to break backwards compatibility even when it'd be easier for everyone if they did.
    For "they", I mean MS could have made BestFit work as follows: if an input string contains characters not in the user's code page, then return a new string with characters replaced by with the name of that code point as assigned by the Unicode consortium (and maybe also the textual code point number U+<number>). This requires a new allocation and copies of the parts of the string not needing replacement, but loses no information and creates no security holes.
    
    cryptonector - 6 months ago
    
    CreateProcess() and related APIs can fail.
    
    account42 - 6 months ago
    
    CreateProcess() doesn't know what functions the resulting process will use to access the command-line and environment.

mmastrac - 6 months ago

This is kind of unsurprising, but still new to me even as someone who did Windows development (and some Wine API hacking) for a decade around when this W/A mess came about.

Windows is like the card game Munchkin, where a whole bunch of features can add up to a completely, unbelievably random over-powered exploit because of unintentional synergy between random bits.

I'm happy to see that they are converting the ANSI subsystem to UTF-8, which should, in theory, mitigate a lot of these problems.

I wonder if the Rust team is going to need YetAnotherFix to the process spawning API to fix this...

ChrisSD - 6 months ago

Rust's standard library basically never uses ANSI APIs. The article doesn't demonstrate any attack that works against Rust. If they do have one I'd highly recommend reporting it.
Of course, Rust can't control what happens on the other side of side of a process boundary. So if an application invoked by Rust uses ANSI APIs then they'll have a problem. But also that's their responsibility.
- okanat - 6 months ago
  
  What about the entry point? Because one of the issues mentioned in the article is about mainCRTStartup calling an ANSI API. Most of the Rust programs are linked with the C runtime. Does Rust make sure that the C library initialization is also done in Unicode APIs?
  - ChrisSD - 6 months ago
    
    No but it doesn't use C library values so it doesn't matter. E.g. getting the command line arguments is done via calling `GetCommandLineW` so it doesn't use argv or argc.
    This is actually necessary because Rust cannot assume it owns the entry point. E.g. a Rust library could be called from a C++ application or in a DLL, etc. So when someone calls `std::env::args` it asks the OS directly for the arguments instead of getting them from C.

Joker_vD - 6 months ago

> the only thing we can do is to encourage everyone, the users, organizations, and developers, to gradually phase out ANSI and promote the use of the Wide Character API,

This has been Microsoft's official position since NT 3.5, if I remember correctly.

Sadly, one of the main hurdles is the way Microsoft's own C/C++ runtime library (msvcrt.dll) is implemented. Its non-standard "wide" functions like _wfopen(), _wgetenv(), etc. internally use W-functions from Win API. But the standard, "narrow" functions like fopen(), getenv(), etc., instead of using the "wide" versions and converting to-from Unicode themselves (and reporting conversion failures), simply use A-functions. Which, as you see, generally don't report any Unicode conversion failures but instead try to gloss over them using best-fit approach.

And of course, nobody who ports software, written in C, to Windows wants to rewrite all of the uses of standard functions to use Microsoft's non-portable functions because at this point, it becomes a full-blown rewrite.

terinjokes - 6 months ago

The position I got reading documentation Microsoft has written in the last two years is the opposite: set activeCodePage in your application manifest to UTF-8 and only ever use the "ANSI" functions.
- ziml77 - 6 months ago
  
  Yes that does seem to be the way going forward. Makes it a lot easier to write cross-platform code. Though library code still has to use the Wide Character APIs because it's up to the application as a whole to opt into UTF-8. Also if you're looking for maximal efficiency, the WChar APIs still make sense because it avoids the conversion of all the string inputs and outputs on every call.
  - terinjokes - 6 months ago
    
    Many libraries I've encountered have defines available now to use the -A APIs; previously they were using -W APIs and converting to/from UTF-8 internally.
    As for my application, any wchar conversions being done by the runtime are a drop in the bucket compared to the actual compute.
  - account42 - 6 months ago
    
    > Also if you're looking for maximal efficiency, the WChar APIs still make sense because it avoids the conversion of all the string inputs and outputs on every call.
    OTOH you need ~twice as much memory / copy ~twice as much data around than if you converted to WTF-8 internally.
- Joker_vD - 6 months ago
  
  Ah, so they've finally given up? Interesting to hear. But I guess the app manifests does give them a way to move forward this way while maintaining the backward-compatible behaviour (for apps without this setting in their manifests).
- dataflow - 6 months ago
  
  Despite whatever Microsoft may seem to be suggesting, you don't want to do this. Just use the wide APIs. Lots of reasons why UTF-8'ing the narrow APIs is a bad idea:
  - The wide APIs accept and/or produce invalid UTF-16 in some places (like filesystems). There's no corresponding UTF-8 for invalid UTF-16. Meaning there are cases that lead to loss of information and that you simply cannot handle.
  - You have no control over all the DLLs loaded in your process. If a user DLL loads that can't handle UTF-8 narrow APIs, you're just praying it won't break.
  - Some APIs simply don't have narrow versions. Like CommandLineToArgvW() or GetFileInformationByHandleEx() (e.g., FILE_NAME_INFO). You will not avoid wide APIs by doing this if you need to use enough of the APIs; you're just going to have to perform conversions that have dubious semantics anyway (see point #1 above).
  - Compatibility with previous Windows versions, obviously.
  - Performance
  - cesarb - 6 months ago
    
    > You have no control over all the DLLs loaded in your process. If a user DLL loads that can't handle UTF-8 narrow APIs, you're just praying it won't break.
    I want to emphasize this point. From what I've heard, on Windows it's very common for DLLs from who knows where to end up loaded in your process. Not only the things you'd also find on other operating systems like the user-space component of graphics APIs like OpenGL and Vulkan, but also things like printer drivers, shell extensions, "anti-malware" stuff, and I've even heard of things like RGB LED control software injecting their DLLs into every single process. It's gotten so bad that browsers like Firefox and Chrome use fairly elaborate mechanisms to try to prevent arbitrary DLLs from being injected into their sandbox processes, since they used to be a common source of crashes.
  - account42 - 6 months ago
    
    > The wide APIs accept and/or produce invalid UTF-16 in some places (like filesystems). There's no corresponding UTF-8 for invalid UTF-16. Meaning there are cases that lead to loss of information and that you simply cannot handle.
    There's WTF-8 - too bad that's not what Microsoft chose to use for their universal 8-bit codepage.
    
    - 6 months ago
    
    [deleted]
  - cryptonector - 6 months ago
    
    Disagree. At least in the context of Unix utilities portable to Windows. We are NOT going to be forking those to use wchar_t on Windows and char on Unix -that's a non-starter- and we're also not going to be switching to wchar_t on both because wchar_t is a second-class citizen on Unix.
    Using UTF-8 with the "A" Windows APIs is the only reasonable solution, and Microsoft needs to commit to that.
    > - The wide APIs accept and/or produce invalid UTF-16 in some places (like filesystems). There's no corresponding UTF-8 for invalid UTF-16. Meaning there are cases that lead to loss of information and that you simply cannot handle.
    This is also true on Unix systems as to `char`. Yes, that means there will be loss of information regarding paths that have garbage in them. And again, if you want to write code for Windows _and_ Unix, using wchar_t won't spare you this loss on Unix. So you're damned if you do and damned if you don't, so just accept this loss and say "don't do that".
    > - You have no control over all the DLLs loaded in your process. If a user DLL loads that can't handle UTF-8 narrow APIs, you're just praying it won't break.
    In some cases you do have such control, but if some DLL unknown to you uses "W" APIs then.. it doesn't matter because if it's unknown to you then you're not interacting with it, or if you are interacting with it via another DLL that is known to you then it's that DLL's responsibility to convert between char and wchar_t as needed. I.e., this is not your problem -- I get that other people's bugs have a way of becoming your problem, but strictly speaking it's their problem not yours.
    > - Some APIs simply don't have narrow versions. Like CommandLineToArgvW() or GetFileInformationByHandleEx() (e.g., FILE_NAME_INFO). You will not avoid wide APIs by doing this if you need to use enough of the APIs; you're just going to have to perform conversions that have dubious semantics anyway (see point #1 above).
    True, but these can be wrapped with code that converts as needed. This is a lot better from a portability point of view than to fork your entire code into Windows and Unix versions.
    > - Compatibility with previous Windows versions, obviously.
    Sigh. At some point people (companies, contractors/consultants, ...) need to put their feet down and tell the U.S. government to upgrade their ancient Windows systems.
    > - Performance
    The performance difference between UTF-8 and UTF-16 is in the noise, and it depends greatly on context. But it doesn't matter. UTF-8 could be invariably slower than UTF-16 and it would still be better to move Windows code to UTF-8 than to move Unix to UTF-16 or lose portability between Windows and Unix.
    In case you and others had not noticed Linux has a huge share of the market on servers while Windows has a huge share of the market on laptops, which means that giving up on portability is not an option.
    The advice we give developers here has to include advice we give to developers who have to write and look after code that is meant to be portable to Windows and Unix. Sure, if you're talking to strictly-Windows-only devs, the advice you give is alright enough, but if later their code needs porting to Unix they'll be sad.
    The reality is that UTF-8 is superior to UTF-16. UTF-8 has won. There's just a few UTF-16 holdouts: Windows and JavaScript/ECMAScript. Even Java has moved to UTF-8. And even Microsoft seems to be heading in the direction of making UTF-8 a first-class citizen on Windows.
    
    account42 - 6 months ago
    
    > This is also true on Unix systems as to `char`. Yes, that means there will be loss of information regarding paths that have garbage in them. And again, if you want to write code for Windows _and_ Unix, using wchar_t won't spare you this loss on Unix. So you're damned if you do and damned if you don't, so just accept this loss and say "don't do that".
    The problem is that you can't roundtrip all filenames. CP_UTF8 doesn't solve that only pretends to. For a full solution you need to use the W functiosn and then convert between WTF-16 and WTF-8 yourself.
    
    dataflow - 6 months ago
    
    Hard disagree:
    > At least in the context of Unix utilities portable to Windows. We are NOT going to be forking those to use wchar_t on Windows and char on Unix -that's a non-starter- and we're also not going to be switching to wchar_t on both because wchar_t is a second-class citizen on Unix.
    Those aren't the only options. You (or someone) could also write your own compatibility layers for the APIs that avoid some of the problems I mentioned (e.g., by producing errors on inconvertible characters, by being compatible with former Windows versions, by not affecting other DLLs in your process, etc.)
    Or you could e.g. get upstream to start caring about their users on other platforms, and play ball.
    > This is also true on Unix systems as to `char`. Yes, that means there will be loss of information regarding paths that have garbage in them. And again, if you want to write code for Windows _and_ Unix, using wchar_t won't spare you this loss on Unix.
    Er, no. First, if you're actually writing portable code, TCHAR is the solution, not wchar_t. Second, if you can't fork others' code, at the very least you can produce errors to avoid silent bugs (see above). And finally, "this problem also exists with char" is just wrong. In a lot of cases the problem doesn't exist as long as you're using the same representation and avoiding lossy conversion, whatever the data type is. If (say) the file path is invalid UTF, and you save it somewhere and reuse it, or pass it to some program and then have it passed back to you, you won't encounter any issues -- the data is whatever it was. The issues only come up with lossy conversions in any direction.
    > if some DLL unknown to you uses "W" APIs then.. it doesn't matter because if it's unknown to you then you're not interacting with it, or if you are interacting with it via another DLL
    I don't think you're understanding the problem here. Interaction is not part of the picture at all. You might not be loading the DLL yourself at all. DLLs get loaded by the OS and user for all sorts of reasons (antiviruses, shell extensions, etc.) and they easily run in the background without anything else in the process "knowing" anything about the at all. Your program is declaring that everything in the process is UTF-8 compatible, but those DLLs might not be compatible with that, and so you're just praying that they don't use -A functions in an incompatible manner.
    > Sigh. At some point people (companies, contractors/consultants, ...) need to put their feet down and tell the U.S. government to upgrade their ancient Windows systems.
    USG? Ancient? These are systems less than 10 years old. We're not talking floppy-controlled nukes here.
    > The performance difference between UTF-8 and UTF-16 is in the noise, and it depends greatly on context.
    "Depends greatly on the context" kinda makes my point. It can turn a zero-copy program into single- or double-copy. Generally not a showstopper by any means, but it sure as heck can impact some programs. And if that program is a DLL people use - well now you can't work around. (Yes, there's a reason I listed this last. But there's a reason I listed it at all.)
    > The reality is that UTF-8 is superior to UTF-16. UTF-8 has won.
    The reality is Windows isn't UTF-16 and nix isn't UTF-8, which was the crux of most of my points.
    
    account42 - 6 months ago
    
    > Er, no. First, if you're actually writing portable code, TCHAR is the solution, not wchar_t.
    TCHAR is a Microsoftism, it's NOT portable at all.
    
    dataflow - 6 months ago
    
    I didn't mean "portable" in the same sense you're using it. Maybe "cross-platform", if you will. Or insert whatever word you want that would get my point across.
    
    cryptonector - 6 months ago
    
    > Those aren't the only options. You (or someone) could also write your own compatibility layers for the APIs that avoid some of the problems I mentioned (e.g., by producing errors on inconvertible characters, by being compatible with former Windows versions, by not affecting other DLLs in your process, etc.)
    That's akin to writing a partial C library. If MSFT makes UTF-8 as the codepage work well enough I'd rather use that.
    > Or you could e.g. get upstream to start caring about their users on other platforms, and play ball.
    The upstream is often not paid for this. Even if they get a PR, if the PR makes their code harder to work on they might reject it.
    Microsoft has to make UTF-8 a first-class citizen.
    > I don't think you're understanding the problem here. Interaction is not part of the picture at all. You might not be loading the DLL yourself at all. DLLs get loaded by the OS and user for all sorts of reasons (antiviruses, shell extensions, etc.) and they easily run in the background without anything else in the process "knowing" anything about the at all. Your program is declaring that everything in the process is UTF-8 compatible, but those DLLs might not be compatible with that, and so you're just praying that they don't use -A functions in an incompatible manner.
    You mean changing the codepage for use with the "A" functions? Any DLL that does that must go on the bonfire. There's a special place in Hell for developers who build such DLLs.
    > "Depends greatly on the context" kinda makes my point. It can turn a zero-copy program into single- or double-copy. Generally not a showstopper by any means, but it sure as heck can impact some programs. And if that program is a DLL people use - well now you can't work around. (Yes, there's a reason I listed this last. But there's a reason I listed it at all.)
    I'm assuming you're referring to having to re-encode at certain boundaries. But note that nothing in Windows forces or even encourages you to use UTF-16 for bulk data.
    > The reality is Windows isn't UTF-16 and nix isn't UTF-8, which was the crux of most of my points.
    Windows clearly prefers UTF-16, and its filesystems generally use just-wchar-strings for filenames on disk (they don't have to though). Unix clearly prefers UTF-8, and its filesystems generally use just-char-strings on disk.
    
    terinjokes - 6 months ago
    
    >> Those aren't the only options. You (or someone) could also write your own compatibility layers for the APIs that avoid some of the problems I mentioned (e.g., by producing errors on inconvertible characters, by being compatible with former Windows versions, by not affecting other DLLs in your process, etc.)
    > That's akin to writing a partial C library. If MSFT makes UTF-8 as the codepage work well enough I'd rather use that.
    I found out about activeCodePages thanks to developers of those compatibility layers documenting the option and recommending it over their own solutions.
    > The upstream is often not paid for this. Even if they get a PR, if the PR makes their code harder to work on they might reject it
    The project I work on is an MFC application stemming from 9x and early XP and abandoned for 15 years. Before I touched it it had no Unicode support at all. I'm definitely not being paid to work on it, let alone the effort to convert everything to UTF-16 when the tide seems to be going the other direction.
    > Your program is declaring that everything in the process is UTF-8 compatible, but those DLLs might not be compatible with that, and so you're just praying that they don't use -A functions in an incompatible manner.
    Programs much, much, much more popular than mine written by the largest companies in the world, and many programs you likely use as a developer on Windows, set activeCodePage to UTF-8. Notwithstanding the advice in the article to set it globally for all applications (and it implies it already is the default in some locales). Those DLLs will be upgraded, removed, or replaced.
    
    Joker_vD - 6 months ago
    
    Forget it, you ain't gonna make Linux-centric open-source community to really care about Windows (or other un-POSIX-like OSes, of which today there is almost none). The others have to give in and accomodate to their ways if those others want to use their code.
    And since Windows-centric developers, when porting their apps to Linux, are generally willing to accomodate for Linux-specific idiosyncrasies (that's what porting is about, after all) if they care abour that platform enough, the dynamic will generally stay the same: people porting from Windows to Linux will keep making compatibility shims, people porting from Linux to Windows will keep telling you "build it with MinGW or just run it in WSL2, idgaf".
    
    dataflow - 6 months ago
    
    > That's akin to writing a partial C library.
    Not really. It's just writing an encoding layer for the APIs. For most APIs it doesn't actually matter what they're doing at all; you don't have to actually care what their behaviors are. In fact you could probably write compiler tooling to do automatically analyze the APIs and generate code for most functions so you don't have to do this manually.
    > If MSFT makes UTF-8 as the codepage work well enough I'd rather use that.
    "Well enough" as in, with all the warts I'm pointing out? Their current solution is all-or-nothing for the whole process. They haven't provided a module-by-module solution and I don't expect them to. They haven't provided a way to avoid information loss and I don't expect them to.
    > You mean changing the codepage for use with the "A" functions? Any DLL that does that must go on the bonfire. There's a special place in Hell for developers who build such DLLs.
    "Changing" the code page? No, I'm just saying any DLL that calls FooA() without realizing FooA() can now accept UTF-8 could easily break. You're just praying that they don't.
    > I'm assuming you're referring to having to re-encode at certain boundaries. But note that nothing in Windows forces or even encourages you to use UTF-16 for bulk data.
    Nothing? How do you say this with such confidence? What about, say, IDWriteFactory::CreateTextLayout(const wchar_t*) (to give just one random example)?
    And literally everything that interacts with other apps/libraries/etc. that use Unicode (which at least includes the OS itself) will have to encode/decode. Like the console, clipboard, or WM_GETTEXT, or whatever.
    The whole underlying system is based on 16-bit code units. You're going to get a performance hit in some places, it's just unavoidable. And performance isn't just throughput, it's also latency.
    > Windows clearly prefers UTF-16, and its filesystems generally use just-wchar-strings for filenames on disk (they don't have to though). Unix clearly prefers UTF-8, and its filesystems generally use just-char-strings on disk.
    Yes, and you completely missed the point. I was replying to your claim that "UTF-8 has won" over UTF-16. I was pointing out that what you have here is neither UTF-8 on one side nor UTF-16 on the other. Going with who "won" makes no sense when neither is the one you're talking about, and you're hitting information loss during conversions. If you were actually dealing with UTF-16 and UTF-8, that would be a very different story.
- SleepyMyroslav - 6 months ago
  
  In gamedev a lot of people read those docs but not a lot of them shipped anything using it. The reason is that file paths are not everything that has A/W versions. There is user input, window message handling ... The API is a maze.
  I really would like to learn otherwise. But when I have to suggest fixes my old opinion stays. Dropping any C runtime use and going from API macro or A version to W is the solution to all weird and hard to repro problems on platforms from Ms.
- 7bit - 6 months ago
  
  Not a Programmer. Wouldn't manifests risk the application breaking, if the manifest is not copied with the exe file? As a power user, I see the manifests sometimes, but honestly ,if I download e.g., bun.exe I would just copy the bun.exe without any manifest that the downloaded archive would contain.
  That does not sound like a good solution.
  - lmz - 6 months ago
    
    You can embed manifests in the exe.
    
    terinjokes - 6 months ago
    
    Expanding on this a bit, if the manifest is available at compile time it's included as a resource in the executable by the RC resource compiler. You can embed a manifest into an existing executable with mt.exe. Embedding the application manifest is recommended.
    If you can't embed it for some reason, then you can distribute the application manifest side-by-side with the executable by appending ".manifest" to the binary filename. In this case probably already have defensive checks for other resources not being found if a user copies just the exe, and if not can add one and exit.
masfuerte - 6 months ago

In my portable code I #define standard functions like main and fopen to their wide equivalents when building on Windows.
This does mean I can't just use char* and unadorned string literals, so I define a tchar type (which is char on Linux and wchar_t on Windows) and an _T() macro for string literals.
This mostly works without thinking about it.
dblohm7 - 6 months ago

What really annoys me these days is that if you search for a Win32 API on Google, it will always come up with the -A variant, not the -W variant. I don't know if they've got something weird in their robots.txt or what, but I find it bizarre that an API whose guidelines desire developers to use the -W variants in all greenfield code, instead returns the legacy APIs by default.
- ack_complete - 6 months ago
  
  They did a strange reorg of the API docs at one point. Not only does it now have functions split by A/W (mostly unnecessarily), it also groups them by header file instead of feature reference, which is kind of annoying. It used to be just that the function doc would note at the bottom if A/W variants were present and they were grouped under Functions in the feature/subsystem area of the docs tree.
  - dblohm7 - 6 months ago
    
    Yeah, that new content management system is awful too -- it doesn't grok preprocessor stuff at all, so sometimes you get nonsensical struct definitions, kernel-mode structs instead of user-mode structs, etc.
delta_p_delta_x - 6 months ago

> Microsoft's own C/C++ runtime library (msvcrt.dll) is implemented
This has been superseded by the Universal C runtime (UCRT)[1] which is C99-compliant.
- pjmlp - 6 months ago
  
  Mostly C99 compliant, some things are left out.
  https://learn.microsoft.com/en-us/cpp/c-runtime-library/comp...
  - sigsev_251 - 6 months ago
    
    I think the documentation is outdated given that C11 atomics [1] and threads [2] are available for more than a year now. Same goes for pretty much everything MSVC frontend related stuff (I've yet to try which C++23 features are supported at the moment, but they've secretly added support for C23 features like typeof and attributes, as well as GNU Statement Expressions).
    [1]: https://devblogs.microsoft.com/cppblog/c11-atomics-in-visual...
    [2]: https://devblogs.microsoft.com/cppblog/c11-threads-in-visual...
    
    pjmlp - 6 months ago
    
    Outdated documentation is pretty normal unfortunely, even .NET suffers from that nowadays.
    Not as bad as Apple nowadays though, quite far from Inside Inside Macintosh days.
    Glad to know about C23 features, as they went silent on C23 plans.
    C++23 looks quite bad for anything that requires frontend changes, there are even developer connection issues for us to tell what to prioritise, as if it wasn't logically all of it. There is another one for C++26 as well.
    Personally, I think that with the improvements on low level coding and AOT compilation from managed languages, we are reaching local optimum, where C and C++ are good enough for the low level glue, C23 and C++23 (eventually C++26 due to static reflection) might be the last ones that are actually relevant.
    Similar to how although COBOL and Fortran standards keep being updated, how many ISO 2023 revision compliant compilers are you going to find out for portable code?
    
    sigsev_251 - 6 months ago
    
    > Outdated documentation is pretty normal unfortunely, even .NET suffers from that nowadays.
    That's really unfortunate.
    > Not as bad as Apple nowadays though, quite far from Inside Inside Macintosh days.
    Funny story, I know a guy who wanted to write a personal Swift project for an esoteric spreadsheet format and the quality of the documentation of SwiftUI made him ragequit. After that, he switched to kotlin native and gtk and he is much happier.
    > Personally, I think that with the improvements on low level coding and AOT compilation from managed languages, we are reaching local optimum, where C and C++ are good enough for the low level glue, C23 and C++23 (eventually C++26 due to static reflection) might be the last ones that are actually relevant.
    I agree on the managed language thing but, I mean, the fact that other languages are getting more capable with low level resources does not mean that improvements in C/C++ are a bad idea and will not be used. In fact, I think that features like the transcoding functions in <stdmchar.h> in C2y (ironically those are relevant to the current HN post) are useful to those languages too! So even if C, C++ and fortran are just used for numerical kernels, emulators, hardware stuff, glue code and other "dirty" code advancements made to them are not going wasted.
    
    - 6 months ago
    
    [deleted]
nialv7 - 6 months ago

Windows really should provide an API that treats path names as just bytes, without any of these stupid encoding stuff. Could probably have done that when they introduced UNC paths.
- Dwedit - 6 months ago
  
  Ever since Windows 95 Long File Names for FAT, filenames have been 16-bit characters in their on-disk format. So passing "bytes" means that they need to become wide characters before the filesystem can act on them. And case-sensitivity is still applied, stupidly enough, using locale-specific rules. (Change your locale, and you change how case-insensitive filenames work!)
  It is possible to request for a directory to contain case-sensitive files though, and the filesystem will respect that. And if you use the NT Native API, you have no restrictions on filenames, except for the Backslash character. You can even use filenames that Win32 doesn't allow (name with a ":", name with a null byte, file named "con" etc), and every Win32 program will break badly if it tries to access such a file.
  It's also possible to use unpaired surrogate characters (D800-DFFF without the matching second part) in a filename. Now you have a file on the disk whose name can't be represented in UTF-8, but the filename is still sitting happily in the filesystem. So people invented "WTF-8" encoding to allow those characters to be represented.
  - cesarb - 6 months ago
    
    > And case-sensitivity is still applied, stupidly enough, using locale-specific rules. (Change your locale, and you change how case-insensitive filenames work!)
    AFAIK, it's even worse: it uses the rules for the locale which was in use when the filesystem was created (it's stored in the $UpCase table in NTFS, or its equivalent in EXFAT). So you could have different case-insensitive rules in a single system, if it has more than one partition and they were formatted with different locales.
    IMO, case-insensitive filesystems are an abomination; the case-insensitivity should have been done in the user interface layer, not in the filesystem layer.
    
    cryptonector - 6 months ago
    
    > IMO, case-insensitive filesystems are an abomination; the case-insensitivity should have been done in the user interface layer, not in the filesystem layer.
    Implementing case-insensitivity in a file picker or something is OK, but doing that throughout your app's runtime is insane since you'd have to hook every file open and then list the directory, whereas in a file picker you're probably listing the directory anyways.
    
    cesarb - 6 months ago
    
    The file picker is precisely where case-insensitivity should be done; the rest of the application should already have the correct file name.
    
    cryptonector - 6 months ago
    
    Though you best not have a million files in that directory...
    
    Dwedit - 6 months ago
    
    Did not know about $UpCase, the only part I knew was that the FAT16/32 driver from Microsoft (Which has the source code officially released, it's used as an example for how to implement a filesystem on Windows NT) uses locale-specific case-sensitivity tests.
    
    cesarb - 6 months ago
    
    You're right, in the case of FAT16/FAT32 it AFAIK has to use the current system locale, since unlike EXFAT or NTFS there isn't a place in the filesystem to store that locale table.
- Joker_vD - 6 months ago
  
  Windows does treat path names as just sequences of uint16_t (which is how NTFS stores them) if you use W-functions and prepend the paths with "\\?\".
  - nialv7 - 6 months ago
    
    oh, that's interesting. do UNC paths not have to be valid UTF-16?
    
    Dwedit - 6 months ago
    
    "\\?\" is strange, because it looks just like a UNC path. But it actually isn't. It's actually a way for Win32 programs to request a path in the NT Object Namespace.
    What's the NT Object Namespace? You can use "WinObj" from SysInternals to see it.
    The NT Object Namespace uses its own special paths called NT-Native paths. A file might be "C:\hello.txt" as a Win32 path, but as an NT-Native path, it's "\??\C:\hello.txt". "\??\" isn't a prefix, or a escape or anything like that. It's a real directory sitting in the NT Object Namespace named "\??", and it's holding symbolic links to all your drive letters. For instance, on my system, "\??\C:" is a symbolic link that points to "\Device\HarddiskVolume4".
    Just like Linux has the "/dev/" directory that holds devices, the NT Object Namespace has a directory named "\Device\" that holds all the devices. You can perform File IO (open files, memory map, device IO control) on these devices, just like on Linux.
    "\??\" in addition to your drive letters, also happens to have a symbolic link named "GLOBALROOT" that points back to the NT-Native path "\".
    Anyway, back to "\\?\". This is a special prefix that when Win32 sees it, it causes the path to be parsed differently. Many of the checks are removed, and the path is rewritten as an NT-Native path that begins with "\??\". You can even use the Win32 Path "\\?\GLOBALROOT\Device\HarddiskVolume4\" (at least on my PC) as another way to get to your C:\ drive. *Windows Explorer and File Dialogs forbid this style of path.* But 7-Zip File Manager allows it! And regular programs will accept a filename as a command line argument in that format.
    Another noteworthy path in "\??\" is "\??\UNC\". It's a symbolic link to "\Device\Mup". From there, you can add on the hostname/IP address, and share name, and access a network share. So in addition to the classic UNC path "\\hostname\sharename", you can also access the share with "\\?\UNC\hostname\sharename" or "\\?\GLOBALROOT\Device\Mup\hostname\sharename".
    
    jeroenhd - 6 months ago
    
    I don't believe they do. Maybe the documentation will tell you it must be, but in practice file names with broken UTF-16 can be created.
    
    cryptonector - 6 months ago
    
    It's the same on Unix.
    On Unix the reason for this is that the kernel has no idea what codeset you're using for your strings in user-land, so filesystem-related system calls have to limit themselves to treating just a few ASCII codepoints as such (mainly NUL, `/`, and `.`).
userbinator - 6 months ago

And of course making everything twice as big as it needs to be is also extremely repugnant.
- Joker_vD - 6 months ago
  
  Not everyone uses Latin-based scripts, you know. Most of the symbols in the BMP (including Brahmic scripts) take two bytes in either UTF-8 or UTF-16, and CJK symbols take 3 bytes in UTF-8 instead of 2 in UTF-16. Emojis, again, are 4 bytes long in either encoding. So for the most people in the world, UTF-16 is either slightly more compact encoding, or literally the same as UTF-8.
  - account42 - 6 months ago
    
    > Not everyone uses Latin-based scripts
    Actually, everyne does use Latin-based scripts extensively. Maybe not exclusively but your almost all text-like data intended to be consumed by programs will mainly be Latin-based scripts. So even for languages where you have characters that need 3-bytes in UTF-8 but two in UTF-16 you can still end up saving memory with UTF-8 because all the boilerplate syntax around your fancy characters is ASCII.

Dwedit - 6 months ago

There are two ways to force the "Ansi" codepage to actually be UTF-8 for an application that you write (or an EXE that you patch).

One way is with a Manifest file, and works as of a particular build of Windows 10. This can also be applied to any EXE after building it. So if you want a program to gain UTF-8 support, you can hack it in. Most useful for console-mode programs.

The other way is to use the hacks that "App Locale" type tools use. One way involves undocumented function calls from NTDLL. I'm not sure exactly which functions you need to call, but I think it might involve "RtlInitNlsTables" and "RtlResetRtlTranslations" (not actually sure).

layer8 - 6 months ago

> until Microsoft chooses to enable UTF-8 by default in all of their Windows editions.

I don’t know how likely this is. There are a lot of old applications that assume a particular code page, or assume 1 byte per character, that this would break. There are also more subtle variations of this, like applications assuming that converting from wide characters to ANSI can’t increase the number of bytes (and hence an existing buffer can be safely reused), which isn’t the case for UTF-8 (but for all, or almost all, existing code pages). It can open up new vulnerabilities.

It would probably cause much less breakage to remove the Best-Fit logic from the win32 xxxA APIs, and instead have all unmappable characters be replaced by a character without any common meta semantics, like “x”.

tambre - 6 months ago

One example of such an application is Adobe After Effects [0]. Or at least used to be, I no longer use Windows.
[0] https://tambre.ee/blog/adobe_after_effects_windows_utf-8/
kgeist - 6 months ago

Maybe they can introduce OS API versions (if there's no such thing yet) and require new (or updated) apps targetting new API versions/newer SDKs to assume UTF8 by default? So everything below a certain API version is emulated legacy mode. Windows already has the concept of shims to emulate behavior of different Windows versions.
- layer8 - 6 months ago
  
  Apps can already opt-in to UTF-8 for the ANSI APIs (see https://news.ycombinator.com/item?id=42649122), or use the wide-character APIs.
cryptonector - 6 months ago

You already had this problem pre-UTF-8 in Windows: changing your default codepage could cause app fuckiness. So giving the user the option to use UTF-8 is reasonable. Making it the default is also reasonable given the problems that the Best-Fit mapping are causing, though Microsoft would have to do something to help users easily figure out how to run older code.
Another not-so-reasonable thing would be to drop all mappings to "special" ASCII characters from the Best-Fit mappings, though this wouldn't help apps linked statically with the CRT. Not so reasonable because it doesn't fix the vulnerabilities.
Sometimes security vulnerabilities motivate backwards-compatibility breakage.

garganzol - 6 months ago

Microsoft was aware of this issue at least 1 year ago. I know this because they released a special code analysis rule CA2101 [1] that explicitly discouraged the use of the best-fit mapping. They mentioned security vulnerabilities in the rule’s description, but they were purposefully vague in details though.

[1] https://learn.microsoft.com/en-us/dotnet/fundamentals/code-a...

cesarb - 6 months ago

> However, resolving this problem isn’t that as simple as just replacing the main() with its wide-character counterpart. Since the function signature has been changed, maintainers would need to rewrite all variable definitions and argument parsing logics, converting everything from simple char * to wchar_t *. This process can be painful and error-prone.

You don't need to convert everything from char * to wchar *. You can instead convert the wide characters you received to UTF-8 (or to something like Rust's WTF-8, if you want to also allow invalid sequences like unpaired surrogates), and keep using "char" everywhere; of course, you have to take care to not mix ANSI or OEMCP strings with UTF-8 strings, which is easy if you simply use UTF-8 everywhere. This is the approach advocated by the classic https://utf8everywhere.org/ site.

segasaturn - 6 months ago

I've been inadvertantly safe from this bug on my personal Windows computer for years thanks to having the UTF-8 mode set, as shown at the bottom of the article. I had it set due to some old, foreign games showing garbled nonsense text on my computer. Have not noticed any bugs or side effects despite it being labelled as "Beta".

numpad0 - 6 months ago

Interesting, to me that checkbox have done nothing but crashing too many random apps. I guess whether it works depends on the user's home codepage with it off.
UltraSane - 6 months ago

I just enabled the "Beta: Use Unicode UTF-8 for worldwide language support" option. Going to be interesting to see how many apps this breaks.
- cryptonector - 6 months ago
  
  Please come back and tell us!
  - UltraSane - 6 months ago
    
    Nothing so far.

scoopr - 6 months ago

I was wondering if the beta checkbox the same thing as setting the ActiveCodePage to UTF-8 in the manifest, but the docs[0] clarify that GDI doesn't adhere to per-process codepage, but only a single global one, which is what the checkbox sets.

Bit of a shame that you can't fully opt-in to be UTF-8 with the *A API, for your own apps. But I think for the issues highlighted in the post, I think it would still be a valid workaround/defence-in-depth thing.

[0] https://learn.microsoft.com/en-us/windows/apps/design/global...

lifthrasiir - 6 months ago

Oh, my, freaking, god. I knew Windows API provides that sort of best-fit conversions, but didn't realize that it was a default behavior for several ANSI functions in my native code page (949 [1])! At this point they should be just banned like gets.

[1] Yes, I know there is a UTF-8 code page (65001). That was really unusable for a long time and still is suffering compatibility issues to this day.

mouse_ - 6 months ago

Unicode on modern systems is absolutely terrifying. Anyone remember the black dot of death? https://mashable.com/article/black-dot-of-death-unicode-imes...

kazinator - 6 months ago

HN, Help! Before I dive into this, does anyone know whether this affects the argument parsing in Cygwin, that prepares the arguments for a regular int main(int argc, char *argv)?

TXR Lisp uses wchar_t strings, and the "W" functions on Windows. So that's well and good. But it does start with a regular C main, relying on the Cygwin run-time for that.

If that's vulnerable, I will hack it to have its own argument parsing, using the wide char command line.

Maybe I should ask this on the Cygwin mailing list.

shakna - 6 months ago

Cygwin has been using the W variant of most things. [0]
Unfortunately, they're using it on "__argv", not "__wargv".
Which means they are probably vulnerable to this.
[0] https://cygwin.com/cgit/newlib-cygwin/tree/winsup/cygwin/ker...
kazinator - 6 months ago

Cygwin evidently contains its own versions of certain Win32 functions, like in this source file:
https://github.com/cygwin/cygwin/blob/main/winsup/cygwin/ker...
If we jump to the very bottom, we see that GetCommandLineA is one of the functions implemented in this file. It uses something called RtlUnicodeStringToAnsiString.
Microsoft declares that one in <wdm.h> and places it in NTOSKRNL.EXE.
That very function is mentioned in the submitted article as having the problem!
If Cygwin handles arguments through this function in preparation for main(), and it's the Microsoft one, it has the issue.
cryptonector - 6 months ago

Starting with main() instead of wmain() is enough to make it vulnerable :(
- kazinator - 6 months ago
  
  It's enough to make it suspicious, but there is a way for a C compiler on Windows to support regular main startup such that it is not vulnerable. It depends on which/whose piece of code takes the WCHAR-based command line and produces the char-based argv[]. Does that code do the "WorstFit" thing, or not.
  In my program, I could fix it (if necessary) by taking the WCHAR command line and parsing it myself, using my own UTF8 encoding routines to make the multi-byte strings.
  - cryptonector - 6 months ago
    
    Yes, that's what I'd do: write a `wmain()` that converts UTF-16 arguments to UTF-8 then calls the real `main()` (which you'll have to rename).
    
    kazinator - 6 months ago
    
    Except, in the Cygwin ecosystem, it doesn't look as if you can do that. All the "crt" stuff for setting up the executable environment is built around regular main.
    Luckily, I maintain a fork of the Cygwin DLL for my project called Cygnal, where I can make improvements. The focus has not been security up to now, but rather restoring some "native like" behaviors in the Cygwin run-time that are "too POSIXy" for native Windows users.
    I could hack the CRT stuff in the Cygnal fork to avoid pitfalls in producing the main() arguments.
    At the moment, I only looked at this issue fairly superficially, so I don't have a full picture of how what piece are doing exactly what in this area.
    
    kazinator - 5 months ago
    
    To close this loose end in the HN thread:
    The above guess turns out to be false. You can link a program in Cygwin that has WinMain.
    If you don't include <windows.h>, you can give it whatever argument list you like, such as (void), and it will work. But in 32 bit Cygwin (which I still support), it will complain about _WinMain@16 missing (as a warning only).
    If you give it the correct signature declared in <winbase.h> then it links on 32 or 64 bit Cygwin just fine. without warnings.
    In this WinMain, you can use the wide-string functions for obtaining the command line and parsing to arguments.
    I am going with this fix for TXR.
    
    kazinator - 5 months ago
    
    Looks like this is unnecessary. Cygwin in fact uses GetCommandLineW and does its own parsing.
    See reply by Corinna Vinschen:
    https://cygwin.com/pipermail/cygwin/2025-January/257074.html

bangaladore - 6 months ago

I tend to agree that this is not an issue with many of the applications that are mentioned in the post.

Fundamentally this boils down to essentially bugs in functions that are supposed to transform untrusted into trusted input like the example they gave:

`system("wget.exe -q " . escapeshellarg($url));`

`escapeshellarg` is not producing a trusted output with some certain inputs.

blibble - 6 months ago

the escaping rules for windows are so complicated (and can vary with configuration) such that it's not possible to do it securely
vs. posix that just dumps the arguments directly into argv
- hnlmorg - 6 months ago
  
  Windows doesn’t really have an ARGV though. It’s a user space abstraction for compatibility with POSIX.
  Windows technically just works on the principle of an executable name + a single argument. And it does this for compatibility with DOS.
  So you end up with this stupid escaping rules you’ve described so there are compatibility conventions at the kernel level with earlier implementations of Windows, which in turn maintained compatibility with MS-DOS. While providing a C abstraction that’s compatible with POSIX.
  Which is just one of many reasons why it’s a nightmare to write cross platform shells that also target Windows.
- bangaladore - 6 months ago
  
  > the escaping rules for windows are so complicated (and can vary with configuration) such that it's not possible to do it securely
  This is bold claim.
  Is it not possible? Or not easy to do correctly?
  - blibble - 6 months ago
    
    all the kernel passes to executables is one long string
    and then every program handles it in whatever way it feels is best
    as examples: go/java/python all process arguments slightly differently
    even microsoft's libc changes handling between versions
    given it's not possible to know what parser a specific target program is going to use: it's not possible to generically serialise an array safely
    
    Aachen - 6 months ago
    
    Okay so it's not that the rules are too complicated for mortals, it's that there are no rules and so you can't know how programs will interpret the data, so you can't write anything that would universally work. Am I understanding that right?

sharpshadow - 6 months ago

That’s amazing great read. According to for example this[0] post it’s possible to change code pages in windows in various ways and would allow the use of multiple BestFit scenarios on the same OS without reboot. Even combining them should be possible.

Aachen - 6 months ago

(you forgot the link)
- sharpshadow - 6 months ago
  
  Right.
  https://stackoverflow.com/a/56323296

pornel - 6 months ago

It would be easily fixable if CommandlineToArgvA was obtaining the command line itself. Then instead of converting to ANSI and then parsing that, it could parse args in Unicode, and then convert argument by argument to ANSI. The output would be ANSI compatible, but split and unescaped in the true form.

Unfortunately, the parsing is a two-step operation, with the application calling GetCommandLineA itself first and passing that to the parser, so a fix would need a hack to correlate the versions of the command line input without breaking when it's given a different string.

nitwit005 - 6 months ago

There are presumably some similar .Net COM issues when communicating with unmanaged code, as there is an attribute for controlling this conversion: https://learn.microsoft.com/en-us/dotnet/api/system.runtime....

It directly mentions: "Setting BestFitMappingAttribute parameters in this manner provides an added measure of security."

LudwigNagasena - 6 months ago

What would even be the proper way to do `system("wget.exe -q " . escapeshellarg($url))`? It’s ridiculous that plaintext IPC is still the primary interface for many tools.

cryptonector - 6 months ago

Parse the URI query parameters and construct the command-line. "Parse, don't validate." Though still, that's just not enough here. If the command is a "main()" style command then you'll lose. You'll need to make sure that the command is safe to use with Unicode data regardless of codepage in use.
rubatuga - 6 months ago

Agreed, Windows should just make some breaking changes already and adopt unix style arguments.
- ygra - 6 months ago
  
  So we end up in a world 20 years from now where most applications still don't use that. I guess the main problem as described here is the mapping, as argument splitting was just one of the possible things that break (next to argument validation or bad file names).

layer8 - 6 months ago

> And yes, Python’s subprocess module can’t prevent this.

A reasonably sane solution would be for it to reject command line arguments on Windows that contain non-ASCII characters or ASCII characters that aren’t portable across code pages (not all code pages are a superset of US-ASCII), by default, and to support an optional parameter to allow the full range, documenting the risk.

veltas - 6 months ago

That's not sane at all, because then you can't send e.g. Japanese arguments to tools that support wide chars or have UTF-8 codepage in their manifest. And then there's yet another difference between Python versions to trip you up. And why should the default not allow internationalisation? Doesn't fit with the idea of Python 3.

radarsat1 - 6 months ago

Seems like a another possible fix would be to change the best fit mapping table to never generate control characters, but only alphanumerics. So map quote-like characters to 'q' and so on.

This might be uglier and slightly change behaviour, but only for vulnerable applications.

ok123456 - 6 months ago

Bush hid the facts

cesarb - 6 months ago

> Bush hid the facts
For those who don't know the reference: https://en.wikipedia.org/wiki/Bush_hid_the_facts it's a vaguely related issue, in which a Windows component misinterprets a sequence of ASCII characters as a sequence of UTF-16 characters. Windows just seems full of these sorts of character-handling bugs, in part due to its long history as a descendant of the codepage-using MS-DOS and 16-bit Windows operating systems.
- 6 months ago

[deleted]

lilyball - 6 months ago

> Worse still, as the attack exploits behavior at the system level during the conversion process, no standard library in any programming language can fully stop our attack!

What happens if the standard library updates its shell escaping to also escape things like the Yen character and any other character that has a Best-Fit translation into a quote or backslash? Which is to say, what does Windows do for command-line splitting if it encounters a backslash-escaped nonspecial character in a quoted string? If it behaves like sh and the backslash simply disables special handling of the next character, then backslash-escaping any threat characters should work.

david2ndaccount - 6 months ago

Backslash is a valid path character (you can use / or \ as the path separator on windows) so if the backslash isn’t actually escaping anything it is left as-is.
cryptonector - 6 months ago

> What happens if the standard library updates its shell escaping
If the executable is linked statically with the CRT then nothing changes until you re-link it with the newer CRT. If it links with the UCRT then if the UCRT changes its rules then the program will too.

account42 - 6 months ago

Window's A APIs and conversion functions are best ingored entirely.

Always use W functions and use your own converions (that can round-trip invalid UTF-16 like WTF-8) if you want to use an 8-bit encoding internally.

Most (all?) of the exploits here are already bugs because the applications don't properly handle unicode data.

rubatuga - 6 months ago

From what I can tell the largest vulnerability is argument passing to executables in Windows. Essentially it is very difficult to safeguard it. I've seen some CLI programs use the '--' to signify user input at the end, maybe this would solve this for a single argument scenario. Overall, this is an excellent article and vulnerability discovery.

ppp999 - 6 months ago

Character encoding has been such a mess for so long it's crazy.

Aachen - 6 months ago

To be fair, I do enjoy that you can start programs like Rollercoaster Tycoon, written mostly in assembly, still in Windows 7 (maybe even newer, I haven't used Windows in a while) and it mostly all works. I'm not aware that there are characters that can't be represented by the UTF-16 encoding (that will, according to the article, also widen further when necessary) which has been in use for decades now. I don't feel like this is character encoding still being a mess as much as legacy causing a mess: without those legacy binaries, translation wouldn't be needed and there wouldn't be a problem since the newer encoding is long there
I find encoding to be a mostly solved problem in any software that supports /UTF-[0-9]+/. Can't remember the last time I ran into an encoding issue, maybe it was when I ran PHP on a Windows server years and years ago, which defaulted to passing query parameters in ISO-8859-15 or something, which didn't work with html escaping (returned a blank string, so failing in a safe and obvious way iirc) if you didn't specify the character set. I personally converted, or set, everything I create or use to UTF-8 by now

est - 6 months ago

I remember typing some prefix character in notepad.exe then your hole txt became messed up. Funny unicode times.

jeroenhd - 6 months ago

That didn't have anything to do with the mechanism at play here. The "bush hid the facts" example interfered with the unicode detection mechanism (IsTextUnicode) built into Notepad itself. If it just defaulted to the current code page, it wouldn't have had this problem, but because IsTextUnicode misdetects various word length sequences, it needlessly converted the file from ASCII into UTF-16LE, which would often produce Chinese characters by chance.
https://devblogs.microsoft.com/oldnewthing/20040324-00/?p=40...

UltraSane - 6 months ago

The loosey-goosey mapping of code points to characters has always bothered me about Unicode.

To guard against this nasty issue that is going to take years to fix you can enable global UTF-8 support by doing

Settings > Time & language > Language & region > Administrative language settings > Change system locale, and check Beta: Use Unicode UTF-8 for worldwide language support. Then reboot the PC for the change to take effect.

cryptonector - 6 months ago

What "loosey-goosey mapping of code points to characters" are you talking about?
- perching_aix - 6 months ago
  
  I'd imagine the same one the article is talking about [0].
  [0] https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/Windo...
  - cryptonector - 6 months ago
    
    Thanks for that link (bookmarked). GP Had written:
    | The loosey-goosey mapping of code points to characters has always bothered me about Unicode.
    but the ones you posted are Microsoft's Best-Fit mappings. I'm going to conclude that GP referred to just that.

Randor - 6 months ago

That was a long read. Just be happy that you never had to deal with Trigraphs. https://learn.microsoft.com/en-us/cpp/c-language/trigraphs?v...

EdSharkey - 6 months ago

Distributing native binaries is so dangerous!

garganzol - 6 months ago

Any executable code is dangerous if isolation assumptions are violated. JavaScript, Python, or anything Turing-complete. It does not matter if the code is native or interpreted.
- perching_aix - 6 months ago
  
  Distributing any data is dangerous. Programs and devices have been compromised by malicious text, audio, images, video and other kinds of binary media (data), so even (especially?) programs. So I'd argue this - on it's own - is not really a useful thing to remark. The question is the characteristics and interactions of the dangers.
  The comment you responded to is weird because of exactly this reason. Very terse without a whole lot of substance, coming off as interaction bait / similar. All too often do I see comments on social media where people post something that will obviously lead others into being cornered arguing something that was misleading in the first place. Best not to entertain these unless confidence can be had that they're being genuine and are phrasing like this by mistake / not knowing better.

tiahura - 6 months ago

Imagine no Unicode, It’s easy if you try, No bytes that bloat our systems, No errors make us cry. Imagine all the coders, Living life in ASCII…

Imagine no emojis, Just letters, plain and true, No accents to confuse us, No glyphs in Sanskrit too. Imagine all the programs, Running clean and fast…

You may say I’m a dreamer, But I’m not the only one. I hope someday you’ll join us, And encoding wars will be done.

Ndymium - 6 months ago

I can appreciate the funny lyrics, but in real life I appreciate being able to write in my own language on a computer. Or even a mix of my language and another non-English language!
- lmm - 6 months ago
  
  > in real life I appreciate being able to write in my own language on a computer.
  I do too, which is why I hate the "unicode only for everything everywhere" narrative that's taken hold. My language can't be written properly in Unicode, so support for traditional codepages and encodings is really important!
  - klibertp - 6 months ago
    
    What language is that? Given that there are glyphs for Ancient Egyptian in Unicode now, the claim that there's a language you can't write in Unicode needs some backing :)
    
    Ndymium - 6 months ago
    
    I presume they mean Han unification[0], which to an uneducated outsider sure sounds like a fuckup. I wonder if it's possible to fix it inside Unicode without throwing the rest of it away. Reserve new blocks for all the needed graphemes and deprecate the unified ones?
    [0] https://en.wikipedia.org/wiki/Han_unification
    
    lmm - 6 months ago
    
    Japanese. If you try to write it in Unicode you get (in the overwhelming majority of applications) Chinese and some vaporware excuses about font regions.
nonrandomstring - 6 months ago

Nah, I'm just shaving my head and going full-on ultra-nationalist...
Speak ASCII or Die [0]
[0] https://en.wikipedia.org/wiki/Speak_English_or_Die

JimmyWilliams1 - 6 months ago

[dead]