Microsoft open-sources "the earliest DOS source code discovered to date"
arstechnica.com377 points by DamnInteresting 17 hours ago
377 points by DamnInteresting 17 hours ago
https://opensource.microsoft.com/blog/2026/04/28/continuing-...
Looking through the source is a great reminder of how constrained early computing was. It's amazing how much of this architecture still influences modern systems. It is rare that I say this but, thanks MS! Arguably just as, if not more, important is the BASIC that they wrote. That was what they actually wanted to do. DOS just got them the contract with IBM. For decades MS was really a developer tools company with a side biz of writing operating systems and other misc software. They also open sourced that BASIC code too [1]. [1] https://opensource.microsoft.com/blog/2025/09/03/microsoft-o... I dont think I've ever seen a commit that says "49 years ago". Damn. Not quite as old, but brl-cad is still in active development and has commits from 1983. https://github.com/BRL-CAD/brlcad/graphs/contributors?all=1 I remember when I realized I had been using Microsoft all along through my Commodore 64. What's interesting is that Microsoft BASIC itself was derived from BASIC-PLUS which itself was derived from Dartmouth BASIC (which evolved into a structured programming language called SBASIC (Structured BASIC). But the popularity of Microsoft BASIC, actually halted the standardisation of SBASIC as an ANSI standard. https://en.wikipedia.org/wiki/Microsoft_BASIC The Altair BASIC interpreter was developed by Microsoft founders Paul Allen and Bill Gates using a self-written Intel 8080 emulator running on a PDP-10 minicomputer.[1] The MS dialect is patterned on Digital Equipment Corporation's BASIC-PLUS on the PDP-10, which Gates had used in high school. https://en.wikipedia.org/wiki/Dartmouth_BASIC Dartmouth BASIC is the original version of the BASIC programming language. It was designed by two professors at Dartmouth College, John G. Kemeny and Thomas E. Kurtz. With the underlying Dartmouth Time-Sharing System (DTSS), it offered an interactive programming environment to all undergraduates as well as the larger university community. Dartmouth also introduced a dramatically updated version known as Structured BASIC (or SBASIC) in 1975, which added various structured programming concepts. SBASIC formed the basis of the American National Standards Institute (ANSI) "Standard BASIC" efforts in the early 1980s. In contrast to the Dartmouth compilers, most other BASICs were written as interpreters. This decision allowed them to run in the limited main memory of early microcomputers. Microsoft's Altair BASIC is one example: it was designed to run in only 4 KB of memory (interestingly, it was delivered on paper tape). Kemeny became involved in an effort to produce an ANSI standard BASIC in an attempt to bring together the many small variations of the language that had developed through the late 1960s and early 1970s. This effort initially focused on a system known as Minimal BASIC that was similar to earliest versions of Dartmouth BASIC, while later work was aimed at a Full BASIC that was essentially SBASIC with various extensions. But by the late 1980s, tens of millions of home computers were running some variant of the MS BASIC interpreter. It had become the de facto standard for BASIC, which eventually led to the abandonment of the ANSI SBASIC efforts. Kemeny and Kurtz, however, decided to continue their efforts to introduce the concepts from SBASIC and the ANSI Standard BASIC efforts. This became True BASIC. https://en.wikipedia.org/wiki/True_BASIC There are versions of the True BASIC compiler for MS-DOS, Microsoft Windows, and Classic Mac OS. At one time, versions for TRS-80 Color Computer, Amiga and Atari ST computers were offered, as well as a UNIX command-line compiler. After several years of inactivity, as of February 2026, the TrueBASIC website is officially closed. I cannot describe to you how jealous I am of the fact that back then writing a few thousand lines of assembly was what it took to launch a successful software company. >writing a few thousand lines of assembly was what it took to launch a successful software company. Yes, but that assembly was not DOS, and it wasn’t easy. Microsoft purchased the DOS code, they didn’t write it. Of course, they did develop and modify DOS. But that was a clever (and lucky) business deal, not a technological accomplishment. The real beginning of Microsoft was earlier, with Allen, Gates and Davidoff writing the Altair BASIC interpreter. That was a serious achievement. They had never seen the computer they were writing that assembly code for. They did not even own any computers. It took them 8 weeks on a university computer they were not supposed to be using for that “Altair agreed to meet them to possibly buy a BASIC interpreter… Gates and Allen had neither a BASIC interpreter nor even an Altair system on which to develop and test one. However, Allen had written an Intel 8008 emulator that ran on a PDP-10 time-sharing computer. Allen adapted this emulator based on the Altair programmer guide, and they developed and tested the interpreter on Harvard's PDP-10. The finished interpreter, including its own I/O system and line editor, fit in only four kilobytes of memory, leaving plenty of room for the interpreted program. In preparation for the demo, they stored the finished interpreter on a punched tape that the Altair could read, and Paul Allen flew to Albuquerque to meet with Altair… While on final approach into the Albuquerque airport, Allen realized that they had forgotten to write a bootloader to read the tape into memory. Writing in 8080 machine language, Allen finished the program before the plane landed. Only when they loaded the program onto an Altair and saw a prompt asking for the system's memory size did Gates and Allen know that their interpreter worked on the Altair hardware.” Imagine if the University had sued for their share of the IP and that was created using their resources… It’s funny because I thought Jobs/Wozinak got their initial funding from selling phreaking boxes. And more recently, Anthropic engaged in criminal copyright violations with only a slap on the wrist. Feels like a common theme of every “great” company having its origins from a “boost” resulting from criminal activity. (After all, that’s where the money is!) Just imagine the criminal penalties possible for pirating and selling one copy of a movie or making one long distance phone call with phreaking. In the case of Microsoft, I'm not seeing it. Being born into a 1% household and understanding the asymmetric upside that having the money and the time to speculate is far more significant than the civil and criminal legal violations on the way. The most common way to go from one-percenter rich to .001% rich is to already have enough wealthy people generating capital in your personal network that you can raise capital on sweetheart terms to buy the labor of people who don't. Then you sell it at a massive premium and repeat. I think it's empirically dubious to identify the UW mainframes as the secret sauce instead of "being able to ask your mom for a meeting with the chairman of IBM followed by asking her for 80,000 dollars ASAP." If the original creators of DOS were born into a wealthy family and on a first name basis with the chairman of IBM, do you think they would've sold it to Gates? Trying to attribute the tech business "founding crime" feels like displacement for what is perfectly legal and accepted cultural practice. To be fair, i think you needed a cutthroat businessman leading the company. Which i guess is more or less the same today This too but early MS to their employs was closer to a hipster SV vibe coding in a coffee shop a decade ago. And for such simple processors and systems no less! No descriptor tables to deal with, no memory management to configure. These days it takes a little processor inside the main processor, just to get things started. Those were golden times. Replace Assembly with TypeScript/Rust/Go/whatever and as long as the idea is good and useful, same thing applies today. Except the competition was essentially non existent and no one would copy your product with llm in a day The "competition" never been just a different codebase, that's one of the smallest pieces you'd have to actually build if you want to build a product people actually want to buy and use. The magic is basically all around it, multiplied by the code, but you really must have every else down pretty tight before the codebase even start mattering. But once it does matter, it matters a lot, hence the difficult balancing. More than a few people would rather die in poverty than put in the effort today even if you offered to time-machine them back with their finished product. Discussion, on the source, at the time (79 points, 24 days ago, 19 comments) https://news.ycombinator.com/item?id=47957494 Or on the GitHub clone (162 points, 15 comments) https://news.ycombinator.com/item?id=47946813 wow, they had to OCR it back in from paper printouts > This source code is old enough that it hadn’t been stored digitally. “A dedicated team of historians and preservationists led by Yufeng Gao and Rich Cini,” calling itself the “DOS Disassembly Group,” painstakingly transcribed and scanned in code from paper printouts provided by Paterson. This process was made even more difficult because modern OCR software struggled with the quality of the decades-old printout. I'd like to hear more about what works in OCR of dot-matrix fonts. I've been able to OCR letter-quality printer output to 97% (mostly Os and Xs problems). But it seems that machine-learning text-recognition is also now biased to reject computer code because it doesn't look like human language. There's a writeup here from one of the people on the team about the work it took to go from the listings to source code. http://cini.classiccmp.org/recoveryblog.htm > With less-than-satisfactory OCR output, I resorted to a process I used many years ago when converting scans made of old Commodore ROM dumps printed on a Commodore 1515 dot-matrix printer. The process relies on the ASCII OCR output having the same repetitive errors. "B" and "8", "S" and "5" are good examples, as are "l" and "1", and "O" and "0". There are many other similar single-character errors and, when working with x86 code, there are similar errors with instructions like "MOV". This process naturally works better if the output file is monolithic rather than single-page OCR conversions because you can do substitutions across the entire converted printout and not 75 separate files. > The next formatting hassle was the spacing. This required repetitive substitutions of a descending numbers of spaces to tabs (i.e., replace 8 spaces with a tab, 7, 6, etc.). Then if you want to return it to fixed spaces (which is likely how the original printer printed it -- spaces and not vertical tabs), you can. For pure re-creation work, spaces produce absolute column formatting while tabs can move around depending on the program displaying the file. > Once you run thought the 15 or so common global substitutions and tab conversion, it's a lot easier to work with the file to fix formatting and perform other cleanup. This is then followed by a line-by-line comparison against the original printouts. Overall I'd say the conversion output quality with this method is very good. Hmm, doesn't say anything about what OCR tools they used. I've got a 4" stack of wide-carriage COBOL. I guess it's two revisions of the same system so I only need to scan the newer half. Its probably from a TI Omni 810. On the other hand, I've got 100 pages of code printed in compressed font by someone wanting to make sure that 80+ char lines fit within margins. So a lot of words just don't come out at all. A frequent error is "A" becomes "H", "O" becomes "U" because the top dots aren't "attached". And columns of line numbers starting with 0001, or hex? The most confounding thing is OCR that thinks 00 is a sideways 8, and that dominates the uniform block, so it tries to interpret the whole column as sideways text. In another situation, it interprets two stacked lines (each starting with 0) as one line starting with 8 and it just goes off the rails. So I've been working with automatic skew correction, then clipping it into rows, in order to get each line of text isolated from the surrounding context. When I do that, I get better results, but it is not great either. I'm considering going all-in on training a new recognizer on snippets. For that, I'll be constructing "The Set of All As" and so on. Pretty interesting. I wonder if a whitelist against certain columns in the output could help, e.g. this column can only contain valid x86 instructions (e.g. MOV is allowed, M0V is not), this column can only contain hexadecimal (1 is allowed but never "l"), etc. Probably more work than it's worth given the final line-by-line comparison that happens anyway. Boring reply perhaps, but I've had wild success with adding even a tiny LLM afterwards to do "fixups" over OCRd text, works great for the typical O/0 issues and similar, just pass it the scrambled OCRd text together with the text around it, and even dumb and tiny 7b models running on CPU do a pretty fine job. ABBYY has a specific module for dot matrix printouts so I’m surprised it was a struggle for them but every document is different I've recovered some ancient software I wrote via scanning in listings I found among my dad's papers. Yet another case where text printed on paper outlived any digital storage. Seems like it was never digitally stored in the first place, and the printed text was barely readable due to age. Not really a big win for paper. Well it had to have been on disk or tape at some point. It wasn't all typed in by hand every time they needed to build a new version. unless they used punch cards Punch cards are still a form of digital storage, mind. Also a form of storing things on paper Reminds me of an old fortune cookie message or meme, something like "digital data is made from analog parts". I threw out all my punch cards. Wish I'd kept at least a listing! I find punch cards being used in old engineering books I buy from the 60s. Maybe write them again? > unless they used punch cards For MS-DOS? Not likely. Punch cards disappeared around the end of 1976. My firt job out of college in the early 1990s was at an equipment manufacturer who was still using them. They had a big chart on the wall titled "punch-card elimination" and a line trending down, but it wasn't at zero yet. My work there was all new code and didn't involve any of that, however. I remember seeing stacks of cards being carried into/out of the university "computing center" in the mid 1980s, on more than a couple of occasions. Though in retrospect, these were probably just old programs that had been in various professors offices since the mid 70s, being taken to get read into some disk in the mainframe. We still learned how to use them in the 80’s high school computer classes, mostly because we had a balance of CP/M plus card-reader/early DOS machines, eventually .. in the labs. Rich kid schools had Apples though, and some of them also had card readers for BASIC .. "[..] card readers for BASIC" Finally, a sensible use case for BASIC's "READ" and "DATA" commands. Learning BASIC as a kid on a micro, it always struck me as an odd way to get input into a program. Sure, with INPUT, you'd have to hand enter your input every time, but baking into the program meant that you'd have to edit your program any time you wanted to change anything. But with a card reader, you could "cut the deck". Keep the program cards, and then just stack on whatever set of data cards you wanted. From this vantage point, in the 21st century with our flying cars and what not, it seems really quirky that back then, even your data could be a tangible thing.
danborn26 - 4 minutes ago
jmward01 - 15 hours ago
ramon156 - 10 hours ago
RobotToaster - 2 hours ago
steve1977 - 10 hours ago
vee-kay - 10 hours ago
nananana9 - 9 hours ago
curiousObject - 7 hours ago
BobbyTables2 - 3 hours ago
areweai - 23 minutes ago
yokoprime - 8 hours ago
justsomehnguy - 5 hours ago
greenbit - 8 hours ago
embedding-shape - 8 hours ago
risyachka - 7 hours ago
embedding-shape - 7 hours ago
avadodin - 8 hours ago
gnabgib - 17 hours ago
locusofself - 16 hours ago
FarmerPotato - 15 hours ago
ndiddy - 4 hours ago
FarmerPotato - 3 hours ago
accrual - 2 hours ago
embedding-shape - 8 hours ago
bob778 - 8 hours ago
WalterBright - 10 hours ago
SoftTalker - 15 hours ago
jshier - 15 hours ago
SoftTalker - 15 hours ago
debesyla - 11 hours ago
Sharlin - 9 hours ago
wongarsu - 8 hours ago
accrual - an hour ago
WalterBright - 10 hours ago
genxy - 9 minutes ago
andsoitis - 11 hours ago
WalterBright - 10 hours ago
SoftTalker - 2 hours ago
greenbit - 7 hours ago
MomsAVoxell - 8 hours ago
greenbit - 7 hours ago