Finding and fixing Ghostty's largest memory leak
mitchellh.com621 points by thorel 3 days ago
621 points by thorel 3 days ago
This is great news! Well done to everyone who helped sort it out. It was a problem noted by users in a thread here just last week, https://news.ycombinator.com/item?id=46460319
While Claude Code might have been the reason this bug became triggered by more people, there are some of us who were hitting it without ever having used Claude Code at all. Maybe the assumption about what makes a page non-standard, isn't as black-and-white as presumed. And I wonder if the leak would have been triggered more often for people who use scrollback-limit = 0, or something very small.
Probably not a huge deal, but it does seem the fix will needlessly delete and recreate non-standard pages in the case where the new page needs to be non-standard, and the oldest one (that needs to be pruned) already is non-standard and could be reused.
> Probably not a huge deal, but it does seem the fix will needlessly delete and recreate non-standard pages in the case where the new page needs to be non-standard, and the oldest one (that needs to be pruned) already is non-standard and could be reused.
This is addressed in the blog post.
It is how the PageList has always worked, and also how it worked before with the bug, because during capacity adjustment we would see the wrong size. This shouldn't change any perceived performance.
And as I note in the blog post, there are alternative approaches such as the one you suggested, but we don't have enough empirical data to support changing our viewpoint on that whereas our current viewpoint (standard sizes are common) is well supported by known benchmarks. I'm open to changing my mind here, but I didn't want to change worldviews AND fix the leak in the same go.
How come this isn't released as a hotfix / out of band patch but will follow the standard release cycle in March?
Of all the things to be impressed by you about, your patience is commendable. I'd be losing my shit if someone couldn't be bothered to read what I wrote and just spout off about something I'd addressed in my writing, but I suppose that's why your bank account has two commas and a bunch more. Thank you for everything. Can we go flying sometime?
> I'd be losing my shit if someone couldn't be bothered to read what I wrote and just spout off about something I'd addressed in my writing
In my experience that’s a universal feature of comment sections everywhere, and HN is not an exception. This is very common in HN comments which is why it’s important to always read the article, not just the comments.
[flagged]
Hell yeah! You don't got any heros? No body you look up to or respect? Not even a little bit?
> Well done to everyone who helped sort it out. It was a problem noted by users in a thread here just last week
I'm feeling a bit lucky I was able to sneak in an issue during the beta phase, but it was a real reproducible one that led to a segfault.
The thread about memory leak is here: https://news.ycombinator.com/item?id=46461061
And the same diagnosis in the blog post was reported by a user in discussions a month ago but ignored https://github.com/ghostty-org/ghostty/discussions/9786#disc...
That doesn't sound like the actual issue, or am I not understanding it correctly?
I think you’re correct. the reproduction isn’t very precise and the solution doesn’t seem right (I’m not seeing anything about the non-standard pages not being freed). I’d guess this was ignored because it was wrong…
As a side note - Claude Code is making the CLI attractive in a renewed fashion - more than anything else did it last 20years.
Great write-up. And, thanks mitchellh for Ghostty, I switched to it last year, and have not regretted it.
However, I am a somewhat surprised that the fix is reserved for a feature release in a couple of months. I would have expected this to be included in a bug fix release.
It's already released in the latest nightly build.
Are the nightly releases the expected way to get timely bugfixes?
That is how software releases generally work. AFAICT this is not a bug with broad impact or security implications.
I guess thats arguable, a memory leak can make a system unpleasant to use although I accept it can be solved by repeatedly restarting the offending app.
The moment you started talking about pages, I was like: “Ok, obviously memory pooled” and yup, it is. Then I said “obviously ring buffered” and yeah, essentially your scroll back reuse. Then I knew exactly where the bug was before getting to that part, not freeing the pages memory properly and sure enough - bingo! With some great looking diagrams of memory space alignment.
Kudos, that was a good read. Just remember that every time you do something novel, there’s potential for leaks :D
Funny timing, I moved to Ghostty this week and just today I ran into OOM crashes in Ghostty while developing a terminal UI app. Coincidentally this TUI has a tab bar that looks like this, where UTF8 icons are used for recognizability and activity indicators (using © and € as placeholders here):
1|Flakes © 2|Installed © 3|Store © € 4|Security © €
──────────────────────────────────────────────────────────────
This works fine normally, but resizing the terminal would quickly trigger the crash - easy to avoid but still annoying!I was already preparing myself to file a bug report with the easy repro, but this sounds suspiciously close to what the blog post is describing. Fingers crossed :)
(EDIT: HN filters unicode, booo :( )
Why would I move to GhosTTY versus the terminal emulator that comes with my OS as it's not clear to me from the documentation?
I don't think I can do a better overview than https://ghostty.org/docs/about . It's not world-changing but simply a very polished, well-executed terminal.
GPU rendering virtually eliminates typing latency. Most terminals that have it don't support native content like tabs, but Ghostty gets minimal latency without having to compromise on essentials since it uses native toolkits under the hood.
The modern TTY has lots of protocol extensions that allow your CLI tools to do things like display high-resolution images. There's tons of good-quality color themes out-of-the-box (with a built-in browser for preview).
Configuration is highly customizable but the defaults are good enough that you barely need it.
This feels like a case of guessing at something you could know. There are two types of allocations that each have a size and free method. The free method is polymorphic over the allocations type. Instead of using a tag to know absolutely which type an object it is you guess based on some other factor, in this case a size invariant which was violated. It also doesn't seem like this invariant was ever codified otherwise the first time a large alloc was modified to a standard size it would've blown up. It's worth asking yourself if your distinguishing factor is the best you can use or perhaps there is a better test. Maybe in this case a tag would've been too expensive.
@mitchellh what did you use for the memory visualizations? Looks nice, and the website plays well with mobile. Whats the stack?
Static HTML/CSS generated by Opus 4.5.
I like using AI for visualizations because it is one-time use throwaway code, so the quality doesn't matter at all (above not being TOTALLY stupid), it doesn't need to be maintained. I review the end result carefully for correctness because it's on a topic I'm an expert of.
I produce non-reusable diagrams namespaced by blog post (so they're never used by any other post). I just sanity check that the implementation isn't like... mining bitcoin or leaking secrets (my personal site has no secrets to build) or something. After that, I don't care at all about that quality.
The information is conveys is the critical part, and diagrams like this make it so much more consumable for people.
That's really cool. I was looking at them and thinking "I could probably make these with vanilla html/css but it'd be pretty tedious." Perfect use case for AI. I need to work on developing a reflex for it.
I've also started doing this, and it's surprisingly enjoyable to both do and even to read. The end result is often more readable to me than using a 3rd-party JS visualization library, because I only need to know standard HTML/CSS concepts to understand what's going on. And a side benefit is smaller pages with less bitrot due to being able to skip the dependencies.
I've been following the development of Ghostty for a while and while I have the feeling that there is a bit of over-engineering in this project, I find this kind of bug post mortem to be extremely valuable for anyone in love with the craft.
Over-engineered in what way?
Having to introduce a new language stack to distributioms just to be able to build a terminal emulator is what I would consider over-engineering.
So anything that uses a less popular language is considered over engineering? Distros support lots of different languages already and there are likely other packages built with zig already.
A 50-ish MB build time dependency that doesn't need any special privileges or installation to run? That's over engineering? A binary release of just CMake is bigger than all of Zig.
Super accessible write up as someone unfamiliar with Ghostty and terminal emulators in general. Thanks!
Reliable reproductions are so valuable.
Let me see if I can understand this properly:
There's a linear buffer of pages, most of which come from the pool. It's not clear to me under what conditions these are returned to the pool? Is it when the specific session terminates?
When a non-standard page reaches the point of being recycled, it'll instead be re-added to the list but with a standard size. That effectively leaks the extra space above the standard size. But when the buffer is released (because the session ends?) the pool is also released, which releases all the standard sized pages but leaks the custom-sized ones?
Which suggests that the issue may be even rarer than it initially looked to me: I tend to open a small number of sessions and then use them continuously, rather than starting new sessions during the lifetime of the process. If I never terminated a session, I would never fully leak the memory?
Why not just use a circular buffer for the scroll back? Why use blocks at all if you’re just going to recycle them anyway? That said, great write-up.
It started that way, and that's a common way to do this. One of the reasons is to avoid large pre-allocations OR large copies. A few other notes over on lobsters: https://lobste.rs/s/vlzg2m/finding_fixing_ghostty_s_largest_...
waiting for someone to say "this wouldn't have happen if you chose rust"
You’ll probably be waiting a long time, since Rust very explicitly doesn’t have “leak safety” as a constructive property. Safe Rust programs are allowed to leak memory, because memory leaks themselves don’t cause safety issues.
There’s even a standard, non-unsafe API for leaking memory[1].
(What Rust does do is make it harder to construct programs that leak memory unintentionally. It’s possible but not guaranteed that a similar leak would be difficult to express idiomatically in Rust.)
[1]: https://doc.rust-lang.org/std/boxed/struct.Box.html#method.l...
The specific language feature you want if you insist that you don't want this kind of leak is Linear Types.
Rust has Affine Types. This means Rust cares that for any value V of type T, Rust can see that we did not destroy V twice (or more often).
With Linear Types the compiler checks that you destroyed V exactly once, not less and not more.
However, one reason I don't end up caring about Leak Safety of this sort is that in fact users do not care that you didn't "leak" data in this nerd sense. In this nerd sense what matters is only leaks where we lost all reference to the heap data. But from a user's perspective it's just as bad if we did have the reference but we forgot - or even decided explicitly not - to throw it away and get back the RAM.
The obvious way to make this mistake "by accident" in Rust is to have two things which keep each other alive via reference counting and yet have been disconnected and forgotten by the rest of the system. A typical garbage collected language would notice that these are garbage and destroy them both, but Rust isn't a GC language of course. Calling Box::leak isn't likely to happen by accident (though you might mistakenly believe you will call it only once but actually use it much more often)
I think the main part of Ghostty's design mentioned here that - as a Rust programmer - I think is probably a mistake is the choice to use a linked list. To me this looks exactly like it needs VecDeque, a circular buffer backed by a growable array type. Their "clever" typical case where you emit more text and so your oldest page is scrapped and re-used to form your newest page, works very nicely in VecDeque, and it seems like they never want the esoteric fast things a linked list can do, nor do they need multi-writer concurrency like the guts of an OS kernel, they want O(1) pop & push from opposite ends. Zig's Deque is probably that same thing but in Zig.
The issue isn’t linked list vs dequeue but type confusion about what was in the container. They didn’t forget to drop it - they got confused about which type was in the list when popping and returned it to the pool instead of munmap.
The way to solve this in Rust would be to put this logic in the drop and hide each page type in an enum. That way you can’t ever confuse the types or what happens when you drop.
Was going to say this, but I don't think anyone actually wants to hear that Rust actually would have helped here.
As you're saying, the bug was the equivalent of an incorrectly written Drop implementation.
Nothing against Zig, and people not using Rust is just fine, but this is what happens when you want C-like feel for your language. You miss out on useful abstractions along with the superfluous ones.
"We don't need destructors, defer/errdefer is enough" is Zig's stance, and it was mostly OK.
Impossible to predict this kind of issue when choosing a project language (and it's already been discussed why Zig was chosen over Rust for Ghostty, which is fine!), so it's not a reason to always choose Rust over Zig, but sometimes that slightly annoying ceremony is useful!
Maybe some day I'll be smart enough to write Zig as a default over Rust, but until that day I'm going to pay the complexity price to get more safety and keep more safety mechanisms on the shotgun aimed at my foot. I've got plenty of other bugs I can spend time writing.
Another good example is the type vs type alias vs wrapper type debate. It's probably not reasonable to use a wrapper type every single time (e.g. num_seconds probably can probably be a u32 and not a Seconds type), but it's really a Rorschach test because some people lean towards one end versus the other for whatever reason, and the plusses/minuses are different depending on where you land on the spectrum.
[EDIT] also some good discussion here
https://ziggit.dev/t/zig-what-i-think-after-months-of-using-...
> "We don't need destructors, defer/errdefer is enough" is Zig's stance, and it was mostly OK.
There's more than that. Zig has leak detecting memory allocators as well, but they only detect the leak if it happens. Nobody had a reliable reproduction method until recently.
If you wanted to match Ghostty's performance in Rust, you'd need to use unsafe in order to use these memory mapping APIs, then you'd be in the exact same boat. Actually you'd be in a worse boat because Zig is safer than unsafe Rust.
> If you wanted to match Ghostty's performance in Rust, you'd need to use unsafe in order to use these memory mapping APIs, then you'd be in the exact same boat.
Yea, but not for all the parts — being able to isolate the unsafe and build abstractions that ensure certain usage parts of the unsafe stuff is a key part of high quality rust code that uses unsafe.
In this case though I think the emphasis is on the fact that there is a place where that code should have been in Rust land, and writing that function would have made it clear and likely avoided the confusion.
Less about unsafe and more about the resulting structure of code.
> Actually you'd be in a worse boat because Zig is safer than unsafe Rust
Other people have mentioned it but I disagree with this assertion.
Its a bit simplistic but I view it this way — every line of C/Zig is unsafe (lots of quibbling to do about what “unsafe” means of course) while some lines of rust are unsafe. Really hard for that assertion to make sense under that world view.
That said, I’m not gonna miss this chance to thank you and the Zig foundation and ecosystem for creating and continuously improving Zig! Thanks for all the hard work and thoughtful API design that has sparked conversation and progress.
Thank you for the kind words.
> every line of C/Zig is unsafe
This is trivially false... for instance here's a line:
const pi = 3.14;
It's actually a pretty small subset of the language that can cause unchecked illegal behavior.Also IMO the word "safety" should include integer overflow. I don't agree that those kind of bugs are so unimportant as to not be checked in safe builds.
> Thank you for the kind words.
Absolutely, I meant them.
> This is trivially false... for instance here's a line:
Yep, that was really wrongly stated on my part -- what I meant is that the kind of protections that "safe" Rust provides are not available anywhere in average lines of Zig code (though they can be detected with tooling, etc).
What I should have written is that I could easily write unsafe code anywhere in Zig (as in C). In practice of course most people don't because they're not trying to destroy their own computers, and most code is benign. Rust will at least save me from myself some of the time.
> Also IMO the word "safety" should include integer overflow. I don't agree that those kind of bugs are so unimportant as to not be checked in safe builds.
Rust does do some work to catch trivial overflows, but you're right that it does not catch any slightly more complex overflows, and that is certainly unsafe in a sense. I don't think any reasonable person would disagree with that.
Rust's answer to this of course is checked_{op}/wrapping_{op}/etc options, and that's what I often see in high quality codebases where it matters. Of course, this is a footgun that could have had a safety applied and it's too late now (AFAIK) to change the default to be always wrapping or something (also, I think people may oppose always checked for perf reasons).
[EDIT] Just to compare/make this more concrete, playgrounds:
https://zig.fly.dev/p/LGnrBGXPlVJ
https://play.rust-lang.org/?version=stable&mode=release&edit...
Rust in this case of doing something obviously wrong is at least a little more helpful -- the obvious overflow does not compile.
And of course you can get rust to do it like it allows (and what would be present in any codebase with real complexity):
https://play.rust-lang.org/?version=stable&mode=release&edit...
It's just that little bit of safety that makes it easy for me (personally) to default to Rust. Very possible that someday that won't be true.
[EDIT2] Also, somewhat under-discussed, but if Zig supported a bolt-on a "safety check compile mode" that ran with some stricter (maybe not quite borrow checking level) semantics, that would be pretty dope. Of course not something anyone should devote any real time to for a long time (or ever?) BUT it would trivialize a lot of these discussions maybe.
But in the mean time people just using what they're comfortable with/the feel they want is obviously fine.