A bug that taught me more about PyTorch than years of using it

elanapearl.github.io

416 points by bblcla 4 days ago


montebicyclelo - a day ago

Incorrect Pytorch gradients with Apple MPS backend...

Yep this kind of thing can happen. I found and reported incorrect gradients for Apple's Metal-backed tensorflow conv2d in 2021 [1].

(Pretty sure I've seen incorrect gradients with another Pytorch backend, but that was a few years ago and I don't seem to have raised an issue to refer to... )

One might think this class of errors would be caught by a test suite. Autodiff can be tested quite comprehensively against numerical differentiation [2]. (Although this example is from a much simpler lib than Pytorch, so I could be missing something.)

[1] https://github.com/apple/tensorflow_macos/issues/230

[2] https://github.com/sradc/SmallPebble/blob/2cd915c4ba72bf2d92...

dangoodmanUT - a day ago

The tinygrad folks talk about this a lot.

Not that I understand much of what they say, but it appears there are a lot of correctness bugs in pytorch that are flying under the radar, probably having a measurable impact on the results of model quality.

It would be interesting to see model weights comparison of the same model trained with the two to see if they exhibit meaningfully different behavior.

CaptainOfCoit - a day ago

Only slightly related, but how common are bugs in GPUs and/or CUDA? I'm currently on Day 5 of trying to debug why my GPT-OSS implementation (not using PyTorch) I've made from scratch isn't working correctly, and while I have it somewhat working with some naive and slow methods, I'm now doing an implementation of the tensor cores and have been just stuck for 2-3 days because of some small numerical difference I can't understand why it's happening.

Every day I'm getting closer to believing this is some sort of hardware bug in Blackwell or in CUDA itself, but as we know, the bug is (almost) never in the compiler or in the hardware. Until it is...

jebarker - a day ago

This is a great write up and I’d love to see more like it. Debugging this sort of thing in the megatron->pytorch->CUDA stack is what my team spends more than half of their time on as an ML research team.

farhanhubble - 14 hours ago

Great work hunting the bug down the stack. The writeup is top notch. I wish I documented some of the nastiest bugs I found in such detail.

Funnily, only a few days ago I was thinking about just how far the field has come since 2014 or so when you'd build a computational graph, initialize weights manually and so on, versus now, where you just have to use a library like Ultralytics or HuggingFace most of the time. Then I thought about just how many deep, undetected bugs there would be in this mountain of abstraction. Bugs that make the computation invalid.

albertzeyer - 16 hours ago

The bug was with non-contiguous data in tensors.

I also had a very similar bug a while ago, broken gradients due to non-contiguous data for masked_select: https://github.com/pytorch/pytorch/issues/99638

In my case, it was easier to identify: I had another implementation of my loss function before that did not use masked_select. But then I thought I can be clever and use masked_select to take out the non-masked frames and calculate the loss only on those. But it wasn't working. Also, it only happened for some models, not for all. It turns out, it was always happening when the data coming out of the model was non-contiguous.

I think the bugs with non-contiguous data are not so uncommon. I wonder how much of that we still have.

EdwardDiego - 15 hours ago

Kudos to Elana for a) such a thorough deep dive and b) a great write-up of it. I understand very little about ML libraries, but was able to follow this easily :)

cadamsdotcom - a day ago

Sounds like Placeholder should somehow be split into InputPlaceholder and OutputPlaceholder, based on the usage.

Even identical classes could help future folks know copying back is platform specific: “hm, we wrote to an OutputPlaceholder but didn’t read back from it, that seems wrong”.

hobom - a day ago

What a fantastic way to write a post mortem, pedagogically very useful.

ipsum2 - 20 hours ago

Apple used to contribute to the PyTorch MPS backend, but decided to create their own framework (MLX) instead, fragmenting the ecosystem for very little gain. (MLX is basically PyTorch, but invented-at-apple)

Meta, the creator and main contributor to PyTorch, does not use Macs for their day-to-day ML work (they focus on GPUs and CPUs), so the MPS backend is sadly incomplete and has errors like the one you see here.

hinkley - 21 hours ago

Reminds me of the largest AJAX app I worked on, back when jquery was still hot and IE6 still existed as a problem.

The landing page in our app used jqueryUI’s drag and drop support, back around the time they declared bankruptcy on the confusing buggy code and wouldn’t even accept bug fixes because they were replacing it component by component (which was taking almost 3x as long as predicted). We had columns you could drag items between but they had a max height and scroll bars and it turned out jqueryUI would let you drag items into different rows if the overflow area for adjacent drag targets overlapped your row.

The person who found it couldn’t fix it. The other fixer couldn’t fix it. I diagnosed it but the spaghetti code was a recursive mess and I could not find a spot where I could fix it. Especially given I couldn’t send in a patch to them.

So I spent half of my free time the last day of every (2 week) sprint for almost six months before I finally found a small function I could monkey patch to wrap it in a short circuit check for clipping region. I spent maybe 20,30 hours on this, a lot of it just getting back to the same situation to debug. But it felt like it took forever to fix it.

The short circuit also made drag and drop faster, which was just getting in the edge of distracting. Particularly on a crowded page.

brilee - a day ago

Great write-up, but I admit that I found the interweaving of human and AI-written content/headlines/summaries pretty distracting. I kept on wanting to scroll past, but had to keep on backtracking to find the human thread again.

I think if you want to give your reader a quick intro to, e.g., what is the Adam optimizer, a simple link to Wikipedia is fine. No need to copy-paste an AI tutorial on Adam into the blog post.

airza - a day ago

I too have been insanely burned by an MPS bug. I wish Apple would throw an engineer or two at making sure their hardware works with PyTorch.

Rileyen - 11 hours ago

Just read the article and it instantly brought back memories of when I spent days trying to fix a broken loss in a PyTorch model. Turned out I had passed the wrong optimizer parameters. I ended up digging all the way from the model to the CUDA kernel. Debugging took longer than training.

What’s the trickiest bug you’ve ever run into?

dcl - 15 hours ago

Is this why I cannot seem to fine tune YOLO models on a Apple M4? The loss hits nan after a few batches. Same code using Windows PC and Google Colab CPU and GPU is fine...

nraynaud - 21 hours ago

Naive question: ML tensor libraries don’t use a Z-order memory layout like textures do? It’s not beneficial like it is for textures?

gugagore - a day ago

This is the first time I see "SGD" to mean "standard gradient descent" and not "stochastic gradient descent".

kccqzy - a day ago

This is a minor quibble but I don't really like the author calling Placeholder a leaky abstraction. It's just straight up an incomplete abstraction that only handles inputs but not outputs. As the author says, Placeholder should know about the difference and do the copy-back itself.

mirekrusin - 20 hours ago

Nice work, surprising, I'd imagine implementations are cross tested all the time and this kind of bugs have no way of appearing?

dataflow - a day ago

Dumb question: why isn't there some kind of assertion to sanity-check some bits of the GPU results against CPU's?

cryber - 19 hours ago

this is a great writeup! methodical without being pedantic.

modeless - 20 hours ago

Another reason people use Nvidia. You know that Nvidia is the most used backend and the most likely to have this kind of bug found and fixed before you encounter it.

saagarjha - a day ago

Non-contiguous tensors have to be the #1 source of bugs in PyTorch lol

anal_reactor - 17 hours ago

If I understand correctly, the root cause of the bug was improper use of object-oriented programming. A `Placeholder` object behaves differently depending on how it was created, and requires the user to have this awareness. The check `if is_continuous` should only ever exist inside the code of the `Placeholder` class.

hershyb_ - 17 hours ago

awesome read!