How I solved PyTorch's cross-platform nightmare

73 points by msvana 5 days ago

> Setting up a Python project that relies on PyTorch, so that it works across different accelerators and operating systems, is a nightmare.

I would like to add some anecdata to this.

When I was a PhD student, I already had 12 years of using and administrating Linuxes as my personal OS, and I'd already had my share of package manager and dependency woes.

But managing Python, PyTorch, and CUDA dependencies were relatively new to me. Sometimes I'd lose an evening here or there to something silly. But I had one week especially dominated by these woes, to the point where I'd have dreams about package management problems at the terminal.

They were mundane dreams but I'd chalk them up as nightmares. The worst was having the pleasant dream where those problems went away forever, only to wake up to realize that was not the case.

dleeftink - 2 days ago

Wake up, lynndotpy
- gchamonlive - 2 days ago
  
  Follow the white rabbit.
godelski - 2 days ago
```
  > When I was a PhD student, I already had 12 years of using and administrating Linuxes as my personal OS, and I'd already had my share of package manager and dependency woes.
```
I'm in a very similar boat (just defended a few months ago).
More than once I had installed pytorch into a new environment and subsequently spent hours trying to figure out why things suddenly aren't working. Turns out, PyTorch had just uploaded a bad wheel.
Weirdly I feel like CUDA has become easier yet Python has become worse. It's all package management. Honestly, I find myself wanting to use package managers less and less because of Python. Of course `pip install` doesn't work, and that is probably a good thing. But the result of this is that any time you install a package it adds the module as a system module, which I thought was the whole thing we were trying to avoid. So what? Do I edit every package build now so that it runs a uv venv? If I do that, then this seems to just get more complicated as I have to keep better track of my environments. I'd rather be dealing with environment modules than that. I'd rather things be rapped up in a systemd service or nspawn than that!
I mean I just did a update and upgrade and I had 13 python packages and 193 haskell modules, out of 351 packages! This shit is getting insane.
People keep telling me to keep things simple, but I don't think any of this is simple. It really looks like a lot of complexity created by a lot of things being simplified. I mean isn't every big problem created out of a bunch of little problems? That's how we solve big problems -- break them down to small problems -- right? Did we forget the little things matter? If you don't think they do, did you question if this comment was written by an LLM because I used a fucking em dash? Seems like you latched onto something small. I think it is hard to know when the little things matter or don't matter, often we just don't realize the little things are part of the big things.
aitchnyu - 2 days ago

How well do you read in your dreams? Do you read full outputs or just diffrentiate between a green [OK] status and stack traces?
- lynndotpy - 2 days ago
  
  I don't recall the details but I do remember having to write down details by hand.
  But the point is more that, for me, this is a somewhat rare instance where I think using the term "nightmare" in the title is justified.
- godelski - 2 days ago
  
  I can read perfectly well in my dreams. Like the letters are sharp, clear, and perfectly legible. The problem is when I look away from something and then look back the text usually changes. Once I lucid dreamed because I walked past a street sign, realized it was the name of a different street than the one I was on, looked back and say the other side of the sign read a third street name, walked back to the other side and saw a fourth name. I decided I should take this opportunity and be cliche and try to fly. I just kept going up till it was really bright and I woke up. Mostly now I just recognize I'm in a dream and go along for the ride, but better able to remember it.

di - 2 days ago

Note that https://peps.python.org/pep-0440/#direct-references says:

> Public index servers SHOULD NOT allow the use of direct references in uploaded distributions. Direct references are intended as a tool for software integrators rather than publishers.

This means that PyPI will not accept your project metadata as you currently have it configured. See https://github.com/pypi/warehouse/issues/7136 for more details.

doctorpangloss - 2 days ago

Guess the guy who wrote this article will learn the hard way: The last 20% of packaging is 800% of your time.

mdaniel - 2 days ago

> Cross-Platform

  cpu = [
  "torch @ <https://download.pytorch.org/whl/cpu/torch-2.7.1%2Bcpu-cp312-cp312-manylinux_2_28_x86_64.whl> ; python_version == '3.12'",
  "torch @ <https://download.pytorch.org/whl/cpu/torch-2.7.1%2Bcpu-cp313-cp313-manylinux_2_28_x86_64.whl> ; python_version == '3.13'",
  ]

:-/ It reminds me of Microsoft calling their thing "cross platform" because it works on several copies of Windows

In all seriousness, I get the impression that pytorch is such a monster PITA to manage because it cares so much about the target hardware. It'd be like a blog post saying "I solved the assembly language nightmare"

gobdovan - 2 days ago

Torch simply has to work this way because it cares about performance on a combination of multiple systems and dozens of GPUs. The complexity leaks into packaging.
If you do not care about performance and would rather have portability, use an alternative like tinygrad that does not optimize for every accelerator under the sun.
This need for hardware-specific optimization is also why the assembly language analogy is a little imprecise. Nobody expects one binary to run on every CPU or GPU with peak efficiency, unless you are talking about something like Redbean which gets surprisingly far (the creator actually worked on the TensorFlow team and addressed similar cross-platform problems).
So maybe the the blogpost you're looking for is https://justine.lol/redbean2/.
- dragonwriter - 2 days ago
  
  > Torch simply has to work this way because it cares about performance on a combination of multiple systems and dozens of GPUs
  Or, looked at a different way, Torch has to work this way because Python packaging has too narrow of an understanding of platforms which treats many things that are materially different platforms as the same platform.
cstrahan - 2 days ago

I think a more charitable interpretation of TFA would be: "I Have Come Up With A Recipe for Solving PyTorch's Cross-Platform Nightmare"
That is: there's nothing stopping the author from building on the approach he shares to also include Windows/FreeBSD/NetBSD/whatever.
It's his project (FileChat), and I would guess he uses Linux. It's natural that he'd solve this problem for the platforms he uses, and for which wheels are readily available.
esafak - 2 days ago

https://github.com/pypa/manylinux is for building cross-platform wheels.
- mdaniel - 2 days ago
  
  > Python wheels that work on any linux (almost)
  So, you're doubling down on OP's misnomer of "cross platform means whatever platforms I use", eh?
  - esafak - 2 days ago
    
    I do not know what you are objecting to. I have successfully packaged wheels for MacOS and Ubuntu with manylinux. I presume it works with some others too, but that is what I can personally attest to. Alpine is catered to by the musllinux project.
    You should be specific about which distributions you have in mind.

kwon-young - 2 days ago

In my opinion, anything that touch compiled packages like pytorch should be packaged with conda/mamba on conda-forge. I found it is the only package manager for python which will reliably detect my hardware and install the correct version of every dependency.

zbowling - 2 days ago

Try pixi! Pixi is a much more sane way for building with conda + pypi packages in a single tool that makes this so much easier for torch development, regardless if you get the condaforge or pypi builds of pytorch. https://pixi.sh/latest/
- kwon-young - a day ago
  
  I don't see the advantage ?
  In the comparative table, they claim that conda doesn't support:
  * lock file: which is false, you can freeze your environment
  * task runner: I don't need my package manager to be a task runner
  * project management: You can do 1 env per project ? I don't see the problem here...
  So no, please, just use conda/mamba and conda-forge.
levocardia - 2 days ago

Likewise, this was my experience. If ever I need to "pip anything" I know I'm in for a bad time. Conda is built for literally this exact problem. Still not a breeze, but much better than trying to manually freeze all your pip dependencies.

cmdr2 - 2 days ago

https://pypi.org/p/torchruntime might help here, it's designed precisely for this purpose.

`pip install torchruntime`

`torchruntime install torch`

It figures out the correct torch to install on the user's PC, factoring in the OS (Win, Linux, Mac), the GPU vendor (NVIDIA, AMD, Intel) and the GPU model (especially for ROCm, whose configuration varies per generation and ROCm version).

And it tries to support quite a number of older GPUs as well, which are pinned to older versions of torch.

It's used by a few cross-platform torch-based consumer apps, running on quite a number of consumer installations.

arun-mani-j - 2 days ago

This is so nice, I wish more packages followed something like this. I'm on AMD integrated GPU (doesn't even support Rocm). Whenever I install a Python package that depends on PyTorch, it automatically installs some GBs of CUDA related packages.

This ends up wasting space and slowing down installation :(

Speaking of PyTorch and CUDA, I wish the Vulkan backend becomes stable, but that seems to super far dream...

https://docs.pytorch.org/executorch/stable/backends-vulkan.h...

zbowling - 2 days ago

Check out Pixi! Pixi is an alternative to the common conda and pypi frontends and has better system for hardware feature detection and get the best version of Torch for your hardware that is compatible across your packages (except for AMD at the moment). It can pull in the condaforge or pypi builds of pytorch and help you manage things automagically across platforms. https://pixi.sh/latest/python/pytorch/

It doesn't solve how you package your wheels specifically, that problem is still pushed on your downstream users because of boneheaded packaging decisions by PyTorch themselves but as the consumer, Pixi soften's blow. The condaforge builds of PyTorch also are a bit more sane.

ashvardanian - 2 days ago

Related, but wasn’t broadly discussed on HN: https://astral.sh/blog/wheel-variants

Simulacra - 2 days ago

Good writeup. PyTorch has generally been very good to me when I can mitigate its resource hogging at times. Production can be a little wonky but for everything else it works

tuna74 - 2 days ago

Is there a problem using distro packages for Pytorch? What are the downsides of using the official Fedora Pytorch for example?

antimora - 2 days ago

Check out https://github.com/tracel-ai/burn project! It makes deploying models across different platforms easy. It uses Rust instead of Python.

liuliu - 2 days ago

The reason why people go distances to package PyTorch is because the skill of translating models between different frameworks manually is "easy" but not well dispensed in developer community.
That's why people will go stupid lengths to convert model from PyTorch / TensorFlow with onnxtools / coremltools to avoid touch the model / weights themselves.
The only one that escaped this is llama.cpp, which weirdly, despite the difficulty of model conversion with ggml, people seem to do it anyway.

userabchn - 2 days ago

I maintain a package that provides some PyTorch operators that are written in C/C++/CUDA. I have tried various approaches over the years (including the ones endorsed by PyTorch), but the only solution I have found that seems to work flawlessly for everyone who uses it is to have no Python or PyTorch dependence in the compiled code, and to load the compiled libraries using ctypes. I use an old version of nvcc to compile the CUDA, use manylinux2014 for the Linux builds, and ask users to install PyTorch themselves before installing my package.