Reverse engineering GitHub Actions cache to make it fast

blacksmith.sh

154 points by tsaifu 3 days ago


junon - 3 days ago

> iptables was already doing heavy lifting for other subsystems inside our environment, and with each VM adding or removing its own set of rules, things got messy fast, and extremely flakey

We saw the same thing at Vercel. Back when we were still doing docker-as-a-service we used k8s for both internal services as well as user deployments. The latter lead to master deadlocks and all sorts of SRE nightmares (literally).

So I was tasked to write a service scheduler from scratch that replaced k8s. When it got to the manhandling of IP address allocations, deep into the rabbit hole, we had already written our own redis-backed DHCP implementation and needed to insert those IPs into the firewall tables ourselves, since Docker couldn't really do much at all concurrently.

Iptables was VERY fragile. Aside from the fact it didn't even have a stable programmatic interface, it was also a race condition nightmare, rules were strictly ordered, had no composition or destruction-free system (name spacing, layering, etc), and was just all around the worst tool for the job.

Unfortunately not much else existed at the time, and given that we didn't have time to spend on implementing our own kernel modules for this system, and that Docker itself had a slew of ridiculous behavior, we ended up scratching the project.

Learned a lot though! We were almost done, until we weren't :)

zamalek - 3 days ago

I'm currently migrating some stuff from azdo to GHA, and have been putting past lessons to serious use:

* Perf: don't use "install X" (Node, .Net, Ruby, Python, etc.) tasks. Create a container image with all your deps and use that instead.

* Perf: related to the last, keep multiple utility container images around of varying degrees of complexity. For example, in our case, I decided on PowerShell because we have some devs with Windows and it's the easiest to get working across Linux+Windows - so my simplest container has pwsh and some really basic tools (git, curl, etc.). I build another container on that which has .Net deps. Then each .Net repo uses that to:

* Perf: don't use the cache action at all. Run a job nightly that pulls down your code into a container, restore/install to warm the cache, then delete the code. `RUN --mount` is a good way to avoid creating a layer with your code in it.

* Maintainability: don't write big scripts in your workflow file. Create scripts as files that can also be executed on your local machine. Keep the "glue code" between GHA and your script in the workflow file. I slightly lie here, I do source in a single utility script that reads in GHA envars and has functions to set CI variables and so forth (that does sensible things when run locally).

Our CI builds are stupid fast. Comparatively speaking.

For the OP (I just sent your pricing page to my manager ;) ): having a colocated container registry for these types of things would be super useful. I would say you don't need to expose it to the internet, but sometimes you do need to be able to `podman run` into an image for debug purposes.

[1]: https://docs.github.com/en/actions/how-tos/writing-workflows...

bob1029 - 3 days ago

I am struggling with justification for CI/CD pipelines that are so complex this kind of additional tooling becomes necessary.

There are ways to refactor your technology so that you don't have to suffer so much at integration and deployment time. For example, the use of containers and hosted SQL where neither are required can instantly 10x+ the complexity of deploying your software.

The last few B2B/SaaS projects I worked on had CI/CD built into the actual product. Writing a simple console app that polls SCM for commits, runs dotnet build and then performs a filesystem operation is approximately all we've ever needed. The only additional enhancement was zipping the artifacts to an S3 bucket so that we could email the link out to the customer's IT team for install in their secure on-prem instances.

I would propose a canary - If your proposed CI/CD process is so complicated that you couldn't write a script by hand to replicate it in an afternoon or two, you should seriously question bringing the rest of the team into that coal mine.

tagraves - 3 days ago

It's pretty amazing to see what Blacksmith, Depot, Actuated, etc. have been able to build on top of GitHub Actions. At RWX we got a bit tired of constantly trying to work around the limitations of the platform with self-hosted runners, so we just built an entirely new CI platform on a brand new execution model with support for things like lightning-fast caching out of the box. Plus, there are some fundamental limitations that are impossible to work around, like the retry behavior [0]. Still, I have a huge appreciation for the patience of the Blacksmith team to actually dig in and improve what they can with GHA.

[0] https://www.rwx.com/blog/retry-failures-while-run-in-progres...

jchw - 3 days ago

Oh this is pretty interesting. One thing that's also interesting to note is that the Azure Blob Storage version of GitHub Actions Cache is actually a sort of V2, although internally it is just a brand new service with the internal version of V1. The old service was a REST-ish service that abstracted the storage backend, and it is still used by GitHub Enterprise. The new service is a TWIRP-based system where you directly store things into Azure using signed URLs from the TWIRP side. I reverse engineered this to implement support for the new cache API in Determinate System's Magic Nix Cache which abruptly stopped working earlier this year when GitHub disabled the old API on GitHub.com. One thing that's annoying is GitHub seems to continue to tacitly allow people to use the cache internals but stops short of providing useful things like the protobuf files used to generate the TWIRP clients. I wound up reverse engineering them from the actions/cache action's gencode, tweaking the reconstructed protobuf files until I was able to get a byte-for-byte match.

On the flip side, I did something that might break Blacksmith: I used append blobs instead of block blobs. Why? ... Because it was simpler. For block blobs you have to construct this silly XML payload with the block list or whatever. With append blobs you can just keep appending chunks of data and then seal it when you're done. I have always wondered if the fact that I am responsible for the fact that some of GitHub Actions Cache is using append blobs would ever come back to bite me, but as far as I can tell from the Azure PoV it makes very little difference, pricing seems the same at least. But either way, they need to support append blobs now probably. Sorry :)

(If you are wondering why not use the Rust Azure SDK, as far as I can tell the official Rust Azure SDK does not support using signed URLs for uploading. And frankly, it would've brought a lot of dependencies and been rather complex to integrate for other Rust reasons.)

(It would also be possible, by setting env variables a certain way, to get virtually all workflows to behave as if they're running under GitHub Enterprise, and get the old REST API. However, Azure SDK with its concurrency features probably yields better performance.)

movedx01 - 3 days ago

Anything for artifacts perhaps? ;) We use external runners(not blacksmith) and had to work around this manually. https://github.com/actions/download-artifact/issues/362#issu...

crohr - 3 days ago

At this point this is considered a baseline feature of every good GitHub Actions third-party provider, but nice to see the write-up and solution they came up with!

Note that GitHub Actions Cache v2 is actually very good in terms of download/upload speed right now, when running from GitHub managed runners. The low speed Blacksmith was seeing before is just due to their slow (Hetzner?) network.

I benchmarked most providers (I maintain RunsOn) with regards to their cache performance here: https://runs-on.com/benchmarks/github-actions-cache-performa...

sameermanek - 3 days ago

Is it similar to this article posted a year ago:

https://depot.dev/blog/github-actions-cache

kylegalbraith - 3 days ago

Fun to see folks replicating what we’ve done with Depot for GitHub Actions [0]. Going as far as using a similar title :)

Forking the ecosystem of actions to plug in your cache backed isn’t a good long term solution.

[0] https://depot.dev/blog/github-actions-cache

EdJiang - 3 days ago

Related read: Cirrus Labs also wrote a drop-in replacement for GH Actions cache on their platform.

https://cirrus-runners.app/blog/2024/04/23/speeding-up-cachi...

https://github.com/cirruslabs/cache

nodesocket - 3 days ago

I'm currently using blacksmith for my arm64 Docker builds. Unfortunately my workflow currently requires invoking a custom bash script which executes the Docker commands. Does this mean, I can now utilize Docker image caching without needing to migrate to useblacksmith/build-push-action?

esafak - 3 days ago

Caching gets trickier when tasks spin up containers.