Reverse engineering GitHub Actions cache to make it fast

154 points by tsaifu 3 days ago

> iptables was already doing heavy lifting for other subsystems inside our environment, and with each VM adding or removing its own set of rules, things got messy fast, and extremely flakey

We saw the same thing at Vercel. Back when we were still doing docker-as-a-service we used k8s for both internal services as well as user deployments. The latter lead to master deadlocks and all sorts of SRE nightmares (literally).

So I was tasked to write a service scheduler from scratch that replaced k8s. When it got to the manhandling of IP address allocations, deep into the rabbit hole, we had already written our own redis-backed DHCP implementation and needed to insert those IPs into the firewall tables ourselves, since Docker couldn't really do much at all concurrently.

Iptables was VERY fragile. Aside from the fact it didn't even have a stable programmatic interface, it was also a race condition nightmare, rules were strictly ordered, had no composition or destruction-free system (name spacing, layering, etc), and was just all around the worst tool for the job.

Unfortunately not much else existed at the time, and given that we didn't have time to spend on implementing our own kernel modules for this system, and that Docker itself had a slew of ridiculous behavior, we ended up scratching the project.

Learned a lot though! We were almost done, until we weren't :)

immibis - 3 days ago

I think iptables compiles BPF filters; you could write your own thing to compile BPF filters. In general, the whole Linux userspace interface (with few exceptions) is considered stable; if you go below any given userspace tool, you're likely to find a more stable, but less well documented, kernel interface. Since it's all OSS, you can even use iptables itself as a starting point to build your own thing.
- formerly_proven - 3 days ago
  
  Nowadays you would use nftables, which like most new-ish kernel infra uses netlink as an API, and supports at least atomic updates of multiple rules. That's not to say there's documentation for that; there isn't.
  - cameronh90 - 3 days ago
    
    I spent a decade and a bit away from Linux programming and have recently come back to it, and I'm absolutely blown away at how poor the documentation has become.
    Back in the day, one of the best things about Linux was actually how good the docs were. Comprehensive man pages, stable POSIX standards, projects and APIs that have been used since 1970 so every little quirk has been documented inside out.
    Now it seems like the entire OS has been rewritten by freedesktop and if I'm lucky I might find some two year out of date information on the ArchLinux wiki. If I'm even luckier, that behaviour won't have been completely broken by a commit from @poettering in a minor point release.
    I actually think a lot of the new stuff is really fantastic once I reverse engineer it enough to understand what it's doing. I will defend to the death that systemd is, in principle, a lot better than the adhoc mountain of distro-specific shell scripts it replaces. Pulseaudio does a lot of important things that weren't possible before, etc. But honestly it feels like nobody wants to write any docs because it's changing too frequently, but then everything just constantly breaks because it turns out changing complex systems rapidly without any documentation leads to weird bugs that nobody understands.
tsaifu - 3 days ago

yeah our findings were similar. the issues we saw with iptables rules, especially at scale with ephemeral workloads, was starting to cause us a lot of operational toil. nftables ftw
tekla - 3 days ago

I've had this problem.
We ended up using Docker Swarm. Painless afterward
theideaofcoffee - 3 days ago

[flagged]
- dang - 3 days ago
  
  Please don't cross into being a jerk on HN. You may not feel you owe the person you're speaking with better, but you owe this community better if you're participating in it.
  If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.
- junon - 3 days ago
  
  What a delightfully arrogant comment :) This was about ten years ago. At the time, given the constraints, iptables did not have a public, stable programmatic interface, no. Perhaps you used other firewall systems/frameworks. We had to interop with Docker. I don't remember specifics; I do remember going very deep down that rabbit hole.
  For what it's worth, the development team, at least in the early ZEIT days, were the most competent engineers I've ever worked with on a team.

zamalek - 3 days ago

I'm currently migrating some stuff from azdo to GHA, and have been putting past lessons to serious use:

* Perf: don't use "install X" (Node, .Net, Ruby, Python, etc.) tasks. Create a container image with all your deps and use that instead.

* Perf: related to the last, keep multiple utility container images around of varying degrees of complexity. For example, in our case, I decided on PowerShell because we have some devs with Windows and it's the easiest to get working across Linux+Windows - so my simplest container has pwsh and some really basic tools (git, curl, etc.). I build another container on that which has .Net deps. Then each .Net repo uses that to:

* Perf: don't use the cache action at all. Run a job nightly that pulls down your code into a container, restore/install to warm the cache, then delete the code. `RUN --mount` is a good way to avoid creating a layer with your code in it.

* Maintainability: don't write big scripts in your workflow file. Create scripts as files that can also be executed on your local machine. Keep the "glue code" between GHA and your script in the workflow file. I slightly lie here, I do source in a single utility script that reads in GHA envars and has functions to set CI variables and so forth (that does sensible things when run locally).

Our CI builds are stupid fast. Comparatively speaking.

For the OP (I just sent your pricing page to my manager ;) ): having a colocated container registry for these types of things would be super useful. I would say you don't need to expose it to the internet, but sometimes you do need to be able to `podman run` into an image for debug purposes.

[1]: https://docs.github.com/en/actions/how-tos/writing-workflows...

suryao - 2 days ago

I've seen this pattern quite a bit with our users, and we've implemented "snapshots" for this.
It essentially makes a copy of your entire state of the runner, which you can then use in subsequent runs, with no concurrency limits.
This essentially automates out a lot of the work you're doing to make the jobs faster.
https://docs.warpbuild.com/snapshot-runners
We provide a product similar to Blacksmith.

bob1029 - 3 days ago

I am struggling with justification for CI/CD pipelines that are so complex this kind of additional tooling becomes necessary.

There are ways to refactor your technology so that you don't have to suffer so much at integration and deployment time. For example, the use of containers and hosted SQL where neither are required can instantly 10x+ the complexity of deploying your software.

The last few B2B/SaaS projects I worked on had CI/CD built into the actual product. Writing a simple console app that polls SCM for commits, runs dotnet build and then performs a filesystem operation is approximately all we've ever needed. The only additional enhancement was zipping the artifacts to an S3 bucket so that we could email the link out to the customer's IT team for install in their secure on-prem instances.

I would propose a canary - If your proposed CI/CD process is so complicated that you couldn't write a script by hand to replicate it in an afternoon or two, you should seriously question bringing the rest of the team into that coal mine.

norir - 3 days ago

Here is my cynical take in ci. Firstly, testing is almost never valued by management which would rather close a deal on half finished promises than actually build a polished, reliable product (they can always scapegoat the eng team if things go wrong with the customer anyway).
So, to begin with, testing is rarely prioritized. But most developer orgs eventually realize that centralized testing is necessary or else everyone is stuck in permanent "works on my machine!" mode. When deciding to switch to automated ci, eng management is left with the build vs buy decision. Buy is very attractive for something that is not seriously valued anyway and that is often given away for free. There is also industry consensus pressure, which has converged on github (even though github is objectively bad on almost every metric besides popularity -- to be fair the other larger players are also generally bad on similar ways). This is when the lock in begins. What begins as a simple build file starts expanding outward. Well intentioned developers will want to do things idiomatically for the ci tool and will start putting logic in the ci tool's dsl. The more they do this, the more invested they become and the more costly switching becomes. The CI vendor is rarely incentivized to make things truly better once you are captive. Indeed, that would threaten their business model where they typically are going to sell you one of two things or both: support or cpu time. Given that business model, it is clear that they are incentivized to make their system as inefficient and difficult to use (particularly at scale) as possible while still retaining just enough customers to remain profitable.
The industry has convinced many people that it is too costly/inefficient to build your own test infrastructure even while burning countless man and cpu hours on the awful solutions presented by industry.
Companies like blacksmith are smart to address the clear shortcomings in the market though personally I find life too short to spend on github actions in any capacity.
- bob1029 - 3 days ago
  
  > they typically are going to sell you one of two things or both: support or cpu time
  At what point does the line between CPU time in GH Actions and CPU time in the actual production environment lose all meaning? Why even bother moving to production? You could just create a new GH action called "Production" that gets invoked at the end of the pipeline and runs perpetually.
  I think I may have identified a better canary here. If the CI/CD process takes so much CPU time that we are consciously aware of the resulting bill, there is definitely something going wrong.
  - AOE9 - 3 days ago
    
    > I think I may have identified a better canary here. If the CI/CD process takes so much CPU time that we are consciously aware of the resulting bill, there is definitely something going wrong.
    CPU time is cheaper than an engineers time, you should be offloading formatting/linting/testing checks to CI on PRs. This will add up though when multiple by hundreds or thousands, it isn't a good canary.
AOE9 - 3 days ago

> The last few B2B/SaaS projects I worked on had CI/CD built into the actual product. Writing a simple console app that polls SCM for commits, runs dotnet build and then performs a filesystem operation is approximately all we've ever needed. The only additional enhancement was zipping the artifacts to an S3 bucket so that we could email the link out to the customer's IT team for install in their secure on-prem instances.
That sounds like the biggest yikes.
presentation - 2 days ago

I invested a few days at the start of building my B2B SaaS company wherein on every deploy we automate branching our actual staging/production databases with Neon, spinning up a complete preview environment with all the real dependencies, applying all migrations to the real data, and running end to end tests on that environment. Once it was set up I almost never need to touch it (and it's documented, but my engineers also don't need to touch it either).
In a few years of that setup being in place, while it's only partially attributable to this CI/CD process, we have never had a single incident—a stark contrast to the volatile on-call experiences I've had at past startups. For me, never being woken up in the night or needing to stress out while reverting deploys that broke our users needs was well worth it.

tagraves - 3 days ago

It's pretty amazing to see what Blacksmith, Depot, Actuated, etc. have been able to build on top of GitHub Actions. At RWX we got a bit tired of constantly trying to work around the limitations of the platform with self-hosted runners, so we just built an entirely new CI platform on a brand new execution model with support for things like lightning-fast caching out of the box. Plus, there are some fundamental limitations that are impossible to work around, like the retry behavior [0]. Still, I have a huge appreciation for the patience of the Blacksmith team to actually dig in and improve what they can with GHA.

[0] https://www.rwx.com/blog/retry-failures-while-run-in-progres...

jchw - 3 days ago

Oh this is pretty interesting. One thing that's also interesting to note is that the Azure Blob Storage version of GitHub Actions Cache is actually a sort of V2, although internally it is just a brand new service with the internal version of V1. The old service was a REST-ish service that abstracted the storage backend, and it is still used by GitHub Enterprise. The new service is a TWIRP-based system where you directly store things into Azure using signed URLs from the TWIRP side. I reverse engineered this to implement support for the new cache API in Determinate System's Magic Nix Cache which abruptly stopped working earlier this year when GitHub disabled the old API on GitHub.com. One thing that's annoying is GitHub seems to continue to tacitly allow people to use the cache internals but stops short of providing useful things like the protobuf files used to generate the TWIRP clients. I wound up reverse engineering them from the actions/cache action's gencode, tweaking the reconstructed protobuf files until I was able to get a byte-for-byte match.

On the flip side, I did something that might break Blacksmith: I used append blobs instead of block blobs. Why? ... Because it was simpler. For block blobs you have to construct this silly XML payload with the block list or whatever. With append blobs you can just keep appending chunks of data and then seal it when you're done. I have always wondered if the fact that I am responsible for the fact that some of GitHub Actions Cache is using append blobs would ever come back to bite me, but as far as I can tell from the Azure PoV it makes very little difference, pricing seems the same at least. But either way, they need to support append blobs now probably. Sorry :)

(If you are wondering why not use the Rust Azure SDK, as far as I can tell the official Rust Azure SDK does not support using signed URLs for uploading. And frankly, it would've brought a lot of dependencies and been rather complex to integrate for other Rust reasons.)

(It would also be possible, by setting env variables a certain way, to get virtually all workflows to behave as if they're running under GitHub Enterprise, and get the old REST API. However, Azure SDK with its concurrency features probably yields better performance.)

movedx01 - 3 days ago

Anything for artifacts perhaps? ;) We use external runners(not blacksmith) and had to work around this manually. https://github.com/actions/download-artifact/issues/362#issu...

aayushshah15 - 3 days ago

[cofounder of blacksmith here]
This is on our radar! The primitives mentioned in this blog post are fairly general and allow us to support various types of artifact storage and caching protocols.
- AOE9 - 3 days ago
  
  Small blacksmith.sh user here, have yous any plans on reducing billing from per minute to something smaller like per second?
  - tsaifu - 3 days ago
    
    Hi! no plans for more resolution than per minute currently
    
    AOE9 - 3 days ago
    
    Shame, I am not sure how others use actions but I like really small granular ones to make it easy to see what's wrong at a glance. E.g. formatting checks per language per project in my monorepo etc. Each check is like 10 seconds and I have like 70+, so the per minute is biting me ATM.
    
    tagraves - 3 days ago
    
    Come check out RWX :). We have per second billing but I think it won't even matter for your use case because most of those checks are going to take 0s on RWX. And our UI is optimized for showing you what is wrong at a glance without having to look at logs at all.
    
    AOE9 - 3 days ago
    
    Sorry not for me
    * Your per min billing is double blacksmith's * RWX is a proprietary format? vs blacksmith's one line change. * No fallback option, blacksmith goes down and I can revert back to GitHub temporarily.
pbardea - 3 days ago

Also might be worth checking out our sticky disks for your use-case: https://github.com/useblacksmith/stickydisk. It can be a good option for persisting artifacts across jobs - especially when they're larger

crohr - 3 days ago

At this point this is considered a baseline feature of every good GitHub Actions third-party provider, but nice to see the write-up and solution they came up with!

Note that GitHub Actions Cache v2 is actually very good in terms of download/upload speed right now, when running from GitHub managed runners. The low speed Blacksmith was seeing before is just due to their slow (Hetzner?) network.

I benchmarked most providers (I maintain RunsOn) with regards to their cache performance here: https://runs-on.com/benchmarks/github-actions-cache-performa...

crohr - 3 days ago

Also note this open-source project that shows a way to implement this: https://github.com/falcondev-oss/github-actions-cache-server

sameermanek - 3 days ago

Is it similar to this article posted a year ago:

https://depot.dev/blog/github-actions-cache

kylegalbraith - 3 days ago

Fun to see folks replicating what we’ve done with Depot for GitHub Actions [0]. Going as far as using a similar title :)

Forking the ecosystem of actions to plug in your cache backed isn’t a good long term solution.

[0] https://depot.dev/blog/github-actions-cache

EdJiang - 3 days ago

Related read: Cirrus Labs also wrote a drop-in replacement for GH Actions cache on their platform.

https://cirrus-runners.app/blog/2024/04/23/speeding-up-cachi...

https://github.com/cirruslabs/cache

crohr - 3 days ago

It is not transparent though, so it doesn't work with all the other actions that use the cache toolkit, and you have to reference a specific action.

nodesocket - 3 days ago

I'm currently using blacksmith for my arm64 Docker builds. Unfortunately my workflow currently requires invoking a custom bash script which executes the Docker commands. Does this mean, I can now utilize Docker image caching without needing to migrate to useblacksmith/build-push-action?

suryao - 2 days ago

Founder of WarpBuild here. Our arm64 runners are ~2x faster than Blacksmith. This is because we use a newer generation of more powerful instances.
If you're building multi arch images, here is a page that may be useful: https://docs.warpbuild.com/docker-builders#multi-platform-bu...
We natively support multi platform builds in our drop in replacement action.
- nodesocket - 2 days ago
  
  It's poor form in my opinion to come and solicit business and bash your competitors here on HN. Just a tip.
aayushshah15 - 3 days ago

Yes! This is documented in our docs: https://docs.blacksmith.sh/blacksmith-caching/docker-builds#..., the TLDR is that you can use the `build-push-action` with `setup-only: true`.

esafak - 3 days ago

Caching gets trickier when tasks spin up containers.