WebSockets cost us $1M on our AWS bill
recall.ai249 points by tosh 8 hours ago
249 points by tosh 8 hours ago
Classic story of a startup taking a "good enough" shortcut and then coming back later to optimize.
---
I have a similar story: Where I work, we had a cluster of VMs that were always high CPU and a bit of a problem. We had a lot of fire drills where we'd have to bump up the size of the cluster, abort in-progress operations, or some combination of both.
Because this cluster of VMs was doing batch processing that the founder believed should be CPU intense, everyone just assumed that increasing load came with increasing customer size; and that this was just an annoyance that we could get to after we made one more feature.
But, at one point the bean counters pointed out that we spent disproportionately more on cloud than a normal business did. After one round of combining different VM clusters (that really didn't need to be separate servers), I decided that I could take some time to hook up this very CPU intense cluster up to a profiler.
I thought I was going to be in for a 1-2 week project and would follow a few worms. Instead, the CPU load was because we were constantly loading an entire table, that we never deleted from, into the application's process. The table had transient data that should only last a few hours at most.
I quickly deleted almost a decade's worth of obsolete data from the table. After about 15 minutes, CPU usage for this cluster dropped to almost nothing. The next day we made the VM cluster a fraction of its size, and in the next release, we got rid of the cluster and merged the functionality into another cluster.
I also made a pull request that introduced a simple filter to the query to only load 3 days of data; and then introduced a background operation to clean out the table periodically.
As much as you can say (perhaps not hard numbers, but as a percentage), what was the savings to the bottom line / cloud costs?
Probably ~5% of cloud costs. Combined with the prior round of optimizations, it was substantial.
I was really disappointed when my wife couldn't get the night off from work when the company took everyone out to a fancy steak house.
So you saved the company $10k a month and got a $200 meal in gratitude? Awesome.
I'm not sure how they feel, but when it happens to me, it's not a big deal because it's my job to do things like that. If I fuck up and cost them $10k/month I'm certainly not going to offer to reimburse them.
They're more pissed about the 1.2M they spent than about the 10k a month they saved.
What should they have gotten?
They are in theory owed nothing more than their salary but it can be very good for moral to reward that type of thing (assuming they are not introducing a perverse incentive)
> So you saved the company $10k a month and got a $200 meal in gratitude? Awesome.
You seem to be assuming that a $200 meal was the only compensation the person received, and they weren't just getting a nice meal as a little something extra on top of getting paid for doing their job competently and efficiently.
But that's the kind of deal I make when I take a job: I do the work (pretty well most of the time), and I get paid. If I stop doing the work, I stop getting paid. If they stop paying, I stop doing the work. (And bonus, literally, if I get a perk once in a while like a free steak dinner that I wasn't expecting)
It doesn't have to be more complicated than that.
Yeah? Well, proper rewards make those savings and optimizations more common. Otherwise most people will do the work needed just to have work tomorrow.
It creates a perverse incentive to deliberately do things a more expensive way at the beginning and then be a hero 6 months down the line by refactoring it to be less expensive.
Ha ha, software developers already have this incentive. Viz: "superhero 10x programmer" writing unmaintainable code to provide some desirable features, whose work later turns out to be much less awesome than originally billed.
Of course the truth is more complicated than the sound bite, but still...
Depends. There are people who put in the absolute minimum work they can get away with, and there are people who have pride in their profession.
That's independent of pay scale.
Granted, if you pay way below expectations, you'll lose the professionals over time. But if you pay lavishly no matter what, you get the 2021/2022 big tech hiring cycle instead. Neither one is a great outcome.
99% of the time, it's either a quadratic (or exponential) algorithm or a really bad DB query.
can also be a linear algorithm that does N+1 queries. ORMs can be very good at hiding this implementation detail
> a linear algorithm that does N+1 queries.
That's what quadratic means.
Not really what they're talking about https://stackoverflow.com/questions/97197/what-is-the-n1-sel...
Typically implemented accidentally: https://www.tumblr.com/accidentallyquadratic
> One complicating factor here is that raw video is surprisingly high bandwidth.
It's weird to be living in a world where this is a surprise but here we are.
Nice write up though. Web sockets has a number of nonsensical design decisions, but I wouldn't have expected that this is the one that would be chewing up all your cpu.
I think it's just rare for a lot of people to be handling raw video. Most people interact with highly efficient (lossy) codecs on the web.
I was surprised when calculating and sizing the shared memory for my Gaming VM for use with "Looking-Glass". At 165hz 2k HDR it's many gigabytes per second, that's why HDMI and DisplayPort is specced really high
> It's weird to be living in a world where this is a surprise but here we are.
I think it's because the cost of it is so abstracted away with free streaming video all across the web. Once you take a look at the egress and ingress sides you realize how quickly it adds up.
I always knew video was "expensive", but my mark for what expensive meant was a good few orders of magnitude off when I researched the topic for a personal project.
I can easily imagine the author being in a similar boat, knowing that it isn't cheap, but then not realizing that expensive in this context truly does mean expensive until they actually started seeing the associated costs.
Is this really an AWS issue? Sounds like you were just burning CPU cycles, which is not AWS related. WebSockets makes it sound like it was a data transfer or API gateway cost.
> Is this really an AWS issue?
I doubt they would have even noticed this outrageous cost if they were running on bare-metal Xeons or Ryzen colo'd servers. You can rent real 44-core Xeon servers for like, $250/month.
So yes, it's an AWS issue.
You can rent real 44-core Xeon servers for like, $250/month.
Where, for instance ?Hetzner for example. An EPYC 48c (96t) goes for 230 euros
Hetzner network is complete dog. They also sell you machines that are long should be EOL’ed. No serious business should be using them
What cpu do you think your workload is using on AWS?
GCP exposes their cpu models, and they have some Haswell and Broadwell lithographies in service.
Thats a 10+ year old part, for those paying attention.
I think they meant that Hetzner is offering specific machines they know to be faulty and should have EOLd to customers, not that they use deprecated CPUs.
Thats scary if true, any sources? My google-fu is failing me. :/
It's not scary, it's part of the value proposition.
I used to work for a company that rented lots of hetzner boxes. Consumer grade hardware with frequent disk failures was just what we excepted for saving a buck.
Most of GCP and some AWS instances will migrate to another node when it’s faulty. Also disk is virtual. None of this applies to baremetal hetzner