WebSockets cost us $1M on our AWS bill
recall.ai362 points by tosh 8 months ago
362 points by tosh 8 months ago
Classic story of a startup taking a "good enough" shortcut and then coming back later to optimize.
---
I have a similar story: Where I work, we had a cluster of VMs that were always high CPU and a bit of a problem. We had a lot of fire drills where we'd have to bump up the size of the cluster, abort in-progress operations, or some combination of both.
Because this cluster of VMs was doing batch processing that the founder believed should be CPU intense, everyone just assumed that increasing load came with increasing customer size; and that this was just an annoyance that we could get to after we made one more feature.
But, at one point the bean counters pointed out that we spent disproportionately more on cloud than a normal business did. After one round of combining different VM clusters (that really didn't need to be separate servers), I decided that I could take some time to hook up this very CPU intense cluster up to a profiler.
I thought I was going to be in for a 1-2 week project and would follow a few worms. Instead, the CPU load was because we were constantly loading an entire table, that we never deleted from, into the application's process. The table had transient data that should only last a few hours at most.
I quickly deleted almost a decade's worth of obsolete data from the table. After about 15 minutes, CPU usage for this cluster dropped to almost nothing. The next day we made the VM cluster a fraction of its size, and in the next release, we got rid of the cluster and merged the functionality into another cluster.
I also made a pull request that introduced a simple filter to the query to only load 3 days of data; and then introduced a background operation to clean out the table periodically.
Hah. At a previous place I found that our cloud cost consisted of 90% storage costs. The data? Tens of thousands of incomplete backups of the in-office file server. 3 years of the NAS continuously trying to back itself up to S3 and failing every time.
I love these stories. I have a few as well. In the end I know we're all just doing our job, but I've been tempted at times to say to my manager: "I will save the company $10k/month tomorrow if you give me a cut of the pie."
This should be the norm, actually.
It drives ya nuts to read a story where some guy on the shop floor saves his employer ten million dollars and to reward him they give him a 20% off coupon for Home Depot.
This sounds nice in theory, but would incentivize people to introduce unnecessarily expensive things initially and optimize them later to claim some of the savings. We would like to think that nobody would do something like that, but the sad reality is that there are plenty, especially as the potential reward goes up high enough.
Boss makes a dollar, I make a dime, that's why my code runs in exponential time.
> It drives ya nuts to read a story where some guy on the shop floor saves his employer ten million dollars and to reward him they give him a 20% off coupon for Home Depot.
It’s your job as an employee, it’s why you get paid in the first place
Is it? Or is an employee's job to just do the work they're asked to do?
"It depends" is, of course, the common answer, but in most places I've worked, "please help find operational optimizations that can have a positive impact for the team, department or organization" has certainly never been an explicit ask.
Ask team mates to change something to help on a project may fall under the same category, but usually the effect isn't felt beyond a project.
My default mode when coming in to jobs is to try to get a 'full company' view, because I want to know how things work and how they might be made better. That approach is usually not met with much enthusiasm, and usually more with "that's not your job, you don't need to know that", etc.
I took a daily import routine that was taking 25+ hours (meaning we couldn't show 'daily info' because it was out of date before finishing import) and got it down to 30 minutes. This was after having to fight/argue to see the data, and being told for a couple weeks "it can't be sped up, we'll have to buy faster hardware" ($8-$10k min, but they weren't looking at $15-20k IIRC). I spent a few hours over a weekend and got it down to 30 minutes, and saved the company minimum $8k. But I had to fight/argue to even do that ("that's not your job", "Charles is taking care of that", "the client will just have to deal with more delays while we upgrade", etc).
Uh, no. Have you ever been employed? Your job is what is laid out in your contract, period. You are paid to do that specific set of tasks and nothing else. "Other duties as required" be damned.
Fixing the business is very explicitly not your job, and is absolutely not what you're paid for. Any value you create for the business outside of those bounds is at your own cost and you absolutely will not be compensated unless the business is so small you don't have six layers of management trying to extract any kind of promotion.
99% of the time, it's either a quadratic (or exponential) algorithm or a really bad DB query.
can also be a linear algorithm that does N+1 queries. ORMs can be very good at hiding this implementation detail
> a linear algorithm that does N+1 queries.
That's what quadratic means.
Typically implemented accidentally: https://www.tumblr.com/accidentallyquadratic
Not really what they're talking about https://stackoverflow.com/questions/97197/what-is-the-n1-sel...
It wasn't a financial cost, but the biggest single performance improvement I've seen firsthand came from optimizing a SQL query. One of our Professional Services people had written a query that did repeated self-joins on a fairly large table, which took ~15 minutes to run. A DBA-turned-dev on our team rewrote it using MSSQL's PIVOT operator, and the query started executing in less than a second.
As much as you can say (perhaps not hard numbers, but as a percentage), what was the savings to the bottom line / cloud costs?
Probably ~5% of cloud costs. Combined with the prior round of optimizations, it was substantial.
I was really disappointed when my wife couldn't get the night off from work when the company took everyone out to a fancy steak house.
So you saved the company $10k a month and got a $200 meal in gratitude? Awesome.
I'm not sure how they feel, but when it happens to me, it's not a big deal because it's my job to do things like that. If I fuck up and cost them $10k/month I'm certainly not going to offer to reimburse them.
They're more pissed about the 1.2M they spent than about the 10k a month they saved.
> So you saved the company $10k a month and got a $200 meal in gratitude? Awesome.
You seem to be assuming that a $200 meal was the only compensation the person received, and they weren't just getting a nice meal as a little something extra on top of getting paid for doing their job competently and efficiently.
But that's the kind of deal I make when I take a job: I do the work (pretty well most of the time), and I get paid. If I stop doing the work, I stop getting paid. If they stop paying, I stop doing the work. (And bonus, literally, if I get a perk once in a while like a free steak dinner that I wasn't expecting)
It doesn't have to be more complicated than that.
Yeah? Well, proper rewards make those savings and optimizations more common. Otherwise most people will do the work needed just to have work tomorrow.
Depends. There are people who put in the absolute minimum work they can get away with, and there are people who have pride in their profession.
That's independent of pay scale.
Granted, if you pay way below expectations, you'll lose the professionals over time. But if you pay lavishly no matter what, you get the 2021/2022 big tech hiring cycle instead. Neither one is a great outcome.
A business that relies on people having pride in their profession won't scale. Proper rewards scale.
We demonstrated in 2022 that they don't.
There is no single mechanism that does. Paying well is always a component in attracting talent (see my original comment)
It is not a guarantor of quality/motivation. That's ongoing leadership work. And part of that is maintaining/kindling pride in people's work (and firing the ones who are just there for the money)
It creates a perverse incentive to deliberately do things a more expensive way at the beginning and then be a hero 6 months down the line by refactoring it to be less expensive.
Ha ha, software developers already have this incentive. Viz: "superhero 10x programmer" writing unmaintainable code to provide some desirable features, whose work later turns out to be much less awesome than originally billed.
Of course the truth is more complicated than the sound bite, but still...
The dinner was to celebrate getting acquired, which was a wide team effort. The cost savings I did was one of the pieces I contributed.
Don't assume that a steak dinner was the only recognition we got.
As far as comp: I was well taken care of, and I won't discuss more in a public forum.
What should they have gotten?
They are in theory owed nothing more than their salary but it can be very good for moral to reward that type of thing (assuming they are not introducing a perverse incentive)
It is a good problem to have for a startup, most startups are struggling finding customers to use their thing. Better to go with "good enough" shortcut and prioritize on growth. Recall is a YC company, I am sure they took advantage of huge amount AWS of credits in the first few years.
Great.
It will be great if anyone write a checklist(playbook) to be checked for CPU, Memory, Disk, IO and network issues.
> One complicating factor here is that raw video is surprisingly high bandwidth.
It's weird to be living in a world where this is a surprise but here we are.
Nice write up though. Web sockets has a number of nonsensical design decisions, but I wouldn't have expected that this is the one that would be chewing up all your cpu.
I think it's just rare for a lot of people to be handling raw video. Most people interact with highly efficient (lossy) codecs on the web.
I was surprised when calculating and sizing the shared memory for my Gaming VM for use with "Looking-Glass". At 165hz 2k HDR it's many gigabytes per second, that's why HDMI and DisplayPort is specced really high
I always knew video was "expensive", but my mark for what expensive meant was a good few orders of magnitude off when I researched the topic for a personal project.
I can easily imagine the author being in a similar boat, knowing that it isn't cheap, but then not realizing that expensive in this context truly does mean expensive until they actually started seeing the associated costs.
> It's weird to be living in a world where this is a surprise but here we are.
I think it's because the cost of it is so abstracted away with free streaming video all across the web. Once you take a look at the egress and ingress sides you realize how quickly it adds up.
I’m so confused… they were sending uncompressed video to an AWS server?
If so, they deserve a $1M bill.
It was on a loopback interface. The problem was CPU usage, not bandwidth costs.
Let me rephrase. They were processing uncompressed video via a loopback interface?
Is this really an AWS issue? Sounds like you were just burning CPU cycles, which is not AWS related. WebSockets makes it sound like it was a data transfer or API gateway cost.
Neither the title nor the article are painting it as an AWS issue, but as a websocket issue, because the protocol implicitly requires all transferred data to be copied multiple times.
I disagree. Like @turtlebits, I was waiting for the part of the story where websocket connections between their AWS resources somehow got billed at Amazon's internet data egress rates.
If you call out your vendor, the issue usually lies with some specific issue with them or their service. The title obviously states AWS.
If I said that "childbirth cost us 5000 on our <hospital name> bill", you assume the issue is with the hospital.
Only for people that just read headlines and make technical decisions based on them. Are we catering to them now? The title is factual and straightforward.
And also highlights a meaningful irrelevance.
The idea that clearer titles are just babying some class of people is perverse.
Titles are the foremost means of deciding what to read, for anyone of any sophistication. Clearer titles benefit everyone.
The subject matter is meaningful to more than AWS users, but non-AWS users are going to be less likely to read it based on the title.
> Is this really an AWS issue?
I doubt they would have even noticed this outrageous cost if they were running on bare-metal Xeons or Ryzen colo'd servers. You can rent real 44-core Xeon servers for like, $250/month.
So yes, it's an AWS issue.
You can rent real 44-core Xeon servers for like, $250/month.
Where, for instance ?Hetzner for example. An EPYC 48c (96t) goes for 230 euros
I checked here: https://www.hetzner.com/managed-server/
I see "AMD EPYC 7502P 32-Core" for 236 EUR per month. Can you tell me where you see 48c/96t?
EDIT
I found it! Unbelievable that it is so cheap.
https://www.hetzner.com/dedicated-rootserver/#cores_threads_...
Hetzner network is complete dog. They also sell you machines that are long should be EOL’ed. No serious business should be using them
What cpu do you think your workload is using on AWS?
GCP exposes their cpu models, and they have some Haswell and Broadwell lithographies in service.
Thats a 10+ year old part, for those paying attention.
I think they meant that Hetzner is offering specific machines they know to be faulty and should have EOLd to customers, not that they use deprecated CPUs.
Thats scary if true, any sources? My google-fu is failing me. :/
It's not scary, it's part of the value proposition.
I used to work for a company that rented lots of hetzner boxes. Consumer grade hardware with frequent disk failures was just what we excepted for saving a buck.
Sorry, I have no idea if this is true. I was just pointing out what the GP was trying to claim.
Most of GCP and some AWS instances will migrate to another node when it’s faulty. Also disk is virtual. None of this applies to baremetal hetzner